## Feature Selection Methods For Machine Learning

There are two main Types of feature selection techniques:

   - Supervised: Considering the Target values (remove irrelevant variables)
   - Unsupervised: Do not use the target values (remove redundant values)

In Supervised methods is common to use correlation type statistical measures between input and output variables as the basis for filter feature selection.As such, the choice of the statistical measures is highly dependent upon the variable data types:

Common Input Variables and respective Methods:

    - Numerical
        - Pearson's correlation coefficient (linear and numerical output)
        - Spearman's rank coefficient (nonlinear and numerical output)
        - ANOVA correlation coefficient (linear and categorical output)
        - Kendall's rank coefficient (nonlinear and categorical output)
    - Categorical
        - Chi-Squared test (contingency tables)
        - Mutual Information

The purpose of this work is to test some of this methods of features selection and test the results in a SVR model.

#### Dataset

Data Set Information:

This data is for the purpose of bias correction of next-day maximum and minimum air temperatures forecast of the LDAPS model operated by the Korea Meteorological Administration over Seoul, South Korea. This data consists of summer data from 2013 to 2017. The input data is largely composed of the LDAPS model's next-day forecast data, in-situ maximum and minimum temperatures of present-day, and geographic auxiliary variables. There are two outputs (i.e. next-day maximum and minimum air temperatures) in this data. Hindcast validation was conducted for the period from 2015 to 2017.


For this, it will be used a weather related set of variables in time series format. The goal is to predict the next day temperature.
The Features are:

1. station - used weather station number: 1 to 25
2. Date - Present day: yyyy-mm-dd ('2013-06-30' to '2017-08-30')
3. Present_Tmax - Maximum air temperature between 0 and 21 h on the present day (Â°C): 20 to 37.6
4. Present_Tmin - Minimum air temperature between 0 and 21 h on the present day (Â°C): 11.3 to 29.9
5. LDAPS_RHmin - LDAPS model forecast of next-day minimum relative humidity (%): 19.8 to 98.5
6. LDAPS_RHmax - LDAPS model forecast of next-day maximum relative humidity (%): 58.9 to 100
7. LDAPS_Tmax_lapse - LDAPS model forecast of next-day maximum air temperature applied lapse rate (Â°C): 17.6 to 38.5
8. LDAPS_Tmin_lapse - LDAPS model forecast of next-day minimum air temperature applied lapse rate (Â°C): 14.3 to 29.6
9. LDAPS_WS - LDAPS model forecast of next-day average wind speed (m/s): 2.9 to 21.9
10. LDAPS_LH - LDAPS model forecast of next-day average latent heat flux (W/m2): -13.6 to 213.4
11. LDAPS_CC1 - LDAPS model forecast of next-day 1st 6-hour split average cloud cover (0-5 h) (%): 0 to 0.97
12. LDAPS_CC2 - LDAPS model forecast of next-day 2nd 6-hour split average cloud cover (6-11 h) (%): 0 to 0.97
13. LDAPS_CC3 - LDAPS model forecast of next-day 3rd 6-hour split average cloud cover (12-17 h) (%): 0 to 0.98
14. LDAPS_CC4 - LDAPS model forecast of next-day 4th 6-hour split average cloud cover (18-23 h) (%): 0 to 0.97
15. LDAPS_PPT1 - LDAPS model forecast of next-day 1st 6-hour split average precipitation (0-5 h) (%): 0 to 23.7
16. LDAPS_PPT2 - LDAPS model forecast of next-day 2nd 6-hour split average precipitation (6-11 h) (%): 0 to 21.6
17. LDAPS_PPT3 - LDAPS model forecast of next-day 3rd 6-hour split average precipitation (12-17 h) (%): 0 to 15.8
18. LDAPS_PPT4 - LDAPS model forecast of next-day 4th 6-hour split average precipitation (18-23 h) (%): 0 to 16.7
19. lat - Latitude (Â°): 37.456 to 37.645
20. lon - Longitude (Â°): 126.826 to 127.135
21. DEM - Elevation (m): 12.4 to 212.3
22. Slope - Slope (Â°): 0.1 to 5.2
23. Solar radiation - Daily incoming solar radiation (wh/m2): 4329.5 to 5992.9
24. Next_Tmax - The next-day maximum air temperature (Â°C): 17.4 to 38.9
25. Next_Tmin - The next-day minimum air temperature (Â°C): 11.3 to 29.8


In [2]:
import pandas as pd
import numpy as np 

db = pd.read_csv('Bias_correction_ucl.csv')
db.head()

Unnamed: 0,station,Date,Present_Tmax,Present_Tmin,LDAPS_RHmin,LDAPS_RHmax,LDAPS_Tmax_lapse,LDAPS_Tmin_lapse,LDAPS_WS,LDAPS_LH,...,LDAPS_PPT2,LDAPS_PPT3,LDAPS_PPT4,lat,lon,DEM,Slope,Solar radiation,Next_Tmax,Next_Tmin
0,1.0,2013-06-30,28.7,21.4,58.255688,91.116364,28.074101,23.006936,6.818887,69.451805,...,0.0,0.0,0.0,37.6046,126.991,212.335,2.785,5992.895996,29.1,21.2
1,2.0,2013-06-30,31.9,21.6,52.263397,90.604721,29.850689,24.035009,5.69189,51.937448,...,0.0,0.0,0.0,37.6046,127.032,44.7624,0.5141,5869.3125,30.5,22.5
2,3.0,2013-06-30,31.6,23.3,48.690479,83.973587,30.091292,24.565633,6.138224,20.57305,...,0.0,0.0,0.0,37.5776,127.058,33.3068,0.2661,5863.555664,31.1,23.9
3,4.0,2013-06-30,32.0,23.4,58.239788,96.483688,29.704629,23.326177,5.65005,65.727144,...,0.0,0.0,0.0,37.645,127.022,45.716,2.5348,5856.964844,31.7,24.3
4,5.0,2013-06-30,31.4,21.9,56.174095,90.155128,29.113934,23.48648,5.735004,107.965535,...,0.0,0.0,0.0,37.5507,127.135,35.038,0.5055,5859.552246,31.2,22.5


## Regression Feature Selection 
#### (Numerical Input, Numerical Output)

In [28]:
# pearson' correlation feature selection for numeric input and numeric output
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

db = db.interpolate()

features = db.drop(['Next_Tmin','Next_Tmax','Date'],axis=1) # Independent variables
target = db['Next_Tmax'] 

target = np.array(db[['Next_Tmax']])
fs = SelectKBest(score_func=f_regression,k=10)
# K = Max Number of variables 
fit = fs.fit(features,target.ravel())
df_scores = pd.DataFrame(fit.scores_)
df_columns = pd.DataFrame(features.columns)

# concatenate dataframes
feature_scores = pd.concat([df_columns, df_scores],axis=1)
feature_scores.columns = ['Feature_Name','Score']  # name output columns
print(feature_scores.nlargest(10,'Score'))  # print 10 best features
# export selected features to .csv
#df_univ_feat = feature_scores.nlargest(20,'Score')
#df_univ_feat.to_csv('feature_selection_UNIVARIATE.csv', index=False)

        Feature_Name         Score
5   LDAPS_Tmax_lapse  17690.628713
1       Present_Tmax   4676.022750
6   LDAPS_Tmin_lapse   4264.993921
11         LDAPS_CC3   2871.758855
10         LDAPS_CC2   2504.336200
2       Present_Tmin   2240.671968
9          LDAPS_CC1   1974.098567
12         LDAPS_CC4   1968.991569
3        LDAPS_RHmin   1859.568171
7           LDAPS_WS   1044.448522


In [17]:
X_selected.shape

(7752, 10)

In [44]:
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
X = features[feature_scores.nlargest(10,'Score')['Feature_Name']]
y = target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [46]:
model = SVR(kernel='rbf', C=100, epsilon=0.01)
svr = model.fit(X_train,y_train.ravel())

In [47]:
from sklearn.metrics import mean_squared_error
yfit = svr.predict(X_test)
score = svr.score(X_test,y_test)
print("R-squared:", score)
print("MSE:", mean_squared_error(y_test, yfit))

R-squared: 0.7796287535272308
MSE: 2.1878204432326487
