# California Housing Market : Feature engineering and feature selection
In the previous exercise, we concluded it was worth including more variables in a model. But is this set of variables **the best** we could have chosen ? In this exercises, we'll go further by applying two canonical methods:
* Feature engineering consists in creating more variables from the original dataset
* Feature selection allows to select the best set of features among all the available variables

## The dataset
1. Load the California Housing dataset again and remove the outliers:

In [43]:
import pandas as pd

import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import  OneHotEncoder


In [44]:
from sklearn import datasets
housing = datasets.fetch_california_housing(data_home=None, download_if_missing=True, return_X_y=False)

df = pd.DataFrame(columns=housing["feature_names"], data=housing["data"])
df.loc[:,'Price'] = housing["target"]
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Price
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [45]:
# Remove outliers
mask = (df['AveRooms'] < 10) & (df['AveBedrms'] < 10) & (df['Population'] < 15000) & (df['AveOccup'] < 10) & (df['Price'] < 5)
df = df.loc[mask,:]

In [46]:
df.shape

(19398, 9)

In [47]:
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Price
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [48]:
df.describe(include="all")

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Price
count,19398.0,19398.0,19398.0,19398.0,19398.0,19398.0,19398.0,19398.0,19398.0
mean,3.674497,28.496907,5.210648,1.066038,1442.17208,2.94464,35.637872,-119.567484,1.924128
std,1.563397,12.477953,1.168098,0.128846,1077.498768,0.766194,2.14296,2.004793,0.971784
min,0.4999,1.0,0.846154,0.333333,3.0,0.75,32.54,-124.35,0.14999
25%,2.5259,18.0,4.407329,1.005413,805.0,2.450413,33.93,-121.77,1.167
50%,3.4478,29.0,5.170038,1.047619,1185.5,2.842105,34.26,-118.49,1.741
75%,4.583175,37.0,5.944617,1.096884,1752.0,3.308127,37.72,-118.0,2.485
max,15.0001,52.0,9.979167,3.411111,13251.0,9.954545,41.95,-114.55,4.991


In [49]:
100*df.isnull().sum()/df.shape[0]

MedInc        0.0
HouseAge      0.0
AveRooms      0.0
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
Price         0.0
dtype: float64

2. Separate the target from the features

In [50]:
target_variable = "Price"

X = df.drop(target_variable, axis = 1)
y = df.loc[:,target_variable]

In [51]:
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [52]:
y.head()

0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: Price, dtype: float64

## From linear to non-linear regression
An easy way of implementing a non-linear regression is to create by hand more columns containing non-linear functions of the features.

3. For each explanatory variable, create 3 new columns in $X$ containing the following functions:
* $\textrm{X}^2$
* $\textrm{X}^3$
* $\textrm{X}^4$
* $\frac{1}{\textrm{X}}$
* $\frac{1}{\textrm{X}^2}$

In [53]:
features_list = X.columns
for c in features_list:
    X.loc[:, c + '_2'] = X[c]**2
    X.loc[:, c + '_3'] = X[c]**3
    X.loc[:, c + '_4'] = X[c]**3
    X.loc[:, c + '_inverse'] = 1/X[c]
    X.loc[:, c + '_inverse2'] = 1/(X[c]**2)
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedInc_2,MedInc_3,...,Latitude_2,Latitude_3,Latitude_4,Latitude_inverse,Latitude_inverse2,Longitude_2,Longitude_3,Longitude_4,Longitude_inverse,Longitude_inverse2
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,69.308955,577.010912,...,1434.8944,54353.799872,54353.799872,0.026399,0.000697,14940.1729,-1826137.0,-1826137.0,-0.008181,6.7e-05
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,68.913242,572.076387,...,1433.3796,54267.751656,54267.751656,0.026413,0.000698,14937.7284,-1825689.0,-1825689.0,-0.008182,6.7e-05
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,52.669855,382.246204,...,1432.6225,54224.761625,54224.761625,0.02642,0.000698,14942.6176,-1826586.0,-1826586.0,-0.008181,6.7e-05
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,31.844578,179.702136,...,1432.6225,54224.761625,54224.761625,0.02642,0.000698,14945.0625,-1827034.0,-1827034.0,-0.00818,6.7e-05
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,14.793254,56.897815,...,1432.6225,54224.761625,54224.761625,0.02642,0.000698,14945.0625,-1827034.0,-1827034.0,-0.00818,6.7e-05


4. Split your dataset into train (80%) and test (20%)

In [54]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

5. Apply the same preprocessing as in the previous exercise

In [55]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
print(X_train[0:5,:]) 

[[-8.19927084e-01 -1.80892400e+00  9.55328430e-01  1.78190047e+00
  -4.16134941e-01  2.18006083e-01  2.11764744e-01 -2.73283112e-03
  -7.23284862e-01 -5.55080709e-01 -5.55080709e-01  5.13008216e-01
   1.82695953e-01 -1.26498533e+00 -9.49143480e-01 -9.49143480e-01
   2.46604354e+00  1.09423468e+00  9.05288599e-01  8.03814419e-01
   8.03814419e-01 -8.33507043e-01 -6.12210207e-01  1.45391405e+00
   1.00467627e+00  1.00467627e+00 -1.89314156e+00 -1.63270329e+00
  -3.05939544e-01 -1.69174011e-01 -1.69174011e-01 -7.12692942e-02
  -2.04111800e-02  7.67587648e-02 -3.41707972e-02 -3.41707972e-02
  -4.40113786e-01 -4.73219971e-01  1.80503284e-01  1.49155312e-01
   1.49155312e-01 -2.73488382e-01 -3.03698049e-01 -5.61484636e-03
   1.39454572e-02  1.39454572e-02  1.94679289e-02 -2.78495433e-02]
 [-1.12594285e-01 -1.72879039e+00  2.70646520e+00  3.80387675e+00
   1.40114366e+00 -5.61637519e-01 -3.67622089e-01  1.05246805e+00
  -2.62141004e-01 -3.10620943e-01 -3.10620943e-01 -2.65306230e-01
  -2.8361

In [56]:
X_test = scaler.transform(X_test) # don't fit again !
print(X_test[0:5,:]) # X_train is now a numpy array

[[ 1.04744947 -1.32812236  1.72693875 -0.13930381 -0.22139308 -0.29619119
  -0.33958724 -0.43781566  0.86843834  0.5874456   0.5874456  -0.84075904
  -0.5159755  -1.11905578 -0.90987681 -0.90987681  0.71843539  0.10718311
   1.86566728  1.9136803   1.9136803  -1.20037727 -0.80008105 -0.15085042
  -0.13734077 -0.13734077  0.06109415  0.01122288 -0.24311142 -0.15707881
  -0.15707881 -0.11170421 -0.02074989 -0.34000705 -0.31236615 -0.31236615
   0.07460051 -0.04399513 -0.36099219 -0.381315   -0.381315    0.29394803
   0.2699244   0.42995743 -0.42203602 -0.42203602  0.45333115 -0.46098272]
 [-0.64166613  1.23615306 -0.56959193 -0.29467615 -1.08556008 -0.52236244
   0.95468818 -1.10294228 -0.62336384 -0.51002032 -0.51002032  0.25621231
   0.00951432  1.30229387  1.25747691  1.25747691 -0.55255235 -0.19736173
  -0.61767668 -0.61894617 -0.61894617  0.31839396  0.13978309 -0.26582877
  -0.20956556 -0.20956556  0.26022794  0.19821476 -0.4309706  -0.18453718
  -0.18453718  0.53087518 -0.00776121

6. Train a model including all these features. Do you get better performances than before?

In [57]:
model = LinearRegression()
model.fit(X_train, y_train)

In [58]:
print("R2 score on training set : ", model.score(X_train, y_train))
print("R2 score on test set : ", model.score(X_test, y_test))

R2 score on training set :  0.6806483963727856
R2 score on test set :  0.6832246870586418


## Forward selection
This feature engineering trick improved the model's score significantly ! But now, the model is a lot more complex as it uses 32 input features. Do we really need all these features? Let's implement the forward selection method described in this morning's lecture. 

Fortunately, the latest versions of sklearn provide a class that implements forward selection, such that we don't need to code the algorithm by hand 🥳

7. Have a look at the documentation of [SequentialFeatureSelector](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html) and try to understand the following lines of code:

In [59]:
from sklearn.feature_selection import  SequentialFeatureSelector
feature_selector =  SequentialFeatureSelector(model, n_features_to_select = 20)
feature_selector.fit(X_train, y_train)
features_list = X.columns
best_features = features_list[feature_selector.support_]
print("According to the forward selection algorithm, the following features should be kept: ")
print(best_features.to_list())

According to the forward selection algorithm, the following features should be kept: 
['MedInc', 'HouseAge', 'Population', 'Latitude', 'MedInc_inverse2', 'HouseAge_inverse', 'AveRooms_3', 'AveRooms_inverse', 'AveRooms_inverse2', 'AveBedrms_2', 'AveBedrms_inverse', 'Population_inverse2', 'AveOccup_3', 'AveOccup_inverse', 'AveOccup_inverse2', 'Latitude_2', 'Latitude_3', 'Latitude_inverse', 'Latitude_inverse2', 'Longitude_inverse2']


8. Create a DataFrame X_best containing only the best set of features, train a model only with these features and evaluate the performances

In [60]:
X_best = X.loc[:, best_features]

X_train, X_test, y_train, y_test = train_test_split(X_best, y, test_size=0.2, random_state=0)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test) # don't fit again !

model = LinearRegression()
model.fit(X_train, Y_train)

# Print R^2 scores
print("R2 score on training set : ", model.score(X_train, Y_train))
print("R2 score on test set : ", model.score(X_test, Y_test))

R2 score on training set :  0.6641637765998667
R2 score on test set :  0.6722761833968529


## Advanced feature engineering
Let's make even more advanced feature engineering. Until now, we've included the latitude and longitude as such into the models. However, usually the GPS coordinates are not used rawly, instead we deduce some geographical information from these. Let's use an API that will allows to retrieve the name of the city from the latitude and longitude.

💡 As the calls to the API may be time-consuming, we'll work on a sample of the dataset.

9. Take a sample of your dataset X (the one that contains all the features and not only the best set, because we need the values of Latitude and Longitude). Keep only 150 rows.

In [61]:
X_sample = X.sample(150)
X_sample.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedInc_2,MedInc_3,...,Latitude_2,Latitude_3,Latitude_4,Latitude_inverse,Latitude_inverse2,Longitude_2,Longitude_3,Longitude_4,Longitude_inverse,Longitude_inverse2
2136,3.7578,24.0,5.061538,0.957692,781.0,3.003846,36.8,-119.73,14.121061,53.064122,...,1354.24,49836.032,49836.032,0.027174,0.000738,14335.2729,-1716362.0,-1716362.0,-0.008352,7e-05
7208,2.3462,42.0,3.75,1.010163,2191.0,4.453252,34.01,-118.18,5.504654,12.91502,...,1156.6801,39338.690201,39338.690201,0.029403,0.000865,13966.5124,-1650562.0,-1650562.0,-0.008462,7.2e-05
16605,4.2841,8.0,7.060367,1.076115,1085.0,2.847769,35.63,-120.67,18.353513,78.628284,...,1269.4969,45232.174547,45232.174547,0.028066,0.000788,14561.2489,-1757106.0,-1757106.0,-0.008287,6.9e-05
5149,0.5495,38.0,4.249057,1.018868,999.0,3.769811,33.96,-118.27,0.30195,0.165922,...,1153.2816,39165.443136,39165.443136,0.029446,0.000867,13987.7929,-1654336.0,-1654336.0,-0.008455,7.1e-05
20167,2.7019,22.0,5.510937,1.110937,1483.0,2.317188,34.44,-119.27,7.300264,19.724582,...,1186.1136,40849.752384,40849.752384,0.029036,0.000843,14225.3329,-1696655.0,-1696655.0,-0.008384,7e-05


10. Create a Y_sample variable containing the target values corresponding to the rows that were kept in X_sample

In [62]:
y_sample = y.loc[X_sample.index]
y_sample.head()

2136     0.692
7208     1.273
16605    2.567
5149     0.917
20167    2.347
Name: Price, dtype: float64

11. Use the following help to translate the longitude and latitude of the data to find the cities corresponding to each observation: [geopy](https://pypi.org/project/geopy)

In [63]:
# !pip install geopy

In [64]:
# Example of how to get the adress from a given pair of latitude/longitude coordinates
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="yet_another_app")
location = geolocator.reverse("52.509669, 13.376294")
loc_dict = dict(location.raw)
loc_dict["address"]

{'tourism': 'Potsdamer Platz',
 'road': 'Potsdamer Platz',
 'suburb': 'Mitte',
 'borough': 'Mitte',
 'city': 'Berlin',
 'ISO3166-2-lvl4': 'DE-BE',
 'postcode': '10785',
 'country': 'Deutschland',
 'country_code': 'de'}

In [65]:
# Use geopy to extract the city of each row in the sample dataset
X_sample["City"] = 0
for i, row in X_sample.iterrows():
    geolocator = Nominatim(user_agent="yet_another_app_2")
    location = geolocator.reverse("{}, {}".format(X_sample.loc[i, "Latitude"], X_sample.loc[i, "Longitude"]), 
                                  timeout = None)
    loc_dict = dict(location.raw)
    try:
        X_sample.loc[i, "City"] = loc_dict["address"]["city"]
    except:
        try:
            X_sample.loc[i, "City"] = loc_dict["address"]["town"]
        except:
            try:
                X_sample.loc[i, "City"] = loc_dict["address"]["village"]
            except:
                pass
# If city was not found, replace by "Unknown"
X_sample.loc[X_sample['City'] == 0, 'City'] = "Unknown"

  X_sample.loc[i, "City"] = loc_dict["address"]["town"]


In [66]:
X_sample.describe(include='all')

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedInc_2,MedInc_3,...,Latitude_3,Latitude_4,Latitude_inverse,Latitude_inverse2,Longitude_2,Longitude_3,Longitude_4,Longitude_inverse,Longitude_inverse2,City
count,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,...,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150
unique,,,,,,,,,,,...,,,,,,,,,,78
top,,,,,,,,,,,...,,,,,,,,,,Unknown
freq,,,,,,,,,,,...,,,,,,,,,,18
mean,3.560617,28.246667,5.128556,1.063183,1471.513333,2.956455,35.710933,-119.669533,14.769479,69.779341,...,46065.556082,46065.556082,0.028107,0.000793,14325.027943,-1715284.0,-1715284.0,-0.008359,7e-05,
std,1.451041,12.221082,1.123036,0.116761,1120.002228,0.744426,2.211298,2.063766,12.233164,89.580819,...,8691.02504,8691.02504,0.001707,9.5e-05,494.909716,89029.83,89029.83,0.000144,2e-06,
min,0.5495,3.0,1.902087,0.884615,91.0,1.376963,32.58,-124.18,0.30195,0.165922,...,34582.249512,34582.249512,0.02451,0.000601,13342.5601,-1914939.0,-1914939.0,-0.008657,6.5e-05,
25%,2.49715,18.25,4.329086,1.010591,881.75,2.461991,33.9325,-121.8725,6.235761,15.571644,...,39070.376339,39070.376339,0.026523,0.000703,13909.254075,-1810161.0,-1810161.0,-0.008479,6.7e-05,
50%,3.2609,28.5,5.130846,1.048246,1256.5,2.880206,34.355,-118.515,10.633508,34.675067,...,40548.35106,40548.35106,0.029108,0.000847,14045.80585,-1664639.0,-1664639.0,-0.008438,7.1e-05,
75%,4.44865,37.0,5.761996,1.110097,1806.0,3.339743,37.7025,-117.9375,19.792236,88.064216,...,53593.312466,53593.312466,0.02947,0.000868,14852.906425,-1640423.0,-1640423.0,-0.008205,7.2e-05,


12. Make a train/test splitting from X_sample and Y_sample

In [67]:
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.2, random_state=0)

13. What preprocessings are necessary now ? The cells below implement the preprocessings, read it carefully and check what is done

In [68]:
categorical_features = ['City']
numeric_features = [c for c in X_sample.columns if c != 'City']

# Y a sans doute mieux à faire pour numeric_features

In [69]:
# Create transformer for numeric features
numeric_transformer = StandardScaler()

In [70]:
# Create transformer for categorical features
categorical_transformer = OneHotEncoder(drop='first', handle_unknown = 'ignore') # ignore if unknown categories are found in test set

In [71]:
# Use ColumnTransformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [72]:
X_train = preprocessor.fit_transform(X_train)

X_test = preprocessor.transform(X_test) # Don't fit again !! The test set is used for validating decisions

Performing preprocessings on train set...
...Done.
Performing preprocessings on test set...
...Done.




14. Train a regression model and evaluate the performances. Are you satisfied?

In [76]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

print("R2 score on training set : ", regressor.score(X_train, y_train).round(3))
print("R2 score on test set     : ", regressor.score(X_test, y_test).round(3))

R2 score on training set :  0.979
R2 score on test set     :  -1300.729
