# The Etape 3 challenge is to create a new feature from our current data which might help our performance prediction.

Most of this notebook is copied from etape 2. With a few modifications within the pipeline to create the new feature.

In [187]:
import pandas as pd
import numpy as np
%matplotlib inline

## Load clean trees data from saved pickle into the `trees` dataframe

In [188]:
trees = pd.read_pickle("../data/pickle/trees_first_clean.pkl")

display first 5 lines

In [189]:
trees.head(5)

Unnamed: 0,ELEM_POINT_ID,CODE,SOUS_CATEGORIE,SOUS_CATEGORIE_DESC,CODE_PARENT,CODE_PARENT_DESC,ADR_SECTEUR,GENRE_BOTA,ESPECE,VARIETE,STADEDEDEVELOPPEMENT,ANNEEDEPLANTATION,COLLECTIVITE,type,longitude,latitude
0,32215,ESP32919,ESP151,Arbre de voirie,ESP32840,Crs Jaurès impair de PHD/ Lorr,2,Platanus,platanor,Vallis clausa,Arbre jeune,2014.0,Grenoble Alpes Métropole,Point,5.719919,45.190237
1,32214,ESP32918,ESP151,Arbre de voirie,ESP32840,Crs Jaurès impair de PHD/ Lorr,2,Tilia,mongolica,,Arbre jeune,2014.0,Grenoble Alpes Métropole,Point,5.719994,45.19028
2,32213,ESP32917,ESP151,Arbre de voirie,ESP32840,Crs Jaurès impair de PHD/ Lorr,2,Malus,perpetu,Evereste,Arbre jeune,2014.0,Grenoble Alpes Métropole,Point,5.720006,45.190322
3,32212,ESP32916,ESP151,Arbre de voirie,ESP32840,Crs Jaurès impair de PHD/ Lorr,2,Platanus,platanor,Vallis clausa,Arbre jeune,2014.0,Grenoble Alpes Métropole,Point,5.719959,45.190359
4,32211,ESP32915,ESP151,Arbre de voirie,ESP32840,Crs Jaurès impair de PHD/ Lorr,2,Tilia,mongolica,,Arbre jeune,2014.0,Grenoble Alpes Métropole,Point,5.720047,45.190442


display dataframe columns

In [190]:
trees.columns

Index(['ELEM_POINT_ID', 'CODE', 'SOUS_CATEGORIE', 'SOUS_CATEGORIE_DESC',
       'CODE_PARENT', 'CODE_PARENT_DESC', 'ADR_SECTEUR', 'GENRE_BOTA',
       'ESPECE', 'VARIETE', 'STADEDEDEVELOPPEMENT', 'ANNEEDEPLANTATION',
       'COLLECTIVITE', 'type', 'longitude', 'latitude'],
      dtype='object')

# Make train test split

#### random split

In the following we want to build a machine learning model that helps us predict the plantation year for different trees based on their characteristics.

For this you have to first remove the column `ANNEEDEPLANTATION` from the dataframe and save the result in the variable year. Hint: use `pop`

In [191]:
years = trees.pop("ANNEEDEPLANTATION")

In [192]:
from sklearn.model_selection import train_test_split

Split the trees and the year data into train and test partitions. Use seed=800.

You must save the result in 4 variables: `X_train` (instances in the train set), `X_test` (instances in the test set), `y_train` (labels corresponding to the instances in the train set), `y_test` (labels corresponding to the instances in the test set).

For clarifications, read the `train_test_split` documentation.

In [193]:
X_train,X_test,y_train,y_test = train_test_split(trees,years,random_state=800)

How many instances are there in the train set?

In [194]:
X_train.shape

(23550, 15)

How many instances are there in the test set?

In [195]:
X_test.shape

(7850, 15)

#### Stratified shuffle - alternative (and very optional at this stage!)

import  `StratifiedShuffleSplit` from scikit-learn package `model_selection`.

In [196]:
from sklearn.model_selection import StratifiedShuffleSplit

Create a `StratifiedShuffleSplit` instance with `n_split=1` and a `test_size=0.25`  and a `random_state=800`.

Store this instance in a variable named `stratSplit`.

In [197]:
stratSplit  = StratifiedShuffleSplit(n_splits=1,test_size=0.25,random_state=43)

Use the `split` function of this `StratifiedShuffleSplit` instance.

(*Hint the arguments of this function are avalaible in the [sklearn doc](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html)* )

Store the result of the `split` function in a variable named `splits`.

In [198]:
splits = stratSplit.split(trees,years)

What is the type of the `splits` variable?

In [199]:
type(splits)

generator

Use the `splits` variable for creating the test and train sets.

The following cell must create the `X_train`, `X_test`, `y_train`  and `y_test` variables.

In [200]:
X_train,X_test = trees.iloc[train_index],trees.iloc[test_index]
y_train,y_test = years.iloc[train_index],years.iloc[test_index]

NameError: name 'train_index' is not defined

In [201]:
print(X_train.shape,X_test.shape)

(23550, 15) (7850, 15)


Check if the the proportions we find look consistent between splits

In [202]:
# code for checking proportions in labels

In [203]:
# code for checking proportions in data

# Basic Statistics

Should usually only be done after the split. 

# Naive model
Used as a baseline model to know if our ML models are remotely useful. Here based on average tree age.

# Prepare data - using pipeline

Before starting this step, check how one hot encoding works.

Import the `StandardScaler` and `OneHotEncoder` from the scikit-learn `preprocessing` package.

Import the `SimpleImputer` from the scikit-learn `impute` package.
Finally import the `Pipeline` from the scikit-learn `pipeline` package

In [204]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

### Make pipelines for numerical and catageorical data

Import the `TransformerMixin` and `BaseEstimator` from the scikit-learn `base` package.

Import the `ColumnTransformer` from the scikit-learn `compose` package.

In [205]:
from sklearn.base import TransformerMixin,BaseEstimator
from sklearn.compose import ColumnTransformer

Now the goal is to create a pipeline for each type of attribute

The two lists in the code cell below contain the numerical and categorical attributes.

In [206]:
num_attribs = ['longitude', 'latitude']
cat_attribs = ["ADR_SECTEUR", 'COLLECTIVITE', 'STADEDEDEVELOPPEMENT'] # adding , 'GENRE_BOTA' here create many more columns which might cause problems. Leaving out for now.



Create two pipelines which will use a `SimpleImputer` for completing missing data:

- `num_pipeline` which process numerical data by scaling them using a `StandardScaler`.
- `cat_pipeline` which process categorical data by encoding categories using a `OneHotEncoder`.

In [207]:
class DataFrameSelector(TransformerMixin,BaseEstimator):
    def __init__(self,name_attributes):
        self.name_attributes = name_attributes

    def fit(self,X,y=None):
        return self

    def transform(self,X):
        return X[self.name_attributes].values

In [208]:

num_pipeline = Pipeline([("DFS",DataFrameSelector(num_attribs)),("SI",SimpleImputer(strategy="median")),("SS",StandardScaler())])
cat_pipeline = Pipeline([("DFS",DataFrameSelector(cat_attribs)),("SI",SimpleImputer(strategy='constant')),("OHE",OneHotEncoder(handle_unknown='ignore'))])

Check if pipelines creating expected shape for categorical features

In [209]:
X_trainC = cat_pipeline.fit_transform(X_train,y_train)
X_testC = cat_pipeline.fit_transform(X_train,y_train)

X_trainC.shape

(23550, 12)

In [210]:
X_trainC

<23550x12 sparse matrix of type '<class 'numpy.float64'>'
	with 70650 stored elements in Compressed Sparse Row format>

Check if pipelines creating expected shape for numerical features

In [211]:
X_trainN = num_pipeline.fit_transform(X_train,y_train)
X_testN = num_pipeline.fit_transform(X_train,y_train)

X_trainN.shape

(23550, 2)

### Transform the data using full pipeline

This step consists in combining the two pipelines into one named `full_pipeline`.

Create a variable named `full_pipeline` and use a `ColumnTransformer` to create a pipeline combining the 2 created before.

Import `FeatureUnion` from the scikit-learn `pipeline` package

In [212]:
# code here
full_pipeline = ColumnTransformer([("num",num_pipeline,num_attribs),("cat",cat_pipeline,cat_attribs)])

use the full pipeline to `fit_transform` the training set

In [213]:
X_trainT = full_pipeline.fit_transform(X_train,y_train)

Use the full pipeline to `fit_transform` the test set

In [214]:
X_testT = full_pipeline.transform(X_test)

Check the shape of both transformed set.

In [215]:
X_trainT.shape # quick check for shape -> yes,  13 columns is what we expect

(23550, 14)

Do the same to test data, just to ensure that there are no problems pre-modelling.

In [216]:
X_testT = full_pipeline.transform(X_test) # TRANSFORM ONLY(!) test data

### New pipeline

Now import `BaseEstimator` and `TransformerMixin` from the scikit-learn `base` package.

Complete the `CombinedAttributesAdderNewVersion` class declared below. 

In [217]:
# column index
longitude_id, latitude_id = X_train.columns.get_loc("longitude"),X_train.columns.get_loc("latitude")
# dataframe column indices for these cols

class CombinedAttributesAdderNewVersion(BaseEstimator, TransformerMixin):
    def __init__(self): # no *args or **kargs
        pass
    def fit(self, X, y=None):
        return self# return the object itself
    def transform(self, X):
        X = X.to_numpy()# convert the train set to numpy array
        new_feature = X[:,0]+X[:,1]# combine the two attributes of the given dataframe
        return pd.DataFrame(new_feature)
        # return as dataframe created from the new_feature variable declared before

Initialize the `CombinedAttributesAdderNewVersion` and use the `fit_transform` method on the `LONGITUDE` and `LATITUDE` features.

In [218]:
attr_adder = CombinedAttributesAdderNewVersion()# instance of CombinedAttributesAdderNewVersion
X_train_new_att = attr_adder.fit_transform(X_train.loc[:,["longitude","latitude"]]) # fit_transform on the right columns

display the new attributes

In [219]:
X_train_new_att

Unnamed: 0,0
0,50.905736
1,50.910903
2,50.929723
3,50.897548
4,50.912131
...,...
23545,50.918871
23546,50.922004
23547,50.927329
23548,50.924552


Use the pipelines created before and a new feature pipeline in order to create a full_pipeline which includes the newly created feature.

In [220]:
num_attribs = ['longitude', 'latitude']
cat_attribs = ["ADR_SECTEUR", 'COLLECTIVITE', 'STADEDEDEVELOPPEMENT']

num_pipeline = num_pipeline #numerical features pipeline

cat_pipeline = cat_pipeline #categorical features pipeline

new_features = Pipeline([("features",CombinedAttributesAdderNewVersion())]) # create the new feature using the CombinedAttributesAdderNewVersion

Create a full pipeline with the pipeline created before.

This full pipeline has to create two new features : 
- the first combines the `LONGITUDE`and `LATITUDE` columns
- the other combines the `LONGITUDE` and `ADR_SECTEUR` columns

Complete the cell below.

In [221]:

full_pipeline = ColumnTransformer([("num",num_pipeline,num_attribs),("cat",cat_pipeline,cat_attribs),(("f1"),new_features,num_attribs),("f2",new_features,["longitude","ADR_SECTEUR"])]) # code here 

Use the `fit_transform` method of the new pipeline on the train dataset.

In [222]:
X_trainT = full_pipeline.fit_transform(X_train,y_train) #code here

Display the `X_trainT` by instanciating a dataframe with it.

In [223]:
df = pd.DataFrame(X_trainT)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0.590012,-0.507185,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,50.905736,10.734914
1,0.982725,-0.497273,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,50.910903,10.739964
2,0.968887,1.098621,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,50.929723,7.739786
3,0.655137,-1.265350,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,50.897548,11.735751
4,0.920956,-0.327421,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,50.912131,10.739170
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23545,1.298442,-0.168982,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,50.918871,10.744024
23546,1.515768,-0.140606,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,50.922004,10.746818
23547,1.124375,0.729544,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,50.927329,10.741785
23548,0.255137,1.435226,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,50.924552,7.730608


# Model Build

### Linear regression

Import the `LinearRegression` model from the scikit-learn `linear_model` package.

Import `cross_val_score` from the scikit-learn `model_selection` package.

In [224]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score as cvs

Create the LinearRegression object.

In [225]:
lin_reg = LinearRegression()

Build linear regresion model and estimate test error using cross validation (10 folds).

Store the cross validation scores in a `scores` variable. 

*Use the Negative Mean Square Error ($NMSE$) as scoring metric.*

In [226]:
lin_reg.fit(X_trainT,y_train)

scores = cvs(lin_reg,X_trainT,y_train,scoring="neg_mean_squared_error",cv=10)

describe the scores

In [227]:
scores

array([-170.31411547, -176.98928496, -166.03339419, -169.41911681,
       -177.47335573, -168.89577116, -171.47021972, -173.57999428,
       -170.41281671, -175.47670399])

### Decision tree regressor

Now, import the `DecisionTreeRegressor` from the scikit-learn `tree` package.

In [228]:
from sklearn.tree import DecisionTreeRegressor

Build a `DecisionTreeRegressor` with a random state equals to 42.

In [229]:
dec_tree = DecisionTreeRegressor(random_state=42).fit(X_trainT,y_train)

use the `cross_val_score` method to train and evaluate the built model, with the **NMSE** scoring metric.

In [236]:
score_tree = cvs(dec_tree,X_trainT,y_train,scoring="neg_mean_squared_error",cv=10)

Describe the scores

In [237]:
score_tree

array([-75.25902335, -80.07388535, -84.78089172, -80.54140127,
       -70.44543524, -74.16645435, -80.2552017 , -70.85350318,
       -96.22377919, -74.02292994])

### random forest - small number of trees

Now, import the `RandomForestRegressor` from the scikit-learn `ensemble` package.

In [238]:
from sklearn.ensemble import RandomForestRegressor

Build a `RandomForestRegressor` with 4 `estimators` and 8 `max_features`.

In [239]:
r_for_reg = RandomForestRegressor(n_estimators=4,max_features=8).fit(X_trainT,y_train)

use the `cross_val_score` method to train and evaluate the built model, with the **NMSE** scoring metric.

In [240]:
score_rand_for = cvs(r_for_reg,X_trainT,y_train,scoring="neg_mean_squared_error",cv=10)

Describe the scores

In [241]:
score_rand_for

array([-56.75636943, -58.37489384, -63.7345276 , -57.49238323,
       -54.21435775, -60.98357219, -52.95106157, -51.31135881,
       -61.17940552, -58.58142251])

### Conclusion

Does these new features improve the model performance?

Compare to the previous 