# Final assignment

Time to condense all you've learnt through this course in a final assignment. This notebook serves as a basic template for you to fill in with the code needed to do what is requested.

As always, add as many cells as you need explaining your approach. In particular, **if you do things differently than you did for previous assigments (e.g., because of the feedback received, or because you come up with new ideas), please highlight it, and explain why you changed your mind**.

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Linear regression for sklearn
import sklearn.linear_model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
# make the number format not to display in a scientific format
pd.set_option('display.float_format', lambda x: '%.3f' % x)

from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeRegressor

## 1. Load data

First thing's first. Load the Used Cars Dataset, **using only the columns that you're going to use in your models**. That is, **DON'T read columns that**:
* <u>Are obviously useless</u> (for example, unique IDs).

* <u>Could be useful, but previous assignments showed you that they're not worth processing for prediction</u>.

* <u>Special data types that are different from pure numbers or categories</u> (e.g., geographical or text ones). However, **extra points will be given if you use them, as you've specific material on the subject**.

In [2]:
### Your code goes here
df = pd.read_csv(
    './vehicles.csv', 
    usecols=['price', 'year', 'manufacturer', 'condition', 'cylinders', 
             'fuel', 'odometer', 'title_status', 'transmission', 'drive', 
             'type', 'model']
)
df

#df=df.sample(frac=0.1, replace=True, random_state=1)

Unnamed: 0,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,type
0,6000,,,,,,,,,,,
1,11900,,,,,,,,,,,
2,21000,,,,,,,,,,,
3,1500,,,,,,,,,,,
4,4900,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
426875,23590,2019.000,nissan,maxima s sedan 4d,good,6 cylinders,gas,32226.000,clean,other,fwd,sedan
426876,30590,2020.000,volvo,s60 t5 momentum sedan 4d,good,,gas,12029.000,clean,other,fwd,sedan
426877,34990,2020.000,cadillac,xt4 sport suv 4d,good,,diesel,4174.000,clean,other,,hatchback
426878,28990,2018.000,lexus,es 350 sedan 4d,good,6 cylinders,gas,30112.000,clean,other,fwd,sedan


## 2. Divide into X and Y, and into training and testing set

At this point, the dataframe has all features we want in our data, so it's time to split it into training and testing set. Remember some things here:
* <u>`Y` is the `price` feature and `X` is all the rest of features</u>.
* <u>Use fixed ratios. For example, 80% for training and 20% for testing.</u>
* <u>We saw that there's no need to shuffle data, but if you do, justify it, and use always the same `RandomState` so that you always get the same split.</u>

**REMARK**: <b><u>don't use test data from now on, till you've tuned your final model! (Step 7 below)</u></b>. Remember: **test data is like the final exam, so we can't access its questions (= test samples) until we've studied (= trained and tuned our model)**.

In [3]:
### Your code goes here

X, Y = df.drop(columns=['price']), df['price']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8, shuffle=False)

## 3. Remove problematic rows in X_train

Now that we have all columns (= features or attributes) needed, it's time to remove rows (= car ads in this case) that we don't want to include. Unfortunately, **scikit-learn doesn't support transformations that remove rows** (there's an <a href="https://github.com/scikit-learn/scikit-learn/issues/3855">open issue</a> about this), so **any row-removing operations must be done as preprocessing steps that cannot be included in a Pipeline**.

You know from previous assignments that there're several reasons for which we may want to delete some rows. The most important ones are:
* <u>Missing values</u>: there's some feature that has a *NaN/None* value, and we know that:
    * <u>The feature is difficult to impute with the rest of information at hand</u> (so we don't know how to replace it with a non-missing value).
    * <u>The fact that it's missing isn't relevant for prediction</u> (that is, if we keep it with a NaN value, the model doesn't take that fact into consideration). Note that this only applies to categorical features (as they can be encoded keeping NaNs), but not to numerical ones. **Numerical features cannot be NaN in sklearn models, except for very limited cases**.
* <u>Outliers</u>: some value of some feature (or belonging to the target `Y`) isn't missing, but it's wrong, out of bounds, or simply doesn't make sense from the business point of view.

**HINT**: once you've decided what rows to remove and with what logic, try to write a function that, given a dataset `X`, returns that same dataset without the rows that meet that logic. Why? Because you'll have to apply this function eventually to `X_test` before predicting for it (see Step 7 below).

**REMARK**: why don't we remove problematic rows before splitting into training and testing? Because **that's a subtle form of data leakage**. We simply don't know what will come in `X_test` (actually we do, but we have to behave as if we didn't know!), so we need to infer from `X_train` what is "problematic" or an "outlier" and what is not. **Note that in previous assignments we cheated a bit because we used the whole dataset to discard prices that are extreme, but now we should do things properly.**

We will start here with removing the missing values for the numerical variables which are the year and the odometer. We decided to drop the missing the values because both features cannot be imputed from other features. 

In [4]:
### Your code goes here

## before we drop any record we will join the X_train and Y_traun together to make sure that we drop the whole record
Train_temp = pd.concat([X_train,Y_train], axis = 1)

## Now we can drop the missing values
Train_temp = Train_temp.dropna(subset=['year', 'odometer'])  # just drop rows whenever year and/or odometer are missing

#print(Train_temp.isna().mean().sort_values().rename('% of samples with NAs in each feature'))



In the transformation part, we will try to impute the manufacturer by utilizing the model. However, If there are records that have missing model and manufacturer, we will drop them. The nxt block will check if we have both model and manufacturer missing and drop the records accordingly 

In [5]:
# If we have a record with a missing manufacturer and model, we will remove it.
manufacturer_missing_ind=pd.DataFrame(Train_temp[Train_temp['manufacturer'].isna()].index,columns=['index'])
model_missing_ind=pd.DataFrame(Train_temp[Train_temp['model'].isna()].index, columns=['index'])
both_missing_ind=pd.merge(manufacturer_missing_ind,model_missing_ind, on='index', how = 'inner')
Train_temp.drop(index=both_missing_ind['index'])
Train_temp.reset_index(inplace=True, drop=True)



Now we will be removing the outliers for the odometer and year by utilizing the turkey method

In [6]:
#Tukey's method
def tukeys_method(df, variable):
    #Takes two parameters: dataframe & variable of interest as string
    q1 = df[variable].quantile(0.25)
    q3 = df[variable].quantile(0.75)
    iqr = q3-q1
    inner_fence = 1.5*iqr
    outer_fence = 3*iqr
    
    #inner fence lower and upper end
    inner_fence_le = q1-inner_fence
    inner_fence_ue = q3+inner_fence
    
    #outer fence lower and upper end
    outer_fence_le = q1-outer_fence
    outer_fence_ue = q3+outer_fence
 
    for index, x in enumerate(df[variable]):
        if x <= outer_fence_le or x >= outer_fence_ue:
            #outliers_prob.append(x)
            df = df[df[variable] != x]
    return df

In [7]:

# Now let's remove the outliers from the odometer and year
Train_temp = tukeys_method(Train_temp,'odometer')
Train_temp = tukeys_method(Train_temp,'year')
Train_temp = tukeys_method(Train_temp,'price')

# After droping the problomatic rows we split the X and Y again
X_train, Y_train = Train_temp.drop(columns=['price']), Train_temp['price']
X_train_original, Y_train_original = Train_temp.drop(columns=['price']), Train_temp['price']

## 4. Transform X_train

At this point, you know that `X_train` has all necessary features, and you also know that the rows that have remained from the previous steps are the ones you're going to train with.

This is the core part. Consider all the features you loaded in point 1, and what you've done so far regarding missing values. **You need to take into consideration these facts**:
* <u>Numerical features cannot have missing values</u>.
* <u>Unless you use some particular models (e.g., trees), numerical features need to be scaled</u>. **This is particularly important if in step 5 you're using a linear model, or a non-linear model that relies on scalar products** (e.g., SVMs on Neural Networks).
* <u>All non-numerical features must be transformed into numbers</u>, so:
    * <u>Categorical features should be one-hot encoded</u> (perhaps with NaNs, perhaps without them).
    * <u>Ordinal features should be categorized and then one-hot encoded, or transformed into numbers somehow</u>.
    * <u>Text features should be tf-idf vectorized</u> (if you use them, and unless you use some more advanced NLP packages).
    * <u>Geographical features are tricky</u>. If numerical (such as coordinates), those numbers can only lie in particular ranges. If categorical, usually they follow a hierarchy (for examples regions include states, which include counties). Think carefully about what to do with these!
* <u>If you are imputing missing values for some feature, the logic must be included in this step.</u>
* <u>If you need to discard some feature because it's used at this step but not anymore, drop it now.</u> For example, if you use `model` to impute missing values in other features, but you don't want to use it for training because it has too many categories.
    
**HINT**: once you've thought about it, try to condense this into a `ColumnTransformer` that splits processing between different kinds of features. Depending on what you do, it's possible that some of those kind-specific transformations need to be compound too (i.e., not just single transformers but `Pipelines`, `FeatureUnions` or `ColumnTransformers`). Remember about recursion!

**HINT**: if you're doing things which aren't included in scikit-learn, such as imputing missing values with some more elaborate logic than replacing by mean or mode (so that `SimpleImputer` isn't enough), you can also **write your own transformers**. To do this, you'll need to write a class that inherits from `BaseEstimator` and `TransformerMixin` and implement yourself the `fit(X)` and the `transform(X)` methods.

In [8]:
def impute_by_model(X):
    for col in ['manufacturer','type','cylinders']:
        model_by_manu = X.groupby(['model',col], as_index=False).size()
        Manu_null=X[X[col].isna()]
        for i in Manu_null.iterrows():
            model=Manu_null.at[i[0],'model']
            temp = model_by_manu[model_by_manu['model'] == model]
            if len(temp) != 0:
                ind = temp[['size']].idxmax()
                X.at[i[0],col] = temp.at[ind[0],col]
    X=X.drop(columns=['model'])
    return X

In [9]:
def replace_missing_by_mode(X):
    for col in ['manufacturer','type','cylinders','fuel','title_status','transmission','condition','drive']:
        X[col]= X[col].where(~(X[col].isna()), other=X[col].mode()[0], inplace=False)
    return X

For the manufacturer & type columns we will transform it by using the average price of each one.

In [10]:
def transform_by_avrgprice(X,X_train=X_train_original,Y_train=Y_train_original):
    Train_temp = pd.concat([X_train,Y_train], axis = 1)
    for col in ['manufacturer','type']:
        df_man_train = Train_temp.groupby(col)[['price']].mean().sort_values(by=['price'], ascending=False).rename(
            columns={'price': col+'_avg_price'})

        # Now we create the dictionary:
        df_dict_man_train = df_man_train.to_dict()[col+'_avg_price']

        # And create a new variable based on each manufacturer's average price:
        X[col+'_avg_price'] = X[col].replace(df_dict_man_train)
        
        # Now we can drop the columns
        X=X.drop(columns=col)
        
    return X

For the rest of the columns we will use the one hot encouding

In [11]:
from sklearn.preprocessing import OneHotEncoder

def fun_ohe (X):
    
    for variable in ['cylinders','fuel','title_status','transmission','condition','drive']:
        ohe = OneHotEncoder(sparse=False, drop='first')
        ohe.fit(X[[variable]])
        ohe_df = pd.DataFrame(ohe.transform(X[[variable]]),
                     columns = ohe.get_feature_names([variable]))
        ohe_df.set_index(X.index, inplace=True)
        X=pd.concat([X, ohe_df], axis=1).drop([variable], axis=1)
    return X

In [12]:
from sklearn import preprocessing
def scaling(X):
    X = X.values
    Standard_Scaler = preprocessing.StandardScaler()
    X = Standard_Scaler.fit_transform(X)
    X = pd.DataFrame(X)
    return X

In [13]:
from sklearn.pipeline import Pipeline, make_pipeline

def Pipe_data_preprocess(X):
    X = X.pipe(impute_by_model).pipe(replace_missing_by_mode).pipe(transform_by_avrgprice).pipe(fun_ohe).pipe(scaling)
    return X


In [14]:
X_train = Pipe_data_preprocess(X_train)

## 5. Train models

Data are now in a suitable way for any model we want to train. Missing values have been dropped or filled, there're no outliers, numbers have been scaled, etc. Try to keep in mind lessons learnt in ML1 and ML2, as to which models may be more suitable for this problem, slower/faster to train, etc.

Also **decide on what metric to use to measure performance**; the one you feel more comfortable with, whatever. In any case, follow this motto: "start simple, and then add complexity little by little". The usual procedure is:
1. <u>Start with a really simple model</u>, perhaps even a `DummyRegressor` (or `DummyClassifier` if this was a classification problem). Such a simple model is very fast to train, and it gives you **a value of the error metric that you must improve. If you do worse than this, you're making some mistake in your pipeline**.
2. Once you've that reference dummy performance, <u>turn linear</u>. Use simple linear models, and see where you can get. **The new error should be better than the dummy one, but probably still not very satisfactory**. In any case, **this becomes the new reference to beat**.
3. Once you've the reference linear performance, <u>turn non-linear, but interpretable</u>. This is where trees, nearest neighbors or naïve bayes come in handy, as they're easily intepreted (if-then rules, using very similar samples, or using independent probabilities). **Most likely you'll get an error which is even better than linear one, so this becomes the new reference**.
4. <u>Turn non-linear and non-interpretable</u>. Typically here we use models like SVMs or Neural Networks, which are even more powerful, but harder to train and particularly difficult to explain in simple words.
5. If not even all of this is enough, <u>build ensembles</u>. That is, not relying on a single model, but combining what several models say.

**HINT**: build a `Pipeline` with the previous preprocessing transformations, and whose final step is the model you want to try. This ensures that transformations are applied before training.

**HINT**: use `GridSearchCV/RandomizedSearchCV` to not only try the default model, but also tune its more important hyperparameters. Remember that a `Pipeline` is an estimator, so that's what you feed into the search. Also, remember the double underscore trick to specify that a parameter belongs to the estimator, and also recall the different CV strategies. **If you don't do CV, you'll most likely end up overfitting.**

In [15]:
### Using dummy regressor by utilizing the mean stategy
dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(X_train, Y_train)


DummyRegressor()

In [16]:
### Using linear regression
Lin_regr = sklearn.linear_model.LinearRegression()
Lin_regr.fit(X_train, Y_train)


LinearRegression()

In [21]:
### Using KNN with CV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
Knn_reg = KNeighborsRegressor()
Knn_grid = {'n_neighbors': [4,5,6,7]}
Knn_model = GridSearchCV(Knn_reg, Knn_grid, scoring='neg_root_mean_squared_error', cv = 5)
Knn_model.fit(X_train, Y_train)

GridSearchCV(cv=5, estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': [4, 5, 6, 7]},
             scoring='neg_root_mean_squared_error')

In [23]:
Knn_model.best_params_

{'n_neighbors': 7}

In [24]:
## Using Nueral Network
### we will try to tune the network with 3,5 or 7 layers using grid search. We will fix the number of nodes to 30

from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV
NN_reg = MLPRegressor()
NN_grid = {'hidden_layer_sizes': [(20),(20,20,20),(20,20,20,20,20)], 'early_stopping': [True]}
NN_model = GridSearchCV(NN_reg, NN_grid, scoring='neg_root_mean_squared_error', cv = 5)
NN_model.fit(X_train, Y_train)

GridSearchCV(cv=5, estimator=MLPRegressor(),
             param_grid={'early_stopping': [True],
                         'hidden_layer_sizes': [20, (20, 20, 20),
                                                (20, 20, 20, 20, 20)]},
             scoring='neg_root_mean_squared_error')

In [25]:
NN_model.best_params_

{'early_stopping': True, 'hidden_layer_sizes': (20, 20, 20, 20, 20)}

In [None]:
# Using SVM
import sklearn.svm
from sklearn.model_selection import GridSearchCV

SVR_Grid = {'C' : [1,5,10]}

SVR_Reg = sklearn.svm.SVR()

SVR_model = GridSearchCV(SVR_Reg,SVR_Grid, scoring='neg_root_mean_squared_error', cv=5)

SVR_model.fit(X_train,Y_train)

In [None]:
SVR_model.best_params_

## 6. Decide what the final model would be

After having done all the previous steps, you'll have trained several models. Now you need to **decide which of those is the one you're going to choose as your best**. Following the exam metaphor, you have to pick your best student to win the ML Olympics!

**HINT**: in order to try different models, you can write an outer loop that tries different estimators, as well as what hyperparameters and values for those hyperparameters are to be tried in the CV search. This loop calls CV search, picks its `best_estimator_` and compares its performance with the best performance you had so far. If the new one is better than your current best, this becomes your new best.

**HINT**: it's not always about performance (= score). Sometimes you can improve a bit the score, at the expense of training a model that takes way longer, or that requires much more memory. Besides this, clients usually require some interpretability on the model, so think twice about what "best" means.

In [26]:
### Your code goes here
## Dummy regressor predictor
dummy_regr_pred_train = dummy_regr.predict(X_train)
print(mean_squared_error(dummy_regr_pred_train,Y_train,squared=False))

14257.227082995028


In [27]:
## Linear Regressor predictor
Lin_train_predicted = Lin_regr.predict(X_train)
print(mean_squared_error(Lin_train_predicted,Y_train,squared=False))

10359.812667667422


In [33]:
## KNN predictor
best_knn = Knn_model.best_estimator_
Knn_train_predict = best_knn.predict(X_train)
print('The best KNN model score is ', Knn_model.best_score_)
print(mean_squared_error(Knn_train_predict,Y_train,squared=False))

The best KNN model score is  -9277.560981152643
6430.898958791743


In [34]:
## Neural Networks predictor
best_NN = NN_model.best_estimator_
NN_train_predict = best_NN.predict(X_train)
print('The best NN model score is ', NN_model.best_score_)
print(mean_squared_error(NN_train_predict,Y_train,squared=False))

The best NN model score is  -9706.744309802121
9495.164233635734


In [None]:
## SVM Predictor
best_svr = SVR_model.best_estimator_
SVR_train_predict = best_svr.predict(X_train)
print('The best SVM model score is ', SVM_model.best_score_)
print(mean_squared_error(SVR_train_predict,Y_train,squared=False))

## 7. Test your final model

Time to recover `X_test`, which was put on hold since Step 2. Now you've a final `Pipeline` from Step 6, which knows how to transform the data in `X_test` (Step 4) and knows how to predict (because it's been fit by Step 5).

**HINT**: **beware that Step 3 hasn't been applied to the test set!** You need to do that before calling `Pipeline.score(X_test)`. For example, if your model doesn't deal with missing values, you need to remove any row from `X_test` that has missing values! Otherwise the code will crash.

From section 6 we can conclude that KNN is the best model, however, we will also check test with the neural network since the RMSE score are very close.

In [35]:
### Your code goes here
## Before we can start testing with the test set, we need to transoform X_test.
Test_temp = pd.concat([X_test,Y_test], axis = 1)
Test_temp = Test_temp.dropna(subset=['year', 'odometer']) #just drop rows whenever year and/or odometer are missing

manufacturer_missing_ind=pd.DataFrame(Test_temp[Test_temp['manufacturer'].isna()].index,columns=['index'])
model_missing_ind=pd.DataFrame(Test_temp[Test_temp['model'].isna()].index, columns=['index'])
both_missing_ind=pd.merge(manufacturer_missing_ind,model_missing_ind, on='index', how = 'inner')
Test_temp.drop(index=both_missing_ind['index'])
Test_temp.reset_index(inplace=True, drop=True)

Test_temp = tukeys_method(Test_temp,'price')

X_test, Y_test = Test_temp.drop(columns=['price']), Test_temp['price']

X_test= Pipe_data_preprocess(X_test)


In [36]:
## The KNN with 7 nighbors was the best estimator, therefore, we will test with it.
Knn_test_predict = best_knn.predict(X_test)
print(mean_squared_error(Knn_test_predict,Y_test,squared=False))

12074.461572208716


In [37]:
best_NN = NN_model.best_estimator_
NN_test_predict = best_NN.predict(X_test)
print(mean_squared_error(NN_test_predict,Y_test,squared=False))

11568.433320615046


## 8. (Optional) Revisit what you've done

Once you get the score for `X_test`, it's tempting to try to improve even more. If you got a much worse performance than for `X_train`, chances are that you're overfitting, so you need to refine your CV strategy, use regularization, or choose parameters that don't drive to that (for example, don't let a tree grow without limit!).

This is like when you fail an exam, and you want to have another try. The catch is that you already know what the exam is (you saw `X_test`), and you also got your marks (the `score`), so it's not taking another similar exam (as would happen in real life), but taking the same exam again. Strictly speaking, this is another subtle form of data leakage, but a widely used one. The hope is that by refining the training strategy, even if we're cheating a bit, the behavior of the final model when it actually takes another, different exam (that is, when it's put into production), will be better than our current one would have obtained.

So we're going to **overlook this fact and allow that, given the results in Step 7, you can go back, try again, change your final estimator in Step 6 and retry Step 7, until you can't get any better**.

**HINT**: besides the score, you can also plot your predictions against reality, and try to infer when you predict wrongly. This can give you insights on how to improve the model and/or the pre-processing part.

**It looks like the KNN model is over fitting, let's try to search for better model by increasing the number of neighbours in the search grid**

In [38]:
### Your code goes here
# Let's try to improve the KNN by searching more neighbours 9, 11, 13 & 15
Knn_reg = KNeighborsRegressor()
Knn_grid = {'n_neighbors': [9,11,13, 15]}
Knn_model = GridSearchCV(Knn_reg, Knn_grid, scoring='neg_root_mean_squared_error', cv = 5)
Knn_model.fit(X_train, Y_train)

GridSearchCV(cv=5, estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': [9, 11, 13, 15]},
             scoring='neg_root_mean_squared_error')

In [39]:
Knn_model.best_params_

{'n_neighbors': 9}

In [40]:
#let's implement the best model on training set
best_knn = Knn_model.best_estimator_
Knn_train_predict = best_knn.predict(X_train)
print('The best KNN model score is ', Knn_model.best_score_)
print(mean_squared_error(Knn_train_predict,Y_train,squared=False))

The best KNN model score is  -9279.987865077841
6792.315012936213


In [41]:
# now let's test with the best model
Knn_test_predict = best_knn.predict(X_test)
print(mean_squared_error(Knn_test_predict,Y_test,squared=False))

11987.403133899445


**The model was improved but still it looks like overfitting. We will increase the number of neighbours and check again.**

In [42]:
# Let's try to improve the KNN by searching more neighbours 13, 15, 17 & 19
Knn_reg = KNeighborsRegressor()
Knn_grid = {'n_neighbors': [13,15,17,19]}
Knn_model = GridSearchCV(Knn_reg, Knn_grid, scoring='neg_root_mean_squared_error', cv = 5)
Knn_model.fit(X_train, Y_train)

GridSearchCV(cv=5, estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': [13, 15, 17, 19]},
             scoring='neg_root_mean_squared_error')

In [43]:
Knn_model.best_params_

{'n_neighbors': 13}

In [44]:
#let's implement the best model on training set
best_knn = Knn_model.best_estimator_
Knn_train_predict = best_knn.predict(X_train)
print('The best NN model score is ', NN_model.best_score_)
print(mean_squared_error(Knn_train_predict,Y_train,squared=False))

The best NN model score is  -9706.744309802121
7270.736951403021


In [45]:
# now let's test with the best model
Knn_test_predict = best_knn.predict(X_test)
print(mean_squared_error(Knn_test_predict,Y_test,squared=False))

11885.190811506334


Now we will try to improve the neural network by changing the number of nodes using only **one layer**
we will try to tune the network with **20, 30** or **40 neurones** using grid search.
we will try **both rectified linear unit (relu)** or **logistic sigmoid function (logistic)** 

In [46]:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV
NN_reg = MLPRegressor()
NN_grid = {'hidden_layer_sizes': [(20),(30),(40)], 'activation': ['relu','logistic'], 'early_stopping': [True]}
NN_model = GridSearchCV(NN_reg, NN_grid, scoring='neg_root_mean_squared_error', cv = 5)
NN_model.fit(X_train, Y_train)

GridSearchCV(cv=5, estimator=MLPRegressor(),
             param_grid={'activation': ['relu', 'logistic'],
                         'early_stopping': [True],
                         'hidden_layer_sizes': [20, 30, 40]},
             scoring='neg_root_mean_squared_error')

In [47]:
NN_model.best_params_

{'activation': 'relu', 'early_stopping': True, 'hidden_layer_sizes': 40}

In [48]:
best_NN = NN_model.best_estimator_
NN_train_predict = best_NN.predict(X_train)
print('The best NN model score is ', NN_model.best_score_)
print(mean_squared_error(NN_train_predict,Y_train,squared=False))

The best NN model score is  -9871.967548808569
9757.268437734008


In [49]:
best_NN = NN_model.best_estimator_
NN_test_predict = best_NN.predict(X_test)
print(mean_squared_error(NN_test_predict,Y_test,squared=False))

11509.879186786744


In conclusion, the best model will be the neural network with one hidden layer and rectified linear unit (relu) activation.

## 9. (Not Optional) Study for your final exam!

I hope that this final assignment, together with previous ones, gives you a clear view on how ML must be done in real life. I also hope that it's useful for making up your mind, clarifying concepts and understanding much better all we've seen.

All the course slides, notebooks, assignments, feedbacks and forum answers are now your personal `X_train`, your training dataset. So now it's just calling `fit` on yourselves, attending the exam, seeing what the questions in `X_test` are, calling `predict(X_test)` on yourselves (that is, trying to answer correctly all `Y_test`), and getting the highest score possible!

Real life is like ML, or ML is like real life, the way you prefer to see it. Thanks for your patience!