# Final assignment

Time to condense all you've learnt through this course in a final assignment. This notebook serves as a basic template for you to fill in with the code needed to do what is requested.

As always, add as many cells as you need explaining your approach. In particular, **if you do things differently than you did for previous assigments (e.g., because of the feedback received, or because you come up with new ideas), please highlight it, and explain why you changed your mind**.

In [31]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Linear regression for sklearn
import sklearn.linear_model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
# make the number format not to display in a scientific format
pd.set_option('display.float_format', lambda x: '%.3f' % x)

from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeRegressor

## 1. Load data

First thing's first. Load the Used Cars Dataset, **using only the columns that you're going to use in your models**. That is, **DON'T read columns that**:
* <u>Are obviously useless</u> (for example, unique IDs).

* <u>Could be useful, but previous assignments showed you that they're not worth processing for prediction</u>.

* <u>Special data types that are different from pure numbers or categories</u> (e.g., geographical or text ones). However, **extra points will be given if you use them, as you've specific material on the subject**.

In [32]:
### Your code goes here
df = pd.read_csv(
    './vehicles.csv', 
    usecols=['price', 'year', 'manufacturer' ,'cylinders', 
             'fuel', 'odometer', 'transmission', 'drive', 'type', 'model']
)
df=df.sample(frac=0.10, replace=True, random_state=1)
df


Unnamed: 0,price,year,manufacturer,model,cylinders,fuel,odometer,transmission,drive,type
128037,35995,2018.000,acura,tlx 3.5l v6,6 cylinders,gas,22026.000,automatic,fwd,sedan
267336,35990,2018.000,audi,a4 allroad premium plus,,gas,29157.000,other,,wagon
312201,54990,2015.000,ram,2500,,diesel,68664.000,automatic,4wd,pickup
371403,15999,2015.000,kia,optima,,gas,68235.000,automatic,,other
73349,0,2018.000,ford,f-150,,gas,53964.000,automatic,4wd,pickup
...,...,...,...,...,...,...,...,...,...,...
31637,8000,2015.000,chevrolet,cruze lt,4 cylinders,gas,75000.000,automatic,fwd,sedan
264939,46875,2015.000,chevrolet,corvette,,gas,29337.000,manual,,coupe
18784,32590,2015.000,mercedes-benz,gla-class gla 45,,gas,34811.000,other,,other
107796,36500,2018.000,nissan,titan crew cab,8 cylinders,gas,35116.000,automatic,rwd,truck


## 2. Divide into X and Y, and into training and testing set

At this point, the dataframe has all features we want in our data, so it's time to split it into training and testing set. Remember some things here:
* <u>`Y` is the `price` feature and `X` is all the rest of features</u>.
* <u>Use fixed ratios. For example, 80% for training and 20% for testing.</u>
* <u>We saw that there's no need to shuffle data, but if you do, justify it, and use always the same `RandomState` so that you always get the same split.</u>

**REMARK**: <b><u>don't use test data from now on, till you've tuned your final model! (Step 7 below)</u></b>. Remember: **test data is like the final exam, so we can't access its questions (= test samples) until we've studied (= trained and tuned our model)**.

In [33]:
### Your code goes here

X, Y = df.drop(columns=['price']), df['price']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8, shuffle=False)

## 3. Remove problematic rows in X_train

Now that we have all columns (= features or attributes) needed, it's time to remove rows (= car ads in this case) that we don't want to include. Unfortunately, **scikit-learn doesn't support transformations that remove rows** (there's an <a href="https://github.com/scikit-learn/scikit-learn/issues/3855">open issue</a> about this), so **any row-removing operations must be done as preprocessing steps that cannot be included in a Pipeline**.

You know from previous assignments that there're several reasons for which we may want to delete some rows. The most important ones are:
* <u>Missing values</u>: there's some feature that has a *NaN/None* value, and we know that:
    * <u>The feature is difficult to impute with the rest of information at hand</u> (so we don't know how to replace it with a non-missing value).
    * <u>The fact that it's missing isn't relevant for prediction</u> (that is, if we keep it with a NaN value, the model doesn't take that fact into consideration). Note that this only applies to categorical features (as they can be encoded keeping NaNs), but not to numerical ones. **Numerical features cannot be NaN in sklearn models, except for very limited cases**.
* <u>Outliers</u>: some value of some feature (or belonging to the target `Y`) isn't missing, but it's wrong, out of bounds, or simply doesn't make sense from the business point of view.

**HINT**: once you've decided what rows to remove and with what logic, try to write a function that, given a dataset `X`, returns that same dataset without the rows that meet that logic. Why? Because you'll have to apply this function eventually to `X_test` before predicting for it (see Step 7 below).

**REMARK**: why don't we remove problematic rows before splitting into training and testing? Because **that's a subtle form of data leakage**. We simply don't know what will come in `X_test` (actually we do, but we have to behave as if we didn't know!), so we need to infer from `X_train` what is "problematic" or an "outlier" and what is not. **Note that in previous assignments we cheated a bit because we used the whole dataset to discard prices that are extreme, but now we should do things properly.**

We will start here with removing the missing values for the numerical variables which are the year and the odometer. We decided to drop the missing the values because both features cannot be imputed from other features. 

In [34]:
### Your code goes here

## before we drop any record we will join the X_train and Y_traun together to make sure that we drop the whole record
Train_temp = pd.concat([X_train,Y_train], axis = 1)

## Now we can drop the missing values
#Train_temp = Train_temp.dropna(subset=['year', 'odometer'])  # just drop rows whenever year and/or odometer are missing

print(Train_temp.isna().mean().sort_values().rename('% of samples with NAs in each feature'))



price          0.000
year           0.003
transmission   0.005
fuel           0.006
odometer       0.010
model          0.013
manufacturer   0.042
type           0.221
drive          0.307
cylinders      0.412
Name: % of samples with NAs in each feature, dtype: float64


In the transformation part, we will try to impute the manufacturer by utilizing the model. However, If there are records that have missing model and manufacturer, we will drop them. The nxt block will check if we have both model and manufacturer missing and drop the records accordingly 

In [35]:
# If we have a record with a missing manufacturer and model, we will remove it.
manufacturer_missing_ind=pd.DataFrame(Train_temp[Train_temp['manufacturer'].isna()].index,columns=['index'])
model_missing_ind=pd.DataFrame(Train_temp[Train_temp['model'].isna()].index, columns=['index'])
both_missing_ind=pd.merge(manufacturer_missing_ind,model_missing_ind, on='index', how = 'inner')
Train_temp.drop(index=both_missing_ind['index'])
Train_temp.reset_index(inplace=True, drop=True)



In [36]:
#Tukey's method
def tukeys_method(df, variable):
    #Takes two parameters: dataframe & variable of interest as string
    q1 = df[variable].quantile(0.25)
    q3 = df[variable].quantile(0.75)
    iqr = q3-q1
    inner_fence = 1.5*iqr
    outer_fence = 3.0*iqr
    
    #inner fence lower and upper end
    inner_fence_le = q1-inner_fence
    inner_fence_ue = q3+inner_fence
    
    #outer fence lower and upper end
    outer_fence_le = q1-outer_fence
    outer_fence_ue = q3+outer_fence
 
    for index, x in enumerate(df[variable]):
        if x <= inner_fence_le or x >= inner_fence_ue:
            #outliers_prob.append(x)
            df = df[df[variable] != x]
    return df

In [37]:

# Now let's remove the outliers from the odometer and year
Train_temp = tukeys_method(Train_temp,'odometer')
Train_temp = tukeys_method(Train_temp,'year')
Train_temp = tukeys_method(Train_temp,'price')
Train_temp.price.describe()
# After droping the problomatic rows we split the X and Y again
X_train, Y_train = Train_temp.drop(columns=['price']), Train_temp['price']


269950

## 4. Transform X_train

At this point, you know that `X_train` has all necessary features, and you also know that the rows that have remained from the previous steps are the ones you're going to train with.

This is the core part. Consider all the features you loaded in point 1, and what you've done so far regarding missing values. **You need to take into consideration these facts**:
* <u>Numerical features cannot have missing values</u>.
* <u>Unless you use some particular models (e.g., trees), numerical features need to be scaled</u>. **This is particularly important if in step 5 you're using a linear model, or a non-linear model that relies on scalar products** (e.g., SVMs on Neural Networks).
* <u>All non-numerical features must be transformed into numbers</u>, so:
    * <u>Categorical features should be one-hot encoded</u> (perhaps with NaNs, perhaps without them).
    * <u>Ordinal features should be categorized and then one-hot encoded, or transformed into numbers somehow</u>.
    * <u>Text features should be tf-idf vectorized</u> (if you use them, and unless you use some more advanced NLP packages).
    * <u>Geographical features are tricky</u>. If numerical (such as coordinates), those numbers can only lie in particular ranges. If categorical, usually they follow a hierarchy (for examples regions include states, which include counties). Think carefully about what to do with these!
* <u>If you are imputing missing values for some feature, the logic must be included in this step.</u>
* <u>If you need to discard some feature because it's used at this step but not anymore, drop it now.</u> For example, if you use `model` to impute missing values in other features, but you don't want to use it for training because it has too many categories.
    
**HINT**: once you've thought about it, try to condense this into a `ColumnTransformer` that splits processing between different kinds of features. Depending on what you do, it's possible that some of those kind-specific transformations need to be compound too (i.e., not just single transformers but `Pipelines`, `FeatureUnions` or `ColumnTransformers`). Remember about recursion!

**HINT**: if you're doing things which aren't included in scikit-learn, such as imputing missing values with some more elaborate logic than replacing by mean or mode (so that `SimpleImputer` isn't enough), you can also **write your own transformers**. To do this, you'll need to write a class that inherits from `BaseEstimator` and `TransformerMixin` and implement yourself the `fit(X)` and the `transform(X)` methods.

In [39]:
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import FeatureUnion, Pipeline, make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
class ExperimentalTransformer(BaseEstimator, TransformerMixin):
  def __init__(self):
    print('\n>>>>>>>init() called for ExperimentalTransformer.\n')

  def fit(self, X, y = None):
    print('\n>>>>>>>fit() called for ExperimentalTransformer.\n')
    print(X.columns)
    X_ = X.copy()
    final_columns=[]
    group_list=[]
    print(X_.columns)
    for col_name in X_.columns:
        print('before check column name ' + col_name)
        if col_name =='model':
            continue
        else:
            #model_by_manu = X.groupby(['model',col_name], as_index=False).size()
            final_columns.append(col_name)
            model_by_typ= pd.DataFrame({'size' : X_.groupby( [ 'model',col_name] ).size()}).reset_index()
            group_list.append(model_by_typ)
        self.model_by_typ=group_list
            
            #Manu_null=X_[X_[col_name].isna()]
            #for i in Manu_null.iterrows():
             #   model=Manu_null.at[i[0],'model']
                #temp = model_by_typ[model_by_typ.model.apply(lambda x: x == model)]   
              #  temp = model_by_typ[model_by_typ['model'] == model]
               # print(temp)
               # if len(temp) != 0:
                #    ind = temp[['size']].idxmax()
                 #   X_.at[i[0],col_name] = temp.at[ind[0],col_name]
   
    return self

  def transform(self, X, y = None):
    print('\n>>>>>>>transform() called for ExperimentalTransformer.\n')
    X_ = X.copy()
    indicator=0
    print('after indicator')
    final_columns=[]
    print(X_.columns)
    for col_name in X_.columns:
        print('before check column name ' + col_name)
        if col_name =='model':
            continue
        else:
            #model_by_manu = X.groupby(['model',col_name], as_index=False).size()
            final_columns.append(col_name)
           # model_by_typ= pd.DataFrame({'size' : X_.groupby( [ 'model',col_name] ).size()}).reset_index()
            model_by_typ=self.model_by_typ[indicator]
            print('start of group by')
            print(model_by_typ)
            print('end of group by')
            Manu_null=X_[X_[col_name].isna()]
            for i in Manu_null.iterrows():
                model=Manu_null.at[i[0],'model']
                #print(model)
                #temp = model_by_typ[model_by_typ.model.apply(lambda x: x == model)]   
                #temp = model_by_typ[model_by_typ['model'] == model]
                if type(model) == str:
                    temp= model_by_typ[model_by_typ.model.map(tuple) == tuple(model)]
                   # print(temp)
                    if len(temp) != 0:
                      #  print('value has been replaced' )
                        ind = temp[['size']].idxmax()
                        X_.at[i[0],col_name] = temp.at[ind[0],col_name]
        indicator=indicator+1
    print('done for ExperimentalTransformer' +'ind = ')
    print(final_columns)
    without_model = X_[final_columns]
    print(without_model.shape)
    print(without_model.dtypes)                
    return without_model
class ExperimentalTransformer2(BaseEstimator, TransformerMixin):
  def __init__(self):
    print('\n>>>>>>>init() called for ExperimentalTransformer2.\n')

  def fit(self, X, y = None):
    print('\n>>>>>>>fit() called for ExperimentalTransformer2.\n')
    #print(X.columns)
     
    X_ = X.copy() 
    Y_ = y.copy() 
    self.avg=y.mean()
    df_dict_man_train=[]
    counter=0
    print(X_.columns)
    for col_name in X_.columns:
        X_[col_name]= X_[col_name].where(~(X_[col_name].isna()), other='na', inplace=False)
        #X_['type']= X_['type'].where(~(X_['type'].isna()), other='na', inplace=False)
        a=col_name+'_avg_price'
        Train_temp = pd.concat([X_,Y_], axis = 1)

        df_man_train = Train_temp.groupby(col_name)[['price']].mean().sort_values(by=['price'], ascending=False).rename(
        columns={'price': a})

        #df_typ_train = Train_temp.groupby('type')[['price']].mean().sort_values(by=['price'], ascending=False).rename(
        #columns={'price': 'typ_avg_price'})

# Now we create the dictionary:
        df_dict_man_train.insert(counter, df_man_train.to_dict()[a])
        #df_dict_man_train[counter] = df_man_train.to_dict()[a]
        counter=counter+1
    #df_dict_typ_train = df_typ_train.to_dict()['typ_avg_price']
    self.df_dict_man_train=df_dict_man_train
    #self.df_dict_typ_train=df_dict_typ_train
    
    return self

  def transform(self, X, y = None):
    print('\n>>>>>>>transform() called for ExperimentalTransformer2.\n')
    
    X_ = X.copy() 
# And create a new variable based on each manufacturer's average price:
    counter=0
    for col_name in X_.columns:
     #   print(col_name)
        a=col_name+'_avg_price'
        X_[col_name] = X_[col_name].replace(self.df_dict_man_train[counter])
        # check if manu is new
        
        X_[col_name] = pd.to_numeric(X_[col_name], errors='coerce')
        X_[col_name].fillna(self.avg)
        counter=counter+1
# Now we can drop the columns
       # X_=X_.drop(columns=col_name)
    #X_=X_.drop(columns='type')
    
    #final_columns=[]
    #for col_name in X_.columns:
     #   print(col_name)
      #  if col_name =='model':
       #     continue
       # else:
            #model_by_manu = X.groupby(['model',col_name], as_index=False).size()
            
            
            
        #    final_columns.append(col_name)
         #   model_by_typ= pd.DataFrame({'size' : X_.groupby( [ 'model',col_name] ).size()}).reset_index()
         #   Manu_null=X_[X_[col_name].isna()]
          #  for i in Manu_null.iterrows():
           #     model=Manu_null.at[i[0],'model']
                    
            #    temp = model_by_typ[model_by_typ['model'] == model]
             #   if len(temp) != 0:
              #      ind = temp[['size']].idxmax()
               #     X_.at[i[0],col_name] = temp.at[ind[0],col_name]
    print(X_.dtypes)
    print('done for ExperimentalTransformer2')
   # without_model = X_[final_columns]
    #((without_model.isnull().sum()/without_model.isnull().count()) * 100).sort_values(ascending=False)
    return X_

In [40]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
print("create pipeline 1")
#categorical_features = ['manufacturer', 'type', 'model','cylinders']
categorical_features = ['manufacturer', 'type', 'model']
categorical_features1 = ['cylinders', 'model']
categorical_features2 = ['transmission', 'drive','fuel']
numrical_features1 = ['year', 'odometer']
pipe1 = Pipeline(steps=[
                       ('fill_data', ExperimentalTransformer()),
                      ('convert_with_price', ExperimentalTransformer2()),
                      ('imputer1',SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
                      ('scaler', StandardScaler())
                       #('enc', OneHotEncoder(sparse = False, drop ='first'))
])
pipe2 = Pipeline(steps=[
                       ('fill_data_2', ExperimentalTransformer()),
                      ('imputer2',SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
                       ('enc', OneHotEncoder(handle_unknown='ignore'))
])
pipe3 = Pipeline(steps=[
                      ('imputer3',SimpleImputer(missing_values=np.nan, strategy='median')),
                       ('scaler', StandardScaler())
])
pipe4 = Pipeline(steps=[
                       ('imputer4',SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
                       ('enc2', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', pipe1, categorical_features),
        ('cat1', pipe2, categorical_features1),
        ('num', pipe3, numrical_features1),
        ('cat2', pipe4, categorical_features2),
    ],remainder='drop')
#new=preprocessor.fit_transform(X_train,Y_train)
#clf = Pipeline(steps=[('preprocessor', preprocessor),
#                       ('linear_model', LinearRegression())])

print("fit pipeline 1")
#print(new)
#clf.fit(X_train, Y_train)
#preds1 = clf.predict(X_test)
#print(preds1)

create pipeline 1

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.

fit pipeline 1


In [41]:
LRModel = Pipeline(steps=[
                        # incorrect column name passed
                       ('preprocessor', preprocessor), 
                       ('linear_model', LinearRegression())
])

print("fit pipeline LR")
LRModel.fit(X_train, Y_train)  
preds4 = LRModel.predict(X_train) 
print(f"\n{preds4}")  # should be [196. 289.]
print(f"RMSE: {np.sqrt(mean_squared_error(Y_train, preds4))}\n")

fit pipeline LR

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
start of group by
                          model manufacturer  size
0                         (300)     chrysler     1
1        (cng) 2500 express van    chevrolet     1
2                    - forester       subaru     1
3         08' mkz 79,000 miles!      lincoln     1
4                      1 series          bmw     1
...                         ...          ...   ...
5789   z4 sdrive28i roadster 2d          bmw     2
5790   z4 sdrive30i r

In [None]:
import sklearn.svm
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
RFR = RandomForestRegressor(max_depth=2, random_state=0)

RFRModel = Pipeline(steps=[
                        # incorrect column name passed
                       ('preprocessor', preprocessor), 
                       ('RFR', RFR)
])
'''
print("fit pipeline LR")
RFRModel.fit(X_train, Y_train)  
preds4 = RFRModel.predict(X_train) 
print(f"\n{preds4}")  # should be [196. 289.]
print(f"RMSE: {np.sqrt(mean_squared_error(Y_train, preds4))}\n")
'''
param_grid = dict(RFR__max_depth=[2, 3, 4,5],
                  RFR__n_estimators=[10, 20,30]
                 )

grid_search = GridSearchCV(RFRModel, param_grid=param_grid, verbose=10, n_jobs=-1, cv=3)
grid_search.fit(X_train, Y_train)
print ('best params')
print(grid_search.best_estimator_)
print ('score')
print(grid_search.best_score_)


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.

Fitting 3 folds for each of 12 candidates, totalling 36 fits

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
Knn_reg = KNeighborsRegressor(n_neighbors=3)
# Grid search result is 10

KNNModel = Pipeline(steps=[
                        # incorrect column name passed
                       ('preprocessor', preprocessor), 
                       ('KNN',Knn_reg )
])
'''
print("fit pipeline LR")
KNNModel.fit(X_train, Y_train)  
preds4 = KNNModel.predict(X_train) 
print(f"\n{preds4}")  # should be [196. 289.]
print(f"RMSE: {np.sqrt(mean_squared_error(Y_train, preds4))}\n")
'''
param_grid = dict(KNN__n_neighbors=[4, 6, 8,10]
                 )

grid_search = GridSearchCV(KNNModel, param_grid=param_grid, verbose=10, n_jobs=-1, cv=3)
grid_search.fit(X_train, Y_train)
print ('best params')
print(grid_search.best_estimator_)
print ('score')
print(grid_search.best_score_)

In [None]:
reg_decision_model=DecisionTreeRegressor(max_depth=2,min_samples_leaf=2)

DTRModel = Pipeline(steps=[
                        # incorrect column name passed
                       ('preprocessor', preprocessor), 
                       ('reg_decision_model',reg_decision_model )
])
'''
print("fit pipeline LR")
DTRModel.fit(X_train, Y_train)  
preds4 = DTRModel.predict(X_train) 
print(f"\n{preds4}")  # should be [196. 289.]
print(f"RMSE: {np.sqrt(mean_squared_error(Y_train, preds4))}\n")
'''
param_grid = dict(reg_decision_model__max_depth=[2, 3, 4,5],
                  reg_decision_model__min_samples_leaf=[1, 2,3]
                 )

grid_search = GridSearchCV(DTRModel, param_grid=param_grid, verbose=10, n_jobs=-1, cv=3)
grid_search.fit(X_train, Y_train)
print ('best params')
print(grid_search.best_estimator_)
print ('score')
print(grid_search.best_score_)


In [None]:

from sklearn.svm import SVR
SVR_model=SVR(C=1.0, epsilon=0.2,kernel="linear")

SVRModel = Pipeline(steps=[
                        # incorrect column name passed
                       ('preprocessor', preprocessor), 
                       ('SVR',SVR_model )
])
'''
print("fit pipeline LR")
SVRModel.fit(X_train, Y_train)  
preds4 = SVRModel.predict(X_train) 
print(f"\n{preds4}")  # should be [196. 289.]
print(f"RMSE: {np.sqrt(mean_squared_error(Y_train, preds4))}\n")
'''

param_grid = dict(SVR__kernel=["linear", "rbf", "sigmoid", "poly"],
                  SVR__C=[1, 1.5, 2, 2.5, 3]
                 )

grid_search = GridSearchCV(SVRModel, param_grid=param_grid, verbose=10, n_jobs=-1, cv=3)
grid_search.fit(X_train, Y_train)
print ('best params')
print(grid_search.best_estimator_)
print ('score')
print(grid_search.best_score_)


In [None]:

from sklearn.dummy import DummyRegressor
DR_model=DummyRegressor(strategy="mean")

DRModel = Pipeline(steps=[
                        # incorrect column name passed
                       ('preprocessor', preprocessor), 
                       ('DR',DR_model )
])

print("fit pipeline LR")
DRModel.fit(X_train, Y_train)  
preds4 = DRModel.predict(X_train) 
print(f"\n{preds4}")  # should be [196. 289.]
print(f"RMSE: {np.sqrt(mean_squared_error(Y_train, preds4))}\n")

In [None]:
# to be removed
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
DR_model=DummyRegressor(strategy="mean")
reg_decision_model=DecisionTreeRegressor(max_depth=2,min_samples_leaf=2)
Knn_reg = KNeighborsRegressor(n_neighbors=3)
BestModel = Pipeline(steps=[
                        # incorrect column name passed
                       ('preprocessor', preprocessor), 
                       ('KNN',Knn_reg ), 
                       ('reg_decision_model',reg_decision_model )
])

print("fit pipeline LR")
DRModel.fit(X_train, Y_train)  
preds4 = BestModel.predict(X_train) 
print(f"\n{preds4}")  # should be [196. 289.]
print(f"RMSE: {np.sqrt(mean_squared_error(Y_train, preds4))}\n")

param_grid = dict(reg_decision_model__max_depth=[2, 3, 4,5],
                  reg_decision_model__min_samples_leaf=[1, 2,3],
                  KNN__n_neighbors=[4, 6, 8,10]
                 )

grid_search = GridSearchCV(BestModel, param_grid=param_grid, verbose=10, n_jobs=-1, cv=3)
grid_search.fit(X_train, Y_train)
print ('best estimator')
print(grid_search.best_estimator_)
print ('score')
print(grid_search.best_params_)
print ('best params')

In [None]:
new=pd.DataFrame(new,columns=['manu','type','cylinders'])
((new.isnull().sum()/new.isnull().count()) * 100).sort_values(ascending=False)

In [None]:
### Your code goes here
## Here we are trying to impute the manufacturer based on the model
model_by_manu = X_train.groupby(['model','manufacturer'], as_index=False).size()
Manu_null=X_train[X_train['manufacturer'].isna()]
for i in Manu_null.iterrows():
    model=Manu_null.at[i[0],'model']
    temp = model_by_manu[model_by_manu['model'] == model]
    if len(temp) != 0:
        ind = temp[['size']].idxmax()
        if (temp[['size']]==temp[['size']].max()).sum()>1:
            print('more than one match with max count')
        X_train.at[i[0],'manufacturer'] = temp.at[ind[0],'manufacturer']

In [None]:
print(X_train.isna().mean().sort_values().rename('% of samples with NAs in each feature'))

In [None]:
## here we will be trying to impute the missing type values similar to the strategy used with the manufacturer
## model_by_typ = X_train.groupby(['model','type'], as_index=False).size()

model_by_typ= pd.DataFrame({'size' : X_train.groupby( [ 'model','type'] ).size()}).reset_index()
Typ_null=X_train[X_train['type'].isna()]
for i in Typ_null.iterrows():
    model=Typ_null.at[i[0],'model']
    print(model_by_typ['model'])
    temp = model_by_typ[model_by_typ['model'] == model]
    if len(temp) != 0:
        ind = temp[['size']].idxmax()
        X_train.at[i[0],'type'] = temp.at[ind[0],'type']

In [None]:
print(X_train.isna().mean().sort_values().rename('% of samples with NAs in each feature'))

In [None]:
## here we will be trying to impute the missing cylinders similar way
model_by_cyl = X_train.groupby(['model','cylinders'], as_index=False).size()
Cyl_null=X_train[X_train['cylinders'].isna()]
for i in Cyl_null.iterrows():
    model=Cyl_null.at[i[0],'model']
    temp = model_by_cyl[model_by_cyl['model'] == model]
    if len(temp) != 0:
        ind = temp[['size']].idxmax()
        X_train.at[i[0],'cylinders'] = temp.at[ind[0],'cylinders']

In [None]:
print(X_train.isna().mean().sort_values().rename('% of samples with NAs in each feature'))

Since the model column is a free text, it is not going to be useful in our model, therefore, se are not going to drop it from our training set. We used it only in the imputaiton part.

In [None]:
X_train=X_train.drop(columns=['model'])

For the manufacturer & type columns we will transform it by using the average price of each one.

In [None]:
X_train['manufacturer']= X_train['manufacturer'].where(~(X_train['manufacturer'].isna()), other='na', inplace=False)
X_train['type']= X_train['type'].where(~(X_train['type'].isna()), other='na', inplace=False)

Train_temp = pd.concat([X_train,Y_train], axis = 1)

df_man_train = Train_temp.groupby('manufacturer')[['price']].mean().sort_values(by=['price'], ascending=False).rename(
    columns={'price': 'man_avg_price'})

df_typ_train = Train_temp.groupby('type')[['price']].mean().sort_values(by=['price'], ascending=False).rename(
    columns={'price': 'typ_avg_price'})

# Now we create the dictionary:
df_dict_man_train = df_man_train.to_dict()['man_avg_price']
df_dict_typ_train = df_typ_train.to_dict()['typ_avg_price']

# And create a new variable based on each manufacturer's average price:
X_train['manufacturer_avg_price'] = X_train['manufacturer'].replace(df_dict_man_train)
X_test['manufacturer_avg_price'] = X_test['manufacturer'].replace(df_dict_man_train)

X_train['type_avg_price'] = X_train['type'].replace(df_dict_typ_train)
X_test['type_avg_price'] = X_test['type'].replace(df_dict_typ_train)

# Now we can drop the columns
X_train=X_train.drop(columns='manufacturer')
X_train=X_train.drop(columns='type')
X_test=X_test.drop(columns='manufacturer')
X_test=X_test.drop(columns='type')

For the rest of the columns we will use the one hot encouding

In [None]:
from sklearn.preprocessing import OneHotEncoder

def fun_ohe (df_in, variable):
    
    ohe = OneHotEncoder(sparse=False, drop='first')
    ohe.fit(df_in[[variable]])
    ohe_df = pd.DataFrame(ohe.transform(df_in[[variable]]),
                 columns = ohe.get_feature_names([variable]))
    ohe_df.set_index(df_in.index, inplace=True)
    return pd.concat([df_in, ohe_df], axis=1).drop([variable], axis=1)

# replace nan with 'na'

X_train['condition']= X_train['condition'].where(~(X_train['condition'].isna()), other='na', inplace=False)
X_train['cylinders']= X_train['cylinders'].where(~(X_train['cylinders'].isna()), other='na', inplace=False)
X_train['transmission']=X_train['transmission'].where(~(X_train['transmission'].isna()), other='na', inplace=False)
X_train['fuel']=X_train['fuel'].where(~(X_train['fuel'].isna()), other='na', inplace=False)
X_train['title_status']= X_train['title_status'].where(~(X_train['title_status'].isna()), other='na', inplace=False)
X_train['drive']= X_train['drive'].where(~(X_train['drive'].isna()), other='na', inplace=False)    
    

# apply ohe function
X_train = fun_ohe(df_in= X_train, variable='cylinders')
X_train = fun_ohe(df_in= X_train, variable='fuel')
X_train = fun_ohe(df_in= X_train, variable='title_status')
X_train = fun_ohe(df_in= X_train, variable='transmission')
X_train = fun_ohe(df_in= X_train, variable='condition')
X_train = fun_ohe(df_in= X_train, variable='drive')


In [29]:
#standardize the data

from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
X_train_scaled = X_train
X_test_scaled = X_test
X_train_scaled[['year', 'odometer']] = scale.fit_transform(X_train[['year', 'odometer']])
X_test_scaled[['year', 'odometer']] = scale.transform (X_test[['year', 'odometer']])

## 5. Train models

Data are now in a suitable way for any model we want to train. Missing values have been dropped or filled, there're no outliers, numbers have been scaled, etc. Try to keep in mind lessons learnt in ML1 and ML2, as to which models may be more suitable for this problem, slower/faster to train, etc.

Also **decide on what metric to use to measure performance**; the one you feel more comfortable with, whatever. In any case, follow this motto: "start simple, and then add complexity little by little". The usual procedure is:
1. <u>Start with a really simple model</u>, perhaps even a `DummyRegressor` (or `DummyClassifier` if this was a classification problem). Such a simple model is very fast to train, and it gives you **a value of the error metric that you must improve. If you do worse than this, you're making some mistake in your pipeline**.
2. Once you've that reference dummy performance, <u>turn linear</u>. Use simple linear models, and see where you can get. **The new error should be better than the dummy one, but probably still not very satisfactory**. In any case, **this becomes the new reference to beat**.
3. Once you've the reference linear performance, <u>turn non-linear, but interpretable</u>. This is where trees, nearest neighbors or naïve bayes come in handy, as they're easily intepreted (if-then rules, using very similar samples, or using independent probabilities). **Most likely you'll get an error which is even better than linear one, so this becomes the new reference**.
4. <u>Turn non-linear and non-interpretable</u>. Typically here we use models like SVMs or Neural Networks, which are even more powerful, but harder to train and particularly difficult to explain in simple words.
5. If not even all of this is enough, <u>build ensembles</u>. That is, not relying on a single model, but combining what several models say.

**HINT**: build a `Pipeline` with the previous preprocessing transformations, and whose final step is the model you want to try. This ensures that transformations are applied before training.

**HINT**: use `GridSearchCV/RandomizedSearchCV` to not only try the default model, but also tune its more important hyperparameters. Remember that a `Pipeline` is an estimator, so that's what you feed into the search. Also, remember the double underscore trick to specify that a parameter belongs to the estimator, and also recall the different CV strategies. **If you don't do CV, you'll most likely end up overfitting.**

In [18]:
### Using dummy regressor by utilizing the mean stategy
dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(X_train, Y_train)
dummy_regr_pred_train = dummy_regr.predict(X_train)
print(mean_squared_error(dummy_regr_pred_train,Y_train,squared=False))

11848064.610319335


In [20]:
### Using linear regression
Lin_regr = sklearn.linear_model.LinearRegression()
Lin_regr.fit(X_train, Y_train)
Lin_train_predicted = Lin_regr.predict(X_train)
print(mean_squared_error(Lin_train_predicted,Y_train,squared=False))


11846629.811360905


In [55]:
### Using KNN with CV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
Knn_reg = KNeighborsRegressor()
Knn_grid = {'n_neighbors': [4,5,6,7]}
Knn_model = GridSearchCV(Knn_reg, Knn_grid, cv = 5, n_jobs=-1)
Knn_model.fit(X_train, Y_train)

10144218.813050075


In [72]:
Knn_model.best_params_

{'n_neighbors': 5}

In [65]:
best_knn = Knn_model.best_estimator_
Knn_train_predict = best_knn.predict(X_train)
print(mean_squared_error(Knn_train_predict,Y_train,squared=False))

10144218.813050075


In [67]:
## Using Nueral Network
from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from sklearn.preprocessing import OneHotEncoder
from sklearn import preprocessing
import tensorflow

In [73]:
def model():
    model = Sequential()
    model.add(Dense(10, activation = 'relu'))
    model.add(BatchNormalization())
    model.add(Dense(10, activation = 'relu'))
    model.add(BatchNormalization())
    model.add(Dense(15, activation = 'relu'))
    model.add(Dense(15, activation = 'relu'))
    model.add(BatchNormalization())
    model.add(Dense(15, activation = 'relu'))
    model.add(Dense(15, activation = 'relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.5))
    model.add(Dense(1))
    model.compile(loss='mse', optimizer = 'rmsprop', metrics = ['mae'])
    
    return model

In [74]:
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score

In [30]:
#results = model.fit(X_train, Y_train, epochs=20, batch_size=64)
NN_model = KerasRegressor(build_fn=model,epochs=20, batch_size=64,verbose=0, n_jobs=-1)

NameError: name 'KerasRegressor' is not defined

In [79]:
import keras
cross_val_score(NN_model,np.array(X_train),np.array(Y_train), scoring=keras.metrics.mean_squared_error, cv=5)

array([nan, nan, nan, nan, nan])

In [None]:
###HERE ON OUT IS FROM THE NO TRANSFORM VERSION


In [None]:
## Using Nueral Network
### we will try to tune the network with 3,5 or 7 layers using grid search. We will fix the number of nodes to 30

from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV
NN_reg = MLPRegressor()
NN_grid = {'hidden_layer_sizes': [(30,30,30),(30,30,30,30,30),(30,30,30,30,30,30,30)]}
NN_model = GridSearchCV(NN_reg, NN_grid,n_jobs = -1,  cv = 5)
NN_model.fit(X_train, Y_train)

In [None]:
NN_model.best_params_

In [None]:
best_NN = NN_model.best_estimator_
NN_train_predict = best_NN.predict(X_train)
print(mean_squared_error(NN_train_predict,Y_train,squared=False))

In [None]:
# Using SVM
import sklearn.svm
from sklearn.model_selection import GridSearchCV

SVR_Grid = {'C' : [1,5,10]}

SVR_Reg = sklearn.svm.SVR()

SVR_model = GridSearchCV(SVR_Reg,SVR_Grid,n_jobs = -1, cv = 5)

SVR_model.fit(X_train,Y_train)

best_svr = SVR_model.best_estimator_
SVR_train_predict = best_svr.predict(X_train)
print(mean_squared_error(SVR_train_predict,Y_train,squared=False))

In [None]:
SVR_model.best_params_

## 6. Decide what the final model would be

After having done all the previous steps, you'll have trained several models. Now you need to **decide which of those is the one you're going to choose as your best**. Following the exam metaphor, you have to pick your best student to win the ML Olympics!

**HINT**: in order to try different models, you can write an outer loop that tries different estimators, as well as what hyperparameters and values for those hyperparameters are to be tried in the CV search. This loop calls CV search, picks its `best_estimator_` and compares its performance with the best performance you had so far. If the new one is better than your current best, this becomes your new best.

**HINT**: it's not always about performance (= score). Sometimes you can improve a bit the score, at the expense of training a model that takes way longer, or that requires much more memory. Besides this, clients usually require some interpretability on the model, so think twice about what "best" means.

In [None]:
### Your code goes here


## 7. Test your final model

Time to recover `X_test`, which was put on hold since Step 2. Now you've a final `Pipeline` from Step 6, which knows how to transform the data in `X_test` (Step 4) and knows how to predict (because it's been fit by Step 5).

**HINT**: **beware that Step 3 hasn't been applied to the test set!** You need to do that before calling `Pipeline.score(X_test)`. For example, if your model doesn't deal with missing values, you need to remove any row from `X_test` that has missing values! Otherwise the code will crash.

In [None]:
### Your code goes here

## 8. (Optional) Revisit what you've done

Once you get the score for `X_test`, it's tempting to try to improve even more. If you got a much worse performance than for `X_train`, chances are that you're overfitting, so you need to refine your CV strategy, use regularization, or choose parameters that don't drive to that (for example, don't let a tree grow without limit!).

This is like when you fail an exam, and you want to have another try. The catch is that you already know what the exam is (you saw `X_test`), and you also got your marks (the `score`), so it's not taking another similar exam (as would happen in real life), but taking the same exam again. Strictly speaking, this is another subtle form of data leakage, but a widely used one. The hope is that by refining the training strategy, even if we're cheating a bit, the behavior of the final model when it actually takes another, different exam (that is, when it's put into production), will be better than our current one would have obtained.

So we're going to **overlook this fact and allow that, given the results in Step 7, you can go back, try again, change your final estimator in Step 6 and retry Step 7, until you can't get any better**.

**HINT**: besides the score, you can also plot your predictions against reality, and try to infer when you predict wrongly. This can give you insights on how to improve the model and/or the pre-processing part.

In [None]:
### Your code goes here

## 9. (Not Optional) Study for your final exam!

I hope that this final assignment, together with previous ones, gives you a clear view on how ML must be done in real life. I also hope that it's useful for making up your mind, clarifying concepts and understanding much better all we've seen.

All the course slides, notebooks, assignments, feedbacks and forum answers are now your personal `X_train`, your training dataset. So now it's just calling `fit` on yourselves, attending the exam, seeing what the questions in `X_test` are, calling `predict(X_test)` on yourselves (that is, trying to answer correctly all `Y_test`), and getting the highest score possible!

Real life is like ML, or ML is like real life, the way you prefer to see it. Thanks for your patience!