# Final assignment

Time to condense all you've learnt through this course in a final assignment. This notebook serves as a basic template for you to fill in with the code needed to do what is requested.

As always, add as many cells as you need explaining your approach. In particular, **if you do things differently than you did for previous assigments (e.g., because of the feedback received, or because you come up with new ideas), please highlight it, and explain why you changed your mind**.

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Linear regression for sklearn
import sklearn.linear_model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
# make the number format not to display in a scientific format
pd.set_option('display.float_format', lambda x: '%.3f' % x)

from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeRegressor

## 1. Load data

First thing's first. Load the Used Cars Dataset, **using only the columns that you're going to use in your models**. That is, **DON'T read columns that**:
* <u>Are obviously useless</u> (for example, unique IDs).

* <u>Could be useful, but previous assignments showed you that they're not worth processing for prediction</u>.

* <u>Special data types that are different from pure numbers or categories</u> (e.g., geographical or text ones). However, **extra points will be given if you use them, as you've specific material on the subject**.

In [2]:
### Your code goes here
df = pd.read_csv(
    './vehicles.csv', 
    usecols=['price', 'year', 'manufacturer', 'condition',
             'fuel', 'odometer', 'title_status', 'transmission', 'drive', 
             'type', 'model']
)
df=df.sample(frac=0.010, replace=True, random_state=1)

#df=df.sample(frac=0.1, replace=True, random_state=1)

## 2. Divide into X and Y, and into training and testing set

At this point, the dataframe has all features we want in our data, so it's time to split it into training and testing set. Remember some things here:
* <u>`Y` is the `price` feature and `X` is all the rest of features</u>.
* <u>Use fixed ratios. For example, 80% for training and 20% for testing.</u>
* <u>We saw that there's no need to shuffle data, but if you do, justify it, and use always the same `RandomState` so that you always get the same split.</u>

**REMARK**: <b><u>don't use test data from now on, till you've tuned your final model! (Step 7 below)</u></b>. Remember: **test data is like the final exam, so we can't access its questions (= test samples) until we've studied (= trained and tuned our model)**.

In [3]:
### Your code goes here

X, Y = df.drop(columns=['price']), df['price']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8, shuffle=False)

## 3. Remove problematic rows in X_train

Now that we have all columns (= features or attributes) needed, it's time to remove rows (= car ads in this case) that we don't want to include. Unfortunately, **scikit-learn doesn't support transformations that remove rows** (there's an <a href="https://github.com/scikit-learn/scikit-learn/issues/3855">open issue</a> about this), so **any row-removing operations must be done as preprocessing steps that cannot be included in a Pipeline**.

You know from previous assignments that there're several reasons for which we may want to delete some rows. The most important ones are:
* <u>Missing values</u>: there's some feature that has a *NaN/None* value, and we know that:
    * <u>The feature is difficult to impute with the rest of information at hand</u> (so we don't know how to replace it with a non-missing value).
    * <u>The fact that it's missing isn't relevant for prediction</u> (that is, if we keep it with a NaN value, the model doesn't take that fact into consideration). Note that this only applies to categorical features (as they can be encoded keeping NaNs), but not to numerical ones. **Numerical features cannot be NaN in sklearn models, except for very limited cases**.
* <u>Outliers</u>: some value of some feature (or belonging to the target `Y`) isn't missing, but it's wrong, out of bounds, or simply doesn't make sense from the business point of view.

**HINT**: once you've decided what rows to remove and with what logic, try to write a function that, given a dataset `X`, returns that same dataset without the rows that meet that logic. Why? Because you'll have to apply this function eventually to `X_test` before predicting for it (see Step 7 below).

**REMARK**: why don't we remove problematic rows before splitting into training and testing? Because **that's a subtle form of data leakage**. We simply don't know what will come in `X_test` (actually we do, but we have to behave as if we didn't know!), so we need to infer from `X_train` what is "problematic" or an "outlier" and what is not. **Note that in previous assignments we cheated a bit because we used the whole dataset to discard prices that are extreme, but now we should do things properly.**

We will start here with removing the missing values for the numerical variables which are the year and the odometer. We decided to drop the missing the values because both features cannot be imputed from other features. 

In [4]:
### Your code goes here

## before we drop any record we will join the X_train and Y_traun together to make sure that we drop the whole record
Train_temp = pd.concat([X_train,Y_train], axis = 1)

## Now we can drop the missing values
Train_temp = Train_temp.dropna(subset=['year', 'odometer'])  # just drop rows whenever year and/or odometer are missing

#print(Train_temp.isna().mean().sort_values().rename('% of samples with NAs in each feature'))



From the above figures, can reialize since more than 70% of the size is missing, therefore, we decided to remove the whole feature.

In [5]:
Train_temp = Train_temp.drop(columns=['size'])

KeyError: "['size'] not found in axis"

In the transformation part, we will try to impute the manufacturer by utilizing the model. However, If there are records that have missing model and manufacturer, we will drop them. The nxt block will check if we have both model and manufacturer missing and drop the records accordingly 

In [6]:
# If we have a record with a missing manufacturer and model, we will remove it.
manufacturer_missing_ind=pd.DataFrame(Train_temp[Train_temp['manufacturer'].isna()].index,columns=['index'])
model_missing_ind=pd.DataFrame(Train_temp[Train_temp['model'].isna()].index, columns=['index'])
both_missing_ind=pd.merge(manufacturer_missing_ind,model_missing_ind, on='index', how = 'inner')
Train_temp.drop(index=both_missing_ind['index'])
Train_temp.reset_index(inplace=True, drop=True)



Now we will be removing the outliers for the odometer and year by utilizing the turkey method

In [7]:
#Tukey's method
def tukeys_method(df, variable):
    #Takes two parameters: dataframe & variable of interest as string
    q1 = df[variable].quantile(0.25)
    q3 = df[variable].quantile(0.75)
    iqr = q3-q1
    inner_fence = 1.5*iqr
    outer_fence = 3*iqr
    
    #inner fence lower and upper end
    inner_fence_le = q1-inner_fence
    inner_fence_ue = q3+inner_fence
    
    #outer fence lower and upper end
    outer_fence_le = q1-outer_fence
    outer_fence_ue = q3+outer_fence
 
    for index, x in enumerate(df[variable]):
        if x <= outer_fence_le or x >= outer_fence_ue:
            #outliers_prob.append(x)
            df = df[df[variable] != x]
    return df

In [8]:

# Now let's remove the outliers from the odometer and year
Train_temp = tukeys_method(Train_temp,'odometer')
Train_temp = tukeys_method(Train_temp,'year')
Train_temp = tukeys_method(Train_temp,'price')

# After droping the problomatic rows we split the X and Y again
X_train, Y_train = Train_temp.drop(columns=['price']), Train_temp['price']


## 4. Transform X_train

At this point, you know that `X_train` has all necessary features, and you also know that the rows that have remained from the previous steps are the ones you're going to train with.

This is the core part. Consider all the features you loaded in point 1, and what you've done so far regarding missing values. **You need to take into consideration these facts**:
* <u>Numerical features cannot have missing values</u>.
* <u>Unless you use some particular models (e.g., trees), numerical features need to be scaled</u>. **This is particularly important if in step 5 you're using a linear model, or a non-linear model that relies on scalar products** (e.g., SVMs on Neural Networks).
* <u>All non-numerical features must be transformed into numbers</u>, so:
    * <u>Categorical features should be one-hot encoded</u> (perhaps with NaNs, perhaps without them).
    * <u>Ordinal features should be categorized and then one-hot encoded, or transformed into numbers somehow</u>.
    * <u>Text features should be tf-idf vectorized</u> (if you use them, and unless you use some more advanced NLP packages).
    * <u>Geographical features are tricky</u>. If numerical (such as coordinates), those numbers can only lie in particular ranges. If categorical, usually they follow a hierarchy (for examples regions include states, which include counties). Think carefully about what to do with these!
* <u>If you are imputing missing values for some feature, the logic must be included in this step.</u>
* <u>If you need to discard some feature because it's used at this step but not anymore, drop it now.</u> For example, if you use `model` to impute missing values in other features, but you don't want to use it for training because it has too many categories.
    
**HINT**: once you've thought about it, try to condense this into a `ColumnTransformer` that splits processing between different kinds of features. Depending on what you do, it's possible that some of those kind-specific transformations need to be compound too (i.e., not just single transformers but `Pipelines`, `FeatureUnions` or `ColumnTransformers`). Remember about recursion!

**HINT**: if you're doing things which aren't included in scikit-learn, such as imputing missing values with some more elaborate logic than replacing by mean or mode (so that `SimpleImputer` isn't enough), you can also **write your own transformers**. To do this, you'll need to write a class that inherits from `BaseEstimator` and `TransformerMixin` and implement yourself the `fit(X)` and the `transform(X)` methods.

In [9]:
# by Fahad start of transformation with pipeline: 
# here is a custom transformer
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import FeatureUnion, Pipeline, make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
class ExperimentalTransformer(BaseEstimator, TransformerMixin):
  def __init__(self):
    print('\n>>>>>>>init() called for ExperimentalTransformer.\n')

  def fit(self, X, y = None):
    print('\n>>>>>>>fit() called for ExperimentalTransformer.\n')
    print(X.columns)
    X_ = X.copy()
    final_columns=[]
    group_list=[]
    print(X_.columns)
    for col_name in X_.columns:
        print('before check column name ' + col_name)
        if col_name =='model':
            continue
        else:
            #model_by_manu = X.groupby(['model',col_name], as_index=False).size()
            final_columns.append(col_name)
            model_by_typ= pd.DataFrame({'size' : X_.groupby( [ 'model',col_name] ).size()}).reset_index()
            group_list.append(model_by_typ)
        self.model_by_typ=group_list
    return self

  def transform(self, X, y = None):
    print('\n>>>>>>>transform() called for ExperimentalTransformer.\n')
    X_ = X.copy()
    indicator=0
    print('after indicator')
    final_columns=[]
    print(X_.columns)
    for col_name in X_.columns:
        print('before check column name ' + col_name)
        if col_name =='model':
            continue
        else:
            #model_by_manu = X.groupby(['model',col_name], as_index=False).size()
            final_columns.append(col_name)
           # model_by_typ= pd.DataFrame({'size' : X_.groupby( [ 'model',col_name] ).size()}).reset_index()
            model_by_typ=self.model_by_typ[indicator]
            print('start of group by')
            print(model_by_typ)
            print('end of group by')
            Manu_null=X_[X_[col_name].isna()]
            for i in Manu_null.iterrows():
                model=Manu_null.at[i[0],'model']
                #print(model)
                #temp = model_by_typ[model_by_typ.model.apply(lambda x: x == model)]   
                #temp = model_by_typ[model_by_typ['model'] == model]
                if type(model) == str:
                    temp= model_by_typ[model_by_typ.model.map(tuple) == tuple(model)]
                   # print(temp)
                    if len(temp) != 0:
                      #  print('value has been replaced' )
                        ind = temp[['size']].idxmax()
                        X_.at[i[0],col_name] = temp.at[ind[0],col_name]
        indicator=indicator+1
    print('done for ExperimentalTransformer' +'ind = ')
    print(final_columns)
    without_model = X_[final_columns]
    print(without_model.shape)
    print(without_model.dtypes)                
    return without_model
class ExperimentalTransformer2(BaseEstimator, TransformerMixin):
  def __init__(self):
    print('\n>>>>>>>init() called for ExperimentalTransformer2.\n')

  def fit(self, X, y = None):
    print('\n>>>>>>>fit() called for ExperimentalTransformer2.\n')
    #print(X.columns)
     
    X_ = X.copy() 
    Y_ = y.copy() 
    self.avg=y.mean()
    df_dict_man_train=[]
    counter=0
    print(X_.columns)
    for col_name in X_.columns:
        X_[col_name]= X_[col_name].where(~(X_[col_name].isna()), other='na', inplace=False)
        a=col_name+'_avg_price'
        Train_temp = pd.concat([X_,Y_], axis = 1)

        df_man_train = Train_temp.groupby(col_name)[['price']].mean().sort_values(by=['price'], ascending=False).rename(
        columns={'price': a})
        df_dict_man_train.insert(counter, df_man_train.to_dict()[a])
        counter=counter+1
    self.df_dict_man_train=df_dict_man_train
    
    return self

  def transform(self, X, y = None):
    print('\n>>>>>>>transform() called for ExperimentalTransformer2.\n')
    
    X_ = X.copy() 
# And create a new variable based on each manufacturer's average price:
    counter=0
    for col_name in X_.columns:
     #   print(col_name)
        a=col_name+'_avg_price'
        X_[col_name] = X_[col_name].replace(self.df_dict_man_train[counter])
        # check if manu is new
        
        X_[col_name] = pd.to_numeric(X_[col_name], errors='coerce')
        X_[col_name].fillna(self.avg)
        counter=counter+1

    print(X_.dtypes)
    print('done for ExperimentalTransformer2')
    return X_
# by Fahad end of transformation with pipeline: 

In [10]:
# by Fahad start of data pre-processing using pipelines
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
print("create pipeline 1")
categorical_features = ['manufacturer', 'type', 'model']
categorical_features2 = ['transmission', 'drive','fuel']
numrical_features1 = ['year', 'odometer']
pipe1 = Pipeline(steps=[
                       ('fill_data', ExperimentalTransformer()),
                      ('convert_with_price', ExperimentalTransformer2()),
                      ('imputer1',SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
                      ('scaler', StandardScaler())

])

pipe3 = Pipeline(steps=[
                      ('imputer3',SimpleImputer(missing_values=np.nan, strategy='median')),
                       ('scaler', StandardScaler())
])
pipe4 = Pipeline(steps=[
                       ('imputer4',SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
                       ('enc2', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', pipe1, categorical_features),
        
        ('num', pipe3, numrical_features1),
        ('cat2', pipe4, categorical_features2),
    ],remainder='drop')


print("fit pipeline 1")

# by Fahad start of data pre-processing using pipelines 
# we need to reomve the missing values from type variables to comply with previous assigment findings

create pipeline 1

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

fit pipeline 1


In [126]:
def impute_by_model(X):
    for col in ['manufacturer','type','cylinders']:
        model_by_manu = X.groupby(['model',col], as_index=False).size()
        Manu_null=X[X[col].isna()]
        for i in Manu_null.iterrows():
            model=Manu_null.at[i[0],'model']
            temp = model_by_manu[model_by_manu['model'] == model]
            if len(temp) != 0:
                ind = temp[['size']].idxmax()
                X.at[i[0],col] = temp.at[ind[0],col]
    X=X.drop(columns=['model'])
    return X

In [127]:
def replace_missing_by_mode(X):
    for col in ['manufacturer','type','cylinders','fuel','title_status','transmission','condition','drive']:
        X[col]= X[col].where(~(X[col].isna()), other=X[col].mode()[0], inplace=False)
    return X

For the manufacturer & type columns we will transform it by using the average price of each one.

In [128]:
def transform_by_avrgprice(X,X_train=X_train,Y_train=Y_train):
    Train_temp = pd.concat([X_train,Y_train], axis = 1)
    for col in ['manufacturer','type']:
        df_man_train = Train_temp.groupby(col)[['price']].mean().sort_values(by=['price'], ascending=False).rename(
            columns={'price': col+'_avg_price'})

        # Now we create the dictionary:
        df_dict_man_train = df_man_train.to_dict()[col+'_avg_price']

        # And create a new variable based on each manufacturer's average price:
        X[col+'_avg_price'] = X[col].replace(df_dict_man_train)
        
        # Now we can drop the columns
        X=X.drop(columns=col)
        
    return X

For the rest of the columns we will use the one hot encouding

In [129]:
from sklearn.preprocessing import OneHotEncoder

def fun_ohe (X):
    
    for variable in ['cylinders','fuel','title_status','transmission','condition','drive']:
        ohe = OneHotEncoder(sparse=False, drop='first')
        ohe.fit(X[[variable]])
        ohe_df = pd.DataFrame(ohe.transform(X[[variable]]),
                     columns = ohe.get_feature_names([variable]))
        ohe_df.set_index(X.index, inplace=True)
        X=pd.concat([X, ohe_df], axis=1).drop([variable], axis=1)
    return X

In [130]:
from sklearn import preprocessing
def scaling(X):
    X = X.values
    min_max_scaler = preprocessing.MinMaxScaler()
    X = min_max_scaler.fit_transform(X)
    X = pd.DataFrame(X)
    return X

In [160]:
from sklearn.pipeline import Pipeline, make_pipeline

def Pipe_data_preprocess(X):
    X = X.pipe(impute_by_model).pipe(replace_missing_by_mode).pipe(transform_by_avrgprice).pipe(fun_ohe).pipe(scaling)
    return X


In [132]:
X_train = Pipe_data_preprocess(X_train)
#X_train

## 5. Train models

Data are now in a suitable way for any model we want to train. Missing values have been dropped or filled, there're no outliers, numbers have been scaled, etc. Try to keep in mind lessons learnt in ML1 and ML2, as to which models may be more suitable for this problem, slower/faster to train, etc.

Also **decide on what metric to use to measure performance**; the one you feel more comfortable with, whatever. In any case, follow this motto: "start simple, and then add complexity little by little". The usual procedure is:
1. <u>Start with a really simple model</u>, perhaps even a `DummyRegressor` (or `DummyClassifier` if this was a classification problem). Such a simple model is very fast to train, and it gives you **a value of the error metric that you must improve. If you do worse than this, you're making some mistake in your pipeline**.
2. Once you've that reference dummy performance, <u>turn linear</u>. Use simple linear models, and see where you can get. **The new error should be better than the dummy one, but probably still not very satisfactory**. In any case, **this becomes the new reference to beat**.
3. Once you've the reference linear performance, <u>turn non-linear, but interpretable</u>. This is where trees, nearest neighbors or naïve bayes come in handy, as they're easily intepreted (if-then rules, using very similar samples, or using independent probabilities). **Most likely you'll get an error which is even better than linear one, so this becomes the new reference**.
4. <u>Turn non-linear and non-interpretable</u>. Typically here we use models like SVMs or Neural Networks, which are even more powerful, but harder to train and particularly difficult to explain in simple words.
5. If not even all of this is enough, <u>build ensembles</u>. That is, not relying on a single model, but combining what several models say.

**HINT**: build a `Pipeline` with the previous preprocessing transformations, and whose final step is the model you want to try. This ensures that transformations are applied before training.

**HINT**: use `GridSearchCV/RandomizedSearchCV` to not only try the default model, but also tune its more important hyperparameters. Remember that a `Pipeline` is an estimator, so that's what you feed into the search. Also, remember the double underscore trick to specify that a parameter belongs to the estimator, and also recall the different CV strategies. **If you don't do CV, you'll most likely end up overfitting.**

In [11]:
# by Fahad dummy reg with pipeline: 
DRModel = Pipeline(steps=[
                        # incorrect column name passed
                       ('preprocessor', preprocessor), 
                       ('DummyRegressor', DummyRegressor(strategy="mean"))
])

print("fit pipeline LR")
DRModel.fit(X_train, Y_train) 
DRModel_score = DRModel.score(X_train, Y_train)
print('R2 for DummyRegressor' ,DRModel_score )

# by Fahad dummy reg end  with pipeline, score on training data:

fit pipeline LR

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
start of group by
                           model manufacturer  size
0         1 series 128i coupe 2d          bmw     1
1            124 spider classica         fiat     1
2                            150         ford     1
3                           1500          gmc     1
4                           1500          ram    30
...                          ...          ...   ...
1496                    yukon xl          gmc     3
1497  yukon x

In [12]:
### Using dummy regressor by utilizing the mean stategy
dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(X_train, Y_train)
dummy_regr_pred_train = dummy_regr.predict(X_train)
print(mean_squared_error(dummy_regr_pred_train,Y_train,squared=False))

14418.367254505985


In [13]:
# by Fahad lineaar reg with pipeline: 
LRModel = Pipeline(steps=[
                        # incorrect column name passed
                       ('preprocessor', preprocessor), 
                       ('linear_model', LinearRegression())
])

print("fit pipeline LR")
LRModel.fit(X_train, Y_train)  
LRModel_score = DRModel.score(X_train, Y_train)
print('R2 for linear_model' ,LRModel_score )
# by Fahad lineaar reg end with pipeline:

fit pipeline LR

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
start of group by
                           model manufacturer  size
0         1 series 128i coupe 2d          bmw     1
1            124 spider classica         fiat     1
2                            150         ford     1
3                           1500          gmc     1
4                           1500          ram    30
...                          ...          ...   ...
1496                    yukon xl          gmc     3
1497  yukon x

In [14]:
### Using linear regression
Lin_regr = sklearn.linear_model.LinearRegression()
Lin_regr.fit(X_train, Y_train)
Lin_train_predicted = Lin_regr.predict(X_train)
print(mean_squared_error(Lin_train_predicted,Y_train,squared=False))


ValueError: could not convert string to float: 'acura'

In [79]:
# by Fahad KNN with pipeline:

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
Knn_reg = KNeighborsRegressor()
KNNModel = Pipeline(steps=[
                       ('preprocessor', preprocessor), 
                       ('KNN',Knn_reg )
])

param_grid = dict(KNN__n_neighbors=[4, 6, 8,10]
                 )

grid_search_KNNModel = GridSearchCV(KNNModel, param_grid=param_grid,verbose=3,cv=5)
grid_search_KNNModel.fit(X_train, Y_train)
print ('best params')
print(grid_search_KNNModel.best_params_)
print ('score')
print(grid_search_KNNModel.best_score_)
KNNModel_score=grid_search_KNNModel.best_score_
# by Fahad end KNN with pipeline:



>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

Fitting 5 folds for each of 4 candidates, totalling 20 fits

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
start of group by
                           model manufacturer  size
0         1 series 128i coupe 2d          bmw     1
1            124 spider classica         fiat     1
2                  

[CV 2/5] END ................KNN__n_neighbors=4;, score=0.287 total time=   0.8s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
start of group by
                           model manufacturer  size
0         1 series 128i coupe 2d          bmw     1
1            124 spider classica         fiat     1
2                            150         ford     1
3                           1500          gmc     1

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 4/5] END ................KNN__n_neighbors=4;, score=0.464 total time=   0.8s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
start of group by
                           model manufacturer  size
0         1 series 128i coupe 2d          bmw     1
1                            150         ford     1
2      

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(659, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 1/5] END ................KNN__n_neighbors=6;, score=0.364 total time=   0.8s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(658, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 3/5] END ................KNN__n_neighbors=6;, score=0.468 total time=   0.8s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 5/5] END ................KNN__n_neighbors=6;, score=0.441 total time=   0.8s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
start of group by
                           model manufacturer  size
0         1 series 128i coupe 2d          bmw     1
1            124 spider classica         fiat     1
2      

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 2/5] END ................KNN__n_neighbors=8;, score=0.348 total time=   0.8s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
start of group by
                           model manufacturer  size
0         1 series 128i coupe 2d          bmw     1
1            124 spider classica         fiat     1
2      

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 4/5] END ................KNN__n_neighbors=8;, score=0.448 total time=   0.8s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
start of group by
                           model manufacturer  size
0         1 series 128i coupe 2d          bmw     1
1                            150         ford     1
2      

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(659, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 1/5] END ...............KNN__n_neighbors=10;, score=0.384 total time=   0.8s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 3/5] END ...............KNN__n_neighbors=10;, score=0.473 total time=   0.8s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
start of group by
                           model manufacturer  size
0            124 spider classica         fiat     1
1                           1500          gmc     1
2      

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(658, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 5/5] END ...............KNN__n_neighbors=10;, score=0.465 total time=   0.8s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() ca

In [135]:
### Using KNN with CV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
Knn_reg = KNeighborsRegressor()
Knn_grid = {'n_neighbors': [4,5,6,7]}
Knn_model = GridSearchCV(Knn_reg, Knn_grid, cv = 5)
Knn_model.fit(X_train, Y_train)

GridSearchCV(cv=5, estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': [4, 5, 6, 7]})

In [19]:
Knn_model.best_params_

NameError: name 'Knn_model' is not defined

In [137]:
best_knn = Knn_model.best_estimator_
Knn_train_predict = best_knn.predict(X_train)
print(mean_squared_error(Knn_train_predict,Y_train,squared=False))

6450.356970888414


In [20]:
# by Fahad RandomForestRegressor with pipeline:
from sklearn.ensemble import RandomForestRegressor
RFR = RandomForestRegressor(max_depth=2, random_state=0)

RFRModel = Pipeline(steps=[
                        # incorrect column name passed
                       ('preprocessor', preprocessor), 
                       ('RFR', RFR)
])

param_grid = dict(RFR__max_depth=[2, 3, 4,5],
                  RFR__n_estimators=[10, 20,30]
                 )

grid_search_RFRModel = GridSearchCV(RFRModel, param_grid=param_grid, verbose=3,cv=3)
grid_search_RFRModel.fit(X_train, Y_train)
print ('best params')
print(grid_search_RFRModel.best_params_)
print ('score')
print(grid_search_RFRModel.best_score_)
RFRModel_score=grid_search_RFRModel.best_score_
# by Fahad end RandomForestRegressor with pipeline:



>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

Fitting 3 folds for each of 12 candidates, totalling 36 fits

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 1/3; 1/12] START RFR__max_depth=2, RFR__n_estimators=10.....................

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
start of group by
                           model manufacturer  size
0         1 series 128i coupe 2d         

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 2/3; 1/12] END RFR__max_depth=2, RFR__n_estimators=10;, score=0.361 total time=   0.9s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 3/3; 1/12] START RFR__max_depth=2, RFR__n_estimators=10.....................

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for Expe

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 1/3; 2/12] END RFR__max_depth=2, RFR__n_estimators=20;, score=0.246 total time=   1.0s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 2/3; 2/12] START RFR__max_depth=2, RFR__n_estimators=20.....................

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for Expe

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 3/3; 2/12] END RFR__max_depth=2, RFR__n_estimators=20;, score=0.360 total time=   0.9s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 1/3; 3/12] START RFR__max_depth=2, RFR__n_estimators=30.....................

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for Expe

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 2/3; 3/12] END RFR__max_depth=2, RFR__n_estimators=30;, score=0.365 total time=   1.6s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 3/3; 3/12] START RFR__max_depth=2, RFR__n_estimators=30.....................

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for Expe

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 1/3; 4/12] END RFR__max_depth=3, RFR__n_estimators=10;, score=0.329 total time=   0.9s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 2/3; 4/12] START RFR__max_depth=3, RFR__n_estimators=10.....................

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for Expe

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 3/3; 4/12] END RFR__max_depth=3, RFR__n_estimators=10;, score=0.414 total time=   1.2s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 1/3; 5/12] START RFR__max_depth=3, RFR__n_estimators=20.....................

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for Expe

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 2/3; 5/12] END RFR__max_depth=3, RFR__n_estimators=20;, score=0.415 total time=   2.1s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 3/3; 5/12] START RFR__max_depth=3, RFR__n_estimators=20.....................

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for Expe

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 1/3; 6/12] END RFR__max_depth=3, RFR__n_estimators=30;, score=0.326 total time=   1.1s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 2/3; 6/12] START RFR__max_depth=3, RFR__n_estimators=30.....................

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for Expe

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 3/3; 6/12] END RFR__max_depth=3, RFR__n_estimators=30;, score=0.426 total time=   2.5s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 1/3; 7/12] START RFR__max_depth=4, RFR__n_estimators=10.....................

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for Expe

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 2/3; 7/12] END RFR__max_depth=4, RFR__n_estimators=10;, score=0.450 total time=   1.2s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 3/3; 7/12] START RFR__max_depth=4, RFR__n_estimators=10.....................

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for Expe

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 1/3; 8/12] END RFR__max_depth=4, RFR__n_estimators=20;, score=0.360 total time=   2.5s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 2/3; 8/12] START RFR__max_depth=4, RFR__n_estimators=20.....................

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for Expe

before check column name type
start of group by
                           model         type  size
0         1 series 128i coupe 2d        coupe     1
1                           1500       pickup    10
2                           1500        truck     6
3                           1500          van     1
4                       1500 4x4       pickup     2
...                          ...          ...   ...
1094  yukon xl 1500 denali sport          SUV     1
1095                yukon xl slt          SUV     1
1096                yukon xl slt        other     1
1097                          z4  convertible     1
1098   z4 sdrive35is roadster 2d        other     1

[1099 rows x 3 columns]
end of group by
before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
do

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 2/3; 9/12] END RFR__max_depth=4, RFR__n_estimators=30;, score=0.447 total time=   1.6s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 3/3; 9/12] START RFR__max_depth=4, RFR__n_estimators=30.....................

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for Expe

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 1/3; 10/12] END RFR__max_depth=5, RFR__n_estimators=10;, score=0.384 total time=   1.2s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 2/3; 10/12] START RFR__max_depth=5, RFR__n_estimators=10....................

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for Exp

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 3/3; 10/12] END RFR__max_depth=5, RFR__n_estimators=10;, score=0.474 total time=   1.2s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 1/3; 11/12] START RFR__max_depth=5, RFR__n_estimators=20....................

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for Exp

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 2/3; 11/12] END RFR__max_depth=5, RFR__n_estimators=20;, score=0.463 total time=   1.1s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 3/3; 11/12] START RFR__max_depth=5, RFR__n_estimators=20....................

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for Exp

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 1/3; 12/12] END RFR__max_depth=5, RFR__n_estimators=30;, score=0.389 total time=   1.2s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 2/3; 12/12] START RFR__max_depth=5, RFR__n_estimators=30....................

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for Exp

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 3/3; 12/12] END RFR__max_depth=5, RFR__n_estimators=30;, score=0.487 total time=   1.1s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>tr

In [21]:
# by Fahad DecisionTreeRegressor with pipeline:
from sklearn.tree import DecisionTreeRegressor
reg_decision_model=DecisionTreeRegressor(max_depth=2,min_samples_leaf=2)

DTRModel = Pipeline(steps=[
                        # incorrect column name passed
                       ('preprocessor', preprocessor), 
                       ('reg_decision_model',reg_decision_model )
])
param_grid = dict(reg_decision_model__max_depth=[2, 3, 4,5],
                  reg_decision_model__min_samples_leaf=[1, 2,3]
                 )

grid_search_DTRModel = GridSearchCV(DTRModel, param_grid=param_grid, verbose=10,cv=3)
grid_search_DTRModel.fit(X_train, Y_train)
print ('best params')
print(grid_search_DTRModel.best_estimator_)
print ('score')
print(grid_search_DTRModel.best_score_)
DTRModel_score=grid_search_DTRModel.best_score_
# by Fahad end of  DecisionTreeRegressor with pipeline:


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

Fitting 3 folds for each of 12 candidates, totalling 36 fits

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 1/3; 1/12] START reg_decision_model__max_depth=2, reg_decision_model__min_samples_leaf=1

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
start of group by
                           model manufacturer  size
0         1 series 128i coupe

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 2/3; 1/12] END reg_decision_model__max_depth=2, reg_decision_model__min_samples_leaf=1;, score=0.286 total time=   0.9s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 3/3; 1/12] START reg_decision_model__max_depth=2, reg_decision_model__min_samples_leaf=1

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column n

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 1/3; 2/12] END reg_decision_model__max_depth=2, reg_decision_model__min_samples_leaf=2;, score=0.188 total time=   0.8s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 2/3; 2/12] START reg_decision_model__max_depth=2, reg_decision_model__min_samples_leaf=2

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column n

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 3/3; 2/12] END reg_decision_model__max_depth=2, reg_decision_model__min_samples_leaf=2;, score=0.265 total time=   0.8s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 1/3; 3/12] START reg_decision_model__max_depth=2, reg_decision_model__min_samples_leaf=3

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column n

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 2/3; 3/12] END reg_decision_model__max_depth=2, reg_decision_model__min_samples_leaf=3;, score=0.286 total time=   1.0s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 3/3; 3/12] START reg_decision_model__max_depth=2, reg_decision_model__min_samples_leaf=3

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column n

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 1/3; 4/12] END reg_decision_model__max_depth=3, reg_decision_model__min_samples_leaf=1;, score=0.260 total time=   1.0s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 2/3; 4/12] START reg_decision_model__max_depth=3, reg_decision_model__min_samples_leaf=1

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column n

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 3/3; 4/12] END reg_decision_model__max_depth=3, reg_decision_model__min_samples_leaf=1;, score=0.361 total time=   1.6s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 1/3; 5/12] START reg_decision_model__max_depth=3, reg_decision_model__min_samples_leaf=2

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column n

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 2/3; 5/12] END reg_decision_model__max_depth=3, reg_decision_model__min_samples_leaf=2;, score=0.327 total time=   1.9s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 3/3; 5/12] START reg_decision_model__max_depth=3, reg_decision_model__min_samples_leaf=2

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column n

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 1/3; 6/12] END reg_decision_model__max_depth=3, reg_decision_model__min_samples_leaf=3;, score=0.260 total time=   1.6s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 2/3; 6/12] START reg_decision_model__max_depth=3, reg_decision_model__min_samples_leaf=3

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column n

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 3/3; 6/12] END reg_decision_model__max_depth=3, reg_decision_model__min_samples_leaf=3;, score=0.361 total time=   1.3s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 1/3; 7/12] START reg_decision_model__max_depth=4, reg_decision_model__min_samples_leaf=1

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column n

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 2/3; 7/12] END reg_decision_model__max_depth=4, reg_decision_model__min_samples_leaf=1;, score=0.368 total time=   2.0s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 3/3; 7/12] START reg_decision_model__max_depth=4, reg_decision_model__min_samples_leaf=1

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column n

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 1/3; 8/12] END reg_decision_model__max_depth=4, reg_decision_model__min_samples_leaf=2;, score=0.274 total time=   1.3s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 2/3; 8/12] START reg_decision_model__max_depth=4, reg_decision_model__min_samples_leaf=2

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column n

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 3/3; 8/12] END reg_decision_model__max_depth=4, reg_decision_model__min_samples_leaf=2;, score=0.390 total time=   1.8s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 1/3; 9/12] START reg_decision_model__max_depth=4, reg_decision_model__min_samples_leaf=3

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column n

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 2/3; 9/12] END reg_decision_model__max_depth=4, reg_decision_model__min_samples_leaf=3;, score=0.368 total time=   1.6s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 3/3; 9/12] START reg_decision_model__max_depth=4, reg_decision_model__min_samples_leaf=3

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column n

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 1/3; 10/12] END reg_decision_model__max_depth=5, reg_decision_model__min_samples_leaf=1;, score=0.294 total time=   1.1s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 2/3; 10/12] START reg_decision_model__max_depth=5, reg_decision_model__min_samples_leaf=1

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 3/3; 10/12] END reg_decision_model__max_depth=5, reg_decision_model__min_samples_leaf=1;, score=0.411 total time=   1.1s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 1/3; 11/12] START reg_decision_model__max_depth=5, reg_decision_model__min_samples_leaf=2

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 2/3; 11/12] END reg_decision_model__max_depth=5, reg_decision_model__min_samples_leaf=2;, score=0.400 total time=   1.0s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 3/3; 11/12] START reg_decision_model__max_depth=5, reg_decision_model__min_samples_leaf=2

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 1/3; 12/12] END reg_decision_model__max_depth=5, reg_decision_model__min_samples_leaf=3;, score=0.283 total time=   1.1s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

[CV 2/3; 12/12] START reg_decision_model__max_depth=5, reg_decision_model__min_samples_leaf=3

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV 3/3; 12/12] END reg_decision_model__max_depth=5, reg_decision_model__min_samples_leaf=3;, score=0.409 total time=   1.0s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before c

In [74]:

from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV

NN_reg = MLPRegressor()

NNRModel = Pipeline(steps=[
                        # incorrect column name passed
                       ('preprocessor', preprocessor), 
                       ('NNR',NN_reg )
])
param_grid = dict(NNR__hidden_layer_sizes= [(7,),(8,),(9,)]
                 )

grid_search_NNRModel = GridSearchCV(NNRModel, param_grid=param_grid, verbose=2,cv=3)
grid_search_NNRModel.fit(X_train, Y_train)
print ('best params')
print(grid_search_NNRModel.best_estimator_)
print ('score')
print(grid_search_NNRModel.best_score_)
NNRModel_score=grid_search_NNRModel.best_score_
# by Fahad end of  NN with pipeline:


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

Fitting 3 folds for each of 3 candidates, totalling 9 fits

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
start of group by
                           model manufacturer  size
0         1 series 128i coupe 2d          bmw     1
1            124 spider classica         fiat     1
2                   

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV] END .......................NNR__hidden_layer_sizes=(7,); total time=   5.3s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='objec

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV] END .......................NNR__hidden_layer_sizes=(8,); total time=   2.5s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='objec

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV] END .......................NNR__hidden_layer_sizes=(8,); total time=   2.4s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='objec

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV] END .......................NNR__hidden_layer_sizes=(9,); total time=   2.1s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='objec

In [138]:
## Using Nueral Network
### we will try to tune the network with 3,5 or 7 layers using grid search. We will fix the number of nodes to 30

from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV
NN_reg = MLPRegressor()
NN_grid = {'hidden_layer_sizes': [(30,30,30),(30,30,30,30,30),(30,30,30,30,30,30,30)]}
NN_model = GridSearchCV(NN_reg, NN_grid, cv = 5)
NN_model.fit(X_train, Y_train)

GridSearchCV(cv=5, estimator=MLPRegressor(),
             param_grid={'hidden_layer_sizes': [(30, 30, 30),
                                                (30, 30, 30, 30, 30),
                                                (30, 30, 30, 30, 30, 30, 30)]})

In [139]:
NN_model.best_params_

{'hidden_layer_sizes': (30, 30, 30, 30, 30, 30, 30)}

In [140]:
best_NN = NN_model.best_estimator_
NN_train_predict = best_NN.predict(X_train)
print(mean_squared_error(NN_train_predict,Y_train,squared=False))

8890.724114580584


In [75]:
# by Fahad  SVR with pipeline:
# Using SVM
import sklearn.svm
from sklearn.model_selection import GridSearchCV



SVR_Reg = sklearn.svm.SVR()

SVRModel = Pipeline(steps=[
                        # incorrect column name passed
                       ('preprocessor', preprocessor), 
                       ('SVR',SVR_Reg )
])
param_grid = dict(SVR__C= [1,5,10]
                 )

grid_search_SVRModel = GridSearchCV(SVRModel, param_grid=param_grid, verbose=2,cv=3)
grid_search_SVRModel.fit(X_train, Y_train)
print ('best params')
print(grid_search_SVRModel.best_estimator_)
print ('score')
print(grid_search_SVRModel.best_score_)
SVRModel_score=grid_search_SVRModel.best_score_
# by Fahad end of  SVR with pipeline:


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

Fitting 3 folds for each of 3 candidates, totalling 9 fits

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
start of group by
                           model manufacturer  size
0         1 series 128i coupe 2d          bmw     1
1            124 spider classica         fiat     1
2                   

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV] END ...........................................SVR__C=1; total time=   1.8s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='objec

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV] END ...........................................SVR__C=5; total time=   2.2s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='objec

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV] END ...........................................SVR__C=5; total time=   2.3s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='objec

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV] END ..........................................SVR__C=10; total time=   2.3s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='objec

In [71]:
# Using SVM
import sklearn.svm
from sklearn.model_selection import GridSearchCV

SVR_Grid = {'C' : [1,5,10]}

SVR_Reg = sklearn.svm.SVR()

SVR_model = GridSearchCV(SVR_Reg,SVR_Grid,cv=5)

SVR_model.fit(X_train,Y_train)

ValueError: could not convert string to float: 'acura'

In [None]:
SVR_model.best_params_

In [None]:
best_svr = SVR_model.best_estimator_
SVR_train_predict = best_svr.predict(X_train)
print(mean_squared_error(SVR_train_predict,Y_train,squared=False))

## 6. Decide what the final model would be

After having done all the previous steps, you'll have trained several models. Now you need to **decide which of those is the one you're going to choose as your best**. Following the exam metaphor, you have to pick your best student to win the ML Olympics!

**HINT**: in order to try different models, you can write an outer loop that tries different estimators, as well as what hyperparameters and values for those hyperparameters are to be tried in the CV search. This loop calls CV search, picks its `best_estimator_` and compares its performance with the best performance you had so far. If the new one is better than your current best, this becomes your new best.

**HINT**: it's not always about performance (= score). Sometimes you can improve a bit the score, at the expense of training a model that takes way longer, or that requires much more memory. Besides this, clients usually require some interpretability on the model, so think twice about what "best" means.

In [78]:
### Your code goes here
# refrence https://www.kdnuggets.com/2018/01/managing-machine-learning-workflows-scikit-learn-pipelines-part-3.html




# List of pipelines for ease of iteration
grids = [grid_search_SVRModel, grid_search_NNRModel, grid_search_DTRModel, grid_search_KNNModel, LRModel, DRModel,grid_search_RFRModel]

# Dictionary of pipelines and classifier types for ease of reference
grid_dict = {0: 'SVR', 1: 'NN', 
2: 'DTR', 3: 'KNN', 
4: 'LR', 5: 'DR', 6: 'RFR'}

# Fit the grid search objects
print('Performing model optimizations...')
best_acc = 0.0
best_Reg = 0
best_gs = ''
for idx, gs in enumerate(grids):
    print('\nEstimator: %s' % grid_dict[idx])
    # Fit grid search
    gs.fit(X_train, Y_train)
    # Best params
    print('Best params: %s' % gs.best_params_)
    # Best training data accuracy
    print('Best training accuracy: %.3f' % gs.best_score_)
    # Predict on test data with best params
    y_pred = gs.predict(Y_train)
    # Test data accuracy of model with best params
    #print('Test set accuracy score for best params: %.3f ' % accuracy_score(y_test, y_pred))
    print(mean_squared_error(Y_train,y_pred))
    # Track best (highest test accuracy) model
    if mean_squared_error(Y_train, y_pred) > best_acc:
        best_acc = mean_squared_error(Y_train, y_pred)
        best_gs = gs
        best_Reg = idx
        print('\n best estimator: %s' % grid_dict[best_Reg])

Performing model optimizations...

Estimator: SVR

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.

Fitting 3 folds for each of 3 candidates, totalling 9 fits

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
start of group by
                           model manufacturer  size
0         1 series 128i coupe 2d          bmw     1
1            124 spide

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV] END ...........................................SVR__C=1; total time=   2.4s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='objec

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV] END ...........................................SVR__C=5; total time=   1.9s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='objec

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV] END ...........................................SVR__C=5; total time=   1.9s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='objec

before check column name model
done for ExperimentalTransformerind = 
['manufacturer', 'type']
(1097, 2)
manufacturer    object
type            object
dtype: object

>>>>>>>transform() called for ExperimentalTransformer2.

manufacturer    float64
type            float64
dtype: object
done for ExperimentalTransformer2
[CV] END ..........................................SVR__C=10; total time=   2.4s

>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>init() called for ExperimentalTransformer.


>>>>>>>init() called for ExperimentalTransformer2.


>>>>>>>fit() called for ExperimentalTransformer.

Index(['manufacturer', 'type', 'model'], dtype='object')
Index(['manufacturer', 'type', 'model'], dtype='object')
before check column name manufacturer
before check column name type
before check column name model

>>>>>>>transform() called for ExperimentalTransformer.

after indicator
Index(['manufacturer', 'type', 'model'], dtype='objec

IndexError: tuple index out of range

## 7. Test your final model

Time to recover `X_test`, which was put on hold since Step 2. Now you've a final `Pipeline` from Step 6, which knows how to transform the data in `X_test` (Step 4) and knows how to predict (because it's been fit by Step 5).

**HINT**: **beware that Step 3 hasn't been applied to the test set!** You need to do that before calling `Pipeline.score(X_test)`. For example, if your model doesn't deal with missing values, you need to remove any row from `X_test` that has missing values! Otherwise the code will crash.

In [164]:
### Your code goes here
## Before we can start testing with the test set, we need to transoform X_test.
Test_temp = pd.concat([X_test,Y_test], axis = 1)
Test_temp = Test_temp.dropna(subset=['year', 'odometer'])  # just drop rows whenever year and/or odometer are missing
Test_temp = Test_temp.drop(columns=['size'])

X_test, Y_test = Test_temp.drop(columns=['price']), Test_temp['price']

X_test= Pipe_data_preprocess(X_test)
X_test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19,20,21,22,23,24,25,26,27,28
0,0.800,0.138,0.459,1.000,0.000,0.000,0.000,0.000,0.000,1.000,...,0.000,0.000,1.000,0.000,1.000,0.000,0.000,0.000,0.000,0.000
1,0.700,0.170,0.360,1.000,0.000,0.000,0.000,0.000,0.000,1.000,...,0.000,0.000,1.000,0.000,1.000,0.000,0.000,0.000,0.000,0.000
2,0.950,0.046,0.360,1.000,0.000,0.000,0.000,0.000,0.000,1.000,...,0.000,0.000,1.000,0.000,1.000,0.000,0.000,0.000,0.000,0.000
3,0.875,0.098,0.313,1.000,0.000,0.000,0.000,0.000,0.000,1.000,...,0.000,0.000,1.000,0.000,1.000,0.000,0.000,0.000,0.000,0.000
4,0.775,0.305,0.379,0.795,0.000,0.000,0.000,0.000,1.000,0.000,...,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,1.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
327962,0.900,0.076,0.459,0.336,0.000,0.000,0.000,0.000,1.000,0.000,...,0.000,0.000,1.000,0.000,1.000,0.000,0.000,0.000,0.000,0.000
327963,0.725,0.224,0.254,0.336,0.000,0.000,1.000,0.000,0.000,0.000,...,0.000,0.000,0.000,0.000,1.000,0.000,0.000,0.000,0.000,0.000
327964,0.900,0.093,0.208,0.160,0.000,0.000,0.000,0.000,1.000,0.000,...,0.000,0.000,1.000,0.000,1.000,0.000,0.000,0.000,1.000,0.000
327965,0.950,0.012,0.313,0.160,0.000,0.000,1.000,0.000,0.000,0.000,...,0.000,0.000,1.000,0.000,1.000,0.000,0.000,0.000,1.000,0.000


In [165]:
## The KNN with 7 nighbors was the best estimator, therefore, we will test with it.
Knn_test_predict = best_knn.predict(X_test)
print(mean_squared_error(Knn_test_predict,Y_test,squared=False))

6450.356970888414


## 8. (Optional) Revisit what you've done

Once you get the score for `X_test`, it's tempting to try to improve even more. If you got a much worse performance than for `X_train`, chances are that you're overfitting, so you need to refine your CV strategy, use regularization, or choose parameters that don't drive to that (for example, don't let a tree grow without limit!).

This is like when you fail an exam, and you want to have another try. The catch is that you already know what the exam is (you saw `X_test`), and you also got your marks (the `score`), so it's not taking another similar exam (as would happen in real life), but taking the same exam again. Strictly speaking, this is another subtle form of data leakage, but a widely used one. The hope is that by refining the training strategy, even if we're cheating a bit, the behavior of the final model when it actually takes another, different exam (that is, when it's put into production), will be better than our current one would have obtained.

So we're going to **overlook this fact and allow that, given the results in Step 7, you can go back, try again, change your final estimator in Step 6 and retry Step 7, until you can't get any better**.

**HINT**: besides the score, you can also plot your predictions against reality, and try to infer when you predict wrongly. This can give you insights on how to improve the model and/or the pre-processing part.

In [166]:
### Your code goes here
# Let's try to improve the KNN by searching more neighbours 7, 9, 11 & 13
Knn_reg = KNeighborsRegressor()
Knn_grid = {'n_neighbors': [7,9,11,13]}
Knn_model = GridSearchCV(Knn_reg, Knn_grid, cv = 5)
Knn_model.fit(X_train, Y_train)

GridSearchCV(cv=5, estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': [7, 9, 11, 13]})

In [167]:
Knn_model.best_params_

{'n_neighbors': 9}

In [168]:
best_knn = Knn_model.best_estimator_
Knn_train_predict = best_knn.predict(X_train)
print(mean_squared_error(Knn_train_predict,Y_train,squared=False))

6820.705773396576


In [169]:
# now let's test with the best model
Knn_test_predict = best_knn.predict(X_test)
print(mean_squared_error(Knn_test_predict,Y_test,squared=False))

6820.705773396576


As a result, we can conclude that the best model is the KNN with 7 neighbours 

## 9. (Not Optional) Study for your final exam!

I hope that this final assignment, together with previous ones, gives you a clear view on how ML must be done in real life. I also hope that it's useful for making up your mind, clarifying concepts and understanding much better all we've seen.

All the course slides, notebooks, assignments, feedbacks and forum answers are now your personal `X_train`, your training dataset. So now it's just calling `fit` on yourselves, attending the exam, seeing what the questions in `X_test` are, calling `predict(X_test)` on yourselves (that is, trying to answer correctly all `Y_test`), and getting the highest score possible!

Real life is like ML, or ML is like real life, the way you prefer to see it. Thanks for your patience!