<a href="https://colab.research.google.com/github/QianFu520/project2/blob/main/Project_2_Part_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



*   Qian Fu
*   9/7/2022



# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn import set_config
from sklearn.decomposition import PCA

# Upload Data

In [None]:
df = pd.read_csv("/content/Wine.csv")
print('Number of Duplicated Rows', df.duplicated().sum())
print('\n')
print(df.info())
df.head()

Number of Duplicated Rows 5452


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   winery       7500 non-null   object 
 1   wine         7500 non-null   object 
 2   year         7498 non-null   object 
 3   rating       7500 non-null   float64
 4   num_reviews  7500 non-null   int64  
 5   country      7500 non-null   object 
 6   region       7500 non-null   object 
 7   price        7500 non-null   float64
 8   type         6955 non-null   object 
 9   body         6331 non-null   float64
 10  acidity      6331 non-null   float64
dtypes: float64(4), int64(1), object(6)
memory usage: 644.7+ KB
None


Unnamed: 0,winery,wine,year,rating,num_reviews,country,region,price,type,body,acidity
0,Teso La Monja,Tinto,2013,4.9,58,Espana,Toro,995.0,Toro Red,5.0,3.0
1,Artadi,Vina El Pison,2018,4.9,31,Espana,Vino de Espana,313.5,Tempranillo,4.0,2.0
2,Vega Sicilia,Unico,2009,4.8,1793,Espana,Ribera del Duero,324.95,Ribera Del Duero Red,5.0,3.0
3,Vega Sicilia,Unico,1999,4.8,1705,Espana,Ribera del Duero,692.96,Ribera Del Duero Red,5.0,3.0
4,Vega Sicilia,Unico,1996,4.8,1309,Espana,Ribera del Duero,778.06,Ribera Del Duero Red,5.0,3.0


I can see that there are duplicated rows, missing values in some columns. There is no wrong datatype.

# Data Cleaning

**Deleted unnecessary columns**

In [None]:
#I decided to delete country column, because there is only one country: Espana, it doesn't make impact on predicting the wine price.
df.drop(columns="country", inplace= True)

**Check and drop any duplicates**

In [None]:
#check for duplicates
df.duplicated().sum()

5452

In [None]:
#Drop all the duplicates
df.drop_duplicates(inplace=True)
df.duplicated().sum()

0

In [None]:
df.shape

(2048, 10)

# **Identify and address any missing values in this dataset.**


In [None]:
df.isna().sum()

winery           0
wine             0
year             2
rating           0
num_reviews      0
region           0
price            0
type           106
body           271
acidity        271
dtype: int64

I can see that there are 2 missing values in 'year' column, 106 missing values in "type" column, 271 missing values in "body" column, and 271 missing values in "acidity" column.

**Figure out the method for dealing with the missing values in year column**



In [None]:
missing_values = pd.isna(df["year"])
df[missing_values]

Unnamed: 0,winery,wine,year,rating,num_reviews,region,price,type,body,acidity
46,Vega Sicilia,Unico Reserva Especial Edicion,,4.7,12421,Ribera del Duero,423.5,Ribera Del Duero Red,5.0,3.0
851,La Unica,Fourth Edition,,4.4,131,Vino de Espana,40.0,Tempranillo,4.0,2.0


I choose to drop these two rows

In [None]:
df.dropna(subset=['year'], inplace=True)

**Figure out the method for dealing with the missing values in type column**

In [None]:
#check the value count in type column
df["type"].value_counts()

Ribera Del Duero Red    535
Rioja Red               451
Priorat Red             238
Red                     210
Toro Red                 78
Tempranillo              73
Sherry                   56
Rioja White              37
Pedro Ximenez            35
Grenache                 35
Albarino                 34
Cava                     33
Verdejo                  27
Monastrell               18
Mencia                   17
Montsant Red             17
Syrah                    15
Chardonnay               13
Cabernet Sauvignon       11
Sparkling                 5
Sauvignon Blanc           4
Name: type, dtype: int64

To prevent model performance bias, for missing values in type column, I decided to create a new label"Unidentified". I will deal with this after data split.

**Figure out the method for dealing with the missing values in body and acidity column**

In [None]:
#check the stats information of body column
df["body"].describe().round(1)

count    1775.0
mean        4.3
std         0.7
min         2.0
25%         4.0
50%         4.0
75%         5.0
max         5.0
Name: body, dtype: float64

In [None]:
#check the most frequent value 
df["body"].value_counts()

4.0    1002
5.0     633
3.0     106
2.0      34
Name: body, dtype: int64

I can see that the most frequent body value is 4.0, the mean value of the body is around 4.3. I can use the SimpleImputer(strategy= 'mean') method to fill in the missing values in body column.I will address this after the data split

In [None]:
#check the stats information of acidity column
df["acidity"].describe().round(1)

count    1775.0
mean        2.9
std         0.3
min         1.0
25%         3.0
50%         3.0
75%         3.0
max         3.0
Name: acidity, dtype: float64

In [None]:
#check the most frequent value
df["acidity"].value_counts()

3.0    1671
2.0      69
1.0      35
Name: acidity, dtype: int64

I can see that the most frequent acidity value is 3.0, the mean value of the body is around 2.9. I can use the SimpleImputer(strategy= 'mean') method to fill in the missing values in acidity column.I will address this after the data split

# Identified and corrected inconsistencies in data for categorical values

In [None]:
dtypes = df.dtypes
str_cols = dtypes[dtypes=="object"].index
for col in str_cols:
  print(f'-Column={col}')
  print(df[col].value_counts(dropna=False))
  print('\n\n')

-Column=winery
Vega Sicilia                            96
Alvaro Palacios                         48
Artadi                                  43
La Rioja Alta                           36
Marques de Murrieta                     33
                                        ..
Briego                                   1
Guillem Carol - Cellers Carol Valles     1
Particular                               1
Bodegas Asenjo & Manso                   1
Binigrau                                 1
Name: winery, Length: 479, dtype: int64



-Column=wine
Tinto                                                 56
Unico                                                 41
Valbuena 5o                                           32
Reserva                                               31
Priorat                                               26
                                                      ..
San Valentin Parellada                                 1
Silvanus Edicion Limitada Ribera del Duero             1


I can't tell there are inconsistencies in data for categorical values

**Check stats information for numeric values**

In [None]:
df.describe()

Unnamed: 0,rating,num_reviews,price,body,acidity
count,7500.0,7500.0,7500.0,6331.0,6331.0
mean,4.254933,451.109067,60.095822,4.158427,2.946612
std,0.118029,723.001856,150.356676,0.583352,0.248202
min,4.2,25.0,4.99,2.0,1.0
25%,4.2,389.0,18.9,4.0,3.0
50%,4.2,404.0,28.53,4.0,3.0
75%,4.2,415.0,51.35,5.0,3.0
max,4.9,32624.0,3119.08,5.0,3.0


I can tell that there is no impossible values in numeric columns

# Prepare the data appropriately for modeling

**Define X, y, train test split**

In [None]:
X=df.drop(columns ="price")
y=df["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

**Identify Columns Features**



*   Numeric Features: rating, num_reviews, price, body, acidity
*   Nominal Features: winery, wine, region, type, year
*   Nominal Features: none





**Make column selectors**

In [None]:
num_selector = make_column_selector(dtype_include='number')
cat_selector = make_column_selector(dtype_include='object')

**Instantiate transformers**

In [None]:
mean_imputer = SimpleImputer(strategy= 'mean')# To fill the missing values in "body" and "acidity". 
ohe_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
scaler = StandardScaler()

**Filling the missing values in "type" and "year" by using Fillna fuction**

In [None]:
X_train["type"].fillna('Unidentified', inplace=True)
X_test["type"].fillna('Unidentified', inplace=True)


**Create piplines**

In [None]:
num_pipe = make_pipeline(mean_imputer, scaler)

**Create Tuples to Pair Pipelines with Columns**

In [None]:
number_tuple = (num_pipe, num_selector)
nom_tuple = (ohe_encoder, cat_selector)

**Instantiate the ColumnTransformer**

In [None]:
preprocessor = make_column_transformer(nom_tuple,  number_tuple,  remainder='drop')                                                                      

# Try multiple models and tune the hyperparameters of the models to find out the best final model

Define a function that takes true and predicted values as arguments
and prints all 4 metrics 

In [None]:
def eval_regression(true, pred):
  mae = mean_absolute_error(true, pred)
  mse = mean_squared_error(true, pred)
  rmse = np.sqrt(mse)
  r2 = r2_score(true, pred)

  print(f'MAE {mae},\n MSE {mse},\n RMSE: {rmse},\n R^2: {r2} ')

**Model 1: Baseline Model**

In [None]:
# instantiate a baseline model
dummy_reg = DummyRegressor(strategy='mean')

In [None]:
# create model pipeline
base_pipe = make_pipeline(preprocessor, dummy_reg)

base_pipe.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f75a7b38f90>),
                                                 ('pipeline',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f75a7b38d10>)])),
                ('dummyregressor', DummyRegresso

In [None]:
# find MAE, MSE, RMSE and R2 on the baseline model for both the train and test data
print('Train Evaluation')
eval_regression(y_train, base_pipe.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, base_pipe.predict(X_test))

Train Evaluation
MAE 137.69530871565573,
 MSE 76623.99878410886,
 RMSE: 276.8104022324827,
 R^2: 0.0 

Test Evaluation
MAE 130.88599784008295,
 MSE 66428.8687280928,
 RMSE: 257.73798464349954,
 R^2: -0.00012888778872621742 


**Model 2:Linear Regression Model**

In [None]:
# instantiate a linear regression model
lin_reg = lin_reg = LinearRegression()

In [None]:
# create model pipeline
lin_reg_pipe = make_pipeline(preprocessor, lin_reg)
lin_reg_pipe.fit(X_train, y_train);

In [None]:
# find MAE, MSE, RMSE and R2 on the linear regression model for both the train and test data
print('Train Evaluation')
eval_regression(y_train, lin_reg_pipe.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, lin_reg_pipe.predict(X_test))

Train Evaluation
MAE 33.05563919671121,
 MSE 10228.88146741623,
 RMSE: 101.13793288087427,
 R^2: 0.8665055122451061 

Test Evaluation
MAE 873357503934.7001,
 MSE 5.833118263525715e+24,
 RMSE: 2415184933607.7173,
 R^2: -8.782130710548942e+19 


**Model 3: DecisionTree Model**

In [None]:
#use all of the default parameters, instantiate a decision tree model
dec_tree = DecisionTreeRegressor(random_state = 42)

In [None]:
# create model pipeline
dec_tree_pipe = make_pipeline(preprocessor, dec_tree)
dec_tree_pipe.fit(X_train, y_train);

In [None]:
# find MAE, MSE, RMSE and R2 on the decision tree model for both the train and test data
print('Train Evaluation')
eval_regression(y_train, dec_tree_pipe.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, dec_tree_pipe.predict(X_test))

Train Evaluation
MAE 0.0,
 MSE 0.0,
 RMSE: 0.0,
 R^2: 1.0 

Test Evaluation
MAE 54.272085712460935,
 MSE 22855.24218196678,
 RMSE: 151.17950318071158,
 R^2: 0.6558997860229151 


**Hypertune the Decisoin tree model**

In [None]:
# find the hyperparameters to tune
dec_tree_pipe.get_params()

{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(transformers=[('onehotencoder',
                                    OneHotEncoder(handle_unknown='ignore',
                                                  sparse=False),
                                    <sklearn.compose._column_transformer.make_column_selector object at 0x7f75a7b38f90>),
                                   ('pipeline',
                                    Pipeline(steps=[('simpleimputer',
                                                     SimpleImputer()),
                                                    ('standardscaler',
                                                     StandardScaler())]),
                                    <sklearn.compose._column_transformer.make_column_selector object at 0x7f75a7b38d10>)])),
  ('decisiontreeregressor', DecisionTreeRegressor(random_state=42))],
 'verbose': False,
 'columntransformer': ColumnTransformer(transformers=[('onehotencoder',
               

In [None]:
#choose the hyperparameters I want to tune
dec_tree_params = {'decisiontreeregressor__max_depth' : range(3, 5, 10),
                   'decisiontreeregressor__min_samples_leaf' : range(1, 8, 20),
                   'decisiontreeregressor__min_samples_split' : range(2,3, 20)}

In [None]:
dec_tree_gs = GridSearchCV(dec_tree_pipe, dec_tree_params)

dec_tree_gs.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('onehotencoder',
                                                                         OneHotEncoder(handle_unknown='ignore',
                                                                                       sparse=False),
                                                                         <sklearn.compose._column_transformer.make_column_selector object at 0x7f75a7b38f90>),
                                                                        ('pipeline',
                                                                         Pipeline(steps=[('simpleimputer',
                                                                                          SimpleImputer()),
                                                                                         ('standardscaler',
                                                                    

In [None]:
dec_tree_gs.best_params_

{'decisiontreeregressor__max_depth': 3,
 'decisiontreeregressor__min_samples_leaf': 1,
 'decisiontreeregressor__min_samples_split': 2}

In [None]:
# find MAE, MSE, RMSE and R2 on the decision tree model with the best params for both the train and test data
print('Train Evaluation')
eval_regression(y_train, dec_tree_gs.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, dec_tree_gs.predict(X_test))

Train Evaluation
MAE 89.16279599446379,
 MSE 39354.72701122947,
 RMSE: 198.38025862275074,
 R^2: 0.48639163139850006 

Test Evaluation
MAE 82.79083574951059,
 MSE 25857.822234003077,
 RMSE: 160.80367605873653,
 R^2: 0.6106940327798267 


**Model 4: Bagged Trees Model**

In [None]:
#use all of the default parameters, instantiate a bagged tree model
bagreg = BaggingRegressor(random_state = 42)

In [None]:
# create model pipeline
bagreg_pipe = make_pipeline(preprocessor, bagreg)
bagreg_pipe.fit(X_train, y_train);

In [None]:
# find MAE, MSE, RMSE and R2 on the bagged tree model for both the train and test data
print('Train Evaluation')
eval_regression(y_train, bagreg_pipe.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, bagreg_pipe.predict(X_test))

Train Evaluation
MAE 21.48688720370078,
 MSE 4904.944080022237,
 RMSE: 70.03530595365625,
 R^2: 0.9359868427926594 

Test Evaluation
MAE 49.50219003248633,
 MSE 18443.87556063448,
 RMSE: 135.8082308280116,
 R^2: 0.7223157174860919 


**Hypertune the Bagged trees model**

In [None]:
bagreg_pipe.get_params()

{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(transformers=[('onehotencoder',
                                    OneHotEncoder(handle_unknown='ignore',
                                                  sparse=False),
                                    <sklearn.compose._column_transformer.make_column_selector object at 0x7f75a7b38f90>),
                                   ('pipeline',
                                    Pipeline(steps=[('simpleimputer',
                                                     SimpleImputer()),
                                                    ('standardscaler',
                                                     StandardScaler())]),
                                    <sklearn.compose._column_transformer.make_column_selector object at 0x7f75a7b38d10>)])),
  ('baggingregressor', BaggingRegressor(random_state=42))],
 'verbose': False,
 'columntransformer': ColumnTransformer(transformers=[('onehotencoder',
                         

In [None]:
#choose the hyperparameters I want to tune
bagreg_params = {'baggingregressor__max_samples' : [30, 40, 60, 100],
                 'baggingregressor__n_estimators' : [100, 120, 150, 200]}

In [None]:
bagreg_gs = GridSearchCV(bagreg_pipe, bagreg_params)

bagreg_gs.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('onehotencoder',
                                                                         OneHotEncoder(handle_unknown='ignore',
                                                                                       sparse=False),
                                                                         <sklearn.compose._column_transformer.make_column_selector object at 0x7f75a7b38f90>),
                                                                        ('pipeline',
                                                                         Pipeline(steps=[('simpleimputer',
                                                                                          SimpleImputer()),
                                                                                         ('standardscaler',
                                                                    

In [None]:
bagreg_gs.best_params_

{'baggingregressor__max_samples': 100, 'baggingregressor__n_estimators': 100}

In [None]:
# find MAE, MSE, RMSE and R2 on the best_bag_model for both the train and test data
print('Train Evaluation')
eval_regression(y_train, bagreg_gs.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, bagreg_gs.predict(X_test))

Train Evaluation
MAE 66.89217141284055,
 MSE 34529.64291862124,
 RMSE: 185.82153513148373,
 R^2: 0.5493625565547698 

Test Evaluation
MAE 64.31877307689551,
 MSE 29734.895687058503,
 RMSE: 172.43809233188153,
 R^2: 0.5523222249390032 


**Model 5: Random Forest Model**

In [None]:
#use all of the default parameters, instantiate a random forest model
rf = RandomForestRegressor(random_state = 42)

In [None]:
# create model pipeline
rf_pipe = make_pipeline(preprocessor, rf)
rf_pipe.fit(X_train, y_train);

In [None]:
# find MAE, MSE, RMSE and R2 on the random forest model for both the train and test data
print('Train Evaluation')
eval_regression(y_train, rf_pipe.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, rf_pipe.predict(X_test))

Train Evaluation
MAE 20.340255363093096,
 MSE 3833.5939483834745,
 RMSE: 61.916023357314174,
 R^2: 0.9499687564050947 

Test Evaluation
MAE 45.4659751828209,
 MSE 14942.792167342574,
 RMSE: 122.24071403318362,
 R^2: 0.7750267557324473 


**Hypertune the Random Forest Model**

In [None]:
# find the hyperparameters to tune
rf_pipe.get_params()

{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(transformers=[('onehotencoder',
                                    OneHotEncoder(handle_unknown='ignore',
                                                  sparse=False),
                                    <sklearn.compose._column_transformer.make_column_selector object at 0x7f75a7b38f90>),
                                   ('pipeline',
                                    Pipeline(steps=[('simpleimputer',
                                                     SimpleImputer()),
                                                    ('standardscaler',
                                                     StandardScaler())]),
                                    <sklearn.compose._column_transformer.make_column_selector object at 0x7f75a7b38d10>)])),
  ('randomforestregressor', RandomForestRegressor(random_state=42))],
 'verbose': False,
 'columntransformer': ColumnTransformer(transformers=[('onehotencoder',
               

In [None]:
#choose the hyperparameters I want to tune
rf_params = {'randomforestregressor__max_depth' : [18, 25, 30],
             'randomforestregressor__min_samples_leaf' : [1, 3, 10],
             'randomforestregressor__n_estimators' : [10, 20, 30]}

In [None]:
rf_gs = GridSearchCV(rf_pipe, rf_params)

In [None]:
rf_gs.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('onehotencoder',
                                                                         OneHotEncoder(handle_unknown='ignore',
                                                                                       sparse=False),
                                                                         <sklearn.compose._column_transformer.make_column_selector object at 0x7f75a7b38f90>),
                                                                        ('pipeline',
                                                                         Pipeline(steps=[('simpleimputer',
                                                                                          SimpleImputer()),
                                                                                         ('standardscaler',
                                                                    

In [None]:
rf_gs.best_params_

{'randomforestregressor__max_depth': 25,
 'randomforestregressor__min_samples_leaf': 1,
 'randomforestregressor__n_estimators': 30}

In [None]:
# find MAE, MSE, RMSE and R2 on the random forest model for both the train and test data
print('Train Evaluation')
eval_regression(y_train, rf_gs.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, rf_gs.predict(X_test))

Train Evaluation
MAE 30.952580627255266,
 MSE 4857.569990200602,
 RMSE: 69.69626955727689,
 R^2: 0.9366051097922076 

Test Evaluation
MAE 48.916191776952076,
 MSE 16394.63952183596,
 RMSE: 128.0415538871501,
 R^2: 0.7531682699913764 


**Model 6: K-Nearest Neighbors model**

In [None]:
#use all of the default parameters, instantiate a kNN model
knn = KNeighborsRegressor()

In [None]:
# create model pipeline
knn_pipe = make_pipeline(preprocessor, knn)
knn_pipe.fit(X_train, y_train);

In [None]:
# find MAE, MSE, RMSE and R2 on the knn model for both the train and test data
print('Train Evaluation')
eval_regression(y_train, knn_pipe.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, knn_pipe.predict(X_test))

Train Evaluation
MAE 50.49896298717992,
 MSE 17438.885271805473,
 RMSE: 132.05637156837784,
 R^2: 0.7724096164578904 

Test Evaluation
MAE 53.67091132513281,
 MSE 13646.457914182427,
 RMSE: 116.81805474404385,
 R^2: 0.7945438927790276 


**Hypertune the KNN model**

In [None]:
# find the hyperparameters to tune
knn_pipe.get_params()

{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(transformers=[('onehotencoder',
                                    OneHotEncoder(handle_unknown='ignore',
                                                  sparse=False),
                                    <sklearn.compose._column_transformer.make_column_selector object at 0x7f75a7b38f90>),
                                   ('pipeline',
                                    Pipeline(steps=[('simpleimputer',
                                                     SimpleImputer()),
                                                    ('standardscaler',
                                                     StandardScaler())]),
                                    <sklearn.compose._column_transformer.make_column_selector object at 0x7f75a7b38d10>)])),
  ('kneighborsregressor', KNeighborsRegressor())],
 'verbose': False,
 'columntransformer': ColumnTransformer(transformers=[('onehotencoder',
                                  

In [None]:
#choose the hyperparameters I want to tune
knn_params = {'kneighborsregressor__n_neighbors' : [5, 7, 9, 11],
              'kneighborsregressor__leaf_size' : [30, 35, 40]}

In [None]:
knn_gs = GridSearchCV(knn_pipe, knn_params)
knn_gs.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('onehotencoder',
                                                                         OneHotEncoder(handle_unknown='ignore',
                                                                                       sparse=False),
                                                                         <sklearn.compose._column_transformer.make_column_selector object at 0x7f75a7b38f90>),
                                                                        ('pipeline',
                                                                         Pipeline(steps=[('simpleimputer',
                                                                                          SimpleImputer()),
                                                                                         ('standardscaler',
                                                                    

In [None]:
knn_gs.best_params_

{'kneighborsregressor__leaf_size': 30, 'kneighborsregressor__n_neighbors': 5}

I found that the best hyperparameters are the default hyperparameters.

In [None]:
# find MAE, MSE, RMSE and R2 on the kNN model for both the train and test data
print('Train Evaluation')
eval_regression(y_train, knn_gs.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, knn_gs.predict(X_test))

Train Evaluation
MAE 50.49896298717992,
 MSE 17438.885271805473,
 RMSE: 132.05637156837784,
 R^2: 0.7724096164578904 

Test Evaluation
MAE 53.67091132513281,
 MSE 13646.457914182427,
 RMSE: 116.81805474404385,
 R^2: 0.7945438927790276 


**Summary**


*   After trying all the models and hyperparameters tested. I can tell that the best final model is the KNN model. It had the highest R2 score which means it explained the highest amount of variance in the data. It also had the lowest RMSE score which means there was, on average, less error in the predictions with this model. 


*   The hypertuned KNN model had a leaf_size of 30, and the n_neighbors of 5







**Perform PCA on the final best model I just created.**

In [None]:
#I want the number of Principal Components that will retain 95% of the variance in the original features
pca = PCA(n_components = .95)

In [None]:
#create a pipeline with the preprocessor and the pca together
pca_pipe = make_pipeline(preprocessor, pca)

In [None]:
#create anthother pipeline with the pca_pipe and rf_54_80 model together
knn_pca_pipe = make_pipeline(pca_pipe, knn)

In [None]:
#fit the pipeline
knn_pca_pipe.fit(X_train, y_train)

Pipeline(steps=[('pipeline',
                 Pipeline(steps=[('columntransformer',
                                  ColumnTransformer(transformers=[('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False),
                                                                   <sklearn.compose._column_transformer.make_column_selector object at 0x7f758d2aba90>),
                                                                  ('pipeline',
                                                                   Pipeline(steps=[('simpleimputer',
                                                                                    SimpleImputer()),
                                                                                   ('standardscaler',
                                                                                    StandardS

In [None]:
# find MAE, MSE, RMSE and R2 on the random forest model when the max_depth = 54 and n_estimator = 80 with the PCA for both the train and test data
print('Train Evaluation')
eval_regression(y_train, knn_pca_pipe.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, knn_pca_pipe.predict(X_test))

Train Evaluation
MAE 48.75582528232464,
 MSE 17773.54613367993,
 RMSE: 133.31746372354948,
 R^2: 0.7680420440630148 

Test Evaluation
MAE 57.11373571111719,
 MSE 19621.557254067156,
 RMSE: 140.07696903512425,
 R^2: 0.7045849702255434 


**Key Finding**


*   My best KNN model with all default parameters, which leaf_size of 30, and the n_neighbors of 5, without the PCA, the RMSE score was 116, and R2 score was 0.79.
*   With the PCA, the RMSE score was 140 and the R2 score was 0.7.


*   Perform PCA on the model didn't improve the model's predicting ability.



