<a href="https://colab.research.google.com/github/QianFu520/project2/blob/main/Project_2_Part_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



*   Qian Fu
*   8/25/2022



# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn import set_config
from sklearn.decomposition import PCA

# Upload Data

In [None]:
df = pd.read_csv("/content/Wine.csv")
df.head()

Unnamed: 0,winery,wine,year,rating,num_reviews,country,region,price,type,body,acidity
0,Teso La Monja,Tinto,2013,4.9,58,Espana,Toro,995.0,Toro Red,5.0,3.0
1,Artadi,Vina El Pison,2018,4.9,31,Espana,Vino de Espana,313.5,Tempranillo,4.0,2.0
2,Vega Sicilia,Unico,2009,4.8,1793,Espana,Ribera del Duero,324.95,Ribera Del Duero Red,5.0,3.0
3,Vega Sicilia,Unico,1999,4.8,1705,Espana,Ribera del Duero,692.96,Ribera Del Duero Red,5.0,3.0
4,Vega Sicilia,Unico,1996,4.8,1309,Espana,Ribera del Duero,778.06,Ribera Del Duero Red,5.0,3.0


# Data Cleaning

**Deleted unnecessary columns**

In [None]:
#I decided to delete country column, because there is only one country: Espana, it doesn't make impact on predicting the wine price.
df.drop(columns="country", inplace= True)

**Check and drop any duplicates**

In [None]:
#check for duplicates
df.duplicated().sum()

5452

In [None]:
#Drop all the duplicates
df.drop_duplicates(inplace=True)
df.duplicated().sum()

0

In [None]:
df.shape

(2048, 10)

# **Identify and address any missing values in this dataset.**


In [None]:
df.isna().sum()

winery           0
wine             0
year             2
rating           0
num_reviews      0
region           0
price            0
type           106
body           271
acidity        271
dtype: int64

I can see that there are 2 missing values in 'year' column, 106 missing values in "type" column, 271 missing values in "body" column, and 271 missing values in "acidity" column.

**Figure out the method for dealing with the missing values in type column**

In [None]:
#check the value count in type column
df["type"].value_counts()

Ribera Del Duero Red    535
Rioja Red               451
Priorat Red             238
Red                     210
Toro Red                 78
Tempranillo              73
Sherry                   56
Rioja White              37
Pedro Ximenez            35
Grenache                 35
Albarino                 34
Cava                     33
Verdejo                  27
Monastrell               18
Mencia                   17
Montsant Red             17
Syrah                    15
Chardonnay               13
Cabernet Sauvignon       11
Sparkling                 5
Sauvignon Blanc           4
Name: type, dtype: int64

To prevent model performance bias, for missing values in type column, I decided to create a new label"Unidentified". I will deal with this after data split.

**Figure out the method for dealing with the missing values in year column**



In [None]:
missing_values = pd.isna(df["year"])
df[missing_values]

Unnamed: 0,winery,wine,year,rating,num_reviews,region,price,type,body,acidity
46,Vega Sicilia,Unico Reserva Especial Edicion,,4.7,12421,Ribera del Duero,423.5,Ribera Del Duero Red,5.0,3.0
851,La Unica,Fourth Edition,,4.4,131,Vino de Espana,40.0,Tempranillo,4.0,2.0


I will fill the missing values in "year"column with the most recent year.I will deal with it after data split.

**Figure out the method for dealing with the missing values in body and acidity column**

In [None]:
#check the stats information of body column
df["body"].describe().round(1)

count    1777.0
mean        4.3
std         0.7
min         2.0
25%         4.0
50%         4.0
75%         5.0
max         5.0
Name: body, dtype: float64

In [None]:
#check the most frequent value 
df["body"].value_counts()

4.0    1003
5.0     634
3.0     106
2.0      34
Name: body, dtype: int64

I can see that the most frequent body value is 4.0, the mean value of the body is around 4.3. I can use the SimpleImputer(strategy= 'mean') method to fill in the missing values in body column.I will address this after the data split

In [None]:
#check the stats information of acidity column
df["acidity"].describe().round(1)

count    1777.0
mean        2.9
std         0.3
min         1.0
25%         3.0
50%         3.0
75%         3.0
max         3.0
Name: acidity, dtype: float64

In [None]:
#check the most frequent value
df["acidity"].value_counts()

3.0    1672
2.0      70
1.0      35
Name: acidity, dtype: int64

I can see that the most frequent acidity value is 3.0, the mean value of the body is around 2.9. I can use the SimpleImputer(strategy= 'mean') method to fill in the missing values in acidity column.I will address this after the data split

# Identified and corrected inconsistencies in data for categorical values

In [None]:
dtypes = df.dtypes
str_cols = dtypes[dtypes=="object"].index
for col in str_cols:
  print(f'-Column={col}')
  print(df[col].value_counts(dropna=False))
  print('\n\n')

-Column=winery
Vega Sicilia                            97
Alvaro Palacios                         48
Artadi                                  43
La Rioja Alta                           36
Marques de Murrieta                     33
                                        ..
Valdelosfrailes                          1
Briego                                   1
Guillem Carol - Cellers Carol Valles     1
Particular                               1
Binigrau                                 1
Name: winery, Length: 480, dtype: int64



-Column=wine
Tinto                                                 56
Unico                                                 41
Valbuena 5o                                           32
Reserva                                               31
Priorat                                               26
                                                      ..
San Valentin Parellada                                 1
Silvanus Edicion Limitada Ribera del Duero             1


I can't tell there are inconsistencies in data for categorical values

# Identified and corrected any impossible values in numeric columns

In [None]:
dtypes = df.dtypes
str_cols1 = dtypes[dtypes=="float64"].index
for col in str_cols1:
  print(f'-Column={col}')
  print(df[col].value_counts(dropna=False))
  print('\n\n')

-Column=rating
4.3    706
4.4    484
4.5    281
4.2    228
4.6    191
4.7    112
4.8     44
4.9      2
Name: rating, dtype: int64



-Column=price
75.00     16
95.00     12
34.90     12
59.90     12
26.90     11
          ..
75.92      1
47.52      1
94.20      1
185.15     1
995.00     1
Name: price, Length: 1292, dtype: int64



-Column=body
4.0    1003
5.0     634
NaN     271
3.0     106
2.0      34
Name: body, dtype: int64



-Column=acidity
3.0    1672
NaN     271
2.0      70
1.0      35
Name: acidity, dtype: int64





I can tell that there is no impossible values in numeric columns

# Ensure all columns data types are correct

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2048 entries, 0 to 6100
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   winery       2048 non-null   object 
 1   wine         2048 non-null   object 
 2   year         2046 non-null   object 
 3   rating       2048 non-null   float64
 4   num_reviews  2048 non-null   int64  
 5   region       2048 non-null   object 
 6   price        2048 non-null   float64
 7   type         1942 non-null   object 
 8   body         1777 non-null   float64
 9   acidity      1777 non-null   float64
dtypes: float64(4), int64(1), object(5)
memory usage: 176.0+ KB


I can see that the data type of "year" column is not correct. I will change it from object to inter.

In [None]:
df.loc[df["year"]=="N.V.", :]

Unnamed: 0,winery,wine,year,rating,num_reviews,region,price,type,body,acidity
20,Valdespino,Toneles Moscatel,N.V.,4.8,174,Jerez-Xeres-Sherry,253.00,Sherry,4.0,3.0
133,Barbadillo,Reliquia Palo Cortado Sherry,N.V.,4.7,58,Jerez Palo Cortado,380.00,Sherry,4.0,3.0
142,Alvear,Abuelo Diego Palo Cortado,N.V.,4.7,42,Montilla-Moriles,114.28,Pedro Ximenez,5.0,1.0
143,Equipo Navazos,La Bota 78 de Oloroso,N.V.,4.7,41,Manzanilla,95.57,Sherry,4.0,3.0
267,Osborne,Solera India Oloroso Rare Sherry,N.V.,4.6,74,Jerez-Xeres-Sherry,189.99,Sherry,4.0,3.0
...,...,...,...,...,...,...,...,...,...,...
1942,Williams & Humbert,Dos Cortados Palo Cortado Solera Especial Aged...,N.V.,4.2,666,Jerez Palo Cortado,32.16,Sherry,4.0,3.0
1971,Fernando de Castilla,Antique Palo Cortado,N.V.,4.2,519,Jerez Palo Cortado,36.90,Sherry,4.0,3.0
1979,Williams & Humbert,Jalifa Amontillado Rare Old Dry Solera Especia...,N.V.,4.2,487,Jerez Amontillado,33.50,Sherry,4.0,3.0
2012,Lustau,Candela Cream Dulce Sweet,N.V.,4.2,405,Jerez-Xeres-Sherry,7.10,Sherry,,


I checked the data dictionary, it doesn't tell what "N.V." means here.So I decided to drop these rows.

In [None]:
df.drop(df[df["year"]=="N.V."].index, inplace=True)

In [None]:
df["year"] = df["year"].astype(float)
df.dtypes

winery          object
wine            object
year           float64
rating         float64
num_reviews      int64
region          object
price          float64
type            object
body           float64
acidity        float64
dtype: object

In [None]:
df.shape

(1978, 10)

# Prepare the data appropriately for modeling

**Define X, y, train test split**

In [None]:
X=df.drop(columns ="price")
y=df["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

**Identify Columns Features**



*   Numeric Features: year, rating, num_reviews, price, body, acidity
*   Nominal Features: winery, wine, region, type
*   Nominal Features: none





**Make column selectors**

In [None]:
num_selector = make_column_selector(dtype_include='number')
cat_selector = make_column_selector(dtype_include='object')

**Instantiate transformers**

In [None]:
mean_imputer = SimpleImputer(strategy= 'mean')# To fill the missing values in "body" and "acidity". 
ohe_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
scaler = StandardScaler()

**Filling the missing values in "type" and "year" by using Fillna fuction**

In [None]:
X_train["type"].fillna('Unidentified', inplace=True)
X_test["type"].fillna('Unidentified', inplace=True)


In [None]:
X_train["year"].fillna(2018, inplace=True)
X_test["year"].fillna(2018, inplace=True)


**Create piplines**

In [None]:
num_pipe = make_pipeline(mean_imputer, scaler)

**Create Tuples to Pair Pipelines with Columns**

In [None]:
number_tuple = (num_pipe, num_selector)
nom_tuple = (ohe_encoder, cat_selector)

**Instantiate the ColumnTransformer**

In [None]:
preprocessor = make_column_transformer(nom_tuple,  number_tuple,  remainder='drop')                                                                      

# Try multiple models and tune the hyperparameters of the models to find out the best final model

Define a function that takes true and predicted values as arguments
and prints all 4 metrics 

In [None]:
def eval_regression(true, pred):
  mae = mean_absolute_error(true, pred)
  mse = mean_squared_error(true, pred)
  rmse = np.sqrt(mse)
  r2 = r2_score(true, pred)

  print(f'MAE {mae},\n MSE {mse},\n RMSE: {rmse},\n R^2: {r2} ')

**Model 1: Baseline Model**

In [None]:
# instantiate a baseline model
dummy_reg = DummyRegressor(strategy='mean')

In [None]:
# create model pipeline
base_pipe = make_pipeline(preprocessor, dummy_reg)

base_pipe.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f7e9b5c3690>),
                                                 ('pipeline',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f7e9b5c30d0>)])),
                ('dummyregressor', DummyRegresso

In [None]:
# find MAE, MSE, RMSE and R2 on the baseline model for both the train and test data
print('Train Evaluation')
eval_regression(y_train, base_pipe.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, base_pipe.predict(X_test))

Train Evaluation
MAE 136.81661664457778,
 MSE 69011.44063521981,
 RMSE: 262.70028670562925,
 R^2: 0.0 

Test Evaluation
MAE 140.56462624592072,
 MSE 98340.3743166096,
 RMSE: 313.59268855732205,
 R^2: -0.0003932989341515203 


**Model 2:Linear Regression Model**

In [None]:
# instantiate a linear regression model
lin_reg = lin_reg = LinearRegression()

In [None]:
# create model pipeline
lin_reg_pipe = make_pipeline(preprocessor, lin_reg)
lin_reg_pipe.fit(X_train, y_train);

In [None]:
# find MAE, MSE, RMSE and R2 on the linear regression model for both the train and test data
print('Train Evaluation')
eval_regression(y_train, lin_reg_pipe.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, lin_reg_pipe.predict(X_test))

Train Evaluation
MAE 55.62212060463251,
 MSE 12325.322972071312,
 RMSE: 111.01947113939659,
 R^2: 0.8214017435569789 

Test Evaluation
MAE 48190910089146.32,
 MSE 1.955254032877776e+28,
 RMSE: 139830398443177.44,
 R^2: -1.9890335437481975e+23 


**Model 3: DecisionTree Model**

In [None]:
#use all of the default parameters, instantiate a decision tree model
dec_tree = DecisionTreeRegressor(random_state = 42)

In [None]:
# create model pipeline
dec_tree_pipe = make_pipeline(preprocessor, dec_tree)
dec_tree_pipe.fit(X_train, y_train);

In [None]:
# find MAE, MSE, RMSE and R2 on the decision tree model for both the train and test data
print('Train Evaluation')
eval_regression(y_train, dec_tree_pipe.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, dec_tree_pipe.predict(X_test))

Train Evaluation
MAE 1.221769370322492e-16,
 MSE 6.561960975594336e-30,
 RMSE: 2.5616324825381053e-15,
 R^2: 1.0 

Test Evaluation
MAE 64.11083932343435,
 MSE 38699.33721390723,
 RMSE: 196.721471156321,
 R^2: 0.606320823049308 


**Tune the decision tree model: finding the optimal max_depth**

In [None]:
# find the depth of the default tree
dec_tree.get_depth()

72

In [None]:
#Use a for loop help to try many values to find out the optimal max_depth.
depths = list(range(2, 132))
scores = pd.DataFrame(index=depths, columns=['Test Score','Train Score'])
for depth in depths:
    dec_tree = DecisionTreeRegressor(max_depth=depth, random_state=42)
    dec_tree_pipe = make_pipeline(preprocessor, dec_tree)
    dec_tree_pipe.fit(X_train, y_train)
    train_score = dec_tree_pipe.score(X_train, y_train)
    test_score = dec_tree_pipe.score(X_test, y_test)
    scores.loc[depth, 'Train Score'] = train_score
    scores.loc[depth, 'Test Score'] = test_score

In [None]:
#use sort_values to sort out the best score
sorted_scores = scores.sort_values(by='Test Score', ascending=False)
sorted_scores.head()

Unnamed: 0,Test Score,Train Score
42,0.666689,0.999306
37,0.665015,0.998654
35,0.664006,0.998397
63,0.662086,0.999992
68,0.661489,1.0


we can see that when the max_depth = 74, we have the best test score.

In [None]:
#instantiate a decision tree model with the max_depth = 74 and create the pipeline
dec_tree_74 = DecisionTreeRegressor(max_depth=74, random_state=42)
dec_tree_74_pipe = make_pipeline(preprocessor, dec_tree)
dec_tree_74_pipe.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f7e9b5c3690>),
                                                 ('pipeline',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f7e9b5c30d0>)])),
                ('decisiontreeregressor',
      

In [None]:
# find MAE, MSE, RMSE and R2 on the decision tree model with the best max_depth = 74 for both the train and test data
print('Train Evaluation')
eval_regression(y_train, dec_tree_74_pipe.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, dec_tree_74_pipe.predict(X_test))

Train Evaluation
MAE 1.221769370322492e-16,
 MSE 6.561960975594336e-30,
 RMSE: 2.5616324825381053e-15,
 R^2: 1.0 

Test Evaluation
MAE 64.11083932343435,
 MSE 38699.33721390723,
 RMSE: 196.721471156321,
 R^2: 0.606320823049308 


**Model 4: Bagged Trees Model**

In [None]:
#use all of the default parameters, instantiate a bagged tree model
bagreg = BaggingRegressor(random_state = 42)

In [None]:
# create model pipeline
bagreg_pipe = make_pipeline(preprocessor, bagreg)
bagreg_pipe.fit(X_train, y_train);

In [None]:
# find MAE, MSE, RMSE and R2 on the bagged tree model for both the train and test data
print('Train Evaluation')
eval_regression(y_train, bagreg_pipe.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, bagreg_pipe.predict(X_test))

Train Evaluation
MAE 18.291313427030346,
 MSE 2819.5565091036437,
 RMSE: 53.09949631685449,
 R^2: 0.9591436364296866 

Test Evaluation
MAE 57.92638929570505,
 MSE 38504.045324248604,
 RMSE: 196.2244768734232,
 R^2: 0.6083074811143027 


**Tune the bagged tree model: finding the optimal n_estimators**

In [None]:
#using a for loop help to try many values to find out the optimal n_estimators
estimators = [10, 30, 60, 80, 100, 120]
scores = pd.DataFrame(index=estimators, columns=['Train Score', 'Test Score'])
for num_estimators in estimators:
   bagreg = BaggingRegressor(n_estimators=num_estimators, random_state=42)
   bagreg_pipe = make_pipeline(preprocessor, bagreg)
   bagreg_pipe.fit(X_train, y_train)
   train_score = bagreg_pipe.score(X_train, y_train)
   test_score = bagreg_pipe.score(X_test, y_test)
   scores.loc[num_estimators, 'Train Score'] = train_score
   scores.loc[num_estimators, 'Test Score'] = test_score

In [None]:
#use sort_values to sort out the best score
sorted_scores = scores.sort_values(by='Test Score', ascending=False)
sorted_scores.head()

Unnamed: 0,Train Score,Test Score
100,0.963497,0.661828
120,0.964016,0.661765
80,0.963625,0.660033
60,0.96029,0.651452
30,0.961952,0.642305


we can see that when the n_estimator = 120, we have the best test score.

In [None]:
#instantiate a bagged tree model with the n_estimator = 120 and create the pipeline
bagreg = BaggingRegressor(n_estimators=120, random_state=42)
bagreg_120_pipe = make_pipeline(preprocessor, bagreg)
bagreg_120_pipe.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f7e9b5c3690>),
                                                 ('pipeline',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f7e9b5c30d0>)])),
                ('baggingregressor',
           

In [None]:
# find MAE, MSE, RMSE and R2 on the bagged tree model with the best n_estimators = 120 for both the train and test data
print('Train Evaluation')
eval_regression(y_train, bagreg_120_pipe.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, bagreg_120_pipe.predict(X_test))

Train Evaluation
MAE 16.548544983186563,
 MSE 2483.325569742755,
 RMSE: 49.83297672969933,
 R^2: 0.9640157407687068 

Test Evaluation
MAE 53.59454360185303,
 MSE 33249.083876370685,
 RMSE: 182.3433132208875,
 R^2: 0.6617649573049995 


**Model 5: Random Forest Model**

In [None]:
#use all of the default parameters, instantiate a random forest model
rf = RandomForestRegressor(random_state = 42)

In [None]:
# create model pipeline
rf_pipe = make_pipeline(preprocessor, rf)
rf_pipe.fit(X_train, y_train);

In [None]:
# find MAE, MSE, RMSE and R2 on the random forest model for both the train and test data
print('Train Evaluation')
eval_regression(y_train, rf_pipe.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, rf_pipe.predict(X_test))

Train Evaluation
MAE 16.705355210102095,
 MSE 2547.0911871015537,
 RMSE: 50.46871493412086,
 R^2: 0.9630917545894317 

Test Evaluation
MAE 53.12674734011859,
 MSE 33170.251520156075,
 RMSE: 182.12702029121346,
 R^2: 0.6625669001636101 


**Tune the random forest model: finding the optimal max_depth and n_estimators**

In [None]:
# to find out the depth of each tree in random forest was when the max_depth was unlimited.
est_depths = [estimator.get_depth() for estimator in rf.estimators_]
max(est_depths)

94

In [None]:
#Use a for loop help to try many values to find out the optimal max_depth.
depths = range(1, max(est_depths))
scores = pd.DataFrame(index=depths, columns=['Test Score', 'Train Score'])
for depth in depths:    
   rf = RandomForestRegressor(max_depth=depth, random_state=42)
   rf_pipe = make_pipeline(preprocessor, rf)
   rf_pipe.fit(X_train, y_train)
   train_score = rf_pipe.score(X_train, y_train)
   test_score = rf_pipe.score(X_test, y_test)
   scores.loc[depth, 'Train Score'] = train_score
   scores.loc[depth, 'Test Score'] = test_score

In [None]:
#use sort_values to sort out the best score
sorted_scores = scores.sort_values(by='Test Score', ascending=False)
sorted_scores.head()

Unnamed: 0,Test Score,Train Score
54,0.665523,0.963715
42,0.665137,0.963081
73,0.663972,0.963374
76,0.663501,0.963115
67,0.663367,0.963758


we can see that when the max_depth = 54, we have the best test score.

**Keep tuning the best n_estimator based on the max_depth = 54.**

In [None]:
#using a for loop to find the optimal n_estimators
n_ests = [10, 30, 60, 80, 100, 120]
scores = pd.DataFrame(index=n_ests, columns=['Test Score', 'Train Score'])
for n in n_ests:
   rf_54 = RandomForestRegressor(max_depth=54, n_estimators=n, random_state = 42)
   rf_54_pipe = make_pipeline(preprocessor, rf_54)
   rf_54_pipe.fit(X_train, y_train)
   scores.loc[n, 'Train Score'] = rf_54_pipe.score(X_train, y_train)
   scores.loc[n, 'Test Score'] = rf_54_pipe.score(X_test, y_test)

In [None]:
#use sort_values to sort out the best score
sorted_scores = scores.sort_values(by='Test Score', ascending=False)
sorted_scores.head()

Unnamed: 0,Test Score,Train Score
80,0.666001,0.96393
100,0.665523,0.963715
120,0.662368,0.964142
60,0.660707,0.960648
30,0.648093,0.961896


we can see that when the max_deth=54, the n_estimator=80, we have the best test score.

In [None]:
#instantiate a random forest model with the max_depth = 54 and the n_estimator=80
rf_54_80 = RandomForestRegressor(max_depth=54, n_estimators=80, random_state = 42)

In [None]:
#create pipeline
rf_54_80_pipe = make_pipeline(preprocessor, rf_54_80)
rf_54_80_pipe.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f7e9b5c3690>),
                                                 ('pipeline',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f7e9b5c30d0>)])),
                ('randomforestregressor',
      

In [None]:
# find MAE, MSE, RMSE and R2 on the random forest model when the max_depth = 54 and n_estimator = 80 for both the train and test data
print('Train Evaluation')
eval_regression(y_train, rf_54_80_pipe.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, rf_54_80_pipe.predict(X_test))

Train Evaluation
MAE 16.80743830720469,
 MSE 2489.266798954593,
 RMSE: 49.892552539979285,
 R^2: 0.963929650271868 

Test Evaluation
MAE 53.38600129157633,
 MSE 32832.676109262764,
 RMSE: 181.1978921214669,
 R^2: 0.6660009747366369 


**Summary**


*   After trying all the models and hyperparameters tested. I can tell that the best final model is Random Forest Model with the max_depth = 54 and the n_estimator = 80.




**Perform PCA on the final best model I just created.**

In [None]:
#I want the number of Principal Components that will retain 95% of the variance in the original features
pca = PCA(n_components = .95)

In [None]:
#create a pipeline with the preprocessor and the pca together
pca_pipe = make_pipeline(preprocessor, pca)

In [None]:
#create anthother pipeline with the pca_pipe and rf_54_80 model together
rf_pca_pipe = make_pipeline(pca_pipe, rf_54_80)

In [None]:
#fit the pipeline
rf_pca_pipe.fit(X_train, y_train)

Pipeline(steps=[('pipeline',
                 Pipeline(steps=[('columntransformer',
                                  ColumnTransformer(transformers=[('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False),
                                                                   <sklearn.compose._column_transformer.make_column_selector object at 0x7f7e9b5c3690>),
                                                                  ('pipeline',
                                                                   Pipeline(steps=[('simpleimputer',
                                                                                    SimpleImputer()),
                                                                                   ('standardscaler',
                                                                                    StandardS

In [None]:
# find MAE, MSE, RMSE and R2 on the random forest model when the max_depth = 54 and n_estimator = 80 with the PCA for both the train and test data
print('Train Evaluation')
eval_regression(y_train, rf_pca_pipe.predict(X_train))

print('\nTest Evaluation')
eval_regression(y_test, rf_pca_pipe.predict(X_test))

Train Evaluation
MAE 18.970932575858317,
 MSE 2640.512215630401,
 RMSE: 51.385914564503,
 R^2: 0.9617380510923167 

Test Evaluation
MAE 55.90016801025883,
 MSE 31235.58260768514,
 RMSE: 176.73591204869808,
 R^2: 0.6822478280545388 


**Key Finding**


*   The Random Forest Model with the max_depth = 54 and n_estimator = 80 without the PCA which is "rf_54_80_pipe" model, the testing score (R2 score) was around 0.666.
*   The Random Forest Model with the max_depth = 54 and n_estimator = 80 with the PCA which is "rf_pca_pipe" model, the testing score (R2 score) was around 0.682.


*   Perform PCA on the model did improve the model's predicting ability.



