# Mercedes-Benz Greener Manufacturing

### DESCRIPTION

Reduce the time a Mercedes-Benz spends on the test bench.

#### Problem Statement Scenario:

Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.

#### Following actions should be performed:

1) If for any column(s), the variance is equal to zero, then you need to remove those variable(s).

2) Check for null and unique values for test and train sets.

3) Apply label encoder.

4) Perform dimensionality reduction.

5) Predict your test_df values using XGBoost.

**Note: Cells with "###" in first line are operations that we will need to perform on Forecast Data ("test.csv") before making Predictions on them.**

In [410]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from statsmodels.api import OLS

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA

from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBRegressor
from sklearn.metrics import r2_score 

from math import sqrt

import warnings
warnings.filterwarnings("ignore")

In [302]:
df = pd.read_csv("train.csv")

In [303]:
df.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [304]:
df.shape

(4209, 378)

In [305]:
df.columns

Index(['ID', 'y', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8',
       ...
       'X375', 'X376', 'X377', 'X378', 'X379', 'X380', 'X382', 'X383', 'X384',
       'X385'],
      dtype='object', length=378)

In [306]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,4209.0,4205.960798,2437.608688,0.00,2095.00,4220.00,6314.00,8417.00
y,4209.0,100.669318,12.679381,72.11,90.82,99.15,109.01,265.32
X10,4209.0,0.013305,0.114590,0.00,0.00,0.00,0.00,1.00
X11,4209.0,0.000000,0.000000,0.00,0.00,0.00,0.00,0.00
X12,4209.0,0.075077,0.263547,0.00,0.00,0.00,0.00,1.00
...,...,...,...,...,...,...,...,...
X380,4209.0,0.008078,0.089524,0.00,0.00,0.00,0.00,1.00
X382,4209.0,0.007603,0.086872,0.00,0.00,0.00,0.00,1.00
X383,4209.0,0.001663,0.040752,0.00,0.00,0.00,0.00,1.00
X384,4209.0,0.000475,0.021796,0.00,0.00,0.00,0.00,1.00


In [307]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 378 entries, ID to X385
dtypes: float64(1), int64(369), object(8)
memory usage: 12.1+ MB


### Missing Value Treatment:

In [308]:
df.isna().sum().sum()

0

In [309]:
# There are  no Missing Values in Data.

### Checking For Numeric Cintinuous, Numeric Discrete, Numeric Categorical and Categorical Variables:

In [310]:
df.dtypes

ID        int64
y       float64
X0       object
X1       object
X2       object
         ...   
X380      int64
X382      int64
X383      int64
X384      int64
X385      int64
Length: 378, dtype: object

In [311]:
# For Object Data Types:

for cols in df.select_dtypes(include= "object"):
    print(cols)
    print(df[cols].nunique())
    print("\n")

X0
47


X1
27


X2
44


X3
7


X4
4


X5
29


X6
12


X8
25




In [312]:
# All Object Type Columms are Categorical.
# We will Perform Lable Encoder on Them.

In [313]:
# Cheking Numeric Data Features:

for cols in df.select_dtypes(exclude= "object"):
    print(cols)
    print(df[cols].nunique())

ID
4209
y
2545
X10
2
X11
1
X12
2
X13
2
X14
2
X15
2
X16
2
X17
2
X18
2
X19
2
X20
2
X21
2
X22
2
X23
2
X24
2
X26
2
X27
2
X28
2
X29
2
X30
2
X31
2
X32
2
X33
2
X34
2
X35
2
X36
2
X37
2
X38
2
X39
2
X40
2
X41
2
X42
2
X43
2
X44
2
X45
2
X46
2
X47
2
X48
2
X49
2
X50
2
X51
2
X52
2
X53
2
X54
2
X55
2
X56
2
X57
2
X58
2
X59
2
X60
2
X61
2
X62
2
X63
2
X64
2
X65
2
X66
2
X67
2
X68
2
X69
2
X70
2
X71
2
X73
2
X74
2
X75
2
X76
2
X77
2
X78
2
X79
2
X80
2
X81
2
X82
2
X83
2
X84
2
X85
2
X86
2
X87
2
X88
2
X89
2
X90
2
X91
2
X92
2
X93
1
X94
2
X95
2
X96
2
X97
2
X98
2
X99
2
X100
2
X101
2
X102
2
X103
2
X104
2
X105
2
X106
2
X107
1
X108
2
X109
2
X110
2
X111
2
X112
2
X113
2
X114
2
X115
2
X116
2
X117
2
X118
2
X119
2
X120
2
X122
2
X123
2
X124
2
X125
2
X126
2
X127
2
X128
2
X129
2
X130
2
X131
2
X132
2
X133
2
X134
2
X135
2
X136
2
X137
2
X138
2
X139
2
X140
2
X141
2
X142
2
X143
2
X144
2
X145
2
X146
2
X147
2
X148
2
X150
2
X151
2
X152
2
X153
2
X154
2
X155
2
X156
2
X157
2
X158
2
X159
2
X160
2
X161
2
X162
2
X163
2
X164
2
X165
2
X166
2
X1

In [314]:
# Except "ID" and "y", all other Numeric Columns are Binary Categorical.

# "ID" has unique values for all observations. We can either drop it or make it an Index Column.

# "y" is our Target Variable (Numeric Continuous).

In [315]:
###

df = df.set_index("ID")

In [316]:
df.head()

Unnamed: 0_level_0,y,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,130.81,k,v,at,a,d,u,j,o,0,...,0,0,1,0,0,0,0,0,0,0
6,88.53,k,t,av,e,d,y,l,o,0,...,1,0,0,0,0,0,0,0,0,0
7,76.26,az,w,n,c,d,x,j,x,0,...,0,0,0,0,0,0,1,0,0,0
9,80.62,az,t,n,f,d,x,l,e,0,...,0,0,0,0,0,0,0,0,0,0
13,78.02,az,v,n,f,d,h,d,n,0,...,0,0,0,0,0,0,0,0,0,0


### Applying Label Encoder on Object Data Type Columns:

In [317]:
###

le = LabelEncoder()

In [318]:
cat_col = list(df.select_dtypes(include="object").columns)

In [319]:
cat_col

['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']

In [320]:
for col in cat_col:
    df[col] = le.fit_transform(df[col])

In [321]:
df.head()

Unnamed: 0_level_0,y,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,130.81,32,23,17,0,3,24,9,14,0,...,0,0,1,0,0,0,0,0,0,0
6,88.53,32,21,19,4,3,28,11,14,0,...,1,0,0,0,0,0,0,0,0,0
7,76.26,20,24,34,2,3,27,9,23,0,...,0,0,0,0,0,0,1,0,0,0
9,80.62,20,21,34,5,3,27,11,4,0,...,0,0,0,0,0,0,0,0,0,0
13,78.02,20,23,34,5,3,12,3,13,0,...,0,0,0,0,0,0,0,0,0,0


In [322]:
df.select_dtypes(include= "object")

0
6
7
9
13
...
8405
8406
8412
8415
8417


In [323]:
# All Object Type Categorical Columns have been Label Encoded.

### Checking Variance of All Features:

In [324]:
cols_to_drop = []

for cols in df.columns:
    
    if df[cols].var() == 0:
        cols_to_drop.append(cols)

In [325]:
cols_to_drop

['X11',
 'X93',
 'X107',
 'X233',
 'X235',
 'X268',
 'X289',
 'X290',
 'X293',
 'X297',
 'X330',
 'X347']

In [326]:
# This Columns have no Variance in it:

df[cols_to_drop].nunique()

X11     1
X93     1
X107    1
X233    1
X235    1
X268    1
X289    1
X290    1
X293    1
X297    1
X330    1
X347    1
dtype: int64

In [327]:
# We can Drop These Columns.

In [328]:
###

df = df.drop(cols_to_drop, axis= 1)

In [329]:
df.shape

(4209, 365)

### Using Ordinary Least Square Method to Find Out Features Significant to Target Variable:

In [330]:
x = df.drop("y", axis =1)
y = df["y"]

In [331]:
x.head()

Unnamed: 0_level_0,X0,X1,X2,X3,X4,X5,X6,X8,X10,X12,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,32,23,17,0,3,24,9,14,0,0,...,0,0,1,0,0,0,0,0,0,0
6,32,21,19,4,3,28,11,14,0,0,...,1,0,0,0,0,0,0,0,0,0
7,20,24,34,2,3,27,9,23,0,0,...,0,0,0,0,0,0,1,0,0,0
9,20,21,34,5,3,27,11,4,0,0,...,0,0,0,0,0,0,0,0,0,0
13,20,23,34,5,3,12,3,13,0,0,...,0,0,0,0,0,0,0,0,0,0


In [332]:
x.shape

(4209, 364)

In [333]:
y.shape

(4209,)

In [334]:
y.head()

ID
0     130.81
6      88.53
7      76.26
9      80.62
13     78.02
Name: y, dtype: float64

In [335]:
ols_model = OLS(y, x)

In [336]:
results = ols_model.fit()

In [337]:
results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.592
Model:,OLS,Adj. R-squared:,0.566
Method:,Least Squares,F-statistic:,22.15
Date:,"Fri, 25 Nov 2022",Prob (F-statistic):,0.0
Time:,12:02:00,Log-Likelihood:,-14775.0
No. Observations:,4209,AIC:,30070.0
Df Residuals:,3949,BIC:,31720.0
Df Model:,259,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
X0,0.0665,0.015,4.351,0.000,0.037,0.097
X1,-0.0418,0.038,-1.090,0.276,-0.117,0.033
X2,0.0082,0.052,0.156,0.876,-0.094,0.111
X3,-0.1232,0.126,-0.978,0.328,-0.370,0.124
X4,-0.1255,1.771,-0.071,0.944,-3.598,3.347
X5,-0.0532,0.017,-3.089,0.002,-0.087,-0.019
X6,0.0392,0.050,0.786,0.432,-0.059,0.137
X8,0.0013,0.020,0.067,0.946,-0.038,0.041
X10,4.7385,4.469,1.060,0.289,-4.024,13.501

0,1,2,3
Omnibus:,4021.263,Durbin-Watson:,1.978
Prob(Omnibus):,0.0,Jarque-Bera (JB):,484285.739
Skew:,4.226,Prob(JB):,0.0
Kurtosis:,54.865,Cond. No.,1.31e+16


In [338]:
# Lot of Columns have P-Value > 0.05.

# These columns are not required to Predict Target.

# We can try Building the Model Without these Columns.

In [339]:
results.pvalues

X0      0.000014
X1      0.275892
X2      0.876181
X3      0.327985
X4      0.943520
          ...   
X380    0.827400
X382    0.942622
X383    0.015117
X384    0.980310
X385    0.204673
Length: 364, dtype: float64

In [340]:
results.pvalues[results.pvalues < 0.05]

X0      0.000014
X5      0.002022
X45     0.003403
X47     0.015723
X48     0.029690
          ...   
X365    0.011386
X368    0.028281
X375    0.000014
X376    0.008187
X383    0.015117
Length: 70, dtype: float64

In [341]:
(results.pvalues[results.pvalues < 0.05]).index

Index(['X0', 'X5', 'X45', 'X47', 'X48', 'X52', 'X54', 'X74', 'X75', 'X76',
       'X79', 'X95', 'X104', 'X111', 'X113', 'X115', 'X117', 'X118', 'X119',
       'X120', 'X123', 'X128', 'X130', 'X133', 'X134', 'X136', 'X142', 'X143',
       'X147', 'X152', 'X156', 'X157', 'X158', 'X163', 'X174', 'X178', 'X179',
       'X180', 'X186', 'X189', 'X194', 'X201', 'X204', 'X206', 'X209', 'X210',
       'X217', 'X222', 'X226', 'X236', 'X240', 'X249', 'X250', 'X263', 'X272',
       'X301', 'X310', 'X314', 'X315', 'X326', 'X336', 'X337', 'X338', 'X339',
       'X364', 'X365', 'X368', 'X375', 'X376', 'X383'],
      dtype='object')

In [342]:
significant_cols = list((results.pvalues[results.pvalues < 0.05]).index)

In [343]:
len(significant_cols)

70

In [344]:
# we will Build our Model Using these 70 Features Only.

In [372]:
new_x = x[significant_cols].copy()

In [373]:
new_x.shape

(4209, 70)

In [374]:
new_x.head()

Unnamed: 0_level_0,X0,X5,X45,X47,X48,X52,X54,X74,X75,X76,...,X336,X337,X338,X339,X364,X365,X368,X375,X376,X383
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,32,24,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
6,32,28,0,0,0,0,0,1,0,0,...,1,1,0,0,0,0,0,1,0,0
7,20,27,0,0,0,0,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
9,20,27,0,0,0,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
13,20,12,0,0,0,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0


### Scalling:

- While Mostly all Categorical Variables have Values of 0 and 1, Variables that we did Label Encoder on have Multiple Values. So, it is Possible that Machine Learning Models will give More Importance to them as They have Values Greater Than 1. So, we will Scale the dataset.

In [376]:
sc= StandardScaler()

In [378]:
temp = sc.fit_transform(new_x)
scaled_x = pd.DataFrame(temp, index=new_x.index, columns= new_x.columns)
scaled_x.head()

Unnamed: 0_level_0,X0,X5,X45,X47,X48,X52,X54,X74,X75,X76,...,X336,X337,X338,X339,X364,X365,X368,X375,X376,X383
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.163012,1.292117,-0.58238,-0.114002,-0.15114,-0.210138,-0.213201,0.026707,-0.193562,-0.213201,...,-0.382008,-1.033588,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,-0.684167,-0.246447,-0.040815
6,0.163012,1.776974,-0.58238,-0.114002,-0.15114,-0.210138,-0.213201,0.026707,-0.193562,-0.213201,...,2.617749,0.967503,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,1.46163,-0.246447,-0.040815
7,-0.71056,1.65576,-0.58238,-0.114002,-0.15114,-0.210138,4.690416,0.026707,5.166313,4.690416,...,-0.382008,-1.033588,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,-0.684167,-0.246447,-0.040815
9,-0.71056,1.65576,-0.58238,-0.114002,-0.15114,-0.210138,4.690416,0.026707,-0.193562,4.690416,...,-0.382008,-1.033588,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,-0.684167,-0.246447,-0.040815
13,-0.71056,-0.162454,-0.58238,-0.114002,-0.15114,-0.210138,4.690416,0.026707,-0.193562,4.690416,...,-0.382008,-1.033588,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,-0.684167,-0.246447,-0.040815


### Dimensionality Reduction Using PCA:

In [379]:
pca = PCA(n_components= 0.99)

In [380]:
pca_result = pca.fit_transform(scaled_x)

In [381]:
pca_result

array([[ 7.38566730e-01, -6.06310781e-01,  8.42687902e-01, ...,
        -4.62081770e-01,  2.71190481e-02,  9.29520826e-01],
       [-8.77990014e-01, -5.24783033e-01,  1.46480126e+00, ...,
        -1.43627754e-01, -6.13814733e-02, -4.77159197e-02],
       [ 2.53123660e+00,  1.19378448e+01, -6.82449209e-02, ...,
         2.16114663e+00,  1.09025426e-01, -1.47043665e+00],
       ...,
       [-3.26179928e-01, -2.04293626e-01,  1.55053118e-01, ...,
         4.63004956e-01, -1.12608503e-01, -6.64108647e-02],
       [-9.78535047e-01, -2.41528421e-02,  3.96064695e-01, ...,
        -5.87925831e-01,  1.57913682e-01, -1.72809030e-02],
       [-1.18629764e+00, -1.15946417e+00, -5.89086902e-01, ...,
        -5.31059588e-01,  6.75673743e-03, -1.82552487e-01]])

In [383]:
pca_result.shape

(4209, 43)

In [384]:
pca.explained_variance_ratio_

array([0.15670056, 0.09494875, 0.05538113, 0.05306349, 0.04308408,
       0.04106288, 0.03696266, 0.0345388 , 0.0312952 , 0.0299603 ,
       0.02534326, 0.02456276, 0.02115288, 0.02014233, 0.0180675 ,
       0.01737717, 0.01612341, 0.01576077, 0.01525681, 0.0150391 ,
       0.01435188, 0.01425306, 0.01414013, 0.01403319, 0.01383682,
       0.01347849, 0.01306786, 0.0128261 , 0.01236534, 0.011939  ,
       0.01156939, 0.01028064, 0.0097577 , 0.00909089, 0.00900305,
       0.0082925 , 0.00732237, 0.00633129, 0.00478431, 0.00459924,
       0.00432563, 0.00414165, 0.00326093])

In [385]:
np.sum(pca.explained_variance_ratio_)

0.9928753048841577

### Train Test Split:

In [386]:
pca_result.shape

(4209, 43)

In [387]:
y.shape

(4209,)

In [388]:
x_train, x_test, y_train, y_test = train_test_split(pca_result, y, test_size= 0.2, random_state= 42)

In [390]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(3367, 43)
(842, 43)
(3367,)
(842,)


In [393]:
x_train

array([[ 0.01935025,  0.46122708, -0.07545592, ..., -0.08958546,
         0.20525837, -0.31718347],
       [ 0.50260866, -0.11270403,  0.42246925, ..., -0.53121479,
        -0.34511288, -0.69113605],
       [-0.45737264,  0.640639  , -0.80102688, ...,  0.20973414,
        -0.00425784, -0.10340277],
       ...,
       [-1.6878898 , -0.8662463 , -0.24779434, ...,  0.27311174,
        -0.09071486, -0.20574805],
       [-1.46444445, -0.142273  , -1.02934594, ...,  0.30099518,
        -0.09149417, -0.02939956],
       [-1.01640424, -1.32260175, -0.05691751, ..., -0.42583111,
         0.08116195, -0.12648478]])

In [394]:
y_train

ID
2011     88.96
3690     89.90
7597     92.59
322     108.84
3103    111.15
         ...  
6879    109.42
898      78.25
6214     92.18
7558     91.92
1712     87.71
Name: y, Length: 3367, dtype: float64

In [395]:
x_test

array([[-1.4129803 ,  0.3257658 ,  0.95520315, ..., -1.71545701,
         0.28846419, -0.02113956],
       [-0.72302668, -0.28694555,  0.2555645 , ..., -0.15784824,
         0.02061572,  0.35074088],
       [ 1.52635451, -0.96242301, -0.19938882, ...,  0.41793905,
        -0.00647589, -0.22711589],
       ...,
       [ 1.31727029, -0.89018872, -0.66790561, ...,  0.18478042,
         0.01967501, -0.32233412],
       [-0.45551154,  0.64007108, -0.80133001, ...,  0.20805844,
        -0.0041023 , -0.10448369],
       [-0.51235256, -1.54242536,  0.35736107, ..., -0.85119515,
         0.12623405,  0.33733054]])

In [396]:
y_test

ID
2140     97.94
310      96.41
4779    105.83
385      79.09
5180    108.69
         ...  
1280    113.68
7972     88.85
1810     89.60
7206     89.23
3922    109.49
Name: y, Length: 842, dtype: float64

### XGBRegressor Model on PCA Data:

In [397]:
XGBR_Model_1 = XGBRegressor(n_estimators=300, max_depth=3, learning_rate=0.02, random_state=100, min_child_weight=1,
                          colsample_bytree=0.4, alpha=10)

In [398]:
XGBR_Model_1.fit(x_train, y_train)

XGBRegressor(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.4, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.02, max_delta_step=0, max_depth=3,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=300, n_jobs=4, num_parallel_tree=1, random_state=100,
             reg_alpha=10, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [399]:
pred = XGBR_Model_1.predict(x_test)

In [400]:
r2_score(y_test, pred)

0.5309834096629831

$\color{red}{\text{Grid Search Below Takes around 3 Hours to run as I have Checked Many Combinations of Parameters and Data Size is also Large.}}$

$\color{red}{\text{If you are Following this Notebook, it's better if you just take Best Parameters Found from Grid Search below and}}$
$\color{red}{\text{Use Those Parameters to Build Model rather than Running full Grid Search.}}$

In [404]:
xgbr = XGBRegressor()

In [403]:
param_dict = {
    'n_estimators': [100,200,300,500,1000],
    'max_depth': range(3, 6),
    'learning_rate': [0.01, 0.02, 0.1, 0.5, 0.75, 1],
    'colsample_bytree': [0.4,0.5,0.6, 0.7, 0.8, 0.9, 1]}

In [405]:
grid_xgbr = GridSearchCV(estimator= xgbr, param_grid= param_dict, cv= 5)

In [406]:
grid_xgbr.fit(x_train, y_train)

GridSearchCV(cv=5,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, gamma=None,
                                    gpu_id=None, importance_type='gain',
                                    interaction_constraints=None,
                                    learning_rate=None, max_delta_step=None,
                                    max_depth=None, min_child_weight=None,
                                    missing=nan, monotone_constraints=None,
                                    n_estimators=100, n_jobs=None,
                                    num_parallel_tree=None, random_state=None,
                                    reg_alpha=None, reg_lambda=None,
                                    scale_pos_weight=None, subsample=None,
                                    tree_method=None, validate_para

In [407]:
grid_xgbr.best_estimator_

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.4, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.02, max_delta_step=0, max_depth=3,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=500, n_jobs=4, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [408]:
grid_xgbr.best_score_

0.49330473622053717

In [409]:
r2_score(y_test, grid_xgbr.best_estimator_.predict(x_test))

0.5505314980737926

In [411]:
# XGB Regressor on PCA Data Doesn't Give Propre Accuracy.

# We will try to build XGB Regressor on 70 Features selected by OLS.

### XGBRegressor Model on Features selected using OLS:

In [546]:
# We have 70 Featues Selected Using OLS scaled above in scaled_x data frame (on Which we performed PCA.)
# We can Use scaled_x data frame here.

In [412]:
scaled_x.head()

Unnamed: 0_level_0,X0,X5,X45,X47,X48,X52,X54,X74,X75,X76,...,X336,X337,X338,X339,X364,X365,X368,X375,X376,X383
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.163012,1.292117,-0.58238,-0.114002,-0.15114,-0.210138,-0.213201,0.026707,-0.193562,-0.213201,...,-0.382008,-1.033588,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,-0.684167,-0.246447,-0.040815
6,0.163012,1.776974,-0.58238,-0.114002,-0.15114,-0.210138,-0.213201,0.026707,-0.193562,-0.213201,...,2.617749,0.967503,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,1.46163,-0.246447,-0.040815
7,-0.71056,1.65576,-0.58238,-0.114002,-0.15114,-0.210138,4.690416,0.026707,5.166313,4.690416,...,-0.382008,-1.033588,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,-0.684167,-0.246447,-0.040815
9,-0.71056,1.65576,-0.58238,-0.114002,-0.15114,-0.210138,4.690416,0.026707,-0.193562,4.690416,...,-0.382008,-1.033588,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,-0.684167,-0.246447,-0.040815
13,-0.71056,-0.162454,-0.58238,-0.114002,-0.15114,-0.210138,4.690416,0.026707,-0.193562,4.690416,...,-0.382008,-1.033588,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,-0.684167,-0.246447,-0.040815


In [413]:
scaled_x.shape

(4209, 70)

In [414]:
y.head()

ID
0     130.81
6      88.53
7      76.26
9      80.62
13     78.02
Name: y, dtype: float64

In [415]:
y.shape

(4209,)

In [421]:
# Train Test Split:

In [422]:
x_train, x_test, y_train, y_test = train_test_split(scaled_x, y, test_size= 0.2, random_state= 42)

In [423]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(3367, 70)
(842, 70)
(3367,)
(842,)


In [427]:
x_train.head()

Unnamed: 0_level_0,X0,X5,X45,X47,X48,X52,X54,X74,X75,X76,...,X336,X337,X338,X339,X364,X365,X368,X375,X376,X383
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2011,0.308607,-0.88974,-0.58238,-0.114002,-0.15114,-0.210138,-0.213201,0.026707,-0.193562,-0.213201,...,-0.382008,-1.033588,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,-0.684167,-0.246447,-0.040815
3690,0.381405,-1.132169,-0.58238,-0.114002,-0.15114,-0.210138,-0.213201,0.026707,-0.193562,-0.213201,...,2.617749,0.967503,-0.083293,-0.015416,-0.053471,-0.053471,3.865641,1.46163,-0.246447,-0.040815
7597,-0.200976,1.413332,-0.58238,-0.114002,-0.15114,-0.210138,-0.213201,0.026707,-0.193562,-0.213201,...,-0.382008,-1.033588,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,-0.684167,-0.246447,-0.040815
322,0.090214,-0.04124,-0.58238,-0.114002,-0.15114,-0.210138,-0.213201,0.026707,-0.193562,-0.213201,...,-0.382008,-1.033588,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,-0.684167,4.057675,-0.040815
3103,-0.783357,-1.253383,1.717092,-0.114002,-0.15114,-0.210138,-0.213201,0.026707,-0.193562,-0.213201,...,-0.382008,-1.033588,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,-0.684167,-0.246447,-0.040815


In [428]:
y_train.head()

ID
2011     88.96
3690     89.90
7597     92.59
322     108.84
3103    111.15
Name: y, dtype: float64

In [429]:
x_test.head()

Unnamed: 0_level_0,X0,X5,X45,X47,X48,X52,X54,X74,X75,X76,...,X336,X337,X338,X339,X364,X365,X368,X375,X376,X383
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2140,-1.511334,-0.88974,-0.58238,-0.114002,-0.15114,-0.210138,-0.213201,0.026707,-0.193562,-0.213201,...,-0.382008,0.967503,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,-0.684167,-0.246447,-0.040815
310,-0.200976,-0.04124,-0.58238,-0.114002,-0.15114,-0.210138,-0.213201,0.026707,-0.193562,-0.213201,...,-0.382008,0.967503,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,-0.684167,-0.246447,-0.040815
4779,0.090214,0.564832,-0.58238,-0.114002,-0.15114,4.758788,-0.213201,0.026707,-0.193562,-0.213201,...,-0.382008,0.967503,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,1.46163,-0.246447,-0.040815
385,-0.71056,-0.04124,-0.58238,-0.114002,-0.15114,-0.210138,4.690416,0.026707,-0.193562,4.690416,...,-0.382008,-1.033588,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,-0.684167,-0.246447,-0.040815
5180,-1.584132,0.443617,-0.58238,-0.114002,-0.15114,-0.210138,-0.213201,0.026707,-0.193562,-0.213201,...,-0.382008,-1.033588,-0.083293,-0.015416,-0.053471,-0.053471,-0.258689,-0.684167,-0.246447,-0.040815


In [430]:
y_test.head()

ID
2140     97.94
310      96.41
4779    105.83
385      79.09
5180    108.69
Name: y, dtype: float64

In [431]:
XGBR_Model_2= XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.4, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.02, max_delta_step=0, max_depth=3,
             min_child_weight=1, missing=np.nan, monotone_constraints='()',
             n_estimators=500, n_jobs=4, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [432]:
XGBR_Model_2.fit(x_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.4, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.02, max_delta_step=0, max_depth=3,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=500, n_jobs=4, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [433]:
pred = XGBR_Model_2.predict(x_test)

In [434]:
r2_score(y_test, pred)

0.6007262412395912

In [435]:
# Still, Not decent Accuracy.

# We will Try to Build Model on all the features that were in data just before performing OLS.

### XGBRegressor Model on all Features:

In [547]:
# We are going to take all the features available after dropping features with Zero Variance.
# Those Features are stored in df dataframe, so we will use df here to create x.

In [438]:
x = df.drop("y", axis= 1)
y = df["y"]

In [439]:
x.head()

Unnamed: 0_level_0,X0,X1,X2,X3,X4,X5,X6,X8,X10,X12,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,32,23,17,0,3,24,9,14,0,0,...,0,0,1,0,0,0,0,0,0,0
6,32,21,19,4,3,28,11,14,0,0,...,1,0,0,0,0,0,0,0,0,0
7,20,24,34,2,3,27,9,23,0,0,...,0,0,0,0,0,0,1,0,0,0
9,20,21,34,5,3,27,11,4,0,0,...,0,0,0,0,0,0,0,0,0,0
13,20,23,34,5,3,12,3,13,0,0,...,0,0,0,0,0,0,0,0,0,0


In [440]:
y.head()

ID
0     130.81
6      88.53
7      76.26
9      80.62
13     78.02
Name: y, dtype: float64

In [451]:
# Scalling:

In [452]:
sc = StandardScaler()

In [453]:
temp = sc.fit_transform(x)
scaled_x_all_features = pd.DataFrame(temp, index=x.index, columns= x.columns)
scaled_x_all_features.head()

Unnamed: 0_level_0,X0,X1,X2,X3,X4,X5,X6,X8,X10,X12,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.163012,1.393488,-0.028122,-1.67827,0.028938,1.292117,0.751787,0.339445,-0.116122,-0.284906,...,-0.684167,-0.246447,1.475332,-0.14528,-0.097952,-0.090243,-0.087527,-0.040815,-0.021804,-0.037783
6,0.163012,1.159021,0.155388,0.620969,0.028938,1.776974,1.437511,0.339445,-0.116122,-0.284906,...,1.46163,-0.246447,-0.677814,-0.14528,-0.097952,-0.090243,-0.087527,-0.040815,-0.021804,-0.037783
7,-0.71056,1.510721,1.531709,-0.52865,0.028938,1.65576,0.751787,1.618389,-0.116122,-0.284906,...,-0.684167,-0.246447,-0.677814,-0.14528,-0.097952,-0.090243,11.425027,-0.040815,-0.021804,-0.037783
9,-0.71056,1.159021,1.531709,1.195779,0.028938,1.65576,1.437511,-1.081605,-0.116122,-0.284906,...,-0.684167,-0.246447,-0.677814,-0.14528,-0.097952,-0.090243,-0.087527,-0.040815,-0.021804,-0.037783
13,-0.71056,1.393488,1.531709,1.195779,0.028938,-0.162454,-1.305384,0.19734,-0.116122,-0.284906,...,-0.684167,-0.246447,-0.677814,-0.14528,-0.097952,-0.090243,-0.087527,-0.040815,-0.021804,-0.037783


In [454]:
# Train Test Split:

x_train, x_test, y_train, y_test = train_test_split(scaled_x_all_features, y, test_size= 0.2, random_state= 42)

In [455]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(3367, 364)
(842, 364)
(3367,)
(842,)


In [456]:
x_train.head()

Unnamed: 0_level_0,X0,X1,X2,X3,X4,X5,X6,X8,X10,X12,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2011,0.308607,1.393488,-0.119876,-0.52865,0.028938,-0.88974,1.094649,1.618389,-0.116122,-0.284906,...,-0.684167,-0.246447,1.475332,-0.14528,-0.097952,-0.090243,-0.087527,-0.040815,-0.021804,-0.037783
3690,0.381405,1.041787,-0.119876,0.04616,0.028938,-1.132169,-0.276798,0.907865,-0.116122,-0.284906,...,1.46163,-0.246447,-0.677814,-0.14528,-0.097952,-0.090243,-0.087527,-0.040815,-0.021804,-0.037783
7597,-0.200976,-0.833948,1.439954,-0.52865,0.028938,1.413332,0.408925,-1.081605,-0.116122,-0.284906,...,-0.684167,-0.246447,1.475332,-0.14528,-0.097952,-0.090243,-0.087527,-0.040815,-0.021804,-0.037783
322,0.090214,-1.185648,0.889426,0.04616,0.028938,-0.04124,0.408925,-1.081605,-0.116122,-0.284906,...,-0.684167,4.057675,-0.677814,-0.14528,-0.097952,-0.090243,-0.087527,-0.040815,-0.021804,-0.037783
3103,-0.783357,-0.130547,-0.119876,-0.52865,0.028938,-1.253383,1.437511,-0.228975,-0.116122,-0.284906,...,-0.684167,-0.246447,1.475332,-0.14528,-0.097952,-0.090243,-0.087527,-0.040815,-0.021804,-0.037783


In [457]:
y_train.head()

ID
2011     88.96
3690     89.90
7597     92.59
322     108.84
3103    111.15
Name: y, dtype: float64

In [458]:
x_test.head()

Unnamed: 0_level_0,X0,X1,X2,X3,X4,X5,X6,X8,X10,X12,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2140,-1.511334,0.572854,-0.945669,1.195779,0.028938,-0.88974,0.751787,-0.08687,-0.116122,-0.284906,...,-0.684167,-0.246447,-0.677814,-0.14528,-0.097952,-0.090243,-0.087527,-0.040815,-0.021804,-0.037783
310,-0.200976,0.221153,-1.312688,1.195779,0.028938,-0.04124,0.408925,1.476285,-0.116122,-0.284906,...,-0.684167,-0.246447,-0.677814,-0.14528,-0.097952,-0.090243,-0.087527,-0.040815,-0.021804,-0.037783
4779,0.090214,-1.185648,0.338897,-0.52865,0.028938,0.564832,1.437511,0.339445,8.611662,-0.284906,...,1.46163,-0.246447,-0.677814,-0.14528,-0.097952,-0.090243,-0.087527,-0.040815,-0.021804,-0.037783
385,-0.71056,1.627955,0.430652,-0.52865,0.028938,-0.04124,0.751787,-0.08687,-0.116122,-0.284906,...,-0.684167,-0.246447,-0.677814,-0.14528,-0.097952,-0.090243,11.425027,-0.040815,-0.021804,-0.037783
5180,-1.584132,1.393488,-0.853914,0.04616,0.028938,0.443617,0.408925,0.76576,-0.116122,-0.284906,...,-0.684167,-0.246447,-0.677814,-0.14528,-0.097952,-0.090243,-0.087527,-0.040815,-0.021804,-0.037783


In [459]:
y_test.head()

ID
2140     97.94
310      96.41
4779    105.83
385      79.09
5180    108.69
Name: y, dtype: float64

In [460]:
XGBR_Model_3= XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.4, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.02, max_delta_step=0, max_depth=3,
             min_child_weight=1, missing=np.nan, monotone_constraints='()',
             n_estimators=500, n_jobs=4, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [461]:
XGBR_Model_3.fit(x_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.4, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.02, max_delta_step=0, max_depth=3,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=500, n_jobs=4, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [462]:
pred = XGBR_Model_3.predict(x_test)

In [463]:
r2_score(y_test, pred)

0.5971265803872543

In [464]:
# Again, Not very Decent.

# We will try to Build XGB Regressor on all features after Converting Categorical Variable into Dummy Variables.

### XGBRegressor Model on all Features after Getting Dummy Variables for Categorical Features:

In [548]:
# In Above Model, Our x had all the features which we then scaled to get scaled_x_all_features data frame.
# We will use x with all unscaled features from above to create dummy variables from categorical ones.

In [504]:
x.head()

Unnamed: 0_level_0,X0,X1,X2,X3,X4,X5,X6,X8,X10,X12,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,32,23,17,0,3,24,9,14,0,0,...,0,0,1,0,0,0,0,0,0,0
6,32,21,19,4,3,28,11,14,0,0,...,1,0,0,0,0,0,0,0,0,0
7,20,24,34,2,3,27,9,23,0,0,...,0,0,0,0,0,0,1,0,0,0
9,20,21,34,5,3,27,11,4,0,0,...,0,0,0,0,0,0,0,0,0,0
13,20,23,34,5,3,12,3,13,0,0,...,0,0,0,0,0,0,0,0,0,0


In [505]:
columns_to_dummy = []
for col in x.columns:
    if df[col].nunique() > 2:
        print(col)
        columns_to_dummy.append(col)

X0
X1
X2
X3
X4
X5
X6
X8


In [506]:
# Only This columns have Multiple Values in them, other features already have Binary Data.

In [507]:
columns_to_dummy

['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']

In [508]:
x_dummy = pd.get_dummies(x, columns= columns_to_dummy, drop_first= True)

In [509]:
x_dummy.head()

Unnamed: 0_level_0,X10,X12,X13,X14,X15,X16,X17,X18,X19,X20,...,X8_15,X8_16,X8_17,X8_18,X8_19,X8_20,X8_21,X8_22,X8_23,X8_24
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
13,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [510]:
# We don't need to Scale this Data as all columns have binary data now.

In [511]:
# Train Test Split:

x_train, x_test, y_train, y_test = train_test_split(x_dummy, y, test_size= 0.2, random_state= 42)

In [512]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(3367, 543)
(842, 543)
(3367,)
(842,)


In [513]:
x_train.head()

Unnamed: 0_level_0,X10,X12,X13,X14,X15,X16,X17,X18,X19,X20,...,X8_15,X8_16,X8_17,X8_18,X8_19,X8_20,X8_21,X8_22,X8_23,X8_24
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2011,0,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3690,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
7597,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
322,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3103,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [514]:
y_train.head()

ID
2011     88.96
3690     89.90
7597     92.59
322     108.84
3103    111.15
Name: y, dtype: float64

In [515]:
x_test.head()

Unnamed: 0_level_0,X10,X12,X13,X14,X15,X16,X17,X18,X19,X20,...,X8_15,X8_16,X8_17,X8_18,X8_19,X8_20,X8_21,X8_22,X8_23,X8_24
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2140,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
310,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4779,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
385,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5180,0,0,0,0,0,0,0,0,0,1,...,0,0,1,0,0,0,0,0,0,0


In [516]:
y_test.head()

ID
2140     97.94
310      96.41
4779    105.83
385      79.09
5180    108.69
Name: y, dtype: float64

In [517]:
XGBR_Model_4= XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.4, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.02, max_delta_step=0, max_depth=3,
             min_child_weight=1, missing=np.nan, monotone_constraints='()',
             n_estimators=500, n_jobs=4, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [518]:
XGBR_Model_4.fit(x_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.4, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.02, max_delta_step=0, max_depth=3,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=500, n_jobs=4, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [519]:
pred = XGBR_Model_4.predict(x_test)

In [520]:
r2_score(y_test, pred)

0.5982197068074213

In [521]:
# Let's See if we can use PCA on data with dummy variables to reduce dimensions and still get accuracy close to this.

In [522]:
# PCA:

In [528]:
pca_2 = PCA(n_components= 0.99)

In [529]:
pca_2_result = pca_2.fit_transform(x_dummy)

In [530]:
pca_2_result

array([[ 8.17092193e-01, -1.35926461e+00,  1.93980323e+00, ...,
        -1.22764591e-01,  2.80940435e-02, -2.22326646e-01],
       [-1.01498528e-01, -1.29439536e+00, -9.14891423e-02, ...,
         2.30191387e-01,  1.91058288e-03, -1.77685773e-01],
       [-6.68523927e-01, -2.43478060e+00,  1.69808551e+00, ...,
         3.66251606e-03, -8.76945321e-02, -1.19278619e-01],
       ...,
       [-1.03441762e+00, -4.84180675e-01,  1.80068794e+00, ...,
        -3.64025730e-02, -1.41229392e-02, -2.88435861e-02],
       [ 3.90266235e-01, -1.17253942e+00, -3.10414176e+00, ...,
        -8.96313589e-02,  3.32681750e-02,  1.99912401e-02],
       [ 9.63564876e-01, -9.00336985e-01, -9.06489023e-01, ...,
        -2.70507554e-02, -2.92804705e-02, -2.50616125e-02]])

In [531]:
pca_2_result.shape

(4209, 218)

In [533]:
np.sum(pca_2.explained_variance_ratio_)

0.9901745951185926

In [534]:
### Train Test Split:

pca_2_result.shape

(4209, 218)

In [535]:
y.shape

(4209,)

In [536]:
x_train, x_test, y_train, y_test = train_test_split(pca_2_result, y, test_size= 0.2, random_state= 42)

In [537]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(3367, 218)
(842, 218)
(3367,)
(842,)


In [538]:
x_train

array([[-1.4278624 ,  0.64548255,  2.29645895, ..., -0.138267  ,
        -0.11193458,  0.03215213],
       [-1.93015669, -0.29011213,  0.59183733, ...,  0.29790631,
         0.10102458,  0.2768754 ],
       [ 0.85510938, -0.80627092,  2.2602882 , ..., -0.0145267 ,
         0.02536868, -0.00437477],
       ...,
       [ 1.17073239, -1.38305776, -1.77050338, ...,  0.05092168,
        -0.16211578, -0.06952964],
       [ 0.53356227, -0.69417238, -2.12762357, ...,  0.06070186,
         0.05843849,  0.0203837 ],
       [ 2.23176552, -0.68529378, -0.45970559, ...,  0.02002003,
        -0.11532239, -0.00550201]])

In [539]:
x_test

array([[ 1.34039212, -1.89903008, -2.6847891 , ..., -0.07946088,
        -0.0374303 , -0.03027228],
       [-0.4169664 , -2.80610142,  0.1241259 , ..., -0.02693305,
        -0.06557611, -0.00796897],
       [ 0.68685357,  0.35114241, -0.25831204, ..., -0.07884676,
        -0.07603062,  0.06455336],
       ...,
       [ 1.01770431, -0.18906457, -0.41868967, ..., -0.0640775 ,
        -0.05065421, -0.02059913],
       [-0.11958256, -1.23867553,  2.38847224, ...,  0.01667588,
        -0.02003622,  0.00297176],
       [-1.29360794,  1.65573016, -1.33224764, ...,  0.11602609,
        -0.11917746,  0.11274231]])

In [540]:
y_train

ID
2011     88.96
3690     89.90
7597     92.59
322     108.84
3103    111.15
         ...  
6879    109.42
898      78.25
6214     92.18
7558     91.92
1712     87.71
Name: y, Length: 3367, dtype: float64

In [541]:
y_test

ID
2140     97.94
310      96.41
4779    105.83
385      79.09
5180    108.69
         ...  
1280    113.68
7972     88.85
1810     89.60
7206     89.23
3922    109.49
Name: y, Length: 842, dtype: float64

In [542]:
XGBR_Model_5= XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.4, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.02, max_delta_step=0, max_depth=3,
             min_child_weight=1, missing=np.nan, monotone_constraints='()',
             n_estimators=500, n_jobs=4, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [543]:
XGBR_Model_5.fit(x_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.4, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.02, max_delta_step=0, max_depth=3,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=500, n_jobs=4, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [544]:
pred = XGBR_Model_5.predict(x_test)

In [545]:
r2_score(y_test, pred)

0.5205444142381193

### We got

- 55.05% Accuracy on XGBRegressor Model Built on 43 Features found from PCA (PCA on 70 Significant Features Obtained from OLS).


- 60.07% Accuracy on XGBRegressor Model Built on 70 Significant Features Obtained from OLS.


- 59.71% Accuracy on XGBRegressor Model Built on all 364 Features of Data (Scaled Features).


- 59.82% Accuracy on XGBRegressor Model Built on all features after getting Dummy Variables for Categorical Columns (Total 543 features).


- 52.05% Accuracy on XGBRegressor Model Built on 218 Features found from PCA(PCA on all features after getting Dummy Variables for Categorical Columns).

**As we can see, XGBR_Model_2 is Best in both Accuracy and also using Less Features than Original Data.**

**So, We will Use XGBR_Model_2 to Make Predictions on Unseen Test Data.**

**Note: We Will have to Perform all the data cleaning and manipulation steps, that we performed on data that we used to build XGBR_Model_2 model, on Unseen Test Data to get Unseen Test Data in Same Form as Training Data for XGBR_Model_2.**

## Prediction on Unseen Test Data:

In [557]:
unseen_test = pd.read_csv("test.csv")

In [558]:
unseen_test.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,az,v,n,f,d,t,a,w,0,...,0,0,0,1,0,0,0,0,0,0
1,2,t,b,ai,a,d,b,g,y,0,...,0,0,1,0,0,0,0,0,0,0
2,3,az,v,as,f,d,a,j,j,0,...,0,0,0,1,0,0,0,0,0,0
3,4,az,l,n,f,d,z,l,n,0,...,0,0,0,1,0,0,0,0,0,0
4,5,w,s,as,c,d,y,i,m,0,...,1,0,0,0,0,0,0,0,0,0


In [559]:
unseen_test.dtypes

ID       int64
X0      object
X1      object
X2      object
X3      object
         ...  
X380     int64
X382     int64
X383     int64
X384     int64
X385     int64
Length: 377, dtype: object

#### 1) Missing Value Treatment:

In [560]:
unseen_test.isna().sum().sum()

0

#### 2) Setting "ID" Column as Data Frame Index:

In [561]:
unseen_test = unseen_test.set_index("ID")

In [562]:
unseen_test.head()

Unnamed: 0_level_0,X0,X1,X2,X3,X4,X5,X6,X8,X10,X11,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,az,v,n,f,d,t,a,w,0,0,...,0,0,0,1,0,0,0,0,0,0
2,t,b,ai,a,d,b,g,y,0,0,...,0,0,1,0,0,0,0,0,0,0
3,az,v,as,f,d,a,j,j,0,0,...,0,0,0,1,0,0,0,0,0,0
4,az,l,n,f,d,z,l,n,0,0,...,0,0,0,1,0,0,0,0,0,0
5,w,s,as,c,d,y,i,m,0,0,...,1,0,0,0,0,0,0,0,0,0


#### 3) Applying Label Encoder on Categorical Features:

In [566]:
# We already have "cat_col" list which have names of Categorical Columns to Encode.

# We also have "le", object of Label Encoder.

In [564]:
cat_col

['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']

In [567]:
for col in cat_col:
    unseen_test[col] = le.fit_transform(unseen_test[col])

In [568]:
unseen_test.dtypes

X0      int32
X1      int32
X2      int32
X3      int32
X4      int32
        ...  
X380    int64
X382    int64
X383    int64
X384    int64
X385    int64
Length: 376, dtype: object

#### 4) Dropping Columns with 0 Variance:

In [569]:
cols_to_drop

['X11',
 'X93',
 'X107',
 'X233',
 'X235',
 'X268',
 'X289',
 'X290',
 'X293',
 'X297',
 'X330',
 'X347']

In [570]:
# We dropped these columns From Training Data as it had no Variance.
# We will have to Drop thses columns from new data too, even if these columns have variance in new data.

In [572]:
unseen_test = unseen_test.drop(cols_to_drop, axis= 1)

In [573]:
unseen_test.head()

Unnamed: 0_level_0,X0,X1,X2,X3,X4,X5,X6,X8,X10,X12,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,21,23,34,5,3,26,0,22,0,0,...,0,0,0,1,0,0,0,0,0,0
2,42,3,8,0,3,9,6,24,0,0,...,0,0,1,0,0,0,0,0,0,0
3,21,23,17,5,3,0,9,9,0,0,...,0,0,0,1,0,0,0,0,0,0
4,21,13,34,5,3,31,11,13,0,0,...,0,0,0,1,0,0,0,0,0,0
5,45,20,17,2,3,30,8,12,0,0,...,1,0,0,0,0,0,0,0,0,0


In [574]:
unseen_test.shape

(4209, 364)

#### 5) Keeping only those Features which were found Significant to Target Variable by OLS:

In [575]:
significant_cols

['X0',
 'X5',
 'X45',
 'X47',
 'X48',
 'X52',
 'X54',
 'X74',
 'X75',
 'X76',
 'X79',
 'X95',
 'X104',
 'X111',
 'X113',
 'X115',
 'X117',
 'X118',
 'X119',
 'X120',
 'X123',
 'X128',
 'X130',
 'X133',
 'X134',
 'X136',
 'X142',
 'X143',
 'X147',
 'X152',
 'X156',
 'X157',
 'X158',
 'X163',
 'X174',
 'X178',
 'X179',
 'X180',
 'X186',
 'X189',
 'X194',
 'X201',
 'X204',
 'X206',
 'X209',
 'X210',
 'X217',
 'X222',
 'X226',
 'X236',
 'X240',
 'X249',
 'X250',
 'X263',
 'X272',
 'X301',
 'X310',
 'X314',
 'X315',
 'X326',
 'X336',
 'X337',
 'X338',
 'X339',
 'X364',
 'X365',
 'X368',
 'X375',
 'X376',
 'X383']

In [576]:
len(significant_cols)

70

In [577]:
unseen_test_new = unseen_test[significant_cols].copy()

In [578]:
unseen_test_new.head()

Unnamed: 0_level_0,X0,X5,X45,X47,X48,X52,X54,X74,X75,X76,...,X336,X337,X338,X339,X364,X365,X368,X375,X376,X383
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,21,26,0,0,0,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
2,42,9,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
3,21,0,0,0,0,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,21,31,0,0,0,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
5,45,30,0,0,0,0,0,1,0,0,...,1,1,0,0,0,0,0,1,0,0


In [579]:
unseen_test_new.shape

(4209, 70)

#### 6) Scalling:

**Note: We always have to Use the Same Scalar Object that we used on Training data to transform New Data.**

**We should not Create new Scalar object for New Data.**

**As we have Over-Written same Scalar Object multiple times while Model Building, we will create a Scalar Object and Fit it on Same Training Data as before.**

**Scaled Data Used in XGBR_Model_2 was scaled by using Data in new_x Data Frame.**

**So, we will Use Same dataframe to Fit the Scalar Object and then Transform our New Data Using That Scalar.**

In [580]:
new_x.head()

Unnamed: 0_level_0,X0,X5,X45,X47,X48,X52,X54,X74,X75,X76,...,X336,X337,X338,X339,X364,X365,X368,X375,X376,X383
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,32,24,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
6,32,28,0,0,0,0,0,1,0,0,...,1,1,0,0,0,0,0,1,0,0
7,20,27,0,0,0,0,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
9,20,27,0,0,0,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
13,20,12,0,0,0,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0


In [581]:
scalar =StandardScaler()

In [582]:
scalar.fit(new_x)

StandardScaler()

In [583]:
temp = scalar.fit_transform(unseen_test_new)
unseen_scaled = pd.DataFrame(temp, index=unseen_test_new.index, columns= unseen_test_new.columns)
unseen_scaled.head()

Unnamed: 0_level_0,X0,X5,X45,X47,X48,X52,X54,X74,X75,X76,...,X336,X337,X338,X339,X364,X365,X368,X375,X376,X383
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.625211,1.266652,-0.580551,-0.106267,-0.156003,-0.210753,4.904444,0.046291,-0.200715,4.904444,...,-0.382416,-1.055468,-0.088895,-0.015416,-0.059804,-0.057769,-0.259733,-0.69542,-0.228583,-0.021804
2,0.754609,-0.695011,1.722502,-0.106267,-0.156003,-0.210753,-0.203897,0.046291,-0.200715,-0.203897,...,-0.382416,-1.055468,-0.088895,-0.015416,-0.059804,-0.057769,3.850105,-0.69542,-0.228583,-0.021804
3,-0.625211,-1.733538,-0.580551,-0.106267,-0.156003,-0.210753,4.904444,0.046291,-0.200715,4.904444,...,-0.382416,-1.055468,-0.088895,-0.015416,-0.059804,-0.057769,-0.259733,-0.69542,-0.228583,-0.021804
4,-0.625211,1.843611,-0.580551,-0.106267,-0.156003,-0.210753,4.904444,0.046291,-0.200715,4.904444,...,-0.382416,-1.055468,-0.088895,-0.015416,-0.059804,-0.057769,-0.259733,-0.69542,-0.228583,-0.021804
5,0.951726,1.728219,-0.580551,-0.106267,-0.156003,-0.210753,-0.203897,0.046291,-0.200715,-0.203897,...,2.614955,0.947447,-0.088895,-0.015416,-0.059804,-0.057769,-0.259733,1.437979,-0.228583,-0.021804


#### 7) Using XGBR_Model_2 to Make Predictions on New Data:

In [584]:
XGBR_Model_2.get_params

<bound method XGBModel.get_params of XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.4, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.02, max_delta_step=0, max_depth=3,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=500, n_jobs=4, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)>

In [585]:
Predictions = XGBR_Model_2.predict(unseen_scaled)

In [586]:
Predictions

array([ 99.57243 , 113.51955 ,  96.249695, ...,  92.27821 , 110.89687 ,
        91.844406], dtype=float32)

In [587]:
Predictions.shape

(4209,)

#### 8) Adding Predictions With Original Data in Data Frame:

In [588]:
# Loading Unseen Test Data again from file as we have made some manipulations to columns.

In [589]:
new_data = pd.read_csv("test.csv")

In [590]:
new_data.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,az,v,n,f,d,t,a,w,0,...,0,0,0,1,0,0,0,0,0,0
1,2,t,b,ai,a,d,b,g,y,0,...,0,0,1,0,0,0,0,0,0,0
2,3,az,v,as,f,d,a,j,j,0,...,0,0,0,1,0,0,0,0,0,0
3,4,az,l,n,f,d,z,l,n,0,...,0,0,0,1,0,0,0,0,0,0
4,5,w,s,as,c,d,y,i,m,0,...,1,0,0,0,0,0,0,0,0,0


In [591]:
new_data["y"] = Predictions

In [592]:
new_data.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X376,X377,X378,X379,X380,X382,X383,X384,X385,y
0,1,az,v,n,f,d,t,a,w,0,...,0,0,1,0,0,0,0,0,0,99.572433
1,2,t,b,ai,a,d,b,g,y,0,...,0,1,0,0,0,0,0,0,0,113.519547
2,3,az,v,as,f,d,a,j,j,0,...,0,0,1,0,0,0,0,0,0,96.249695
3,4,az,l,n,f,d,z,l,n,0,...,0,0,1,0,0,0,0,0,0,77.344536
4,5,w,s,as,c,d,y,i,m,0,...,0,0,0,0,0,0,0,0,0,110.06855


#### 9) Saving New Data with Predictions to "csv" file:

In [593]:
new_data.to_csv("test_with_predictions.csv", index= None)