DESCRIPTION

Reduce the time a Mercedes-Benz spends on the test bench.

Problem Statement Scenario:

Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.

Following actions should be performed:

If for any column(s), the variance is equal to zero, then you need to remove those variable(s).

Check for null and unique values for test and train sets.

Apply label encoder.

Perform dimensionality reduction.

Predict your test_df values using XGBoost.

###  Import Required Library

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.metrics import explained_variance_score,r2_score,mean_squared_error
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor

import warnings
warnings.filterwarnings('ignore')

### Loading train and test dataset and EDA

In [2]:
# reading train dataset
train_df = pd.read_csv('train.csv')
train = train_df
train_df.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [3]:
# reading test dataset
test_df = pd.read_csv('test.csv')
test = test_df
test_df.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,az,v,n,f,d,t,a,w,0,...,0,0,0,1,0,0,0,0,0,0
1,2,t,b,ai,a,d,b,g,y,0,...,0,0,1,0,0,0,0,0,0,0
2,3,az,v,as,f,d,a,j,j,0,...,0,0,0,1,0,0,0,0,0,0
3,4,az,l,n,f,d,z,l,n,0,...,0,0,0,1,0,0,0,0,0,0
4,5,w,s,as,c,d,y,i,m,0,...,1,0,0,0,0,0,0,0,0,0


In [4]:
# function to print shape and info for dataset
def df_shape(data):
    print(data.shape)
    print(data.info())
    
df_shape(train_df)
df_shape(test_df)

(4209, 378)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 378 entries, ID to X385
dtypes: float64(1), int64(369), object(8)
memory usage: 12.1+ MB
None
(4209, 377)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 377 entries, ID to X385
dtypes: int64(369), object(8)
memory usage: 12.1+ MB
None


In [5]:
# to check column for data types
def col_unique_dtype(data):
    print(data.dtypes.unique())

# for train
col_unique_dtype(train_df)
# for test
col_unique_dtype(test_df)

[dtype('int64') dtype('float64') dtype('O')]
[dtype('int64') dtype('O')]


### Removal of column , where variance is 0

In [6]:
# choosing the numerical column for variance manipulation on train dataset
train_df = train_df.select_dtypes(include=['float64', 'int64'])
train_df.head()

Unnamed: 0,ID,y,X10,X11,X12,X13,X14,X15,X16,X17,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# choosing the numerical column for variance manipulation on test dataset
test_df = test_df.select_dtypes(include=['float64', 'int64'])
test_df.head()

Unnamed: 0,ID,X10,X11,X12,X13,X14,X15,X16,X17,X18,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,2,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,3,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,5,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [8]:
# printing the column where variance = 0
for column in train_df.columns:
        variance_zero = train_df[[column]].var(axis = 0)
        if np.array(variance_zero) == 0:
            print(variance_zero)

X11    0.0
dtype: float64
X93    0.0
dtype: float64
X107    0.0
dtype: float64
X233    0.0
dtype: float64
X235    0.0
dtype: float64
X268    0.0
dtype: float64
X289    0.0
dtype: float64
X290    0.0
dtype: float64
X293    0.0
dtype: float64
X297    0.0
dtype: float64
X330    0.0
dtype: float64
X347    0.0
dtype: float64


In [9]:
# dropping the column whose variance is 0 in train and test dataset
train_df = train_df.drop(['X11', 'X93','X107','X233', 'X235','X268','X289','X290','X293','X297','X330','X347'], axis = 1)
test_df = test_df.drop(['X11', 'X93','X107','X233', 'X235','X268','X289','X290','X293','X297','X330','X347'], axis = 1)

In [10]:
# to print shape and info for dataset afeter column removal
df_shape(train_df)
df_shape(test_df)

(4209, 358)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 358 entries, ID to X385
dtypes: float64(1), int64(357)
memory usage: 11.5 MB
None
(4209, 357)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 357 entries, ID to X385
dtypes: int64(357)
memory usage: 11.5 MB
None


### Check for null and unique values for test and train sets

In [11]:
# to check for NULL/NaN value in train dataset
train.isna().sum().any()

False

In [12]:
# to check for NULL/NaN value in test dataset
test.isna().sum().any()

False

In [13]:
# function to check unique value in column
def val_unique(data):
    for column in data.columns:
        unique_count = data[column].nunique()
        print(column, ":",unique_count,",",end = ' ')


In [14]:
# printing unique data in train dataset
val_unique(train)

ID : 4209 , y : 2545 , X0 : 47 , X1 : 27 , X2 : 44 , X3 : 7 , X4 : 4 , X5 : 29 , X6 : 12 , X8 : 25 , X10 : 2 , X11 : 1 , X12 : 2 , X13 : 2 , X14 : 2 , X15 : 2 , X16 : 2 , X17 : 2 , X18 : 2 , X19 : 2 , X20 : 2 , X21 : 2 , X22 : 2 , X23 : 2 , X24 : 2 , X26 : 2 , X27 : 2 , X28 : 2 , X29 : 2 , X30 : 2 , X31 : 2 , X32 : 2 , X33 : 2 , X34 : 2 , X35 : 2 , X36 : 2 , X37 : 2 , X38 : 2 , X39 : 2 , X40 : 2 , X41 : 2 , X42 : 2 , X43 : 2 , X44 : 2 , X45 : 2 , X46 : 2 , X47 : 2 , X48 : 2 , X49 : 2 , X50 : 2 , X51 : 2 , X52 : 2 , X53 : 2 , X54 : 2 , X55 : 2 , X56 : 2 , X57 : 2 , X58 : 2 , X59 : 2 , X60 : 2 , X61 : 2 , X62 : 2 , X63 : 2 , X64 : 2 , X65 : 2 , X66 : 2 , X67 : 2 , X68 : 2 , X69 : 2 , X70 : 2 , X71 : 2 , X73 : 2 , X74 : 2 , X75 : 2 , X76 : 2 , X77 : 2 , X78 : 2 , X79 : 2 , X80 : 2 , X81 : 2 , X82 : 2 , X83 : 2 , X84 : 2 , X85 : 2 , X86 : 2 , X87 : 2 , X88 : 2 , X89 : 2 , X90 : 2 , X91 : 2 , X92 : 2 , X93 : 1 , X94 : 2 , X95 : 2 , X96 : 2 , X97 : 2 , X98 : 2 , X99 : 2 , X100 : 2 , X101 : 2

In [15]:
# printing unique data in test dataset
val_unique(test)

ID : 4209 , X0 : 49 , X1 : 27 , X2 : 45 , X3 : 7 , X4 : 4 , X5 : 32 , X6 : 12 , X8 : 25 , X10 : 2 , X11 : 2 , X12 : 2 , X13 : 2 , X14 : 2 , X15 : 2 , X16 : 2 , X17 : 2 , X18 : 2 , X19 : 2 , X20 : 2 , X21 : 2 , X22 : 2 , X23 : 2 , X24 : 2 , X26 : 2 , X27 : 2 , X28 : 2 , X29 : 2 , X30 : 2 , X31 : 2 , X32 : 2 , X33 : 2 , X34 : 2 , X35 : 2 , X36 : 2 , X37 : 2 , X38 : 2 , X39 : 2 , X40 : 2 , X41 : 2 , X42 : 2 , X43 : 2 , X44 : 2 , X45 : 2 , X46 : 2 , X47 : 2 , X48 : 2 , X49 : 2 , X50 : 2 , X51 : 2 , X52 : 2 , X53 : 2 , X54 : 2 , X55 : 2 , X56 : 2 , X57 : 2 , X58 : 2 , X59 : 2 , X60 : 2 , X61 : 2 , X62 : 2 , X63 : 2 , X64 : 2 , X65 : 2 , X66 : 2 , X67 : 2 , X68 : 2 , X69 : 2 , X70 : 2 , X71 : 2 , X73 : 2 , X74 : 2 , X75 : 2 , X76 : 2 , X77 : 2 , X78 : 2 , X79 : 2 , X80 : 2 , X81 : 2 , X82 : 2 , X83 : 2 , X84 : 2 , X85 : 2 , X86 : 2 , X87 : 2 , X88 : 2 , X89 : 2 , X90 : 2 , X91 : 2 , X92 : 2 , X93 : 2 , X94 : 2 , X95 : 2 , X96 : 2 , X97 : 2 , X98 : 2 , X99 : 2 , X100 : 2 , X101 : 2 , X102 : 2

###  Apply label encoder on Categorical Data

In [16]:
# instantiate the LabelEncoder and print the categorical feature of train dataset
le = LabelEncoder()
train_le = train.select_dtypes(include= 'O')
print(train_le)

      X0 X1  X2 X3 X4  X5 X6 X8
0      k  v  at  a  d   u  j  o
1      k  t  av  e  d   y  l  o
2     az  w   n  c  d   x  j  x
3     az  t   n  f  d   x  l  e
4     az  v   n  f  d   h  d  n
...   .. ..  .. .. ..  .. .. ..
4204  ak  s  as  c  d  aa  d  q
4205   j  o   t  d  d  aa  h  h
4206  ak  v   r  a  d  aa  g  e
4207  al  r   e  f  d  aa  l  u
4208   z  r  ae  c  d  aa  g  w

[4209 rows x 8 columns]


In [17]:
# transformation of categorical feature to numerical feature of train dataset
for column in train_le.columns:
    train_le[column] = le.fit_transform(train_le[column])
print(train_le)

      X0  X1  X2  X3  X4  X5  X6  X8
0     32  23  17   0   3  24   9  14
1     32  21  19   4   3  28  11  14
2     20  24  34   2   3  27   9  23
3     20  21  34   5   3  27  11   4
4     20  23  34   5   3  12   3  13
...   ..  ..  ..  ..  ..  ..  ..  ..
4204   8  20  16   2   3   0   3  16
4205  31  16  40   3   3   0   7   7
4206   8  23  38   0   3   0   6   4
4207   9  19  25   5   3   0  11  20
4208  46  19   3   2   3   0   6  22

[4209 rows x 8 columns]


In [18]:
# instantiate the LabelEncoder and print the categorical feature of test dataset
le = LabelEncoder()
test_le = test.select_dtypes(include= 'O')
print(test_le)

      X0  X1  X2 X3 X4  X5 X6 X8
0     az   v   n  f  d   t  a  w
1      t   b  ai  a  d   b  g  y
2     az   v  as  f  d   a  j  j
3     az   l   n  f  d   z  l  n
4      w   s  as  c  d   y  i  m
...   ..  ..  .. .. ..  .. .. ..
4204  aj   h  as  f  d  aa  j  e
4205   t  aa  ai  d  d  aa  j  y
4206   y   v  as  f  d  aa  d  w
4207  ak   v  as  a  d  aa  c  q
4208   t  aa  ai  c  d  aa  g  r

[4209 rows x 8 columns]


In [19]:
# transformation of categorical feature to numerical feature for test dataset
for column in train_le.columns:
    test_le[column] = le.fit_transform(train_le[column])
print(test_le)

      X0  X1  X2  X3  X4  X5  X6  X8
0     32  23  17   0   3  24   9  14
1     32  21  19   4   3  28  11  14
2     20  24  34   2   3  27   9  23
3     20  21  34   5   3  27  11   4
4     20  23  34   5   3  12   3  13
...   ..  ..  ..  ..  ..  ..  ..  ..
4204   8  20  16   2   3   0   3  16
4205  31  16  40   3   3   0   7   7
4206   8  23  38   0   3   0   6   4
4207   9  19  25   5   3   0  11  20
4208  46  19   3   2   3   0   6  22

[4209 rows x 8 columns]


In [20]:
# final train data to train the model
X_train = pd.merge(train_le,train_df.iloc[:,2:], on = train_le.index)
X_train = X_train.iloc[:,1:]
X_train

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8,X10,X12,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,32,23,17,0,3,24,9,14,0,0,...,0,0,1,0,0,0,0,0,0,0
1,32,21,19,4,3,28,11,14,0,0,...,1,0,0,0,0,0,0,0,0,0
2,20,24,34,2,3,27,9,23,0,0,...,0,0,0,0,0,0,1,0,0,0
3,20,21,34,5,3,27,11,4,0,0,...,0,0,0,0,0,0,0,0,0,0
4,20,23,34,5,3,12,3,13,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,8,20,16,2,3,0,3,16,0,0,...,1,0,0,0,0,0,0,0,0,0
4205,31,16,40,3,3,0,7,7,0,0,...,0,1,0,0,0,0,0,0,0,0
4206,8,23,38,0,3,0,6,4,0,1,...,0,0,1,0,0,0,0,0,0,0
4207,9,19,25,5,3,0,11,20,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
# final test data to test mdoel
Y_train = pd.DataFrame(train_df.y)
Y_train

Unnamed: 0,y
0,130.81
1,88.53
2,76.26
3,80.62
4,78.02
...,...
4204,107.39
4205,108.77
4206,109.22
4207,87.48


In [22]:
# final test data to test model
X_test = pd.merge(test_le,test_df.iloc[:,1:], on = train_le.index)
X_test = X_test.iloc[:,1:]
X_test

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8,X10,X12,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,32,23,17,0,3,24,9,14,0,0,...,0,0,0,1,0,0,0,0,0,0
1,32,21,19,4,3,28,11,14,0,0,...,0,0,1,0,0,0,0,0,0,0
2,20,24,34,2,3,27,9,23,0,0,...,0,0,0,1,0,0,0,0,0,0
3,20,21,34,5,3,27,11,4,0,0,...,0,0,0,1,0,0,0,0,0,0
4,20,23,34,5,3,12,3,13,0,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,8,20,16,2,3,0,3,16,0,0,...,0,0,0,0,0,0,0,0,0,0
4205,31,16,40,3,3,0,7,7,0,0,...,0,1,0,0,0,0,0,0,0,0
4206,8,23,38,0,3,0,6,4,0,0,...,0,0,0,0,0,0,0,0,0,0
4207,9,19,25,5,3,0,11,20,0,0,...,0,0,1,0,0,0,0,0,0,0


### Perform dimensionality reduction.

In [23]:
# Instantiating the PCA
pca = PCA(n_components=6)

In [24]:
# fit and transform of final train data
trans_X_train = pd.DataFrame(pca.fit_transform(X_train))
trans_X_train.shape

(4209, 6)

In [25]:
# printing the amount of vaiance after applying pca technique
print(pca.explained_variance_)

[204.02462081 113.83096499  70.58204097  62.94352325  48.99603633
   8.465483  ]


In [26]:
# printing amount of variance ratio defined for before and after pca
print(pca.explained_variance_ratio_)

[0.38334782 0.21388033 0.13261866 0.11826642 0.09206008 0.01590604]


In [27]:
# checking shape for modelling the mdoel
X_train.shape,Y_train.shape

((4209, 364), (4209, 1))

In [28]:
# transofrming the test's train dataset for predicting
trans_Y_train = pd.DataFrame(pca.transform(X_test))

### Generating model  using XGBoost

In [29]:
# Estimating by XGBoost's regression
regressor = XGBRegressor(random_state =4,n_jobs=-1)

In [30]:
#Hyper Parameter Optimization

parameters = {'learning_rate':[0.001,0.01,0.05,0.1,1],
             'n_estimators':[100,150,200,500],
             'max_depth':[2,3,5,10],
             'colsample_bytree':[0.1,0.5,0.7,1],
             'reg_alpha':[1e-5,1e-3,1e-1,1,1e1]}


In [31]:
# performing RandomizedSearchCV to check params for hyper parameter tuning 
random_search = RandomizedSearchCV(regressor,parameters,cv=5,scoring='r2',return_train_score=True,n_jobs=-1,verbose=3)
random_search.fit(trans_X_train,Y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:   13.4s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   31.9s finished


RandomizedSearchCV(cv=5,
                   estimator=XGBRegressor(base_score=None, booster=None,
                                          colsample_bylevel=None,
                                          colsample_bynode=None,
                                          colsample_bytree=None, gamma=None,
                                          gpu_id=None, importance_type='gain',
                                          interaction_constraints=None,
                                          learning_rate=None,
                                          max_delta_step=None, max_depth=None,
                                          min_child_weight=None, missing=nan,
                                          monotone_constraints=None,
                                          n_estimators=100, n...
                                          reg_lambda=None,
                                          scale_pos_weight=None, subsample=None,
                                          tree_met

In [32]:
# best estimator attr for XGRegressor
random_search.best_estimator_

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.05, max_delta_step=0, max_depth=2,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=500, n_jobs=-1, num_parallel_tree=1, random_state=4,
             reg_alpha=1e-05, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [33]:
# best estimator param for XGRegressor
random_search.best_params_

{'reg_alpha': 1e-05,
 'n_estimators': 500,
 'max_depth': 2,
 'learning_rate': 0.05,
 'colsample_bytree': 1}

In [34]:
# fitting the estiamtor
regressor.fit(trans_X_train, Y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=-1, num_parallel_tree=1, random_state=4,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [35]:
# score variability 
regressor.score(trans_X_train, Y_train)

0.8600323816318168

In [36]:
# predicting on test dataset
pred_regressor = regressor.predict(trans_Y_train)

In [37]:
# checking mean squared error 
mean_squared_error(Y_train,pred_regressor)

54.517342817683954

In [38]:
# regression score function
r2_score(Y_train,pred_regressor)

0.660811006409497

### Predicted test_df

In [39]:
# predicted data on test dataset
test_df = pd.DataFrame()
test_df['y']= pred_regressor
test_df

Unnamed: 0,y
0,98.881699
1,94.868675
2,74.654633
3,78.968872
4,77.960724
...,...
4204,108.354019
4205,106.432777
4206,111.529900
4207,90.521278
