DESCRIPTION

Reduce the time a Mercedes-Benz spends on the test bench.

Problem Statement Scenario:
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.

Following actions should be performed:

If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
Check for null and unique values for test and train sets.
Apply label encoder.
Perform dimensionality reduction.
Predict your test_df values using XGBoost.
Find the datasets here.

In [1]:
#importing numpy and pandas
import numpy as np
import pandas as pd

In [2]:
#Loading Datasets

In [3]:
a = pd.read_csv('train.csv')

In [4]:
b = pd.read_csv('test.csv')

In [5]:
#Splitting data and target

In [6]:
target = a['y']

In [7]:
target.head()

0    130.81
1     88.53
2     76.26
3     80.62
4     78.02
Name: y, dtype: float64

In [8]:
a.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [9]:
b.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,az,v,n,f,d,t,a,w,0,...,0,0,0,1,0,0,0,0,0,0
1,2,t,b,ai,a,d,b,g,y,0,...,0,0,1,0,0,0,0,0,0,0
2,3,az,v,as,f,d,a,j,j,0,...,0,0,0,1,0,0,0,0,0,0
3,4,az,l,n,f,d,z,l,n,0,...,0,0,0,1,0,0,0,0,0,0
4,5,w,s,as,c,d,y,i,m,0,...,1,0,0,0,0,0,0,0,0,0


In [10]:
c = a.drop(['y'],axis=1)

In [11]:
features = c

In [12]:
features.shape

(4209, 377)

In [13]:
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 377 entries, ID to X385
dtypes: int64(369), object(8)
memory usage: 12.1+ MB


In [14]:
features.describe()

Unnamed: 0,ID,X10,X11,X12,X13,X14,X15,X16,X17,X18,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
count,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,...,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0
mean,4205.960798,0.013305,0.0,0.075077,0.057971,0.42813,0.000475,0.002613,0.007603,0.00784,...,0.318841,0.057258,0.314802,0.02067,0.009503,0.008078,0.007603,0.001663,0.000475,0.001426
std,2437.608688,0.11459,0.0,0.263547,0.233716,0.494867,0.021796,0.051061,0.086872,0.088208,...,0.466082,0.232363,0.464492,0.142294,0.097033,0.089524,0.086872,0.040752,0.021796,0.037734
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2095.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4220.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,6314.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,8417.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [15]:
#If for any column(s), the variance is equal to zero, then you need to remove those variable(s).

In [16]:
features.var()[features.var()==0].index.values

array(['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290',
       'X293', 'X297', 'X330', 'X347'], dtype=object)

In [17]:
features1 = features.drop(features.var()[features.var()==0].index.values,axis=1)

In [18]:
features1.shape

(4209, 365)

In [19]:
#Check for null and unique values for test and train sets.

In [20]:
features1.isna().any()

ID      False
X0      False
X1      False
X2      False
X3      False
        ...  
X380    False
X382    False
X383    False
X384    False
X385    False
Length: 365, dtype: bool

In [21]:
features1.isna().sum()

ID      0
X0      0
X1      0
X2      0
X3      0
       ..
X380    0
X382    0
X383    0
X384    0
X385    0
Length: 365, dtype: int64

In [22]:
target.isna().any()

False

In [23]:
#Applying label encoder for categorical Variables

In [24]:
features1.dtypes

ID       int64
X0      object
X1      object
X2      object
X3      object
         ...  
X380     int64
X382     int64
X383     int64
X384     int64
X385     int64
Length: 365, dtype: object

In [25]:
features1.dtypes.values

array([dtype('int64'), dtype('O'), dtype('O'), dtype('O'), dtype('O'),
       dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
       dtype('int64'), dtype('int64'), dtype('int64'), dtype('int64'),
      

In [26]:
features1.describe(include ='object')

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
count,4209,4209,4209,4209,4209,4209,4209,4209
unique,47,27,44,7,4,29,12,25
top,z,aa,as,c,d,v,g,j
freq,360,833,1659,1942,4205,231,1042,277


In [27]:
category = features1.describe(include ='object').columns.values

In [28]:
category

array(['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8'], dtype=object)

In [29]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [30]:
for i in category:
    features1[i] = le.fit_transform(features1[i])

In [31]:
fea = features1.values
fea

array([[   0,   32,   23, ...,    0,    0,    0],
       [   6,   32,   21, ...,    0,    0,    0],
       [   7,   20,   24, ...,    0,    0,    0],
       ...,
       [8412,    8,   23, ...,    0,    0,    0],
       [8415,    9,   19, ...,    0,    0,    0],
       [8417,   46,   19, ...,    0,    0,    0]], dtype=int64)

In [32]:
from sklearn.model_selection import train_test_split

X, y = features1.values , target.values

le = LabelEncoder()

y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

In [33]:
X_train.shape ,X_test.shape ,y_train.shape , y_test.shape

((3367, 365), (842, 365), (3367,), (842,))

In [34]:
#Normalizing the data using Standard scaler

In [35]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
featuresstd = sc.fit_transform(fea)

In [36]:
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

In [37]:
featuresstd

array([[-1.72565045,  0.16301209,  1.39348787, ..., -0.04081511,
        -0.02180363, -0.03778296],
       [-1.72318873,  0.16301209,  1.15902093, ..., -0.04081511,
        -0.02180363, -0.03778296],
       [-1.72277844, -0.71055977,  1.51072134, ..., -0.04081511,
        -0.02180363, -0.03778296],
       ...,
       [ 1.72568262, -1.58413164,  1.39348787, ..., -0.04081511,
        -0.02180363, -0.03778296],
       [ 1.72691348, -1.51133398,  0.924554  , ..., -0.04081511,
        -0.02180363, -0.03778296],
       [ 1.72773405,  1.18217927,  0.924554  , ..., -0.04081511,
        -0.02180363, -0.03778296]])

In [38]:
#Performing dimensionality reduction using PCA

In [39]:
from sklearn.decomposition import PCA

pca =  PCA(n_components = 364 , svd_solver = 'full')
pca.fit(featuresstd,target)

PCA(copy=True, iterated_power='auto', n_components=364, random_state=None,
    svd_solver='full', tol=0.0, whiten=False)

In [40]:
pca.explained_variance_ratio_

array([6.87384486e-02, 5.67283084e-02, 4.52510465e-02, 3.41738590e-02,
       3.25538323e-02, 3.15418578e-02, 2.85471262e-02, 2.11817663e-02,
       1.96863310e-02, 1.77893503e-02, 1.63562978e-02, 1.56009983e-02,
       1.45906030e-02, 1.44564755e-02, 1.34495596e-02, 1.29257331e-02,
       1.24138205e-02, 1.17139363e-02, 1.11912605e-02, 1.07496090e-02,
       9.89891380e-03, 9.67760321e-03, 9.40045751e-03, 9.08605429e-03,
       8.72347187e-03, 8.40759803e-03, 7.92761993e-03, 7.61388789e-03,
       7.34903377e-03, 7.18304967e-03, 6.91226562e-03, 6.75052104e-03,
       6.55057087e-03, 6.46544442e-03, 6.21347862e-03, 6.00246073e-03,
       5.86650100e-03, 5.74454073e-03, 5.62534299e-03, 5.55771245e-03,
       5.50145016e-03, 5.38603020e-03, 5.32448904e-03, 5.23215509e-03,
       5.11352399e-03, 5.01856856e-03, 4.97724151e-03, 4.77275686e-03,
       4.65790330e-03, 4.59136569e-03, 4.46221069e-03, 4.37329823e-03,
       4.31692752e-03, 4.29122103e-03, 4.22545368e-03, 4.18909864e-03,
      

In [41]:
np.mean(pca.explained_variance_ratio_)

0.0027472527472527475

In [42]:
pca.explained_variance_ratio_ >  0.002747252747252748

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

In [43]:
print(len(pca.explained_variance_ratio_[pca.explained_variance_ratio_ > 0.002747252747252748]))

92


In [44]:
pca = PCA(n_components = 92, svd_solver = 'full')
pca.fit(featuresstd ,  target)

PCA(copy=True, iterated_power='auto', n_components=92, random_state=None,
    svd_solver='full', tol=0.0, whiten=False)

In [45]:
finalfeatures = pca.transform(featuresstd)
finalfeatures

array([[12.2479426 , -2.94616767, -0.96889421, ...,  0.25882444,
        -0.61676101,  1.10463618],
       [-0.10761082,  0.3647402 ,  0.98997272, ...,  0.77942142,
        -0.2513724 ,  1.93558103],
       [10.27291072, 21.10326107, -5.02216176, ...,  0.68435581,
        -0.25604245, -0.12838853],
       ...,
       [ 0.44231616,  0.8989563 ,  3.45276206, ..., -0.32567421,
         0.26075324, -0.29454552],
       [-1.33649135,  0.59062955, -0.10217684, ...,  0.04258129,
         0.39559623, -0.32160133],
       [-2.16301364, -1.05903452, -0.28051994, ...,  0.49978042,
         0.50150749,  0.41787128]])

In [46]:
finalfeatures.shape

(4209, 92)

In [47]:
target.shape

(4209,)

In [48]:
#Predicting the values using Linear Regression

In [49]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(finalfeatures,target, test_size=0.2)

In [50]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

In [51]:
lr.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [52]:
y_pred = lr.predict(X_test)

In [53]:
pd.DataFrame({'Actual' : y_test , 'Predicted' : y_pred})   

Unnamed: 0,Actual,Predicted
2569,87.78,94.908706
1976,100.85,105.942556
722,87.07,95.788066
1028,117.98,108.330053
2790,107.97,113.195746
...,...,...
602,105.43,110.820748
3466,96.26,108.539179
2204,99.03,95.037396
2632,109.49,109.934759


In [54]:
#predicting the values using XGBRegressor

In [55]:
import xgboost as xgb

In [56]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(finalfeatures,target, test_size=0.2)

In [57]:
from xgboost import XGBRegressor
model = XGBRegressor(objective='reg:squarederror', learning_rate=0.1)
    
model.fit(X_train,y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.1, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1,
             objective='reg:squarederror', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

In [58]:
y_pred = model.predict(X_test)

In [59]:
pd.DataFrame({'Actual' : y_test , 'Predicted' : y_pred}) 

Unnamed: 0,Actual,Predicted
3582,93.96,91.042679
3999,100.90,102.910324
3332,87.46,94.036461
1674,98.07,96.550613
634,112.71,112.923676
...,...,...
1603,91.01,92.351799
3022,91.86,99.766640
2378,92.38,100.559090
651,106.72,106.961151


In [60]:
#fit into model and get scores using XGBoost regression

model = XGBRegressor(objective='reg:squarederror', learning_rate=0.1)
    
model.fit(X_train,y_train)

#Check the quality of model

print("Training Accuracy ",model.score(X_train,y_train))
print("Testing Accuracy ",model.score(X_test,y_test))

Training Accuracy  0.8907866395893875
Testing Accuracy  0.43037251796074166


In [61]:
from sklearn.metrics import r2_score

r2_score(y_test,model.predict(X_test), multioutput='variance_weighted')

0.43037251796074166

In [62]:
from sklearn.metrics import mean_absolute_error

#MAE - check this error metrics

mean_absolute_error(y_test,model.predict(X_test))

5.978755960351214

In [63]:
## Predicting test_df values using XGBoost.

In [64]:
target

0       130.81
1        88.53
2        76.26
3        80.62
4        78.02
         ...  
4204    107.39
4205    108.77
4206    109.22
4207     87.48
4208    110.85
Name: y, Length: 4209, dtype: float64

In [65]:
#If for any column(s), the variance is equal to zero, then you need to remove those variable(s).

In [66]:
b.var()[b.var()==0].index.values

array(['X257', 'X258', 'X295', 'X296', 'X369'], dtype=object)

In [67]:
test = b.drop(b.var()[b.var()==0].index.values,axis=1)

In [68]:
test.shape

(4209, 372)

In [69]:
#Check for null and unique values for test and train sets.

In [70]:
test.isna().any()

ID      False
X0      False
X1      False
X2      False
X3      False
        ...  
X380    False
X382    False
X383    False
X384    False
X385    False
Length: 372, dtype: bool

In [71]:
test.isna().sum()

ID      0
X0      0
X1      0
X2      0
X3      0
       ..
X380    0
X382    0
X383    0
X384    0
X385    0
Length: 372, dtype: int64

In [72]:
target.isna().any()

False

In [73]:
#Applying label encoder for categorical Variables

In [74]:
category1 = test.describe(include ='object').columns.values
category1

array(['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8'], dtype=object)

In [75]:
from sklearn.preprocessing import LabelEncoder
lec = LabelEncoder()

In [76]:
for i in category:
    test[i] = lec.fit_transform(test[i])

In [77]:
feat = test.values
feat

array([[   1,   21,   23, ...,    0,    0,    0],
       [   2,   42,    3, ...,    0,    0,    0],
       [   3,   21,   23, ...,    0,    0,    0],
       ...,
       [8413,   47,   23, ...,    0,    0,    0],
       [8414,    7,   23, ...,    0,    0,    0],
       [8416,   42,    1, ...,    0,    0,    0]], dtype=int64)

In [78]:
X,y = test.values , target.values

In [79]:

lec = LabelEncoder()

y = lec.fit_transform(y)

In [80]:
#Predicting y-value(time) on test_df values using XGBoost.

In [81]:
import xgboost as xgb

In [82]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(test,target, test_size=0.2)

In [83]:
from xgboost import XGBRegressor
model = XGBRegressor(objective='reg:squarederror', learning_rate=0.1)
    
model.fit(X_train,y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.1, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1,
             objective='reg:squarederror', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

In [84]:
y_predt = model.predict(X_test)

In [85]:
pd.DataFrame({'Actual' : y_test , 'Predicted' : y_predt}) 

Unnamed: 0,Actual,Predicted
4200,108.59,103.274933
1012,89.43,103.723076
193,90.54,97.508148
3033,106.04,101.850121
2818,101.92,101.307411
...,...,...
203,136.41,98.734222
429,154.87,101.426765
1444,87.00,104.296928
1112,93.95,100.132278


In [86]:
print(y_predt)

[103.27493  103.723076  97.50815  101.85012  101.30741  106.591835
 102.46691   98.94566  101.74861   99.44234  100.641525  99.029236
 102.30223  102.622314 100.835976 102.41957  102.94305   99.268936
 100.46998   98.39673   94.66549   98.83996  100.259964 102.175674
  95.06955   96.09242   98.21539   99.76069  101.20213  101.47222
 103.4796   102.11756   99.23348   99.18473  123.79226  100.6226
 103.86228  102.577095 109.68129  102.76705   99.32746   98.89783
  98.99497  100.132    101.52334   99.78276   98.83815   94.17593
 101.33589  100.70447  102.28134  101.1475    99.91386  100.59154
  98.28317  101.19163   95.65593   96.660774 100.06372   97.86057
  97.14725  103.27502   99.37115  103.06418  100.09363   98.93131
 104.10266  100.54063   98.6065    99.92577   96.57313  106.387115
  98.85196   99.697365  99.888916  99.02734  101.31686  103.37309
 100.69429  100.17704  100.36803  104.08862  100.85737   98.43767
 102.61132  100.06832  103.571075 109.31889   96.14872  102.91509
 103.2

In [87]:
# y-predt is the list of y values that is the time taken to pass testing.