DESCRIPTION

Reduce the time a Mercedes-Benz spends on the test bench.

Problem Statement Scenario:
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.

In [105]:
import numpy as np
import pandas as pd

In [106]:
train_df=pd.read_csv(r"train.csv")
test_df=pd.read_csv(r"test.csv")
print (train_df.shape)
print (train_df.columns)
print (test_df.shape)
print (test_df.columns)

(4209, 378)
Index(['ID', 'y', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8',
       ...
       'X375', 'X376', 'X377', 'X378', 'X379', 'X380', 'X382', 'X383', 'X384',
       'X385'],
      dtype='object', length=378)
(4209, 377)
Index(['ID', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8', 'X10',
       ...
       'X375', 'X376', 'X377', 'X378', 'X379', 'X380', 'X382', 'X383', 'X384',
       'X385'],
      dtype='object', length=377)


In [107]:
train_df.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [108]:
test_df.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,az,v,n,f,d,t,a,w,0,...,0,0,0,1,0,0,0,0,0,0
1,2,t,b,ai,a,d,b,g,y,0,...,0,0,1,0,0,0,0,0,0,0
2,3,az,v,as,f,d,a,j,j,0,...,0,0,0,1,0,0,0,0,0,0
3,4,az,l,n,f,d,z,l,n,0,...,0,0,0,1,0,0,0,0,0,0
4,5,w,s,as,c,d,y,i,m,0,...,1,0,0,0,0,0,0,0,0,0


In [109]:
train_df.describe()

Unnamed: 0,ID,y,X10,X11,X12,X13,X14,X15,X16,X17,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
count,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,...,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0
mean,4205.960798,100.669318,0.013305,0.0,0.075077,0.057971,0.42813,0.000475,0.002613,0.007603,...,0.318841,0.057258,0.314802,0.02067,0.009503,0.008078,0.007603,0.001663,0.000475,0.001426
std,2437.608688,12.679381,0.11459,0.0,0.263547,0.233716,0.494867,0.021796,0.051061,0.086872,...,0.466082,0.232363,0.464492,0.142294,0.097033,0.089524,0.086872,0.040752,0.021796,0.037734
min,0.0,72.11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2095.0,90.82,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4220.0,99.15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,6314.0,109.01,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,8417.0,265.32,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [110]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 378 entries, ID to X385
dtypes: float64(1), int64(369), object(8)
memory usage: 12.1+ MB


In [111]:
d_types = train_df.columns.to_series().groupby(train_df.dtypes).groups
d_types

{dtype('int64'): Index(['ID', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18',
        ...
        'X375', 'X376', 'X377', 'X378', 'X379', 'X380', 'X382', 'X383', 'X384',
        'X385'],
       dtype='object', length=369),
 dtype('float64'): Index(['y'], dtype='object'),
 dtype('O'): Index(['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8'], dtype='object')}

#we see that few data has datatype Object type, let's ovserve them

In [112]:
# Encoding the data for  dtype='object'
from sklearn.preprocessing import LabelEncoder

for c in train_df.columns:
    if train_df[c].dtype == 'object':
        lbl = LabelEncoder()
        lbl.fit(list(train_df[c].values) + list(test_df[c].values))
        train_df[c] = lbl.transform(list(train_df[c].values))
        test_df[c] = lbl.transform(list(test_df[c].values))

In [113]:
test_df.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,24,23,38,5,3,26,0,22,0,...,0,0,0,1,0,0,0,0,0,0
1,2,46,3,9,0,3,9,6,24,0,...,0,0,1,0,0,0,0,0,0,0
2,3,24,23,19,5,3,0,9,9,0,...,0,0,0,1,0,0,0,0,0,0
3,4,24,13,38,5,3,32,11,13,0,...,0,0,0,1,0,0,0,0,0,0
4,5,49,20,19,2,3,31,8,12,0,...,1,0,0,0,0,0,0,0,0,0


In [114]:
train_df.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,37,23,20,0,3,27,9,14,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,37,21,22,4,3,31,11,14,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,24,24,38,2,3,30,9,23,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,24,21,38,5,3,30,11,4,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,24,23,38,5,3,14,3,13,...,0,0,0,0,0,0,0,0,0,0


In [115]:
# Importing Statistics module 
import statistics 
col_zero_var = []
col_nonzero_var = []
allcolumns=train_df.columns
for item in allcolumns:
    var = statistics.variance(train_df[item])
    if (var == 0):
        col_zero_var.append(item)
    else:
        col_nonzero_var.append(item)
        
print(col_zero_var)
print(col_nonzero_var)

['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290', 'X293', 'X297', 'X330', 'X347']
['ID', 'y', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8', 'X10', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100', 'X101', 'X102', 'X103', 'X104', 'X105', 'X106', 'X108', 'X109', 'X110', 'X111', 'X112', 'X113', 'X114', 'X115', 'X116', 'X117', 'X118', 'X119', 'X120', 'X122', 'X123', 'X124', 'X125', 'X126', 'X127', 'X128', 'X129', 'X130', 'X131', 'X1

In [116]:
print('Before : ', train_df.shape)
train_df = train_df[col_nonzero_var]
print('After : ', train_df.shape)



Before :  (4209, 378)
After :  (4209, 366)


In [117]:
# as col_nonzero_var has 'y' column, removing it to apply on Test data
col_nonzero_var1 = []
for col in col_nonzero_var:
    if col != 'y':
        col_nonzero_var1.append(col)
        
print(col_nonzero_var1)

['ID', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8', 'X10', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100', 'X101', 'X102', 'X103', 'X104', 'X105', 'X106', 'X108', 'X109', 'X110', 'X111', 'X112', 'X113', 'X114', 'X115', 'X116', 'X117', 'X118', 'X119', 'X120', 'X122', 'X123', 'X124', 'X125', 'X126', 'X127', 'X128', 'X129', 'X130', 'X131', 'X132', 'X133', 'X134', 'X135', 'X136', 'X137', 'X138', 'X139', 'X140', 'X141', 'X142', 'X143', 'X144',

In [118]:
# Get the expected column values

print('Before : ', test_df.shape)
test_df = test_df[col_nonzero_var1]
print('After : ', test_df.shape)

Before :  (4209, 377)
After :  (4209, 365)


In [119]:
# Function to chcek if there is any null value
def checknull(df):
    null_columns = []
    for col in df.columns:
        null_chk = sum(pd.isnull(df[col]))
        if (null_chk != 0):
            null_columns.append(col)
    return null_columns


In [120]:
#Lets check if there is any null value
print(checknull(train_df))
print(checknull(test_df))

[]
[]


#so for none of the columns there are null values.

In [121]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split

X = train_df.drop("y",axis=1).values
print('Input : ', X.shape)
y = train_df["y"].values
print('Output : ', y.shape)
test_df = test_df.values
print('test_df shape : ', test_df.shape)


Input :  (4209, 365)
Output :  (4209,)
test_df shape :  (4209, 365)


In [122]:
# Splitting the Given the tarining dataset 'train_df' into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [123]:
test_df 

array([[   1,   24,   23, ...,    0,    0,    0],
       [   2,   46,    3, ...,    0,    0,    0],
       [   3,   24,   23, ...,    0,    0,    0],
       ...,
       [8413,   51,   23, ...,    0,    0,    0],
       [8414,   10,   23, ...,    0,    0,    0],
       [8416,   46,    1, ...,    0,    0,    0]], dtype=int64)

In [124]:
print('X_train : ', X_train.shape)
print('X_test : ', X_test.shape)
print('y_train : ', y_train.shape)
print('y_test : ', y_test.shape)


X_train :  (3367, 365)
X_test :  (842, 365)
y_train :  (3367,)
y_test :  (842,)


In [125]:
y_test   

array([ 96.49,  96.93, 114.22,  88.1 ,  92.63,  93.83, 109.79,  89.03,
       109.38, 103.9 ,  93.59,  91.03,  93.5 , 110.7 , 108.92,  75.88,
        89.34, 118.44, 114.88,  92.21,  89.75, 108.01, 108.42, 100.21,
       109.09, 112.97,  95.57,  84.76,  92.34, 101.89,  92.83, 110.43,
        94.35, 111.33, 107.68,  98.59, 106.27, 110.52, 130.6 , 100.21,
        88.29, 118.53,  91.88,  89.24,  93.  ,  87.4 ,  90.69, 121.31,
        91.46, 114.35, 117.35,  87.28, 101.12,  85.14,  89.8 ,  87.79,
        91.67, 112.89,  90.82,  91.68, 102.91, 101.98,  88.1 ,  87.66,
       102.78, 106.77, 131.69, 109.74,  97.99,  78.91,  90.34,  88.73,
       107.96,  88.14, 131.56, 121.23, 109.19, 107.97, 101.49, 101.03,
       103.47, 108.43, 128.14, 104.79, 112.71,  95.43,  89.35,  94.83,
        91.59, 101.71, 110.81, 105.94, 101.59,  89.96,  92.29, 117.36,
        91.81,  91.32, 104.29, 101.34,  90.67,  94.31,  87.86, 108.87,
        98.31, 136.96,  95.65,  99.86, 112.31,  75.54, 107.16, 107.9 ,
      

In [126]:
# Apply PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 12)
# None was reset by 12 , None uas used to find the variance
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
# Transform data for the given test data set
test_df = pca.transform(test_df)

# Cumulative expllain variance
explained_variance = pca.explained_variance_ratio_

In [127]:
explained_variance

array([9.99904302e-01, 4.09031166e-05, 2.23421418e-05, 1.12154931e-05,
       8.18773618e-06, 7.67810684e-06, 1.43435904e-06, 6.72260943e-07,
       3.91686235e-07, 2.59578109e-07, 2.17613213e-07, 2.09964812e-07])

In [128]:
X_train

array([[-2.85394807e+03,  2.46650378e+01, -6.05835204e+00, ...,
        -3.18609442e-01, -1.05518896e-01, -3.29071568e-02],
       [-3.29395519e+03,  3.04903015e+01,  1.66598170e+01, ...,
         7.87493230e-03, -5.16769030e-01, -1.58300935e+00],
       [ 1.67609150e+03, -3.82341068e+00, -9.87003364e+00, ...,
         9.67346199e-01, -7.82362559e-02, -4.83028517e-01],
       ...,
       [ 9.05092868e+02, -1.27409536e+01,  1.23703669e+00, ...,
         9.60468195e-01, -1.08806613e+00,  9.60832207e-01],
       [-1.00094142e+03, -1.19443866e+01,  2.24935713e+00, ...,
        -1.47891951e+00, -1.11814092e-01,  3.47387572e-01],
       [-1.23894154e+03,  3.86709084e+01,  7.34625205e+00, ...,
        -8.21830386e-01,  1.63443333e+00,  9.43115630e-02]])

#Applying XGBoost

In [129]:
import xgboost as xgb

In [130]:
D_train = xgb.DMatrix(X_train, label=y_train)
D_test = xgb.DMatrix(X_test, label=y_test)
D_test_df = xgb.DMatrix(test_df)

In [131]:
'''Train the xgb model then predict the test data'''
y_mean = np.mean(y_train)

param = {
    'n_trees': 520, 
    'eta': 0.0045,
    'max_depth': 4,
    'subsample': 0.93,
    'objective': 'reg:linear',
    'eval_metric': 'rmse',
    'base_score': y_mean, # base prediction = mean(target)
    'silent': 1
        } 

steps = 1250  # The number of training iterations

In [132]:
model = xgb.train(param, D_train, steps)

In [133]:
# predict output for splitted test data from Orginal train data 
y_pred = model.predict(D_test)

In [134]:
print(y_pred.shape)
y_pred

(842,)


array([108.30901 ,  96.94895 , 111.65728 ,  97.53912 ,  99.6185  ,
        94.10359 , 101.70929 , 101.22127 , 105.089874,  97.84861 ,
        94.23259 ,  94.273575,  95.83564 , 110.30177 , 103.342606,
        83.98297 ,  96.45807 ,  97.835144, 108.268524,  97.623665,
        98.5753  ,  95.80192 , 110.903046, 103.16148 , 102.709236,
        95.184906, 100.240715,  94.68312 ,  95.74212 , 102.902405,
        95.04305 , 109.25165 , 103.99795 , 108.92621 ,  97.33527 ,
        95.15915 , 111.90012 , 105.55895 ,  98.07698 , 103.09326 ,
       100.59174 ,  92.63013 ,  99.41328 ,  94.997925,  91.89498 ,
        92.18721 ,  95.30876 ,  93.62407 ,  98.01995 , 112.06244 ,
       108.84588 ,  92.22907 , 102.9413  ,  91.10362 , 104.694275,
        97.05176 ,  95.47963 , 109.13242 ,  96.06717 ,  97.17301 ,
       103.49769 ,  95.547806,  99.71703 ,  98.62    , 103.456955,
        93.97748 , 103.92313 , 117.49625 ,  99.89029 ,  76.96495 ,
        96.68688 ,  92.711205, 107.53766 ,  97.81105 , 103.956

In [135]:
# predict output for test_df values
predict_test_df = model.predict(D_test_df)

In [136]:
print(predict_test_df.shape)
predict_test_df


(4209,)


array([ 81.1214  , 101.27536 ,  93.7672  , ..., 101.127144, 106.30044 ,
        93.611   ], dtype=float32)