**Task** :
    
Reduce the time a Mercedes-Benz spends on the test bench.

**Problem Statement** :

Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.

Following actions should be performed -------------

1. If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
2. Check for null and unique values for test and train sets.
3. Apply label encoder.
4. Perform dimensionality reduction.
5. Predict your test_df values using XGBoost.

• **IMPORT REQUIRED LIBRARIES . . .**

In [96]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

• **IMPORT & REVIEW THE DATA**

In [97]:
df_train = pd.read_csv('Benz_Train.csv')
df_test = pd.read_csv('Benz_Test.csv')

In [98]:
print('Shape of df_train -', df_train.shape)
print('Shape of df_test -', df_test.shape)

Shape of df_train - (4209, 378)
Shape of df_test - (4209, 377)


In [99]:
df_train.describe()

Unnamed: 0,ID,y,X10,X11,X12,X13,X14,X15,X16,X17,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
count,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,...,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0
mean,4205.960798,100.669318,0.013305,0.0,0.075077,0.057971,0.42813,0.000475,0.002613,0.007603,...,0.318841,0.057258,0.314802,0.02067,0.009503,0.008078,0.007603,0.001663,0.000475,0.001426
std,2437.608688,12.679381,0.11459,0.0,0.263547,0.233716,0.494867,0.021796,0.051061,0.086872,...,0.466082,0.232363,0.464492,0.142294,0.097033,0.089524,0.086872,0.040752,0.021796,0.037734
min,0.0,72.11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2095.0,90.82,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4220.0,99.15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,6314.0,109.01,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,8417.0,265.32,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [100]:
df_test.describe()

Unnamed: 0,ID,X10,X11,X12,X13,X14,X15,X16,X17,X18,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
count,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,...,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0
mean,4211.039202,0.019007,0.000238,0.074364,0.06106,0.427893,0.000713,0.002613,0.008791,0.010216,...,0.325968,0.049656,0.311951,0.019244,0.011879,0.008078,0.008791,0.000475,0.000713,0.001663
std,2423.078926,0.136565,0.015414,0.262394,0.239468,0.494832,0.026691,0.051061,0.093357,0.10057,...,0.468791,0.217258,0.463345,0.137399,0.108356,0.089524,0.093357,0.021796,0.026691,0.040752
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2115.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4202.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,6310.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,8416.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [101]:
df_train.columns

Index(['ID', 'y', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8',
       ...
       'X375', 'X376', 'X377', 'X378', 'X379', 'X380', 'X382', 'X383', 'X384',
       'X385'],
      dtype='object', length=378)

In [102]:
df_test.columns

Index(['ID', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8', 'X10',
       ...
       'X375', 'X376', 'X377', 'X378', 'X379', 'X380', 'X382', 'X383', 'X384',
       'X385'],
      dtype='object', length=377)

• **Remove Unwanted Columns from Dataset . . .**

In [103]:
df_train.drop(['ID'], axis=True)
df_test.drop(['ID'], axis=True)

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8,X10,X11,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,az,v,n,f,d,t,a,w,0,0,...,0,0,0,1,0,0,0,0,0,0
1,t,b,ai,a,d,b,g,y,0,0,...,0,0,1,0,0,0,0,0,0,0
2,az,v,as,f,d,a,j,j,0,0,...,0,0,0,1,0,0,0,0,0,0
3,az,l,n,f,d,z,l,n,0,0,...,0,0,0,1,0,0,0,0,0,0
4,w,s,as,c,d,y,i,m,0,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,aj,h,as,f,d,aa,j,e,0,0,...,0,0,0,0,0,0,0,0,0,0
4205,t,aa,ai,d,d,aa,j,y,0,0,...,0,1,0,0,0,0,0,0,0,0
4206,y,v,as,f,d,aa,d,w,0,0,...,0,0,0,0,0,0,0,0,0,0
4207,ak,v,as,a,d,aa,c,q,0,0,...,0,0,1,0,0,0,0,0,0,0


# 1. IF FOR ANY COLUMN(S), THE VARIANCE IS EQUAL TO ZERO, THEN YOU NEED TO REMOVE THOSE VARIABLE(S) . . .

In [104]:
df_train.var() [df_train.var()==0]

X11     0.0
X93     0.0
X107    0.0
X233    0.0
X235    0.0
X268    0.0
X289    0.0
X290    0.0
X293    0.0
X297    0.0
X330    0.0
X347    0.0
dtype: float64

In [105]:
df_train = df_train.drop(['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290', 'X293', 'X297', 'X330', 'X347'], axis=True)

In [106]:
df_train.shape

(4209, 366)

In [107]:
df_test.var() [df_test.var()==0]

X257    0.0
X258    0.0
X295    0.0
X296    0.0
X369    0.0
dtype: float64

In [108]:
df_test = df_test.drop(['X257', 'X258', 'X295', 'X296', 'X369'], axis=True)

In [109]:
df_test.shape

(4209, 372)

# 2. CHECK FOR NULL AND UNIQUE VALUES FOR TEST AND TRAIN SETS

• **Check for Null Values . . .**

In [110]:
df_train.isna().any()

ID      False
y       False
X0      False
X1      False
X2      False
        ...  
X380    False
X382    False
X383    False
X384    False
X385    False
Length: 366, dtype: bool

In [111]:
df_test.isna().any()

ID      False
X0      False
X1      False
X2      False
X3      False
        ...  
X380    False
X382    False
X383    False
X384    False
X385    False
Length: 372, dtype: bool

► Null Values are not present in both the Datasets.

• **Check for Unique Values . . .**

In [112]:
df_train.describe(include=['object'])

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
count,4209,4209,4209,4209,4209,4209,4209,4209
unique,47,27,44,7,4,29,12,25
top,z,aa,as,c,d,w,g,j
freq,360,833,1659,1942,4205,231,1042,277


In [113]:
df_train_object = df_train.select_dtypes(include=['object'])
df_train_object.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
0,k,v,at,a,d,u,j,o
1,k,t,av,e,d,y,l,o
2,az,w,n,c,d,x,j,x
3,az,t,n,f,d,x,l,e
4,az,v,n,f,d,h,d,n


In [114]:
df_test.describe(include=['object'])

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
count,4209,4209,4209,4209,4209,4209,4209,4209
unique,49,27,45,7,4,32,12,25
top,ak,aa,as,c,d,v,g,e
freq,432,826,1658,1900,4203,246,1073,274


In [115]:
df_test_object = df_test.select_dtypes(include=['object'])
df_test_object.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
0,az,v,n,f,d,t,a,w
1,t,b,ai,a,d,b,g,y
2,az,v,as,f,d,a,j,j
3,az,l,n,f,d,z,l,n
4,w,s,as,c,d,y,i,m


# 3. APPLY LABEL ENCODER . . .

In [116]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [117]:
for i in df_train_object:
    le=LabelEncoder()
    le.fit_transform(list(df_train_object[i].values) + list(df_train_object[i].values))
    df_train_object[i] = le.fit_transform(list(df_train_object[i].values))

In [118]:
df_train_object.head(10)

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
0,32,23,17,0,3,24,9,14
1,32,21,19,4,3,28,11,14
2,20,24,34,2,3,27,9,23
3,20,21,34,5,3,27,11,4
4,20,23,34,5,3,12,3,13
5,40,3,25,2,3,11,7,18
6,9,19,25,5,3,10,7,18
7,36,13,16,5,3,10,9,0
8,43,20,16,4,3,10,8,7
9,31,3,14,2,3,10,0,4


# 4. PERFORM DIMENSIONALITY REDUCTION . . .

In [119]:
x = df_train_object
y = df_train['y']

In [120]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=123)

In [121]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(2946, 8)
(2946,)
(1263, 8)
(1263,)


In [122]:
from sklearn.decomposition import PCA
pc = PCA(n_components=0.95)
pc.fit(x_train)
x_train = pc.transform(x_train)
x_test = pc.transform(x_test)

In [123]:
pc.components_

array([[-9.34265900e-01,  2.48695343e-01,  2.51376752e-01,
         9.89320283e-03,  9.17405074e-06, -1.01686970e-02,
        -1.17093459e-02, -4.20618088e-02],
       [ 2.56330111e-01, -1.93474507e-02,  9.63494112e-01,
        -1.96027686e-02, -1.37752144e-05, -4.29425311e-02,
         2.68999676e-02, -5.14558900e-02],
       [ 1.37908587e-01,  5.40571119e-01,  1.33982042e-02,
         2.13465821e-02,  1.42729002e-04,  8.28927495e-01,
        -2.12327429e-02,  2.36062816e-02],
       [ 2.01111303e-01,  8.00012652e-01, -5.78088372e-02,
         3.75921414e-02, -3.14900020e-04, -5.57291381e-01,
        -1.01559146e-02,  6.40472034e-02],
       [-4.26220238e-02, -5.44399674e-02,  6.34737093e-02,
        -3.30376036e-03, -1.01291699e-04,  1.37420091e-02,
         1.69875241e-02,  9.95340342e-01]])

In [124]:
pc.explained_variance_ratio_

array([0.39971569, 0.22489107, 0.13690151, 0.12122863, 0.0954623 ])

# 5. PREDICT YOUR TEST_DF VALUES USING XGBOOST . . .

In [125]:
from sklearn import svm
from sklearn.metrics import r2_score, mean_squared_error
from xgboost import XGBRegressor
xgbr= XGBRegressor(random_state=123)

In [126]:
model = xgbr.fit(x_train, y_train)

In [127]:
ypred_test = model.predict(x_test)

In [137]:
ypred_test

array([108.031265,  94.25957 ,  96.51694 , ...,  94.389   , 116.920006,
       102.40319 ], dtype=float32)

In [138]:
ypred_train = model.predict(x_train)

In [139]:
ypred_train

array([ 91.040794, 111.4589  , 100.83158 , ..., 104.627914,  94.22453 ,
        94.77243 ], dtype=float32)

In [140]:
print(r2_score(ypred_train, y_train))

0.8371947617088391


In [141]:
print(mean_squared_error(ypred_train, y_train))

18.18063508762394


In [142]:
df_test_prediction = model.predict(x_test)

In [143]:
df_test_prediction

array([108.031265,  94.25957 ,  96.51694 , ...,  94.389   , 116.920006,
       102.40319 ], dtype=float32)