# Mercedes-Benz Greener Manufacturing

**Project Objective:-**

Reduce the time a Mercedes-Benz spends on the test bench.

**Problem Statement Scenario:-**

Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.
To ensure the safety and reliability of every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.
You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.


## Importing the Modules

In [1]:
import pandas as pd 
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
import xgboost
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('train.csv') #importing the datasets

In [3]:
df.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [4]:
df.describe()

Unnamed: 0,ID,y,X10,X11,X12,X13,X14,X15,X16,X17,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
count,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,...,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0
mean,4205.960798,100.669318,0.013305,0.0,0.075077,0.057971,0.42813,0.000475,0.002613,0.007603,...,0.318841,0.057258,0.314802,0.02067,0.009503,0.008078,0.007603,0.001663,0.000475,0.001426
std,2437.608688,12.679381,0.11459,0.0,0.263547,0.233716,0.494867,0.021796,0.051061,0.086872,...,0.466082,0.232363,0.464492,0.142294,0.097033,0.089524,0.086872,0.040752,0.021796,0.037734
min,0.0,72.11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2095.0,90.82,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4220.0,99.15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,6314.0,109.01,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,8417.0,265.32,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [5]:
df.shape

(4209, 378)

In [6]:
df.isnull().sum() #Checking for the NULL values in the datasets

ID      0
y       0
X0      0
X1      0
X2      0
X3      0
X4      0
X5      0
X6      0
X8      0
X10     0
X11     0
X12     0
X13     0
X14     0
X15     0
X16     0
X17     0
X18     0
X19     0
X20     0
X21     0
X22     0
X23     0
X24     0
X26     0
X27     0
X28     0
X29     0
X30     0
       ..
X355    0
X356    0
X357    0
X358    0
X359    0
X360    0
X361    0
X362    0
X363    0
X364    0
X365    0
X366    0
X367    0
X368    0
X369    0
X370    0
X371    0
X372    0
X373    0
X374    0
X375    0
X376    0
X377    0
X378    0
X379    0
X380    0
X382    0
X383    0
X384    0
X385    0
Length: 378, dtype: int64

In [7]:
#Checking any columns the variance is equal to zero.
var = df.var()[df.var() == 0].index.values
print(var)

['X11' 'X93' 'X107' 'X233' 'X235' 'X268' 'X289' 'X290' 'X293' 'X297'
 'X330' 'X347']


In [8]:
#Removing variables whose variance is equal to zero.
train_df = df.drop(['X11', 'X93' ,'X107' ,'X233', 'X235' ,'X268', 'X289' ,'X290' ,'X293' ,'X297','X330', 'X347'],axis=1)
train_df.shape

(4209, 366)

In [9]:
test_df = pd.read_csv('test.csv')

Checking for labels in the datasets

In [10]:
labels=train_df.describe(include=['object']).columns.values
labels

array(['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8'], dtype=object)

## Label encoding

In [11]:
lec = LabelEncoder()
for col in labels:
    lec.fit(train_df[col].append(test_df[col]).values)
    train_df[col]=lec.transform(train_df[col])
    test_df[col]=lec.transform(test_df[col])
train_df.iloc[:5,2:8]

Unnamed: 0,X0,X1,X2,X3,X4,X5
0,37,23,20,0,3,27
1,37,21,22,4,3,31
2,24,24,38,2,3,30
3,24,21,38,5,3,30
4,24,23,38,5,3,14


In [12]:
x = train_df.drop(['ID','y'],axis=1)
y = train_df.y
print(x.shape)
print(y.shape)

(4209, 364)
(4209,)


## Principal componenent Analysis 

In [13]:
pca = PCA(n_components=0.98,svd_solver='full')
X=pca.fit_transform(x)
X.shape

(4209, 12)

In [14]:
display(X)

array([[-2.07635904e-01,  2.44432248e-02,  1.48645082e+01, ...,
         1.73181271e+00,  3.21425250e-01,  3.48388199e-01],
       [-2.44087570e-01,  1.73167468e+00,  1.82110804e+01, ...,
        -1.50111783e-01,  7.65953683e-01, -3.57007990e-01],
       [ 1.62731508e+01,  1.37845266e+01,  1.79269909e+01, ...,
        -4.25315967e-01, -1.11678537e+00,  3.90136871e+00],
       ...,
       [ 3.10773913e+01,  1.60215078e+01, -1.14345977e+01, ...,
        -1.12533822e+00,  1.39088130e+00, -3.23316758e-01],
       [ 2.56465558e+01,  2.81245083e+00, -1.22524179e+01, ...,
         2.04902343e-01,  1.31369943e+00, -1.09328166e+00],
       [-1.88429811e+01, -1.09969233e+01, -1.01931362e+01, ...,
         2.60063925e-01,  4.60020702e-01, -7.25923360e-01]])

In [15]:
print(pca.n_components_)
print(pca.explained_variance_ratio_)

12
[0.40868988 0.21758508 0.13120081 0.10783522 0.08165248 0.0140934
 0.00660951 0.00384659 0.00260289 0.00214378 0.00209857 0.00180388]


In [16]:
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=10)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(3367, 12)
(3367,)
(842, 12)
(842,)


## Implementing the XGBoost Algorithm

In [17]:
model = xgboost.XGBRegressor(objective='reg:linear',learning_rate=0.5,max_depth=5,n_estimators=12) 
model.fit (x_train,y_train)
y_pred = model.predict(x_test)



In [18]:
RMSE_XBG = np.sqrt(mean_squared_error(y_pred, y_test)) #Calculating the RMSE value 
print('The RMSE value of XGBoost is : ',RMSE_XBG)

The RMSE value of XGBoost is :  9.20277519691632


## Crosschecking with Randon Forest Regressor 

In [19]:
reg = RandomForestRegressor(max_features=10,random_state=5)
reg.fit(x_train,y_train)
y_pred_reg = reg.predict(x_test)
RMSE_RFR = np.sqrt(mean_squared_error(y_pred_reg, y_test))
print('The RMSE value of Random Forest Regressor is : ',RMSE_RFR)

The RMSE value of Random Forest Regressor is :  9.834450705692735


**The RMSE value of XGBoost is :  9.202**

**The RMSE value of Random Forest Regressor is :  9.834**