## Mercedes-Benz Greener Manufacturing.

**Reduce the time a Mercedes-Benz spends on the test bench.**

**Problem Statement Scenario:**
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.

**Following actions should be performed:**

- If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
- Check for null and unique values for test and train sets.
- Apply label encoder.
- Perform dimensionality reduction.
- Predict values using XGBoost.

In [40]:
#Importing libraries
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
import xgboost as xgb

In [41]:
#Variable assignments
train_df=pd.read_csv(r"train.csv")

In [42]:
train_df


Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,8405,107.39,ak,s,as,c,d,aa,d,q,...,1,0,0,0,0,0,0,0,0,0
4205,8406,108.77,j,o,t,d,d,aa,h,h,...,0,1,0,0,0,0,0,0,0,0
4206,8412,109.22,ak,v,r,a,d,aa,g,e,...,0,0,1,0,0,0,0,0,0,0
4207,8415,87.48,al,r,e,f,d,aa,l,u,...,0,0,0,0,0,0,0,0,0,0


In [43]:
print ("Training Data Shape is:",train_df.shape)


Training Data Shape is: (4209, 378)


In [46]:
#Checking for variance 0 // training data set
variance_0 = train_df.var()[train_df.var()==0].index.values
variance_0

array(['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290',
       'X293', 'X297', 'X330', 'X347'], dtype=object)

In [47]:
# Drop variables with a variance with 0´s
train_df = train_df.drop(variance_0, axis=1)

In [50]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 366 entries, ID to X385
dtypes: float64(1), int64(357), object(8)
memory usage: 11.8+ MB


In [51]:
#Checking if null values are present
train_df.isnull().sum()

ID      0
y       0
X0      0
X1      0
X2      0
       ..
X380    0
X382    0
X383    0
X384    0
X385    0
Length: 366, dtype: int64

In [52]:
#Checking if null values are present - training data
null_values = train_df.isnull().sum()[train_df.isnull().sum()!=0]
null_values.size

0

In [55]:
#Checking unique values - train df
for x in train_df:
    print(x,": unique values:", train_df[x].unique())

ID : unique values: [   0    6    7 ... 8412 8415 8417]
y : unique values: [130.81  88.53  76.26 ...  85.71 108.77  87.48]
X0 : unique values: ['k' 'az' 't' 'al' 'o' 'w' 'j' 'h' 's' 'n' 'ay' 'f' 'x' 'y' 'aj' 'ak' 'am'
 'z' 'q' 'at' 'ap' 'v' 'af' 'a' 'e' 'ai' 'd' 'aq' 'c' 'aa' 'ba' 'as' 'i'
 'r' 'b' 'ax' 'bc' 'u' 'ad' 'au' 'm' 'l' 'aw' 'ao' 'ac' 'g' 'ab']
X1 : unique values: ['v' 't' 'w' 'b' 'r' 'l' 's' 'aa' 'c' 'a' 'e' 'h' 'z' 'j' 'o' 'u' 'p' 'n'
 'i' 'y' 'd' 'f' 'm' 'k' 'g' 'q' 'ab']
X2 : unique values: ['at' 'av' 'n' 'e' 'as' 'aq' 'r' 'ai' 'ak' 'm' 'a' 'k' 'ae' 's' 'f' 'd'
 'ag' 'ay' 'ac' 'ap' 'g' 'i' 'aw' 'y' 'b' 'ao' 'al' 'h' 'x' 'au' 't' 'an'
 'z' 'ah' 'p' 'am' 'j' 'q' 'af' 'l' 'aa' 'c' 'o' 'ar']
X3 : unique values: ['a' 'e' 'c' 'f' 'd' 'b' 'g']
X4 : unique values: ['d' 'b' 'c' 'a']
X5 : unique values: ['u' 'y' 'x' 'h' 'g' 'f' 'j' 'i' 'd' 'c' 'af' 'ag' 'ab' 'ac' 'ad' 'ae'
 'ah' 'l' 'k' 'n' 'm' 'p' 'q' 's' 'r' 'v' 'w' 'o' 'aa']
X6 : unique values: ['j' 'l' 'd' 'h' 'i' 'a' 'g' 'c' '

In [56]:
#Preparing LE variable for using label encoder
LE = preprocessing.LabelEncoder()

In [57]:
type(train_df)

pandas.core.frame.DataFrame

In [58]:
#Inspecting unique data types in training data frame
train_df.dtypes.unique()

array([dtype('int64'), dtype('float64'), dtype('O')], dtype=object)

In [59]:
#getting conlumns with object data type
train_df.dtypes[train_df.dtypes=='O'].index.values

array(['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8'], dtype=object)

In [60]:
#Applying label encoder for non numerical columns using label encoder´s fit transform funcion
train_df_objects=train_df[train_df.dtypes[train_df.dtypes=='O'].index.values]
for x in train_df_objects:
    train_df[x]=LE.fit_transform(train_df[x])

In [61]:
train_df

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,32,23,17,0,3,24,9,14,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,32,21,19,4,3,28,11,14,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,20,24,34,2,3,27,9,23,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,20,21,34,5,3,27,11,4,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,20,23,34,5,3,12,3,13,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,8405,107.39,8,20,16,2,3,0,3,16,...,1,0,0,0,0,0,0,0,0,0
4205,8406,108.77,31,16,40,3,3,0,7,7,...,0,1,0,0,0,0,0,0,0,0
4206,8412,109.22,8,23,38,0,3,0,6,4,...,0,0,1,0,0,0,0,0,0,0
4207,8415,87.48,9,19,25,5,3,0,11,20,...,0,0,0,0,0,0,0,0,0,0


In [66]:
#Dropping ID values
train_df = train_df.drop(['ID'], axis=1)

In [67]:
#Separating target variable & Data
y_target_training= train_df["y"]
train_df2 = train_df.drop(["y"], axis=1)

In [68]:
#Splitting testing and training data
X_train, X_test, y_train, y_test = train_test_split(train_df2, y_target_training, test_size=0.3,random_state=1)

In [69]:
#dimensionality reduction - PCA Principal Component Analysis
from sklearn.decomposition import PCA 
sklearn_pca = PCA(n_components=0.95) 
sklearn_pca.fit(X_train)
X_train_transformed = sklearn_pca.transform(X_train) 
X_test_transformed =sklearn_pca.transform(X_test) 

In [70]:
#Applying XGBoost, importing additional libraries
from sklearn import svm 
from xgboost import XGBClassifier 

In [71]:
model= XGBClassifier() 


In [72]:
model.fit(X_train_transformed, y_train)

XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method=None, validate_parameters=False, verbosity=None)

In [78]:
y_pred2 = model.predict(X_test_transformed) 

In [86]:
pd.DataFrame(y_pred2).head

<bound method NDFrame.head of            0
0      87.68
1     121.84
2     107.48
3      99.03
4      75.79
...      ...
1258   98.64
1259   89.65
1260   90.46
1261   97.99
1262   90.44

[1263 rows x 1 columns]>