Project 1 - Mercedes-Benz Greener Manufacturing

DESCRIPTION

Reduce the time a Mercedes-Benz spends on the test bench.

# Problem Statement Scenario:
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with the crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.

# Following actions should be performed:
* If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
* Check for null and unique values for test and train sets
* Apply label encoder.
* Perform dimensionality reduction.
* Predict your test_df values using xgboost

In [2]:
import pandas as pd
import numpy as np

In [3]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

In [4]:
train_df.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [5]:
test_df.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,az,v,n,f,d,t,a,w,0,...,0,0,0,1,0,0,0,0,0,0
1,2,t,b,ai,a,d,b,g,y,0,...,0,0,1,0,0,0,0,0,0,0
2,3,az,v,as,f,d,a,j,j,0,...,0,0,0,1,0,0,0,0,0,0
3,4,az,l,n,f,d,z,l,n,0,...,0,0,0,1,0,0,0,0,0,0
4,5,w,s,as,c,d,y,i,m,0,...,1,0,0,0,0,0,0,0,0,0


In [6]:
y_train = train_df['y'].values
y_train[0:5]

array([130.81,  88.53,  76.26,  80.62,  78.02])

In [7]:
# Drop unecessary columns
train_df = train_df.drop(['ID','y'],axis=1)
test_df = test_df.drop(['ID'],axis=1)

In [8]:
# Check for null Values train_df
train_df.isnull().sum().sum()

0

In [9]:
# Check for null test_df
test_df.isnull().sum().sum()

0

In [10]:
test_df_categorical_columns = test_df.select_dtypes(object).columns
test_df_categorical_columns

Index(['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8'], dtype='object')

In [11]:
train_df_categorical_columns = train_df.select_dtypes(object).columns
train_df_categorical_columns

Index(['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8'], dtype='object')

In [12]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
test_df[test_df_categorical_columns] = test_df[test_df_categorical_columns].apply(le.fit_transform)
test_df.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8,X10,X11,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,21,23,34,5,3,26,0,22,0,0,...,0,0,0,1,0,0,0,0,0,0
1,42,3,8,0,3,9,6,24,0,0,...,0,0,1,0,0,0,0,0,0,0
2,21,23,17,5,3,0,9,9,0,0,...,0,0,0,1,0,0,0,0,0,0
3,21,13,34,5,3,31,11,13,0,0,...,0,0,0,1,0,0,0,0,0,0
4,45,20,17,2,3,30,8,12,0,0,...,1,0,0,0,0,0,0,0,0,0


In [13]:
from sklearn.preprocessing import LabelEncoder
le2 = LabelEncoder()
train_df[train_df_categorical_columns] = train_df[train_df_categorical_columns].apply(le2.fit_transform)
train_df.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8,X10,X11,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,32,23,17,0,3,24,9,14,0,0,...,0,0,1,0,0,0,0,0,0,0
1,32,21,19,4,3,28,11,14,0,0,...,1,0,0,0,0,0,0,0,0,0
2,20,24,34,2,3,27,9,23,0,0,...,0,0,0,0,0,0,1,0,0,0
3,20,21,34,5,3,27,11,4,0,0,...,0,0,0,0,0,0,0,0,0,0
4,20,23,34,5,3,12,3,13,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
# remove column with variance equals to 0

for col in train_df:
    if(train_df[col].var() == 0):
        train_df=train_df.drop(axis=1,columns=col)

for col in test_df:
    if(test_df[col].var()==0):
        test_df=test_df.drop(axis=1,columns=col) 


print(train_df.shape)
print(test_df.shape)   

(4209, 364)
(4209, 371)


In [15]:
# Perform Dimenosionality reduction using PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=12)
train_pca_results=pca.fit_transform(train_df)
test_pca_results = pca.fit_transform(test_df)
print(train_pca_results.shape)
print(test_pca_results.shape)

(4209, 12)
(4209, 12)


In [16]:
# split data into train and test split
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(train_pca_results,y_train)

In [17]:
# create and fit REgression model using xgBoost
import xgboost as xgb
xgb_model = xgb.XGBRegressor(objective="reg:linear",random_state=42)
xgb_model.fit(x_train,y_train)
y_pred = xgb_model.predict(x_test)



In [20]:
# Check the accuracy of the model using R2
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)

0.41491868757716754