DESCRIPTION

Reduce the time a Mercedes-Benz spends on the test bench.

# Problem Statement Scenario:
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with the crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.

# Following actions should be performed:
* If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
* Check for null and unique values for test and train sets
* Apply label encoder.
* Perform dimensionality reduction.
* Predict your test_df values using xgboost

In [1]:
# Step1: Read the Training Dataset
# Import Basic Libraries 
import numpy as np
import pandas as pd
# for dimensionality reduction
import warnings
from sklearn.decomposition import PCA


In [2]:
# Step2: Read the Training Dataset

df_train = pd.read_csv('train.csv')
# let's know the Number of rows and columns
df_train.shape

(4209, 378)

In [3]:
# Let's see how does our training dataset looks like
df_train.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# Step3: Collect Target Variable from Training Dataset as we will use this to learn as the prediction output
y_train = df_train['y'].values

In [8]:
# Step4: Let see the Basic information about Training Dataset

df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 378 entries, ID to X385
dtypes: float64(1), int64(369), object(8)
memory usage: 12.1+ MB


In [26]:
# Step5: Count the data in each of the columns

counts =[[], [], []]
for item in cols:
    Type = df_train[item].dtype
    s1 = len(np.unique(df_train[item]))
    if s1 == 1:
        counts[0].append(item)
    elif s1 == 2 and Type == np.int64:
        counts[1].append(item)
    else:
        counts[2].append(item)

print('Numerical Variables: {}, Binary Variables: {}, Categorical Variables: {}\n'
      .format(*[len(item) for item in counts]))
print('Numerical Variables:', counts[0])
print('Categorical Variables:', counts[2])

Numerical Variables: 12, Binary Variables: 356, Categorical Variables: 8

Numerical Variables: ['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290', 'X293', 'X297', 'X330', 'X347']
Categorical Variables: ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']


In [27]:
# Step6: Read the test.csv data

df_test = pd.read_csv('test.csv')

# remove columns ID and Y from the data as they are not used for learning
usable_columns = list(set(df_train.columns) - set(['ID', 'y']))
y_train = df_train['y'].values
y_Train= y_train
id_test = df_test['ID'].values

x_train = df_train[usable_columns]
x_test = df_test[usable_columns]

In [28]:
# Step7: Check for null and unique values for test and train sets

def check_missing_values(df):
    if df.isnull().any().any():
        print("There are missing values in the dataframe")
    else:
        print("There are no missing values in the dataframe")
check_missing_values(x_train)
check_missing_values(x_test)

There are no missing values in the dataframe
There are no missing values in the dataframe


In [29]:
# Step8: If for any column(s), the variance is equal to zero, then it's need to remove those variable(s).
# Apply label encoder
warnings.filterwarnings("ignore")

for column in usable_columns:
    cardinality = len(np.unique(x_train[column]))
    if cardinality == 1:
        x_train.drop(column, axis=1) # Column with only one 
        # value is useless so we drop it
        x_test.drop(column, axis=1)
    if cardinality > 2: # Column is categorical
        mapper = lambda x: sum([ord(digit) for digit in x])
        x_train[column] = x_train[column].apply(mapper)
        x_test[column] = x_test[column].apply(mapper)
x_train.head()

Unnamed: 0,X297,X330,X99,X293,X265,X368,X154,X137,X175,X68,...,X355,X353,X277,X362,X236,X33,X10,X63,X43,X334
0,0,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,1,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1


In [30]:
# Step9: Make sure the data is now changed into numericals

print('Feature types:')
x_train[cols].dtypes.value_counts()

Feature types:


int64    376
dtype: int64

In [31]:
# Step10: Perform dimensionality reduction
# Linear dimensionality reduction using Singular Value Decomposition of 
# the data to project it to a lower dimensional space.
n_comp = 12
pca = PCA(n_components=n_comp, random_state=420)
pca2_results_train = pca.fit_transform(x_train)
pca2_results_test = pca.transform(x_test)

In [32]:
# Step11: Training using xgboost

import xgboost as xgb
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

x_train, x_valid, y_train, y_valid = train_test_split(
        pca2_results_train, 
        y_train, test_size=0.2, 
        random_state=4242)

d_train = xgb.DMatrix(x_train, label=y_train)
d_valid = xgb.DMatrix(x_valid, label=y_valid)
#d_test = xgb.DMatrix(x_test)
d_test = xgb.DMatrix(pca2_results_test)

params = {}
params['objective'] = 'reg:linear'
params['eta'] = 0.02
params['max_depth'] = 4

def xgb_r2_score(preds, dtrain):
    labels = dtrain.get_label()
    return 'r2', r2_score(labels, preds)

watchlist = [(d_train, 'train'), (d_valid, 'valid')]

clf = xgb.train(params, d_train, 
                1000, watchlist, early_stopping_rounds=50, 
                feval=xgb_r2_score, maximize=True, verbose_eval=10)

[0]	train-rmse:99.14835	train-r2:-58.35295	valid-rmse:98.26297	valid-r2:-67.63754
[10]	train-rmse:81.27653	train-r2:-38.88428	valid-rmse:80.36433	valid-r2:-44.91014
[20]	train-rmse:66.71610	train-r2:-25.87403	valid-rmse:65.77334	valid-r2:-29.75260
[30]	train-rmse:54.86912	train-r2:-17.17722	valid-rmse:53.89136	valid-r2:-19.64525
[40]	train-rmse:45.24710	train-r2:-11.36097	valid-rmse:44.22323	valid-r2:-12.90218
[50]	train-rmse:37.44854	train-r2:-7.46723	valid-rmse:36.37614	valid-r2:-8.40622
[60]	train-rmse:31.14584	train-r2:-4.85695	valid-rmse:30.02252	valid-r2:-5.40732
[70]	train-rmse:26.08417	train-r2:-3.10795	valid-rmse:24.91497	valid-r2:-3.41268
[80]	train-rmse:22.04313	train-r2:-1.93371	valid-rmse:20.83055	valid-r2:-2.08449
[90]	train-rmse:18.84670	train-r2:-1.14458	valid-rmse:17.59590	valid-r2:-1.20092
[100]	train-rmse:16.33346	train-r2:-0.61075	valid-rmse:15.07941	valid-r2:-0.61641
[110]	train-rmse:14.39544	train-r2:-0.25118	valid-rmse:13.14880	valid-r2:-0.22901
[120]	train-rmse:

In [33]:
# Step12: Predict your test_df values using xgboost

p_test = clf.predict(d_test)

sub = pd.DataFrame()
sub['ID'] = id_test
sub['y'] = p_test
sub.to_csv('xgb.csv', index=False)

sub.head()

Unnamed: 0,ID,y
0,1,82.949921
1,2,97.609726
2,3,83.469597
3,4,77.036774
4,5,112.563431


----------------------------------------------------------------------------------------------------------------

**Insights:**
As r2>0 , Signify model performed much better than base mean model, and r2>0.5 is good score in terms of linear regression performance metrics, Hence **Time spends on the test bench can be minimized** using xgboost with **Linear Regression Technique**

-----------------------------------------------------------------------------------------------------------------------------