## Project 1 - Mercedes-Benz Greener Manufacturing

DESCRIPTION

Reduce the time a Mercedes-Benz spends on the test bench.

# Problem Statement Scenario:
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with the crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.

# Following actions should be performed:
* If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
* Check for null and unique values for test and train sets
* Apply label encoder.
* Perform dimensionality reduction.
* Predict your test_df values using xgboost

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

### Read train data and test data 

In [2]:
df_train =pd.read_csv('Dataset for the project/train.csv')

In [3]:
df_test =pd.read_csv('Dataset for the project/test.csv')

In [4]:
df_train.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [5]:
df_test.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,az,v,n,f,d,t,a,w,0,...,0,0,0,1,0,0,0,0,0,0
1,2,t,b,ai,a,d,b,g,y,0,...,0,0,1,0,0,0,0,0,0,0
2,3,az,v,as,f,d,a,j,j,0,...,0,0,0,1,0,0,0,0,0,0
3,4,az,l,n,f,d,z,l,n,0,...,0,0,0,1,0,0,0,0,0,0
4,5,w,s,as,c,d,y,i,m,0,...,1,0,0,0,0,0,0,0,0,0


### Variance Check for each columns 

In [10]:
variance_train =df_train.select_dtypes(['int','float']).var()
variance_test  =df_test.select_dtypes(['int','float']).var()
variance_train.sort_values(ascending =True)

X289    0.000000e+00
X330    0.000000e+00
X268    0.000000e+00
X347    0.000000e+00
X107    0.000000e+00
            ...     
X362    2.496467e-01
X337    2.497867e-01
X127    2.500357e-01
y       1.607667e+02
ID      5.941936e+06
Length: 370, dtype: float64

####     Columns having zero Variance , this indicate that it is actiually a same number and didnt contribute to our analysis

In [9]:
variance_train[variance_train==0].count()

12

In [11]:
variance_test[variance_test==0].count()

5

In [15]:
zero_variance_cols_train = variance_train[variance_train==0].index.to_list()


In [13]:
zero_variance_cols

['X11',
 'X93',
 'X107',
 'X233',
 'X235',
 'X268',
 'X289',
 'X290',
 'X293',
 'X297',
 'X330',
 'X347']

Drop cloumns having zero variance 

In [23]:
df_train.drop(columns=zero_variance_cols,inplace=True)

In [24]:
df_train.shape

(4209, 366)

In [17]:
df_test[zero_variance_cols].var()

X11     0.000238
X93     0.000475
X107    0.000950
X233    0.000238
X235    0.000238
X268    0.000238
X289    0.000475
X290    0.000238
X293    0.000238
X297    0.000238
X330    0.000238
X347    0.000475
dtype: float64

#### For df_test the variances are not zero but we have to drop the same columns in df_test also 

In [20]:
df_test.drop(columns=zero_variance_cols,inplace=True)

Check for null values 

In [25]:
df_train.isna().sum()

ID      0
y       0
X0      0
X1      0
X2      0
       ..
X380    0
X382    0
X383    0
X384    0
X385    0
Length: 366, dtype: int64

In [22]:
df_cat_train= df_train.select_dtypes('O')
df_cat_train_dummies = pd.get_dummies(df_cat_train)
df_num_train =df_train.select_dtypes(['int','float'])
df_train_final = pd.concat([df_cat_train_dummies,df_num_train], axis=1)
df_train_final.isna().sum()

X0_a     0
X0_aa    0
X0_ab    0
X0_ac    0
X0_ad    0
        ..
X380     0
X382     0
X383     0
X384     0
X385     0
Length: 565, dtype: int64

In [23]:
df_train_final.head()

Unnamed: 0,X0_a,X0_aa,X0_ab,X0_ac,X0_ad,X0_af,X0_ai,X0_aj,X0_ak,X0_al,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
df_cat_test= df_test.select_dtypes('O')
df_cat_test_dummies = pd.get_dummies(df_cat_test)
df_num_test =df_test.select_dtypes(['int','float'])
df_test_final = pd.concat([df_cat_test_dummies,df_num_test], axis=1)
df_test_final.isna().sum()

X0_a     0
X0_ad    0
X0_ae    0
X0_af    0
X0_ag    0
        ..
X380     0
X382     0
X383     0
X384     0
X385     0
Length: 558, dtype: int64

In [26]:
df_train_final.describe()

Unnamed: 0,X0_a,X0_aa,X0_ab,X0_ac,X0_ad,X0_af,X0_ai,X0_aj,X0_ak,X0_al,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
count,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,...,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0
mean,0.004989,0.000475,0.000238,0.000238,0.003326,0.008316,0.008078,0.035876,0.082918,0.015918,...,0.318841,0.057258,0.314802,0.02067,0.009503,0.008078,0.007603,0.001663,0.000475,0.001426
std,0.070467,0.021796,0.015414,0.015414,0.057584,0.09082,0.089524,0.186002,0.27579,0.125174,...,0.466082,0.232363,0.464492,0.142294,0.097033,0.089524,0.086872,0.040752,0.021796,0.037734
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [27]:
df_train_final.head()

Unnamed: 0,X0_a,X0_aa,X0_ab,X0_ac,X0_ad,X0_af,X0_ai,X0_aj,X0_ak,X0_al,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


As The data look standardised no need for standardisation 

#### Preparing the data for model Building

In [28]:
X=df_train_final.drop(columns=['y','ID'])
y=df_train_final.y

In [29]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X,y, random_state=100)

#### As the Model contains lot of features it is very much required to do the Dimensionality Reduction 

In [30]:
from sklearn.decomposition import PCA
pca_model = PCA(n_components=.95)
pca_model.fit(x_train)
pca_model.fit(x_test)
X_pca_train= pca_model.transform(x_train)
X_pca_test = pca_model.transform(x_test)

In [31]:
X_pca_train

array([[ 0.96354924, -0.8386578 , -0.75073752, ...,  0.19923963,
         0.16036081,  0.25615965],
       [-0.44018992, -2.7589094 , -0.21738357, ..., -0.11892577,
         0.00529397, -0.17920985],
       [-1.1095658 , -2.58640221, -0.87230401, ..., -0.05402031,
         0.07071705,  0.10643146],
       ...,
       [-0.61221108,  1.83066307, -1.10053283, ..., -0.11366487,
         0.37973508,  0.18875616],
       [ 0.29527164, -1.44712478, -2.44564843, ...,  0.46624205,
        -0.46139263,  0.04379387],
       [ 1.18436468, -0.8705941 , -1.04158122, ..., -0.14782503,
         0.08364615,  0.6398504 ]])

In [120]:
print(X_pca_train.shape)
print(y_train.shape)

(3156, 125)
(3156,)


In [32]:
print(X_pca_test.shape)
print(y_test.shape)

(1053, 125)
(1053,)


In [48]:
from sklearn.linear_model import LinearRegression
model_1 = LinearRegression()
model_1.fit(X_pca_train,y_train)

print('Accuracy_train_Linear_Model :' ,model_1.score(X_pca_train,y_train))

print('Accuracy_test_Linear_Model :' ,model_1.score(X_pca_test,y_test))

Accuracy_train_Linear_Model : 0.6215693520645091
Accuracy_test_Linear_Model : 0.4529957373498884


In [51]:
from sklearn.linear_model import Ridge

model_3 = Ridge()
model_3.fit(X_pca_train,y_train)

print('Accuracy_train_Ridge :' ,model_3.score(X_pca_train,y_train))

print('Accuracy_test_Ridge :' ,model_3.score(X_pca_test,y_test))

Accuracy_train_Ridge : 0.621566117487214
Accuracy_test_Ridge : 0.45305799590545837


In [49]:
from sklearn.linear_model import Lasso

model_2 = Lasso()
model_2.fit(X_pca_train,y_train)

print('Accuracy_train_Lasso :' ,model_2.score(X_pca_train,y_train))

print('Accuracy_test_Lasso :' ,model_2.score(X_pca_test,y_test))

Accuracy_train_Lasso : 0.4532478088965546
Accuracy_test_Lasso : 0.3354643790669096


In [52]:
from sklearn.linear_model import ElasticNet

model_4 = ElasticNet()
model_4.fit(X_pca_train,y_train)

print('Accuracy_train_Elastic_Net :' ,model_4.score(X_pca_train,y_train))

print('Accuracy_test_Elastic_Net :' ,model_4.score(X_pca_test,y_test))

Accuracy_train_Elastic_Net : 0.43768848713258424
Accuracy_test_Elastic_Net : 0.3258093693155647


# Boosting Using Catboost

In [53]:
from catboost import CatBoostRegressor

In [54]:
model_cb = CatBoostRegressor(iterations=15,learning_rate=.11,l2_leaf_reg=2, depth=6, use_best_model=True)

In [55]:

model_cb.fit(X_pca_train,y_train,plot=True,eval_set=(X_pca_test,y_test))

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0:	learn: 11.7105332	test: 13.5896006	best: 13.5896006 (0)	total: 177ms	remaining: 2.48s
1:	learn: 11.2671092	test: 13.2068618	best: 13.2068618 (1)	total: 196ms	remaining: 1.27s
2:	learn: 10.9006433	test: 12.9022059	best: 12.9022059 (2)	total: 215ms	remaining: 858ms
3:	learn: 10.5617962	test: 12.6412743	best: 12.6412743 (3)	total: 233ms	remaining: 641ms
4:	learn: 10.2899731	test: 12.4301535	best: 12.4301535 (4)	total: 251ms	remaining: 502ms
5:	learn: 10.0121955	test: 12.2148411	best: 12.2148411 (5)	total: 269ms	remaining: 403ms
6:	learn: 9.8355590	test: 12.0772122	best: 12.0772122 (6)	total: 287ms	remaining: 328ms
7:	learn: 9.6511915	test: 11.9325837	best: 11.9325837 (7)	total: 305ms	remaining: 267ms
8:	learn: 9.4752142	test: 11.8124503	best: 11.8124503 (8)	total: 323ms	remaining: 215ms
9:	learn: 9.2777322	test: 11.6635030	best: 11.6635030 (9)	total: 341ms	remaining: 171ms
10:	learn: 9.1647886	test: 11.5883348	best: 11.5883348 (10)	total: 359ms	remaining: 131ms
11:	learn: 9.0268842	tes

<catboost.core.CatBoostRegressor at 0x1cb8b7c8e50>

In [68]:
y_pred_test =model_cb.predict(X_pca_test)
y_pred_train =model_cb.predict(X_pca_train)

In [69]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [70]:
mae = mean_absolute_error(y_test, y_pred_test)
mse = mean_squared_error(y_test, y_pred_test)
r2 = r2_score(y_test, y_pred_test)

In [71]:
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R2): {r2:.2f}")

Mean Absolute Error (MAE): 7.28
Mean Squared Error (MSE): 127.01
R-squared (R2): 0.35


In [72]:
r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)

In [73]:
print(f"R-squared (R2) - Training Data: {r2_train:.2f}")
print(f"R-squared (R2) - Test Data: {r2_test:.2f}")

R-squared (R2) - Training Data: 0.49
R-squared (R2) - Test Data: 0.35


#### The data seems to be over fitted and the accuracy is pretty Low , even we uses Catboost Algorithm 

## with the test data we are going to predict the y 

All the Data maniplutation we have to do in the Test data also 

In [74]:
y_actual_pred = model_cb.predict(df_test_final)

In [75]:
df_train.y

0       130.81
1        88.53
2        76.26
3        80.62
4        78.02
         ...  
4204    107.39
4205    108.77
4206    109.22
4207     87.48
4208    110.85
Name: y, Length: 4209, dtype: float64

In [76]:
pd.DataFrame([df_train.y,y_actual_pred]).T.rename(columns={'y':'Actual','Unnamed 0':'Predicted'})

Unnamed: 0,Actual,Predicted
0,130.81,97.453908
1,88.53,97.453908
2,76.26,97.453908
3,80.62,97.453908
4,78.02,97.335883
...,...,...
4204,107.39,99.388521
4205,108.77,97.453908
4206,109.22,97.453908
4207,87.48,97.453908
