# Problem Statement: Refer [Mercedes-Benz Greener Manufacturing](https://www.kaggle.com/c/mercedes-benz-greener-manufacturing) 

## **Objective:** To Reduce the time a Mercedes-Benz spends on the test bench

In [None]:
# import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
%matplotlib inline

# import the library to ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# import the train dataset
train_original = pd.read_csv('../input/mercedes-benz-greener-manufacturing/train.csv.zip')

In [None]:
# display the data
train_original.head()

In [None]:
# Check the shape of the train dataset
train_original.shape

**Comments**
 > 1. There are 4209 data points and 378 features in the dataset.
 > 2. It is observed that there are more number of features with binary values. Hence sparsity exists in the train data.
 > 3. Target feature 'y' from the training data is the time that cars spend on the test bench.

In [None]:
# check the info of the train data
train_original.info()

**Comments**.
> 1. The train dataset has one feature with float64 datatype which is target (Y) features
> 2. The train dataset has 369 features with int64 datatype which are features having binary values (0 and 1)
> 3. The train dataset has 8 features with object datatype which are features having categorical data

In [None]:
# First,check the original variance of all the features in the train dataset and store it to new object
train_original_var=pd.DataFrame(train_original.var(axis=0),columns=['Variance'])
train_original_var

In [None]:
#Second, define a function to remove the features of train dataset with zero variance 
def features_zero_var(df):
    df_original_var=pd.DataFrame(df.var(axis=0),columns=['Variance']) 
    return((df_original_var[df_original_var.Variance==0]))
# Add the object train_original to the function. i.e., features_zero_var(train_original)
# Firstly, the variance with respect to each features in the train_original dataset will be stored in the object df_original_var.
# Index contains the feature names and the corresponding variances will be displayed in the column with name 'Variance'  
# Lastly, it returns only the features of object 'df_original_var' having variance = 0
# df_original_var is named because it will be applicable to both train and test data

In [None]:
# Call the function to return the train dataset features having zero variance.
features_zero_var(train_original)

**Comments**: 
> Above listed features have zero variance.

In [None]:
# Remove the features with 0 variance from train dataset and store the data in the new object
train_original_modified= train_original.drop(columns=train_original_var[train_original_var.Variance==0].index)

In [None]:
# Display the modified train dataset after removing the features having zero variance
train_original_modified.head()

In [None]:
# Print the original and modified shape of the train dataset
print('Shape of original train dataset is:', train_original.shape)
print('\nShape of modified train dataset after removing features having zero variance is:', train_original_modified.shape)

**Comments**: 
> The modified train dataset contains 366 features, which means the 12 features with zero variance from the original dataset is removed.

In [None]:
# Save the modified train dataset
train_original_modified.to_csv('train_original_modified.csv',index=False)

## Similarly remove the zero variance features from the test dataset 

In [None]:
#import the test dataset
test_original = pd.read_csv('../input/mercedes-benz-greener-manufacturing/test.csv.zip')

In [None]:
# display the data
test_original.head()

In [None]:
# Check the shape of the data
test_original.shape

**Comments**
> 1. There are 4209 data points and 377 features in the test dataset.
> 2. Number of features in the test dataset is 377 because Target feature 'y' is missing in the test dataset compared to train dataset.

In [None]:
# First,check the original variance of all the features in the test dataset and store it to new object
test_original_var=pd.DataFrame(test_original.var(axis=0),columns=['Variance'])
test_original_var

In [None]:
# Call the function to return the test dataset features having zero variance.
features_zero_var(test_original)

**Comments**: 
> 1. Above listed features have zero variance.
> 2. However, since test dataset is not considered for training and only used for testing, we can remove the same features of train dataset in the test dataset as well.
> 3. This will ensure the same size and shape of the train and test dataset

In [None]:
# In test dataset, remove the same features of train dataset having 0 variance.
test_original_modified= test_original.drop(columns=['X11', 'X93', 'X107','X233', 'X235', 'X268', 'X289', 'X290', 'X293','X297','X330','X347'])

In [None]:
# Display the modified test dataset after removing the features having zero variance
test_original_modified.head()

In [None]:
# Print the original and modified shape of the test dataset
print('Shape of original test dataset is:', test_original.shape)
print('\nShape of modified test dataset after removing features having zero variance is:', test_original_modified.shape)

**Comments**: 
> 1. The modified test dataset contains 365 features. This is because, we had removed the same 12 features having zero variance in the train dataset.
> 2. The modified test dataset has 365 features, whereas modified train dataset had 366 features. This is because target feature 'y' is missing  in the train dataset

In [None]:
# Save the modified test dataset
test_original_modified.to_csv('test_original_modified.csv',index=False)

##  Check the null and unique values in the train dataset

In [None]:
# Check the null values in the train dataset
print('The sum of null values in the train dataset is:', train_original_modified.isnull().any().sum())

**Comments:** 
> There are no null values in the train dataset.

In [None]:
# Check the unique values in the train dataset
# unique() function includes the missing value 
# nunique() function excludes the missing value as the default parameter is dropna=True
# Since there are no missing values in the train and test dataset, we can use nunique()
train_original_modified_UV=pd.DataFrame(train_original_modified.nunique(),columns=['Unique_Values'])
train_original_modified_UV

**Comments**: 
> 1. It is observed that all the values in ID's are unique. In the provided dataaset, the ID represents the unique car configuration. So, this feature must be ignored for training as it will not make any sense to the prediction. 
> 2. 'y' feature is the target feature. 
> 3. Features X0,X1,X2,X3,X4,X5,X6,X8 are the categorical features which must be converted to numerical values/one hot encoded values.
> 4. All the features after X8 are having binary values.

In [None]:
# Print the train dataset features unique values where the values = 2 and values>2.
print('Train Features with unique values greater than 2 are as follows:\n',train_original_modified_UV[train_original_modified_UV.Unique_Values>2].unstack())
print('Test Features with unique values equal to 2 are as follows:\n',train_original_modified_UV[train_original_modified_UV.Unique_Values==2].unstack())

**Comments:**
> Out of 366 train dataset features, 10 features are having greater than 2 unique values and remaining features are having only 2 unique values (0 and 1). 

### Similarly check the  null and unique values in the  test dataset 

In [None]:
# Check the null values in the test dataset
print('The sum of null values in the test dataset is:', test_original_modified.isnull().any().sum())

**Comments:** 
> There are no null values in the test dataset.

In [None]:
# Check the unique values in the test dataset
test_original_modified_UV=pd.DataFrame(test_original_modified.nunique(),columns=['Unique_Values'])
test_original_modified_UV

**Comments**: 
> 1. It is observed that all the valus in ID's are unique.In the provided dataaset, the ID represents the unique car configuration. So, this feature must be ignored for testing as it will not make any sense to the prediction.
> 2. Features X0,X1,X2,X3,X4,X5,X6,X8 are the categorical features which must be converted to numerical values/one hot encoded values.
> 3. All the features after X8 are having binary values.

In [None]:
# Print the test dataset features unique values where the values = 2 and values >2.
print('Test Features with unique values greater than 2 are as follows:\n',test_original_modified_UV[test_original_modified_UV.Unique_Values>2].unstack())
print('Test Features with unique values equal to 2 are as follows:\n',test_original_modified_UV[test_original_modified_UV.Unique_Values==2].unstack())

**Comments:**
> Out of 365 test dataset features, 9 features are having greater than 2 unique values and remaining features are having only 2 unique values (0 and 1). 

## Apply the label encoder/one hot encoding  for the train dataset

In [None]:
# Before applying label encoder, separate the ID and 'y' features from the train dataset
##
##Drop the columns 'ID' and 'y' and store the data into new object train_X_check and verify the shape
train_X_check = train_original_modified.drop(columns = ['ID','y'])
# train_X_check to verify how the one hot encoding works for the features with multi categorical variables 
train_X_check.shape

In [None]:
# perform label encoder/one hot encoding for the categories features of train_X_check
##
# Import the required library
from sklearn.preprocessing import LabelEncoder

In [None]:
# Define the function to apply the label encoder for the categories features of train_X_check
def label_encoder(df,x):
    features_cat=df.select_dtypes(include='object').columns # select only the features with datatype Object
    le=LabelEncoder() # instantiate the label encoder
    for i in features_cat:
        x[i]=le.fit_transform(x[i]) # Fit,transform and replace with label encoded data for the existing data in the object datype columns of train_X_check

In [None]:
# Call the function to apply label encoder for the train_X_check data
label_encoder(train_original_modified,train_X_check)

In [None]:
# verify the train_X dataset
train_X_check.head()

**Comments**
> 1. It is observed from the above table that, label encoder is applied to categorial features of train_X_check data, the encoded labels are not binary (0 and 1) since the features has 4 or more different categories.
> 
> 
> 2. Example: X5 feature has 29 unique categories. So, when label encoder is applied, the categories will be replaced with 28 unique values starting from 0 to 28.
> 
> 
> 3. This may impact the accuracy level.
> 
> 
> 4. Inorder to fix this issue, the following procedure is followed here [[Reference](https://www.youtube.com/watch?v=6WDFfaYtN6s)]
>     - Identify the top 10 most frequent categories from each feature. 
>     
>     - Perform one hot encoding only for the top 10 most frequent categories.
>     
>     - All Top 10 most frequent categories will be considered as '1' and all the remaining categories will be considered as '0' in each feature.
>     
>     - By performing above 3 steps ensures only binary values (0 and 1) in all the features

In [None]:
# Identify the top 10 most frequent categories of features X0,X1,X2,X3,X4,X5,X6,X8

#X0
train_original_modified.X0.value_counts().sort_values(ascending=False).head(10)

In [None]:
# create a list for the top 10 most frequent catergories of feature X0
top_10_X0 = [x for x in train_original_modified.X0.value_counts().sort_values(ascending=False).head(10).index]
top_10_X0

In [None]:
# Define a funtion to peform one hot encoding for the top 10 most frequent categories of features
def one_hot_top10(df,feature,top10_categories):
    for category in top10_categories:
        df[feature+'_'+category]=np.where(train_original_modified[feature]==category,1,0)

In [None]:
# Call the function to perform one hot encoding on feature X0
one_hot_top10(train_original_modified,'X0',top_10_X0)

In [None]:
# verify the train dataset after applying one hot encoding for the feature X0
train_original_modified.head(3)

In [None]:
# verify for one feature if it contains only binary value
train_original_modified.X0_z.unique()

**Comments:**
> 1. It is evident from the above table that 10 new features are created with only binary values, for the top 10 most frequent categories of feature X0. 
> 
> 2. Hence the total columns/features are increased from 366 to 376.
> 
> 3. Similarly perform one hot encoding for the top 10 frequent categories of remaining categorical features.

In [None]:
# create a list for the top 10 most frequent categories of features X1, X2,X3,X4,X5,X6 and X8
# Call the function to perform one hot encoding on features X1, X2,X3,X4,X5,X6 and X8 
# verify the train dataset after applying one hot encoding for the features X1, X2,X3,X4,X5,X6 and X8 

# X1 
top_10_X1 = [x for x in train_original_modified.X1.value_counts().sort_values(ascending=False).head(10).index]
one_hot_top10(train_original_modified,'X1',top_10_X1)
train_original_modified.head(2)

In [None]:
# X2
top_10_X2 = [x for x in train_original_modified.X2.value_counts().sort_values(ascending=False).head(10).index]
one_hot_top10(train_original_modified,'X2',top_10_X2)
train_original_modified.head(2)

In [None]:
# X3
top_10_X3 = [x for x in train_original_modified.X3.value_counts().sort_values(ascending=False).head(10).index]
one_hot_top10(train_original_modified,'X3',top_10_X3)
train_original_modified.head(2)

**Comments**: 
> In case X3, only 7 columns were added since there are only 7 unique categories in in this feature.

In [None]:
# X4
top_10_X4 = [x for x in train_original_modified.X4.value_counts().sort_values(ascending=False).head(10).index]
one_hot_top10(train_original_modified,'X4',top_10_X4)
train_original_modified.head(2)

**Comments**: 
> In case X4, only 4 columns were added since there are only 4 unique categories in in this feature.

In [None]:
# X5
top_10_X5 = [x for x in train_original_modified.X5.value_counts().sort_values(ascending=False).head(10).index]
one_hot_top10(train_original_modified,'X5',top_10_X5)
train_original_modified.head(2)

In [None]:
# X6
top_10_X6 = [x for x in train_original_modified.X6.value_counts().sort_values(ascending=False).head(10).index]
one_hot_top10(train_original_modified,'X6',top_10_X6)
train_original_modified.head(2)

In [None]:
# X8
top_10_X8 = [x for x in train_original_modified.X8.value_counts().sort_values(ascending=False).head(10).index]
one_hot_top10(train_original_modified,'X8',top_10_X8)
train_original_modified.head(2)

**Comments:** 
> label encoder/one hot encoding is successfully applied to the train dataset

In [None]:
# Store the data set in train_original_modified to new object train_original_modified_OHE
# drop the columns which are not required after performing one hot encoding
train_original_modified_OHE = train_original_modified.drop(columns=['X0','X1','X2','X3','X4','X5','X6','X8'])
# OHE means One hot encoded data

In [None]:
# Save the modified one hot encoded train dataset
train_original_modified_OHE.to_csv('train_original_modified_OHE.csv',index=False)

In [None]:
# Check the shape of the modified one hot encoded train dataset
train_original_modified_OHE.shape

**Comments:** 
> The shape of the train dataset is reduced to 429 from 437 since 8 features were dropped after performing one hot encoding

## Apply the label encoder/one hot encoding for the test dataset similar to the train dataset

In [None]:
# create a list for the top 10 most frequent catergories of features X0, X1, X2,X3,X4,X5,X6 and X8
# Call the function to perform one hot encoding on features X0,X1,X2,X3,X4,X5,X6 and X8 
# verify the train dataset after applying one hot encoding for the features X0,X1,X2,X3,X4,X5,X6 and X8 

# X0
top_10_test_X0 = [x for x in test_original_modified.X0.value_counts().sort_values(ascending=False).head(10).index]
one_hot_top10(test_original_modified,'X0',top_10_test_X0)
test_original_modified.head(2)

In [None]:
# X1
top_10_test_X1 = [x for x in test_original_modified.X1.value_counts().sort_values(ascending=False).head(10).index]
one_hot_top10(test_original_modified,'X1',top_10_test_X1)
test_original_modified.head(2)

In [None]:
# X2
top_10_test_X2 = [x for x in test_original_modified.X2.value_counts().sort_values(ascending=False).head(10).index]
one_hot_top10(test_original_modified,'X2',top_10_test_X2)
test_original_modified.head(2)

In [None]:
# X3
top_10_test_X3 = [x for x in test_original_modified.X3.value_counts().sort_values(ascending=False).head(10).index]
one_hot_top10(test_original_modified,'X3',top_10_test_X3)
test_original_modified.head(2)

**Comments**: 
> In case X3, only 7 columns were added as the maximum categorial variable for this feature is only 7.

In [None]:
# X4
top_10_test_X4 = [x for x in test_original_modified.X4.value_counts().sort_values(ascending=False).head(10).index]
one_hot_top10(test_original_modified,'X4',top_10_test_X4)
test_original_modified.head(2)

**Comments**: 
> In case X4, only 4 columns were added as the maximum categorial variable for this feature is only 4.

In [None]:
# X5
top_10_test_X5 = [x for x in test_original_modified.X5.value_counts().sort_values(ascending=False).head(10).index]
one_hot_top10(test_original_modified,'X5',top_10_test_X5)
test_original_modified.head(2)

In [None]:
# X6
top_10_test_X6 = [x for x in test_original_modified.X6.value_counts().sort_values(ascending=False).head(10).index]
one_hot_top10(test_original_modified,'X6',top_10_test_X6)
test_original_modified.head(2)

In [None]:
# X8
top_10_test_X8 = [x for x in test_original_modified.X8.value_counts().sort_values(ascending=False).head(10).index]
one_hot_top10(test_original_modified,'X8',top_10_test_X8)
test_original_modified.head(2)

**Comments:** 
> label encoder/one hot encoding is successfully applied to the test dataset

In [None]:
# Store the data set in test_original_modified to new object test_original_modified_OHE
# drop the columns which are not required after performing one hot encoding
test_original_modified_OHE = test_original_modified.drop(columns=['X0','X1','X2','X3','X4','X5','X6','X8'])
# OHE means One hot encoded data

In [None]:
# Save the modified one hot encoded train dataset
test_original_modified_OHE.to_csv('test_original_modified_OHE.csv',index=False)

In [None]:
# Check the shape of the modified one hot encoded train dataset
test_original_modified_OHE.shape

**Comments:** 
> The shape of the train dataset is reduced to 428 from 436 since 8 features were dropped after performing one hot encoding

## Perform dimensionality reduction.

In [None]:
# Before performing dimensional reduction aka. principal component anlysis (PCA), seperate the following:
# features 'ID' and 'y' from train_original_modified_OHE dataset and store it in the new object
# features 'ID' from test_original_modified_OHE.shape and store it in the new object

In [None]:
# store the 'ID' values into the new object train_ID and test_ID and verify the shape

# Train dataset
train_ID = train_original_modified_OHE.ID
print('The shape of the ID feature in train dataset is:',train_ID.shape)

# Test dataset
test_ID = test_original_modified_OHE.ID
print('\nThe shape of the ID feature in test dataset is:',test_ID.shape)

In [None]:
# store the remaining values into the new object train_X and test_X and verify the shape

# Train dataset
train_X = train_original_modified_OHE.drop(columns = ['ID','y'])
print('The shape of the final train dataset is:',train_X.shape)

# Test dataset
test_X = test_original_modified_OHE.drop(columns = ['ID'])
print('\nThe shape of the final test dataset is:',test_X.shape)

In [None]:
# store the 'y' values into the new object train_y and verify the shape
train_y=train_original_modified_OHE.y
print('The shape of the target feature of train dataset is:',train_y.shape)

**Comments**: 
> 1. Successfully separated the 'ID' and target feature 'y' from train dataset.
> 2. Successfully separated the 'ID' feature from test dataset.
> 3. Also, observed that shape of the train and test data are same.It is therefore good to proceed further.

In [None]:
# save these datasets to csv 
train_ID.to_csv('train_ID.csv',index=False)
test_ID.to_csv('test_ID.csv',index=False)
train_X.to_csv('train_X.csv',index=False)
test_X.to_csv('test_X.csv',index=False)
train_y.to_csv('train_y.csv',index=False)

In [None]:
# import the required PCA library
from sklearn.decomposition import PCA

In [None]:
# Before performing PCA, the data needs to centered and scaled
# After centring, the average value for each train and train features will be 0
# After scaling, the standard deviation for each feature will be 1

**Comments:** 
> Since all the features are having 0 and 1, there is no need to standardize.

In [None]:
# create a PCA object (instantiate)
pca=PCA()

In [None]:
# fit and transform the train dataset
# transform the test dataset

# fit the train dataset
pca_fit_train_X = pca.fit(train_X)

# transform the  train dataset
pca_fit_transform_train_X = pca_fit_train_X.transform(train_X)

# transform the test dataset
pca_transform_test_X = pca.transform(test_X)

**Note:**

> 1. In the fit step, loading scores and variation of each principal components are calculated
> 2. fit method is used only for the train dataset as the test dataset will only learn from the train dataset.
> 3. In the transform step, cordinates for the PCA plot are generated based on the loading scores and scaled data(we have not scaled the data since all features are having 0 and 1 values).
> 4. test dataset is only transformed. This is because when we feed the test dataset to the algorithm,it will learn the loading scores and variation of each principal components from the train dataset and predict the outcome.   

In [None]:
# visualize the train_X dataset using scree plot to see how many principal components should go into the final plot.
# Calculate the percentage of variation of each principal components
# Assuming we have two principal components PC1 and PC2, then  
# explained_variance_ratio for PC1 = (Variation for PC1/ (Total variation ie (PC1+PC2)))*100
##
pca_train_X_variation = np.round(pca_fit_train_X.explained_variance_ratio_.cumsum()*100,decimals=1)
##
# cumsum() is used to display PC's variation with respect to cumulative percentage

In [None]:
# Print the cumulative percentage of expalined variance
pca_train_X_variation

**Comments**: 
> 1. By looking into the array of elements, It is observed that among 427 Principal components, ~90% of variation of the data in train_X dataset is explained by only first 72 principal components . 
> 2. lets validate by visualizing scree plot for the first 72 pricipal components 

In [None]:
# create a PCA object again (instantiate) by considering first 72 principal components
pca_72=PCA(n_components=72)

In [None]:
# fit and transform the train dataset
# transform the test dataset

# fit the train dataset
pca_fit_train_X_72 = pca_72.fit(train_X)

# transform the train dataset
pca_fit_transform_train_X_72 = pca_fit_train_X_72.transform(train_X)

# transform the test dataset
pca_transform_test_X_72 = pca_72.transform(test_X)

In [None]:
## check the variation for the first 72 principal components 
pca_train_X_variation_72 = np.round(pca_fit_train_X_72.explained_variance_ratio_.cumsum()*100,decimals=1)
pca_train_X_variation_72
##

In [None]:
# assign labels for each PC's as PC1,2,etc., for visulaization in scree plot 
##
labels = ['PC' + str(x) for x in range(1, len(pca_train_X_variation_72)+1)]
##
# All the first 72 features of train_X data set will be labelled as PC1, PC2, .....+ PC72
# Eg: feature X10 will be labelled as PC1, etc., 
# Here X10 will be the first feature since we had one hot encoded and dropped the below listed categorical features
# X0	X1	X2	X3	X4	X5	X6	X8
# After one hot encoding, the newly created columns/features will get automatically moved to the end.
# Hence the first PC1 will be the feature X10 and so on.....

In [None]:
# generate the scree plot
## 
plt.figure(figsize=(25,5))
##
plt.bar(x=range(1, len(pca_train_X_variation_72)+1), height=pca_train_X_variation_72,tick_label=labels)
##
plt.xticks(rotation=90, color='indigo', size=15)
plt.yticks(rotation=0, color='indigo', size=15)
##
##################
plt.title('Scree Plot',color='tab:orange', fontsize=25)
###################
##
plt.xlabel('Principal Components', {'color': 'tab:orange', 'fontsize':15})
plt.ylabel('Cumulative percentage of explained variance ', {'color': 'tab:orange', 'fontsize':15})
##

**Comments:**
> Above scree plot shows that considering first 72 principal components should be sufficient to represent the train_X dataset

In [None]:
# Draw the 2D PCA plot by considering only PC1 and PC2
# PCA plot is to visualize how the data is spread across the origin with new coordinates, based on the loading scores and scaling.

In [None]:
####
# Put the new coordinates created by pca_fit_transform_train_X_72 into matrix
# Rows are the observations (X) and columns are the Principal components (Y)
####
pca_fit_transform_train_X_72_df = pd.DataFrame(pca_fit_transform_train_X_72,columns=labels )
#####
# verify the first 2 rows of data with new coordinates
pca_fit_transform_train_X_72_df.head(2)
#####

In [None]:
# Draw the 2D PCA plot for PC1 and PC2

##
## Removing the cumsum() from the earlier expalained ratio calculation
pca_train_X_variation_72_Nocumsum = np.round(pca_fit_train_X_72.explained_variance_ratio_*100,decimals=1)
##
plt.title('PCA Plot',color='tab:orange', fontsize=20)
##
plt.scatter(pca_fit_transform_train_X_72_df.PC1, pca_fit_transform_train_X_72_df.PC2)
##
plt.xticks(rotation=90, color='indigo', size=15)
plt.yticks(rotation=0, color='indigo', size=15)
##
plt.xlabel('PC1 - {0}%'.format(pca_train_X_variation_72_Nocumsum[0]), {'color': 'tab:orange', 'fontsize':15});
plt.ylabel('PC2 - {0}%'.format(pca_train_X_variation_72_Nocumsum[1]), {'color': 'tab:orange', 'fontsize':15});
##
## The principal components are zero-indexed, So, PC1=[0], PC2=[1]

**Comments:** 
> 1. Above PCA plot shows that how the data is spread along X-axis(PC1) and Y-axis (PC2).
> 2. 11.9% variance of the data is explained by PC1 and 8.2 % of data is explained by PC2
> 3. Similarly we visulaize how the data is spread among other pricipal components as well

In [None]:
# print the loading scores
# Loading scores explains the proportion of each observation with respect to each principal components
#
#Lets check only for the PC1
#
loading_scores = pd.Series(pca_72.components_[0])
#
#
#sort the loading scores based on absolute value
sorted_loading_scores=loading_scores.abs().sort_values(ascending=False)
#
# display only the top 10 loading scores
sorted_loading_scores[0:10]

In [None]:
# Print the minimum and maximum loading scores of PC1
print(sorted_loading_scores.min())
print(sorted_loading_scores.max())

**Comments**:
> 1. It can be concluded from the above loading scores that, almost all the observations of the train datasets plays a role in separating the Principal components PC1
> 2. Example: The 175th observation has a 1 unit long vector consisting of the following:
>    - 0.191403 * PC1 +.......+ Xn * PCn 
>    - 0.191403 is the proportion of 175th observation for PC1 
>    - This unit vector is called singular vector or eigen vector for PC1
> 3. similarly the loading scores will be calculated for PC2 as pca_72.components_[1], etc

In [None]:
# From PCA, the final train and test datasets are as follows

#train data
pca_fit_transform_train_X_72.shape

In [None]:
#test data
pca_transform_test_X_72.shape

In [None]:
#train label
train_y.shape

In [None]:
# train ID
train_ID.shape

In [None]:
test_ID.shape

## Predict test_data using XGBoost.

In [None]:
# Before predicting the test values, lets check the target variable train_y for any outliers
# If present, the value will be replaced with median values
# Using boxplot to identify the outliers
plt.boxplot(train_y);

**Comments**: Outliers are observed in the target variable train_y

In [None]:
# Print the 50th percentile value which is the median
print(train_y.quantile(0.50)) 

In [None]:
# Print the 95th percentile value 
print(train_y.quantile(0.95)) 

In [None]:
# Replace the outlier with median values
train_y = np.where(train_y > 120.80600000000001, 99.15, train_y)

In [None]:
# Verify again with box_plot after replacing the outliers with median values
plt.boxplot(train_y);

In [None]:
# Check the shape again   
train_y.shape

**Comments**: 
> 1. It is evident from the box plot that outliers are replaced with median values in the target variable train_y
> 2. Also, there is no change in the shape of the target variable. Hence its good to go with further steps

In [None]:
# import the required libraries
import xgboost as xgb
from sklearn.model_selection import cross_val_score,cross_val_predict
# Since objective is to predict continuous variable we use XGBregressor
from xgboost import XGBRegressor

In [None]:
# Evaluation metrics for regression
### Mean Absolute Error, Mean Squared Error and R2
# We will use R2 in this case
# R2 is also known as Coefficient of Determination
# It gives the percentage variation in 'y' (test time) explained by 'X'variables
# or,it gives the percentage of data points that fall within the regression line
# R2= (1-SSR/SST) 
# SSR- Sum of square residual; SST- Sum of squares total
# R2 value should be between 0 to 1
# -R2 valve indicates the worst model

In [None]:
# print the XGBoost parameters
print(XGBRegressor())

In [None]:
# Instantiate the Regressor
# specifying random_state ensures same result if we run the model multiple times
# Objective will be automatically set to ''reg:squarederror'
xgb_reg = xgb.XGBRegressor() 

In [None]:
#To find best XGBoost Parameters
params={ 'learning_rate'   : [0.01,0.05,0.1,1] ,
         'max_depth'       : [2,3,5,10],
         'min_child_weight': [ 0, 1, 3],
         'n_estimators'    : [100,150,200,500],
         'gamma'           : [1e-2,1e-3,0,0.1,0.01,0.5,1],
         'colsample_bytree': [0.1,0.5,0.7,1],
         'subsample'       : [0.2,0.3,0.5,1],
         'reg_lambda'      : [0,1,10],
         'reg_alpha'       : [1e-5,1e-3,1e-1,1,1e1] 
        }
## Explainations of the parameters
# 'max_depth' - Maximum depth of trees (default = 6, range: [0,∞])
# 'Learning rate'(eta) - scaling the tree by learning rate predicts the output in smaller steps closer to the actual value.
# 'reg_lambda' - L1 regularization parameter on weights to avoid overfit
#  'reg_alpha' - L1 regularization parameter on weights to avoid overfit
# 'gamma' - Minimum loss reduction required to make a further partition on a leaf node of the tree (pruning)
# 'min_child_weight' - default =1. If the weights of each leaf is less than the min_child weight, then ramove the leaf
# So weights of the each leaf is > min_child_weight
# 'colsample_bytree': It is the subsample ratio of columns when constructing each tree.
# 'Subsample' is the  ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. This will prevent overfitting. 
# Subsampling will occur once in every boosting iteration.
# 'n_estimators' is the number of trees

In [None]:
# Optimize the Hyperparameter using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# Using Random search of parameters with 10 fold cross validation
# Improve the predictions using cross validation to optimize the parameters
Random_Search=RandomizedSearchCV (xgb_reg,params,cv=10, scoring='r2', return_train_score=True, n_jobs=-1,verbose=1) 
# cv=10 - Number of folds in a `(Stratified)KFold`

In [None]:
# Fit the training set to the Randon_Search to obtain the best estimators and parameters.
Random_Search.fit(pca_fit_transform_train_X_72,train_y)

In [None]:
# Print the best estimator
Random_Search.best_estimator_

In [None]:
#Print the best parameters
Random_Search.best_params_

In [None]:
# Instantiate the XGBoost classifier with the best estimators and parameters
xgb_reg=XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.5, gamma=0.01, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.1, max_delta_step=0, max_depth=2,
             min_child_weight=3, missing=None, monotone_constraints='()',
             n_estimators=500, n_jobs=2, num_parallel_tree=1, random_state=0,
             reg_alpha=0.1, reg_lambda=10, scale_pos_weight=1, subsample=0.5,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [None]:
# Check the r2 score of the model using Number of folds in a `(Stratified)KFold` cv=10
r2_Score = cross_val_score(xgb_reg,pca_fit_transform_train_X_72,train_y,scoring='r2',cv=10)
r2_Score

In [None]:
# Print the mean r2_score
print('r2_score of the model with cross validation is:',round(r2_Score.mean(),2))

**Comments** 
> 1. Since r2_score with cross validation is: 0.62 or 62 % which is between 50 to 100%. Hence, its good to proceed with the prediction of time the car takes to pass testing using test data.
> 2. This means the model explains 62% variability of the target variable (y) around its mean.

In [None]:
# Fit the training data
xgb_reg.fit(pca_fit_transform_train_X_72,train_y)

In [None]:
# predict the time taken by car to pass testing using test dataset
X_test_pred = xgb_reg.predict(pca_transform_test_X_72)
X_test_pred 

In [None]:
# print the predicted value (time) in the form of table
df_test_pred = pd.DataFrame({'ID': test_ID, 'y': X_test_pred})
# Print the first 10 predicted values
df_test_pred.head(10)

In [None]:
# save the predicted time values 
df_test_pred.to_csv('submission.csv', index=False)

**Conclusion**:
>  For a given dataset, XGBoost Regressor algorithm with cross validation results in R2 score of **0.62**.

**Scope of Future Work:**
> 1.  I have conisdered 72 principal components for the study using principal component analysis(PCA). Features can be further reduced by using different dimensionality reduction techniques [[Reference](https://thenewstack.io/3-new-techniques-for-data-dimensionality-reduction-in-machine-learning/)].
> 2.  I have used only XGBoost Regressor algorithm for predicting the time, a Mercedes-Benz spends on the test bench. R2 score can be further improved by using other regressor algorithms with hyperparameter tuning and cross validation. 
> 3. As the number of features are more, Deep learning techniques can also be used for the study. 

**References:**
> 1. Dataset and problem statement: https://www.kaggle.com/c/mercedes-benz-greener-manufacturing 
> 2. https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion
> 3. https://medium.com/swlh/greener-manufacturing-with-machine-learning-6ec77d0e7a91
> 4. How to Perform One Hot Encoding for Multi Categorical Variables https://www.youtube.com/watch?v=6WDFfaYtN6s