   # Mercedes-Benz Greener Manufacturing

### DESCRIPTION for the project

Reduce the time a Mercedes-Benz spends on the test bench.

Problem Statement Scenario:
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.


#### Task

    1.If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
    2.Check for null and unique values for test and train sets.
    3.Apply label encoder.
    4.Perform dimensionality reduction.
    5.Predict your test_df values using XGBoost.

In [1]:
# Import Neccessary libraries
import numpy as np
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from  sklearn.decomposition import PCA
import xgboost as xgb
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

In [2]:
# load  data
# train data is loaded
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')

In [3]:
# check train data set
train

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,8405,107.39,ak,s,as,c,d,aa,d,q,...,1,0,0,0,0,0,0,0,0,0
4205,8406,108.77,j,o,t,d,d,aa,h,h,...,0,1,0,0,0,0,0,0,0,0
4206,8412,109.22,ak,v,r,a,d,aa,g,e,...,0,0,1,0,0,0,0,0,0,0
4207,8415,87.48,al,r,e,f,d,aa,l,u,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# check train data head values
train.head()
# head is showing first 5 rows

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [5]:
#  check data set information
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 378 entries, ID to X385
dtypes: float64(1), int64(369), object(8)
memory usage: 12.1+ MB


Here showing data set information .
range of dataframe index.
no of columns.
dtype of data set.
data use memory

In [6]:
# check size of data set
# (rows and columns)
train.shape

(4209, 378)

In [7]:
# check null value
train.isnull().sum()

ID      0
y       0
X0      0
X1      0
X2      0
       ..
X380    0
X382    0
X383    0
X384    0
X385    0
Length: 378, dtype: int64

In [8]:
train.describe()

Unnamed: 0,ID,y,X10,X11,X12,X13,X14,X15,X16,X17,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
count,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,...,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0
mean,4205.960798,100.669318,0.013305,0.0,0.075077,0.057971,0.42813,0.000475,0.002613,0.007603,...,0.318841,0.057258,0.314802,0.02067,0.009503,0.008078,0.007603,0.001663,0.000475,0.001426
std,2437.608688,12.679381,0.11459,0.0,0.263547,0.233716,0.494867,0.021796,0.051061,0.086872,...,0.466082,0.232363,0.464492,0.142294,0.097033,0.089524,0.086872,0.040752,0.021796,0.037734
min,0.0,72.11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2095.0,90.82,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4220.0,99.15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,6314.0,109.01,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,8417.0,265.32,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


#### If for any column(s), the variance is equal to zero, then you need to remove those variable(s).

The **vars()** function returns the __dic__ attribute of an object. 
The __dict__ attribute is a dictionary containing the object's changeable attributes.

***Note:*** calling the vars() function without parameters will return a dictionary containing the local symbol table.

In [9]:
# check the variance 
train .var()

ID      5.941936e+06
y       1.607667e+02
X10     1.313092e-02
X11     0.000000e+00
X12     6.945713e-02
            ...     
X380    8.014579e-03
X382    7.546747e-03
X383    1.660732e-03
X384    4.750593e-04
X385    1.423823e-03
Length: 370, dtype: float64

In [10]:
# Find the variance is equal to zero for any columns
(train.var()==0).sum()

12

total 12 variance is equal to zero in columns

In [11]:
train.var()==0

ID      False
y       False
X10     False
X11      True
X12     False
        ...  
X380    False
X382    False
X383    False
X384    False
X385    False
Length: 370, dtype: bool

In [12]:
(train.var()==0).values

array([False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

In [13]:
variance_zero=train.var()[train.var()==0].index.values
variance_zero

array(['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290',
       'X293', 'X297', 'X330', 'X347'], dtype=object)

In [14]:
# Drop zero variance variables
train =train.drop(variance_zero,axis=1)

In [15]:
# after deleting zero variance rows and columns
train.shape

(4209, 366)

In [16]:
# as ID column is irrelevant for our prediction hence we drop this column
train =train.drop(['ID'],axis=1)

In [17]:
train.head()

Unnamed: 0,y,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,130.81,k,v,at,a,d,u,j,o,0,...,0,0,1,0,0,0,0,0,0,0
1,88.53,k,t,av,e,d,y,l,o,0,...,1,0,0,0,0,0,0,0,0,0
2,76.26,az,w,n,c,d,x,j,x,0,...,0,0,0,0,0,0,1,0,0,0
3,80.62,az,t,n,f,d,x,l,e,0,...,0,0,0,0,0,0,0,0,0,0
4,78.02,az,v,n,f,d,h,d,n,0,...,0,0,0,0,0,0,0,0,0,0


#### Check for null and unique for test and train sets.

In [18]:
train.isnull().sum().values

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

The ***any()*** function returns True if any item in an iterable are true, otherwise it returns False. If the iterable object is empty, the any() function will return False.

In [19]:
train.isnull().any()

y       False
X0      False
X1      False
X2      False
X3      False
        ...  
X380    False
X382    False
X383    False
X384    False
X385    False
Length: 365, dtype: bool

Now I do test data

In [20]:
# check test data head
test.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,az,v,n,f,d,t,a,w,0,...,0,0,0,1,0,0,0,0,0,0
1,2,t,b,ai,a,d,b,g,y,0,...,0,0,1,0,0,0,0,0,0,0
2,3,az,v,as,f,d,a,j,j,0,...,0,0,0,1,0,0,0,0,0,0
3,4,az,l,n,f,d,z,l,n,0,...,0,0,0,1,0,0,0,0,0,0
4,5,w,s,as,c,d,y,i,m,0,...,1,0,0,0,0,0,0,0,0,0


In [21]:
# check test data shape 
test.shape

(4209, 377)

In [22]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 377 entries, ID to X385
dtypes: int64(369), object(8)
memory usage: 12.1+ MB


In [23]:
test.describe()

Unnamed: 0,ID,X10,X11,X12,X13,X14,X15,X16,X17,X18,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
count,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,...,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0
mean,4211.039202,0.019007,0.000238,0.074364,0.06106,0.427893,0.000713,0.002613,0.008791,0.010216,...,0.325968,0.049656,0.311951,0.019244,0.011879,0.008078,0.008791,0.000475,0.000713,0.001663
std,2423.078926,0.136565,0.015414,0.262394,0.239468,0.494832,0.026691,0.051061,0.093357,0.10057,...,0.468791,0.217258,0.463345,0.137399,0.108356,0.089524,0.093357,0.021796,0.026691,0.040752
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2115.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4202.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,6310.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,8416.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [24]:
test.isnull().sum().values

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [25]:
# Find unique values
train.nunique()

y       2545
X0        47
X1        27
X2        44
X3         7
        ... 
X380       2
X382       2
X383       2
X384       2
X385       2
Length: 365, dtype: int64

In [26]:
test.nunique()

ID      4209
X0        49
X1        27
X2        45
X3         7
        ... 
X380       2
X382       2
X383       2
X384       2
X385       2
Length: 377, dtype: int64

##### Find out the columns having object datatype

***dataframe.select_dtypes()*** function return a subset of the DataFrame’s columns based on the column dtypes. The parameters of this function can be set to include all the columns having some specific data type or it could be set to exclude all those columns which has some specific data types.

In [27]:
object_data_types=train.select_dtypes(include=[object])
object_data_types

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
0,k,v,at,a,d,u,j,o
1,k,t,av,e,d,y,l,o
2,az,w,n,c,d,x,j,x
3,az,t,n,f,d,x,l,e
4,az,v,n,f,d,h,d,n
...,...,...,...,...,...,...,...,...
4204,ak,s,as,c,d,aa,d,q
4205,j,o,t,d,d,aa,h,h
4206,ak,v,r,a,d,aa,g,e
4207,al,r,e,f,d,aa,l,u


In [28]:
object_data_type_columns=object_data_types.columns
object_data_type_columns

Index(['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8'], dtype='object')

#### Apply label encoder.

https://docs.w3cub.com/scikit_learn/modules/generated/sklearn.preprocessing.labelencoder

In [29]:
label_encoder=preprocessing.LabelEncoder()
train['X0'].unique()

array(['k', 'az', 't', 'al', 'o', 'w', 'j', 'h', 's', 'n', 'ay', 'f', 'x',
       'y', 'aj', 'ak', 'am', 'z', 'q', 'at', 'ap', 'v', 'af', 'a', 'e',
       'ai', 'd', 'aq', 'c', 'aa', 'ba', 'as', 'i', 'r', 'b', 'ax', 'bc',
       'u', 'ad', 'au', 'm', 'l', 'aw', 'ao', 'ac', 'g', 'ab'],
      dtype=object)

In [30]:
# Encoder and transform object data to interger
train['X0']=label_encoder.fit_transform(train['X0'])

In [31]:
train['X0'].unique()

array([32, 20, 40,  9, 36, 43, 31, 29, 39, 35, 19, 27, 44, 45,  7,  8, 10,
       46, 37, 15, 12, 42,  5,  0, 26,  6, 25, 13, 24,  1, 22, 14, 30, 38,
       21, 18, 23, 41,  4, 16, 34, 33, 17, 11,  3, 28,  2])

In [32]:
# apply same for all columns having object type data
train['X1']=label_encoder.fit_transform(train['X1'])
train['X2']=label_encoder.fit_transform(train['X2'])
train['X3']=label_encoder.fit_transform(train['X3'])
train['X4']=label_encoder.fit_transform(train['X4'])
train['X5']=label_encoder.fit_transform(train['X5'])
train['X6']=label_encoder.fit_transform(train['X6'])
train['X8']=label_encoder.fit_transform(train['X8'])

In [33]:
train.head()

Unnamed: 0,y,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,130.81,32,23,17,0,3,24,9,14,0,...,0,0,1,0,0,0,0,0,0,0
1,88.53,32,21,19,4,3,28,11,14,0,...,1,0,0,0,0,0,0,0,0,0
2,76.26,20,24,34,2,3,27,9,23,0,...,0,0,0,0,0,0,1,0,0,0
3,80.62,20,21,34,5,3,27,11,4,0,...,0,0,0,0,0,0,0,0,0,0
4,78.02,20,23,34,5,3,12,3,13,0,...,0,0,0,0,0,0,0,0,0,0


#### Perform dimensionality reduction PCA


In [34]:
# PCA with 95% (Perform dimensionality reduction)
sklearn_pca=PCA(n_components=0.95)
sklearn_pca.fit(train)

PCA(n_components=0.95)

In [35]:
x_train_transformed =sklearn_pca.transform(train)

In [36]:
x_train_transformed.shape

(4209, 6)

In [37]:
# now we set PCA =98%
sklearn_pca_98=PCA(n_components=0.98)

In [38]:
sklearn_pca_98.fit(train)

PCA(n_components=0.98)

In [39]:
x_train_transformed_98=sklearn_pca_98.transform(train)
x_train_transformed_98.shape

(4209, 12)

In [40]:
train.y

0       130.81
1        88.53
2        76.26
3        80.62
4        78.02
         ...  
4204    107.39
4205    108.77
4206    109.22
4207     87.48
4208    110.85
Name: y, Length: 4209, dtype: float64

#### Train and Test split on Train dataset

In [41]:
X=train.drop('y',axis=1)
y=train.y
xtrain,xtest,ytrain,ytest=train_test_split(X,y,test_size=0.3,random_state=42)

In [42]:
print(xtrain)
print(xtrain.shape)

      X0  X1  X2  X3  X4  X5  X6  X8  X10  X12  ...  X375  X376  X377  X378  \
370   35  13  16   1   3   9   6  19    0    0  ...     0     0     0     0   
3392  15  10  16   2   3  23   9  16    0    0  ...     0     0     1     0   
2208  31   3  16   2   3  15   2  21    0    0  ...     0     0     1     0   
3942  35  20   8   6   3  26   6  14    0    1  ...     1     0     0     0   
1105  36  13  16   5   3   1   6   0    0    0  ...     0     0     0     0   
...   ..  ..  ..  ..  ..  ..  ..  ..  ...  ...  ...   ...   ...   ...   ...   
3444  31  10  16   2   3  22  11  17    0    0  ...     0     0     1     0   
466   20  25  25   2   3   9   9   9    0    0  ...     0     0     0     0   
3092  45  24   3   2   3  21   8   2    0    0  ...     1     0     0     0   
3772  45  19   8   5   3  25   8   1    0    1  ...     0     0     0     0   
860   22   1   7   2   3   5   9  17    0    0  ...     1     0     0     0   

      X379  X380  X382  X383  X384  X385  
370     

In [43]:
print(ytrain)
print(ytrain.shape)

370      95.13
3392    117.36
2208    109.01
3942     93.77
1105    103.41
         ...  
3444    109.42
466      78.25
3092     92.18
3772     91.92
860      87.71
Name: y, Length: 2946, dtype: float64
(2946,)


In [44]:
print(xtest)
print(xtest.shape)

      X0  X1  X2  X3  X4  X5  X6  X8  X10  X12  ...  X375  X376  X377  X378  \
1073   9  16   7   5   3   6   9  11    0    0  ...     0     0     0     0   
144   27  13   3   5   3  13   8  22    0    0  ...     0     0     0     0   
2380  31   1  21   2   3  18  11  14    1    0  ...     1     0     0     0   
184   20  25  22   2   3  13   9  11    0    0  ...     0     0     0     0   
2587   8  23   8   3   3  17   8  17    0    0  ...     0     0     0     0   
...   ..  ..  ..  ..  ..  ..  ..  ..  ...  ...  ...   ...   ...   ...   ...   
2493  27  20  16   2   3  18  10   5    0    0  ...     0     0     1     0   
3388  40  19  24   5   3  23   3  19    0    0  ...     0     0     0     0   
3997  22   3   7   0   3  26   6  18    0    0  ...     0     0     1     0   
383   40   1  16   6   3   9   8   0    0    0  ...     1     0     0     0   
3364  27   4  33   2   3  23   6  24    0    0  ...     0     0     1     0   

      X379  X380  X382  X383  X384  X385  
1073    

In [45]:
# PCA with 95% for xtrain
pca_xtrain=PCA(n_components=0.95)
pca_xtrain.fit(xtrain)

PCA(n_components=0.95)

In [46]:
pca_xtrain_transformed =pca_xtrain.transform(xtrain)
print(pca_xtrain_transformed.shape)

(2946, 6)


In [47]:
# PCA with 95% for xtest
pca_xtest=PCA(n_components=0.95)
pca_xtest.fit(xtest)

PCA(n_components=0.95)

In [48]:
pca_xtest_transformed=pca_xtest.transform(xtest)
print(pca_xtest_transformed.shape)

(1263, 6)


In [49]:
print(pca_xtest.explained_variance_)
print(pca_xtest.explained_variance_ratio_)

[206.79524961 120.24273955  67.64680756  61.94375666  48.08214872
   8.7271811 ]
[0.38517942 0.22396563 0.12599979 0.11537722 0.08955841 0.01625536]


#### PCA for test dataset

In [50]:
test

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,az,v,n,f,d,t,a,w,0,...,0,0,0,1,0,0,0,0,0,0
1,2,t,b,ai,a,d,b,g,y,0,...,0,0,1,0,0,0,0,0,0,0
2,3,az,v,as,f,d,a,j,j,0,...,0,0,0,1,0,0,0,0,0,0
3,4,az,l,n,f,d,z,l,n,0,...,0,0,0,1,0,0,0,0,0,0
4,5,w,s,as,c,d,y,i,m,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,8410,aj,h,as,f,d,aa,j,e,0,...,0,0,0,0,0,0,0,0,0,0
4205,8411,t,aa,ai,d,d,aa,j,y,0,...,0,1,0,0,0,0,0,0,0,0
4206,8413,y,v,as,f,d,aa,d,w,0,...,0,0,0,0,0,0,0,0,0,0
4207,8414,ak,v,as,a,d,aa,c,q,0,...,0,0,1,0,0,0,0,0,0,0


In [51]:
test_object_datatypes=test.select_dtypes(include=[object])
test_object_datatypes

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
0,az,v,n,f,d,t,a,w
1,t,b,ai,a,d,b,g,y
2,az,v,as,f,d,a,j,j
3,az,l,n,f,d,z,l,n
4,w,s,as,c,d,y,i,m
...,...,...,...,...,...,...,...,...
4204,aj,h,as,f,d,aa,j,e
4205,t,aa,ai,d,d,aa,j,y
4206,y,v,as,f,d,aa,d,w
4207,ak,v,as,a,d,aa,c,q


In [52]:
test['X0']=label_encoder.fit_transform(test['X0'])
test['X1']=label_encoder.fit_transform(test['X1'])
test['X2']=label_encoder.fit_transform(test['X2'])
test['X3']=label_encoder.fit_transform(test['X3'])
test['X4']=label_encoder.fit_transform(test['X4'])
test['X5']=label_encoder.fit_transform(test['X5'])
test['X6']=label_encoder.fit_transform(test['X6'])
test['X8']=label_encoder.fit_transform(test['X8'])

In [53]:
print(test)
print(test.shape)

        ID  X0  X1  X2  X3  X4  X5  X6  X8  X10  ...  X375  X376  X377  X378  \
0        1  21  23  34   5   3  26   0  22    0  ...     0     0     0     1   
1        2  42   3   8   0   3   9   6  24    0  ...     0     0     1     0   
2        3  21  23  17   5   3   0   9   9    0  ...     0     0     0     1   
3        4  21  13  34   5   3  31  11  13    0  ...     0     0     0     1   
4        5  45  20  17   2   3  30   8  12    0  ...     1     0     0     0   
...    ...  ..  ..  ..  ..  ..  ..  ..  ..  ...  ...   ...   ...   ...   ...   
4204  8410   6   9  17   5   3   1   9   4    0  ...     0     0     0     0   
4205  8411  42   1   8   3   3   1   9  24    0  ...     0     1     0     0   
4206  8413  47  23  17   5   3   1   3  22    0  ...     0     0     0     0   
4207  8414   7  23  17   0   3   1   2  16    0  ...     0     0     1     0   
4208  8416  42   1   8   2   3   1   6  17    0  ...     1     0     0     0   

      X379  X380  X382  X383  X384  X38

In [54]:
# DROP ID column 
test=test.drop('ID',axis=1)

In [55]:
# PCA with 95% for test dataset
pca_test=PCA(n_components=0.95)
pca_test.fit(test)

PCA(n_components=0.95)

In [56]:
pca_test_transformed = pca_test.transform(test)
print(pca_test_transformed.shape)

(4209, 6)


In [57]:
print(pca_test.explained_variance_)
print(pca_test.explained_variance_ratio_)

[247.07875325 100.33535335  77.48364816  62.33258307  48.95689653
   8.14203723]
[0.43515102 0.17670897 0.13646292 0.10977912 0.08622208 0.01433962]


In [58]:
y

0       130.81
1        88.53
2        76.26
3        80.62
4        78.02
         ...  
4204    107.39
4205    108.77
4206    109.22
4207     87.48
4208    110.85
Name: y, Length: 4209, dtype: float64

#### Perform XGboost

In [59]:
# Building the final feature set
f_train=xgb.DMatrix(xtrain, label=ytrain)
f_test=xgb.DMatrix(xtest,label=ytest)
fea_test=xgb.DMatrix(test)
fea_test=xgb.DMatrix(pca_test_transformed)

In [60]:
# Setting the parameters for XGB
params = {}
params['objective'] = 'reg:linear'
params['eta'] = 0.02
params['max_depth'] = 4

In [61]:
# Predicting the score
# Creating a function for the same

def scorer(m, w):
    labels = w.get_label()
    return 'r2', r2_score(labels, m)

final_set = [(f_train, 'train'), (f_test, 'test')]

P = xgb.train(params, f_train, 1000, final_set, early_stopping_rounds=50, feval=scorer, maximize=True, verbose_eval=10)

[0]	train-rmse:98.72994	train-r2:-62.77040	test-rmse:99.52606	test-r2:-54.40662




[10]	train-rmse:80.82342	train-r2:-41.73619	test-rmse:81.62614	test-r2:-36.26890
[20]	train-rmse:66.21521	train-r2:-27.68381	test-rmse:67.03600	test-r2:-24.13646
[30]	train-rmse:54.30886	train-r2:-18.29580	test-rmse:55.15834	test-r2:-16.01807
[40]	train-rmse:44.61799	train-r2:-12.02391	test-rmse:45.50767	test-r2:-10.58396
[50]	train-rmse:36.74573	train-r2:-7.83355	test-rmse:37.68854	test-r2:-6.94523
[60]	train-rmse:30.36847	train-r2:-5.03347	test-rmse:31.37762	test-r2:-4.50716
[70]	train-rmse:25.22260	train-r2:-3.16199	test-rmse:26.30926	test-r2:-2.87173
[80]	train-rmse:21.09160	train-r2:-1.91032	test-rmse:22.27786	test-r2:-1.77610
[90]	train-rmse:17.79897	train-r2:-1.07258	test-rmse:19.09911	test-r2:-1.04040
[100]	train-rmse:15.19879	train-r2:-0.51126	test-rmse:16.62567	test-r2:-0.54613
[110]	train-rmse:13.17097	train-r2:-0.13490	test-rmse:14.72340	test-r2:-0.21256
[120]	train-rmse:11.60943	train-r2:0.11825	test-rmse:13.29016	test-r2:0.01202
[130]	train-rmse:10.42655	train-r2:0.28878	

In [62]:
# Predicting on test set
p_test = P.predict(f_test)
p_test

array([ 92.19598,  94.34275, 112.12796, ...,  91.63689,  94.53489,
        94.61652], dtype=float32)

In [63]:
Predicted_Data = pd.DataFrame()
Predicted_Data['y'] = p_test
Predicted_Data.head()

Unnamed: 0,y
0,92.195976
1,94.342751
2,112.12796
3,78.708305
4,113.040672


### Thank you !!!  :)