### Reduce the time a Mercedes-Benz spends on the test bench.

Problem Statement Scenario:
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.

Following actions should be performed:

If for any column(s), the variance is equal to zero, then you need to remove those variable(s).

- Check for null and unique values for test and train sets.
- Apply label encoder.
- Perform dimensionality reduction.
- Predict your test_df values using XGBoost.

# Mercedes Benz Greener Manufacturing

## Importing important libraries

In [1]:
#for algebra
import numpy as np
#to read files
import pandas as pd
#for dimensionality reduction
import sklearn
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

## Reading train and test data

In [2]:
trn_data = pd.read_csv(r"E:\Simplilearn\ML Lecture\Notes and Datasets\Datasets\Project Dataset\Mercedez Benz\train.csv")

In [3]:
tst_data = pd.read_csv("E:/Simplilearn/ML Lecture/Notes and Datasets/Datasets/Project Dataset/Mercedez Benz/test.csv")

## Understanding the data

In [4]:
print('Size of training set: {} rows and {} columns'.format(*trn_data.shape))

Size of training set: 4209 rows and 378 columns


In [5]:
print('Size of training set: {} rows and {} columns'.format(*tst_data.shape))

Size of training set: 4209 rows and 377 columns


In [6]:
trn_data.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [7]:
tst_data.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,az,v,n,f,d,t,a,w,0,...,0,0,0,1,0,0,0,0,0,0
1,2,t,b,ai,a,d,b,g,y,0,...,0,0,1,0,0,0,0,0,0,0
2,3,az,v,as,f,d,a,j,j,0,...,0,0,0,1,0,0,0,0,0,0
3,4,az,l,n,f,d,z,l,n,0,...,0,0,0,1,0,0,0,0,0,0
4,5,w,s,as,c,d,y,i,m,0,...,1,0,0,0,0,0,0,0,0,0


In [8]:
print(trn_data.shape)
print(trn_data.columns)

(4209, 378)
Index(['ID', 'y', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8',
       ...
       'X375', 'X376', 'X377', 'X378', 'X379', 'X380', 'X382', 'X383', 'X384',
       'X385'],
      dtype='object', length=378)


In [9]:
print(tst_data.shape)
print(tst_data.columns)

(4209, 377)
Index(['ID', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8', 'X10',
       ...
       'X375', 'X376', 'X377', 'X378', 'X379', 'X380', 'X382', 'X383', 'X384',
       'X385'],
      dtype='object', length=377)


In [10]:
trn_data.describe()

Unnamed: 0,ID,y,X10,X11,X12,X13,X14,X15,X16,X17,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
count,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,...,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0,4209.0
mean,4205.960798,100.669318,0.013305,0.0,0.075077,0.057971,0.42813,0.000475,0.002613,0.007603,...,0.318841,0.057258,0.314802,0.02067,0.009503,0.008078,0.007603,0.001663,0.000475,0.001426
std,2437.608688,12.679381,0.11459,0.0,0.263547,0.233716,0.494867,0.021796,0.051061,0.086872,...,0.466082,0.232363,0.464492,0.142294,0.097033,0.089524,0.086872,0.040752,0.021796,0.037734
min,0.0,72.11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2095.0,90.82,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4220.0,99.15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,6314.0,109.01,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,8417.0,265.32,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [11]:
y_train = trn_data['y'].values

In [12]:
y_train

array([130.81,  88.53,  76.26, ..., 109.22,  87.48, 110.85])

In [13]:
#checking for the number of features and data types of the features for Train data
cols = [c for c in trn_data.columns if 'X' in c]
print('Number of features: {}'.format(len(cols)))
print('Feature types:')
trn_data[cols].dtypes.value_counts()

Number of features: 376
Feature types:


int64     368
object      8
dtype: int64

In [14]:
#checking for the number of features and data types of the features for Test data
cols = [c for c in tst_data.columns if 'X' in c]
print('Number of features: {}'.format(len(cols)))
print('Feature types:')
tst_data[cols].dtypes.value_counts()

Number of features: 376
Feature types:


int64     368
object      8
dtype: int64

In [15]:
#Checking for types of features for train data
counts = [[], [], []]
for c in cols:
    typ = trn_data[c].dtype
    uniq = len(np.unique(trn_data[c]))
    if uniq == 1:
        counts[0].append(c)
    elif uniq == 2 and typ == np.int64:
        counts[1].append(c)
    else:
        counts[2].append(c)

print('Constant features: {} Binary features: {} Categorical features: {}\n'.format(*[len(c) for c in counts]))
print('Constant features:', counts[0])
print('Categorical features:', counts[2])

Constant features: 12 Binary features: 356 Categorical features: 8

Constant features: ['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290', 'X293', 'X297', 'X330', 'X347']
Categorical features: ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']


In [16]:
#Checking for types of features for test data
counts = [[], [], []]
for c in cols:
    typ = tst_data[c].dtype
    uniq = len(np.unique(tst_data[c]))
    if uniq == 1:
        counts[0].append(c)
    elif uniq == 2 and typ == np.int64:
        counts[1].append(c)
    else:
        counts[2].append(c)

print('Constant features: {} Binary features: {} Categorical features: {}\n'.format(*[len(c) for c in counts]))
print('Constant features:', counts[0])
print('Categorical features:', counts[2])

Constant features: 5 Binary features: 363 Categorical features: 8

Constant features: ['X257', 'X258', 'X295', 'X296', 'X369']
Categorical features: ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']


## If for any column(s), the variance is equal to zero, then you need to remove those variable(s).

In [17]:
#checking for variance
trn_data.var()

  trn_data.var()


ID      5.941936e+06
y       1.607667e+02
X10     1.313092e-02
X11     0.000000e+00
X12     6.945713e-02
            ...     
X380    8.014579e-03
X382    7.546747e-03
X383    1.660732e-03
X384    4.750593e-04
X385    1.423823e-03
Length: 370, dtype: float64

In [18]:
tst_data.var()

  tst_data.var()


ID      5.871311e+06
X10     1.865006e-02
X11     2.375861e-04
X12     6.885074e-02
X13     5.734498e-02
            ...     
X380    8.014579e-03
X382    8.715481e-03
X383    4.750593e-04
X384    7.124196e-04
X385    1.660732e-03
Length: 369, dtype: float64

In [19]:
#finding if varience is zero for any columns
(trn_data.var() == 0)

  (trn_data.var() == 0)


ID      False
y       False
X10     False
X11      True
X12     False
        ...  
X380    False
X382    False
X383    False
X384    False
X385    False
Length: 370, dtype: bool

In [20]:
(tst_data.var() == 0)

  (tst_data.var() == 0)


ID      False
X10     False
X11     False
X12     False
X13     False
        ...  
X380    False
X382    False
X383    False
X384    False
X385    False
Length: 369, dtype: bool

In [21]:
(trn_data.var() == 0).values

  (trn_data.var() == 0).values


array([False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

In [22]:
(trn_data.var() == 0).values

  (trn_data.var() == 0).values


array([False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

In [23]:
var_zero = trn_data.var()[trn_data.var()==0].index.values
var_zero

  var_zero = trn_data.var()[trn_data.var()==0].index.values


array(['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290',
       'X293', 'X297', 'X330', 'X347'], dtype=object)

In [26]:
var_zero_a = tst_data.var()[tst_data.var()==0].index.values
var_zero_a

  var_zero_a = tst_data.var()[tst_data.var()==0].index.values


array(['X257', 'X258', 'X295', 'X296', 'X369'], dtype=object)

In [27]:
# Drop zero variance variables

train_data = trn_data.drop(var_zero, axis=1)

In [28]:
test_data = tst_data.drop(var_zero_a, axis=1)

In [27]:
#Drop ID columns as it is of no use here.
trn_data = trn_data.drop(['ID'], axis=1)

In [32]:
train_data.shape

(4209, 366)

In [33]:
test_data.shape

(4209, 372)

In [34]:
train_data.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


## Check for null and unique values for test and train sets 

In [36]:
#finding null values
train_data.isnull().sum()

ID      0
y       0
X0      0
X1      0
X2      0
       ..
X380    0
X382    0
X383    0
X384    0
X385    0
Length: 366, dtype: int64

In [37]:
test_data.isnull().sum()

ID      0
X0      0
X1      0
X2      0
X3      0
       ..
X380    0
X382    0
X383    0
X384    0
X385    0
Length: 372, dtype: int64

In [38]:
train_data.isnull().sum().values

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [39]:
train_data.isnull().any()

ID      False
y       False
X0      False
X1      False
X2      False
        ...  
X380    False
X382    False
X383    False
X384    False
X385    False
Length: 366, dtype: bool

In [40]:
test_data.isnull().sum().values

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [41]:
test_data.isnull().sum().any

<bound method NDFrame._add_numeric_operations.<locals>.any of ID      0
X0      0
X1      0
X2      0
X3      0
       ..
X380    0
X382    0
X383    0
X384    0
X385    0
Length: 372, dtype: int64>

In [42]:
#finding unique values
train_data.nunique()

ID      4209
y       2545
X0        47
X1        27
X2        44
        ... 
X380       2
X382       2
X383       2
X384       2
X385       2
Length: 366, dtype: int64

In [43]:
test_data.nunique()

ID      4209
X0        49
X1        27
X2        45
X3         7
        ... 
X380       2
X382       2
X383       2
X384       2
X385       2
Length: 372, dtype: int64

In [44]:
#Filter out columns having object datatype
        #(Already found in above codes)
print('Categorical features:', counts[2])


Categorical features: ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']


In [45]:
#finding it using a differnt method
object_dtypes = train_data.select_dtypes(include=[object])
object_dtypes

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
0,k,v,at,a,d,u,j,o
1,k,t,av,e,d,y,l,o
2,az,w,n,c,d,x,j,x
3,az,t,n,f,d,x,l,e
4,az,v,n,f,d,h,d,n
...,...,...,...,...,...,...,...,...
4204,ak,s,as,c,d,aa,d,q
4205,j,o,t,d,d,aa,h,h
4206,ak,v,r,a,d,aa,g,e
4207,al,r,e,f,d,aa,l,u


In [46]:
objct_dtp_clms = object_dtypes.columns
objct_dtp_clms

Index(['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8'], dtype='object')

# Apply Label Incoder

In [47]:
# Initialize Label Encoder object

lbl_encoder = preprocessing.LabelEncoder()
train_data['X0'].unique()

array(['k', 'az', 't', 'al', 'o', 'w', 'j', 'h', 's', 'n', 'ay', 'f', 'x',
       'y', 'aj', 'ak', 'am', 'z', 'q', 'at', 'ap', 'v', 'af', 'a', 'e',
       'ai', 'd', 'aq', 'c', 'aa', 'ba', 'as', 'i', 'r', 'b', 'ax', 'bc',
       'u', 'ad', 'au', 'm', 'l', 'aw', 'ao', 'ac', 'g', 'ab'],
      dtype=object)

In [48]:
#Encoding to Integer
train_data['X0'] = lbl_encoder.fit_transform(trn_data['X0'])

In [49]:
train_data['X0'].unique()

array([32, 20, 40,  9, 36, 43, 31, 29, 39, 35, 19, 27, 44, 45,  7,  8, 10,
       46, 37, 15, 12, 42,  5,  0, 26,  6, 25, 13, 24,  1, 22, 14, 30, 38,
       21, 18, 23, 41,  4, 16, 34, 33, 17, 11,  3, 28,  2])

In [50]:
#Applying encoder for all.

train_data['X1'] = lbl_encoder.fit_transform(train_data['X1'])
train_data['X2'] = lbl_encoder.fit_transform(train_data['X2'])
train_data['X3'] = lbl_encoder.fit_transform(train_data['X3'])
train_data['X4'] = lbl_encoder.fit_transform(train_data['X4'])
train_data['X5'] = lbl_encoder.fit_transform(train_data['X5'])
train_data['X6'] = lbl_encoder.fit_transform(train_data['X6'])
train_data['X8'] = lbl_encoder.fit_transform(train_data['X8'])

In [51]:
train_data.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,32,23,17,0,3,24,9,14,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,32,21,19,4,3,28,11,14,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,20,24,34,2,3,27,9,23,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,20,21,34,5,3,27,11,4,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,20,23,34,5,3,12,3,13,...,0,0,0,0,0,0,0,0,0,0


# Perform Dimensionality Reduction

In [47]:
#PCA for Train Dataset

In [53]:
#PCA with 95%

sklearn_pca = PCA(n_components=0.95)

In [54]:
sklearn_pca.fit(train_data)

PCA(n_components=0.95)

In [55]:
x_train_transformed = sklearn_pca.transform(train_data)

In [56]:
print(x_train_transformed.shape)

(4209, 1)


In [57]:
#PCA with 98%

sklearn_pca_98 = PCA(n_components=0.98)

In [58]:
sklearn_pca_98.fit(train_data)

PCA(n_components=0.98)

In [59]:
x_train_transformed_98 = sklearn_pca_98.transform(train_data)

In [60]:
print(x_train_transformed_98.shape)

(4209, 1)


In [61]:
train_df = train_data

In [62]:
train_df.y

0       130.81
1        88.53
2        76.26
3        80.62
4        78.02
         ...  
4204    107.39
4205    108.77
4206    109.22
4207     87.48
4208    110.85
Name: y, Length: 4209, dtype: float64

### Train and Test Split on Dataset

In [63]:
X = train_df.drop('y', axis=1)  #Removing the y label from train_df and assigining it to a different dataframe.
y = train_df.y
xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=0.3,random_state=42)

In [64]:
X

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,32,23,17,0,3,24,9,14,0,...,0,0,1,0,0,0,0,0,0,0
1,6,32,21,19,4,3,28,11,14,0,...,1,0,0,0,0,0,0,0,0,0
2,7,20,24,34,2,3,27,9,23,0,...,0,0,0,0,0,0,1,0,0,0
3,9,20,21,34,5,3,27,11,4,0,...,0,0,0,0,0,0,0,0,0,0
4,13,20,23,34,5,3,12,3,13,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,8405,8,20,16,2,3,0,3,16,0,...,1,0,0,0,0,0,0,0,0,0
4205,8406,31,16,40,3,3,0,7,7,0,...,0,1,0,0,0,0,0,0,0,0
4206,8412,8,23,38,0,3,0,6,4,0,...,0,0,1,0,0,0,0,0,0,0
4207,8415,9,19,25,5,3,0,11,20,0,...,0,0,0,0,0,0,0,0,0,0


In [65]:
y

0       130.81
1        88.53
2        76.26
3        80.62
4        78.02
         ...  
4204    107.39
4205    108.77
4206    109.22
4207     87.48
4208    110.85
Name: y, Length: 4209, dtype: float64

In [66]:
print(xtrain)
print(xtrain.shape)

        ID  X0  X1  X2  X3  X4  X5  X6  X8  X10  ...  X375  X376  X377  X378  \
370    735  35  13  16   1   3   9   6  19    0  ...     0     0     0     0   
3392  6770  15  10  16   2   3  23   9  16    0  ...     0     0     1     0   
2208  4414  31   3  16   2   3  15   2  21    0  ...     0     0     1     0   
3942  7907  35  20   8   6   3  26   6  14    0  ...     1     0     0     0   
1105  2191  36  13  16   5   3   1   6   0    0  ...     0     0     0     0   
...    ...  ..  ..  ..  ..  ..  ..  ..  ..  ...  ...   ...   ...   ...   ...   
3444  6879  31  10  16   2   3  22  11  17    0  ...     0     0     1     0   
466    898  20  25  25   2   3   9   9   9    0  ...     0     0     0     0   
3092  6214  45  24   3   2   3  21   8   2    0  ...     1     0     0     0   
3772  7558  45  19   8   5   3  25   8   1    0  ...     0     0     0     0   
860   1712  22   1   7   2   3   5   9  17    0  ...     1     0     0     0   

      X379  X380  X382  X383  X384  X38

In [67]:
print(ytrain)
print(ytrain.shape)

370      95.13
3392    117.36
2208    109.01
3942     93.77
1105    103.41
         ...  
3444    109.42
466      78.25
3092     92.18
3772     91.92
860      87.71
Name: y, Length: 2946, dtype: float64
(2946,)


In [68]:
print(xtest)
print(xtest.shape)

        ID  X0  X1  X2  X3  X4  X5  X6  X8  X10  ...  X375  X376  X377  X378  \
1073  2140   9  16   7   5   3   6   9  11    0  ...     0     0     0     0   
144    310  27  13   3   5   3  13   8  22    0  ...     0     0     0     0   
2380  4779  31   1  21   2   3  18  11  14    1  ...     1     0     0     0   
184    385  20  25  22   2   3  13   9  11    0  ...     0     0     0     0   
2587  5180   8  23   8   3   3  17   8  17    0  ...     0     0     0     0   
...    ...  ..  ..  ..  ..  ..  ..  ..  ..  ...  ...   ...   ...   ...   ...   
2493  4997  27  20  16   2   3  18  10   5    0  ...     0     0     1     0   
3388  6760  40  19  24   5   3  23   3  19    0  ...     0     0     0     0   
3997  8016  22   3   7   0   3  26   6  18    0  ...     0     0     1     0   
383    752  40   1  16   6   3   9   8   0    0  ...     1     0     0     0   
3364  6709  27   4  33   2   3  23   6  24    0  ...     0     0     1     0   

      X379  X380  X382  X383  X384  X38

In [69]:
print(ytest)
print(ytest.shape)

1073     97.94
144      96.41
2380    105.83
184      79.09
2587    108.69
         ...  
2493    115.25
3388     88.59
3997     92.90
383      98.24
3364     91.46
Name: y, Length: 1263, dtype: float64
(1263,)


In [70]:
# PCA with 95% for xtrain

pca_xtrain = PCA(n_components=0.95)
pca_xtrain.fit(xtrain)

PCA(n_components=0.95)

In [71]:
pca_xtrain_transformed = pca_xtrain.transform(xtrain)
print(pca_xtrain_transformed.shape)

(2946, 1)


In [72]:
# PCA with 95% for xtest

pca_xtest = PCA(n_components=0.95)
pca_xtest.fit(xtest)

PCA(n_components=0.95)

In [73]:
pca_xtest_transformed = pca_xtest.transform(xtest)
print(pca_xtest_transformed.shape)

(1263, 1)


In [74]:
print(pca_xtest.explained_variance_)
print(pca_xtest.explained_variance_ratio_)

[6050078.74865925]
[0.99991594]


In [77]:
test_df = test_data

In [78]:
#PCA for Test Dataset

In [79]:
test_df

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,az,v,n,f,d,t,a,w,0,...,0,0,0,1,0,0,0,0,0,0
1,2,t,b,ai,a,d,b,g,y,0,...,0,0,1,0,0,0,0,0,0,0
2,3,az,v,as,f,d,a,j,j,0,...,0,0,0,1,0,0,0,0,0,0
3,4,az,l,n,f,d,z,l,n,0,...,0,0,0,1,0,0,0,0,0,0
4,5,w,s,as,c,d,y,i,m,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,8410,aj,h,as,f,d,aa,j,e,0,...,0,0,0,0,0,0,0,0,0,0
4205,8411,t,aa,ai,d,d,aa,j,y,0,...,0,1,0,0,0,0,0,0,0,0
4206,8413,y,v,as,f,d,aa,d,w,0,...,0,0,0,0,0,0,0,0,0,0
4207,8414,ak,v,as,a,d,aa,c,q,0,...,0,0,1,0,0,0,0,0,0,0


In [80]:
test_df['X0'] = lbl_encoder.fit_transform(test_df['X0'])

In [81]:
test_object_dtypes = test_df.select_dtypes(include=[object])
test_object_dtypes

Unnamed: 0,X1,X2,X3,X4,X5,X6,X8
0,v,n,f,d,t,a,w
1,b,ai,a,d,b,g,y
2,v,as,f,d,a,j,j
3,l,n,f,d,z,l,n
4,s,as,c,d,y,i,m
...,...,...,...,...,...,...,...
4204,h,as,f,d,aa,j,e
4205,aa,ai,d,d,aa,j,y
4206,v,as,f,d,aa,d,w
4207,v,as,a,d,aa,c,q


In [82]:
test_df['X0'] = lbl_encoder.fit_transform(test_df['X0'])
test_df['X1'] = lbl_encoder.fit_transform(test_df['X1'])
test_df['X2'] = lbl_encoder.fit_transform(test_df['X2'])
test_df['X3'] = lbl_encoder.fit_transform(test_df['X3'])
test_df['X4'] = lbl_encoder.fit_transform(test_df['X4'])
test_df['X5'] = lbl_encoder.fit_transform(test_df['X5'])
test_df['X6'] = lbl_encoder.fit_transform(test_df['X6'])
test_df['X8'] = lbl_encoder.fit_transform(test_df['X8'])

In [83]:
print(test_df)
print(test_df.shape)

        ID  X0  X1  X2  X3  X4  X5  X6  X8  X10  ...  X375  X376  X377  X378  \
0        1  21  23  34   5   3  26   0  22    0  ...     0     0     0     1   
1        2  42   3   8   0   3   9   6  24    0  ...     0     0     1     0   
2        3  21  23  17   5   3   0   9   9    0  ...     0     0     0     1   
3        4  21  13  34   5   3  31  11  13    0  ...     0     0     0     1   
4        5  45  20  17   2   3  30   8  12    0  ...     1     0     0     0   
...    ...  ..  ..  ..  ..  ..  ..  ..  ..  ...  ...   ...   ...   ...   ...   
4204  8410   6   9  17   5   3   1   9   4    0  ...     0     0     0     0   
4205  8411  42   1   8   3   3   1   9  24    0  ...     0     1     0     0   
4206  8413  47  23  17   5   3   1   3  22    0  ...     0     0     0     0   
4207  8414   7  23  17   0   3   1   2  16    0  ...     0     0     1     0   
4208  8416  42   1   8   2   3   1   6  17    0  ...     1     0     0     0   

      X379  X380  X382  X383  X384  X38

In [84]:
test_df = test_df.drop('ID',axis=1)

In [85]:
# PCA with 95% for test_df

pca_test_df = PCA(n_components=0.95)
pca_test_df.fit(test_df)

PCA(n_components=0.95)

In [86]:
pca_test_df_transformed = pca_test_df.transform(test_df)
print(pca_test_df_transformed.shape)

(4209, 6)


In [87]:
print(pca_test_df.explained_variance_)
print(pca_test_df.explained_variance_ratio_)

[247.07875325 100.33535335  77.48364816  62.33258307  48.95689653
   8.14203723]
[0.43515102 0.17670897 0.13646292 0.10977912 0.08622208 0.01433962]


# Predict your test_df values using XGBoost

In [81]:
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

In [89]:
trn_data.shape

(4209, 378)

In [90]:
tst_data.shape

(4209, 377)

In [96]:
usable_columns = list(set(trn_data.columns) - set(['ID', 'y']))
y_train = trn_data['y'].values
id_test = tst_data['ID'].values

x_train = trn_data[usable_columns]
x_test = tst_data[usable_columns]

In [97]:
def check_missing_values(df):
    if df.isnull().any().any():
        print("There are missing values in the dataframe")
    else:
        print("There are no missing values in the dataframe")
check_missing_values(x_train)
check_missing_values(x_test)

There are no missing values in the dataframe
There are no missing values in the dataframe


In [98]:
for column in usable_columns:
    cardinality = len(np.unique(x_train[column]))
    if cardinality == 1:
        x_train.drop(column, axis=1) # Column with only one 
        # value is useless so we drop it
        x_test.drop(column, axis=1)
    if cardinality > 2: # Column is categorical
        mapper = lambda x: sum([ord(digit) for digit in x])
        x_train[column] = x_train[column].apply(mapper)
        x_test[column] = x_test[column].apply(mapper)
x_train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x_train[column] = x_train[column].apply(mapper)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x_test[column] = x_test[column].apply(mapper)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x_train[column] = x_train[column].apply(mapper)
A value is trying to be set on a copy of a slice from a DataFra

Unnamed: 0,X1,X73,X94,X154,X380,X347,X114,X201,X8,X213,...,X182,X343,X315,X104,X356,X69,X136,X118,X185,X101
0,118,0,0,0,0,0,1,0,111,0,...,0,0,0,0,0,0,1,1,0,0
1,116,0,0,0,0,0,0,0,111,0,...,0,0,0,0,0,0,1,1,0,1
2,119,0,0,0,0,0,0,0,120,0,...,0,0,0,0,0,0,0,0,0,1
3,116,0,0,0,0,0,1,0,101,0,...,0,0,0,0,0,0,0,0,0,1
4,118,0,0,0,0,0,1,0,110,0,...,0,0,0,0,0,0,0,0,0,1


In [101]:
x_test.head()

Unnamed: 0,X1,X73,X94,X154,X380,X347,X114,X201,X8,X213,...,X182,X343,X315,X104,X356,X69,X136,X118,X185,X101
0,118,0,0,0,0,0,1,0,119,0,...,0,0,0,0,0,0,0,0,1,1
1,98,0,0,0,0,0,0,0,121,0,...,0,0,0,0,1,0,1,1,0,1
2,118,0,0,0,0,0,0,0,106,0,...,0,1,0,0,0,0,0,0,1,1
3,108,0,0,0,0,0,1,0,110,0,...,0,0,0,0,0,0,0,0,1,1
4,115,0,0,0,0,0,0,0,109,0,...,0,0,0,0,0,0,1,1,0,1


In [102]:
n_comp = 12
pca = PCA(n_components=n_comp, random_state=420)
pca2_results_train = pca.fit_transform(x_train)
pca2_results_test = pca.transform(x_test)

In [103]:
#Training using xgboost

import xgboost as xgb
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

x_train, x_valid, y_train, y_valid = train_test_split(
        pca2_results_train, 
        y_train, test_size=0.2, 
        random_state=4242)

  from pandas import MultiIndex, Int64Index


In [104]:
d_train = xgb.DMatrix(x_train, label=y_train)
d_valid = xgb.DMatrix(x_valid, label=y_valid)
#d_test = xgb.DMatrix(x_test)
d_test = xgb.DMatrix(pca2_results_test)

In [105]:
params = {}
params['objective'] = 'reg:linear'
params['eta'] = 0.02
params['max_depth'] = 4

def xgb_r2_score(preds, dtrain):
    labels = dtrain.get_label()
    return 'r2', r2_score(labels, preds)

In [106]:
watchlist = [(d_train, 'train'), (d_valid, 'valid')]

In [107]:
clf = xgb.train(params, d_train, 
                1000, watchlist, early_stopping_rounds=50, 
                feval=xgb_r2_score, maximize=True, verbose_eval=10)

[0]	train-rmse:99.14835	train-r2:-58.35295	valid-rmse:98.26297	valid-r2:-67.63754
[10]	train-rmse:81.27653	train-r2:-38.88428	valid-rmse:80.36433	valid-r2:-44.91014
[20]	train-rmse:66.71610	train-r2:-25.87403	valid-rmse:65.77334	valid-r2:-29.75260
[30]	train-rmse:54.86956	train-r2:-17.17751	valid-rmse:53.88963	valid-r2:-19.64393
[40]	train-rmse:45.24492	train-r2:-11.35979	valid-rmse:44.21995	valid-r2:-12.90012
[50]	train-rmse:37.44736	train-r2:-7.46669	valid-rmse:36.37245	valid-r2:-8.40431
[60]	train-rmse:31.14760	train-r2:-4.85761	valid-rmse:30.01883	valid-r2:-5.40575
[70]	train-rmse:26.08679	train-r2:-3.10878	valid-rmse:24.90901	valid-r2:-3.41057
[80]	train-rmse:22.04665	train-r2:-1.93465	valid-rmse:20.83098	valid-r2:-2.08462
[90]	train-rmse:18.84412	train-r2:-1.14399	valid-rmse:17.60463	valid-r2:-1.20311
[100]	train-rmse:16.34035	train-r2:-0.61211	valid-rmse:15.08357	valid-r2:-0.61730
[110]	train-rmse:14.40184	train-r2:-0.25230	valid-rmse:13.14930	valid-r2:-0.22910
[120]	train-rmse:

In [109]:
#Predict your test_df values using xgboost

p_test = clf.predict(d_test)

In [110]:
sub = pd.DataFrame()
sub['ID'] = id_test
sub['y'] = p_test

In [111]:
sub.head()

Unnamed: 0,ID,y
0,1,82.890129
1,2,98.129356
2,3,83.058907
3,4,77.052925
4,5,112.577179
