## Train-Test Split 

Importing Seaborn Library for loading inbuilt datasets

In [161]:
import seaborn as sns

Loading planets dataset from Seaborn 

In [162]:
planets = sns.load_dataset('planets')

Checking dataset using head() function

In [163]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [164]:
planets.number.unique()

array([1, 2, 3, 5, 4, 6, 7], dtype=int64)

We can see that it consists of data regarding planets and their attributes. The planet is represented by "number" column

Let's assume that we had to build a planet prediction model which would predict 1 of the 7 planet numbers upon providing other data as input

In that scenario. "number will become Target or Dependent variable and all other fields will be independent features.

Usually, Target field is denoted by y and Independent features data is represented by X as shown below - 

In [165]:
X = planets.drop('number',axis=1) # Dropping target so that X only contains Independent features
y = planets.iloc[:,[1]]           # Extracting Target column out of data into y
  

In [166]:
# Checking the Independent feature dataset
X.head()

Unnamed: 0,method,orbital_period,mass,distance,year
0,Radial Velocity,269.3,7.1,77.4,2006
1,Radial Velocity,874.774,2.21,56.95,2008
2,Radial Velocity,763.0,2.6,19.84,2011
3,Radial Velocity,326.03,19.4,110.62,2007
4,Radial Velocity,516.22,10.5,119.47,2009


In [167]:
# Checking Target dataset
y.head()

Unnamed: 0,number
0,1
1,1
2,1
3,1
4,1


Train/Test split is done by train_test_split function of sklearn.model_selection  module

Following is a description of variables/parameters used - 

X = Independent features data

y = Target data

test_size = Split ratio. ).30 in below case means a 70/30 split for Train/Test respectively

random_state = fixes the random allocation so that every time this cell is run, random allocation is done in similar way for comparsion/performance measurement purpose. It can take any value

X_train = Independent features for train data will be put in this variable. The name can be anything, but this is standard notation. 

y_train = Target for train data will be put in this variable. The name can be anything, but this is standard notation. 

X_test = Independent features for test data will be put in this variable. The name can be anything, but this is standard notation. 

y_test = Target features for test data will be put in this variable.  The name can be anything, but this is standard notation. 



In [168]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=100)

In [169]:
# Exploring X_train. It should contain independent features for train data. 
X_train.head()

Unnamed: 0,method,orbital_period,mass,distance,year
526,Radial Velocity,58.11289,8.02,37.88,1998
707,Transit,42.6318,,2100.0,2011
12,Radial Velocity,479.1,3.88,97.28,2008
287,Radial Velocity,16.546,0.0363,38.01,2011
932,Transit,4.0161,,,2004


In [170]:
# Exploring y_train. It should contain target feature for train data. 
y_train.head()

Unnamed: 0,number
526,2
707,2
12,1
287,3
932,1


In [171]:
# Exploring X_test. It should contain independent features for test data. 
X_test.head()

Unnamed: 0,method,orbital_period,mass,distance,year
826,Transit,13.749,,,2013
917,Microlensing,,,,2004
213,Radial Velocity,360.2,2.37,56.5,2012
270,Radial Velocity,53.881,0.06472,32.31,2011
1022,Transit,1.360031,,93.0,2012


In [172]:
# Exploring y_test. It should contain target feature for test data. 
y_test.head()

Unnamed: 0,number
826,2
917,1
213,2
270,2
1022,1


## Confusion Matrix , Accuracy , Type I & Type II Error

sklearn  methods can be used for these perfromance indicators as below 

In [173]:
# Importing Confusion matrix from sklearn
from sklearn.metrics import confusion_matrix

Let's randomly assume some Actual and Predicted values for now to understand these concepts. 

Assume that 25 Actual values from a test dataset are stored in  actual_val and corresponding predicted ones in predicted_val

In [174]:
# Actual values
actual_val = [1,1,0,1,0,0,0,1,0,1,1,1,0,0,1,0,1,0,1,0,0,0,0,1,1]

# predicted values
predicted_val = [1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,1,0,1,1,1]

### Confusion Matrix

confusion_matrix function can be used to get the confusion matrix as below. 

It takes actual values dataset as first parameter and predicted values as the second parameter. labels can be explicitly passed to specif the order in matrix. 

Note: Students can always explore functions by Shift+tab for additional parameters and examples


In [175]:
matrix = confusion_matrix(actual_val,predicted_val, labels=[1,0] )
matrix

array([[9, 3],
       [7, 6]], dtype=int64)

As expected, the sum of all the values in above matrix is 35 because a total of 25 predictions were made. 

Although Type 1 and Type 2 errors can be directly inferred from the matrix, but these can also be obtained by accessing the confusion matrix as below - 

### Type 1 & Type 2 Errors

In [176]:
Type_1_False_positive = matrix[0][1]
print("Type 1 error ( False Postive) is : " + str(Type_1_False_positive))

Type_2_False_negative = matrix[1][0]
print("Type 2 error ( False Negative) is :  " + str(Type_2_False_negative))

Type 1 error ( False Postive) is : 3
Type 2 error ( False Negative) is :  7


### Accuracy

accuracy_score method of sklearn can be used to find accuracy of a model. 

From confusion matrix, we can expect accuracy to be 15/25 = 60% (Sum of diagonal elements divided by total sum)

In [177]:
from sklearn.metrics import accuracy_score
score = accuracy_score(actual_val, predicted_val)
score

0.6

The accuracy comes out to be 60% as expected

## Feature Engineering

### Data Cleaing

Let's assume we have this data regarding penguins characteristics, and we want to develop a classification model which predicts the the species of the penguin based on all other features.

In [178]:
penguins = sns.load_dataset('penguins')

In [179]:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


Checking the shape of the data. There are 344 rows and 7 columns - 

In [180]:
penguins.shape

(344, 7)

Checking nulls. 6 columns have less than 344 non-null rows. Hence we'll need to treat those 6 fields - 

In [181]:
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


Treating nulls by Mean/mode imputation accordingly - 

In [182]:
penguins['bill_length_mm']=penguins['bill_length_mm'].fillna(penguins['bill_length_mm'].mean())
penguins['bill_length_mm']=penguins['bill_length_mm'].fillna(penguins['bill_length_mm'].mean())
penguins['bill_depth_mm']=penguins['bill_depth_mm'].fillna(penguins['bill_depth_mm'].mean())
penguins['flipper_length_mm']=penguins['flipper_length_mm'].fillna(penguins['flipper_length_mm'].mean())
penguins['body_mass_g']=penguins['body_mass_g'].fillna(penguins['body_mass_g'].mean())
penguins['sex']=penguins['sex'].fillna(penguins['sex'].mode()[0])

Confirming no nulls left in the data - 

In [183]:
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     344 non-null    float64
 3   bill_depth_mm      344 non-null    float64
 4   flipper_length_mm  344 non-null    float64
 5   body_mass_g        344 non-null    float64
 6   sex                344 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


### One-hot encoding

We see two categorical variables - island and sex. 
For these to be used by an algorithm, we'll need to convert these into binary features using One-hot encoding

One-hot encoding can be done by using Pandas' get_dummies() function.- 

drop_first parameters means that for N lables, function will create N-1 fields. Thats exactly what we want. 

In [184]:
import pandas as pd
island_dummies=pd.get_dummies(penguins[['island']],drop_first=True)
island_dummies.head()


Unnamed: 0,island_Dream,island_Torgersen
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1


In [185]:
sex_dummies = pd.get_dummies(penguins[['sex']],drop_first=True)
sex_dummies.head()

Unnamed: 0,sex_Male
0,1
1,0
2,0
3,1
4,0


Merging new binary fields to the original dataframe - 

In [186]:
penguins = pd.concat([penguins,sex_dummies,island_dummies],axis=1)
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,sex_Male,island_Dream,island_Torgersen
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,1,0,1
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,0,0,1
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,0,0,1
3,Adelie,Torgersen,43.92193,17.15117,200.915205,4201.754386,Male,1,0,1
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,0,0,1


Dropping the original categorical fields as those are no more required anymore

In [187]:
penguins.drop(['island','sex'],axis=1,inplace=True)

NOTE : One hot encoding should be performed before the Test/Train split. Otherise it would have to be performed 4 times for X_train, X_test , y_train and y_test

### Feature Scaling - Standardisation

Before applying any scaling on the data, it is advised to do the train test split.
Otherwise data leakage will happen between Train and Test splits. 

Data Leakge - We want our Train and Test data to be completely randomly splitted and fairly independent of each other. But applying scaling before the split would link the 2 splits in a way and thats known as data leakage 

Train/Test Split :

In [188]:
X = penguins.drop('species',axis=1) # Dropping target so that X only contains Independent features
y = penguins.iloc[:,[0]]           # Extracting Target column out of data into y
  

In [189]:
X.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex_Male,island_Dream,island_Torgersen
0,39.1,18.7,181.0,3750.0,1,0,1
1,39.5,17.4,186.0,3800.0,0,0,1
2,40.3,18.0,195.0,3250.0,0,0,1
3,43.92193,17.15117,200.915205,4201.754386,1,0,1
4,36.7,19.3,193.0,3450.0,0,0,1


In [190]:
y.head()

Unnamed: 0,species
0,Adelie
1,Adelie
2,Adelie
3,Adelie
4,Adelie


In [191]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=100)

Standardisation can be performed with the help of StandardScaler method of sklearn as below - 

In [192]:
from sklearn.preprocessing import StandardScaler

sc=StandardScaler()
X_train_standardised =sc.fit_transform(X_train)
X_test_standardised =sc.transform(X_test)

In [193]:
X_train_standardised

array([[ 0.45175397, -1.87193161,  0.64957392, ..., -1.08077159,
        -0.76980036, -0.40269363],
       [ 0.58236008, -0.86968462,  1.01293885, ..., -1.08077159,
        -0.76980036, -0.40269363],
       [ 0.07859368, -1.47103281,  1.08561184, ...,  0.92526489,
        -0.76980036, -0.40269363],
       ...,
       [ 0.93686235, -1.12024636,  1.95768768, ...,  0.92526489,
        -0.76980036, -0.40269363],
       [ 0.2278578 , -1.72159456,  0.50422794, ..., -1.08077159,
        -0.76980036, -0.40269363],
       [-1.86183983,  0.43323647, -0.58586686, ...,  0.92526489,
        -0.76980036,  2.4832774 ]])

In [194]:
X_test_standardised

array([[ 0.63833412, -1.62136986,  0.79491989,  0.84473087, -1.08077159,
        -0.76980036, -0.40269363],
       [ 0.97417838, -0.76945992,  1.15828482,  2.07836928,  0.92526489,
        -0.76980036, -0.40269363],
       [-0.91028109,  1.18492171, -0.44052089,  0.59167684,  0.92526489,
        -0.76980036,  2.4832774 ],
       [-0.55577881,  0.88424762, -1.38526972, -0.98991086,  0.92526489,
         1.29903811, -0.40269363],
       [ 1.31002265, -0.36856112,  1.73966872,  1.31920718,  0.92526489,
        -0.76980036, -0.40269363],
       [-0.48114675,  0.63368587, -0.00448297, -0.26238052,  0.92526489,
         1.29903811, -0.40269363],
       [-1.47002153, -0.01777467, -1.02190478, -1.33786015, -1.08077159,
        -0.76980036,  2.4832774 ],
       [ 0.65699214, -1.37080811,  1.01293885,  1.5089977 ,  0.92526489,
        -0.76980036, -0.40269363],
       [ 1.2540486 ,  0.03233767,  1.95768768,  1.76205174,  0.92526489,
        -0.76980036, -0.40269363],
       [ 0.39577993, -1.5712

As expected, both X_train and X_test are standardised around 0. We can also verify that both splits will have Mean 1 and Standard deviation as 1 :

In [195]:
print("Mean of X_train is : "+str(X_train_standardised.mean()))
print("Standard Deviation of X_train is : "+str(X_train_standardised.std()))
print("Mean of X_test is : "+str(X_test_standardised.mean()))
print("Standard Deviation of X_test is : "+str(X_test_standardised.std()))

Mean of X_train is : 8.950635237288084e-17
Standard Deviation of X_train is : 1.0
Mean of X_test is : -0.0666804481475174
Standard Deviation of X_test is : 1.0317952796662435


### Feature Scaling -  Normalisation

Normalisation can be performed with the help of MinMaxScaler method of sklearn as below -

In [196]:
from sklearn.preprocessing import MinMaxScaler

scaler_minmax = MinMaxScaler()
X_train_normalised = scaler_minmax.fit_transform(X_train)
X_test_normalised = scaler_minmax.fit_transform(X_test)

In [197]:
X_train_normalised

array([[0.49808429, 0.03614458, 0.6440678 , ..., 0.        , 0.        ,
        0.        ],
       [0.52490421, 0.27710843, 0.72881356, ..., 0.        , 0.        ,
        0.        ],
       [0.42145594, 0.13253012, 0.74576271, ..., 1.        , 0.        ,
        0.        ],
       ...,
       [0.59770115, 0.21686747, 0.94915254, ..., 1.        , 0.        ,
        0.        ],
       [0.45210728, 0.07228916, 0.61016949, ..., 0.        , 0.        ,
        0.        ],
       [0.02298851, 0.59036145, 0.3559322 , ..., 1.        , 0.        ,
        1.        ]])

In [198]:
X_test_normalised

array([[0.59459459, 0.12      , 0.65384615, 0.65322581, 0.        ,
        0.        , 0.        ],
       [0.66409266, 0.34666667, 0.75      , 0.96774194, 1.        ,
        0.        , 0.        ],
       [0.27413127, 0.86666667, 0.32692308, 0.58870968, 1.        ,
        0.        , 1.        ],
       [0.34749035, 0.78666667, 0.07692308, 0.18548387, 1.        ,
        1.        , 0.        ],
       [0.73359073, 0.45333333, 0.90384615, 0.77419355, 1.        ,
        0.        , 0.        ],
       [0.36293436, 0.72      , 0.44230769, 0.37096774, 1.        ,
        1.        , 0.        ],
       [0.15830116, 0.54666667, 0.17307692, 0.09677419, 0.        ,
        0.        , 1.        ],
       [0.5984556 , 0.18666667, 0.71153846, 0.82258065, 1.        ,
        0.        , 0.        ],
       [0.72200772, 0.56      , 0.96153846, 0.88709677, 1.        ,
        0.        , 0.        ],
       [0.54440154, 0.13333333, 0.75      , 0.49193548, 0.        ,
        0.        , 0. 

The minimum and maximum values can be confrimed to be 0 and 1 as below  :

In [199]:
print("Min value of X_train is : "+str(X_train_normalised.min()))
print("Max value of X_train is : "+str(X_train_normalised.max()))
print("Min value of X_test is : "+str(X_test_normalised.min()))
print("Max value of X_test is : "+str(X_test_normalised.max()))


Min value of X_train is : 0.0
Max value of X_train is : 1.0
Min value of X_test is : 0.0
Max value of X_test is : 1.0
