<a href="https://colab.research.google.com/github/BhavikDudhrejiya/Feature-Engineering/blob/main/2_How_Cardinality_used_to_improve_model_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Importing libraries

In [68]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

#Loading  & exploring the data

In [69]:
#Loading data
data = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

In [70]:
#Checking data
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [71]:
#Checking the shape of the data
data.shape

(891, 12)

In [72]:
#Extracting count of unique values in the data
for i in data.columns:
  print(f'{i:{15}} {len(data[i].unique())}')

PassengerId     891
Survived        2
Pclass          3
Name            891
Sex             2
Age             89
SibSp           7
Parch           7
Ticket          681
Fare            248
Cabin           148
Embarked        4


We will create a data with two features Cabin and Sex.

#Creating new data with two features Cabin and Sex

In [73]:
#Creating new data with variable Cabin & Sex
data1 = data[['Cabin', 'Sex']]
data1

Unnamed: 0,Cabin,Sex
0,,male
1,C85,female
2,,female
3,C123,female
4,,male
...,...,...
886,,male
887,B42,female
888,,female
889,C148,male


#Feature Engineering on Cabin feature

In [74]:
#Extracting first alphabets from each labels
data1['Cabin'].str.slice(0,1).unique()

array([nan, 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [75]:
#Replacing missing value with 'n'
data1['Cabin'].str.slice(0,1).fillna('n').unique()

array(['n', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [76]:
#Creating feature Cabin reduced
data1['Cabin_reduced'] = data1['Cabin'].str.slice(0,1).fillna('n')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [77]:
#Checking data
data1.head()

Unnamed: 0,Cabin,Sex,Cabin_reduced
0,,male,n
1,C85,female,C
2,,female,n
3,C123,female,C
4,,male,n


In [78]:
#Extracting count of unique values in the data
for i in data1.columns:
  print(f'{i:{15}} {len(data1[i].unique())}')

Cabin           148
Sex             2
Cabin_reduced   9


We have reduced the labels from 149 to 9 in Cabin feature and create a new feature

#Splitting data into train and test

In [79]:
#Spliting data into train and test
X_train, X_test, y_train, y_test = train_test_split(data1,
                                                    data['Survived'],
                                                    test_size = 0.4,
                                                    random_state = 0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((534, 3), (357, 3), (534,), (357,))

#Checking missing values in X_train and X_test

In [80]:
#Checking missing values in X_train
X_train.isnull().sum()

Cabin            407
Sex                0
Cabin_reduced      0
dtype: int64

In [81]:
#Checking missing values in X_test
X_test.isnull().sum()

Cabin            280
Sex                0
Cabin_reduced      0
dtype: int64

In [82]:
#Replacing missing values
X_train.fillna('0', inplace = True)
X_test.fillna('0', inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


In [83]:
#Checking missing values
print(X_train.isnull().sum())
print(X_test.isnull().sum())

Cabin            0
Sex              0
Cabin_reduced    0
dtype: int64
Cabin            0
Sex              0
Cabin_reduced    0
dtype: int64


#Creating dummy variables

In [84]:
#Checking data
display(X_train.head(2))
display(X_test.head(2))

Unnamed: 0,Cabin,Sex,Cabin_reduced
100,0,female,n
722,0,male,n


Unnamed: 0,Cabin,Sex,Cabin_reduced
495,0,male,n
648,0,male,n


In [85]:
#Creating dummies variables from Cabin & Sex features
train = pd.get_dummies(X_train[['Cabin','Sex']], columns = ['Cabin','Sex'])
test = pd.get_dummies(X_test[['Cabin','Sex']], columns = ['Cabin','Sex'])
train.shape, test.shape

((534, 106), (357, 71))

We have 106 features in train and 71 features in test

#Identifying features have in train dataset but not have in test dataset

In [86]:
#Extracting the columns have in train but not have in test
missing_cols = set(train.columns)-set(test.columns)
len(missing_cols)

79

79 features not have in test data set

# Adding 79 features in test dataset

In [87]:
#Adding those 79 columns not in test with 0 values
for i in missing_cols:
  test[i]=0

In [90]:
#Checking data shape
train.shape, test.shape

((534, 106), (357, 150))

After adding 79 features in test now We have 106 features in train and 150 features in test. However, both the train and test imbalance features. 

#Features balancing in test data

In [91]:
#Changing the test 
test = test[train.columns]

In [92]:
#Checking the shape of the data
train.shape, test.shape

((534, 106), (357, 106))

Now We have balanced features

# Creating Random Forest Model

In [93]:
#Performing random classifier
clf = RandomForestClassifier(n_estimators=200, random_state=42)
clf.fit(train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [61]:
#Prediction on the model
y_train_pred = clf.predict_proba(train)
y_test_pred = clf.predict_proba(test)

In [94]:
#Checking the data
y_train_pred

array([[0.3296787 , 0.6703213 ],
       [0.86457227, 0.13542773],
       [0.3296787 , 0.6703213 ],
       ...,
       [0.86457227, 0.13542773],
       [0.3296787 , 0.6703213 ],
       [0.86457227, 0.13542773]])

In [95]:
#Checking ROC AUC score
print('Train set')
print('{} roc auc:{}'.format('Random Forest', roc_auc_score(y_train, y_train_pred[:,1])))

print('Test set')
print('{} roc auc:{}'.format('Random Forest', roc_auc_score(y_test, y_test_pred[:,1])))

Train set
Random Forest roc auc:0.8729650130239166
Test set
Random Forest roc auc:0.8261744743146127


We got ROC auc 87% from training set and 82% from testing set. It shows model performed underfitting by 5%.

#Performing difference ML techniques on the data with Cabin and Sex features and comparing the performance through ROC AUC Score 

In [97]:
#Creating function
def run_model(data_train, data_test, y_train, y_test):
  #1. Taking 4 different ML techniques#################################################################################################################  
  rf = RandomForestClassifier(n_estimators=200, random_state=42)
  ada = AdaBoostClassifier(n_estimators=200, random_state=42)
  logit = LogisticRegression(solver='lbfgs', random_state=42)
  gbc = GradientBoostingClassifier(n_estimators=300, random_state=42)

  #2. Creating text name of the model################################################################################################################### 
  model = {'Random Forest':rf, 'AdaBoost':ada, 'Logistic Regression':logit, 'Gradient Boosting':gbc}

  #3. Creating dummy variables ######################################################################################################################### 
  train = pd.get_dummies(data_train, columns=data_train.columns)
  test = pd.get_dummies(data_test, columns=data_test.columns)

  #4. Extracting missing features have in training set but not have in testing set######################################################################
  missing_cols = set(train.columns)-set(test.columns)

  #5. Adding missing features in testing set############################################################################################################
  for i in missing_cols:
    test[i]=0
  
  #6. Balancing the features in testing set#############################################################################################################
  test = test[train.columns]

  #7. Training a model, prediting on train and test data################################################################################################ 
  for label, clf in model.items():
    clf.fit(train, y_train)#Training a model
    y_train_pred = clf.predict_proba(train)#Prediction on the train data
    y_test_pred = clf.predict_proba(test)#Prediction on the test data

  #8. Extracting ROC and AUC score on difference models
    print('Testing for', label)
    print('Train roc_auc:{}'.format(roc_auc_score(y_train, y_train_pred[:,1])))
    print('Test roc_auc:{}'.format(roc_auc_score(y_test, y_test_pred[:,1])))
    print()

In [98]:
#Extracting ROC and AUC score on difference models
cols = ['Cabin', 'Sex']
run_model(X_train[cols], X_test[cols], y_train, y_test)

Testing for Random Forest
Train roc_auc:0.8729650130239166
Test roc_auc:0.8261744743146127

Testing for AdaBoost
Train roc_auc:0.8699310324413924
Test roc_auc:0.7899254724514241

Testing for Logistic Regression
Train roc_auc:0.8328572697134738
Test roc_auc:0.819204152249135

Testing for Gradient Boosting
Train roc_auc:0.8743858039308549
Test roc_auc:0.8247604471652914



Most of the models performed underfitting

# Performing difference ML techniques on the data with Cabin_reduced and Sex features and comparing the performance through ROC AUC Score

In [99]:
#Extracting ROC and AUC score on difference models
cols = ['Cabin_reduced', 'Sex']
run_model(X_train[cols], X_test[cols], y_train, y_test)

Testing for Random Forest
Train roc_auc:0.8310812810798011
Test roc_auc:0.8101377428799574

Testing for AdaBoost
Train roc_auc:0.8283284986976084
Test roc_auc:0.8166589033803567

Testing for Logistic Regression
Train roc_auc:0.8274109045702107
Test roc_auc:0.8159602076124568

Testing for Gradient Boosting
Train roc_auc:0.8310812810798011
Test roc_auc:0.8101377428799574



Most of the models performed well i.e. no overfitting and underfitting identified.