<a href="https://colab.research.google.com/github/Ashikur-ai/Learn-Machine-Learning/blob/main/1_2_How_Cardinality_Used_to_Improve_ML_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Cardinality refers to the number of possible values that a feature can assume.

High cardinality may pose the following problems:
* Variables with too many labels tend to dominate over those with only a few labels
* A big number of labels within a variable may introduce noise and overfit
* Some of the labels may only be present in the training dataset, but not in the test set


#In this Lecture 
We will:
* Learn how to quantify cardinality
* See examples of high and low cardinality variables
*Understand the effect of cardinality while preparing train and test sets
* Visualise the effect of cardinality on Machine Learning Model performance

#Let's start!
We will first import all the necessary libraries.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/feature-engineering-for-machine-learning-dataset/master/titanic.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#Cardinality is the total number of unique values the column can take

In [3]:
data.shape

(891, 12)

#Check cardinality of Name column

In [4]:
len(data['Name'].unique())

891

#Check cardinality for sex column

In [5]:
len(data['Sex'].unique())

2

#Check cardinality for ticket

In [6]:
len(data['Ticket'].unique())

681

#check cardinality for Cabin column

In [7]:
len(data['Cabin'].unique())

148

#Method to reduce cardinality for cabin column

In [8]:
data = data[['Cabin', 'Sex', 'Survived' ]]

In [9]:
data

Unnamed: 0,Cabin,Sex,Survived
0,,male,0
1,C85,female,1
2,,female,1
3,C123,female,1
4,,male,0
...,...,...,...
886,,male,0
887,B42,female,1
888,,female,0
889,C148,male,1


In [10]:
data['Cabin'].str[0].unique()


array([nan, 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [11]:
data['Cabin'].str[0].fillna('n').unique()

array(['n', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [12]:
data['Cabin_reduced'] = data['Cabin'].str[0].fillna('n')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [13]:
data

Unnamed: 0,Cabin,Sex,Survived,Cabin_reduced
0,,male,0,n
1,C85,female,1,C
2,,female,1,n
3,C123,female,1,C
4,,male,0,n
...,...,...,...,...
886,,male,0,n
887,B42,female,1,B
888,,female,0,n
889,C148,male,1,C


In [14]:
len(data['Cabin'].unique())

148

In [15]:
len(data['Cabin_reduced'].unique())

9

# We will use the cabin and sex

#Train Test Split

In [16]:
use_cols = ['Cabin', 'Cabin_reduced', 'Sex']
X_train, X_test, y_train, y_test = train_test_split(data[use_cols],
                                                    data['Survived'],
                                                    test_size=.40,
                                                    random_state=0)

In [17]:
X_train.shape, X_test.shape

((534, 3), (357, 3))

In [18]:
test_cabin_unique = X_test['Cabin'].unique()
train_cabin_unique = X_train['Cabin'].unique()
test_cabin_unique, train_cabin_unique

(array([nan, 'B78', 'C106', 'C125', 'C7', 'B49', 'C54', 'E34', 'C52',
        'B50', 'B41', 'C83', 'B57 B59 B63 B66', 'E67', 'D45', 'D10 D12',
        'C47', 'C87', 'D20', 'D17', 'B42', 'C126', 'C23 C25 C27', 'D9',
        'F33', 'C65', 'B102', 'F2', 'B18', 'D49', 'F G73', 'G6', 'E33',
        'B96 B98', 'C62 C64', 'B37', 'C78', 'B58 B60', 'C85', 'D36',
        'C124', 'B71', 'B30', 'A26', 'B69', 'E68', 'E121', 'C68', 'B94',
        'E17', 'D33', 'D26', 'C128', 'A14', 'B19', 'D21', 'C148', 'C30',
        'D56', 'E24', 'E40', 'E31', 'E44', 'E38', 'D37', 'E8', 'C92',
        'E63', 'F4'], dtype=object),
 array([nan, 'E67', 'C126', 'B73', 'E36', 'C78', 'E46', 'C111', 'E101',
        'D15', 'E12', 'G6', 'A32', 'B4', 'A10', 'A5', 'C95', 'E25', 'C90',
        'D6', 'A36', 'D', 'D26', 'D50', 'B96 B98', 'C93', 'E77', 'C101',
        'D11', 'C123', 'C32', 'B35', 'C91', 'T', 'B101', 'E58', 'A23',
        'B77', 'D28', 'B82 B84', 'B79', 'E44', 'C45', 'C2', 'B5', 'C104',
        'B20', 'A19', 'B51

In [19]:
len(train_cabin_unique)-len(test_cabin_unique)

35

#Cabin

In [20]:
len([x for x in train_cabin_unique if x not in test_cabin_unique ])

80

In [21]:
 len([x for x in test_cabin_unique if x not in train_cabin_unique ])

45

#Cabin Reduced

In [22]:
 len([x for x in X_train['Cabin_reduced'].unique() if x not in  X_test['Cabin_reduced'].unique() ])

1

The below code means all the column in test set are present in the test set


In [23]:
 len([x for x in X_test['Cabin_reduced'].unique() if x not in  X_train['Cabin_reduced'].unique() ])

0

#Categorical Encoding

In [26]:
X_train.isnull().sum()

Cabin            407
Cabin_reduced      0
Sex                0
dtype: int64

In [27]:
X_test.isnull().sum()

Cabin            280
Cabin_reduced      0
Sex                0
dtype: int64

In [28]:
X_train.fillna('0', inplace=True)

In [29]:
X_test.fillna('0', inplace=True)

In [30]:
X_train.isnull().sum()

Cabin            0
Cabin_reduced    0
Sex              0
dtype: int64

In [31]:
X_test.isnull().sum()

Cabin            0
Cabin_reduced    0
Sex              0
dtype: int64

In [32]:
X_train

Unnamed: 0,Cabin,Cabin_reduced,Sex
100,0,n,female
722,0,n,male
678,0,n,female
229,0,n,female
334,0,n,female
...,...,...,...
835,E49,E,female
192,0,n,female
629,0,n,male
559,0,n,female


In [36]:
['Cabin', 'Sex']
['Cabin_reduced', 'Sex']

['Cabin_reduced', 'Sex']

In [39]:
train = pd.get_dummies(X_train[['Cabin', 'Sex']], columns=['Cabin', 'Sex'])
test = pd.get_dummies(X_test[['Cabin', 'Sex']], columns=['Cabin', 'Sex'])

In [41]:
train.shape

(534, 106)

In [42]:
test.shape

(357, 71)

#Balance the cols of train and test

In [43]:
missing_col = set(train.columns)-set(test.columns)

In [46]:
for c in missing_col:
  test[c] = 0

In [47]:
test = test[train.columns]

In [48]:
train.shape, test.shape

((534, 106), (357, 106))

#Create Random Forest Classifier

In [49]:
clf = RandomForestClassifier(n_estimators=200, random_state=42)
clf.fit(train, y_train)

RandomForestClassifier(n_estimators=200, random_state=42)

In [50]:
y_train_pred = clf.predict_proba(train)
y_test_pred = clf.predict_proba(test)

In [51]:
y_train_pred

array([[0.3296787 , 0.6703213 ],
       [0.86457227, 0.13542773],
       [0.3296787 , 0.6703213 ],
       ...,
       [0.86457227, 0.13542773],
       [0.3296787 , 0.6703213 ],
       [0.86457227, 0.13542773]])

In [52]:
print('Train Set')
print('{} roc_auc: {}'.format('Random Forest', roc_auc_score(y_train, y_train_pred[:,1])))

print("Test Set")
print('{} roc_auc: {}'.format('Random Forest', roc_auc_score(y_test, y_test_pred[:, 1])))

Train Set
Random Forest roc_auc: 0.8729650130239166
Test Set
Random Forest roc_auc: 0.8261744743146127


#ML Models Building

In [55]:
def run_models(data_train, data_test, y_train, y_test):
  rf = RandomForestClassifier(n_estimators=200, random_state=42)
  ada = AdaBoostClassifier(n_estimators=200, random_state=42)
  logit = LogisticRegression(solver='lbfgs', random_state=42)
  gbc = GradientBoostingClassifier(n_estimators=300, random_state=42)

  models = {
      'Random Forest': rf,
      'Adaboost': ada,
      'Logistic Reg': logit,
      'Gradient Boost': gbc,
  }
# Categorical encoding
  train = pd.get_dummies(data_train, columns=data_train.columns)
  test = pd.get_dummies(data_test, columns=data_test.columns)

  missing_cols = set(train.columns) - set(test.columns)
  for c in missing_cols:
    test[c]=0
  test = test[train.columns]

  for label, clf in models.items():
    clf.fit(train, y_train)
    y_train_pred = clf.predict_proba(train)
    y_test_pred = clf.predict_proba(test)

    print('Testing for {}', label)
    print('Train roc_auc: {}'.format(roc_auc_score(y_train, y_train_pred[:,1])))

    print('Test roc_auc: {}'.format(roc_auc_score(y_test, y_test_pred[:, 1])))
    print()

In [54]:
X_train.columns

Index(['Cabin', 'Cabin_reduced', 'Sex'], dtype='object')

In [56]:
cols = ['Cabin', 'Sex']
run_models(X_train[cols], X_test[cols], y_train, y_test)

Testing for {} Random Forest
Train roc_auc: 0.8729650130239166
Test roc_auc: 0.8261744743146127

Testing for {} Adaboost
Train roc_auc: 0.8699310324413924
Test roc_auc: 0.7899254724514241

Testing for {} Logistic Reg
Train roc_auc: 0.8328572697134738
Test roc_auc: 0.819204152249135

Testing for {} Gradient Boost
Train roc_auc: 0.8743858039308549
Test roc_auc: 0.8247604471652914



#Low cardinality


In [57]:
cols = ['Cabin_reduced', 'Sex']
run_models(X_train[cols], X_test[cols], y_train, y_test)

Testing for {} Random Forest
Train roc_auc: 0.8310812810798011
Test roc_auc: 0.8101377428799574

Testing for {} Adaboost
Train roc_auc: 0.8283284986976084
Test roc_auc: 0.8166589033803567

Testing for {} Logistic Reg
Train roc_auc: 0.8274109045702107
Test roc_auc: 0.8159602076124568

Testing for {} Gradient Boost
Train roc_auc: 0.8310812810798011
Test roc_auc: 0.8101377428799574

