## Dimensionality Reduction - Classification

Blog Link :https://medium.com/@muzammila784/dimensionality-reduction-does-approach-matter-58d5cd0915c3

#### BackGround

In machine learning problems, there are many factors that contribute to the final prediction. Some of these features are highly correlated with the target variable, and some are negatively correlated. Having too many features increases the computational time and sometimes reduces the accuracy of the model, as some models are designed such that they are not able to perform well on high-dimensional data. This is where dimensionality reduction is used. It is a process or technique of reducing the number of features without compromising the accuracy of the model and reducing the computational time, which makes it better for models to learn patterns more precisely as compared to high-dimensional datasets.

The number of features is represented by dimensions. To make the problem simpler and make learning easier, we use dimensionality reduction techniques that map high-dimensional features into low-dimensional space. It is an old technique for reducing the dimensions of large datasets, but it still outperforms many traditional algorithms and newer techniques.

#### Dimensionality Reduction Techniques

In this project, we are using 15 different datasets and evaluating the performance of different machine learning models after doing dimensionality reduction with different approaches. The approaches we are using in this project are:


1- Principal Component Analysis (PCA) :-

Principal component analysis (PCA) is the process of calculating and analyzing the key characteristics of data and using them to change the data’s underlying structure, frequently using only the first few key characteristics, and ignoring the rest. The primary objective of PCA is to learn the pattern of variances, and its major objective is to prevent the pattern of variances from being altered during dimensionality reduction.


2- Linear Discriminant Analysis (LDA) :-

In this technique, we compute a scatter matrix, which tells us about the amount of dispersion within and between classes. As a result, we compute two matrices for the inter-class and intra-class matrices. LDA find linear combinations in the dataset so that classes remain separated.


3- T-SNE : -

TSN-E works on non-linear datasets and calculates similarity in between instances, high dimensional space and low dimensional space and then optimize these two similarities using a cost function.


4- Singular Vector Decomposition : -

The sole distinction between this method and PCA is that, in this case, the data matrix is used as the matrix to be factorized rather than the covariance matrix as it is in PCA. While PCA performs better with dense data, SVD performs better with sparse data.


5- Isomap Embedding : -

The isomap technique uses a non-linear approach to minimize dimensions while preserving local structures. It is distinct from a few other methods in the same class because it employs a non-linear method of dimensionality reduction as opposed to the linear mappings employed by algorithms like PCA.

#### Datasets

We are using eight classification and seven regression datasets to check the performance before and after reducing features.

Classification

1- Chronic Kidney Disease Dataset

2: Blood Pressure Disease Dataset

3- Cardiovascular Disease Dataset

4- Heart Disease Prediction Dataset

5- Credit Card Fraud Dataset

6- Customer churn Dataset

7- Brest Cancer Wisconsin Dataset

8-Accident Severity Prediction Dataset

#### Pre - Processing

Our pre-processing pipeline consists of the following steps:

1- The first step is to check the data types of columns and convert categorical columns into numerical columns using mapping or encoding techniques.

2- Dropping columns, which are redundant and unnecessary,

3- The second step is to check for null values and fill all the null values with either the mean, the previous value, or dropping the value.

4- The third step is scaling the data, which is mandatory before dimensionality reduction. We are doing scaling using the standard scaling method.

5- Splitting our dataset into train and test set with 70–30% ratio.

#### Machine Learning and Dimensionality Reduction Pipeline

So, our second pipeline will implement machine learning models and dimensionality reduction techniques on a dataset. This pipeline contains the following steps:

1- Applying machine learning models to a full-featured dataset using lazy prediction and getting results

2- PCA and Lazy Predict are applied to the reduced-dimensioned data, and the results are checked.

3- Implementing LDA and applying lazy prediction to the reduced data of LDA

4- Similarly, we will repeat the process of TSNE, SVD, and Isomap embedding.

## Importing Libraries

In [1]:
import pandas as pd   
import matplotlib.pyplot as plt
import plotly.express as px
import time
import numpy as np
import warnings

from sklearn.linear_model import LogisticRegression 
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import RFE,SelectFromModel


from sklearn.model_selection import train_test_split, RepeatedKFold, GridSearchCV, cross_val_score
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import classification_report
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

from sklearn.metrics import accuracy_score
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import KFold, cross_val_score

from sklearn.pipeline import Pipeline
from matplotlib import pyplot
# the permutation based importance
import seaborn as sns
from sklearn.inspection import permutation_importance

from numpy import mean
import warnings
warnings.filterwarnings("ignore") 

## Dataset 1 : Chronic Kidney Disease

Data info : This dataset of chronic kidney disease is taken from UCI Machine learning repository. Its a binary classsification dataset having 26 features and 400 instances. 

link for the is https://archive.ics.uci.edu/ml/datasets/chronic_kidney_disease

### Loading Dataset

In [73]:
# Importing Dataset using Pandas.
data = pd.read_csv("E:\DOWNLOADS\kidney_disease.csv")
data.shape

(400, 26)

### Data Pre Processing

In [74]:
data.head()

Unnamed: 0,id,age,bp,sg,al,su,rbc,pc,pcc,ba,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,...,44,7800,5.2,yes,yes,no,good,no,no,ckd
1,1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,...,38,6000,,no,no,no,good,no,no,ckd
2,2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,...,31,7500,,no,yes,no,poor,no,yes,ckd
3,3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,...,32,6700,3.9,yes,no,no,poor,yes,yes,ckd
4,4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,...,35,7300,4.6,no,no,no,good,no,no,ckd


In [75]:
# Counting Number of Attrition.
dictribution = data["classification"].value_counts()
dictribution

ckd       248
notckd    150
ckd\t       2
Name: classification, dtype: int64

In [76]:
# Replacing '/t' from classififcation column records.
data['classification'] = data['classification'].str.replace(r'\t', '')

In [77]:
# Counting Number of Attrition.
dictribution = data["classification"].value_counts()
dictribution

ckd       250
notckd    150
Name: classification, dtype: int64

In [78]:
# Cleaning columns and removing extra punctuations from numeric columns
data['pcv'] = data['pcv'].str.replace(r'?', '0')
data['wc'] = data['wc'].str.replace(r'?', '0')
data['rc'] = data['rc'].str.replace(r'?', '0')
data['dm'] = data['dm'].str.replace(r' ', '')

In [79]:
# now converting string columns into numeric after removing punctuations.
data['pcv'] = pd.to_numeric(data['pcv'])
data['wc'] = pd.to_numeric(data['wc'])
data['rc'] = pd.to_numeric(data['rc'])

In [80]:
# Checking Null values in all features.
data.isnull().sum()

id                  0
age                 9
bp                 12
sg                 47
al                 46
su                 49
rbc               152
pc                 65
pcc                 4
ba                  4
bgr                44
bu                 19
sc                 17
sod                87
pot                88
hemo               52
pcv                70
wc                105
rc                130
htn                 2
dm                  2
cad                 2
appet               1
pe                  1
ane                 1
classification      0
dtype: int64

In [81]:
data['age'].fillna(value= 51, inplace=True)

data['bp'].fillna(value= 76, inplace=True)

#Finding the mean of the Specific Gravity column having NaN
mean_value=data['sg'].mean()
#print(mean_value)
data['sg'].fillna(value= mean_value, inplace=True)

#Finding the mean of the Albumin column having NaN
mean_value=data['al'].mean()
print(mean_value)
data['al'].fillna(value= mean_value, inplace=True)

#Finding the mean of the Sugar column having NaN
mean_value=data['su'].mean()
print(mean_value)
data['su'].fillna(value= mean_value, inplace=True)

#Finding the mean of the Blood Glucose Random column having NaN
mean_value=data['bgr'].mean()
print(mean_value)
data['bgr'].fillna(value= mean_value, inplace=True)

#Finding the mean of the Blood urea column having NaN
mean_value=data['bu'].mean()
print(mean_value)
data['bu'].fillna(value= mean_value, inplace=True)

#Finding the mean of the Serum Creatine column having NaN
mean_value=data['sc'].mean()
print(mean_value)
data['sc'].fillna(value= mean_value, inplace=True)

#Finding the mean of the Sodium column having NaN
mean_value=data['sod'].mean()
print(mean_value)
data['sod'].fillna(value= mean_value, inplace=True)

#Finding the mean of the Potassium column having NaN
mean_value=data['pot'].mean()
print(mean_value)
data['pot'].fillna(value= mean_value, inplace=True)

#Finding the mean of the Hemoglobin column having NaN
mean_value=data['hemo'].mean()
print(mean_value)
data['hemo'].fillna(value= mean_value, inplace=True)

#Finding the mean of the Packed Cell Volume column having NaN
mean_value=data['pcv'].mean()
print(mean_value)
data['pcv'].fillna(value= mean_value, inplace=True)

#Finding the mean of the White Blood Cells column having NaN
mean_value=data['wc'].mean()
print(mean_value)
data['wc'].fillna(value= mean_value, inplace=True)

#Finding the mean of the Red Blood cells column having NaN
mean_value=data['rc'].mean()
print(mean_value)
data['rc'].fillna(value= mean_value, inplace=True)

1.0169491525423728
0.45014245014245013
148.0365168539326
57.425721784776904
3.072454308093995
137.52875399361022
4.62724358974359
12.526436781609195
38.766666666666666
8377.627118644068
4.689999999999999


In [82]:
# Replacing null values if categorical columns with mode
mode_value=data['pc'].mode()[0]
data['pc'].fillna(value= mode_value, inplace=True)

mode_value=data['pcc'].mode()[0]
data['pcc'].fillna(value= mode_value, inplace=True)

mode_value=data['ba'].mode()[0]
data['ba'].fillna(value= mode_value, inplace=True)

mode_value=data['htn'].mode()[0]
data['htn'].fillna(value= mode_value, inplace=True)

mode_value=data['cad'].mode()[0]
data['cad'].fillna(value= mode_value, inplace=True)

mode_value=data['appet'].mode()[0]
data['appet'].fillna(value= mode_value, inplace=True)

mode_value=data['pe'].mode()[0]
data['pe'].fillna(value= mode_value, inplace=True)

mode_value=data['ane'].mode()[0]
data['ane'].fillna(value= mode_value, inplace=True)

mode_value=data['dm'].mode()[0]
data['dm'].fillna(value= mode_value, inplace=True)

In [83]:
# Checking Null values in all features.
data.isnull().sum()

id                  0
age                 0
bp                  0
sg                  0
al                  0
su                  0
rbc               152
pc                  0
pcc                 0
ba                  0
bgr                 0
bu                  0
sc                  0
sod                 0
pot                 0
hemo                0
pcv                 0
wc                  0
rc                  0
htn                 0
dm                  0
cad                 0
appet               0
pe                  0
ane                 0
classification      0
dtype: int64

In [84]:
# Dropping unimpactful colums
data.drop('id',axis=1,inplace=True)
data.drop('rbc',axis=1,inplace=True)

In [85]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             400 non-null    float64
 1   bp              400 non-null    float64
 2   sg              400 non-null    float64
 3   al              400 non-null    float64
 4   su              400 non-null    float64
 5   pc              400 non-null    object 
 6   pcc             400 non-null    object 
 7   ba              400 non-null    object 
 8   bgr             400 non-null    float64
 9   bu              400 non-null    float64
 10  sc              400 non-null    float64
 11  sod             400 non-null    float64
 12  pot             400 non-null    float64
 13  hemo            400 non-null    float64
 14  pcv             400 non-null    float64
 15  wc              400 non-null    float64
 16  rc              400 non-null    float64
 17  htn             400 non-null    obj

In [86]:
data['pc'] = data['pc'].map({'normal': 0, 'abnormal': 1})
data['pc'] = pd.to_numeric(data['pc'])

data['pcc'] = data['pcc'].map({'notpresent': 0, 'present': 1})
data['pcc'] = pd.to_numeric(data['pcc'])

data['ba'] = data['pcc'].map({'notpresent': 0, 'present': 1})
data['ba'] = pd.to_numeric(data['ba'])

data['htn'] = data['htn'].map({'no': 0, 'yes': 1})
data['htn'] = pd.to_numeric(data['htn'])

data['dm'] = data['dm'].map({'no': 0, 'yes': 1})
data['dm'] = pd.to_numeric(data['dm'])

data['cad'] = data['cad'].map({'no': 0, 'yes': 1})
data['cad'] = pd.to_numeric(data['cad'])

data['appet'] = data['appet'].map({'good': 1, 'poor': 0})
data['appet'] = pd.to_numeric(data['appet'])

data['pe'] = data['pe'].map({'no': 0, 'yes': 1})
data['pe'] = pd.to_numeric(data['pe'])

data['ane'] = data['ane'].map({'no': 0, 'yes': 1})
data['ane'] = pd.to_numeric(data['ane'])

data['classification'] = data['classification'].map({'notckd': 0, 'ckd': 1})
data['classification'] = pd.to_numeric(data['classification'])

In [87]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             400 non-null    float64
 1   bp              400 non-null    float64
 2   sg              400 non-null    float64
 3   al              400 non-null    float64
 4   su              400 non-null    float64
 5   pc              400 non-null    int64  
 6   pcc             400 non-null    int64  
 7   ba              0 non-null      float64
 8   bgr             400 non-null    float64
 9   bu              400 non-null    float64
 10  sc              400 non-null    float64
 11  sod             400 non-null    float64
 12  pot             400 non-null    float64
 13  hemo            400 non-null    float64
 14  pcv             400 non-null    float64
 15  wc              400 non-null    float64
 16  rc              400 non-null    float64
 17  htn             400 non-null    int

In [88]:
data.drop('ba',axis=1,inplace=True)

### Splitting Dataset into X and Y Variables

In [89]:
X = data.loc[:, data.columns != 'classification']
y = data[['classification']]

In [90]:
from sklearn.preprocessing import StandardScaler

# create a StandardScaler model
scaler = StandardScaler()

# fit and transform the data
X_scaled = scaler.fit_transform(X)

In [91]:
X_scaled = pd.DataFrame(X_scaled, columns= X.columns)

In [92]:
X_scaled.head()

Unnamed: 0,age,bp,sg,al,su,pc,pcc,bgr,bu,sc,...,hemo,pcv,wc,rc,htn,dm,cad,appet,pe,ane
0,-0.20482,0.263379,0.483355,-0.013338,-0.437797,-0.484322,-0.342518,-0.361987,-0.435268,-0.333743,...,1.059271,0.625313,-0.226099,0.5851805,1.311903,1.385535,-0.304789,0.507801,-0.484322,-0.420084
1,-2.623145,-1.9655,0.483355,2.347516,-0.437797,-0.484322,-0.342518,0.0,-0.800941,-0.405039,...,-0.452097,-0.091606,-0.930667,-1.019107e-15,-0.762252,-0.721743,-0.304789,0.507801,-0.484322,-0.420084
2,0.620949,0.263379,-1.381391,0.773613,2.479925,-0.484322,-0.342518,3.681441,-0.089909,-0.2268,...,-1.078762,-0.928012,-0.343527,-1.019107e-15,-0.762252,1.385535,-0.304789,-1.969276,-0.484322,2.380476
3,-0.20482,-0.479581,-2.313764,2.347516,-0.437797,2.064742,2.919556,-0.415543,-0.028964,0.129677,...,-0.48896,-0.808525,-0.656668,-0.9064561,1.311903,-0.721743,-0.304789,-1.969276,2.064742,2.380476
4,-0.02787,0.263379,-1.381391,0.773613,-0.437797,-0.484322,-0.342518,-0.56282,-0.63842,-0.298096,...,-0.341509,-0.450066,-0.421812,-0.1032671,-0.762252,-0.721743,-0.304789,0.507801,-0.484322,-0.420084


In [102]:
trainX, testX, trainy, testy = train_test_split(X_scaled, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(280, 22)
(280, 1)
(120, 22)
(120, 1)


Implementing Lazy predict on full dataset before dimensionality reduction

In [100]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:02<00:00, 13.13it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMClassifier,1.0,1.0,1.0,1.0,0.12
ExtraTreesClassifier,1.0,1.0,1.0,1.0,0.17
SVC,1.0,1.0,1.0,1.0,0.02
RandomForestClassifier,1.0,1.0,1.0,1.0,0.23
LogisticRegression,1.0,1.0,1.0,1.0,0.04
XGBClassifier,0.99,0.99,0.99,0.99,0.1
NuSVC,0.99,0.99,0.99,0.99,0.04
AdaBoostClassifier,0.99,0.99,0.99,0.99,0.5
CalibratedClassifierCV,0.99,0.99,0.99,0.99,0.1
SGDClassifier,0.99,0.99,0.99,0.99,0.02


### Dimensionality Reduction Algorithms

### Principal Component Analysis

In [93]:
from sklearn.decomposition import PCA

pca = PCA(n_components=8)

principalComponents = pca.fit_transform(X_scaled)

PCA_Df = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2','principal component 3', 'principal component 4','principal component 5', 'principal component 6','principal component 7', 'principal component 8'])

In [94]:
PCA_Df.head()

Unnamed: 0,principal component 1,principal component 2,principal component 3,principal component 4,principal component 5,principal component 6,principal component 7,principal component 8
0,-0.901751,-0.669985,0.73458,-0.562227,-0.314508,0.109762,0.140686,-0.14592
1,-1.228522,0.984284,-1.212028,0.793283,1.033842,-0.189207,-1.531347,-2.165175
2,2.957758,-2.463542,0.657733,0.435719,-1.431037,-1.569641,-1.407006,-1.329318
3,4.363778,1.928367,-3.008661,-0.309425,2.279219,-0.562898,0.067209,-1.09477
4,-0.457323,0.404681,-0.42162,-0.015137,-0.039227,-0.869486,-0.128037,0.006296


In [95]:
trainX, testX, trainy, testy = train_test_split(PCA_Df, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(280, 8)
(280, 1)
(120, 8)
(120, 1)


In [97]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:02<00:00, 11.83it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LinearSVC,1.0,1.0,1.0,1.0,0.02
BaggingClassifier,1.0,1.0,1.0,1.0,0.14
XGBClassifier,1.0,1.0,1.0,1.0,0.1
DecisionTreeClassifier,1.0,1.0,1.0,1.0,0.11
SVC,1.0,1.0,1.0,1.0,0.03
ExtraTreesClassifier,1.0,1.0,1.0,1.0,0.4
RandomForestClassifier,1.0,1.0,1.0,1.0,0.39
LogisticRegression,1.0,1.0,1.0,1.0,0.03
CalibratedClassifierCV,0.99,0.99,0.99,0.99,0.15
SGDClassifier,0.99,0.99,0.99,0.99,0.03


### Linear Discriminant Analysis

In [113]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# create an LDA model
lda = LDA(n_components=1)

# fit and transform the data
X_reduced = lda.fit_transform(X_scaled, y)

In [114]:
trainX, testX, trainy, testy = train_test_split(X_reduced, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(280, 1)
(280, 1)
(120, 1)
(120, 1)


In [115]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:01<00:00, 20.38it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AdaBoostClassifier,1.0,1.0,1.0,1.0,0.24
KNeighborsClassifier,1.0,1.0,1.0,1.0,0.04
XGBClassifier,1.0,1.0,1.0,1.0,0.08
SGDClassifier,1.0,1.0,1.0,1.0,0.02
RandomForestClassifier,1.0,1.0,1.0,1.0,0.29
Perceptron,1.0,1.0,1.0,1.0,0.02
BaggingClassifier,1.0,1.0,1.0,1.0,0.04
LGBMClassifier,1.0,1.0,1.0,1.0,0.07
ExtraTreesClassifier,1.0,1.0,1.0,1.0,0.17
DecisionTreeClassifier,1.0,1.0,1.0,1.0,0.02


### T-SNE

In [117]:
from sklearn.manifold import TSNE
# Initialize the t-SNE model
model = TSNE(n_components=2)
X_reduced = model.fit_transform(X_scaled)

In [118]:
trainX, testX, trainy, testy = train_test_split(X_reduced, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(280, 2)
(280, 1)
(120, 2)
(120, 1)


In [119]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:01<00:00, 22.85it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMClassifier,1.0,1.0,1.0,1.0,0.1
XGBClassifier,1.0,1.0,1.0,1.0,0.09
ExtraTreesClassifier,1.0,1.0,1.0,1.0,0.18
KNeighborsClassifier,0.99,0.99,0.99,0.99,0.02
SGDClassifier,0.99,0.99,0.99,0.99,0.02
RandomForestClassifier,0.99,0.99,0.99,0.99,0.21
BaggingClassifier,0.99,0.99,0.99,0.99,0.05
LabelSpreading,0.99,0.99,0.99,0.99,0.03
LabelPropagation,0.99,0.99,0.99,0.99,0.02
AdaBoostClassifier,0.99,0.99,0.99,0.99,0.19


### Singular Vector Decomposition SVD

In [120]:
from sklearn.decomposition import TruncatedSVD

# create a SVD model
svd = TruncatedSVD(n_components=8)

# fit and transform the data
X_reduced = svd.fit_transform(X_scaled)

In [121]:
trainX, testX, trainy, testy = train_test_split(X_reduced, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(280, 8)
(280, 1)
(120, 8)
(120, 1)


In [122]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:01<00:00, 17.90it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AdaBoostClassifier,1.0,1.0,1.0,1.0,0.19
ExtraTreesClassifier,1.0,1.0,1.0,1.0,0.25
XGBClassifier,1.0,1.0,1.0,1.0,0.08
SVC,1.0,1.0,1.0,1.0,0.02
RandomForestClassifier,1.0,1.0,1.0,1.0,0.29
LogisticRegression,1.0,1.0,1.0,1.0,0.02
BaggingClassifier,1.0,1.0,1.0,1.0,0.05
LinearSVC,1.0,1.0,1.0,1.0,0.02
DecisionTreeClassifier,1.0,1.0,1.0,1.0,0.02
PassiveAggressiveClassifier,0.99,0.99,0.99,0.99,0.03


### ISOMAP Embedding

In [123]:
from sklearn.manifold import Isomap
model = Isomap(n_components=2, n_neighbors=3)
# fit and transform the data
X_reduced = model.fit_transform(X_scaled)

In [124]:
trainX, testX, trainy, testy = train_test_split(X_reduced, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(280, 2)
(280, 1)
(120, 2)
(120, 1)


In [125]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:02<00:00, 11.17it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LinearSVC,0.98,0.99,0.99,0.98,0.03
LogisticRegression,0.98,0.99,0.99,0.98,0.04
GaussianNB,0.97,0.98,0.98,0.98,0.04
QuadraticDiscriminantAnalysis,0.97,0.98,0.98,0.98,0.03
CalibratedClassifierCV,0.97,0.97,0.97,0.97,0.12
SGDClassifier,0.97,0.97,0.97,0.97,0.03
PassiveAggressiveClassifier,0.97,0.97,0.97,0.97,0.03
XGBClassifier,0.97,0.97,0.97,0.97,0.16
DecisionTreeClassifier,0.97,0.97,0.97,0.97,0.06
ExtraTreesClassifier,0.97,0.97,0.97,0.97,0.42


## Dataset 2: Blood Pressure Prediction Dataset 

Dataset Info : Dataset we are using in this case study is named as “Blood pressure data for disease prediction” and contains medical information of patients. Dataset contains 2000 patients records and 15 features. We are using “Blood pressure abnormality” as our target variable. It is a Binary class classification problem, where value of class is 0 and 1 (0 means disease not detected and 1 means disease detected). Dataset is balanced and consist of 14 attributes other than target attribute.

In [188]:
# Importing Dataset using Pandas.
data = pd.read_csv("E:\DOWNLOADS\Datasets\data.csv")
data.shape  

(2000, 15)

In [189]:
data.head()

Unnamed: 0,Patient_Number,Blood_Pressure_Abnormality,Level_of_Hemoglobin,Genetic_Pedigree_Coefficient,Age,BMI,Sex,Pregnancy,Smoking,Physical_activity,salt_content_in_the_diet,alcohol_consumption_per_day,Level_of_Stress,Chronic_kidney_disease,Adrenal_and_thyroid_disorders
0,1,1,11.28,0.9,34,23,1,1.0,0,45961,48071,,2,1,1
1,2,0,9.75,0.23,54,33,1,,0,26106,25333,205.0,3,0,0
2,3,1,10.79,0.91,70,49,0,,0,9995,29465,67.0,2,1,0
3,4,0,11.0,0.43,71,50,0,,0,10635,7439,242.0,1,1,0
4,5,1,14.17,0.83,52,19,0,,0,15619,49644,397.0,2,0,0


In [190]:
# Checking Null values in all features.
data.isnull().sum()

Patient_Number                      0
Blood_Pressure_Abnormality          0
Level_of_Hemoglobin                 0
Genetic_Pedigree_Coefficient       92
Age                                 0
BMI                                 0
Sex                                 0
Pregnancy                        1558
Smoking                             0
Physical_activity                   0
salt_content_in_the_diet            0
alcohol_consumption_per_day       242
Level_of_Stress                     0
Chronic_kidney_disease              0
Adrenal_and_thyroid_disorders       0
dtype: int64

In [191]:
#Solving for Null Values and adding 0 in null spaces.
data['Pregnancy'] = data['Pregnancy'].fillna(0)
data['alcohol_consumption_per_day'] = data['alcohol_consumption_per_day'].fillna(0)

In [192]:
# Checking Null Values again.
data.isnull().sum()

Patient_Number                    0
Blood_Pressure_Abnormality        0
Level_of_Hemoglobin               0
Genetic_Pedigree_Coefficient     92
Age                               0
BMI                               0
Sex                               0
Pregnancy                         0
Smoking                           0
Physical_activity                 0
salt_content_in_the_diet          0
alcohol_consumption_per_day       0
Level_of_Stress                   0
Chronic_kidney_disease            0
Adrenal_and_thyroid_disorders     0
dtype: int64

In [193]:
# Computing Mean of every column.
columns_means = data.mean()
print(columns_means)

Patient_Number                   1000.50
Blood_Pressure_Abnormality          0.49
Level_of_Hemoglobin                11.71
Genetic_Pedigree_Coefficient        0.49
Age                                46.56
BMI                                30.08
Sex                                 0.50
Pregnancy                           0.10
Smoking                             0.51
Physical_activity               25254.42
salt_content_in_the_diet        24926.10
alcohol_consumption_per_day       220.64
Level_of_Stress                     2.01
Chronic_kidney_disease              0.51
Adrenal_and_thyroid_disorders       0.44
dtype: float64


In [194]:
# Adding mean value in null spaces of column Genetic Preigree.
data['Genetic_Pedigree_Coefficient'] = data['Genetic_Pedigree_Coefficient'].fillna(0.49)

In [195]:
# Again Checking Null Values.
data.isnull().sum()

Patient_Number                   0
Blood_Pressure_Abnormality       0
Level_of_Hemoglobin              0
Genetic_Pedigree_Coefficient     0
Age                              0
BMI                              0
Sex                              0
Pregnancy                        0
Smoking                          0
Physical_activity                0
salt_content_in_the_diet         0
alcohol_consumption_per_day      0
Level_of_Stress                  0
Chronic_kidney_disease           0
Adrenal_and_thyroid_disorders    0
dtype: int64

In [196]:
# Checking Data Types of every column.
data.dtypes

Patient_Number                     int64
Blood_Pressure_Abnormality         int64
Level_of_Hemoglobin              float64
Genetic_Pedigree_Coefficient     float64
Age                                int64
BMI                                int64
Sex                                int64
Pregnancy                        float64
Smoking                            int64
Physical_activity                  int64
salt_content_in_the_diet           int64
alcohol_consumption_per_day      float64
Level_of_Stress                    int64
Chronic_kidney_disease             int64
Adrenal_and_thyroid_disorders      int64
dtype: object

In [197]:
# Dropping Unnecessary Column.
data.drop('Patient_Number',axis=1,inplace=True)

In [198]:
X = data.loc[:, data.columns != 'Blood_Pressure_Abnormality']     # All columns except target variable.
y = data[['Blood_Pressure_Abnormality']]                          # Target Variable.

In [199]:
from sklearn.preprocessing import StandardScaler

# create a StandardScaler model
scaler = StandardScaler()

# fit and transform the data
X_scaled = scaler.fit_transform(X)

In [200]:
X_scaled = pd.DataFrame(X_scaled, columns= X.columns)

In [201]:
X_scaled.head()

Unnamed: 0,Level_of_Hemoglobin,Genetic_Pedigree_Coefficient,Age,BMI,Sex,Pregnancy,Smoking,Physical_activity,salt_content_in_the_diet,alcohol_consumption_per_day,Level_of_Stress,Chronic_kidney_disease,Adrenal_and_thyroid_disorders
0,-0.2,1.42,-0.73,-0.6,1.01,3.01,-1.02,1.48,1.63,-1.4,-0.02,0.99,1.12
1,-0.9,-0.93,0.44,0.25,1.01,-0.33,-1.02,0.06,0.03,-0.1,1.2,-1.01,-0.89
2,-0.42,1.46,1.37,1.61,-0.99,-0.33,-1.02,-1.09,0.32,-0.98,-0.02,0.99,-0.89
3,-0.32,-0.23,1.43,1.69,-0.99,-0.33,-1.02,-1.04,-1.23,0.14,-1.23,0.99,-0.89
4,1.13,1.18,0.32,-0.94,-0.99,-0.33,-1.02,-0.69,1.74,1.12,-0.02,-1.01,-0.89


In [202]:
trainX, testX, trainy, testy = train_test_split(X_scaled, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(1400, 13)
(1400, 1)
(600, 13)
(600, 1)


In [203]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:04<00:00,  5.90it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMClassifier,0.9,0.9,0.9,0.9,0.19
XGBClassifier,0.9,0.9,0.9,0.9,0.31
RandomForestClassifier,0.89,0.89,0.89,0.89,0.46
BaggingClassifier,0.87,0.87,0.87,0.87,0.2
AdaBoostClassifier,0.86,0.87,0.87,0.86,0.66
SVC,0.86,0.86,0.86,0.86,0.17
NuSVC,0.85,0.85,0.85,0.85,0.22
GaussianNB,0.83,0.83,0.83,0.83,0.03
ExtraTreesClassifier,0.83,0.83,0.83,0.83,0.59
DecisionTreeClassifier,0.82,0.82,0.82,0.82,0.08


### Dimensionality Reduction Algorithms

### Principal Component Analysis

In [204]:
from sklearn.decomposition import PCA

pca = PCA(n_components=8)

principalComponents = pca.fit_transform(X_scaled)

PCA_Df = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2','principal component 3', 'principal component 4','principal component 5', 'principal component 6','principal component 7', 'principal component 8'])

In [206]:
PCA_Df.shape

(2000, 8)

In [205]:
trainX, testX, trainy, testy = train_test_split(PCA_Df, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(1400, 8)
(1400, 1)
(600, 8)
(600, 1)


In [207]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:04<00:00,  6.04it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ExtraTreesClassifier,0.77,0.77,0.77,0.77,0.58
RandomForestClassifier,0.76,0.76,0.76,0.76,0.62
NuSVC,0.76,0.76,0.76,0.76,0.24
QuadraticDiscriminantAnalysis,0.75,0.75,0.75,0.75,0.03
XGBClassifier,0.74,0.75,0.75,0.74,0.34
AdaBoostClassifier,0.74,0.74,0.74,0.74,0.63
SVC,0.74,0.74,0.74,0.74,0.22
LGBMClassifier,0.73,0.74,0.74,0.73,0.21
GaussianNB,0.73,0.73,0.73,0.73,0.03
BaggingClassifier,0.73,0.73,0.73,0.73,0.26


### Linear Discriminant Analysis

In [208]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# create an LDA model
lda = LDA(n_components=1)

# fit and transform the data
X_reduced = lda.fit_transform(X_scaled, y)

In [209]:
trainX, testX, trainy, testy = train_test_split(X_reduced, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(1400, 1)
(1400, 1)
(600, 1)
(600, 1)


In [210]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:02<00:00, 12.79it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BernoulliNB,0.71,0.71,0.71,0.71,0.01
Perceptron,0.71,0.71,0.71,0.71,0.01
NearestCentroid,0.71,0.71,0.71,0.71,0.02
KNeighborsClassifier,0.7,0.71,0.71,0.7,0.04
LGBMClassifier,0.7,0.7,0.7,0.7,0.16
RidgeClassifierCV,0.7,0.7,0.7,0.7,0.02
RidgeClassifier,0.7,0.7,0.7,0.7,0.02
LogisticRegression,0.7,0.7,0.7,0.7,0.02
LinearDiscriminantAnalysis,0.7,0.7,0.7,0.7,0.02
LinearSVC,0.7,0.7,0.7,0.7,0.02


### T-SNE

In [211]:
from sklearn.manifold import TSNE
# Initialize the t-SNE model
model = TSNE(n_components=2)
X_reduced = model.fit_transform(X_scaled)

In [212]:
trainX, testX, trainy, testy = train_test_split(X_reduced, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(1400, 2)
(1400, 1)
(600, 2)
(600, 1)


In [213]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:02<00:00, 10.00it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LabelSpreading,0.74,0.74,0.74,0.74,0.35
KNeighborsClassifier,0.73,0.73,0.73,0.73,0.06
LabelPropagation,0.73,0.73,0.73,0.73,0.14
RandomForestClassifier,0.71,0.72,0.72,0.71,0.4
ExtraTreesClassifier,0.71,0.71,0.71,0.71,0.33
XGBClassifier,0.7,0.7,0.7,0.7,0.2
BaggingClassifier,0.69,0.69,0.69,0.69,0.07
LGBMClassifier,0.69,0.69,0.69,0.69,0.14
SVC,0.68,0.68,0.68,0.68,0.16
DecisionTreeClassifier,0.68,0.68,0.68,0.68,0.02


### Singular Value Decomposition

In [214]:
from sklearn.decomposition import TruncatedSVD

# create a SVD model
svd = TruncatedSVD(n_components=8)

# fit and transform the data
X_reduced = svd.fit_transform(X_scaled)

In [215]:
trainX, testX, trainy, testy = train_test_split(X_reduced, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(1400, 8)
(1400, 1)
(600, 8)
(600, 1)


In [216]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:02<00:00,  9.75it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ExtraTreesClassifier,0.77,0.77,0.77,0.77,0.34
RandomForestClassifier,0.76,0.76,0.76,0.76,0.45
NuSVC,0.76,0.76,0.76,0.76,0.18
QuadraticDiscriminantAnalysis,0.75,0.75,0.75,0.75,0.02
XGBClassifier,0.74,0.75,0.75,0.74,0.33
AdaBoostClassifier,0.74,0.74,0.74,0.74,0.25
SVC,0.74,0.74,0.74,0.74,0.16
LGBMClassifier,0.73,0.74,0.74,0.73,0.18
GaussianNB,0.73,0.73,0.73,0.73,0.02
BaggingClassifier,0.73,0.73,0.73,0.73,0.12


### ISOMAP Embedding

In [217]:
from sklearn.manifold import Isomap
model = Isomap(n_components=2, n_neighbors=3)
# fit and transform the data
X_reduced = model.fit_transform(X_scaled)

In [218]:
trainX, testX, trainy, testy = train_test_split(X_reduced, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(1400, 2)
(1400, 1)
(600, 2)
(600, 1)


In [219]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:02<00:00, 10.48it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LabelSpreading,0.66,0.66,0.66,0.66,0.17
LabelPropagation,0.66,0.66,0.66,0.66,0.11
KNeighborsClassifier,0.64,0.64,0.64,0.64,0.04
BaggingClassifier,0.64,0.64,0.64,0.63,0.1
LGBMClassifier,0.63,0.63,0.63,0.63,0.14
SVC,0.63,0.63,0.63,0.63,0.18
PassiveAggressiveClassifier,0.64,0.63,0.63,0.63,0.02
XGBClassifier,0.63,0.63,0.63,0.63,0.2
CalibratedClassifierCV,0.63,0.63,0.63,0.63,0.15
LogisticRegression,0.63,0.63,0.63,0.63,0.03


## Dataset 3: Cardiovascular Disease

Dataset Info : Dataset we are using in this case study is  named as “cardiovascular disease prediction” and contains medical information of patients. Dataset contains 70000 patients records and 13 features. We are using “cardio” as our target variable. It is a Binary class classification problem, where value of class is 0 and 1 (0 means disease not detected and 1 means disease 2 means detected). Dataset is balanced and consist of 14 attributes other than target attribute. Dataset is cleaned and does not contain any null values or string column, so we don’t need much data prepossessing.

In [2]:
data = pd.read_csv("E:/cardio_disease.csv", sep=';')
data.shape

(70000, 13)

#### Data Preprocessing

In [3]:
# Counting Number of distribution.
dictribution = data["cardio"].value_counts()
dictribution

0    35021
1    34979
Name: cardio, dtype: int64

In [4]:
# Checking Data Types of every column.
data.dtypes

id               int64
age              int64
gender           int64
height           int64
weight         float64
ap_hi            int64
ap_lo            int64
cholesterol      int64
gluc             int64
smoke            int64
alco             int64
active           int64
cardio           int64
dtype: object

In [5]:
# Dropping Unnecessary Column.
data.drop('id',axis=1,inplace=True)

In [6]:
X = data.loc[:, data.columns != 'cardio']     # All columns except target variable.
y = data[['cardio']]                          # Target Variable.

In [7]:
from sklearn.preprocessing import StandardScaler

# create a StandardScaler model
scaler = StandardScaler()

# fit and transform the data
X_scaled = scaler.fit_transform(X)

In [8]:
X_scaled = pd.DataFrame(X_scaled, columns= X.columns)

In [9]:
trainX, testX, trainy, testy = train_test_split(X_scaled, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(49000, 11)
(49000, 1)
(21000, 11)
(21000, 1)


In [12]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [49:57<00:00, 103.38s/it]  


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMClassifier,0.73,0.73,0.73,0.73,0.65
XGBClassifier,0.73,0.73,0.73,0.73,4.18
AdaBoostClassifier,0.73,0.73,0.73,0.73,2.63
SVC,0.73,0.73,0.73,0.73,254.04
LogisticRegression,0.72,0.72,0.72,0.72,0.2
SGDClassifier,0.72,0.72,0.72,0.72,0.38
RandomForestClassifier,0.72,0.72,0.72,0.72,9.18
LinearSVC,0.71,0.71,0.71,0.71,12.92
BernoulliNB,0.71,0.71,0.71,0.71,0.12
CalibratedClassifierCV,0.71,0.71,0.71,0.71,47.43


### Dimensionality Reduction Algorithm

### Principal Component Analysis

In [13]:
from sklearn.decomposition import PCA

pca = PCA(n_components=5)

principalComponents = pca.fit_transform(X_scaled)

PCA_Df = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2','principal component 3', 'principal component 4','principal component 5'])

In [14]:
trainX, testX, trainy, testy = train_test_split(PCA_Df, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(49000, 5)
(49000, 1)
(21000, 5)
(21000, 1)


In [15]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [15:53<00:00, 32.88s/it] 


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMClassifier,0.71,0.71,0.71,0.71,0.61
XGBClassifier,0.71,0.71,0.71,0.71,6.59
SVC,0.7,0.7,0.7,0.7,199.13
RandomForestClassifier,0.7,0.7,0.7,0.7,15.44
ExtraTreesClassifier,0.69,0.69,0.69,0.69,8.51
KNeighborsClassifier,0.68,0.68,0.68,0.68,1.12
SGDClassifier,0.68,0.68,0.68,0.68,0.15
LogisticRegression,0.68,0.68,0.68,0.68,0.09
BaggingClassifier,0.67,0.67,0.67,0.67,3.34
AdaBoostClassifier,0.67,0.67,0.67,0.67,2.96


### Linear Discriminant Analysis

In [16]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# create an LDA model
lda = LDA(n_components=1)

# fit and transform the data
X_reduced = lda.fit_transform(X_scaled, y)

In [17]:
trainX, testX, trainy, testy = train_test_split(X_reduced, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(49000, 1)
(49000, 1)
(21000, 1)
(21000, 1)


In [18]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [09:02<00:00, 18.72s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LogisticRegression,0.65,0.65,0.65,0.65,0.05
CalibratedClassifierCV,0.65,0.65,0.65,0.65,11.33
AdaBoostClassifier,0.65,0.65,0.65,0.65,1.49
LGBMClassifier,0.65,0.65,0.65,0.65,0.41
SGDClassifier,0.65,0.65,0.65,0.65,0.08
LinearSVC,0.65,0.65,0.65,0.65,3.76
SVC,0.65,0.65,0.65,0.65,227.18
RidgeClassifierCV,0.65,0.65,0.65,0.65,0.05
RidgeClassifier,0.65,0.65,0.65,0.65,0.04
LinearDiscriminantAnalysis,0.65,0.65,0.65,0.65,0.05


### Singular Vector Decomposition

In [19]:
from sklearn.decomposition import TruncatedSVD

# create a SVD model
svd = TruncatedSVD(n_components=5)

# fit and transform the data
X_reduced = svd.fit_transform(X_scaled)

In [20]:
trainX, testX, trainy, testy = train_test_split(X_reduced, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(49000, 5)
(49000, 1)
(21000, 5)
(21000, 1)


In [21]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [16:03<00:00, 33.23s/it] 


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMClassifier,0.71,0.71,0.71,0.71,0.53
XGBClassifier,0.71,0.71,0.71,0.71,6.48
SVC,0.7,0.7,0.7,0.7,194.61
RandomForestClassifier,0.7,0.7,0.7,0.7,15.51
ExtraTreesClassifier,0.69,0.69,0.69,0.69,8.64
KNeighborsClassifier,0.68,0.68,0.68,0.68,1.16
SGDClassifier,0.68,0.68,0.68,0.68,0.16
LogisticRegression,0.68,0.68,0.68,0.68,0.1
BaggingClassifier,0.67,0.67,0.67,0.67,3.44
AdaBoostClassifier,0.67,0.67,0.67,0.67,3.03


### T-SNE

In [22]:
from sklearn.manifold import TSNE
# Initialize the t-SNE model
model = TSNE(n_components=2)
X_reduced = model.fit_transform(X_scaled)

In [23]:
trainX, testX, trainy, testy = train_test_split(X_reduced, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(49000, 2)
(49000, 1)
(21000, 2)
(21000, 1)


In [24]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [09:04<00:00, 18.78s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMClassifier,0.64,0.64,0.64,0.64,0.41
XGBClassifier,0.63,0.63,0.63,0.63,3.84
KNeighborsClassifier,0.62,0.62,0.62,0.62,0.77
RandomForestClassifier,0.61,0.61,0.61,0.61,11.37
SVC,0.61,0.61,0.61,0.61,219.31
ExtraTreesClassifier,0.61,0.61,0.61,0.61,7.25
BaggingClassifier,0.6,0.6,0.6,0.6,1.73
AdaBoostClassifier,0.6,0.6,0.6,0.6,1.68
DecisionTreeClassifier,0.58,0.58,0.58,0.58,0.3
ExtraTreeClassifier,0.58,0.58,0.58,0.58,0.11


## Dataset 4 : Heart Disease Prediction

Dataset Info :

This datset is take from UCI Machine learning repository. Its a binary class classififcation dataset.
link for data : https://archive.ics.uci.edu/ml/datasets/heart+disease

In [11]:
# Importing Dataset using Pandas.
data = pd.read_csv("E:/heart_2020.csv")
data.shape  

(319795, 18)

In [12]:
data = data.sample(20000)
data.shape

(20000, 18)

In [13]:
data.head(5)

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
251961,No,24.41,Yes,No,No,0.0,0.0,No,Female,75-79,Hispanic,No,Yes,Excellent,6.0,No,No,No
5347,No,26.63,Yes,Yes,No,21.0,21.0,No,Female,25-29,American Indian/Alaskan Native,No,No,Good,3.0,No,No,No
316653,No,27.4,No,Yes,No,0.0,30.0,No,Female,30-34,Hispanic,No,Yes,Very good,5.0,No,No,No
103705,No,20.18,No,No,No,0.0,15.0,No,Female,18-24,White,No,Yes,Good,8.0,Yes,No,No
288584,No,27.99,No,No,No,0.0,5.0,No,Female,70-74,White,No,Yes,Excellent,7.0,No,No,No


In [14]:
data.columns

Index(['HeartDisease', 'BMI', 'Smoking', 'AlcoholDrinking', 'Stroke',
       'PhysicalHealth', 'MentalHealth', 'DiffWalking', 'Sex', 'AgeCategory',
       'Race', 'Diabetic', 'PhysicalActivity', 'GenHealth', 'SleepTime',
       'Asthma', 'KidneyDisease', 'SkinCancer'],
      dtype='object')

In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20000 entries, 251961 to 72269
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   HeartDisease      20000 non-null  object 
 1   BMI               20000 non-null  float64
 2   Smoking           20000 non-null  object 
 3   AlcoholDrinking   20000 non-null  object 
 4   Stroke            20000 non-null  object 
 5   PhysicalHealth    20000 non-null  float64
 6   MentalHealth      20000 non-null  float64
 7   DiffWalking       20000 non-null  object 
 8   Sex               20000 non-null  object 
 9   AgeCategory       20000 non-null  object 
 10  Race              20000 non-null  object 
 11  Diabetic          20000 non-null  object 
 12  PhysicalActivity  20000 non-null  object 
 13  GenHealth         20000 non-null  object 
 14  SleepTime         20000 non-null  float64
 15  Asthma            20000 non-null  object 
 16  KidneyDisease     20000 non-null  o

In [16]:
data['AgeCategory'].unique()

array(['75-79', '25-29', '30-34', '18-24', '70-74', '40-44', '50-54',
       '55-59', '65-69', '35-39', '60-64', '45-49', '80 or older'],
      dtype=object)

In [17]:
data['Age'] = data['AgeCategory'].str[:2]
data['Age'] = pd.to_numeric(data['Age'])

In [18]:
data.drop('AgeCategory',axis=1,inplace=True)

In [20]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()                                 #encode the education column
label=le.fit_transform(data['HeartDisease'])
label
data.drop('HeartDisease',axis=1)
data['HeartDisease']=label

label1=le.fit_transform(data['Smoking'])   #encode the marital_Status column
data.drop('Smoking',axis=1)
data['Smoking']=label1
data

label1=le.fit_transform(data['AlcoholDrinking'])   #encode the marital_Status column
data.drop('AlcoholDrinking',axis=1)
data['AlcoholDrinking']=label1
data

label1=le.fit_transform(data['Stroke'])   #encode the marital_Status column
data.drop('Stroke',axis=1)
data['Stroke']=label1
data

label1=le.fit_transform(data['DiffWalking'])   #encode the marital_Status column
data.drop('DiffWalking',axis=1)
data['DiffWalking']=label1
data

label1=le.fit_transform(data['Sex'])   #encode the marital_Status column
data.drop('Sex',axis=1)
data['Sex']=label1

label1=le.fit_transform(data['Race'])   #encode the marital_Status column
data.drop('Race',axis=1)
data['Race']=label1

label1=le.fit_transform(data['Diabetic'])   #encode the marital_Status column
data.drop('Diabetic',axis=1)
data['Diabetic']=label1

label1=le.fit_transform(data['PhysicalActivity'])   #encode the marital_Status column
data.drop('Diabetic',axis=1)
data['Diabetic']=label1


label1=le.fit_transform(data['GenHealth'])   #encode the marital_Status column
data.drop('GenHealth',axis=1)
data['GenHealth']=label1


label1=le.fit_transform(data['Asthma'])   #encode the marital_Status column
data.drop('Asthma',axis=1)
data['Asthma']=label1

label1=le.fit_transform(data['KidneyDisease'])   #encode the marital_Status column
data.drop('KidneyDisease',axis=1)
data['KidneyDisease']=label1

label1=le.fit_transform(data['SkinCancer'])   #encode the marital_Status column
data.drop('SkinCancer',axis=1)
data['SkinCancer']=label1

label1=le.fit_transform(data['PhysicalActivity'])   #encode the marital_Status column
data.drop('PhysicalActivity',axis=1)
data['PhysicalActivity']=label1

In [21]:
data.dtypes

HeartDisease          int32
BMI                 float64
Smoking               int32
AlcoholDrinking       int32
Stroke                int32
PhysicalHealth      float64
MentalHealth        float64
DiffWalking           int32
Sex                   int32
Race                  int32
Diabetic              int32
PhysicalActivity      int32
GenHealth             int32
SleepTime           float64
Asthma                int32
KidneyDisease         int32
SkinCancer            int32
Age                   int64
dtype: object

In [22]:
X = data.loc[:, data.columns != 'HeartDisease']     # All columns except target variable.
y = data['HeartDisease']                        # Target Variable.

In [23]:
from sklearn.preprocessing import StandardScaler

# create a StandardScaler model
scaler = StandardScaler()

# fit and transform the data
X_scaled = scaler.fit_transform(X)

In [24]:
X_scaled = pd.DataFrame(X_scaled, columns= X.columns)

In [25]:
trainX, testX, trainy, testy = train_test_split(X_scaled, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(14000, 17)
(14000,)
(6000, 17)
(6000,)


In [26]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:59<00:00,  2.06s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NearestCentroid,0.79,0.71,0.71,0.83,0.03
GaussianNB,0.85,0.67,0.67,0.87,0.04
Perceptron,0.79,0.65,0.65,0.83,0.04
QuadraticDiscriminantAnalysis,0.57,0.63,0.63,0.66,0.06
BernoulliNB,0.89,0.61,0.61,0.89,0.04
LinearDiscriminantAnalysis,0.91,0.58,0.58,0.9,0.08
DecisionTreeClassifier,0.87,0.58,0.58,0.87,0.09
ExtraTreeClassifier,0.87,0.58,0.58,0.87,0.05
LabelSpreading,0.88,0.57,0.57,0.87,20.39
LabelPropagation,0.88,0.57,0.57,0.87,14.81


### Dimensionality Reduction Algorithm

### Principal Component Analysis

In [27]:
from sklearn.decomposition import PCA

pca = PCA(n_components=5)

principalComponents = pca.fit_transform(X_scaled)

PCA_Df = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2','principal component 3', 'principal component 4','principal component 5'])

In [28]:
PCA_Df.shape

(20000, 5)

In [29]:
trainX, testX, trainy, testy = train_test_split(PCA_Df, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(14000, 5)
(14000,)
(6000, 5)
(6000,)


In [30]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:42<00:00,  1.47s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NearestCentroid,0.78,0.71,0.71,0.83,0.02
PassiveAggressiveClassifier,0.7,0.66,0.66,0.77,0.03
QuadraticDiscriminantAnalysis,0.9,0.59,0.59,0.89,0.03
GaussianNB,0.91,0.58,0.58,0.89,0.02
LinearDiscriminantAnalysis,0.91,0.57,0.57,0.9,0.04
DecisionTreeClassifier,0.86,0.57,0.57,0.87,0.12
ExtraTreeClassifier,0.87,0.56,0.56,0.87,0.03
Perceptron,0.72,0.56,0.56,0.78,0.03
LabelSpreading,0.9,0.56,0.56,0.89,12.95
LabelPropagation,0.9,0.56,0.56,0.88,8.44


### Linear Discriminant Analysis

In [31]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# create an LDA model
lda = LDA(n_components=1)

# fit and transform the data
X_reduced = lda.fit_transform(X_scaled, y)

In [32]:
trainX, testX, trainy, testy = train_test_split(X_reduced, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(14000, 1)
(14000,)
(6000, 1)
(6000,)


In [33]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:29<00:00,  1.01s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
PassiveAggressiveClassifier,0.79,0.72,0.72,0.83,0.02
NearestCentroid,0.81,0.71,0.71,0.84,0.02
Perceptron,0.85,0.68,0.68,0.87,0.02
GaussianNB,0.91,0.6,0.6,0.9,0.02
QuadraticDiscriminantAnalysis,0.91,0.6,0.6,0.9,0.02
LinearDiscriminantAnalysis,0.91,0.59,0.59,0.9,0.02
ExtraTreesClassifier,0.87,0.57,0.57,0.87,1.38
DecisionTreeClassifier,0.87,0.57,0.57,0.87,0.1
RandomForestClassifier,0.87,0.57,0.57,0.87,2.35
BaggingClassifier,0.88,0.56,0.56,0.87,0.3


### Singular Value Decomposition

In [34]:
from sklearn.decomposition import TruncatedSVD

# create a SVD model
svd = TruncatedSVD(n_components=5)

# fit and transform the data
X_reduced = svd.fit_transform(X_scaled)

In [35]:
trainX, testX, trainy, testy = train_test_split(X_reduced, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(14000, 5)
(14000,)
(6000, 5)
(6000,)


In [36]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:43<00:00,  1.51s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NearestCentroid,0.79,0.71,0.71,0.83,0.03
Perceptron,0.71,0.68,0.68,0.78,0.03
PassiveAggressiveClassifier,0.69,0.66,0.66,0.76,0.03
QuadraticDiscriminantAnalysis,0.9,0.59,0.59,0.89,0.03
GaussianNB,0.91,0.59,0.59,0.89,0.03
LinearDiscriminantAnalysis,0.91,0.57,0.57,0.9,0.04
ExtraTreeClassifier,0.86,0.57,0.57,0.87,0.04
DecisionTreeClassifier,0.87,0.57,0.57,0.87,0.14
LabelSpreading,0.9,0.55,0.55,0.89,13.14
LabelPropagation,0.9,0.55,0.55,0.88,8.69


### T-SNE

In [37]:
from sklearn.manifold import TSNE
# Initialize the t-SNE model
model = TSNE(n_components=2)
X_reduced = model.fit_transform(X_scaled)

In [38]:
trainX, testX, trainy, testy = train_test_split(X_reduced, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(14000, 2)
(14000,)
(6000, 2)
(6000,)


In [39]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:43<00:00,  1.50s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NearestCentroid,0.59,0.6,0.6,0.68,0.02
ExtraTreeClassifier,0.87,0.56,0.56,0.87,0.03
DecisionTreeClassifier,0.87,0.56,0.56,0.87,0.07
BaggingClassifier,0.9,0.55,0.55,0.88,0.34
ExtraTreesClassifier,0.89,0.55,0.55,0.88,1.25
RandomForestClassifier,0.9,0.55,0.55,0.88,2.12
KNeighborsClassifier,0.91,0.54,0.54,0.89,0.3
XGBClassifier,0.91,0.52,0.52,0.88,1.14
LGBMClassifier,0.92,0.52,0.52,0.88,0.21
AdaBoostClassifier,0.92,0.51,0.51,0.88,0.52


## Dataset 5 : Credit Card Fraud Detection

Dataset Info : Dataset we are using in this case study is  named as “Credit Card dataset” and contains information of peopl2. Dataset contains 284807 records and 31 features. We are using “class” as our target variable. It is a Binary class classification problem, where value of class is 0 and 1 (0 means fraud not detected and 1 means detected). Dataset is imbalanced and consist of 31 attributes other than target attribute.

In [40]:
# Importing Dataset using Pandas.
data = pd.read_csv("E:/creditcard.csv")
data.shape  

(284807, 31)

In [41]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.36,-0.07,2.54,1.38,-0.34,0.46,0.24,0.1,0.36,...,-0.02,0.28,-0.11,0.07,0.13,-0.19,0.13,-0.02,149.62,0
1,0.0,1.19,0.27,0.17,0.45,0.06,-0.08,-0.08,0.09,-0.26,...,-0.23,-0.64,0.1,-0.34,0.17,0.13,-0.01,0.01,2.69,0
2,1.0,-1.36,-1.34,1.77,0.38,-0.5,1.8,0.79,0.25,-1.51,...,0.25,0.77,0.91,-0.69,-0.33,-0.14,-0.06,-0.06,378.66,0
3,1.0,-0.97,-0.19,1.79,-0.86,-0.01,1.25,0.24,0.38,-1.39,...,-0.11,0.01,-0.19,-1.18,0.65,-0.22,0.06,0.06,123.5,0
4,2.0,-1.16,0.88,1.55,0.4,-0.41,0.1,0.59,-0.27,0.82,...,-0.01,0.8,-0.14,0.14,-0.21,0.5,0.22,0.22,69.99,0


In [42]:
data = data.sample(n=25000, replace=True)

In [43]:
data.shape

(25000, 31)

### Data Pre-Processing

In [44]:
data.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

In [45]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25000 entries, 175334 to 67337
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    25000 non-null  float64
 1   V1      25000 non-null  float64
 2   V2      25000 non-null  float64
 3   V3      25000 non-null  float64
 4   V4      25000 non-null  float64
 5   V5      25000 non-null  float64
 6   V6      25000 non-null  float64
 7   V7      25000 non-null  float64
 8   V8      25000 non-null  float64
 9   V9      25000 non-null  float64
 10  V10     25000 non-null  float64
 11  V11     25000 non-null  float64
 12  V12     25000 non-null  float64
 13  V13     25000 non-null  float64
 14  V14     25000 non-null  float64
 15  V15     25000 non-null  float64
 16  V16     25000 non-null  float64
 17  V17     25000 non-null  float64
 18  V18     25000 non-null  float64
 19  V19     25000 non-null  float64
 20  V20     25000 non-null  float64
 21  V21     25000 non-null  float6

In [46]:
data.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [47]:
data.drop('Time',axis=1,inplace=True)

In [48]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
V1,25000.0,-0.01,1.96,-37.56,-0.92,0.01,1.31,2.41
V2,25000.0,-0.01,1.72,-63.34,-0.6,0.07,0.81,12.79
V3,25000.0,-0.01,1.49,-24.96,-0.9,0.17,1.03,3.94
V4,25000.0,-0.01,1.41,-5.42,-0.85,-0.02,0.74,16.72
V5,25000.0,0.0,1.39,-35.18,-0.7,-0.06,0.61,24.66
V6,25000.0,-0.01,1.33,-17.57,-0.78,-0.28,0.39,21.55
V7,25000.0,-0.0,1.21,-18.75,-0.56,0.04,0.57,36.88
V8,25000.0,-0.0,1.18,-37.35,-0.21,0.02,0.32,11.16
V9,25000.0,0.0,1.11,-8.09,-0.65,-0.06,0.6,10.33
V10,25000.0,0.0,1.09,-14.68,-0.54,-0.1,0.47,15.24


In [49]:
X = data.loc[:, data.columns != 'Class']     # All columns except target variable.
y = data['Class']                        # Target Variable.

In [51]:
from sklearn.preprocessing import StandardScaler

# create a StandardScaler model
scaler = StandardScaler()

# fit and transform the data
X_scaled = scaler.fit_transform(X)

In [52]:
X_scaled = pd.DataFrame(X_scaled, columns= X.columns)

In [53]:
trainX, testX, trainy, testy = train_test_split(X_scaled, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(17500, 29)
(17500,)
(7500, 29)
(7500,)


In [54]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [01:30<00:00,  3.10s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
GaussianNB,0.98,0.95,0.95,0.99,0.08
LGBMClassifier,1.0,0.91,0.91,1.0,0.5
XGBClassifier,1.0,0.91,0.91,1.0,2.33
ExtraTreeClassifier,1.0,0.91,0.91,1.0,0.06
QuadraticDiscriminantAnalysis,0.99,0.91,0.91,1.0,0.09
BernoulliNB,1.0,0.86,0.86,1.0,0.08
NearestCentroid,1.0,0.86,0.86,1.0,0.07
LinearDiscriminantAnalysis,1.0,0.86,0.86,1.0,0.18
KNeighborsClassifier,1.0,0.86,0.86,1.0,0.86
Perceptron,1.0,0.86,0.86,1.0,0.08


### Dimensionality Reduction Algorithms

### Principal Component Analysis

In [55]:
from sklearn.decomposition import PCA

pca = PCA(n_components=5)

principalComponents = pca.fit_transform(X_scaled)

PCA_Df = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2','principal component 3', 'principal component 4','principal component 5'])

In [56]:
trainX, testX, trainy, testy = train_test_split(PCA_Df, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(17500, 5)
(17500,)
(7500, 5)
(7500,)


In [57]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:44<00:00,  1.54s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMClassifier,1.0,0.86,0.86,1.0,0.25
QuadraticDiscriminantAnalysis,0.99,0.86,0.86,0.99,0.02
GaussianNB,0.98,0.86,0.86,0.99,0.03
LinearDiscriminantAnalysis,0.99,0.82,0.82,1.0,0.04
NearestCentroid,0.99,0.81,0.81,0.99,0.03
ExtraTreesClassifier,1.0,0.77,0.77,1.0,0.74
XGBClassifier,1.0,0.77,0.77,1.0,0.99
LabelPropagation,1.0,0.77,0.77,1.0,14.77
LabelSpreading,1.0,0.77,0.77,1.0,20.97
AdaBoostClassifier,1.0,0.77,0.77,1.0,1.5


### Linear Discriminant Analysis

In [58]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# create an LDA model
lda = LDA(n_components=1)

# fit and transform the data
X_reduced = lda.fit_transform(X_scaled, y)

In [59]:
trainX, testX, trainy, testy = train_test_split(PCA_Df, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(17500, 5)
(17500,)
(7500, 5)
(7500,)


In [60]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:44<00:00,  1.53s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMClassifier,1.0,0.86,0.86,1.0,0.26
QuadraticDiscriminantAnalysis,0.99,0.86,0.86,0.99,0.03
GaussianNB,0.98,0.86,0.86,0.99,0.02
LinearDiscriminantAnalysis,0.99,0.82,0.82,1.0,0.04
NearestCentroid,0.99,0.81,0.81,0.99,0.02
ExtraTreesClassifier,1.0,0.77,0.77,1.0,0.74
XGBClassifier,1.0,0.77,0.77,1.0,1.02
LabelPropagation,1.0,0.77,0.77,1.0,15.1
LabelSpreading,1.0,0.77,0.77,1.0,20.89
AdaBoostClassifier,1.0,0.77,0.77,1.0,1.09


### Singular Decomposition Analysis

In [61]:
from sklearn.decomposition import TruncatedSVD

# create a SVD model
svd = TruncatedSVD(n_components=5)

# fit and transform the data
X_reduced = svd.fit_transform(X_scaled)

In [62]:
trainX, testX, trainy, testy = train_test_split(PCA_Df, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(17500, 5)
(17500,)
(7500, 5)
(7500,)


In [64]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:44<00:00,  1.52s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMClassifier,1.0,0.86,0.86,1.0,0.31
QuadraticDiscriminantAnalysis,0.99,0.86,0.86,0.99,0.03
GaussianNB,0.98,0.86,0.86,0.99,0.02
LinearDiscriminantAnalysis,0.99,0.82,0.82,1.0,0.04
NearestCentroid,0.99,0.81,0.81,0.99,0.02
ExtraTreesClassifier,1.0,0.77,0.77,1.0,0.71
XGBClassifier,1.0,0.77,0.77,1.0,1.02
LabelPropagation,1.0,0.77,0.77,1.0,15.08
LabelSpreading,1.0,0.77,0.77,1.0,20.66
AdaBoostClassifier,1.0,0.77,0.77,1.0,1.14


### T-SNE

In [65]:
from sklearn.manifold import TSNE
# Initialize the t-SNE model
model = TSNE(n_components=2)
X_reduced = model.fit_transform(X_scaled)

In [66]:
trainX, testX, trainy, testy = train_test_split(PCA_Df, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(17500, 5)
(17500,)
(7500, 5)
(7500,)


In [67]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:47<00:00,  1.64s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMClassifier,1.0,0.86,0.86,1.0,0.26
QuadraticDiscriminantAnalysis,0.99,0.86,0.86,0.99,0.04
GaussianNB,0.98,0.86,0.86,0.99,0.02
LinearDiscriminantAnalysis,0.99,0.82,0.82,1.0,0.07
NearestCentroid,0.99,0.81,0.81,0.99,0.04
ExtraTreesClassifier,1.0,0.77,0.77,1.0,0.73
XGBClassifier,1.0,0.77,0.77,1.0,1.2
LabelPropagation,1.0,0.77,0.77,1.0,14.06
LabelSpreading,1.0,0.77,0.77,1.0,22.76
AdaBoostClassifier,1.0,0.77,0.77,1.0,1.24


## Dataset 6 : Customer Churn Dataset

Dataset we are using in this case study contain 5000 customer records and 21 features. We are  using “class” as our target variable. It is a Binary class classification problem where value of class 
is 0 and 1 (0 means do not churn and 1 means churn). Data set contain other features as well such 
as state, phone number et

In [68]:
# Importing Dataset using Pandas.
data = pd.read_csv("E:/Churn_data_set.csv")
data.shape  

(5000, 21)

In [69]:
data.head()

Unnamed: 0,state,account_length,area_code,phone_number,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,...,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,class
0,16,128,415,2845,0,1,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0
1,35,107,415,2301,0,1,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0
2,31,137,415,1616,0,0,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0
3,35,84,408,2510,1,0,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0
4,36,75,415,155,1,0,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0


In [71]:
data.isna().sum()

state                            0
account_length                   0
area_code                        0
phone_number                     0
international_plan               0
voice_mail_plan                  0
number_vmail_messages            0
total_day_minutes                0
total_day_calls                  0
total_day_charge                 0
total_eve_minutes                0
total_eve_calls                  0
total_eve_charge                 0
total_night_minutes              0
total_night_calls                0
total_night_charge               0
total_intl_minutes               0
total_intl_calls                 0
total_intl_charge                0
number_customer_service_calls    0
class                            0
dtype: int64

In [73]:
X = data.loc[:, data.columns != 'class']     # All columns except target variable.
y = data['class']                        # Target Variable.

In [74]:
from sklearn.preprocessing import StandardScaler

# create a StandardScaler model
scaler = StandardScaler()

# fit and transform the data
X_scaled = scaler.fit_transform(X)

In [75]:
X_scaled = pd.DataFrame(X_scaled, columns= X.columns)

In [76]:
trainX, testX, trainy, testy = train_test_split(X_scaled, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(3500, 20)
(3500,)
(1500, 20)
(1500,)


In [77]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:08<00:00,  3.23it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
XGBClassifier,0.96,0.87,0.87,0.95,0.52
LGBMClassifier,0.95,0.87,0.87,0.95,0.25
BaggingClassifier,0.95,0.86,0.86,0.95,0.44
RandomForestClassifier,0.95,0.85,0.85,0.95,0.9
DecisionTreeClassifier,0.91,0.83,0.83,0.91,0.07
ExtraTreesClassifier,0.93,0.77,0.77,0.92,0.63
QuadraticDiscriminantAnalysis,0.88,0.75,0.75,0.88,0.03
SVC,0.92,0.75,0.75,0.91,0.53
NearestCentroid,0.74,0.74,0.74,0.77,0.02
GaussianNB,0.87,0.74,0.74,0.87,0.02


### Dimensionality Reduction Algorithm

### Principal Component Analysis

In [78]:
from sklearn.decomposition import PCA

pca = PCA(n_components=5)

principalComponents = pca.fit_transform(X_scaled)

PCA_Df = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2','principal component 3', 'principal component 4','principal component 5'])

In [79]:
trainX, testX, trainy, testy = train_test_split(X_scaled, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(3500, 20)
(3500,)
(1500, 20)
(1500,)


In [80]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:09<00:00,  3.21it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
XGBClassifier,0.96,0.87,0.87,0.95,0.85
LGBMClassifier,0.95,0.87,0.87,0.95,0.5
BaggingClassifier,0.95,0.86,0.86,0.95,0.37
RandomForestClassifier,0.95,0.85,0.85,0.95,1.0
DecisionTreeClassifier,0.91,0.83,0.83,0.91,0.06
ExtraTreesClassifier,0.93,0.77,0.77,0.92,0.56
QuadraticDiscriminantAnalysis,0.88,0.75,0.75,0.88,0.03
SVC,0.92,0.75,0.75,0.91,0.66
NearestCentroid,0.74,0.74,0.74,0.77,0.02
GaussianNB,0.87,0.74,0.74,0.87,0.02


### Linear Discriminant Analysis

In [81]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# create an LDA model
lda = LDA(n_components=1)

# fit and transform the data
X_reduced = lda.fit_transform(X_scaled, y)

In [82]:
trainX, testX, trainy, testy = train_test_split(X_scaled, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(3500, 20)
(3500,)
(1500, 20)
(1500,)


In [83]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:09<00:00,  3.10it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
XGBClassifier,0.96,0.87,0.87,0.95,1.04
LGBMClassifier,0.95,0.87,0.87,0.95,0.8
BaggingClassifier,0.95,0.86,0.86,0.95,0.38
RandomForestClassifier,0.95,0.85,0.85,0.95,0.93
DecisionTreeClassifier,0.91,0.83,0.83,0.91,0.08
ExtraTreesClassifier,0.93,0.77,0.77,0.92,0.5
QuadraticDiscriminantAnalysis,0.88,0.75,0.75,0.88,0.03
SVC,0.92,0.75,0.75,0.91,0.66
NearestCentroid,0.74,0.74,0.74,0.77,0.02
GaussianNB,0.87,0.74,0.74,0.87,0.02


### Singular Vector Decomposition

In [84]:
from sklearn.decomposition import TruncatedSVD

# create a SVD model
svd = TruncatedSVD(n_components=5)

# fit and transform the data
X_reduced = svd.fit_transform(X_scaled)

In [85]:
trainX, testX, trainy, testy = train_test_split(X_scaled, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(3500, 20)
(3500,)
(1500, 20)
(1500,)


In [86]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:08<00:00,  3.60it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
XGBClassifier,0.96,0.87,0.87,0.95,0.53
LGBMClassifier,0.95,0.87,0.87,0.95,0.24
BaggingClassifier,0.95,0.86,0.86,0.95,0.39
RandomForestClassifier,0.95,0.85,0.85,0.95,0.9
DecisionTreeClassifier,0.91,0.83,0.83,0.91,0.07
ExtraTreesClassifier,0.93,0.77,0.77,0.92,0.6
QuadraticDiscriminantAnalysis,0.88,0.75,0.75,0.88,0.03
SVC,0.92,0.75,0.75,0.91,0.54
NearestCentroid,0.74,0.74,0.74,0.77,0.02
GaussianNB,0.87,0.74,0.74,0.87,0.03


### TSNE

In [87]:
from sklearn.manifold import TSNE
# Initialize the t-SNE model
model = TSNE(n_components=2)
X_reduced = model.fit_transform(X_scaled)

In [88]:
trainX, testX, trainy, testy = train_test_split(X_scaled, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(3500, 20)
(3500,)
(1500, 20)
(1500,)


In [89]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:08<00:00,  3.60it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
XGBClassifier,0.96,0.87,0.87,0.95,0.49
LGBMClassifier,0.95,0.87,0.87,0.95,0.24
BaggingClassifier,0.95,0.86,0.86,0.95,0.38
RandomForestClassifier,0.95,0.85,0.85,0.95,0.88
DecisionTreeClassifier,0.91,0.83,0.83,0.91,0.08
ExtraTreesClassifier,0.93,0.77,0.77,0.92,0.51
QuadraticDiscriminantAnalysis,0.88,0.75,0.75,0.88,0.02
SVC,0.92,0.75,0.75,0.91,0.53
NearestCentroid,0.74,0.74,0.74,0.77,0.02
GaussianNB,0.87,0.74,0.74,0.87,0.02


## Dataset 7 : Brest Cancer Wisconsin Dataset

Data Info : This datset is take from UCI Machine learning repository. Its a binary class classififcation dataset.

Link for data : https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)

In [2]:
# Importing Dataset using Pandas.
data = pd.read_csv("E:\DOWNLOADS\Brest_Cancer_data.csv.csv")
data.shape

(569, 33)

In [3]:
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [4]:
# Checking Null values in all features.
data.isnull().sum()

id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed:

In [5]:
# Dropping unimpactful colums
data.drop('Unnamed: 32',axis=1,inplace=True)
data.drop('id',axis=1,inplace=True)

In [6]:
data.dtypes

diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst

In [7]:
data['diagnosis'] = data['diagnosis'].map({'B': 0, 'M': 1})
data['diagnosis'] = pd.to_numeric(data['diagnosis'])

In [8]:
X = data.loc[:, data.columns != 'diagnosis']
y = data[['diagnosis']]

In [9]:
from sklearn.preprocessing import StandardScaler

# create a StandardScaler model
scaler = StandardScaler()

# fit and transform the data
X_scaled = scaler.fit_transform(X)

In [10]:
X_scaled = pd.DataFrame(X_scaled, columns= X.columns)

In [11]:
X_scaled.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,2.255747,...,1.88669,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,-0.868652,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.24389,0.28119
2,1.579888,0.456187,1.566503,1.558884,0.94221,1.052926,1.363478,2.037231,0.939685,-0.398008,...,1.51187,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955,1.152255,0.201391
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,4.910919,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.93501
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.53934,1.371011,1.428493,-0.00956,-0.56245,...,1.298575,-1.46677,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.3971


In [12]:
from sklearn.model_selection import train_test_split
trainX, testX, trainy, testy = train_test_split(X_scaled, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(398, 30)
(398, 1)
(171, 30)
(171, 1)


In [13]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:02<00:00, 12.69it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SVC,0.98,0.98,0.98,0.98,0.02
KNeighborsClassifier,0.98,0.98,0.98,0.98,0.22
PassiveAggressiveClassifier,0.97,0.97,0.97,0.97,0.02
LGBMClassifier,0.97,0.97,0.97,0.97,0.16
LogisticRegression,0.97,0.97,0.97,0.97,0.03
SGDClassifier,0.96,0.97,0.97,0.97,0.02
BaggingClassifier,0.96,0.97,0.97,0.96,0.08
XGBClassifier,0.96,0.97,0.97,0.96,0.48
ExtraTreesClassifier,0.96,0.96,0.96,0.96,0.25
CalibratedClassifierCV,0.97,0.96,0.96,0.97,0.08


### Dimensionality Reduction Algorithm

### Principal Component Analysis

In [14]:
from sklearn.decomposition import PCA

pca = PCA(n_components=5)

principalComponents = pca.fit_transform(X_scaled)

PCA_Df = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2','principal component 3', 'principal component 4','principal component 5'])

In [15]:
trainX, testX, trainy, testy = train_test_split(PCA_Df, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(398, 5)
(398, 1)
(171, 5)
(171, 1)


In [16]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:01<00:00, 18.98it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NearestCentroid,0.98,0.97,0.97,0.98,0.02
SGDClassifier,0.97,0.97,0.97,0.97,0.02
LinearSVC,0.96,0.96,0.96,0.96,0.02
PassiveAggressiveClassifier,0.96,0.96,0.96,0.96,0.02
CalibratedClassifierCV,0.96,0.96,0.96,0.96,0.06
RandomForestClassifier,0.96,0.96,0.96,0.96,0.31
ExtraTreesClassifier,0.96,0.96,0.96,0.96,0.2
LabelPropagation,0.96,0.96,0.96,0.96,0.03
LabelSpreading,0.96,0.96,0.96,0.96,0.03
Perceptron,0.96,0.96,0.96,0.96,0.02


### Linear Decomposition Analysis

In [25]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# create an LDA model
lda = LDA(n_components=1)

# fit and transform the data
X_reduced = lda.fit_transform(X_scaled, y)

In [26]:
trainX, testX, trainy, testy = train_test_split(X_reduced, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(398, 1)
(398, 1)
(171, 1)
(171, 1)


In [27]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:01<00:00, 18.95it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LinearSVC,0.98,0.98,0.98,0.98,0.03
LabelPropagation,0.98,0.98,0.98,0.98,0.03
SGDClassifier,0.98,0.98,0.98,0.98,0.01
LabelSpreading,0.98,0.98,0.98,0.98,0.03
KNeighborsClassifier,0.98,0.98,0.98,0.98,0.04
BernoulliNB,0.98,0.98,0.98,0.98,0.01
Perceptron,0.98,0.98,0.98,0.98,0.02
XGBClassifier,0.98,0.97,0.97,0.98,0.08
RandomForestClassifier,0.98,0.97,0.97,0.98,0.3
BaggingClassifier,0.98,0.97,0.97,0.98,0.06


### Singular Vector Decomposition

In [28]:
from sklearn.decomposition import TruncatedSVD

# create a SVD model
svd = TruncatedSVD(n_components=5)

# fit and transform the data
X_reduced = svd.fit_transform(X_scaled)

In [29]:
trainX, testX, trainy, testy = train_test_split(X_reduced, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(398, 5)
(398, 1)
(171, 5)
(171, 1)


In [30]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:01<00:00, 21.47it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NearestCentroid,0.98,0.97,0.97,0.98,0.02
SGDClassifier,0.97,0.97,0.97,0.97,0.02
LinearSVC,0.96,0.96,0.96,0.96,0.02
PassiveAggressiveClassifier,0.96,0.96,0.96,0.96,0.01
CalibratedClassifierCV,0.96,0.96,0.96,0.96,0.06
RandomForestClassifier,0.96,0.96,0.96,0.96,0.26
ExtraTreesClassifier,0.96,0.96,0.96,0.96,0.18
LabelPropagation,0.96,0.96,0.96,0.96,0.02
LabelSpreading,0.96,0.96,0.96,0.96,0.03
Perceptron,0.96,0.96,0.96,0.96,0.02


### T-SNE

In [31]:
from sklearn.manifold import TSNE
# Initialize the t-SNE model
model = TSNE(n_components=2)
X_reduced = model.fit_transform(X_scaled)

In [32]:
trainX, testX, trainy, testy = train_test_split(X_reduced, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(398, 2)
(398, 1)
(171, 2)
(171, 1)


In [33]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [00:01<00:00, 23.41it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ExtraTreesClassifier,0.96,0.96,0.96,0.96,0.16
KNeighborsClassifier,0.96,0.96,0.96,0.96,0.03
RandomForestClassifier,0.96,0.95,0.95,0.96,0.2
SVC,0.96,0.95,0.95,0.96,0.03
XGBClassifier,0.95,0.95,0.95,0.95,0.13
LabelSpreading,0.95,0.95,0.95,0.95,0.03
BaggingClassifier,0.95,0.95,0.95,0.95,0.04
LabelPropagation,0.95,0.95,0.95,0.95,0.03
LinearSVC,0.96,0.95,0.95,0.96,0.01
CalibratedClassifierCV,0.96,0.95,0.95,0.96,0.05


## Dataset 8: Accident Severity Prediction

Dataset we are using in this project is basically taken from famous dataset. Name of dataset is 1.6 million UK Accidents, this is a countrywide car accident dataset, which covers both urban and rural areas of the UK. Each accident record is described by a wide range of data attributes, including the accident location, weather, time, Date etc. This dataset is officially collected by UK Police department and contain mostly severe accidents in which police was also involved. Dataset is officially present on Government of UK website. Whole datasets consist of three CSV files, each CSV file has around 4 lac and 70 thousand records and 33 different features. All 3 CSV files contain same columns and same number of records. First CSV consist of  record of 2005 to 2007, whereas second CSV contain records of accidents happened in time of 2009 to 2011 and third contain data of year 2012 to 2014. Raw data need to be pre-processed, including removing variables with too many missing values, filling variables, dropping unique features, and encoding variables.

In [2]:
data = pd.read_csv("E:/accidents_2012_to_2014.csv")
data.shape

(464697, 33)

In [3]:
data = data.sample(n=30000, replace=True)

In [4]:
data.shape

(30000, 33)

In [5]:
data.isnull().sum()

Accident_Index                                     0
Location_Easting_OSGR                              0
Location_Northing_OSGR                             0
Longitude                                          0
Latitude                                           0
Police_Force                                       0
Accident_Severity                                  0
Number_of_Vehicles                                 0
Number_of_Casualties                               0
Date                                               0
Day_of_Week                                        0
Time                                               0
Local_Authority_(District)                         0
Local_Authority_(Highway)                          0
1st_Road_Class                                     0
1st_Road_Number                                    0
Road_Type                                          0
Speed_limit                                        0
Junction_Detail                               

In [6]:
data.drop('Junction_Detail',axis=1,inplace=True)
data.drop('Junction_Control',axis=1,inplace=True)
filtered_data = data.dropna()

In [7]:
filtered_data.shape

(28098, 31)

In [8]:
filtered_data.drop('Accident_Index',axis=1,inplace=True)
filtered_data.drop('LSOA_of_Accident_Location',axis=1,inplace=True)
filtered_data.drop('Local_Authority_(Highway)',axis=1,inplace=True)
filtered_data.drop('Date',axis=1,inplace=True)
filtered_data.drop('Time',axis=1,inplace=True)

In [9]:
df_onehot = pd.get_dummies(filtered_data)
df_onehot.dtypes

Location_Easting_OSGR                                            int64
Location_Northing_OSGR                                           int64
Longitude                                                      float64
Latitude                                                       float64
Police_Force                                                     int64
                                                                ...   
Carriageway_Hazards_None                                         uint8
Carriageway_Hazards_Other object in carriageway                  uint8
Carriageway_Hazards_Pedestrian in carriageway (not injured)      uint8
Did_Police_Officer_Attend_Scene_of_Accident_No                   uint8
Did_Police_Officer_Attend_Scene_of_Accident_Yes                  uint8
Length: 67, dtype: object

In [10]:
X = df_onehot.loc[:, df_onehot.columns != 'Accident_Severity']
y = df_onehot[['Accident_Severity']]

In [11]:
from sklearn.preprocessing import StandardScaler

# create a StandardScaler model
scaler = StandardScaler()

# fit and transform the data
X_scaled = scaler.fit_transform(X)

In [12]:
X_scaled = pd.DataFrame(X_scaled, columns= X.columns)

In [13]:
trainX, testX, trainy, testy = train_test_split(X_scaled, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(19668, 66)
(19668, 1)
(8430, 66)
(8430, 1)


In [15]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [12:25<00:00, 25.70s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NearestCentroid,0.49,0.47,,0.58,0.36
ExtraTreeClassifier,0.76,0.39,,0.76,0.21
LabelSpreading,0.76,0.39,,0.76,155.96
LabelPropagation,0.75,0.39,,0.76,116.69
DecisionTreeClassifier,0.76,0.39,,0.76,0.92
BaggingClassifier,0.82,0.38,,0.79,5.22
ExtraTreesClassifier,0.85,0.37,,0.8,10.27
RandomForestClassifier,0.85,0.37,,0.8,8.08
BernoulliNB,0.83,0.37,,0.77,0.21
LGBMClassifier,0.84,0.35,,0.78,3.85


#### Dimensionality Reduction Algorithm

In [14]:
from sklearn.decomposition import PCA

pca = PCA(n_components=5)

principalComponents = pca.fit_transform(X_scaled)

PCA_Df = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2','principal component 3', 'principal component 4','principal component 5'])

In [15]:
trainX, testX, trainy, testy = train_test_split(PCA_Df, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(19668, 5)
(19668, 1)
(8430, 5)
(8430, 1)


In [16]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [02:50<00:00,  5.88s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NearestCentroid,0.45,0.47,,0.55,0.06
DecisionTreeClassifier,0.76,0.41,,0.76,0.32
ExtraTreeClassifier,0.76,0.41,,0.76,0.07
ExtraTreesClassifier,0.84,0.39,,0.8,3.48
RandomForestClassifier,0.85,0.39,,0.8,8.66
BaggingClassifier,0.82,0.39,,0.79,2.2
LabelPropagation,0.82,0.36,,0.78,23.69
LabelSpreading,0.83,0.36,,0.78,37.78
LGBMClassifier,0.85,0.34,,0.78,0.98
KNeighborsClassifier,0.82,0.34,,0.78,0.71


In [17]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# create an LDA model
lda = LDA(n_components=1)

# fit and transform the data
X_reduced = lda.fit_transform(X_scaled, y)

In [18]:
trainX, testX, trainy, testy = train_test_split(X_scaled, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(19668, 66)
(19668, 1)
(8430, 66)
(8430, 1)


In [19]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [07:35<00:00, 15.71s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NearestCentroid,0.52,0.48,,0.61,0.14
ExtraTreeClassifier,0.76,0.42,,0.77,0.22
LabelSpreading,0.76,0.42,,0.76,69.85
DecisionTreeClassifier,0.75,0.42,,0.76,0.73
LabelPropagation,0.76,0.42,,0.76,53.88
BaggingClassifier,0.83,0.41,,0.8,4.15
ExtraTreesClassifier,0.85,0.4,,0.8,7.87
RandomForestClassifier,0.86,0.39,,0.8,7.83
BernoulliNB,0.83,0.38,,0.77,0.19
PassiveAggressiveClassifier,0.78,0.37,,0.76,0.31


In [20]:
from sklearn.manifold import TSNE
# Initialize the t-SNE model
model = TSNE(n_components=2)
X_reduced = model.fit_transform(X_scaled)

In [21]:
trainX, testX, trainy, testy = train_test_split(X_scaled, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(19668, 66)
(19668, 1)
(8430, 66)
(8430, 1)


In [22]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [06:35<00:00, 13.65s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NearestCentroid,0.52,0.48,,0.61,0.13
ExtraTreeClassifier,0.76,0.42,,0.77,0.56
LabelSpreading,0.76,0.42,,0.76,56.6
DecisionTreeClassifier,0.75,0.42,,0.76,1.7
LabelPropagation,0.76,0.42,,0.76,50.97
BaggingClassifier,0.83,0.41,,0.8,3.36
ExtraTreesClassifier,0.85,0.4,,0.8,12.0
RandomForestClassifier,0.86,0.39,,0.8,5.03
BernoulliNB,0.83,0.38,,0.77,0.16
PassiveAggressiveClassifier,0.78,0.37,,0.76,0.23


In [23]:
from sklearn.decomposition import TruncatedSVD

# create a SVD model
svd = TruncatedSVD(n_components=5)

# fit and transform the data
X_reduced = svd.fit_transform(X_scaled)

In [24]:
trainX, testX, trainy, testy = train_test_split(X_scaled, y, test_size=0.3, random_state=2)
print(trainX.shape)
print(trainy.shape)
print(testX.shape)
print(testy.shape)

(19668, 66)
(19668, 1)
(8430, 66)
(8430, 1)


In [25]:
from lazypredict.Supervised import LazyClassifier
clf = LazyClassifier(predictions=True)
models, predictions = clf.fit(trainX, testX, trainy, testy)
models

100%|██████████| 29/29 [06:49<00:00, 14.13s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NearestCentroid,0.52,0.48,,0.61,0.12
ExtraTreeClassifier,0.76,0.42,,0.77,0.13
LabelSpreading,0.76,0.42,,0.76,60.05
DecisionTreeClassifier,0.75,0.42,,0.76,0.46
LabelPropagation,0.76,0.42,,0.76,49.52
BaggingClassifier,0.83,0.41,,0.8,3.32
ExtraTreesClassifier,0.85,0.4,,0.8,5.65
RandomForestClassifier,0.86,0.39,,0.8,7.15
BernoulliNB,0.83,0.38,,0.77,0.18
PassiveAggressiveClassifier,0.78,0.37,,0.76,0.27


## Conclusion

For Classification

we have noticed that after implementing dimensionality reduction techniques, performance of the models increased but accuracy of some models decrease after doing dimensionality reduction using isomap embedding and T-SNE

For some dataset, where dimensions are not too high, performance of models decrease after reducing dimensions. This is because data is losing important patterns in the dataset because dimension is already very low.

Computational time of the model reduce after reducing the dimensions, as the number of features to learn reduce for the model.

PCA and LDA are performing better as compare to the other dimensionality reduction technique. The fact that LDA is a supervised technique, which is also one of its disadvantages, is the main reason it outperforms the other techniques. However, since our attention is on classification, we would need a labelled dataset, and LDA makes use of the labels in conjunction with the dataset to improve class separability.