# Heart Disease training using All Models

This project consist of Heart Disease Analysis using all the Supervised Learning Algorithm and using PCA with hyperparameter tuning using Cross Validation.

Models used here -

#### Logistic Regression/SVM/Decision Tree/Random Forest/K Nearest Neighbors using Grid Search and Randomized Search Cross Validation

#### STEPS INVOLVED -
- Importing Libraries and Dataset
- Checking on Outilers using Z-score and removing all outliers with Z-score >3 or <-3
- One Hot encoding and Label encoding using pd.get_dummies
- Standardization of dataset using StandardScaler from sklearn.preprocessing
- train-test split
- Training on different models like Logistic Regression/SVM/Decision Tree/Random Forest/K Nearest Neighbors using Grid Search and Randomized Search Cross Validation
- Dimensionality Reduction by
-- 95% variance retention
-- 4-D reduction
- Performing training on Reduced Dataset
- Checking for scores at each scores

#### IMPORTING LIBRARIES

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

  import pandas.util.testing as tm


#### IMPORTING DATASETS

In [2]:
heart_df = pd.read_csv('heart.csv')

In [3]:
heart_df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [4]:
heart_df.shape

(918, 12)

In [5]:
heart_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [6]:
print(heart_df.Sex.value_counts(), "\n")
print(heart_df.ChestPainType.value_counts(), "\n")
print(heart_df.RestingECG.value_counts(), "\n")
print(heart_df.ExerciseAngina.value_counts(), "\n")
print(heart_df.ST_Slope.value_counts(), "\n")

M    725
F    193
Name: Sex, dtype: int64 

ASY    496
NAP    203
ATA    173
TA      46
Name: ChestPainType, dtype: int64 

Normal    552
LVH       188
ST        178
Name: RestingECG, dtype: int64 

N    547
Y    371
Name: ExerciseAngina, dtype: int64 

Flat    460
Up      395
Down     63
Name: ST_Slope, dtype: int64 



In [7]:
heart_df.columns

Index(['Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS',
       'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope',
       'HeartDisease'],
      dtype='object')

In [8]:
heart_df.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


## DATA PREPROCESSING

In [9]:
X = heart_df.iloc[:, :-1]
y = heart_df.iloc[:, -1]
X.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up


In [10]:
y.head()

0    0
1    1
2    0
3    1
4    0
Name: HeartDisease, dtype: int64

In [41]:
y.value_counts()

1    508
0    410
Name: HeartDisease, dtype: int64

Dividing columns into numerical and categorical cols for making transformations easier

In [11]:
numeric_cols = heart_df.select_dtypes(include=np.number).columns.to_list()
categorical_cols = heart_df.select_dtypes('object').columns.to_list()

print(numeric_cols)
print(categorical_cols)

['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak', 'HeartDisease']
['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']


In [12]:
y.nunique()

2

In [13]:
for cols in numeric_cols:
    if heart_df[cols].nunique() < 15:
        numeric_cols.remove(cols)
        print(cols)

FastingBS
HeartDisease


In [14]:
numeric_cols

['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']

# Z-score calculation

In [15]:
for cols in numeric_cols:
    col_name = cols + '_zscore'
    heart_df[col_name] = (heart_df[cols] - heart_df[cols].mean())/heart_df[cols].std()
    
    heart_df[col_name] = (abs(heart_df[col_name])>3).astype(int)

In [16]:
heart_df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease,Age_zscore,RestingBP_zscore,Cholesterol_zscore,MaxHR_zscore,Oldpeak_zscore
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0,0,0,0,0,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1,0,0,0,0,0
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0,0,0,0,0,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1,0,0,0,0,0
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0,0,0,0,0,0


In [17]:
for cols in numeric_cols:
    col = cols + '_zscore'
    total_outliers = heart_df[col].sum()
    print(col, total_outliers, '\n')

Age_zscore 0 

RestingBP_zscore 8 

Cholesterol_zscore 3 

MaxHR_zscore 1 

Oldpeak_zscore 7 



#### Dropping rows with Z-score > 3 or <-3

In [18]:
for cols in numeric_cols:
    col = cols + '_zscore'
    heart_df = heart_df[heart_df[col] == 0]
    
heart_df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease,Age_zscore,RestingBP_zscore,Cholesterol_zscore,MaxHR_zscore,Oldpeak_zscore
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0,0,0,0,0,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1,0,0,0,0,0
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0,0,0,0,0,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1,0,0,0,0,0
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0,0,0,0,0,0


In [19]:
heart_df.shape

(899, 17)

In [20]:
heart_df.drop([cols + '_zscore' for cols in numeric_cols], axis=1, inplace=True)

In [21]:
heart_df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [22]:
heart_df.shape

(899, 12)

### Encoding of categorical data

In [23]:
heart_df = pd.concat([heart_df, pd.get_dummies(heart_df[categorical_cols], drop_first=True)], axis=1)

In [24]:
heart_df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,...,HeartDisease,Sex_M,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_Normal,RestingECG_ST,ExerciseAngina_Y,ST_Slope_Flat,ST_Slope_Up
0,40,M,ATA,140,289,0,Normal,172,N,0.0,...,0,1,1,0,0,1,0,0,0,1
1,49,F,NAP,160,180,0,Normal,156,N,1.0,...,1,0,0,1,0,1,0,0,1,0
2,37,M,ATA,130,283,0,ST,98,N,0.0,...,0,1,1,0,0,0,1,0,0,1
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,...,1,0,0,0,0,1,0,1,1,0
4,54,M,NAP,150,195,0,Normal,122,N,0.0,...,0,1,0,1,0,1,0,0,0,1


In [25]:
heart_df.drop(categorical_cols, axis=1, inplace=True)

In [26]:
heart_df.head()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,Sex_M,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_Normal,RestingECG_ST,ExerciseAngina_Y,ST_Slope_Flat,ST_Slope_Up
0,40,140,289,0,172,0.0,0,1,1,0,0,1,0,0,0,1
1,49,160,180,0,156,1.0,1,0,0,1,0,1,0,0,1,0
2,37,130,283,0,98,0.0,0,1,1,0,0,0,1,0,0,1
3,48,138,214,0,108,1.5,1,0,0,0,0,1,0,1,1,0
4,54,150,195,0,122,0.0,0,1,0,1,0,1,0,0,0,1


In [27]:
heart_df.shape

(899, 16)

In [28]:
heart_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 899 entries, 0 to 917
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Age                899 non-null    int64  
 1   RestingBP          899 non-null    int64  
 2   Cholesterol        899 non-null    int64  
 3   FastingBS          899 non-null    int64  
 4   MaxHR              899 non-null    int64  
 5   Oldpeak            899 non-null    float64
 6   HeartDisease       899 non-null    int64  
 7   Sex_M              899 non-null    uint8  
 8   ChestPainType_ATA  899 non-null    uint8  
 9   ChestPainType_NAP  899 non-null    uint8  
 10  ChestPainType_TA   899 non-null    uint8  
 11  RestingECG_Normal  899 non-null    uint8  
 12  RestingECG_ST      899 non-null    uint8  
 13  ExerciseAngina_Y   899 non-null    uint8  
 14  ST_Slope_Flat      899 non-null    uint8  
 15  ST_Slope_Up        899 non-null    uint8  
dtypes: float64(1), int64(6), u

#### Making target_vector (y) from heart_df last column

In [29]:
target_vector = heart_df.iloc[:, -1]

In [42]:
target_vector.shape

(899,)

In [43]:
target_vector.value_counts()

0    506
1    393
Name: ST_Slope_Up, dtype: int64

In [30]:
heart_df.drop('HeartDisease', axis=1, inplace=True)

In [31]:
heart_df.shape

(899, 15)

## Scaling the dataset using StandardScaler from sklearn.preprocessing

In [32]:
from sklearn.preprocessing import StandardScaler

In [33]:
scaler = StandardScaler()

In [34]:
heart_df = scaler.fit_transform(heart_df)

In [35]:
heart_df

array([[-1.42815446,  0.46590022,  0.84963584, ..., -0.8229452 ,
        -0.99888827,  1.13469459],
       [-0.47585532,  1.63471366, -0.16812204, ..., -0.8229452 ,
         1.00111297, -0.88129441],
       [-1.7455875 , -0.1185065 ,  0.79361247, ..., -0.8229452 ,
        -0.99888827,  1.13469459],
       ...,
       [ 0.3706328 , -0.1185065 , -0.62564622, ...,  1.21514774,
         1.00111297, -0.88129441],
       [ 0.3706328 , -0.1185065 ,  0.35476274, ..., -0.8229452 ,
         1.00111297, -0.88129441],
       [-1.63977649,  0.34901888, -0.21480818, ..., -0.8229452 ,
        -0.99888827,  1.13469459]])

In [36]:
heart_df.shape

(899, 15)

#### Making feature_vector which will be out X dataset

In [45]:
feature_vector = heart_df

In [46]:
feature_vector.shape

(899, 15)

## Train test split using train_test_split from sklearn.model_selection

In [47]:
from sklearn.model_selection import train_test_split

In [48]:
x_train, x_test, y_train, y_test = train_test_split(feature_vector, target_vector, test_size=0.30, random_state=42)

In [49]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(629, 15)
(270, 15)
(629,)
(270,)


# Applying different models with hyperparameter tuning

In [50]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

In [64]:
models = {
    'LogisticRegression' : {
        'model': LogisticRegression(),
        'params': {
            'C': [0.5, 1, 1.5, 2, 2.5, 3, 5, 10]
        }
    },
    'DecisionTree': {
        'model': DecisionTreeClassifier(random_state=42),
        'params': {
            'criterion': ['gini', 'entropy'],
            'max_depth': list(range(5, 31, 5)),
            'max_features': [5, 10, 12, 15]
        }
    },
    'RandomForest': {
        'model': RandomForestClassifier(random_state=42, n_jobs=-1),
        'params': {
            'n_estimators': list(range(20, 111, 10)),
            'max_features': ['auto', 'sqrt', 'log2'],
            'criterion': ['gini', 'entropy'],
            'bootstrap': [True, False]
        }
    },
    'SVM': {
        'model': SVC(random_state=42),
        'params': {
            'kernel': ['rbf', 'linear', 'poly'],
            'C': [1, 5, 10, 20, 25, 50, 100]
        }
    },
    'KNearestNeighbors': {
        'model': KNeighborsClassifier(),
        'params': {
            'n_neighbors': list(range(25, 34)),
            'p': [1, 2]
        }
    }
}

## Training different models and hyperparameter tuning using GridSearchCV and RandomizedSearchCV

In [55]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

### GridSearchCV

In [72]:
%%time
scores = []

for model in models.values():
    clf = GridSearchCV(model['model'], model['params'], cv=10, return_train_score=False)
    clf.fit(x_train, y_train)
    scores.append({'Model': model['model'], 'Parameters': clf.best_params_, 'Score': clf.best_score_})

df = pd.DataFrame(scores)
df

Wall time: 6min 5s


Unnamed: 0,Model,Parameters,Score
0,LogisticRegression(),{'C': 0.5},1.0
1,DecisionTreeClassifier(random_state=42),"{'criterion': 'gini', 'max_depth': 5, 'max_fea...",1.0
2,"RandomForestClassifier(n_jobs=-1, random_state...","{'bootstrap': True, 'criterion': 'gini', 'max_...",1.0
3,SVC(random_state=42),"{'C': 1, 'kernel': 'rbf'}",1.0
4,KNeighborsClassifier(),"{'n_neighbors': 27, 'p': 1}",0.988863


In [101]:
df.Score.mean()

0.9977726574500767

### RandomizedSearchCV

In [73]:
%%time
scores2 = []

for model in models.values():
    clf = RandomizedSearchCV(model['model'], model['params'], cv=10, return_train_score=False, n_iter=10)
    clf.fit(x_train, y_train)
    scores2.append({'Model': model['model'], 'Parameters': clf.best_params_, 'Score': clf.best_score_})

df2 = pd.DataFrame(scores2)
df2



Wall time: 31.7 s


Unnamed: 0,Model,Parameters,Score
0,LogisticRegression(),{'C': 0.5},1.0
1,DecisionTreeClassifier(random_state=42),"{'max_features': 12, 'max_depth': 25, 'criteri...",1.0
2,"RandomForestClassifier(n_jobs=-1, random_state...","{'n_estimators': 60, 'max_features': 'log2', '...",1.0
3,SVC(random_state=42),"{'kernel': 'linear', 'C': 10}",1.0
4,KNeighborsClassifier(),"{'p': 1, 'n_neighbors': 27}",0.988863


In [100]:
df2.Score.mean()

0.9977726574500767

#### Average score of about 99.77% is obtained using standard methods training.

# Dimensionality Reduction using PCA

In [74]:
from sklearn.decomposition import PCA

## PCA with 95% variance retention

In [75]:
pca = PCA(0.95)

In [76]:
x_pca = pca.fit_transform(feature_vector)

In [77]:
x_pca.shape

(899, 13)

In [78]:
pca.explained_variance_ratio_

array([0.22926843, 0.11002739, 0.09403488, 0.08203692, 0.07475953,
       0.07084254, 0.06244973, 0.05500972, 0.05109873, 0.04353943,
       0.04024355, 0.03021333, 0.02833615])

In [79]:
train_x, test_x, train_y, test_y = train_test_split(x_pca, target_vector, test_size=0.3, random_state=42)

### Applying PCA on various models and doing hyperparameter tuning using Cross Validation

## GridSearchCV

In [80]:
%%time
scores3 = []

for model in models.values():
    clf = GridSearchCV(model['model'], model['params'], cv=10, return_train_score=False)
    clf.fit(train_x, train_y)
    scores3.append({'Model': model['model'], 'Parameters': clf.best_params_, 'Score': clf.best_score_})

df3 = pd.DataFrame(scores3)
df3

120 fits failed out of a total of 480.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
120 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\harsh\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 681, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\harsh\Anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 942, in fit
    X_idx_sorted=X_idx_sorted,
  File "C:\Users\harsh\Anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 308, in fit
    raise ValueError("max_features must be in (0, n_features]")
ValueError: max_features must be in (0, n_features]

 0.96656426        nan 0.95552995 0.96185356 0.96656426        nan
 0.95552

Wall time: 6min 28s


Unnamed: 0,Model,Parameters,Score
0,LogisticRegression(),{'C': 2},1.0
1,DecisionTreeClassifier(random_state=42),"{'criterion': 'entropy', 'max_depth': 10, 'max...",0.974578
2,"RandomForestClassifier(n_jobs=-1, random_state...","{'bootstrap': True, 'criterion': 'gini', 'max_...",0.985714
3,SVC(random_state=42),"{'C': 1, 'kernel': 'rbf'}",1.0
4,KNeighborsClassifier(),"{'n_neighbors': 31, 'p': 2}",0.97299


In [99]:
df3.Score.mean()

0.9866564260112647

## RandomizedSearchCV

In [81]:
%%time
scores4 = []

for model in models.values():
    clf = RandomizedSearchCV(model['model'], model['params'], cv=10, return_train_score=False, n_iter=10)
    clf.fit(train_x, train_y)
    scores4.append({'Model': model['model'], 'Parameters': clf.best_params_, 'Score': clf.best_score_})

df4 = pd.DataFrame(scores4)
df4

20 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\harsh\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 681, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\harsh\Anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 942, in fit
    X_idx_sorted=X_idx_sorted,
  File "C:\Users\harsh\Anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 308, in fit
    raise ValueError("max_features must be in (0, n_features]")
ValueError: max_features must be in (0, n_features]

        nan 0.96820276        nan 0.97457757]


Wall time: 35.9 s


Unnamed: 0,Model,Parameters,Score
0,LogisticRegression(),{'C': 2},1.0
1,DecisionTreeClassifier(random_state=42),"{'max_features': 10, 'max_depth': 10, 'criteri...",0.974578
2,"RandomForestClassifier(n_jobs=-1, random_state...","{'n_estimators': 100, 'max_features': 'auto', ...",0.985714
3,SVC(random_state=42),"{'kernel': 'rbf', 'C': 20}",1.0
4,KNeighborsClassifier(),"{'p': 2, 'n_neighbors': 31}",0.97299


In [98]:
df4.Score.mean()

0.9866564260112647

#### Average score of about 98.66% is obtained using PCA with 95% variance retention on standard training methods.

## PCA with 4 dimensions retained

In [82]:
pca2 = PCA(n_components=4)
pca_X = pca2.fit_transform(feature_vector)
print(pca_X.shape)
print(pca2.explained_variance_ratio_)

train_X, test_X, train_Y, test_Y = train_test_split(pca_X, target_vector, test_size=0.3, random_state=42)

(899, 4)
[0.22926843 0.11002739 0.09403488 0.08203692]


### Applying PCA on various models and doing hyperparameter tuning using Cross Validation

## GridSearchCV

In [85]:
models = {
    'LogisticRegression' : {
        'model': LogisticRegression(),
        'params': {
            'C': [0.5, 1, 1.5, 2, 2.5, 3, 5, 10]
        }
    },
    'DecisionTree': {
        'model': DecisionTreeClassifier(random_state=42),
        'params': {
            'criterion': ['gini', 'entropy'],
            'max_depth': list(range(2, 12))
        }
    },
    'RandomForest': {
        'model': RandomForestClassifier(random_state=42, n_jobs=-1),
        'params': {
            'n_estimators': list(range(20, 111, 10)),
            'max_features': ['auto', 'sqrt', 'log2'],
            'criterion': ['gini', 'entropy'],
            'bootstrap': [True, False]
        }
    },
    'SVM': {
        'model': SVC(random_state=42),
        'params': {
            'kernel': ['rbf', 'linear', 'poly'],
            'C': [1, 5, 10, 20, 25, 50, 100]
        }
    },
    'KNearestNeighbors': {
        'model': KNeighborsClassifier(),
        'params': {
            'n_neighbors': list(range(25, 34)),
            'p': [1, 2]
        }
    }
}

In [94]:
%%time
scores5 = []

for model in models.values():
    clf = GridSearchCV(model['model'], model['params'], cv=10, return_train_score=False)
    clf.fit(train_X, train_Y)
    scores5.append({'Model': model['model'], 'Parameters': clf.best_params_, 'Score': clf.best_score_})

df5 = pd.DataFrame(scores5)
df5

Wall time: 6min 24s


Unnamed: 0,Model,Parameters,Score
0,LogisticRegression(),{'C': 5},0.961879
1,DecisionTreeClassifier(random_state=42),"{'criterion': 'gini', 'max_depth': 6}",0.972965
2,"RandomForestClassifier(n_jobs=-1, random_state...","{'bootstrap': False, 'criterion': 'entropy', '...",0.977727
3,SVC(random_state=42),"{'C': 5, 'kernel': 'rbf'}",0.968203
4,KNeighborsClassifier(),"{'n_neighbors': 25, 'p': 1}",0.960266


In [96]:
df5.Score.mean()

0.9682078853046594

## RandomizedSearchCV

In [95]:
%%time
scores6 = []

for model in models.values():
    clf = RandomizedSearchCV(model['model'], model['params'], cv=10, return_train_score=False, n_iter=10)
    clf.fit(train_X, train_Y)
    scores6.append({'Model': model['model'], 'Parameters': clf.best_params_, 'Score': clf.best_score_})

df6 = pd.DataFrame(scores6)
df6



Wall time: 36.1 s


Unnamed: 0,Model,Parameters,Score
0,LogisticRegression(),{'C': 5},0.961879
1,DecisionTreeClassifier(random_state=42),"{'max_depth': 11, 'criterion': 'gini'}",0.972965
2,"RandomForestClassifier(n_jobs=-1, random_state...","{'n_estimators': 100, 'max_features': 'log2', ...",0.977727
3,SVC(random_state=42),"{'kernel': 'rbf', 'C': 5}",0.968203
4,KNeighborsClassifier(),"{'p': 1, 'n_neighbors': 25}",0.960266


In [97]:
df6.Score.mean()

0.9682078853046594

#### Average score of about 96.82% is obtained using PCA with 4 dimension retention on standard training methods.

### CONCLUSION

The following are the scores of model tested on various models like Logistic Regression/SVM/Decision Tree/Random Forest/K Nearest Neighbors using CV search.

#### - Normal Models with Hyperparameters Tuning - 99.77
#### - PCA with 95% variance retention - 98.66
#### - PCA with 4-D dimensions - 96.82