## Implementing PCA

<strong>1.Load heart disease dataset in pandas dataframe<br>
2.Remove outliers using Z score. Usual guideline is to remove anything that has Z score > 3 formula or Z score < -3. <br>
3.Convert text columns to numbers using label encoding and one hot encoding.<br>
4.Apply scaling. <br>
5.Build a classification model using various methods (SVM, logistic regression, random forest) and check which model gives you the best accuracy. <br>
6.Now use PCA to reduce dimensions, retrain your model and see what impact it has on your model in terms of accuracy. Keep in mind that many times doing PCA reduces the accuracy but computation is much lighter and that's the trade off you need to consider while building models in real life

In [364]:
import pandas as pd 

In [365]:
df= pd.read_csv("heart.csv")
df.head(2)

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1


In [366]:
df.shape

(918, 12)

In [367]:
df.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


In [368]:
df.nunique()

Age                50
Sex                 2
ChestPainType       4
RestingBP          67
Cholesterol       222
FastingBS           2
RestingECG          3
MaxHR             119
ExerciseAngina      2
Oldpeak            53
ST_Slope            3
HeartDisease        2
dtype: int64

In [369]:
z_score=(df.RestingBP-df.RestingBP.mean())/df.RestingBP.std()
df=df[(z_score>-3) & (z_score<3)]
df.shape

(910, 12)

In [370]:
z_score=(df.Cholesterol-df.Cholesterol.mean())/df.Cholesterol.std()
df=df[(z_score>-3) & (z_score<3)]
df.shape

(907, 12)

In [371]:
z_score=(df.MaxHR-df.MaxHR.mean())/df.MaxHR.std()
df=df[(z_score>-3) & (z_score<3)]
df.shape

(906, 12)

In [372]:
z_score=(df.Oldpeak-df.Oldpeak.mean())/df.Oldpeak.std()
df=df[(z_score>-3) & (z_score<3)]
df.shape

(899, 12)

### One hot Encoding

In [373]:
X= df.drop("HeartDisease",axis=1)
y= df.HeartDisease
X.sample(1)

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
17,43,F,ATA,120,201,0,Normal,165,N,0.0,Up


In [374]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ct= ColumnTransformer(transformers=[
    ("ohe",OneHotEncoder(drop='first'),[1,2,5,6,8,10])]
    ,remainder="passthrough")

In [375]:
X1=ct.fit_transform(X)
X1.shape

(899, 15)

In [376]:
from sklearn.preprocessing import StandardScaler
ssc= StandardScaler()
scaled_X=ssc.fit_transform(X1)
scaled_X[0]

array([ 0.515943  ,  2.06332497, -0.5349047 , -0.22955001, -0.5503622 ,
        0.80970176, -0.48989795, -0.8229452 , -0.99888827,  1.13469459,
       -1.42815446,  0.46590022,  0.84963584,  1.38431998, -0.85546862])

### Without using PCA

In [377]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test= train_test_split(scaled_X,y,test_size=0.3,random_state=10)

In [378]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

model_params={
    'LogisticR':{
        'model': LogisticRegression(solver='liblinear'),
        'params':{
            'C':[0.75,1,5,10,50]
        }
    },
    'Svm':{
        'model': SVC(gamma='auto'),
        'params':{
            'C': [0.54,1,5,10,40],
            'kernel': ['rbf','linear']
        }
    },
    'RForest':{
        'model':RandomForestClassifier(),
        'params':{
            "n_estimators":[10,35,50,75,100]
        }
    }
}

In [379]:
from sklearn.model_selection import GridSearchCV
scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(X_train, y_train)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    
scoredf = pd.DataFrame(scores,columns=['model','best_score','best_params'])
scoredf

Unnamed: 0,model,best_score,best_params
0,LogisticR,0.864876,{'C': 0.75}
1,Svm,0.871238,"{'C': 0.54, 'kernel': 'rbf'}"
2,RForest,0.876,{'n_estimators': 35}


### Using PCA

In [380]:
from sklearn.decomposition import PCA
pca= PCA(0.90)
pca_X=pca.fit_transform(scaled_X)
pca_X.shape

(899, 11)

In [381]:
pca.explained_variance_ratio_

array([0.22926843, 0.11002739, 0.09403488, 0.08203692, 0.07475953,
       0.07084254, 0.06244973, 0.05500972, 0.05109873, 0.04353943,
       0.04024355])

In [382]:
from sklearn.model_selection import train_test_split
pcaX_train,pcaX_test,y_train,y_test= train_test_split(scaled_X,y,test_size=0.3,random_state=40)

In [383]:
from sklearn.model_selection import GridSearchCV
scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(pcaX_train,y_train)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    
scoredf = pd.DataFrame(scores,columns=['model','best_score','best_params'])
scoredf

Unnamed: 0,model,best_score,best_params
0,LogisticR,0.858514,{'C': 0.75}
1,Svm,0.866463,"{'C': 0.54, 'kernel': 'rbf'}"
2,RForest,0.855302,{'n_estimators': 100}
