<a href="https://colab.research.google.com/github/AnanyaGodse/DJS-Compute-Tasks/blob/main/Task%207/PCA_heart_disease.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Task: Perform dimensionality reduction by applying Principal Component Analysis on the given dataset and eventually fit a logistic regression model on the reduced data**

## Importing libraries and data

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.metrics import f1_score

In [None]:
df = pd.read_csv("heart.csv")
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [None]:
df.shape

(918, 12)

In [None]:
df.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


**No null values are present**

# Pre-processing

In [None]:
X = df.drop("HeartDisease",axis=1)
y = df.HeartDisease

## Preprocessing task
* Apply label encoding on `X`
* Re-scale the data to get the data of same magnitude
* Apply train test split

### Label encoding

In [None]:
# apply label encoding
categorical_columns = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
label_encoder = LabelEncoder()
for column in categorical_columns:
  X[column] = label_encoder.fit_transform(X[column])

X

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
0,40,1,1,140,289,0,1,172,0,0.0,2
1,49,0,2,160,180,0,1,156,0,1.0,1
2,37,1,1,130,283,0,2,98,0,0.0,2
3,48,0,0,138,214,0,1,108,1,1.5,1
4,54,1,2,150,195,0,1,122,0,0.0,2
...,...,...,...,...,...,...,...,...,...,...,...
913,45,1,3,110,264,0,1,132,0,1.2,1
914,68,1,0,144,193,1,1,141,0,3.4,1
915,57,1,0,130,131,0,1,115,1,1.2,1
916,57,0,1,130,236,0,0,174,0,0.0,1


### Standard scaler

In [None]:
# apply standard scaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

### Split data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

## Apply Logistic Regression before applying PCA

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
model.score(X_test, y_test)

0.875

In [None]:
f1_score(y_test, predictions)

0.8820512820512821

## PCA

* Apply PCA with number of components = 5
* fit and transform pca on `X` and store new data in `X2`


In [None]:
pca = PCA(n_components=5)
pca.fit(X)
X2 = pca.fit_transform(X)

* Find number of components of pca

In [None]:
X2.shape

(918, 5)

**Explained variance ratio** is the amount of variance explained by each feature (component) of PCA

In [None]:
pca.explained_variance_ratio_

array([0.25139665, 0.1330889 , 0.10512913, 0.09088956, 0.07916761])

In [None]:
explained_variance = pca.explained_variance_
total_explained_variance = sum(explained_variance)
print(total_explained_variance)

7.264303477381411


### Run the cells below to apply LogisticRegression on reduced data

In [None]:
# Train test split of X2
X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size=0.2, random_state=7)

In [None]:
log = LogisticRegression(max_iter=1000)
log = log.fit(X_train, y_train)
pred = log.predict(X_test)
log.score(X_test, y_test)

0.8695652173913043

In [None]:
f1_score(y_test, pred)

0.8762886597938143