<h1 align="center"> Logistic Regression after PCA</h1>

In [21]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import time
%matplotlib inline

## Load corona Dataset

In [22]:
url = "corona_data.csv"

In [23]:
# loading dataset into Pandas DataFrame
df = pd.read_csv(url)

In [24]:
df.head()

Unnamed: 0,cough,fever,sore_throat,shortness_of_breath,head_ache,age_60_and_above,gender,test_indication,corona_result
0,0,0,0,0,0,1,1,0,0
1,1,0,0,0,0,0,1,0,0
2,0,1,0,0,0,0,0,0,0
3,1,0,0,0,0,0,1,0,0
4,1,0,0,0,0,0,0,0,0


## Standardize the Data

"Since PCA yields a feature subspace that maximizes the variance along the axes, it makes sense to standardize the data, especially, if it was measured on different scales. Although, all features in the Iris dataset were measured in centimeters, let us continue with the transformation of the data onto unit scale (mean=0 and variance=1), which is a requirement for the optimal performance of many machine learning algorithms."
- source from https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

In [25]:
features = ['cough','fever','sore_throat','shortness_of_breath','head_ache','age_60_and_above','gender','test_indication']
x = df.loc[:, features].values

In [26]:
y = df.loc[:,['corona_result']].values

In [27]:
xtrain, xtest, ytrain, ytest = train_test_split(
    x, y, test_size=1/7.0, random_state=0)
ytrain = ytrain.flatten()
ytest = ytest.flatten()

In [28]:
# check the if split works correctly
print(xtrain.shape)
print(ytrain.shape)
print(xtest.shape)
print(ytest.shape)

(235676, 8)
(235676,)
(39280, 8)
(39280,)


In [29]:
scaler = StandardScaler()

# Fit on training set only.
scaler.fit(xtrain)

# Apply transform to both the training set and the test set.
xtrain = scaler.transform(xtrain)
xtest = scaler.transform(xtest)

In [30]:
pd.DataFrame(data = xtrain, columns = features).head()

Unnamed: 0,cough,fever,sore_throat,shortness_of_breath,head_ache,age_60_and_above,gender,test_indication
0,-0.421827,-0.290002,-0.083247,-0.074821,-0.094202,1.574752,1.015049,-0.370008
1,2.370638,-0.290002,-0.083247,-0.074821,-0.094202,-0.63502,-0.985174,2.993162
2,2.370638,-0.290002,-0.083247,-0.074821,-0.094202,-0.63502,-0.985174,2.993162
3,-0.421827,-0.290002,-0.083247,-0.074821,-0.094202,-0.63502,1.015049,-0.370008
4,-0.421827,-0.290002,-0.083247,-0.074821,-0.094202,-0.63502,1.015049,-0.370008


## PCA -> logistic regression
#### 0.85 variance constrain should be fine 0.0

In [31]:
pca = PCA(.85)

In [32]:
pca.fit(xtrain)

PCA(n_components=0.85)

In [33]:
# check the new reduced dimension
pca.n_components_

6

In [34]:
# map both dataset
xtrain_PCA = pca.transform(xtrain)
xtest_PCA = pca.transform(xtest)

In [35]:
pd.DataFrame(data = xtrain_PCA, columns = ['z1', 'z2', 'z3', 'z4', 'z5', 'z6']).head(5)

Unnamed: 0,z1,z2,z3,z4,z5,z6
0,-1.153664,1.417074,-0.672417,-0.00547,-0.019272,-0.053346
1,2.808,-1.29353,-1.246544,-0.03626,1.280931,1.565544
2,2.808,-1.29353,-1.246544,-0.03626,1.280931,1.565544
3,-0.580199,0.17207,0.012429,-0.044266,0.08128,0.175304
4,-0.580199,0.17207,0.012429,-0.044266,0.08128,0.175304


In [45]:
# regression with PCA
start_PCA = time.time()
logisticRegr_PCA = LogisticRegression(solver = 'lbfgs')
logisticRegr_PCA.fit(xtrain_PCA, ytrain)
y_pred_PCA = logisticRegr_PCA.predict(xtest_PCA)
end_PCA = time.time()

# regression without PCA
start = time.time()
logisticRegr = LogisticRegression(solver = 'lbfgs')
logisticRegr.fit(xtrain, ytrain)
y_pred = logisticRegr.predict(xtest)
end = time.time()

## Predict the label of the new(test) data

In [46]:
accuracy_PCA = metrics.accuracy_score(ytest, y_pred_PCA)
f1_PCA = metrics.f1_score(ytest, y_pred_PCA, average='weighted')
precision_PCA = metrics.precision_score(ytest, y_pred_PCA, average='weighted')
recall_PCA = metrics.recall_score(ytest, y_pred_PCA, average='weighted')
print("With PCA: ")
print("Running Time: ",end_PCA - start_PCA,"\n")
print("accuracy: ", accuracy_PCA)
print("F1 score: ",f1_PCA)
print("Precision score: ",precision_PCA)
print("Recall: ",recall_PCA)

print('--------------------------------------------------')
accuracy = metrics.accuracy_score(ytest, y_pred)
score = logisticRegr.score(xtest, ytest)
f1 = metrics.f1_score(ytest, y_pred, average='weighted')
precision = metrics.precision_score(ytest, y_pred, average='weighted')
recall = metrics.recall_score(ytest, y_pred, average='weighted')
print("Without PCA: ")
print("Running Time: ",end - start,"\n")
print("accuracy: ", accuracy)
print("F1 score: ",f1)
print("Precision score: ",precision)
print("Recall: ",recall)

With PCA: 
Running Time:  0.22822880744934082 

accuracy:  0.9556771894093686
F1 score:  0.9434455694446444
Precision score:  0.9507407453218507
Recall:  0.9556771894093686
--------------------------------------------------
Without PCA: 
Running Time:  0.2824678421020508 

accuracy:  0.9555244399185336
F1 score:  0.9430470571879157
Precision score:  0.9507114467815961
Recall:  0.9555244399185336


Wow the score is actually pretty high :)
And as you see PCA with 0.85 variance does not change decrease the score that much and in the mean time it decreases the running time
(Our datasize is small so difference small)