## Import the Libraries

In [1]:
import pandas as pd 
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA

## Load the Iris dataset from seaborn

In [3]:
# here we are loading dataset from inbuilt dataset from seaborn library
df = sns.load_dataset('iris')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


* sepcies is target variable 
* others are independent variable

## Check Target Variable Class Proportions

In [5]:
df['species'].value_counts(1)

virginica     0.333333
versicolor    0.333333
setosa        0.333333
Name: species, dtype: float64

* from above value counts it is clear that this a multiclass problem
* logistic regression is capable to that multiclass problem

In [6]:
df.shape

(150, 5)

## Describe the Data

In [7]:
df.describe().round(2)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.84,3.06,3.76,1.2
std,0.83,0.44,1.77,0.76
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### **To compare the result we create Logistic tregresion without PCA and then with PCA**

## X,y Split

In [9]:
#create X, y
X = df.drop('species',axis=1)
y = df['species']

## Train Test Split

In [10]:
#Split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)


In [12]:
# Check Shape
df.shape, X_train.shape, X_test.shape

((150, 5), (120, 4), (30, 4))

## Scaling the Data (Required for PCA)

**Main reason why we are doing scaling?**
* there is lots of matrix ultplication is happening in PCA

* as we know in PCA each principle componant have a single eigen value and  
* each principle componant have set of numbers called eigen vectors. 
* these eigen vectors are going to multiplied by the independent variable values to get the value of perticular principle component. 
* since there is lot of matrix multplication happning, we can directly increse the efficiency of process if the all the numbers on same scale preferebaly on smaller scale

* so to make the process faster and utilise memory and processing power its a good idea to scale the data before implementing PCA 

In [13]:
# import StandardScaler function from sklearn
from sklearn.preprocessing import StandardScaler

In [14]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) 
# scaler.fit(X_train)
# scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

**Note 1**
 * formula for scalling: **Zscore=(Xi-mean)/std**
    
 * in backround **fit_transform** calculate the mean and std values of each column of train data and then scale the column with Zscore formula 
 
 * we need to scale the test data with same mean and std values which are used while scaling train data so for scaling test data we only use **transform** 
 
 
 **why we do scalling of train and test data separately?**
 * if we do sacalling on whole dataset first and then do split the there is problem of data leakage because the mean and std which we used for scalling that has also taken into account the values which are finally end up with the text dataset values where as we are not supposed to show anything related text dataset to model. which is nothing but a data leakage 
 
 
 **Note 2**
  * we can not do PCA for categorical features
  * if categorical features are available the we need to segregate the categorical features aside
  * and do PCA only on rest of Contineous feature value column and then do dimentionality reduction 
  * then we add back to categorical features
  
  

In [15]:
X_train_scaled

array([[ 0.76416119, -0.18596829,  1.1178787 ,  1.2484939 ],
       [-0.42673937, -1.39880497,  0.10968747,  0.09800464],
       [ 0.76416119, -0.18596829,  0.94984683,  0.73716534],
       [-0.0694692 , -0.9136703 ,  0.05367685, -0.0298275 ],
       [ 1.12143136, -0.18596829,  0.94984683,  1.12066176],
       [-0.18855925, -0.67110296,  0.38974059,  0.09800464],
       [ 1.0023413 ,  0.05659904,  0.50176184,  0.35366892],
       [ 0.04962086, -0.18596829,  0.22170872,  0.35366892],
       [-0.30764931, -0.9136703 ,  0.22170872,  0.09800464],
       [ 2.19324186, -0.18596829,  1.28591057,  1.37632604],
       [-0.90309959,  1.51200306, -1.29057812, -1.05248462],
       [-1.49854987,  1.26943572, -1.57063124, -1.3081489 ],
       [-1.37945981,  0.29916638, -1.2345675 , -1.3081489 ],
       [ 1.0023413 , -0.18596829,  0.66979371,  0.6093332 ],
       [-0.30764931, -0.18596829,  0.38974059,  0.35366892],
       [-1.85582003, -0.18596829, -1.51462062, -1.43598104],
       [ 1.47870152, -0.

In [17]:
X_train_scaled[:,1].mean(), X_train_scaled[:,1].std()

(-1.0565622451016074e-15, 0.9999999999999999)

**Note**
* after scalling mean of the column is = 0
* after scalling std of the column is = 1

## Train & Predict using a Logistic Regression Model

In [18]:
# fit a logistic regression model and get its predictions on train and test datasets
lr = LogisticRegression()
lr.fit(X_train_scaled,y_train)
train_pred = lr.predict(X_train_scaled)
test_pred = lr.predict(X_test_scaled)


## Evaluate Accuracy of Predictions

In [19]:
# import accuracy_score function from sklearn.metrics
from sklearn.metrics import accuracy_score

In [20]:
# calculate accuracy on train & test data predictions
train_acc = accuracy_score(y_train,train_pred) # lr.score(X_train_scaled,y_train)
test_acc = accuracy_score(y_test,test_pred)
train_acc, test_acc

(0.9666666666666667, 0.9666666666666667)

## Implement PCA to Reduce Dimensions
we want to retain 95% of the ability to explain the variations in the dataset. So we want to reduce the dimensions by considering only as many top Principal Components, which together will still be able to explain 95% of the variations in the data. Depending on the domain and problem, it is a call to be taken as to how high you want this percentage to be.

In [21]:
# create a PCA object to retain 95% of explainability
pca = PCA(n_components=0.95) # n_components=3 => top 3 principle component
pca.fit(X_train_scaled)
# transform the original training dataset 
X_train_trf = pca.transform(X_train_scaled)
X_test_trf = pca.transform(X_test_scaled)

* "fit" will do background calculation

## Check the Reduced Dimensions

In [22]:
X_train_trf.shape, X_test_trf.shape

((120, 2), (30, 2))

**is that PCA retain two variable and delet two variable?**
* No, its not like that
* PCA created Priciple components (PC1,PC2,PC3,.....PCn) which are linear combination of original features
* out of which we select PC1, PC2 with the use of (n_components=0.95 or n_components=2)

In [23]:
X_train_trf

array([[ 1.80493828, -0.21426637],
       [ 0.25517327,  1.44497662],
       [ 1.41820595, -0.17141317],
       [ 0.21325761,  0.87781662],
       [ 1.82305213, -0.32669546],
       [ 0.35545092,  0.67683443],
       [ 1.0039873 , -0.44280731],
       [ 0.40276083,  0.12422613],
       [ 0.25779311,  0.94835163],
       [ 2.72656467, -0.73024854],
       [-2.20922505, -0.9852658 ],
       [-2.76710222, -0.52418125],
       [-2.25927778,  0.33209822],
       [ 1.30864723, -0.23969052],
       [ 0.3123476 ,  0.2465219 ],
       [-2.61967987,  0.96765863],
       [ 2.14104661, -0.45689302],
       [ 0.13368207,  0.89529232],
       [-2.10396433,  0.47567551],
       [-0.46513895,  1.6435746 ],
       [ 0.14071226,  1.42394472],
       [-1.82179701, -0.09244361],
       [ 1.10949254,  0.79520394],
       [-2.3864584 , -0.6962729 ],
       [-0.50267584,  1.96007959],
       [ 1.95367128,  0.87259532],
       [ 0.30829558,  1.10292442],
       [-2.52140712, -0.64451214],
       [-2.16624472,

In [27]:
#coefficients
pca.components_

array([[ 0.52606701, -0.25776909,  0.5804543 ,  0.56558059],
       [-0.35178491, -0.93267558, -0.02015382, -0.07718464]])

* These are the coeffiecient which we are used to calculate Priciple components

In [28]:
#eigen values
pca.explained_variance_

array([2.94793148, 0.92567629])

* these two are the eigen values of the Principle Componenet PC1, PC2 respectively

**Note**
* addition of eigen values= no. of original features (Approxmately)
* i.e 3.88 = 4 (approxmately)
* Here 2.94 = 3 (approx) => principle component PC1 able to do single handendly job of what 3 original variables were doing
* Here 0.92 = 1 (approx) => principle component PC1 able to do job of 1 original variable was doing

In [29]:
#Eigen values divided total of all eigen values
pca.explained_variance_ratio_

array([0.73084135, 0.22949058])

* the first principle component able to explain 73% of variation in dataset
* the second principle component able to explain 22% of variation in dataset
* both of them  able to explain 95% of variation in dataset  **(this was we set in n_components=0.95)**

## Train a Fresh Logistic Regression Model on the smaller Dimension Dataset obtained using PCA

In [30]:
#logistic Regression with PCA
lr_pca = LogisticRegression()
lr_pca.fit(X_train_trf,y_train)
train_pred_trf = lr_pca.predict(X_train_trf)
test_pred_trf = lr_pca.predict(X_test_trf)


## Calculate Accuracy

In [31]:
# calculate accuracy on training and test data predictions
train_acc_trf = accuracy_score(y_train,train_pred_trf)
test_acc_trf = accuracy_score(y_test, test_pred_trf)
train_acc_trf, test_acc_trf

(0.925, 0.9)

**Dimentionality Reduction**     4------>2      (50%)

**Accuracy Reduced by**         97%----->90%    (7%)

**Is 7% reduction in accuracy is feasible with 50% reduction in dimentionality**?
    
  Ans--  It is depend upon the domain for which we are doing the model  


**when will we use PCA?**

* If we have numerical features in dataset and we want to reduce the dimentions of dataset 
* i.e if we want to reduce the no. of feature then we can use PCA
* If we want to eliminate multi-collinearity
