## Machine Learning A-Z™

© Kirill Eremenko, Hadelin de Ponteves, SuperDataScience Team |
[Super Data Science](http://www.superdatascience.com)

Part 9: Dimentionality Reduction | Section 43: Principal Component Analysis (PCA)

Created on Apr  22, 2019
@author: yinka_ola

---

In [1]:
## ---

## Remember in Part 3 - Classification, we worked with datasets composed 
## of only two independent variables. We did for two reasons:

## 1. Because we needed two dimensions to visualize better how Machine Learning 
## models worked (by plotting the prediction regions and the prediction boundary 
## for each model).
## 2. Because whatever is the original number of our independent variables, 
## we can often end up with two independent variables by applying an appropriate 
## Dimensionality Reduction technique.

## There are two types of Dimensionality Reduction techniques:
## 1.Feature Selection
## 2.Feature Extraction

## Feature Selection techniques are Backward Elimination, Forward Selection, 
## Bidirectional Elimination, Score Comparison and more. 
## We covered these techniques in Part 2 - Regression.

## In this part we will cover the following Feature Extraction techniques:
## Principal Component Analysis (PCA)

## ---

## Principal Component Analysis (PCA):
## Most used unsupervised algorithm + most populat dimentionality reduction 

## Application:
## Noise Filtering
## Visualization
## Feature Extraction
## Stock Market Predictions
## Gene data analysis

## goal of PCA:
## 1. identify the patterns in data
## 2. detect correlation b/t variables
## reduce dimension of a d-dimensional dataset by projecting it onto a k
## dimensional subspace (where k< d)
## learn about relationship b/t x and Y values
## find list of principal axes

## PCA in 2D vs 3D: http://setosa.io/ev/principal-component-analysis/

## summary:
## from m independent variable of dataset, PCA extracts p =< m new independent
## that explain the most variance in the dataset regardless of the dependent variable
## b/c dependent variable is not considered, PCA = unsupervised model

## ---

#Data Scenario: 
## winery collected information on 
## using clustering technique: created 3 customer segments
## 3 types of wine for each customer segment
## what business owner can do can take information + customer segment and create a
## classification model
## wine owner can predict which new wines to be recommended to each customer segment
## for a visual look: it cannot be done with 12 independent variables
## we needdimenality reduction techniques to extract 2-3 independent variables with most impact
## use PCA to pick 2-3 top influencial 
## the extrated features are called principal components

## ---

In [2]:
# Importing the libraries
import pandas as pd #data
import numpy as np #mathematics
import os
#plotting packages
import matplotlib.pyplot as plt #plotting charts
import seaborn as sns
sns.set()
%matplotlib inline
plt.rcParams['figure.figsize'] = 10,10
#ignore warnings
import warnings
warnings.filterwarnings('ignore') 

In [3]:
## Data Preprocessing

# Importing the dataset
dataset = pd.read_csv('Wine.csv')

## independent variable
x = dataset.iloc[:, 0:13].values

## dependent variable
y = dataset.iloc[:, 13].values

In [4]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)


In [5]:
# Feature Scaling
## must be applied when using dimensionality reduction
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

In [None]:
# Applying PCA
from sklearn.decomposition import PCA
## this is the number of principal components that explains the most variance
pca = PCA(n_components = 2) #create an object of the PCA class
x_train = pca.fit_transform(x_train)
x_test = pca.transform(x_test)
explained_variance = pca.explained_variance_ratio_ #an attribute of PCA class

In [None]:
# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(x_train, y_train)

In [None]:
# Predicting the Test set results
y_pred = classifier.predict(x_test)

In [None]:
# Making the Confusion Matrix to evaluate the model
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
## Observation
## here confusion matrix with 3 classes (3 x 3)
## here the diagonals contains the correct results
## 14 correct predictions of customer segment 1
## 15 correct predictions of customer segment 2
## 6 correct predictions of customer segment 3
# we have almost no incorrect prediction

In [None]:
 ## Accuracy check
35/36 #97.2%

In [None]:
# Visualising the Training set results
from matplotlib.colors import ListedColormap
x_set, y_set = x_train, y_train
x1, x2 = np.meshgrid(np.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
plt.contourf(x1, x2, classifier.predict(np.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
             alpha = 0.75, cmap = ListedColormap(('darkred', 'green', 'blue'))) # add 1 more colur for 3rd classes
plt.xlim(x1.min(), x1.max())
plt.ylim(x2.min(), x2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.title('Logistic Regression (Training set)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()

In [None]:
# Visualising the Test set results
from matplotlib.colors import ListedColormap
x_set, y_set = x_test, y_test
x1, x2 = np.meshgrid(np.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
plt.contourf(x1, x2, classifier.predict(np.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green', 'blue')))
plt.xlim(x1.min(), x1.max())
plt.ylim(x2.min(), x2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.title('Logistic Regression (Test set)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()