Created by: Sangwook Cheon

Date: Dec 31, 2018

This is step-by-step guide to Dimensionality Reduction, which I created for reference. I added some useful notes along the way to clarify things. This notebook's content is from A-Z Datascience course, and I hope this will be useful to those who want to review materials covered, or anyone who wants to learn the basics of Dimensionality Reduction.

## Content:
### 1. Principal Component Analysis (PCA)
### 2. Linear Discriminant Analysis (LDA)
### 3. Kernel PCA
--------
These are are all part of Feature Extraction. Dimensionality regression is used to reduce the number of independent variables. Two variables will be left that most explains the variance.

### Note:
PCA and LDA are used for **linear** problems, and Kernel PCA is used for **non-linear** problems.


# 1. Principal Component Analysis (PCA)
Reduce the dimensions of a d-dimensional dataset by projecting it onto a (k)-dimensional subspace (where k<d). Extracted features (independent variables) are called principal components.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dataset = pd.read_csv('../input/winedata1/Wine.csv')
dataset.head()

### Dataset overview
Based on independent variables about the wine, the algorithm should be able to predict which customer segment a new wine should belong to so that it can be recommended to this specific segment. We need to choose 2 most important variables (principal components) among more than 10 variables present right now. To do this, we need to use PCA. 

In [None]:
#data preparation
x = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

#feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Right after data preprocessing, we need to apply PCA. So, let's apply PCA below.

In [None]:
#Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = None)
#n_componentes: number of extracted features (independent variables) to get. 
                #None is used as we do not know what is the right amount of features
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
#percentage of variance explained by each of the principal components that we extracted

explained_variance

As we had 13 independent variables, the algorithm ranked these variables from the first principal component that explains the most variance to the one with the least variance. Therefore, if we include one component, we will take 37 % of  the variance. If two, 37 % + 19 % = 56 %. In this case, taking two components would be sufficient.

Now that the number of n_components is set, we can redo the PCA process.

In [None]:
# --------------------------------------------------------------------------------------------------
x = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

#feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#----------------------------------------------------------------------------------------------------
#Doing this part again just because X_train and X_test are already transformed from the previous step

pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

X_train[:5, :]

Now, we only have two independent variables! Let's simply apply Logistic Regression model as we prepared our dataset using PCA. If you want to learn about Logistic regression and other classification techniques (Logistic regression is also part of classification), [please refer to this kernel.](https://www.kaggle.com/sangwookchn/classification-techniques-using-scikit-learn)

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, Y_train)
y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, y_pred)

cm

As numbers on the diagonal line shows the correct predictions, we can say the model is really good. Let's move onto the next one.

# 2. Linear Discriminant Analysis (LDA)
From the n independent variables of the dataset, LDA extracts p ≤ n new independent variables that separate the most the classes of the dependent variable. While PCA is unsupervised learning, LDA is supervised as it chooses variables in relation to the dependent variable.

In [None]:
x = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values

X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

#feature scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 2)
X_train = lda.fit_transform(X_train, Y_train) #y_train is required as this is supervised learning
X_test = lda.transform(X_test)

classifier = LogisticRegression()
classifier.fit(X_train, Y_train)
y_pred = classifier.predict(X_test)

cm = confusion_matrix(Y_test, y_pred)
cm

As you can notice, LDA worked better than PCA. This is because the algorithm found the two variables that separate between classes the most. Finally, let's see how Kernel PCA works

# 3. Kernel PCA
In fact, previous models are linear models. For non-linear problems, Kernel PCA works very well. Kernel PCA is not much different from PCA because we are only using an extra kernel trick to map the dataset to a higher dimension, and then make the dataset linearly separable. Let's use a dataset that has non-linear patterns: Social Network Ads dataset.

### Dataset Overview

In [None]:
dataset2 = pd.read_csv('../input/social-network-ads/Social_Network_Ads.csv')
dataset2.head()

In this dataset, the company is trying to see whether or not a customer will buy a product according to gender, age and estimated salary. 

In [None]:
x = dataset2.iloc[:, [2,3]].values
y = dataset2.iloc[:, 4].values

X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# ------------ Please pay attention to this following part ---------------
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components = 2, kernel = 'rbf') 
#rbf is "gaussian" method that maps values to a higher dimension

X_train = kpca.fit_transform(X_train)
X_test = kpca.transform(X_test)
# ------------------------------------------------------------------------

classifier = LogisticRegression()
classifier.fit(X_train, Y_train)
y_pred = classifier.predict(X_test)

cm = confusion_matrix(Y_test, y_pred)
cm

 The algorithm produced very good results! 
 
 -------------
 Thank you for reading this kernel. If you found this helpful, I would appreciate if you upvote the kernel or put a short comment.  