# Dimension Reduction with PCA
Principal Component Analysis (PCA) is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation.

Dimensions are nothing but features that represent the data. For example, A 28 X 28 image has 784 picture elements (pixels) that are the dimensions or features which together represent that image.

One important thing to note about PCA is that it is an Unsupervised dimensionality reduction technique, you can cluster the similar data points based on the feature correlation between them without any supervision (or labels), and you will learn how to achieve this practically using Python in later sections of this tutorial!

According to Wikipedia, PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components.

So given a set of x correlated variables over y samples you achieve a set of z uncorrelated principal components over the same y samples.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


df = pd.read_csv('Wine.csv')
df

Unnamed: 0,Alcohol,Malic_Acid,Ash,Ash_Alcanity,Magnesium,Total_Phenols,Flavanoids,Nonflavanoid_Phenols,Proanthocyanins,Color_Intensity,Hue,OD280,Proline,Customer_Segment
0,14.23,1.71,2.43,15.6,127,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065,1
1,13.20,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050,1
2,13.16,2.36,2.67,18.6,101,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185,1
3,14.37,1.95,2.50,16.8,113,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480,1
4,13.24,2.59,2.87,21.0,118,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740,3
174,13.40,3.91,2.48,23.0,102,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750,3
175,13.27,4.28,2.26,20.0,120,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835,3
176,13.17,2.59,2.37,20.0,120,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840,3


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Alcohol               178 non-null    float64
 1   Malic_Acid            178 non-null    float64
 2   Ash                   178 non-null    float64
 3   Ash_Alcanity          178 non-null    float64
 4   Magnesium             178 non-null    int64  
 5   Total_Phenols         178 non-null    float64
 6   Flavanoids            178 non-null    float64
 7   Nonflavanoid_Phenols  178 non-null    float64
 8   Proanthocyanins       178 non-null    float64
 9   Color_Intensity       178 non-null    float64
 10  Hue                   178 non-null    float64
 11  OD280                 178 non-null    float64
 12  Proline               178 non-null    int64  
 13  Customer_Segment      178 non-null    int64  
dtypes: float64(11), int64(3)
memory usage: 19.6 KB


In [3]:
X = df.drop('Customer_Segment', axis=1)
y = df['Customer_Segment']

In [4]:
X.shape

(178, 13)

In [5]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X = sc.fit_transform(X)

In [6]:
X.shape

(178, 13)

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

In [8]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=90)

**Without PCA**

In [9]:
model = SVC()

model.fit(x_train, y_train)

SVC()

In [10]:
y_pred = model.predict(x_test)
y_pred

array([1, 1, 2, 2, 1, 2, 3, 2, 1, 2, 1, 3, 2, 3, 2, 2, 3, 1, 3, 1, 1, 3,
       3, 2, 1, 1, 2, 3, 3, 3, 1, 1, 2, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 2,
       2, 2, 2, 1, 1, 2, 2, 1, 3, 2], dtype=int64)

In [11]:
from sklearn.metrics import confusion_matrix, accuracy_score

In [12]:
confusion_matrix(y_test, y_pred)

array([[18,  0,  0],
       [ 0, 21,  1],
       [ 0,  1, 13]], dtype=int64)

In [13]:
accuracy_score(y_test, y_pred)

0.9629629629629629

**With PCA**

In [14]:
from sklearn.decomposition import PCA

pca = PCA(0.9)

pca.fit(X)

X = pca.transform(X)

In [15]:
X.shape

(178, 8)

In [16]:
pca.n_components_

8

In [17]:
pca.explained_variance_ratio_

array([0.36198848, 0.1920749 , 0.11123631, 0.0706903 , 0.06563294,
       0.04935823, 0.04238679, 0.02680749])

In [18]:
df = pd.DataFrame(data=X)
df

Unnamed: 0,0,1,2,3,4,5,6,7
0,3.316751,-1.443463,-0.165739,-0.215631,0.693043,-0.223880,0.596427,0.065139
1,2.209465,0.333393,-2.026457,-0.291358,-0.257655,-0.927120,0.053776,1.024416
2,2.516740,-1.031151,0.982819,0.724902,-0.251033,0.549276,0.424205,-0.344216
3,3.757066,-2.756372,-0.176192,0.567983,-0.311842,0.114431,-0.383337,0.643593
4,1.008908,-0.869831,2.026688,-0.409766,0.298458,-0.406520,0.444074,0.416700
...,...,...,...,...,...,...,...,...
173,-3.370524,-2.216289,-0.342570,1.058527,-0.574164,-1.108788,0.958416,-0.146097
174,-2.601956,-1.757229,0.207581,0.349496,0.255063,-0.026465,0.146894,-0.552427
175,-2.677839,-2.760899,-0.940942,0.312035,1.271355,0.273068,0.679235,0.047024
176,-2.387017,-2.297347,-0.550696,-0.688285,0.813955,1.178783,0.633975,0.390829


In [19]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

In [20]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=90)

In [21]:
model = SVC()

model.fit(x_train, y_train)

SVC()

In [22]:
y_pred = model.predict(x_test)
y_pred

array([1, 1, 2, 2, 1, 2, 3, 2, 1, 3, 1, 3, 2, 3, 2, 2, 3, 1, 3, 1, 1, 3,
       3, 2, 1, 1, 2, 3, 3, 3, 1, 2, 2, 2, 3, 1], dtype=int64)

In [23]:
from sklearn.metrics import confusion_matrix, accuracy_score

In [24]:
confusion_matrix(y_test, y_pred)

array([[12,  1,  0],
       [ 0, 11,  1],
       [ 0,  0, 11]], dtype=int64)

In [25]:
accuracy_score(y_test, y_pred)

0.9444444444444444

# Great Work!