### Principal Component Analysis - Step by Step


* Standardize the data
* Obtain the vectors and eigenvalues from the matrix of covariances or correlations or even the singular vector composition technique.
* Sort the eigenvalues in descending order and keep the 'p' that correspond to the largest 'p' and thus decrease the number of variables in the dataset (p <m)
* Construct the projection matrix W from the eigenvectors p
* Transform the original dataset X y through W in order to obtain the data in the dimensional subspace of dimension p and it will be Y

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [2]:
# pip install plotly

In [3]:
import chart_studio
import chart_studio.plotly as py
import plotly.graph_objects as go
import plotly.tools as tls


If you want to use the plotply library you have to create an account at https://plotly.com/

In [4]:
chart_studio.tools.set_credentials_file(username='AlejandroRubiodeCarranza', api_key='d1zea8VnRJlVRg7jkVjP')

In [5]:
df= pd.read_csv(r"C:\Users\Usuario\Desktop\Anaconda\dataset\python-ml-course-master\datasets\iris\iris.csv")

In [6]:
df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
Sepal.Length    150 non-null float64
Sepal.Width     150 non-null float64
Petal.Length    150 non-null float64
Petal.Width     150 non-null float64
Species         150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


We are going to separate the dataset in X and Y to divide the predictor variables with the variable to predict (Target)

In [8]:
X= df.iloc[:,0:4].values
y= df.iloc[:,4].values

In [9]:
traces=[]
legend={0:False,1:False,2:False,3:False}


colors= {'setosa': 'rgb(255,127,20)',
        'versicolor': 'rgb(31,220,120)',
        'virginica': 'rgb(44,50,180)'}

for col in range(4):
    for key in colors:
        traces.append(go.Histogram(x=X[y==key,col],opacity=0.7,
                               xaxis="x%s"%(col+1),marker=go.Marker(color=colors[key]),
                               name=key,showlegend=legend[col]))
    
data=go.Data(traces)
layout=go.Layout(barmode="overlay",
             xaxis=go.XAxis(domain=[0,0.25],title="Length Sepal (cm)"),
             xaxis2=go.XAxis(domain=[0.3,0.5],title="Width Sepal (cm)"),
             xaxis3=go.XAxis(domain=[0.55,0.75],title="Length Petal (cm)"),
             xaxis4=go.XAxis(domain=[0.8,1.0],title="Width Petal (cm)"),
              yaxis=go.YAxis(title="Numero de ejemplares"),
              title="Distribution of the features of the different Iris flowers")

fig=go.Figure(data=data,layout=layout)
py.iplot(fig)


plotly.graph_objs.Marker is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.scatter.Marker
  - plotly.graph_objs.histogram.selected.Marker
  - etc.



plotly.graph_objs.Data is deprecated.
Please replace it with a list or tuple of instances of the following types
  - plotly.graph_objs.Scatter
  - plotly.graph_objs.Bar
  - plotly.graph_objs.Area
  - plotly.graph_objs.Histogram
  - etc.



plotly.graph_objs.XAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.XAxis
  - plotly.graph_objs.layout.scene.XAxis



plotly.graph_objs.YAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.YAxis
  - plotly.graph_objs.layout.scene.YAxis




In [10]:
from sklearn.preprocessing import StandardScaler

With this function I normalize the data of X

In [11]:
X_std=StandardScaler().fit_transform(X)

### 1 - Decomposition of eigenvalues and vectors

#### Using the Covariance Matrix

In [12]:
from IPython.display import display,Math,Latex

In [13]:
mean_vect= np.mean(X_std,axis=0)
mean_vect

array([-4.73695157e-16, -7.81597009e-16, -4.26325641e-16, -4.73695157e-16])

In [14]:
cov_matrix=(X_std-mean_vect).T.dot((X_std-mean_vect))/(X_std.shape[0]-1)

print("The covariance matrix is \n",cov_matrix)

The covariance matrix is 
 [[ 1.00671141 -0.11835884  0.87760447  0.82343066]
 [-0.11835884  1.00671141 -0.43131554 -0.36858315]
 [ 0.87760447 -0.43131554  1.00671141  0.96932762]
 [ 0.82343066 -0.36858315  0.96932762  1.00671141]]


In [15]:
np.cov(X_std.T)

array([[ 1.00671141, -0.11835884,  0.87760447,  0.82343066],
       [-0.11835884,  1.00671141, -0.43131554, -0.36858315],
       [ 0.87760447, -0.43131554,  1.00671141,  0.96932762],
       [ 0.82343066, -0.36858315,  0.96932762,  1.00671141]])

In [16]:
eig_vals,eig_vectors= np.linalg.eig(cov_matrix)
print("EingenValues \n",eig_vals)
print("Eigenvectors \n",eig_vectors)

EingenValues 
 [2.93808505 0.9201649  0.14774182 0.02085386]
Eigenvectors 
 [[ 0.52106591 -0.37741762 -0.71956635  0.26128628]
 [-0.26934744 -0.92329566  0.24438178 -0.12350962]
 [ 0.5804131  -0.02449161  0.14212637 -0.80144925]
 [ 0.56485654 -0.06694199  0.63427274  0.52359713]]


### 2 - The main components

In [17]:
for ev in eig_vectors:
    print("The length of the eigenvector is: ",np.linalg.norm(ev))

The length of the eigenvector is:  0.9999999999999997
The length of the eigenvector is:  1.0000000000000002
The length of the eigenvector is:  1.0
The length of the eigenvector is:  0.9999999999999997


In [19]:
eigen_pairs= [(np.abs(eig_vals[i]),eig_vectors[:,i])for i in range(len(eig_vals))]

eigen_pairs

[(2.938085050199993,
  array([ 0.52106591, -0.26934744,  0.5804131 ,  0.56485654])),
 (0.9201649041624873,
  array([-0.37741762, -0.92329566, -0.02449161, -0.06694199])),
 (0.1477418210449481,
  array([-0.71956635,  0.24438178,  0.14212637,  0.63427274])),
 (0.020853862176462803,
  array([ 0.26128628, -0.12350962, -0.80144925,  0.52359713]))]

I order the eigenvectors with eigenvalue from highest to lowest:

In [22]:
eigen_pairs.reverse()
eigen_pairs

[(2.938085050199993,
  array([ 0.52106591, -0.26934744,  0.5804131 ,  0.56485654])),
 (0.9201649041624873,
  array([-0.37741762, -0.92329566, -0.02449161, -0.06694199])),
 (0.1477418210449481,
  array([-0.71956635,  0.24438178,  0.14212637,  0.63427274])),
 (0.020853862176462803,
  array([ 0.26128628, -0.12350962, -0.80144925,  0.52359713]))]

In [23]:
print("Eigenvalues in descending order: \n")

for ap in eigen_pairs:
    print(ap[0])

Eigenvalues in descending order: 

2.938085050199993
0.9201649041624873
0.1477418210449481
0.020853862176462803


In [24]:
total_sum= sum(eig_vals)

var_exp=[(i/total_sum)*100 for i in sorted(eig_vals,reverse=True)]

cum_var_exp= np.cumsum(var_exp)

In [25]:
var_exp

[72.96244541329986, 22.850761786701774, 3.6689218892828794, 0.5178709107154932]

In [26]:
plot1=go.Bar(x=["CP %s"%i for i in range(1,5)],y=var_exp,showlegend=False)
plot2= go.Scatter(x=["CP %s"%i for i in range(1,5)],y=cum_var_exp,showlegend=True,name="% de Varianza Explicada Acumulada")

data= go.Data([plot1,plot2])

layout=go.Layout(xaxis=go.XAxis(title="Main Components"),
                yaxis=go.YAxis(title="Percentage of explained variance"),
                title="Percentage of variability explained by each main component")

fig=go.Figure(data=data,layout=layout)
py.iplot(fig)


plotly.graph_objs.Data is deprecated.
Please replace it with a list or tuple of instances of the following types
  - plotly.graph_objs.Scatter
  - plotly.graph_objs.Bar
  - plotly.graph_objs.Area
  - plotly.graph_objs.Histogram
  - etc.



plotly.graph_objs.XAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.XAxis
  - plotly.graph_objs.layout.scene.XAxis



plotly.graph_objs.YAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.YAxis
  - plotly.graph_objs.layout.scene.YAxis




This graph explains that with two variables we can explain 95.8% of the entire dataset, so the other two variables have very little meaning. This is the result of our research, clarifying that there are 2 predictor variables with much more importance than the other two. Although the prediction may be a little worse for using only two predictor variables, the computational cost that we save to our models is very high.

In [27]:
W=np.hstack((eigen_pairs[0][1].reshape(4,1),
                eigen_pairs[1][1].reshape(4,1)))
W

array([[ 0.52106591, -0.37741762],
       [-0.26934744, -0.92329566],
       [ 0.5804131 , -0.02449161],
       [ 0.56485654, -0.06694199]])

In [28]:
Y= X_std.dot(W)
Y

array([[-2.26470281, -0.4800266 ],
       [-2.08096115,  0.67413356],
       [-2.36422905,  0.34190802],
       [-2.29938422,  0.59739451],
       [-2.38984217, -0.64683538],
       [-2.07563095, -1.48917752],
       [-2.44402884, -0.0476442 ],
       [-2.23284716, -0.22314807],
       [-2.33464048,  1.11532768],
       [-2.18432817,  0.46901356],
       [-2.1663101 , -1.04369065],
       [-2.32613087, -0.13307834],
       [-2.2184509 ,  0.72867617],
       [-2.6331007 ,  0.96150673],
       [-2.1987406 , -1.86005711],
       [-2.26221453, -2.68628449],
       [-2.2075877 , -1.48360936],
       [-2.19034951, -0.48883832],
       [-1.898572  , -1.40501879],
       [-2.34336905, -1.12784938],
       [-1.914323  , -0.40885571],
       [-2.20701284, -0.92412143],
       [-2.7743447 , -0.45834367],
       [-1.81866953, -0.08555853],
       [-2.22716331, -0.13725446],
       [-1.95184633,  0.62561859],
       [-2.05115137, -0.24216355],
       [-2.16857717, -0.52714953],
       [-2.13956345,

### 3 - Projecting variables into the new vector subspace

In [29]:
results=[]

for name in ('setosa','versicolor','virginica'):
    result=go.Scatter(x=Y[y==name,0],y=Y[y==name,1],
                mode="markers",name=name,
                marker=go.Marker(size=12,
                line=go.Line(color='rgb(220,220,220)',
                width=0.5),opacity=0.8))
    
    results.append(result)
    
data=go.Data(results)

layout=go.Layout(showlegend=True,scene=go.Scene(xaxis=go.XAxis(title="Principal component 1"),
                                                yaxis=go.YAxis(title="Principal component 2")))

fig=go.Figure(data=data,layout=layout)
py.iplot(fig)


plotly.graph_objs.Line is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.scatter.Line
  - plotly.graph_objs.layout.shape.Line
  - etc.



plotly.graph_objs.Marker is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.scatter.Marker
  - plotly.graph_objs.histogram.selected.Marker
  - etc.



plotly.graph_objs.Scene is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.Scene




Next, I'm going to get the main components with the sklearn library.

### Component Analysis by SKLEARN

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import chart_studio
import chart_studio.plotly as py
import plotly.graph_objects as go
import plotly.tools as tls


In [31]:
chart_studio.tools.set_credentials_file(username='AlejandroRubiodeCarranza', api_key='d1zea8VnRJlVRg7jkVjP')

In [32]:
X= df.iloc[:,0:4].values
y= df.iloc[:,4].values

X_std=StandardScaler().fit_transform(X)

In [33]:
from sklearn.decomposition import PCA as sk_pca

In [34]:
k=2 # Number of predictor variables

acp= sk_pca(n_components=k)

Y=acp.fit_transform(X_std)

In [65]:
results=[]
for name in ('setosa','versicolor','virginica'):
        result=go.Scatter(x=Y[y==name,0],y=Y[y==name,1],
                            mode="markers",name=name,
                            marker=go.Marker(size=8,line=go.Line(color='rgb(225,225,225)',
                                            width=0.5),opacity=0.75))
        
        results.append(result)

data=go.Data(results)
layout=go.Layout(xaxis=go.XAxis(title="CP1",showline=False),
                yaxis=go.YAxis(title="CP2",showline=False))

fig=go.Figure(data=data,layout=layout)
py.iplot(fig)


plotly.graph_objs.Line is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.scatter.Line
  - plotly.graph_objs.layout.shape.Line
  - etc.



plotly.graph_objs.Marker is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.scatter.Marker
  - plotly.graph_objs.histogram.selected.Marker
  - etc.



plotly.graph_objs.Data is deprecated.
Please replace it with a list or tuple of instances of the following types
  - plotly.graph_objs.Scatter
  - plotly.graph_objs.Bar
  - plotly.graph_objs.Area
  - plotly.graph_objs.Histogram
  - etc.



plotly.graph_objs.XAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.XAxis
  - plotly.graph_objs.layout.scene.XAxis



plotly.graph_objs.YAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.YAxis
  - plotly.graph_objs.layout.scene.YAxis




I can see that it is the same result if I get the main components manually or automatically with sklearn. The only difference is that with the sklearn library I have to guess the perfect number of variables, while when I do it manually I can see the reason why I choose that number of variables.