<h2 align=center> Principal Component Analysis</h2>

### Task 2: Load the Data and Libraries
---

In [3]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [4]:
plt.style.use("ggplot")
plt.rcParams["figure.figsize"] = (12,8)

In [5]:
# data URL: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

In [6]:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")


In [7]:
df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'species']

df.dropna(how="all", inplace=True)

df.head()


Unnamed: 0,sepal_len,sepal_wid,petal_len,petal_wid,species
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


### Task 3: Visualize the Data
---

In [8]:
import plotly.express as px
df = px.data.iris()
fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width',
                    color='species')
fig.show()


### Task 4: Standardize the Data
---

In [140]:
#using Z score equation -> Z = (X-mean)/std to reduce to   mean = 0 and standard deviation = 1
#dropping the species column since it causes errors when we compute the std and eign and we dont need it anyways
df.drop('species',inplace=True,axis=1)

df_std= (df-df.mean())/df.std()


In [141]:
# to verify that the data is standardized, i will check the mean and the std of the new data ->
df_std.mean()

sepal_len   -1.457168e-15
sepal_wid   -1.722511e-15
petal_len   -2.043551e-15
petal_wid   -9.843977e-17
dtype: float64

In [142]:
df_std.std()

sepal_len    1.0
sepal_wid    1.0
petal_len    1.0
petal_wid    1.0
dtype: float64

### Task 5: Compute the Eigenvectors and Eigenvalues
---

Covariance: $\sigma_{jk} = \frac{1}{n-1}\sum_{i=1}^{N}(x_{ij}-\bar{x_j})(x_{ik}-\bar{x_k})$

Coviance matrix: $Σ = \frac{1}{n-1}((X-\bar{x})^T(X-\bar{x}))$

In [143]:
#D = np.array(df_std)

df_cov = np.cov(df_std.T)
print(df_cov)

eig_values, eig_vectors = np.linalg.eig(df_cov)
print("EigenValues:\n",eig_values)
print("EigenVectors:\n",eig_vectors)

[[ 1.         -0.10936925  0.87175416  0.81795363]
 [-0.10936925  1.         -0.4205161  -0.35654409]
 [ 0.87175416 -0.4205161   1.          0.9627571 ]
 [ 0.81795363 -0.35654409  0.9627571   1.        ]]
EigenValues:
 [2.91081808 0.92122093 0.14735328 0.02060771]
EigenVectors:
 [[ 0.52237162 -0.37231836 -0.72101681  0.26199559]
 [-0.26335492 -0.92555649  0.24203288 -0.12413481]
 [ 0.58125401 -0.02109478  0.14089226 -0.80115427]
 [ 0.56561105 -0.06541577  0.6338014   0.52354627]]


We can prove this by looking at the covariance matrix. It has the property that it is symmetric. We also constrain the each of the columns (eigenvectors) such that the values sum to one. Thus, they are orthonormal to each other.

Eigendecomposition of the covriance matrix:  $Σ = W\wedge W^{-1}$

In [144]:
np.linalg.eigh(df_cov)

(array([0.02060771, 0.14735328, 0.92122093, 2.91081808]),
 array([[ 0.26199559,  0.72101681,  0.37231836, -0.52237162],
        [-0.12413481, -0.24203288,  0.92555649,  0.26335492],
        [-0.80115427, -0.14089226,  0.02109478, -0.58125401],
        [ 0.52354627, -0.6338014 ,  0.06541577, -0.56561105]]))

### Task 6: Singular Value Decomposition (SVD)
---

In [145]:
U,S,V = np.linalg.svd(df_std.T)

In [146]:
print(U)
print("-------------------------------------------------------")
print(S)
print("-------------------------------------------------------")
print(V)

[[-0.52237162 -0.37231836  0.72101681  0.26199559]
 [ 0.26335492 -0.92555649 -0.24203288 -0.12413481]
 [-0.58125401 -0.02109478 -0.14089226 -0.80115427]
 [-0.56561105 -0.06541577 -0.6338014   0.52354627]]
-------------------------------------------------------
[20.82575075 11.71588318  4.68568442  1.75229803]
-------------------------------------------------------
[[ 1.08374515e-01  9.98503796e-02  1.13323362e-01 ... -7.27833114e-02
  -6.58701606e-02 -4.59092965e-02]
 [-4.30198387e-02  5.57547718e-02  2.70926177e-02 ... -2.26960075e-02
  -8.64611208e-02  1.89567788e-03]
 [ 2.59377669e-02  4.83370288e-02 -1.09498919e-02 ... -3.81328738e-02
  -1.98113038e-01 -1.12476331e-01]
 ...
 [ 5.42576376e-02  5.32189412e-03  2.76010922e-02 ...  9.89545817e-01
  -1.40226565e-02 -7.86338250e-04]
 [ 1.60581494e-03  8.56651825e-02  1.78415121e-01 ... -1.24233079e-02
   9.52228601e-01 -2.19591161e-02]
 [ 2.27770498e-03  6.44405862e-03  1.49430370e-01 ... -6.58105858e-04
  -2.32385318e-02  9.77215825e-01

### Task 7: Picking Principal Components Using the Explained Variance
---

In [147]:
# we have to check which eignvectors can be dropped without losing too much data
#the eignvectors with the lowest eignvalues have the least data so we can drop them without losing as much data, so we need to sort the following eignvalues,vectors


In [161]:
#sorting
#make a list of eigenvalue,eigenvector tuples
eig_pairs = [(np.abs(eig_values[i]), eig_vectors[:,i]) for i in range(len(eig_values))]

#sort the tuple from high to low (decreasing order)
eig_pairs.sort(key=lambda x: x[0], reverse=True)




In [162]:
#explained variance: it tells us how much information can be attributed to each principal component
total = sum(eig_values)
var_exp = [(i / total)*100 for i in sorted(eig_values, reverse=True)]
cum_var_exp = np.cumsum(var_exp)

In [163]:
#projection matrix: a matrix with the top eigenvectors with the highest eigenvalues
matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1),eig_pairs[1][1].reshape(4,1)))

print('Matrix W:\n', matrix_w)

Matrix W:
 [[ 0.52237162 -0.37231836]
 [-0.26335492 -0.92555649]
 [ 0.58125401 -0.02109478]
 [ 0.56561105 -0.06541577]]


### Task 8: Project Data Onto Lower-Dimensional Linear Subspace
---

In [164]:
Y = df_std.dot(matrix_w)

In [157]:
Y

Unnamed: 0,0,1
0,-2.256981,-0.504015
1,-2.079459,0.653216
2,-2.360044,0.317414
3,-2.296504,0.573447
4,-2.380802,-0.672514
...,...,...
145,1.864277,-0.381544
146,1.553288,0.902291
147,1.515767,-0.265904
148,1.371796,-1.012968
