Objective:
The objective of this assignment is to implement PCA on a given dataset and analyse the results.

Instructions:
Download the wine dataset from the UCI Machine Learning Repository
Load the dataset into a Pandas dataframe.
Split the dataset into features and target variables.
Perform data preprocessing (e.g., scaling, normalisation, missing value imputation) as necessary.
Implement PCA on the preprocessed dataset using the scikit-learn library.
Determine the optimal number of principal components to retain based on the explained variance ratio.
Visualise the results of PCA using a scatter plot.
Perform clustering on the PCA-transformed data using K-Means clustering algorithm.
Interpret the results of PCA and clustering analysis.

Deliverables:
Jupyter notebook containing the code for the PCA implementation.
A report summarising the results of PCA and clustering analysis.
Scatter plot showing the results of PCA.
A table showing the performance metrics for the clustering algorithm.
Additional Information:
You can use the python programming language.
You can use any other machine learning libraries or tools as necessary.
You can use any visualisation libraries or tools as necessary.

Instructions:
Download the wine dataset from the UCI Machine Learning Repository
Load the dataset into a Pandas dataframe.
Split the dataset into features and target variables.
Perform data preprocessing (e.g., scaling, normalisation, missing value imputation) as necessary.
Implement PCA on the preprocessed dataset using the scikit-learn library.
Determine the optimal number of principal components to retain based on the explained variance ratio.
Visualise the results of PCA using a scatter plot.
Perform clustering on the PCA-transformed data using K-Means clustering algorithm.
Interpret the results of PCA and clustering analysis.

In [1]:
!pip install ucimlrepo

Defaulting to user installation because normal site-packages is not writeable


In [76]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
wine = fetch_ucirepo(id=109) 
  
# data (as pandas dataframes) 
X = wine.data.features 
y = wine.data.targets 
  
 
# variable information 
print(wine.variables)

                            name     role         type demographic  \
0                          class   Target  Categorical        None   
1                        Alcohol  Feature   Continuous        None   
2                      Malicacid  Feature   Continuous        None   
3                            Ash  Feature   Continuous        None   
4              Alcalinity_of_ash  Feature   Continuous        None   
5                      Magnesium  Feature      Integer        None   
6                  Total_phenols  Feature   Continuous        None   
7                     Flavanoids  Feature   Continuous        None   
8           Nonflavanoid_phenols  Feature   Continuous        None   
9                Proanthocyanins  Feature   Continuous        None   
10               Color_intensity  Feature   Continuous        None   
11                           Hue  Feature   Continuous        None   
12  0D280_0D315_of_diluted_wines  Feature   Continuous        None   
13                  

In [77]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
%matplotlib inline

In [78]:
feature_names = wine.data.feature_names

In [79]:
dataset = pd.DataFrame(data=X,columns=feature_names)

In [80]:
dataset.head()

Unnamed: 0,Alcohol,Malicacid,Ash,Alcalinity_of_ash,Magnesium,Total_phenols,Flavanoids,Nonflavanoid_phenols,Proanthocyanins,Color_intensity,Hue,0D280_0D315_of_diluted_wines,Proline
0,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [143]:
dataset.isnull().sum()

Alcohol                         0
Malicacid                       0
Ash                             0
Alcalinity_of_ash               0
Magnesium                       0
Total_phenols                   0
Flavanoids                      0
Nonflavanoid_phenols            0
Proanthocyanins                 0
Color_intensity                 0
Hue                             0
0D280_0D315_of_diluted_wines    0
Proline                         0
dtype: int64

In [144]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

In [145]:
X_train

Unnamed: 0,Alcohol,Malicacid,Ash,Alcalinity_of_ash,Magnesium,Total_phenols,Flavanoids,Nonflavanoid_phenols,Proanthocyanins,Color_intensity,Hue,0D280_0D315_of_diluted_wines,Proline
126,12.43,1.53,2.29,21.5,86,2.74,3.15,0.39,1.77,3.94,0.69,2.84,352
99,12.29,3.17,2.21,18.0,88,2.85,2.99,0.45,2.81,2.30,1.42,2.83,406
54,13.74,1.67,2.25,16.4,118,2.60,2.90,0.21,1.62,5.85,0.92,3.20,1060
124,11.87,4.31,2.39,21.0,82,2.86,3.03,0.21,2.91,2.80,0.75,3.64,380
13,14.75,1.73,2.39,11.4,91,3.10,3.69,0.43,2.81,5.40,1.25,2.73,1150
...,...,...,...,...,...,...,...,...,...,...,...,...,...
11,14.12,1.48,2.32,16.8,95,2.20,2.43,0.26,1.57,5.00,1.17,2.82,1280
137,12.53,5.51,2.64,25.0,96,1.79,0.60,0.63,1.10,5.00,0.82,1.69,515
33,13.76,1.53,2.70,19.5,132,2.95,2.74,0.50,1.35,5.40,1.25,3.00,1235
30,13.73,1.50,2.70,22.5,101,3.00,3.25,0.29,2.38,5.70,1.19,2.71,1285


In [146]:
pca = PCA(n_components=3)
X_train = pca.fit_transform(X_train)

In [147]:
print(pca.components_)

[[ 1.73196586e-03 -6.63911194e-04  1.99972547e-04 -4.74609777e-03
   1.80281397e-02  1.03367617e-03  1.68641805e-03 -1.35523688e-04
   6.49548632e-04  2.24802672e-03  1.98883371e-04  7.44672509e-04
   9.99819474e-01]
 [ 5.32571756e-04  4.95845863e-03  4.10298583e-03  2.79049588e-02
   9.99348040e-01 -1.63142560e-03 -4.32884674e-03 -1.28052829e-03
   5.09068332e-03  9.12220028e-03 -6.23216884e-04 -5.50125592e-03
  -1.78964081e-02]
 [-2.99853173e-02 -1.40871187e-01 -5.28245409e-02 -9.22918953e-01
   3.03361454e-02  3.78316058e-02  9.01882020e-02 -1.26736415e-02
   2.20126213e-02 -3.29034471e-01  2.80006867e-02  6.69921812e-02
  -4.48200024e-03]]


In [148]:
print(pca.explained_variance_ratio_)

[9.98001945e-01 1.83532696e-03 8.81252065e-05]


In [149]:
print(pca.singular_values_)

[3776.27456358  161.94017002   35.48525604]


In [150]:
X_test = pca.transform(X_test)

In [151]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
X_train.shape

(142, 3)

In [152]:
kmeans.fit(X_train,y_train)

  super()._check_params_vs_input(X, default_n_init=10)


In [153]:
preds = kmeans.predict(X_test)

In [154]:
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(preds,y_test))
print(classification_report(preds,y_test))

[[ 0  4  1  6]
 [ 0  6  0  0]
 [ 0  0 14  5]
 [ 0  0  0  0]]
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        11
           1       0.60      1.00      0.75         6
           2       0.93      0.74      0.82        19
           3       0.00      0.00      0.00         0

    accuracy                           0.56        36
   macro avg       0.38      0.43      0.39        36
weighted avg       0.59      0.56      0.56        36



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
