<a href="https://colab.research.google.com/github/RenjithRT/DSA_Assignments/blob/main/Case_Study_09_Unsupervised_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Case study on Unsupervised Learning

Do the following in the wine dataset.
1. Read the dataset to the python environment.
2. Try out different clustering models in the wine dataset.
3. Find the optimum number of clusters in each model and create the model with
the optimum number of clusters.

In [None]:
# Import the required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### 1. Read the dataset to the python environment.

In [None]:
# load the dataset into the dataframe
# Read the wine dataset
wine_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Week 14/Case Study/Wine.csv') 

In [None]:
# Display the data
wine_df.head()

In [None]:
# number of elements in each dimension (Rows and Columns)
wine_df.shape

In [None]:
# Summary of the data
wine_df.info()

In [None]:
# Display the columns in the dataset
wine_df.columns

In [None]:
# Calculating the null values present in each columns in the dataset 
wine_df.isna().sum()

In [None]:
# The Statistical summary of wine dataset
wine_df.describe().T

### 2. Try out different clustering models in the wine dataset.

We can go for two clustering models.
1. K-Means Clustering
2. Hierarchial Clustering - Agglomerative Clustering

### 3. Find the optimum number of clusters in each model and create the model with the optimum number of clusters.

**1. K-Means Clustering**

In [None]:
# Normalize the data
from sklearn.preprocessing import Normalizer
data = pd.DataFrame(Normalizer().fit_transform(wine_df), columns=wine_df.columns)
data.describe().T

In machine learning, some feature values differ from others multiple times. The features with higher values will dominate the leaning process. However, it does not mean those variables are more important to predict the outcome of the model. Data normalization transforms multiscaled data to the same scale. **After normalization, all variables have a similar influence on the model, improving the stability and performance of the learning algorithm.**

**1. K-Means Clustering**

In [None]:
# Finding the optimal clusters using Elbow diagram
from sklearn.cluster import KMeans
# Assign a range for optimal cluster points
ks = range(1,10)
# Create an empty list for getting the inertia values
inertia = [] 
# Creating an instace of the model using for loop
for k in ks:
    # fit the model with the data and check the inertia values to the empty list
    inertia.append(KMeans(n_clusters=k, init = "k-means++", random_state=42).fit(data).inertia_)

# Assign the figure size
plt.figure(figsize = (16, 8))
# Plotting the number of clusters and inertia
plt.plot(ks, inertia, "-o")
# Title of the graph
plt.title("Number of Clusters vs Distance")
# Graph X axis's label name
plt.xlabel("Number of Clusters")
# Graph Y axis's label name
plt.ylabel("Distance")
# show the graph
plt.show()
# Check the interia 
inertia

From the above elbow plot it is clear that, the number of clusters from 5 onwards the inertia value decreases slowly. Therefore we can select the optimun value as 5.

Inertia measures how well a dataset was clustered by K-Means. It is calculated by measuring the distance between each data point and its centroid, squaring this distance, and summing these squares across one cluster. A good model is one with low inertia AND a low number of clusters ( K ).

In [None]:
# Now we can assign the optimal value as 3 to the KMeans model
model_kmeans = KMeans(n_clusters=5, init = "k-means++", random_state=42)
model_kmeans.fit(data)
kmean_clusters = model_kmeans.labels_
kmean_clusters

In [None]:
# Copy the dataframe to the another dataframe
new_data=data.copy()
# Assigning labels to the target variable named Wine_Classes_Kmeans
new_data['Wine_Classes_Kmeans']=kmean_clusters              
new_data.head(5)

In [None]:
# Plot the digram using sactter plot with Alcohol and Malic Acid
plt.title("Alcohol and Malic_Acid")
plt.scatter(data["Alcohol"], data["Malic_Acid"], c = kmean_clusters)
plt.xlabel("Alchol")
plt.ylabel("Malic_Acid")
plt.show()

From the above scatter plot , it is well clear that the two features of Alcohol and Malic Acid are visualized five optimum clusters, since we have optimised the cluster value as 5 and each data points separated with 5 distinct colors.

**2. Hierarchial Clustering - Agglomerative Clustering**

In [None]:
# Using the dendrogram to find the number of clusters by ward method
from scipy.cluster.hierarchy import dendrogram,linkage
linked=linkage(data,method='ward')
plt.figure(figsize=(14,5))
plt.title("Dendrogram")
plt.hlines(0.2, 0, 5000, linestyles="dashed")
dendrogram(linked,orientation='top',distance_sort='ascending',show_leaf_counts=True)
plt.show()

From the above dendrogram plot, it is well clear that, the number of clusters using horizontal line we can identifies that the optimun cluster value as 5. 

Now, we can assign the optimum value to the Agglomerative clustering model.

In [None]:
# Agglomerative clustering
from sklearn.cluster import AgglomerativeClustering
model_agglo = AgglomerativeClustering(n_clusters= 5, affinity= "euclidean", linkage = "ward").fit(data)
agglo_clusters = model_agglo.labels_
agglo_clusters

In [None]:
# Assigning labels to the target variable named Wine_Classes_Agglomerative
new_data['Wine_Classes_Agglomerative']=agglo_clusters              
new_data.head(5)

In [None]:
# Plot the digram using sactter plot with Alcohol and Malic Acid
plt.title("Alcohol and Malic_Acid")
plt.scatter(data["Alcohol"], data["Malic_Acid"], c = agglo_clusters)
plt.xlabel("Alchol")
plt.ylabel("Malic_Acid")
plt.show()

From the above scatter plot , it is well clear that the two features of Alcohol and Malic Acid are visualized five optimum clusters, since we have optimised the cluster value as 5 and each data points separated with 5 distinct colors.

From the above two clustering models (K-Means Clustering, Hierarchial Clustering - Agglomerative Clustering) we can clearly say that the wine dataset can be divided into 5 clusters.

In machine learning problems, there are often too many factors on the basis of which the final classification is done. These factors are basically variables called features. The higher the number of features, the harder it gets to visualize the training set and then work on it. Sometimes, most of these features are correlated, and hence redundant. This is where dimensionality reduction algorithms come into play. Dimensionality reduction is the process of reducing the number of random variables under consideration, by obtaining a set of principal variables. It can be divided into feature selection and feature extraction.

We can go for the method is Principal Components Analysis - It combine multiple, coorelated variables into a single components

**PCA (Principal Components Analysis)**

In [None]:
# number of elements in each dimension (Rows and Columns)
wine_df.shape

In [None]:
# scaling the data
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler()
scaled_data = scalar.fit_transform(wine_df)
scaled_data = pd.DataFrame(scaled_data, columns = data.columns)
scaled_data.describe()

In [None]:
# Doing PCA
from sklearn.decomposition import PCA
pca = PCA(n_components= 0.95)
pca.fit(scaled_data)
data_pca = pca.transform(scaled_data)
print(f'Number of Principal Components to explain 95% variance = {pca.n_components_}')

In [None]:
# Plotting the graph of Components vs Explained Variance
plt.figure(figsize = (12, 8))
x = np.arange(1, pca.n_components_+1 , step = 1)
y = np.cumsum(pca.explained_variance_ratio_)
plt.plot(x, y, marker = "o", linestyle = "--", color = "b")
# horizontal line for 95% cutoff threshold
plt.axhline(y = 0.95, color = 'r', linestyle = "-")
plt.text(1.2, 0.93, "95% cut-off threshold", color = 'b', fontsize = 14)
plt.xticks(x)
plt.grid(axis = 'x')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Variance (%)')
plt.title("Components vs Explained Variance")
plt.show()

Explained variance represents the information explained using a particular principal components (eigenvectors) and it is calculated as ratio of eigenvalue of a articular principal component (eigenvector) with total eigenvalues. 

From the above graph, the Cumulative variance plot clearly shows the contribution of each principal components.

In [None]:
pca.explained_variance_ratio_

As per the PCA explained variance ratios, we can clearly says that, there are 10 components where:
- The first principal component explains 36.19% of the total variations in the dataset.
- The second principal component explains 19.20% of the total variations.
- The third principal component explains 11.12% of the total variations and so on. 