# Problem Statement:

** *To reduce the dimensions of the dataset and to find out what features explain the most variance in the data. * **

Steps:
1. Acquire data
2. Clean data
3. Standardise data
4. Apply PCA

# Data Source:

**"https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"**

# Data Dictionary:

The dataset has **178 rows**(samples/observations) and **14 columns**(attributes).

Below are the attributes:

'class', 'alcohol', 'malic_acid', 'ash', 'alcalinity_ash',
'magnesium', 'total_phenol', 'flavanoids', 'nonflavanoid_phenols',
'proanthocyanins', 'color_intensity', 'hue', 'diluted_wines','proline'

# Import Libraries

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline

# Acquire Data

In [None]:
# Read the dataset

wine_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", header=None)

wine_df.head()

In [None]:
# Add column names to the dataset

wine_df.columns = ['wine_class', 'alcohol', 'malic_acid', 'ash', 'alcalinity_ash',
                  'magnesium', 'total_phenol', 'flavanoids', 'nonflavanoid_phenols',
                  'proanthocyanins', 'color_intensity', 'hue', 'diluted_wines',
                  'proline']

In [None]:
wine_df.head()

## Since PCA is an Unsupervised Learning technique, we will drop the Target variable ('wine_class') and then implement PCA on this data

In [None]:
wine_df = wine_df.drop(['wine_class'], axis=1)

In [None]:
print("Shape of the dataset: ", wine_df.shape)

# Why PCA?

Since it is difficult to **visualize** high dimensional data, we can use PCA to find the first three principal components, and visualize the data in a 3-dimensional plot. It is also used for **data compression**. Before we do this, we have to rescale the data and bring it to standard normal distribution.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
scaled_data = scaler.fit_transform(wine_df)

We instantiate a PCA object, find the principal components using the fit method, then apply the rotation and dimensionality reduction by calling transform().

We can also specify how many components we want to keep when creating the PCA object.

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=3)
pca.fit(scaled_data)

# Now we can transform this data to the first 3 principal components.

X_pca = pca.transform(scaled_data)

In [None]:
# Shape of Original dataset (after dropping target variable 'wine_class')

scaled_data.shape

In [None]:
# Shape of PCA-transformed new dataset

X_pca.shape

##### Observation: From 13 dimensions we have reduced to 3 dimensions

In [None]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)

ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=wine_df.iloc[:,0].values,
           cmap=plt.cm.Set1, edgecolor='k', s=40)
ax.set_title("First three PCA directions")
ax.set_xlabel("PCA-1")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("PCA-2")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("PCA-3")
ax.w_zaxis.set_ticklabels([])

plt.show()

# Interpreting the components 

The components correspond to combinations of the original features, the components themselves are stored as an attribute of the fitted PCA object:

In this numpy matrix array, each row represents a principal component, and each column relates back to the original features. we can visualize this relationship with a heatmap:

In [None]:
df_comp = pd.DataFrame(pca.components_, columns=wine_df.columns)

In [None]:
"""The heatmap and the color bar basically represent the correlation between 
the various features and the principal component itself."""

plt.figure(figsize=(12,6))
sns.heatmap(df_comp,cmap='plasma',)