## 1. Introduction to the Dataset

This is an example of using PCA, k-means clustering and DBSCAN to first reduce the dimensionality of a wine quality dataset and then to assign the data into clusters. Dimensionality reduction and clustering can help us create a dataset for training a classification algorithm such as k nearest neighbours, which you will see next week.

Adaped from https://pubs.acs.org/doi/10.1021/acs.jchemed.1c00142

## Google Colab installs

<div class="alert alert-warning">
The following cell installs necessary packages and downloads data if you are running this tutorial using Google Colab.<br>
<b><i>Run this cell only if you are using Google Colab!</i></b></div>

In [None]:
!if [ -n "$COLAB_RELEASE_TAG" ]; then git clone https://github.com/Edinburgh-Chemistry-Teaching/ATCP_23_24; fi
import os
os.chdir(f"ATCP_23_24{os.sep}Unit_01")

### 1.2. Loading the data

In [None]:
# Imports
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import decomposition
from sklearn import cluster
import os
import warnings
warnings.filterwarnings("ignore")

In [None]:
filename = "wine_data.csv"
dataframe = pd.read_csv(filename)
dataframe.head()

### 1.3 Cleaning up and normalising the data

First, we have to clean up the data frame slightly, because it contains information that we do not want to include in our PCA analysis; Namely, we need to drop the `quality` column as it is a categorical variable, and the `wine` column as that is our target variable for analysis and prediction.

Furthermore, since the variables do not all have the same range, we need to normalise the data. The normalisation is carried out by subtracting the mean from the data and dividing it by the standard deviation, i.e.

$$ Z = \frac{\mathbf{x} - \mu}{\sigma} $$

where $\mathbf{x}$ is the data. 


<div class="alert alert-success">
<b>Task:</b> Normalising the data

- Create a new data frame, e.g. `clean_data` which does not contain the columns `quality` and `wine`.
- Normalise the cleaned wine data and name the new data frame, e.g. `normalised_data`. 

In [None]:
### Your solution here:


<details>
<summary> <mark> Solution: </mark> </summary>

```Python

clean_data = dataframe.drop(columns=["quality", "wine"])
mean = clean_data.mean()
standard_deviation = clean_data.std()
normalized_data = (clean_data - mean) / standard_deviation

```

</details>

<div class="alert alert-success">
<b>Task:</b> Exploring the data

- Plot different properties of the data frame to see how correlated they are.
- Can you see any easy trends to identify red and white wine?

In [None]:
### Your solution here:



<details>
<summary> <mark> Solution: </mark> </summary>

```Python

target_column = 'residual sugar'

# Create scatter plots for each column against the target column
for column in clean_data.columns:
    if column != target_column:
        plt.figure(figsize=(6, 4))
        plt.scatter(clean_data[target_column], clean_data[column])
        plt.xlabel(target_column)
        plt.ylabel(column)
        plt.title(f'{column} vs. {target_column}')
        plt.grid(True)
        plt.show()

```

</details>

### 1.4 Task section

<div class="alert alert-success">
<b>Task 1: Principal components analysis of wine qualities </b>

- Perform a PCA analysis of the normalised dataset using two components. What is the variance contribution of the components?
- Plot the principal components, labelling your axes.
- Are two components a good set of components? Which input features have the most contribution to these components?

</div>

In [None]:
### Your solution here:

# Performing the PCA


# Plotting


# Contribution of components


<details>
<summary> <mark> Solution to performing the PCA</mark> </summary>

```Python
pca = decomposition.PCA(n_components=2)
pca_results = pca.fit_transform(normalised_data)
print(pca.explained_variance_ratio_)

fig, ax = plt.subplots()
ax.scatter(pca_results[:, 0], pca_results[:, 1], c=dataframe["wine"], edgecolor="k") # COLOURING ACCORDING TO WINE
ax.set_xlabel("PC 1")
ax.set_ylabel("PC 2")
```

</details>

<details>
<summary> <mark> Solution to components contribution</mark> </summary>

```Python

# Access the component loadings
loadings = pca.components_

# Create a DataFrame to display the loadings
loadings_df = pd.DataFrame(loadings, columns=clean_data.columns)

# Display the loadings for each principal component
for i in range(2):
    print(f"Loadings for PC{i + 1}:\n{np.abs(loadings_df.iloc[i])}")
```

</details>

<div class="alert alert-success">
<b>Task 2: K-means analysis of wine qualities </b>

- Perform a k-means analysis of the PCA results. How many clusters are there? 
- Plot the results, labelling your axes and marking the cluster centres. 

</div>

In [None]:
### Your solution here



<details>
<summary> <mark> Solution:</mark> </summary>

```Python

kmeans = cluster.KMeans(n_clusters=3) 
dataframe["cluster"] = kmeans.fit(pca_results)

fig, ax = plt.subplots()

ax.scatter(pca_results[:, 0], pca_results[:, 1], s=5, linewidth=0, c="gray", alpha=0.7)
cluster_centers = kmeans.cluster_centers_

for cluster_x, cluster_y in kmeans.cluster_centers_:
    ax.scatter(cluster_x, cluster_y, s=100, c='r', marker='x')

ax.set_xlabel("PC 1")
ax.set_ylabel("PC 2")
fig.show()
```

</details>

<div class="alert alert-success">
<b>Task 3:  Try using DBSCAN to cluster this dataset.</b>

- Perform the clustering
- Is this a good way to distinguish between red and white wine?
</div>


In [None]:
### Your solution here



<details>
<summary> <mark> Solution</mark> </summary>

```Python
db = cluster.DBSCAN(eps=0.5)
db.fit(pca_results[::10])

clusters = db.labels_.astype(int)
no_clusters = len(np.unique(clusters) )
no_noise = np.sum(np.array(clusters) == -1, axis=0)

print(f'Estimated no. of clusters: {no_clusters}')
print(f'Estimated no. of noise points: {no_noise}')
    
fig, ax = plt.subplots()
ax.scatter(pca_results[::10,0], pca_results[::10,1], c=clusters, marker="o", picker=True)
ax.set_xlabel("PC 1")
ax.set_ylabel("PC 2")

```

## END

---