## Portfolio Assignment week 02

This week's focus is on manifold learning and text clustering. As part of the portfolio assignment, you are required to make a contribution to either the manifold learning case or the text clustering case. There are several options for your contribution, so you can choose the one that aligns with your learning style or interests the most


### Manifold learning

Study the Tutorial tutorial_manifold_tSNE and the tutorial_manifold_spectral_clustering and the Study_Case_pipeline. Next improve the code by comparing the performance of k-means and spectral clustering. Also compare PCA and t-SNE in the visualization of the result. You can use the pipeline function of scikit-learn and hyperparameter tuning with GridSearchCV. Here's a possible approach:

- Load the dataset to be used for the clustering analysis.
- Preprocess the dataset as needed (e.g., scale the features, normalize the data, etc.).
- Define a pipeline with preprocessing and clustering
- use PCA and t-SNE for dimension reduction and visualize the dimensions, use the clusters to color the datapoints
- use GridSearchCV to optimize the hyper parameters
- Evaluate the performance of the models using a suitable metric
- choose the best cluster method and the best visualization method combination

Explain choises and evaluate outcome. You can do this assignment in pairs but if you do so mention each others name. Do not forget to reference. If you cannot figure out how to use GridSearchCV and or a pipeline, use your own solution


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import yaml
plt.style.use("seaborn")

In [None]:
# load the data
with open('config.yaml', 'r') as conf:
    config = yaml.load(conf, yaml.SafeLoader)

data = pd.read_csv(config['tSNE_data'])

In [None]:
print(data.shape)
data.head(10)

In [None]:

# check the sample types
print(f'Unique types: {data.type.unique()}')
# check missing values
print(f'missing values: {data.isna().any().any()}')
# descriptive stats
hcc_df = data[data["type"]=="HCC"]
norm_df = data[data["type"]=="normal"]

hcc_skewedness = hcc_df.iloc[:,2:].skew()
hcc_skewed = hcc_skewedness[(hcc_skewedness < -0.75)|(hcc_skewedness > 0.75)]

norm_skewedness = norm_df.iloc[:,2:].skew()
norm_skewed = norm_skewedness[(norm_skewedness < -0.75)|(norm_skewedness > 0.75)]


print(f'The skewed percentage = {len(hcc_skewed)/hcc_df.shape[1]*100:0.03}%')
print(f'The skewed percentage = {len(norm_skewed)/norm_df.shape[1]*100:0.03}%')

The amount of skewedness is quite high so PCA would not be a good option here.
Let's first normalize and check if this assumption proves to be right. 

In [None]:
#normalize
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc_data = sc.fit_transform(data.iloc[:,2:])

In [None]:
# perform pca 
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(sc_data)

# scree plot of the first 100 principle components

pc_values = pca.explained_variance_ratio_[:100]
x_range = np.linspace(1,100,100)
plt.scatter(x_range, pc_values*100)
plt.xlabel("n_component")
plt.ylabel("explained variance (%)")
plt.title("scree plot: explained variance ratios")
plt.show()

In [None]:
import matplotlib.patches as mpatch
PC1 = pca.fit_transform(sc_data)[:,0]
PC2 = pca.fit_transform(sc_data)[:,1]

colors = data["type"]
color_dict = {'HCC':'r', 'normal':'c'}

fig, ax = plt.subplots()
for p1,p2,color in zip(PC1,PC2,colors):
    ax.scatter(p1,p2,c=color_dict[color], alpha=0.5)

cyan_patch = mpatch.Patch(color='c', label='normal')
red_patch = mpatch.Patch(color='r', label='HCC')

ax.legend(handles=[cyan_patch,red_patch])
ax.set_xlabel("PC1")
ax.set_ylabel("PC2")
ax.set_title("PCA")
plt.show()

In [None]:
from sklearn.manifold import TSNE

#initiate instance
tsne = TSNE(n_components=2, learning_rate='auto',
                   init='random', random_state = 42)
#fit 
TS1 = tsne.fit_transform(sc_data)[:,0]
TS2 = tsne.fit_transform(sc_data)[:,1]

fig, ax = plt.subplots()
for t1,t2,color in zip(TS1,TS2,colors):
    ax.scatter(t1,t2,c=color_dict[color], alpha=0.5)

cyan_patch = mpatch.Patch(color='c', label='normal')
red_patch = mpatch.Patch(color='r', label='HCC')

ax.legend(handles=[cyan_patch,red_patch])
ax.set_xlabel("comp_1")
ax.set_ylabel("comp_2")
ax.set_title("tSNE")
plt.show()
