# CS306 Homework3 Cluster

### Information
Author: 王超逸 WANG Chaoyi
SID: 11811014

### README: 
My work for the two tasks are in the same notebook file. The implementation of the latter task depends on the necessary imports of the former one and is tested to be all right, so please ensure the step of necessary imports has run when some exceptions occured in task2 if the two tasks were graded separately.

### Reference:


## Task1: Mystery Data
### Necessary imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
import yellowbrick as yb
import warnings
warnings.filterwarnings('ignore')

### Read the data, necessary cleaning and preview 

In [None]:
df = pd.read_csv('./HW3_1_data.csv')
# Remove null values (in fact there is not null value);
df.dropna(axis=0, inplace=True)
# Remove duplicates (in fact, no either);
df = df.drop_duplicates()
# Briefly check the data
df.info()
df.describe(include='all')
# df.isnull().sum()

In [None]:
f = plt.figure(figsize=(20,20))
for i ,col in enumerate(df.columns):
    ax = f.add_subplot(6,3,i+1)
    sns.distplot(df[col].ffill(),kde=False)
    ax.set_title('discribution',color = 'blue')
    plt.ylabel('discribution')
    plt.show()
f.tight_layout()

### See the correlation
Actually it is not necessary for 2 columns of data...

In [None]:
corr = df.corr()
f, ax = plt.subplots(figsize=(11,9))
cmap = sns.diverging_palette(200, 5, as_cmap=True)
sns.heatmap(corr, annot=True, cmap=cmap)
plt.show()

### Standardize the feature

In [None]:
from sklearn.preprocessing import StandardScaler
scaler_df = StandardScaler().fit_transform(df)
type(scaler_df)#numpy.ndarray
df_scaled = pd.DataFrame(scaler_df, columns=df.columns)

fig, ax = plt.subplots(1,2,figsize=(15,5))
sns.distplot(df['x2'], ax=ax[0], color='green')
ax[0].set_title("Original")
sns.distplot(df['x2'], ax=ax[1], color='purple')
ax[1].set_title("Standardized")
plt.show()
f.tight_layout()#Cool

### Algorithm1: k-means clustering

In [None]:
from sklearn.cluster import KMeans
score = []
range_values = range(1, 20)
for i in range_values:
    kmeans = KMeans(n_clusters = i)
    kmeans.fit(df_scaled)
    score.append(kmeans.inertia_)

plt.plot(range_values, score, '--')
plt.xlabel('K')
plt.ylabel('Score')
plt.title('Elbow Method for Optimal k')
plt.show()
# We should select the value of k at the point after which the inertia decreases in a linear fashion.
# In this case, the k selected is around k = 5.

### Silhouette Coefficient
Silhouette coefficient is a way to evaluate the result of clustering.

In [None]:
from sklearn.metrics import silhouette_score, silhouette_samples

for k in range(2, 20):
    km = KMeans(n_clusters=k)
    preds = km.fit_predict(df_scaled)
    centers = km.cluster_centers_

    score = silhouette_score(df_scaled, preds, metric='euclidean')
    print("For k = {}, silhouette score is {}".format(k, score))
# So k = 4

### Yellowbrick showing the best k

In [None]:
from yellowbrick.cluster import KElbowVisualizer
km = KMeans()
visualizer = KElbowVisualizer(km, k=(2,21), metric='silhouette', timing=False)
visualizer.fit(df_scaled)
visualizer.poof()

In [None]:
from yellowbrick.cluster import SilhouetteVisualizer
km = KMeans(n_clusters=4)
visualizer = SilhouetteVisualizer(km)
visualizer.fit(df_scaled)
visualizer.poof()

### Apply k-means


In [None]:
km = KMeans(n_clusters=4)
km.fit(df_scaled)
cluster_label = km.labels_

df_scaled['KMeans_Label'] = cluster_label
f=plt.figure(figsize=(20,20))
scatter_cols = ['x1','x2']
for i, col in enumerate(scatter_cols):
    ax = f.add_subplot(4,4,i+1)

sns.scatterplot(x=df_scaled['x1'],y=df_scaled['x2'],hue=df_scaled['KMeans_Label'],palette='Set1')
ax.set_title("Result",color='blue')
f.tight_layout()

### Algorithm2: DBSCAN
The result of k-means is stupid to a certain degree (for example, the discrete purple dots near the blue parts should not be purple). Therefore, I would like to apply DBSCAN to do density clustering to improve.
Two parameter for DBSCAN:
- eps: The maximum distance between two samples for one to be considered as in the neighborhood of the other.
- min_samples: The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

In [None]:
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.12, min_samples=5)
db.fit(df_scaled)
preds = db.fit_predict(df_scaled)
plt.figure(figsize=(10,10))
plt.scatter(x=df_scaled['x1'], y=df_scaled['x2'],c=preds,cmap='Paired')
plt.title("Clusters determined by DBSCAN")

### Algorithm3: GMM
Now we try GMM for the same dataset

In [None]:
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=4).fit(df_scaled)
labels = gmm.predict(df_scaled)
plt.scatter(x=df_scaled['x1'], y=df_scaled['x2'], c=labels, s=40, cmap='viridis')

## Task2: Credit cards
### Import data

In [None]:
df = pd.read_csv('./HW3_2_data.csv')
# Remove null values (in fact there is not null value);
df.dropna(axis=0, inplace=True)
# Remove duplicates (in fact, no either);
df = df.drop_duplicates()
# Briefly check the data
df.info()
df.describe(include='all')
# df.isnull().sum()

### Data Cleaning and Standardizing
By reading the info above, we may remove CUSTID because this property has nothing to do with clustering.

In [None]:
df.drop('CUST_ID', axis = 1, inplace = True)
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

### PCA
PCA is commonly used for dimensionality reduction by projecting data only onto the first principle components to obtain low-dimension data while preserving or maximizing the variances along the projected direction.
Also, in the well-packaged PCA in sklearn, the api automatically cluster the dots by kmeans (i think...)

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)   # We need two attributes to visualize the clustering
principal = pca.fit_transform(df_scaled)
df_pca = pd.DataFrame(data = principal, columns = ['pca1', 'pca2'])
df_pca = pd.concat([df_pca, pd.DataFrame({'cluster': labels})], axis=1)
# plt.scatter(x=df_pca['pca1'], y=df_pca['pca2'], c=labels, s=10, cmap='viridis')
plt.scatter(x=df_pca['pca1'], y=df_pca['pca2'], c=df_pca['cluster'], s=10, cmap='viridis')