# Notes for this notebook:

## Models:
I've used 3 different algorithms:
- Kmeans
- Birch
- Agglomerative

## Each Model before and after Tuning:
### Kmeans
The baseline model score:
- silhouette score: 0.6407

After the tuning the model, I managed to get this score:
- silhouette score: 0.7765

A bad baseline due to me starting with the default number of clusters.

### BIRCH
The baseline model score:
- silhouette score: 0.7100

After the tuning the model, I managed to get this score:
- silhouette score: 0.7821

A decent increase in performance, and way better than random guessing.

### Agglomerative
The baseline model score:
- silhouette score: 0.8093

After the tuning the model, I managed to get this score:
- silhouette score: 0.8093

I could not find any parameters worth tuning for agglomerative so the baseline model and after tuning model are the same

## Result Discussion (Some stuff I want to point out)
- The Agglomerative model had the highest solhouette score of 0.8093.
- Every model performed way better than random guessing.
- Almost no parameters to tune.

In [77]:
import pandas as pd
from sklearn.cluster import KMeans, Birch, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from sklearn.model_selection import train_test_split

In [57]:
df = pd.read_csv(r'C:\Users\marku\Desktop\ML\MLGit\datasets\airline.csv')

# EDA in Oblig1 Clustering notebook

In [58]:
df = df.drop(['Unnamed: 0', 'id', 'Flight Distance', 'Departure Delay in Minutes'], axis=1)
df['Gender'] = df['Gender'].replace(['Female', 'Male'], [0,1])
df['Type of Travel'] = df['Type of Travel'].replace(['Personal Travel', 'Business travel'], [0,1])
df['Class'] = df['Class'].replace(['Eco Plus', 'Business', 'Eco'], [0,1, 2])
df['Customer Type'] = df['Customer Type'].replace(['disloyal Customer', 'Loyal Customer'], [0,1])
df['satisfaction'] = df['satisfaction'].replace(['neutral or dissatisfied', 'satisfied'], [0,1])

def handle_null_median(df):
    # Need to set inplace=True, so it doesn't create a copy of the dataframe. Tried without and this led to null-values not being removed
    df['Arrival Delay in Minutes'].fillna(df['Arrival Delay in Minutes'].median(), inplace=True)

    return  df
df_unlabeled = df.drop(['satisfaction', 'Gate location', 'Departure/Arrival time convenient'], axis=1)
df_unlabeled = handle_null_median(df_unlabeled)
df.head()

Unnamed: 0,Gender,Customer Type,Age,Type of Travel,Class,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,...,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Arrival Delay in Minutes,satisfaction
0,1,1,13,0,0,3,4,3,1,5,...,5,5,4,3,4,4,5,5,18.0,0
1,1,0,25,1,1,3,2,3,3,1,...,1,1,1,5,3,1,4,1,6.0,0
2,0,1,26,1,1,2,2,2,2,5,...,5,5,4,3,4,4,4,5,0.0,1
3,0,1,25,1,1,2,5,5,5,2,...,2,2,2,5,3,1,4,2,9.0,0
4,1,1,61,1,1,3,3,3,3,4,...,5,3,3,4,4,3,3,3,0.0,1


In [59]:
train, test = train_test_split(df_unlabeled, random_state=42, test_size=0.25)

# Kmeans

In [53]:
kmeans = KMeans(n_clusters=3)
kmeans_cluster = kmeans.fit(train)
prediction = kmeans_cluster.predict(test)
silhouette_score(test, prediction)

0.6407188307005522

3 clusters is the default value so here I start.

# KMEANS TUNING

In [54]:
kmeans = KMeans(n_clusters=4)
kmeans_cluster = kmeans.fit(train)
prediction = kmeans_cluster.predict(test)
silhouette_score(test, prediction)

0.4063376690024187

Moving up from 3 clusters resulted in extremely poor accuracy. Will therefor test out 2

In [55]:
kmeans = KMeans(n_clusters=2)
kmeans_cluster = kmeans.fit(train)
prediction = kmeans_cluster.predict(test)
silhouette_score(test, prediction)

0.776568653602399

Since this is originally a classification dataset, I know 2 clusters will be the best. From this point on, I will start each model with 2 clusters.

# BIRCH

In [72]:
birch = Birch(n_clusters=2)
birch_cluster = birch.fit(train)
prediction = birch_cluster.predict(test)
silhouette_score(test, prediction)

0.7100111087302052

# BIRCH TUNING

In [73]:
birch = Birch(n_clusters=2, branching_factor=60)
birch_cluster = birch.fit(train)
prediction = birch_cluster.predict(test)
silhouette_score(test, prediction)

0.734814752055152

In [74]:
birch = Birch(n_clusters=2, branching_factor=100)
birch_cluster = birch.fit(train)
prediction = birch_cluster.predict(test)
silhouette_score(test, prediction)

0.7691711173890805

In [75]:
birch = Birch(n_clusters=2, branching_factor=150)
birch_cluster = birch.fit(train)
prediction = birch_cluster.predict(test)
silhouette_score(test, prediction)

0.7821382998018902

In [76]:
birch = Birch(n_clusters=2, branching_factor=200)
birch_cluster = birch.fit(train)
prediction = birch_cluster.predict(test)
silhouette_score(test, prediction)

0.7821382998018902

# Agglomerative

In [83]:
train, test = train_test_split(df_unlabeled, train_size=0.90, random_state=42)
df_unlabeled.shape

(13904, 18)

In [79]:
agglomerative = AgglomerativeClustering(n_clusters=2)
agglomerative_cluster = agglomerative.fit(df_unlabeled)
silhouette_score(df_unlabeled, agglomerative.labels_)

0.8093158154120401

# AGGLOMERATIVE TUNING
I could not find any parameters worth tuning

In [81]:
agglomerative = AgglomerativeClustering(n_clusters=2)
agglomerative_cluster = agglomerative.fit(df_unlabeled)
silhouette_score(df_unlabeled, agglomerative.labels_)

0.8093158154120401