Dataset used in this project / notebook can be found on Kaggle webiste (https://www.kaggle.com/c/titanic/data).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

In [None]:
titanic_dataset = pd.read_csv("titanic.csv")

In [None]:
titanic_dataset

## Preprocessing

    1. Create new dataframe with columns that will determine clusters
    2. Encode Sex colum to 0 and 1
    3. Use PCA for dimensionality reduction, so we can plot datapoints easily
    

### Creating new dataframe

In [None]:
set_to_use = titanic_dataset
set_to_use = set_to_use.drop('Name', axis=1)
set_to_use = set_to_use.drop('Ticket', axis=1)
set_to_use = set_to_use.drop('Fare', axis=1)
set_to_use = set_to_use.drop('Embarked', axis=1)
set_to_use.head()

In [None]:
set_to_use.fillna(0, inplace=True)

In [None]:
set_to_use.head()

### Encoding Sex column using LabelEncoder

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
encoder = LabelEncoder()
encoded_sex = encoder.fit_transform(set_to_use.iloc[:, 3])

In [None]:
set_to_use['Sex'] = encoded_sex

In [None]:
set_to_use.head()

In [None]:
features = set_to_use.iloc[:, 1:-1].values

In [None]:
print(features)

### Using Principal Component Analysis for dimensionality reduction

In [None]:
from sklearn.decomposition import PCA

In [None]:
titanic_pca = PCA(n_components=4)
titanic_pca.fit(features)
test = titanic_pca.transform(features)

In [None]:
plt.plot(list(titanic_pca.explained_variance_ratio_),'-o')
plt.title('Explained variance ratio as function of PCA components')
plt.ylabel('Explained variance ratio')
plt.xlabel('Component')
plt.show()

This PCA above with 4 components was for testing. We will need 2 components for 2D plotting, also for every clustering algorithm, we are going to use reduced_features with 2 features.

In [None]:
reduction_pca = PCA(n_components=2)
reduced_features = reduction_pca.fit_transform(features)

## Clustering

Firstly we are going to use SKlearn version of KMeans

In [None]:
from sklearn.cluster import KMeans

Starting with version of 5 clusters.

In [None]:
km = KMeans(n_clusters=5)
clusters = km.fit(reduced_features)

In [None]:
clusters

In [None]:
plt.scatter(reduced_features[:, 0], reduced_features[:, 1], label='Datapoints')
plt.scatter(clusters.cluster_centers_[:, 0], clusters.cluster_centers_[:, 1], label='Clusters')
plt.title("Sklearn version of KMeans")
plt.legend()
plt.show()

In [None]:
reduced_features.shape

In [None]:
clusters.cluster_centers_

### (Optional) Using custom KMeans

In [None]:
from kmeans_numpy import *

In [None]:
kmm = KMeans_numpy(n_clusters=5, tolerance=0.00001)

In [None]:
clusters, clustered_data = kmm.fit(reduced_features)

In [None]:
clusters = np.array(clusters)

I have improved this version of KMeans so we can easily plot datapoints in clusters and to inspect that data.

In [None]:
cluster_one_data = np.array(clustered_data[0])
cluster_two_data = np.array(clustered_data[1])
cluster_three_data = np.array(clustered_data[2])
cluster_four_data = np.array(clustered_data[3])
cluster_five_data = np.array(clustered_data[4])

In [None]:
plt.figure(figsize=(12, 6))
plt.scatter(cluster_one_data[:, 0], cluster_one_data[:, 1], c='r', label='Cluster One')
plt.scatter(cluster_two_data[:, 0], cluster_two_data[:, 1], c='b', label='Cluster two')
plt.scatter(cluster_three_data[:, 0], cluster_three_data[:, 1], c='g', label='Cluster three')
plt.scatter(cluster_four_data[:, 0], cluster_four_data[:, 1], c='y', label='Cluster four')
plt.scatter(cluster_five_data[:, 0], cluster_five_data[:, 1], color='orange', label='Cluster five')
plt.scatter(clusters[:, 0], clusters[:, 1], marker='*', s=200, color='black', label='Centroids')
plt.title("Custom KMeans results")
plt.legend()
plt.show()

Let's experiment and see how custom KMeans is working with less numbers of clusters. In this way we can determine optimal number of clusters.

In [None]:
import time

plot_colors = ['red', 'green', 'blue', 'orange', 'yellow']
start = time.time()
for i in range(1, 5):
    test = KMeans_numpy(n_clusters=i, tolerance=0.00001)
    clust, clust_data = test.fit(reduced_features)
    clust = np.array(clust)
    plt.figure(figsize=(12, 6))
    for key in clust_data.keys():
        plt.scatter(np.array(clust_data[key])[:, 0], np.array(clust_data[key])[:, 1], color=plot_colors[key], label='Cluster {}'.format(key+1))
    
    plt.scatter(clust[:, 0], clust[:, 1], marker='*', s=200, color='black', label='Centroids')
    plt.title("Custom KMeans results")
    plt.legend()
    plt.show()
    
end = time.time()
print("This experiment took: {} seconds with custom algorithm".format(end-start))