# Titanic Exercise

https://github.com/datsoftlyngby/dat4sem2020spring-python/blob/master/notebooks/10-4-2%20Clustering%20Titanic%20eample.ipynb  

1. Get the data on people on the Titanic, their class, sex age, ticket price and whether they survived.
2. Load into pandas dataframe

In [51]:
import pandas as pd
import numpy as np

titanic_data = pd.read_csv('data/train.csv')
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


3. Drop the PassengerId, Name, Ticket, Cabin columns from the dataframe

In [52]:
titanic_data.drop(['PassengerId','Name','Ticket','Cabin'], 'columns', inplace=True)
titanic_data.head()

# Survived column tells us whether the Passenger survived the sinking of titanic or not. 0 - did not survive, 1 - survived
# Pclass is which class the passenger was travelling ,i.e 1st ,2nd or 3rd.
# Sex male or female
# Age How old the passenger is .
# SibSp and Parch The number of siblings/parents aboard the titanic.
# Fare the price of ticket
# Embarked tells where the passenger boarded the ship . (C - Cherbourg, Q - Queenstown,S= Southampton )

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


4. Change sex column into 0 or 1

In [53]:
from sklearn import preprocessing
# Convert gender to 0 or 1
label_enc =preprocessing.LabelEncoder()
titanic_data['Sex'] = label_enc.fit_transform(titanic_data['Sex'].astype(str)) # encode string categories to integer levels (0,1,2,3...) https://stackoverflow.com/a/41774086
titanic_data.head() # notice change in sex from male/female to 1/0

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,S
1,1,1,0,38.0,1,0,71.2833,C
2,1,3,0,26.0,0,0,7.925,S
3,1,1,0,35.0,1,0,53.1,S
4,0,3,1,35.0,0,0,8.05,S


5. One-hot-encode the Embarked column (3 different ports in England)

*One-hot  
the idea that a column with e.g countries is changed into many columns (each country has its own column) and then 0 or 1 indicate if that country is present in the data row*

In [54]:
# One-hot encoding of 'Embarked' with pd.get_dummies
titanic_data = pd.get_dummies(titanic_data,columns=['Embarked']) #https://stackoverflow.com/a/48170725
titanic_data.head() 
# Notice how Embarked has changed from 1 column to 3
# And again we, instead of showing port with letter (S, C or Q), we show its evaluation by number (0, 1)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
0,0,3,1,22.0,1,0,7.25,0,0,1
1,1,1,0,38.0,1,0,71.2833,1,0,0
2,1,3,0,26.0,0,0,7.925,0,0,1
3,1,1,0,35.0,1,0,53.1,0,0,1
4,0,3,1,35.0,0,0,8.05,0,0,1


6. Drop rows with missing values

In [55]:
pre_rowcount = len(titanic_data)
titanic_data.dropna(inplace=True)
post_rowcount = len(titanic_data)

print('Rows before:', pre_rowcount)
print('Rows after:', post_rowcount)
print('Rows removed:', pre_rowcount - post_rowcount)

Rows before: 891
Rows after: 714
Rows removed: 177


7. what is the best bandwidth to use for our dataset? Use sklearn

*The bandwidth is the distance/size scale of the kernel function, i.e. what the size of the “window” is across which you calculate the mean.*([Source](https://softwareengineering.stackexchange.com/a/388324))  
![](https://raw.githubusercontent.com/datsoftlyngby/dat4sem2020spring-python/50423ce9bb39cdb25ce76b8c28b1e92abe42c8d1/notebooks/images/meanshift.gif)  
> The cirle window above would be one of many windows that are distributed over the feature space.  
The circle moves towards the highest density of feature vectors.  
The mean of the distance to each neighbour is calculated and the circle is moved to the mean value.  
This happens over several iterations untill all windows are finished moving.  
Finally all windows on same location are merged and all features are moved to neares cluster.

The smaller values of bandwith result in tall skinny kernels & larger values result in short fat kernels.

In [56]:
from sklearn.cluster import estimate_bandwidth
titanic_bandwidth = estimate_bandwidth(titanic_data)
titanic_bandwidth # will be used later

30.44675914497196

8. Fit data to a meanshift model

In [57]:
from sklearn.cluster import MeanShift
analyzer = MeanShift(bandwidth=titanic_bandwidth) # later
analyzer.fit(titanic_data)

MeanShift(bandwidth=30.44675914497196, bin_seeding=False, cluster_all=True,
          min_bin_freq=1, n_jobs=None, seeds=None)

9. How many clusters do we get?

In [58]:
labels = analyzer.labels_
labels_unique = np.unique(labels)
n_clusters = len(labels_unique)

print(f"Number of estimated clusters: {n_clusters}") #https://scikit-learn.org/stable/auto_examples/cluster/plot_mean_shift.html
print('Visual presentation of clusters:\n', labels)

Number of estimated clusters: 5
Visual presentation of clusters:
 [0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 1 1 0 0 0 0 0 0 0 0 0
 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 3 0 0 0 1 0 0
 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 3 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 4 0 0 1 0 0 0 0 2 2 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 2 3 0 2 2 0 1 1 3 0 0 0 0 0 0 2 2 0 0
 0 0 2 0 0 0 1 0 2 0 1 2 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1
 0 0 2 0 0 3 0 0 3 0 0 1 1 1 0 0 2 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 3 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 1 2 0 0 0 0 1 2 0 0 1 0
 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 2 1 1 0 0 0 2 0 0 0 0 2 0 0 0 0 1 1
 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 2

10. Add a column to the titanic dataframe with the cluster label for each person

In [59]:
# Add a new column in dataset which shows the cluster the data of a particular row belongs to.

titanic_data['cluster_group'] = np.nan
data_length=len(titanic_data)
for i in range(data_length): # loop 714 rows
    titanic_data.iloc[i, titanic_data.columns.get_loc('cluster_group')] = labels[i] #set the cluster label on each row
    #iloc = integer_location
    #get cell at current_index in column 'cluster_group'
    #set cell to be hold equivalent label (meanshift labels keep same order apparently)

titanic_data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S,cluster_group
0,0,3,1,22.0,1,0,7.25,0,0,1,0.0
1,1,1,0,38.0,1,0,71.2833,1,0,0,1.0
2,1,3,0,26.0,0,0,7.925,0,0,1,0.0
3,1,1,0,35.0,1,0,53.1,0,0,1,1.0
4,0,3,1,35.0,0,0,8.05,0,0,1,0.0


11. Get mean values of each cluster group

In [60]:
# Now all data is set the way we like it, with numbers instead of strings. 
# Lets get some statistics
titanic_data.describe()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S,cluster_group
count,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0
mean,0.406162,2.236695,0.634454,29.699118,0.512605,0.431373,34.694514,0.182073,0.039216,0.77591,0.313725
std,0.49146,0.83825,0.481921,14.526497,0.929783,0.853289,52.91893,0.386175,0.194244,0.417274,0.69027
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,1.0,0.0,20.125,0.0,0.0,8.05,0.0,0.0,1.0,0.0
50%,0.0,2.0,1.0,28.0,0.0,0.0,15.7417,0.0,0.0,1.0,0.0
75%,1.0,3.0,1.0,38.0,1.0,1.0,33.375,0.0,0.0,1.0,0.0
max,1.0,3.0,1.0,80.0,5.0,6.0,512.3292,1.0,1.0,1.0,4.0


12. Add a column with the size of each cluster group.

In [61]:
#Grouping passengers by Cluster
titanic_cluster_data = titanic_data.groupby(['cluster_group']).mean() 
# .mean() returns mean ("average") for requested axis

#Count of passengers in each cluster
titanic_cluster_data['Counts'] = pd.Series(titanic_data.groupby(['cluster_group']).size())
titanic_cluster_data # notice the last column "counts"
# we now have another set of statistics based on group of clusters, instead of the individuals

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S,Counts
cluster_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0.0,0.338104,2.52415,0.677996,28.277728,0.440072,0.372093,15.476296,0.121646,0.046512,0.831843,559
1.0,0.607477,1.28972,0.53271,36.11215,0.813084,0.495327,65.871498,0.336449,0.018692,0.626168,107
2.0,0.733333,1.0,0.366667,32.430667,0.6,0.866667,131.183883,0.5,0.0,0.5,30
3.0,0.733333,1.0,0.266667,30.333333,1.0,1.333333,239.99194,0.533333,0.0,0.466667,15
4.0,1.0,1.0,0.666667,35.333333,0.0,0.333333,512.3292,1.0,0.0,0.0,3


13. Write out conclusion from the aggregated data.

- Cluster 0 (0.0)
  - Has 559 passengers
  - Survival rate is 33% (very low) means most of them didn't survive
  - They belong to the lower classes 2nd and 3rd class mostly and are mostly male .
  - The average fare paid is $15  
- Cluster 1 (1.0)
  - Has 107 passengers
  - Survival rate is 61% means a little more than half of them survived
  - They are mostly from 1st and 2nd class
  - The average fare paid is $65
- Cluster 2 (2.0)
  - Has 30 passengers
  - Survival rate is 73% means most of them survived
  - They are mostly from 1st class
  - The average fare paid is $131 (high fare)
- Cluster 3 (3.0)
  - Has 15 passengers
  - Survival rate is 73% means most of them survived
  - They are mostly from 1st class and are mostly female
  - The average fare paid is $239 (which is far higher than the 1st cluster average fare)  

The last cluster (4.0) has just 3 datapoints so it is not that significant hence we can ignore for data analysis

## Extra: Visualization
Taken from https://scikit-learn.org/stable/auto_examples/cluster/plot_mean_shift.html

In [73]:
import matplotlib.pyplot as plt
from itertools import cycle

cluster_centers = analyzer.cluster_centers_
#X = labels.tolist()
#print(X)
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters), colors):
    my_members = labels == k
    cluster_center = cluster_centers[k]
    plt.plot(X[my_members, 0], X[my_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)

5
[0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 3, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 4, 0, 0, 1, 0, 0, 0, 0, 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 2, 3, 0, 2, 2, 0, 1, 1, 3, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 2, 0, 1, 2, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 2, 0, 0, 3, 0, 0, 3, 0, 0, 1, 1, 1, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0

TypeError: 'MeanShift' object is not subscriptable