## Clustering Titanic data with Meanshift
Source code from kaggle 2019: 'https://www.kaggle.com/hesh97/titanicdataset-traincsv'

1. Get the data on people on the Titanic, their class, sex age, ticket price and whether they survived.
2. Load into pandas dataframe
3. Drop the PassengerId, Name, Ticket, Cabin columns from the dataframe
4. Change sex column into 0 or 1
5. One-hot-encode the Embarked column (3 different ports in England)
6. Drop rows with missing values
7. what is the best bandwidth to use for our dataset? Use sklearn
8. Fit data to a meanshift model
9. How many clusters do we get
10. Add a column to the titanic dataframe with the cluster label for each person
11. Get mean values of each cluster group
12. Add a column with the size of each cluster group.
13. Write out conclusion from the aggregated data.

In [1]:
import pandas as pd 
import numpy as np
titanic_data = pd.read_csv('data/titanic_train.csv')
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
# Drop irrelevant columns
titanic_data.drop(['PassengerId','Name','Ticket','Cabin'],'columns',inplace=True)
titanic_data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


## understanding the data

- **Survived** column tells us whether the Passenger survived the sinking of titanic or not. 0 - did not survive, 1 - survived  
- **Pclass** is which class the passenger was travelling ,i.e 1st ,2nd or 3rd.
- **Sex** male or female
- **Age** How old the passenger is .
- **SibSp and Parch** The number of siblings/parents aboard the titanic.
- **Fare** the price of ticket
- **Embarked** tells where the passenger boarded the ship . (C - Cherbourg, Q - Queenstown,S= Southampton )

In [3]:
from sklearn import preprocessing
# Convert gender to 0 or 1
label_enc =preprocessing.LabelEncoder()
titanic_data['Sex'] = label_enc.fit_transform(titanic_data['Sex'].astype(str))
titanic_data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,S
1,1,1,0,38.0,1,0,71.2833,C
2,1,3,0,26.0,0,0,7.925,S
3,1,1,0,35.0,1,0,53.1,S
4,0,3,1,35.0,0,0,8.05,S


In [4]:
# One-hot encoding of 'Embarked' with pd.get_dummies
titanic_data = pd.get_dummies(titanic_data,columns=['Embarked'])
titanic_data.head()


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
0,0,3,1,22.0,1,0,7.25,0,0,1
1,1,1,0,38.0,1,0,71.2833,1,0,0
2,1,3,0,26.0,0,0,7.925,0,0,1
3,1,1,0,35.0,1,0,53.1,0,0,1
4,0,3,1,35.0,0,0,8.05,0,0,1


In [5]:
# Find missing values in the data and drop those rows:
print('rows before drop n/a',len(titanic_data))
bool_matrix = titanic_data.isnull() # dataframe with True and False values for each cell in the titanic_data
only_null_filter = bool_matrix.any(axis=1) # is there a True value in any column in each row. returns a pandas Series with index matching index of titcanic dataframe
missing = titanic_data[only_null_filter] # show all rows that has one or more null values

# remove null value rows
titanic_data = titanic_data.dropna()
print('rows after',len(titanic_data))
titanic_data
pd.options.display.max_rows = None # let me see all rows in the dataframe (can be used with columns too)
bool_matrix

rows before drop n/a 891
rows after 714


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False
5,False,False,False,True,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False


In [6]:
only_null_filter

0      False
1      False
2      False
3      False
4      False
5       True
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17      True
18     False
19      True
20     False
21     False
22     False
23     False
24     False
25     False
26      True
27     False
28      True
29      True
30     False
31      True
32      True
33     False
34     False
35     False
36      True
37     False
38     False
39     False
40     False
41     False
42      True
43     False
44     False
45      True
46      True
47      True
48      True
49     False
50     False
51     False
52     False
53     False
54     False
55      True
56     False
57     False
58     False
59     False
60     False
61     False
62     False
63     False
64      True
65      True
66     False
67     False
68     False
69     False
70     False
71     False
72     False
73     False
74     False
75     False
76      True

In [7]:
# what is the best bandwidth to use for our dataset?
# The smaller values of bandwith result in tall skinny kernels & larger values result in short fat kernels.
from sklearn.cluster import estimate_bandwidth
bw = estimate_bandwidth(titanic_data)

In [8]:
from sklearn.cluster import MeanShift
analyzer = MeanShift(bandwidth=bw) 
analyzer.fit(titanic_data)

MeanShift(bandwidth=30.44675914497196, bin_seeding=False, cluster_all=True,
          max_iter=300, min_bin_freq=1, n_jobs=None, seeds=None)

In [9]:
labels = analyzer.labels_
print(labels)
print('\n\n',np.unique(labels))

[0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 1 1 0 0 0 0 0 0 0 0 0
 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 3 0 0 0 1 0 0
 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 3 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 4 0 0 1 0 0 0 0 2 2 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 2 3 0 2 2 0 1 1 3 0 0 0 0 0 0 2 2 0 0
 0 0 2 0 0 0 1 0 2 0 1 2 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1
 0 0 2 0 0 3 0 0 3 0 0 1 1 1 0 0 2 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 3 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 1 2 0 0 0 0 1 2 0 0 1 0
 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 2 1 1 0 0 0 2 0 0 0 0 2 0 0 0 0 1 1
 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 2 0 1 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
 2 0 0 1 0 0 0 0 0 1 0 0 

#### 5 closters in above model

In [10]:
#We will add a new column in dataset which shows the cluster the data of a particular row belongs to.

# create a new column in the dataset
titanic_data['cluster_group'] = np.nan
for i in range(len(titanic_data)): # loop 714 rows
    titanic_data.iloc[i,titanic_data.columns.get_loc('cluster_group')] = labels[i] #set the cluster label on each row

titanic_data.head()


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S,cluster_group
0,0,3,1,22.0,1,0,7.25,0,0,1,0.0
1,1,1,0,38.0,1,0,71.2833,1,0,0,1.0
2,1,3,0,26.0,0,0,7.925,0,0,1,0.0
3,1,1,0,35.0,1,0,53.1,0,0,1,1.0
4,0,3,1,35.0,0,0,8.05,0,0,1,0.0


In [11]:
titanic_data.describe()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S,cluster_group
count,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0
mean,0.406162,2.236695,0.634454,29.699118,0.512605,0.431373,34.694514,0.182073,0.039216,0.77591,0.313725
std,0.49146,0.83825,0.481921,14.526497,0.929783,0.853289,52.91893,0.386175,0.194244,0.417274,0.69027
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,1.0,0.0,20.125,0.0,0.0,8.05,0.0,0.0,1.0,0.0
50%,0.0,2.0,1.0,28.0,0.0,0.0,15.7417,0.0,0.0,1.0,0.0
75%,1.0,3.0,1.0,38.0,1.0,1.0,33.375,0.0,0.0,1.0,0.0
max,1.0,3.0,1.0,80.0,5.0,6.0,512.3292,1.0,1.0,1.0,4.0


In [12]:
#Grouping passengers by Cluster
titanic_cluster_data = titanic_data.groupby(['cluster_group']).mean()
#Count of passengers in each cluster
titanic_cluster_data['Counts'] = pd.Series(titanic_data.groupby(['cluster_group']).size())
titanic_cluster_data

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S,Counts
cluster_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0.0,0.338104,2.52415,0.677996,28.277728,0.440072,0.372093,15.476296,0.121646,0.046512,0.831843,559
1.0,0.607477,1.28972,0.53271,36.11215,0.813084,0.495327,65.871498,0.336449,0.018692,0.626168,107
2.0,0.733333,1.0,0.366667,32.430667,0.6,0.866667,131.183883,0.5,0.0,0.5,30
3.0,0.733333,1.0,0.266667,30.333333,1.0,1.333333,239.99194,0.533333,0.0,0.466667,15
4.0,1.0,1.0,0.666667,35.333333,0.0,0.333333,512.3292,1.0,0.0,0.0,3


## Conclusion
- Cluster 0
  - Have 558 passengers
  - Survival rate is 33%(very low) means most of them didn't survive
  - They belong to the lower classes 2nd and 3rd class mostly and are mostly male .
  - The average fare paid is `$15`
- Cluster 1
  - Have 108 passengers
  - Survival rate is 61% means a little more than half of them survived
  - They are mostly from 1st and 2nd class
  - The average fare paid is `$65`  
- Cluster 2 i.e the 3rd Cluster
  - Have 30 passengers
  - Survival rate is 73% means most of them survived
  - They are mostly from 1st class
  - The average fare paid is `$131` (high fare)  
- Cluster 3 i.e the 4th Cluster
  - Have 15 passengers
  - Survival rate is 73% means most of them survived
  - They are mostly from 1st class and are mostly female
  - The average fare paid is  `$239` (which is far higher than the 1st cluster average fare)
- The last cluster has just 3 datapoints so it is not that significant hence we can ignore for data analysis
