# http://bit.ly/dri18-ml

# Before Lunch?
 * Open, filter, and search spreadsheets
 * Compute & plot descriptive statistics
 * Generate cross tabulations (group by/pivot table)
 * Remove missing and invalid data
 * Numerically encode and Z-score normalize variables
 * Reduce data dimensions using Principal Component Analysis
 * Cluster and classifiy data using nearest neighbor approaches 
 * Evaluate algorithm's accuracy and compute confusion matrix
 

![Titanic](figs/RMS_titanic_3.jpg)

# [Kaggle Titanic Data](https://www.kaggle.com/c/titanic/data)

VARIABLE DESCRIPTIONS
=====================
```
survival: Survival (0 = No; 1 = Yes)|
pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name: Name
sex: Sex
age: Age
```

VARIABLE DESCRIPTIONS
=====================
```
sibsp: Number of Siblings/Spouses Aboard
parch: Number of Parents/Children Aboard
ticket: Ticket Number
fare: Passenger Fare
cabin: Cabin
embarked: Port of Embarkation 
    (C = Cherbourg; Q = Queenstown; S = Southampton)
```

SPECIAL NOTES
==============
```
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5
```

Family Notes
============
```
With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
```


In [None]:
import pandas as pd
print(f'pandas: {pd.__version__}')

import numpy as np
print(f'numpy: {np.__version__}')

import matplotlib
import matplotlib.pyplot as plt
print(f'matplotlib: {matplotlib.__version__}')

In [None]:
# read csv can read from a file path or a url
df = pd.read_csv("http://bit.ly/tscv17")
df.head(5)

In [None]:
#What do we have, what type is it, and what's missing?
df.info()

In [None]:
#how do we take a statistical snapshot?
df.describe()

In [None]:
# what if we also want categorical descriptions?
df.describe?

In [None]:
# the docs say to use the include keyword and 'O' argument
df.describe(include=['O'])

In [None]:
# How do we see how many people survived?  1=Survived
df['Survived'].value_counts()

In [None]:
%matplotlib inline
fig, ax = plt.subplots()
df['Survived'].value_counts().plot.pie(ax=ax, 
                                       labels=["Deid", "Survived"])
ax.set_aspect('equal') # this makes it a circle

Challenge
=====
* Print the value counts of another variable (example: cabin)
* Create a pie chart for another variable

In [None]:
# What if we want to see how the people who survived differ from those who deid?
df.groupby(['Survived'])

In [None]:
list(df.groupby(['Survived']))

In [None]:
# How do we get a summary of that?
df.groupby(['Survived']).count()

In [None]:
#how do we now disaggregate the data to compute a cross tabulation?
by_demo = df.groupby(['Sex', 'Pclass'])
#unstack creates the table
by_demo['Survived'].count()

In [None]:
# Can we put it into a table?
by_demo['Survived'].sum().unstack() 

In [None]:
#Let's compute a survival rate
Survival_rate = by_demo['Survived'].sum()/by_demo['Survived'].count()
Survival_rate

In [None]:
# And let's plot that
%matplotlib inline
Survival_rate.plot.bar()

In [None]:
# can we group?
Survival_rate.unstack().plot.bar()
Survival_rate.unstack()

In [None]:
#what about grouped by sex first?
# can we group?
Survival_rate.unstack().T.plot.bar()
Survival_rate.unstack().T

In [None]:
# Is there a correlation?
df[['Pclass','Survived']].corr()

In [None]:
#What if we want to apply different aggregations to different columns?
#{'key':value} is a Python data structure called a dictionary
df.groupby(['Pclass', 'Sex']).agg( {'Survived': np.sum, 
                                    'Fare': np.mean, 
                                    'Age': np.median})

Challenge
=====
* Compute a cross tabulation using a different variable in the dataset
* Plot a chart that illustrates your findings
* If possible, compute the correlation between survival and that variable

![mult](figs/multivariate.png)

![ml](figs/ml_map.png)

# Where do we start? axis notation 
![axis](figs/axis.jpg)
source: [stackoverflow](http://stackoverflow.com/questions/25773245/ambiguity-in-pandas-dataframe-numpy-array-axis-definition)

In [None]:
# names and ticket #s are too complicated and passengerID has no meaning
# drop survived since to an extent that's what we're trying to uncover
# axis = 1 means that these are columns
df_filtered = df.drop(["Name", "Ticket", "PassengerId", "Survived"], 
                      axis=1)
#Now what's missing?
df_filtered.info()

In [None]:
#We're missing so much cabin data that it makes sense to drop it as a first pass
df_c = df_filtered.drop(["Cabin"], axis=1)
df_c.info()

How do we find the ~170 missing rows? Fancy Indexing
==========================
![masking](figs/masking.png)

modified from [software carpentry](http://v4.software-carpentry.org/matrix/indexing.html)


In [None]:
#isnull - is any cell missing a value? 
#any(axis=1) - which rows have at least 1 missing value?
bad_rows = df_c.isnull().any(axis=1)
#true = 1, false = 0, so sum() gives total # true
print(bad_rows.sum(), "missing rows")

In [None]:
#~ means not, so ~bad_rows => good rows
df_clean = df_c[~bad_rows]
survived = df['Survived'][~bad_rows]

#What state is our data in now?
df_clean.info()

# How do we deal with catagories and text? One-Hot Coding
![one-hot](figs/word2vec-one-hot.png)
source: [Amazing Power of Word Vectors](http://www.kdnuggets.com/2016/05/amazing-power-word-vectors.html)

In [None]:
#Let's one-hot code Pclass, Sex, & Embarked
#http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
df_coded = pd.get_dummies(df_clean, 
                          columns=["Pclass", "Sex", "Embarked"])
df_coded.head()

In [None]:
df_coded['Fare'].plot()
df_coded['Age'].plot()
plt.show() #removes matplotlib text under graph

In [None]:
# need to put continuous numerical values on comparable scale (normalize them)
def zscore(x):
    return ((x - x.mean())/x.std()) 

df_coded['AgeN'] = zscore(df_coded['Age'].values)
df_coded['FareN'] = zscore(df_coded['Fare'].values)


In [None]:
df_coded['FareN'].plot()
df_coded['AgeN'].plot()
plt.show()

In [None]:
# What's the distribution of these?

fig, (ax1, ax2) = plt.subplots(nrows=2)
df_coded[['AgeN','FareN']].hist(ax = (ax1, ax2))
plt.show()

In [None]:
#now it's a feature vector as far as our algs are concerned
dfFV = df_coded.drop(["Age", "Fare"], axis=1)
dfFV.head()

In [None]:
#let's save this cleaned dataset
dfFV.to_csv("titanic_FV.csv", index=False)
#and let's save out the survived column of the 
survived.to_csv("survived.csv", index=False)