# Lab: Titanic EDA

For this lab, we're going to take a look at the Titanic manifest. We'll be exploring this data to see what we can learn regarding the survival rates of different groups of people.

## Step 1: Reading the data

1. Go to [Kaggle Titanic Data](https://www.kaggle.com/c/titanic/data])
2. Click Data on the Menu and you'll be brought to a page with a data dictionary explaining each of the columns. Take a minute to familiarize yourself with how the data is structured and what the data represents.
3. Read the titanic data (in the form of the corresponding `train.csv` using the appropriate Pandas method).

![Titanic Data Dictionary](Titanic_Data_Dictionary.png)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
ttn = pd.read_csv('train.csv')

## Step 2: Cleaning the data
####  1. Output the number of missing values in each column

In [None]:
# Number of missing values in each column in the dataframe

In [1]:
# Filtering the output to just output the columns that have more than 0 missing values
# Also sorting the output in descending order


####  2. Which column has the most `NaN` values? How many cells in that column are empty?


[Series.idxmax](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.idxmax.html)

- Return the row label of the maximum value.
- If multiple values equal the maximum, the first row label with that value is returned.

####  3. Delete all rows where `Embarked` is empty

In [2]:
# Checking number of missing values in the `Embarked` column


In [3]:
# Dropping any rows in `Embarked` column where the value is NaN


In [4]:
# Check your work...


#### 4. Fill all empty cabins with **¯\\_(ツ)_/¯**

Note: `NaN`, empty, and missing are synonymous.

In [5]:
# Checking number of missing values in the `Cabin` column


In [6]:
# Filling any values in `Cabin` column where the value is NaN


In [7]:
# Check your work...


In [8]:
# You can also use `.value_counts()` to see what's now in the `Cabin` column


## Step 3: Feature extraction

#### 1.  There are two columns that pertain to how many family members are on the boat for a given person. Create a new column called `FamilyCount` which will be the sum of those two columns.

In [9]:
# Summing the two columns together and assigning the output
# to a new column called `FamilyCount`


In [10]:
# Check your work...


In [17]:
# Alison's version

#ttn = ttn.assign(FamilyCount = lambda x: x.SibSp + x.Parch)

In [11]:
# Check your work...


#### 2. Reverends have a special title in their name. Create a new column called `IsReverend`: 1 if they're a reverend, 0 if they're not.


In [12]:
# Filtering dataframe for only rows where `Rev` is in the title or `Name` column


In [13]:
# Boolean output of True if the string contains Rev or False if not
# Since True == 1 and False == 0


# Chaining on `.asype(int)` will convert the booleans to integers


In [14]:
# Lambda function...


In [16]:
# Named function...

#ttn['Name'].apply(Reverend).value_counts()


In [17]:
# Check your work...


[np.where documentation](https://numpy.org/doc/stable/reference/generated/numpy.where.html)

#### 3. In order to feed our training data into a classification algorithm, we need to convert our categories into 1's and 0's using...
`pd.get_dummies`
  - Familiarize yourself with the [`pd.get_dummies` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)
  - Create 3 columns: `Embarked_C`, `Embarked_Q` and `Embarked_S`. These columns will have 1's and 0's that correspond to the `C`, `Q` and `S` values in the `Embarked` column
  - Do the same thing for `Sex`
  - BONUS: Extract the title from everyone's name and create dummy columns

## Step 4: Exploratory analysis

_[`df.groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) may be very useful._

#### 1. What was the survival rate overall?


#### 2. Which gender fared the worst? What was their survival rate?

In [44]:
# Split-Apply-Combine

#### 3. What was the survival rate for each `Pclass`?

#### 4. Did any reverends survive? How many?`

#### 5. What is the survival rate for cabins marked **¯\\_(ツ)_/¯**

#### 6. What is the survival rate for people whose `Age` is empty?

####  7. What is the survival rate for each port of embarkation?

#### 8. What is the survival rate for children (under 12) in each `Pclass`?

####  9. Did the captain of the ship survive? Is he on the list?

#### 10. Of all the people that died, who had the most expensive ticket? How much did it cost?

#### 11. Does having family on the boat help or hurt your chances of survival?

### What if we wanted to model the Titanic data? Use any method of machine learning to model the data