<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px" />

# Lab: Titanic EDA

---
For this lab, we're going to take a look at the Titanic manifest. We'll be exploring this data to see what we can learn regarding the survival rates of different groups of people.

## Step 1: Reading the data

1. Read the titanic data (in the form of the `train.csv` in this repo using the appropriate Pandas method).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

trn = pd.read_csv('train.csv')

### Data Dictionary

| Variable | Description | Details |
|----------|-------------|---------|
| survival | Survival | 0 = No; 1 = Yes |
| pclass | Passenger Class | 1 = 1st; 2 = 2nd; 3 = 3rd |
| name | First and Last Name | |
| sex | Sex | |
| age | Age | |
| sibsp | Number of Siblings/Spouses Aboard | |
| parch | Number of Parents/Children Aboard | |
| ticket | Ticket Number | |
| fare | Passenger Fare | |
| cabin | Cabin | |
| embarked | Port of Embarkation | C = Cherbourg; Q = Queenstown; S = Southampton |

In [None]:
trn.info()
#trn.describe()
#trn.shape
#trn.dtypes

## Step 2: Cleaning the data
####  1. Create a bar chart showing how many missing values are in each column

In [None]:
trn.isnull().sum()[trn.isnull().sum() != 0].plot(kind = 'bar')

####  2. Which column has the most `NaN` values? How many cells in that column are empty?


In [None]:
trn.isna().sum().sort_values(ascending = False)
# the most NaN values column is Cabin, there are 687 cells are empty

####  3. Delete all rows where `Embarked` is empty

In [None]:
trn= trn.dropna(subset=['Embarked'])

In [None]:
#check
trn.isna().sum()

#### 4. Fill all empty cabins with **¯\\_(ツ)_/¯**

Note: `NaN`, empty, and missing are synonymous.

In [None]:
trn.fillna({'Cabin': r'¯\(ツ)/¯'}, inplace = True)

In [None]:
#check
trn.head(2)

## Step 3: Feature extraction

#### 1.  There are two columns that pertain to how many family members are on the boat for a given person. Create a new column called `FamilyCount` which will be the sum of those two columns.

In [None]:
trn['FamilyCount'] = trn['SibSp'] + trn['Parch']

In [None]:
trn[['SibSp', 'FamilyCount', 'Parch']]

#### 2. Reverends have a special title in their name. Create a column called `IsReverend`: 1 if they're a preacher, 0 if they're not.


In [None]:
trn['IsReverend'] = trn['Name'].str.contains('Rev').astype(int)  

In [None]:
trn['IsReverend'].value_counts()

#### 3. In order to feed our training data into a classification algorithm, we need to convert our categories into 1's and 0's using `pd.get_dummies`.

  - Familiarize yourself with the [**`pd.get_dummies` documentation**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)
  - Create 3 columns: `Embarked_C`, `Embarked_Q` and `Embarked_S`. These columns will have 1's and 0's that correspond to the `C`, `Q` and `S` values in the `Embarked` column
  - Do the same thing for `Sex`
  - BONUS (required): Extract the title from everyone's name and create dummy columns

In [None]:
pd.get_dummies(trn['Embarked'], dtype=int)   #dtype=int means change to numeric variables

In [None]:
pd.get_dummies(trn['Sex'], dtype=int)

In [None]:
#option that remove the original verison 
trn = pd.get_dummies(trn, columns = ['Embarked'])
trn

## Step 4: Exploratory analysis

_[`df.groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) may be very useful._

#### 1. What was the survival rate overall?


In [None]:
# otehr potion
trn['Survived'].mean().round(4)

#### 2. Which gender fared the worst? What was their survival rate?

In [None]:
# Calculate the survival rate by gender
gender_survival_rate = trn.groupby('Sex')['Survived'].mean()

# Identify which gender fared the worst
worst_gender = gender_survival_rate.idxmin()   # idxmin() = showing the min of the value 
worst_survival_rate = gender_survival_rate.min()

print(f'Gender that fared the worst: {worst_gender}')
print(f'Survival rate for {worst_gender}: {worst_survival_rate:.2%}')

#### 3. What was the survival rate for each `Pclass`?

In [None]:
trn.groupby('Pclass')[['Survived']].mean()

In [None]:
trn.groupby(['Pclass', 'Sex'])[['Survived']].mean()

In [None]:
trn.groupby(['Pclass', 'Sex'])[['Fare','Survived']].mean()

#### 4. Did any reverends survive? How many?`

In [None]:
trn.groupby('IsReverend')['Survived'].mean()

In [None]:
# None of them is survived 

#### 5. What is the survival rate for cabins marked **¯\\_(ツ)_/¯**

In [None]:
trn[trn['Cabin'] == r'¯\(ツ)/¯']['Survived'].mean().round(4)


#### 6. What is the survival rate for people whose `Age` is empty?

In [None]:
trn[trn['Age'].isnull()]['Survived'].mean().round(4)

####  7. What is the survival rate for each port of embarkation?

In [None]:
#other option
for col in ['Embarked_C', 'Embarked_Q', 'Embarked_S']:
    print(col, ':')
    print(trn[trn[col] ==1][['Survived']].mean().mul(100).round(2))
    print ()

#### 8. What is the survival rate for children (under 12) in each `Pclass`?

In [None]:
trn[trn['Age'] < 12].groupby('Pclass')[['Survived']].mean()

####  9. Did the captain of the ship survive? Is he on the list?

In [None]:
# get the title of each people
def get_title(full_name):
    name_split = full_name.split()
    return name_split[1]

In [None]:
trn['title'] = trn['Name'].apply(get_title)
trn['title'].value_counts()

In [None]:
trn[trn['title'] == 'Capt.']

In [None]:
# other option
trn[trn['Name'].str.contains('Capt.')][['Name', 'Survived']]

In [None]:
# captain did not survived 

#### 10. Of all the people that died, who had the most expensive ticket? How much did it cost?

In [None]:
# other option
trn[trn['Survived'] ==0].sort_values(by='Fare', ascending = False).head(2)

#### 11. Does having family on the boat help or hurt your chances of survival?

In [None]:
# other option
def survival_rate(x):
    return round(x.mean()*100,2)

surv_rate = trn.groupby('FamilyCount').agg({'Survived': ['count','sum', survival_rate]})
surv_rate.columns = ['Boarded', ' Survived', 'Survival_rate']
surv_rate.reset_index()

## Step 5: Plotting
Using Matplotlib and Seaborn, create multiple charts showing the survival rates of different groups of people. It's fine if a handful of charts are basic (Gender, Age, etc), but what we're really looking for is something beneath the surface.


In [None]:
#import seabon
import seaborn as sns

#creating a function to make chart
def plot_chart(X, Y, df):
    plt.figure(figsize=(8, 5))
    sns.barplot(x=X, y=Y, data=df)
    plt.title(f'Survival Rate by {X}')
    plt.ylabel('Survival Rate')
    plt.show()

In [None]:
trn.columns

In [None]:
#Survival rate by Gender group
plot_chart('Sex', 'Survived', trn)

In [None]:
#Survival rate by Pclass group
plot_chart('Pclass', 'Survived', trn)

In [None]:
#Survival rate by who have family member and dont group
plot_chart('FamilyCount', 'Survived', trn)