# Surviving the Titanic

This contains a few exercises focused on using data frames to begin looking at data.  It uses the `titanic.csv` dataset ([source](http://web.stanford.edu/class/archive/cs/cs109/cs109.1166/problem12.html)), containing data about the passengers on the [RMS *Titanic*](https://en.wikipedia.org/wiki/RMS_Titanic).

Relevant textbook section: [7.1 - An Introduction to Working with Dataframes](https://snakebear.science/07-Pandas/pandas-descriptives.html)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# For slightly nicer charts
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 150

***
Now we read a CSV file of passengers on the Titanic and assign it to a variable called `df`. Each row in the dataset represents one passenger on the titanic. Each row contains information about the passenger and whether they survived. The columns are labelled as follows: 
* 'Survived'
* 'Pclass'
* 'Name'
* 'Sex'
* 'Age'
* 'Siblings/Spouses Aboard'
* 'Parents/Children Aboard'
* 'Fare'

In [None]:
df = pd.read_csv("titanic.csv")

At this point all we know is that we've succesfully read a csv file. Let's find out some more about the data. First, let's find out how much data there is in the file. 

In [None]:
df.shape

`shape` is a list of two numbers (technically a "tuple," but it's a lot like a list) that represent the number of rows and the number of columns in your dataframe. The titanic data set has 887 rows (each representing one passenger) and 8 columns.

It might be a good idea to get a look at our dataframe but 887 rows seems like a bit much to look at all at once. We can look at just part of the dataframe by using the dataframe method `.head()`.

In [None]:
df.head()

***
## Replacing Values

The 0 and 1 values used to the code the 'Survived' column is not easy to read or understand. The 1, 2, 3 values used to code Passenger Class are a little better but could also be improved with more descriptive values. To recode values in a column we can use the `replace()` method on a column.

In [None]:
df['Survived'] = df['Survived'].replace(0, 'Perished')

In the first line of code above we have applied the `replace()` method to the 'Survived' column of the dataframe. Specifically, `df['Survived']` is accessing the 'Survived' column of the dataframe, and `.replace()` is calling a method on that column that takes any instance of the first argument we supply, in this case `0`, and replaces it with the second value, `'Perished'`.


We can do this again to replace `1` with `'Survived'`.

In [None]:
df['Survived'] = df['Survived'].replace(1, 'Lived')
df.head()

In the cell below use the `replace()` method to replace the `1`, `2`, `3` values in 'Pclass' with `'First Class'`, `'Second Class'`, and `'Third Class'`.

In [None]:
# Write and test code here

***
## Value Counts

That looks pretty good. Now the big question: what can this data tell us about who was likely to survive the titanic? First, let's find out how many people lived. 

In [None]:
df['Survived'].value_counts()

What we've done here is apply the `value_counts()` method to the 'Survived' column of the dataframe.  Specifically, `df['Survived']` is accessing the 'Survived' column of the dataframe, and `.value_counts()` is calling a method on that column that counts the number of times each unique value appears in the column.


It looks like there were 342 survivors. Now it's your turn: use the same method to look at the counts of Passenger Class and Sex. 

In [None]:
# Write and test code here

If we group dataframe rows using `.groupby()`, then `.value_counts()` will apply within each group. For example, here we group the data by the Passenger Class ('Pclass') values, then use `.value_counts()` again on the 'Survived' column of the grouped data:

In [None]:
df_byPclass = df.groupby(by='Pclass')
df_byPclass['Survived'].value_counts()

Notice, however, that by default `value_counts()` is sorting the results by the most frequent outcome. This makes the result above a bit hard to read since the first class passengers are sorted differently than the rest (since more survived than perished). We can pass an argument to `value_counts()` to stop it from sorting this way.  

In [None]:
df_byPclass['Survived'].value_counts(sort=False)

We can also use the `.groupby()` method to group on multiple columns by passing it a list of column names.

In [None]:
df_byClassSex = df.groupby(by=['Pclass', 'Sex'])
df_byClassSex['Survived'].value_counts(sort=False)

The result above might be more informative if we first sorted by 'Sex' and then sorted by 'Pclass'. Try reversing the order of the columns in the list passed to `.groupby()`.

In [None]:
# Write and test code here

***
## Cross Tabulation
Up to this point we have used `value_counts()` to and `groupby()` to produce basic counts in a table-like format. When we compare survival for different groups, we are taking one kind of categorical data (Survived, Perished) and seeing how it relates to another kind of categorical data (First Class, Second Class, Third Class). This type of analysis is really common in all kinds of applications. A more formal tool for looking at data this way is a ['Contingency Table' or 'Cross Tabulation'.](https://en.wikipedia.org/wiki/Contingency_table) 


In [None]:
pd.crosstab(df['Pclass'], df['Survived'])

In the code above we have passed two columns from our dataframe into the Pandas `crosstab()` method. **Note** that this is a function in Pandas itself, not in a particular dataframe, so we are specifying `pd` (the Pandas module we imported above) on the left side of the dot notation, and we are passing dataframe columns into it as arguments.

The `crosstab()` method has some additional features that make it very useful.

First, we can add the argument `margins` that produces row or column subtotals (margins):

In [None]:
pd.crosstab(df['Pclass'], df['Survived'], margins=True)

Second, we can add an argument `normalize` that coverts frequency counts to percentages. By setting the `normalize` argument to the string `'index'`, we specify that we want values in each row converted to percentages of that row's total.  For example, the value in the resulting table for Pclass=1 and Survived='Perished' will indicate what percentage *of first class passengers* perished:

In [None]:
pd.crosstab(df['Pclass'], df['Survived'], margins=True, normalize='index')

Write your own code to make crosstabs examining the survival of passengers with sibling or spouses aboard the ship:

In [None]:
# Write and test your code here

Write your own code to make crosstabs examining the survival of passengers with parents or children on board the ship:

In [None]:
# Write and test your code here

***
We can extend the cross tabs by passing a list of columns. Here we've passed in two dataframe columns for the crosstab rows and a single column for the crosstab columns.

In [None]:
pd.crosstab([df['Pclass'], df['Sex']], df['Survived'], normalize='index')

Write your own code to make crosstabs examining the survival of using a list of column values from the dataframe to specify the rows. (Choose any columns you like). 

In [None]:
# Write and test your code here