# Exploratory Data Analysis with the Titanic Dataset

This dataset is the training dataset from Kaggle's ["Titanic - Machine Learning from Disaster"](https://www.kaggle.com/c/titanic)

## Import module

In [None]:
# import the pandas module and give it the alias "pd"



## Import the data

In [None]:
# The dataset is contained in a CSV file, "data/titanic.csv".
# Use pandas read_csv function to import the data into a dataframe.



## Look at the data

* look at snapshots of the dataframe
  * `df`, `df.head()`, `df.tail()`, `df.sample()`
* look at the sizes
  * `df.shape`: look at the size of the data
* look at column names
  * `df.columns`: look at column names
* look at summary information
  * `df.describe()`: statistical summary info
  * `df.info()`: data types, sizes, column labels, null values

In [None]:
# Put the dataframe variable by itself on a line and execute the cell
# You should see an abbreviated output of the dataframe contents



In [None]:
# What happens when you print the dataframe with the print function?



In [None]:
# Look at the first 5 rows



In [None]:
# Look at the last 5 rows



In [None]:
# Look at 5 sample rows



In [None]:
# Look at the number of rows and columns



Look at the description and details of the training data on the data page:
https://www.kaggle.com/competitions/titanic/data?select=train.csv

Do your number of rows and columns match with the description/details?

In [None]:
# Look at the column names
# Do these match your expectations based on the documentation? (included below)



Let's consult information from the Kaggle site to get more information.

| Variable | Definition | Key| 
| :-- | :-- | :-- |
| survival | Survival | 0 = No, 1 = Yes| 
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd| 
| sex | Sex | | 
| Age | Age in years | | 
| sibsp | # of siblings / spouses aboard the Titanic | | 
| parch | # of parents / children aboard the Titanic | | 
| ticket | Ticket number | | 
| fare | Passenger fare | | 
| cabin | Cabin number | | 
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton| 

**Variable Notes**

pclass: A proxy for socio-economic status (SES)
* 1st = Upper
* 2nd = Middle
* 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
* Sibling = brother, sister, stepbrother, stepsister
* Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
* Parent = mother, father
* Child = daughter, son, stepdaughter, stepson
* Some children travelled only with a nanny, therefore parch=0 for them.

In [None]:
# Use the "describe()" method to get summary statistical information about the quantitative data



* What information does this show?
  * What is the average survival rate?
  * What is the age range?
  * What is the mean age?
  * How many have siblings or spounses?
  * How does the standard deviaton of the fare compare with its mean value?

Are the answers to the above reasonable?

In [None]:
# Use the "info()" method to get a summary description of the dataframe's contents.



-> Which columns have null values?  And what is the percentage of nulls for those that do?

-> Do the data types make sense? (The below table describes data types for reference)

<table class="table table-striped">
  <thead>
    <tr>
      <th>Pandas Type</th>
      <th>Native Python Type</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>object</td>
      <td>string</td>
      <td>The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings).</td>
    </tr>
    <tr>
      <td>int64</td>
      <td>int</td>
      <td>Numeric characters. 64 refers to the memory allocated to hold this character.</td>
    </tr>
    <tr>
      <td>float64</td>
      <td>float</td>
      <td>Numeric characters with decimals. If a column contains numbers and NaNs (see below), pandas will default to float64, in case your missing value has a decimal.</td>
    </tr>
    <tr>
      <td>datetime64, timedelta[ns]</td>
      <td>N/A (but see the <a href="http://doc.python.org/2/library/datetime.html">datetime</a> module in Python’s standard library)</td>
      <td>Values meant to hold time data. Look into these for time series experiments.</td>
    </tr>
  </tbody>
</table>

Let's change a column's datatype from int to string (which becomes an object to pandas):

In [None]:
# Execute the following
df['Survived'].astype(str)

In [None]:
# Use "info()" again to see whether that changed anything



Whoops!  The astype function returned a view, but it didn't change the underlying dataframe.  To do that, we need to explicitly assign the returned dataframe column back into the `df['Survived']` column.

In [None]:
df['Survived'] = df['Survived'].astype(str)

In [None]:
# Let's look again
df.info()

In [None]:
# We'll change two other columns too
df['PassengerId'] = df['PassengerId'].astype(str)
df['Pclass'] = df['Pclass'].astype(str)

## Visualization

Now for some fun stuff.  Let's try to make some simple plots to see what observations we can make.

In [None]:
# List the values of the "Fare" column
# It's ok if the output is abbreviated



In [None]:
# Use the "plot()" method to generate the default plot of "Fare" values



This shows Index vs Fare, i.e., what the value of every Fare was.  We can get a sense of what all the fares were from this, but really we probably want to see a distribution of values.

In [None]:
# Use the "plot()" method again, but now set the "kind" input parameter of plot to be equal to "hist"
# This should generate a histogram of Fare values.



It looks like there are a bunch of low cost tickets, or maybe just a few very *very* expensive tickets.

**Our first look at potentially suspicious values:**  Are there any 0 values?

In [None]:
# Use "loc" and a boolean conditon to output those rows that have a 0 value for Fare



A brief search of some names will show that Mr Lionel Leonard, William Cahoone Johnson Jr., Alfred Johnson, and William Henry Tornquist were American Line employees.  It may make sense that they would have traveled on complementary fare.

... more investigation may be warranted...  But let's looks at the columns that have 'NaN'.

In [None]:
# if you use the "isna()" method of the dataframe, what does that output?



In [None]:
# you can get the count of null values for any column by taking the sum of the True/False values of isna
# That is, look at the output of the following:

df.isna().sum()

In [None]:
# Use the shape attribute and a list index to get the number of rows



In [None]:
# Divide df.isna().sum() by the number of rows to find the percent null values for all columns



* What percentage of age data is missing?
* What percentage of cabin data is missing?
* What percentage of embarked data is missing?

If we want to use those data columns, we would potentially stop here and try to figure out how we need/want to deal with the values that are missing.  For example, we could:
* drop the column completely
* drop the rows with NaNs
* fill the NaNs with other values (a useful value like mean or median, the previous or next row's value, a constant, or the result of an operation)

Further analysis: let's see how Age is related to Survived.

Here are the variables we might like to look at:
* `df.loc[df['Survived'] == '0', 'Age']`: the Age values of those who did not survive
* `df.loc[df['Survived'] == '1', 'Age']`: the Age values of those who did survive

Let's use matplotlib to do a histogram of these.

In [None]:
# Use the "hist()" method of dataframes to make a histogram plot of Age values for those who survived



In [None]:
# Make another histogram plot of Age values for those who did not survive



In [None]:
# What happens if you put the commands to make both histograms here and execute the cell?



It would be nice to plot the bars next to each other too to directly compare them.

We can tie in a little bit of another Python plotting package, Matplotlib, to help.

In [None]:
# Execute this cell
import matplotlib.pyplot as plt

In [None]:
# Execute this cell
a = df.loc[df['Survived'] == '0', 'Age']
b = df.loc[df['Survived'] == '1', 'Age']
plt.hist([a,b]);

In [None]:
# Copy the above commands here
# And insert another condition so that you plot data only for Age values > 18
# You'll need to use the "&" symbol to combine two conditions with "and"



In [None]:
# Now try again for Age < 18



Now that we have a method down for Age, apply it to other variables like Sex, Fare, and Pclass.