# Data Exploration 02

You're working on an exhibit for a local museum called "The Titanic Disaster". They've asked you to analyze the passenger manifests and see if you can find any interesting information for the exhibit. 

The museum curator is particularly interested in why some people might have been more likely to survive than others.

## Part 1: Import Pandas and load the data

Remember to import Pandas the conventional way. If you've forgotten how, you may want to review [Data Exploration 01](https://byui-cse.github.io/cse450-course/module-01/exploration-01.html).

The dataset for this exploration is stored at the following url:

[https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/titanic.csv](https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/titanic.csv)

There are lots of ways to load data into your workspace. The easiest way in this case is to [ask Pandas to do it for you](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html).

### Initial Data Analysis
Once you've loaded the data, it's a good idea to poke around a little bit to find out what you're dealing with.

Some questions you might ask include:

* What does the data look like?
* What kind of data is in each column? 
* Do any of the columns have missing values? 

In [12]:
# Part 1: Enter your code below to import Pandas according to the 
# conventional method. Then load the dataset into a Pandas dataframe.
import pandas as pd
titanic = pd.read_csv('https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/titanic.csv')

# Write any code needed to explore the data by seeing what the first few 
# rows look like. Then display a technical summary of the data to determine
# the data types of each column, and which columns have missing data.
titanic.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,No,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,2,Yes,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,Yes,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,3,4,Yes,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,4,5,No,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [13]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   891 non-null    int64  
 1   PassengerId  891 non-null    int64  
 2   Survived     891 non-null    object 
 3   Pclass       891 non-null    int64  
 4   Name         891 non-null    object 
 5   Sex          891 non-null    object 
 6   Age          714 non-null    float64
 7   SibSp        891 non-null    int64  
 8   Parch        891 non-null    int64  
 9   Ticket       891 non-null    object 
 10  Fare         891 non-null    float64
 11  Cabin        204 non-null    object 
 12  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(6)
memory usage: 90.6+ KB


## Part 2: Initial Exploration

Using your visualization library of choice, let's first look at some features in isolation. Generate visualizations showing:

- A comparison of the total number of passengers who survived compared to those that died.
- A comparison of the total number of males compared to females
- A histogram showing the distribution of sibling/spouse counts
- A histogram showing the distribution of parent/child counts

In [4]:
# Part 2: # Write the code needed to generate the visualizations specified.
import altair as alt

In [35]:
# A comparison of the total number of passengers who survived compared to those that died.

alt.Chart(titanic).mark_bar(
    
).encode(
    x=alt.X('Survived', axis=alt.Axis(labelFontSize=14, labelAngle=0)),
    y=alt.Y('count()', title='Number of Passengers'),
    color=alt.Color('Survived', legend=None)
).properties(
    width=300,
    title='Titanic Survival Statistics'
)

In [87]:
# A comparison of the total number of males compared to females


alt.Chart(titanic).mark_bar(
    
).transform_calculate(
    # Generally, it's more efficient to have the data in the form you want
    # before you give it to Altair. But you can also use transforms to do some
    # quick modifications.
    #
    # In this case, we uppercase the first character of the string, and add to it
    # the characters from position 1 onwards. We store that in a temporary variable
    # called "capitalized", which we can then use in our encode() method.
    #
    # Notice that by doing this, altair can't figure out the type of feature that
    # "capitalized" is, so we have to specify its type with :N
    #
    # See https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types
    # and https://altair-viz.github.io/user_guide/transform/index.html
    # for details and:
    # https://vega.github.io/vega/docs/expressions/ for a list of functions you
    # can use in altair expressions.
    Capitalized_Sex = 'upper(datum.Sex[0]) + slice(datum.Sex,1)'
).encode(
    x=alt.X('Capitalized_Sex:N', axis=alt.Axis(labelFontSize=14, labelAngle=0)),
    y=alt.Y('count()', title='Number of Passengers'),
    color=alt.Color('Capitalized_Sex:N', legend=None)
).properties(
    width=300,
    title='Titanic Gender Statistics'
)

In [74]:
# A histogram showing the distribution of sibling/spouse counts
alt.Chart(titanic).mark_bar(
    
).encode(
    x=alt.X('SibSp', bin=True, title='Siblings and Spouses'),
    y=alt.Y('count()', title='Number of Passengers')
).properties(
    width=300,
    title='Sibling & Spouse Count Distribution'
)

In [76]:
# A histogram showing the distribution of parent/child counts
alt.Chart(titanic).mark_bar(
    
).encode(
    x=alt.X('Parch', bin=True, title='Parents and Children'),
    y=alt.Y('count()', title='Number of Passengers')
).properties(
    width=300,
    title='Parent & Child Count Distribution'
)

## Part 3: Pairwise Comparisons
Use your visualization library of choice to look at how the survival distribution varied across different groups.

- Choose some features that you think might have had some influence over the likelihood of a titanic passenger surviving.

- For each of those features, generate a chart for each feature showing the survival distributions when taking that feature into account

In [89]:
# Write the code to explore how different features affect the survival distribution
# Gender seems to make a big difference. Could Jack have fit on that door?
alt.Chart(titanic).mark_bar().transform_calculate(
    Capitalized_Sex = 'upper(datum.Sex[0]) + slice(datum.Sex,1)'
).encode(
    alt.X('Survived', title='Survived', axis=alt.Axis(labelFontSize=14, labelAngle=0)),
    alt.Y('count()', title='Total Survivors'),
    color=alt.Color('Survived', legend=None),
    column=alt.Column('Capitalized_Sex:N', title=None)
).properties(
    title='How Did Passenger Gender Affect Survival',
    width=100
)

In [104]:
# The better class ticket you had, the more likely you were to survive.

# Let's look at survivors by ticket class
alt.Chart(titanic).mark_bar(
    
).encode(
    alt.X('Survived', title='Survived'),
    alt.Y('count()', title='Number of Passengers'),
    color=alt.Color('Survived', legend=None),
    column=alt.Column('Pclass:N', title='Passenger Ticket Class')
).properties(
    width=100,
    title='Survival Rates by Ticket Class'
)

## Part 4: Feature Engineering

The museum curator wonders if the passenger's rank and title might have anything to do with whether or not they survived. Since this information is embedded in their name, we'll use "feature engineering" to create two new columns:

- Title: The passenger's title
- Rank: A boolean (true/false) indicating if a passenger was someone of rank.

For the first new column, you'll need to find a way to [extract the title portion of their name](https://pandas.pydata.org/docs/getting_started/intro_tutorials/10_text_data.html). Be sure to clean up any whitespace or extra punctuation.

For the second new column, you'll need to first look at a summary of your list of titles and decide what exactly constitutes a title of rank. Will you include military and eccelsiastical titles? Once you've made your decision, create the second column.

You may want to review prior Data Explorations for tips on creating new columns and checking for lists of values.

In [105]:
# Enter the code needed to create the two new columns

# Names seems to be in this format:
#   Braund, Mr. Owen Harris
#   <Last Name>, Title. <Given Names>

# Let's break down this code:
#  .str.split(',') - Use the string split function to split the name field at the comma
#  .str.get(1) - Then use the string get function to get the second half of that split (0 indexed)
#  .str.split(.) - Then use the string split function again to split that half at the period.
#  .str.get(0) - Then use the string get function again to get the first half of that string.
#  .str.strip() - Then use the string split function to get rid of any extra whitespace 
titanic['Title'] = titanic['Name'].str.split(',').str.get(1).str.split('.').str.get(0).str.strip()
titanic.head()

# An alternative method is to use the pandas apply function with a lambda. Note that when
# you do this, you don't have to specify the str clas for each string transformation, 
# because the lambda parameter (x) is-a string already.
#
# titanic['Title'] = titanic['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,0,1,No,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr
1,1,2,Yes,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs
2,2,3,Yes,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss
3,3,4,Yes,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs
4,4,5,No,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr


In [106]:
# Now see what values we have:
titanic['Title'].value_counts()

Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Major             2
Col               2
Don               1
Ms                1
Jonkheer          1
Mme               1
the Countess      1
Lady              1
Capt              1
Sir               1
Name: Title, dtype: int64

In [140]:
# Some things I think qualify as titles of rank:
titanic['Rank'] = titanic['Title'].isin(['Dr', 'Rev', 'Major', 'Col', 'Capt', 'Lady', 'Sir', 'Don', 'the Countess', 'Jonkheer'])
titanic.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Rank
0,0,1,No,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,False
1,1,2,Yes,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,False
2,2,3,Yes,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,False
3,3,4,Yes,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,False
4,4,5,No,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,False


### Revisit Visualizations
Now that you have the new columns in place. Revisit the pairwise comparison plots to see if the new columns reveal any interesting relationships. Don't forget to check with and without different `column` variations.

In [139]:
# Enter the code needed to recheck the pairwise comparison. Try different variations of the column channel.

# Let's look at survivors by title using a % of total chart for this
# see https://altair-viz.github.io/user_guide/transform/joinaggregate.html

alt.Chart(titanic).mark_bar(
    
).encode(
    alt.X('Survived', title='Survived'),
    alt.Y('count()', title='Number of Passengers'),
    color=alt.Color('Survived', legend=None),
    column=alt.Column('Title:N', title='Passenger Title')
).properties(
    width=50,
    title='Survival Rates by Title'
)

In [145]:
# There are some titles that seem to make it look like they had high survivor rates
# but most of those are probably just artifacts of being female, which we already
# saw made a big difference.
alt.Chart(titanic[titanic['Rank'] == True]).mark_bar().encode(
    x='Survived:N',
    y='count()',
    color='Survived:N',
    column='Title:N'
)

In [150]:
alt.Chart(titanic).mark_bar(
    
).encode(
    alt.X('Survived', title='Survived'),
    alt.Y('count()', title='Number of Passengers'),
    color=alt.Color('Survived', legend=None),
    column=alt.Column('Rank:N', title='Passenger Has Title of Rank')
).properties(
    width=500,
    title='Survival Rates by Title'
)


### Simplifying Data
There appears to be a lot of different variations of similar titles. (such as abbreviations for Miss and Mademoiselle). 

Scan through the different titles to see which titles can be consolidated, then use what you know about data manipulation to simplify the distribution.

Once you've finished, check the visualizations again to see if that made any difference.

In [151]:
# Enter the code needed to consolidate some of the different title variations 
# Recheck the pairwise distributions to see if it made a difference.

# Don't forget, any mapping you don't explicitly specify will be converted to
# NaN, so you need to list every option. value_counts() can be a good check
# to see if you missed anything
titanic['Title_Consolidated'] = titanic['Title'].map({
    'Mme': 'Miss',
    'Ms' : 'Mrs', # This could have gone into Miss as well, would probably need to research the trends of the time period
    'Mlle': 'Miss',
    'Col': 'Officer',
    'Major': 'Officer',
    'Capt': 'Officer',
    'Jonkeer': 'Noble',
    'the Countess': 'Noble',
    'Lady': 'Noble',
    'Sir': 'Noble',
    'Don': 'Noble',
    'Mr': 'Mr',
    'Mrs': 'Mrs',
    'Miss': 'Miss',
    'Master': 'Master',
    'Rev': 'Rev',
    'Dr': 'Dr'
})

titanic.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Rank,Title_Consolidated
0,0,1,No,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,False,Mr
1,1,2,Yes,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,False,Mrs
2,2,3,Yes,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,False,Miss
3,3,4,Yes,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,False,Mrs
4,4,5,No,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,False,Mr


In [152]:
alt.Chart(titanic[titanic['Rank'] == True]).mark_bar().encode(
    x='Survived:N',
    y='count()',
    color='Survived:N',
    column='Title:N'
)

# With the consolidate titles, it doesn't look like title played a significant role,
# except for the poor reverends

# Part 5: Conclusions

Based on your analysis, what interesting relationships did you find? Write three interesting facts the museum can use in their exhibit.

> From a purely visual evaluation, it appears that gender and passenger fare class had the biggest impact on survival rates. There's probably a deeper story about socio-economic class distinction hidden here.
> 
> Other factors such as familial connections weren't explored, but likely also played a role.


## 🌟 Above and Beyond 🌟

The museum curator has room for a couple of nice visualizations for the exhibit. 

1. Create additiobanl interesting visualizations that are suitable for public display.

2. Use the [GeoPandas library](https://geopandas.org) to create a [Choropleth Map](https://geopandas.org/mapping.html#choropleth-maps) of the likelihood of a Titanic passenger surviving based on their port of embarkation.