# Day Four: Contingency tables and External Resources
This notebook will take a look at everything you have learned so far, from cleaning data, to creating contingency tables, to other visualisations you can create!

If you wish to learn more, we have also included some suggestions and links to other resources that you could utilise to further devlop either your Python or general programming skills.

In [2]:
# Import the data
import pandas as pd 
import numpy as np

data = pd.read_csv('https://raw.githubusercontent.com/chroadhouse/Futureme/main/Data/titanic.csv')

In [3]:
#Check the data has imported correctly
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,Not Survive,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,Survived,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,Survived,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,Survived,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,Not Survive,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# Before looking at contingency tables you many want to create some custom tables with more than one column
# We can do this by creating two separate tables and then merging the data
age_mean = data['Age'].mean()
age_mode = data['Age'].mode()
age_median = data['Age'].median()
age_max = data['Age'].max()
age_min = data['Age'].min()
age_stand = data['Age'].std()

age_table = pd.DataFrame({
    'Mean':age_mean,
    'Mode':age_mode,
    'Median':age_median,
    'Maximum':age_max,
    'Minumum':age_min,
    'Standard Deviation':age_stand
})



fare_mean = data['Fare'].mean()
fare_mode = data['Fare'].mode()
fare_median = data['Fare'].median()
fare_max = data['Fare'].max()
fare_min = data['Fare'].min()
fare_stand = data['Fare'].std()


fare_table = pd.DataFrame({
    'Mean':fare_mean,
    'Mode':fare_mode,
    'Median':fare_median,
    'Maximum':fare_max,
    'Minumum':fare_min,
    'Standard Deviation':fare_stand
})

# We can now combine these two tables together to make the data more managable to look at - run this cell and see how it looks!

combined_table = pd.DataFrame({
    'Type':['Age', 'Ticket Fare'],
    'Mean':[age_mean, fare_mean],
    'Mode':[age_mode[0], fare_mode[0]],
    'Median':[age_median, fare_median],
    'Maximum':[age_max, fare_max],
    'Minumum':[age_min, fare_min],
    'Standard Deviation':[age_stand, fare_stand]
}).set_index('Type')

 
combined_table

Unnamed: 0_level_0,Mean,Mode,Median,Maximum,Minumum,Standard Deviation
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Age,29.699118,24.0,28.0,80.0,0.42,14.526497
Ticket Fare,32.204208,8.05,14.4542,512.3292,0.0,49.693429


# Contingency Tables
Remember - we generally use contingency tables when we want to observe relationships between categorical data.

Do do this, however, we need to make sure the data is **'clean'**.

If you need to remind yourself in more detail about this, you can always go back to the exercises for Day 3. You don't need to do this from sctratch as we have filled in these cells for you, but you do need to know **why** it's important.

Once you know **how** these visualisations can help you make conclusions about the data, you can start thinking about the historical significance of what it actually tells you. We have included some questions in the cells below to kickstart your thinking on this.

In [5]:
# We have to clean the data on age and cabin categories 
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    object 
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(4), object(6)
memory usage: 83.7+ KB


In [6]:
# Here we get the value for where most people embarked 
# Why do you think more people got on the ship at Southamption, rather than Cherbourg or Queenstown?
data['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [7]:
# Now we fill in the columns with the averages of the data
data['Age'] = data['Age'].replace(np.nan, data['Age'].mean())
data['Embarked'] = data['Embarked'].fillna('S')

data = data.drop(columns=['Cabin'])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    object 
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked     891 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB


In [8]:
# Below we create a contingency table to show the males and females that did and didn't survive
pd.crosstab(data.Sex, data.Survived, margins=True)

Survived,Not Survive,Survived,All
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,81,233,314
male,468,109,577
All,549,342,891


In [12]:
# We can also have a contingency table showing percentages 
# Have a think about what this table shows you - why do you think more women survived than men?
pd.crosstab(data.Sex, data.Survived, margins=True, normalize='index').round(4)*100

Survived,Not Survive,Survived
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,25.8,74.2
male,81.11,18.89
All,61.62,38.38


In [None]:
# We can convert some continuous data to categorical to see more relationships.
data['Age'] = pd.cut(data['Age'],bins=[0,17,65,99],labels=['Child','Adult','Elder'])

In [None]:
# We can then plot this converted age data against which port they embarked on.
# Why do you think so few elderly people were on the Titanic?
pd.crosstab(data.Age, data.Embarked, margins=True)

Embarked,C,Q,S,All
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Child,24,7,82,113
Adult,142,69,559,770
Elder,2,1,5,8
All,168,77,646,891


# Extrernal resources 
If you have enjoyed this brief introduction to Python for data analysis, you should consider broadening your knowledge on the subject. In a society that is placing more and more emphasis on digital skills, Python in particular is a brilliantly versatile one to have on your CV, showing:
   * Your ability to solve problems
   * Digital fluency
   * Your motivation to learn

Programming and computer science is one of the main subjects that has a large number of detailed resources to help you learn and develop. Some of these resources even provide certificates as proof of your capability in programming.
* FreeCodeCamp -  https://www.freecodecamp.org - Provides easy and intuative courses with certificates to give credit for course completion offering courses such as:
    - Web development
    - Data Analysis using python 
    - Scientific computing using python 
    - Machine Learning (AI) using python 
    - Algorithms and Data structures in Javascript 

* Linkedin Learning - All MMU students have free access to LinkedIn Learning. It provides video courses and certificates linked to your LinkedIn account (this also automatically earns you Rise points too)!

* CS Dojo - Videos on Youtube that provide multiple resources aimed at beginners to try and encourage coding.

* Rise - It's also a good idea to keep an eye on the Rise website, as new courses are being added all the time. Currently, you can sign up for Python for Scientific Computing and TensorFlow for Artificial Intelligence with Stephen Lynch: https://rise.mmu.ac.uk/activity/python-for-scientific-computing-and-tensorflow-for-artificial-intelligence-3/

* Or you could also preorder Stephen Lynch's book, which is centered around Python for Artificial Intelligence and Scientific Computing: https://www.routledge.com/Python-for-Scientific-Computing-and-Artificial-Intelligence/Lynch/p/book/9781032258713#