# Lab for badges 23 - 25

## Who survived on the Titanic?

In this lab we will use a famous learning dataset about passangers of the Titanic. Titanic was a big passenger boat which sunk on 15 April 1912. We have quite detailed information about passengers and whether they survived. On this lab we will investigate the relationship between those variables (eg. wealth, age, gender, cabin number) and survival.

In [None]:
import pandas as pd

In [None]:
passengers = pd.read_csv('./data/titanic_kaggle_train.csv')
# this version of titanic dataset is taken from Kaggle
# https://www.kaggle.com/competitions/titanic/data?select=test.csv
passengers

# Task 1: Descriptive statistics

Describe the dataset in a set of bulletpoints (use Pandas to give you the answers). Try to backup each bulletpoint with separate pieces of Python code.

Below are just some examples to get you started, but explore numbers and types of things (rows, columns). What are max min and mean of the numeric ones? Which columns have missing values and how many? etc.

For exaplanations of what columns mean see https://www.kaggle.com/competitions/titanic/data?select=test.csv

### A. What columns and of what type did you find? How many rows and columns?

In [None]:
passengers.shape

In [None]:
passengers.columns

In [None]:
passengers.info()

In [None]:
print("Mean", passengers.Age.mean())
# ...
# ...
# (you can add more cells with + button in the top menu)

**Answer A:**

There are 12 columns (+ index) and 887 rows. 

Columns have names: 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'

Name and Sex columns are an object (likely a string), Age and Fare are Floats, and Sibling/Parent rows are Integers.

Age had a mean 29 years, and max and min are ... [you continue]

# Task 2. Answers basic questions: Select some rows, some columns and run some simple statistics

What makes a good **'Basic Question'**? Question which you can explore by selecting only some rows and columns.

Example Basic Questions:

- Just select Name and Age components of the data frame. Sort them by Age and get the names of the 10 oldest passengers
- Pick the fares of passengers who survived and did not survive and compare their ticket class (Pclass), Fare cost and Sex. Is there a large difference in the means?

**Itentify 5 Basic Questions** like above and answer them with a simple sentence. Provide Python code which will find you that answer. 

Call them 'A', 'B', 'C', ...

### To get you started, here is some starting code. Read it, make sure you understand it. Feel free to modify it.

Note about referencing: iloc[rows, columns] is used when referencing with numbers, and loc[rows, columns] is used when referencing by names. 

Also colon `:` means `'all of them'`

In [None]:
# first passenger, their data in column indexed 4
passengers.iloc[0,4]

In [None]:
passengers.iloc[0:5,[2,3,-1]] # first 5 passengers, only columns 2, 3 and the last one

In [None]:
# same as above, but actually making sense ;) - NOTICE WE USE LOC, NOT ILOC
passengers.loc[0:5,["Pclass","Name","Embarked"]] # first 5 passengers, only columns 2, 3 and the last one

In [None]:
passengers.loc[10:20,['Age', 'Fare']] 
# passengers with indexes 10 to 20 passangers, only columns 'Age', 'Fare'

In [None]:
passengers.loc[:,['Name']] # All rows, column name

In [None]:
# finally, you can combine them:
# get first 5 people, and their Names
passengers.iloc[0:5,:].loc[:,'Name']

To pick first or last n items you can also use .head(n) or tail(n).

And if picking just one column (and it does not have spaces in the name), you can use .ColumnName 

In [None]:
passengers.head(5).Name

In [None]:
# notice that .loc[:,'Name'] is the same as .Name
# try to guess, what are advantages and limitations of .ColumnName notation? 

passengers.head(5).loc[:,'Name']

Ok, let's find our 5 questions and answer them (call them A,B,C,D and E)

### A: Names of oldest passengers. What can we tell about their survival rate when compared with youngest ones?

When exploring data it is hard to come up with tests, but you can try to create an 'aspirational placeholders sentence' and then aim to create code to 'fill it in'.

**Here's a sentence we will 'complete' together:**


Name of the oldest passenger (`X` years old) was `Mrs. ABC`.

Survival rate of 10 oldest passangers was `Z`%, and of the youngest 10 passagners `Y`%.

For context, the average survival rate on the whole boat was `W`%.`

```

In [None]:
# just select name and age components of the data frame. 
dfA2 = passengers[['Age', 'Name', 'Pclass', 'Survived']].copy()

# Sort them by Age 
dfA2.sort_values(by='Age', ascending=False, inplace=True)

# and get the names of 10 oldest passangers
oldest_10 = dfA2.iloc[0:10]
oldest_10

In [None]:
# note: and if you wanted to get names of all people over 70
print(dfA2[dfA2.Age > 70].loc[:, ['Name','Age']])

In [None]:
# Or get names of all people over 50 who survived
print(dfA2[(dfA2.Age > 50)&(dfA2.Survived == 1)].loc[:, ['Name','Age']])

In [None]:
print('oldest 10 survival', dfA2.iloc[0:10,:].loc[:, 'Survived'].mean())
# could also be
print('oldest 10 survival', dfA2.iloc[0:10,:].Survived.mean())
print('oldest 10 survival', dfA2.head(10).Survived.mean())

In [None]:
print('youngest 10 survival', dfA2.tail(10).Survived.mean())

In [None]:
# wait a minute.. that looks a bit suspicious! Children died with such frequency?
# let's have a look at the data!
dfA2.tail(10)

In [None]:
# oh no! that's the data for those where age is missing!
# let's ask for average for those with known age, but still youngest
dfA2[ dfA2['Age'].notna()].tail(10)
# why do you think decimal ages are 0.42, 0.67, 0.75, 0.83 and 0.93 - oh why?

In [None]:
print("youngest 10 survival: ", dfA2[ dfA2['Age'].notna()].tail(10).Survived.mean())


In [None]:
print('average age survival', dfA2.Survived.mean())
print('average age survival (if age known)', dfA2[ dfA2['Age'].notna()].Survived.mean())
# how do you interpret those results? What would it mean?

Answer: Name of the oldest passanger (80) was Mr. Algernon Henry Wilson Barkworth. Survival rate of 10 oldest passangers was 10%, and of the youngest 10 passagners 90%. In context, the average survival rate was 38%.

### B: (your turn) What's the impact of Passanger Class and Sex on the suvival?

Hint: start with this:

- `Groupby` takes colums to retain and tries to create a 'grid of values at all possible intersections'. (observed=True means that if some data is completely missing, it will not be included).

`passengers.groupby(['TicketClass', 'Survived'], observed=True)`

- the trick is that then we do nto have items with values, like 123, but rather whole collections of items
- you can limit the data you care about by just selecting some items

`passengers.groupby(['TicketClass', 'Survived'], observed=True)['PassengerId']`

- so we need to specify how to represent/map each collection of items at an intersection (e.g. count() them, or get average())

`passengers.groupby(['TicketClass', 'Survived'], observed=True)['PassengerId'].count()`

- you will get a stacked table. but you can un-stack it with  .unstack()

`passengers.groupby(['TicketClass', 'Survived'], observed=True)['PassengerId'].count().unstack()`

In [None]:
# your code here (add more cells with '+' in the top menu)

# here's some code to get you started

grouped = passengers.groupby(['TicketClass', 'Survived'], observed=True)['PassengerId'].count().unstack()
grouped 
# btw. notice we are not printing a table, but rather using the fact that last 'non-comment' line in a code cell
# will be shown in a prettier, colourful and interactive way. To see the difference, add a line with:  print(grouped)

You answer here:

### C: Your question? Select a subset, and look at it's qualities

In [None]:
# your code here (add more cells with '+' in the top menu)

You answer here:

# Task 3: Cleaning up and improving the dataset:

A: Identify missing values and decide what to do with them. You can remove whole rows with missing things, or you can fill them with averages. Or averages for a particular subgroup.

B: Rename the column names so that they follow the same naming convention (eg. no spaces, CamelCase or snake_case). 

C: Can you create new columns which use available information? eg. Can you create a column TravelingAlone based on the values in siblings/partners and parents/children ?

### A: Looking at missing values

In [None]:
# btw. re-run this cell to reset yoru dataset it its original form
passengers = pd.read_csv('./data/titanic_kaggle_train.csv')


In [None]:
# all data into binary: is missing or not:
passengers.isna()

In [None]:
# by columns, are any items True?
passengers.isna().any()

In [None]:
# how many passagners would you loose if you passengers.dropna(axis=0) ?
# you can use .shape()
pass_no_na = passengers.dropna(axis=0)
pass_no_na

In [None]:
# You could replace all missing values with a set value, 
passengers = pd.read_csv('./data/titanic_kaggle_train.csv')
pass_no_na = passengers.fillna(0) 
# but what unexpected effects it could have?
pass_no_na

In [None]:
# or you can replace them with a most common value,
# or other statistic of that column eg.mean, max
passengers = pd.read_csv('./data/titanic_kaggle_train.csv')
passengers.fillna(passengers.mean(numeric_only=True)) 

In [None]:
# for non-numeric values one of the oprtions is to use the most common (mode) value.
# since there might be mady most common values, you'll have to specify that you want the first one
passengers = pd.read_csv('./data/titanic_kaggle_train.csv')
passengers.fillna(passengers.mode().iloc[0]) 
# notice what happens with the Cabin

In [None]:
# you could also sopecify individual columns and values
passengers = pd.read_csv('./data/titanic_kaggle_train.csv')
passengers.Cabin.fillna("NoCabin", inplace=True) 
passengers

In [None]:
# Your task, decide what to do with the missing values in the origibnal dataset:
# Once your_dataset.isna().any() returns all False, there are no more missing values

### B: Duplicates. 

In this dataset there are no dulplicates. So here's a small exercise. Let's duplicate the first person so they appear 3 times. And let's change the age of one of those entries. So we're pretending first passanger got accidentally added 3 times, including once with a slightly different data for age.

In [None]:
# add two more copies of first row (person called Braund, Mr. Owen Harris)
passengers = pd.read_csv('./data/titanic_kaggle_train.csv')
with_duplicates = passengers
with_duplicates = pd.concat((passengers.iloc[0:1], passengers.iloc[0:1], passengers))
with_duplicates

In [None]:
# now let's fix the index
with_duplicates.reset_index(drop=True, inplace=True) # drop completely forgets previous index
with_duplicates

In [None]:
# change one of the duplicates to have another age:
with_duplicates.loc[1,'Age'] = 23 

# Note: as well as loc[] and iloc[], which access many items,
# you can use at[] and iat[] to access just one:
# with_duplicates.at[1,'Age'] = 23

with_duplicates

Now let's check for duplicates, and remove them. You might want to define a duplicate as when some unique element (or a combination of unique elements) are the same.

In [None]:
with_duplicates.duplicated()
# here you should see that the person at index 2 is a duplicate (perfectly identical)
# you could also decide to keep last (not first) with  duplicated(keep='last')

In [None]:
no_duplicates1 = with_duplicates.drop_duplicates()
no_duplicates1
# this would remove all completely-dentical duplicates

In [None]:
# this would remove all duplicates, identical only in those columns
no_duplicates2 = with_duplicates.drop_duplicates(subset=['Name'])
no_duplicates2
# notice we're down to one duplicated first passanger

In [None]:
# so now there are no duplicates
no_duplicates2.duplicated()


In [None]:
# and we can re-index
no_duplicates2.reset_index(drop=True, inplace=True)
no_duplicates2

### C: Improving Column names

Look at the notes if you're unsure how to do it. You're looking for rename(). Aim for no spaces or special characters in column names, so that you can use . notation, like .Age

One improvementc could be making those columns more human readable: SibSp, Parch. See link to Kaggle above for more insights or what they mean/

In [None]:
# if you'd like you can rename columns to something meaningful, but remember that 

### D: Categorical variables (in R called Factors, in Java called Enums)

Some values are presented as numbers (Pclass), but really, they are categories. Some are presented as strings (sex), but with some careful consideration and kind design can be analysed as categories too (notice this very old dataset is using Sex, not gender - see the note at the end of this lab).

So let's not look at the Sex field, but rather Pclass.

Let's first have a look at exploring passenger class: What passenger classes were there on the Titanic (on British trains we still have 1st and 2nd class carriages)

In the dataset they are represented as:

In [None]:
passengers.Pclass

In [None]:
passengers.Pclass.value_counts()

In [None]:
# are there any NA ?
passengers.Pclass.isna().value_counts()

In [None]:
passengers['TicketClass'] = pd.Categorical(passengers.Pclass)
passengers

In [None]:
# Categorical data has number of advantages in generating statistics etc
# but also make our datasts much smaller (in memory) and faster (in processing)

passengers.memory_usage()
# notice TicketClass stores the same data as Pclass, but uses 15% of memory.

In [None]:
# Additionally you can specify the order of categories, so that you 
# could perform comparisons

# introduce order
passengers['TicketClass'] = pd.Categorical(passengers.Pclass, 
                                           categories=[3,2,1], # here specify order
                                           ordered=True
                                          )
# optionally re-name categories with cat.rename_categories
mapping = {1: 'Luxury', 2: 'Economy', 3: 'LowerDeck'}
passengers['TicketClass'] = passengers['TicketClass'].cat.rename_categories(mapping)
passengers['TicketClass'] # notice printout shows order of categories

In [None]:
# so that you can now eg choose all TicketClass LOWER THAN a number:
passengers[passengers['TicketClass'] < 'Luxury']

In [None]:
# what other columns could be categories. Try them!

### E: Creating more columns

Look at the notes if you're unsure how to do it. What could you use: age groups? In one of the notebooks you were shown how to group different ranges of values into groups. Maybe you could use that technique on age?

Or maybe you could combine columns abotu partners and dependants (SibSp, Parch) into a column TravelsAlone?

How do Survival stats look like for different values of the new columns?

# Extra exercises: Visualisatin

Because so many of you requested it, here are some extra tasks that you can tackle at home later. 

Here's an example of plotting. This is a bar chart of Survival rates by TicketClass. Note that this relies on the previous cells which added the TicketClass category.

Modify this code to show other things that you thing could be meaningful.

In [None]:
# Bar Chart Example

import plotly.graph_objects as go

# Calculate mean survival rates by TicketClass
# reset_index() converts the grouped result back into a DataFrame
survival_rates = passengers.groupby('TicketClass', observed=False)['Survived'].mean().reset_index()

# Print and check the dataframe that we'll use to plot
print(survival_rates)

# Create the bar chart
fig = go.Figure(
    data=[
        go.Bar(
        x=survival_rates['TicketClass'], 
        y=survival_rates['Survived'], 
        # Format the percentages at the top of the bars and round to 2 decimal places
        text=survival_rates['Survived'].apply(lambda x: '{:.2%}'.format(x)), textposition='auto')
        ]
    )

# Add title and axis labels
fig.update_layout(
    title='Survival Rates by Passenger Class',
    xaxis_title='Passenger Class',
    yaxis_title='Survival Rate'
)

fig.show()

# Example of something you could try to do with a graph

Using Plotly, create a stacked bar chart that shows how survival rates varied across different TicketClasses (Luxury, Economy, LowerDeck). We've given you the pseudocode below and you can scroll down for the solution.


# Other Tasks if you'd like to practice more:

Your turn! What data relationships could you visualise and what type of chart would you use? Here are a few ideas:

- Did Age play any role in survivial rates?
- Display the proportions of passengers in each class
- Visualise the distribution of fares for each TicketClass
- ... etc.


**Note on usage of Gender and Sex in datsets: read up on this after the lab**

*In all modern datasets Gender is represented as a much more complex datastructure than just True/False binary (this was common in early days of databases, where gender was a True/False value 'is_male'). Later it was a category (with 2 or 3 choices eg. 'male', 'female', 'prefer_not_to_say'). 

Luckilly, to reflect beauty of human experience there are some very interesting research projects on how to design inclusive forms and datsets. This Scottish Government project involved data scientists and people with various experiences of gender. Do some reading on the subject https://www.gov.scot/publications/reviewing-design-methods-make-more-sensitive-gender/pages/1/ and notice it does not provide 'one answer fits all' but rather encourages empathy and curiouslity on a project-by-project basis. Also form designers together with LGBTQI+ organisations are constanly advancing this field https://uxdesign.cc/designing-forms-for-gender-diversity-and-inclusion-d8194cf1f51  