# Cleaning Data With Pandas

In this exercise, we load data from an Excel spreadsheet, clean the dataset, then save the cleaned data to a new Excel sheet.

This dataset contains a list of 891 of the passengers on board including variables (columns) such as Name, Age, Sex, and Pclass i.e. whether they travelled 1st, 2nd or 3rd class.  We need to clean the data.  Some variables have missing values.  The names of the variables, are cryptic and so are the values. This file is at a public web location - the URL is provided.  The workbook has several sheets, the passenger data is in a sheet named 'Passengers'.

Here are some suggested data quality improvements:
* Remove the Ticket and Cabin columns (we don’t need them in this exercise).
* split the Name column into three columns: last_name, title and other_names.  
* The Survived column has two values 0 and 1 to indicate whether the passenger died or survived.  These values are not intuitive.  Create a new column survival, with values 'Died' or 'Survived' based on the value of the Survived column (0 and 1 respectively).
* The Pclass column has values 1, 2 and 3.  Perhaps integer values are not best in this case – is a 2nd class passenger somehow twice as much as a 1st class? Create a new column passenger_class with values '1st', '2nd' and '3rd'.
* In the Embarked column, replace S, C and Q values with Southampton, Cherbourg and Queenstown respectively.  Deal with the two empty values.
* add a column family_size, with formula = [SibSp]+[Parch]+1
* remove any further columns we no longer need e.g. Survived
* rename any columns to a more Pythonic style with lowercase and underscore style e.g.PassengerId -> passenger_id 

### Background
Almost everyone knows the story of the Titanic.  In April 1912, this magnificent ship left Southampton on its maiden voyage to New York but it never arrived.  It hit an iceberg in the Atlantic and sank.  There were over 2,000 people on board.  Less than half survived.
A century later, this Titanic dataset is a classic case study for rookie data scientist to build a predictive model to determine who is likely to survive or perish (ignoring the fact that this is a matter of historical record). However, we will visualise the data with Power BI and see if we can gain some intuition and who did and did not survive and why.  We know from the film that Kate Winslet survived but poor old Leo DiCaprio did not – is that an accurate reflection?

Note: we may need to pip install pandas, numpy, openpyxl (a dependency of pandas required for opening Excel sheets)

In [None]:
import pandas as pd
import numpy as np 

The Excel file is at this location

In [None]:
file_url = "https://github.com/MarkWilcock/CourseDatasets/raw/main/Misc%20Datasets/Titanic%20Data.xlsx"

Load the data in the Passengers sheet of the Excel file into a pandas dataframe

In [None]:
df = pd.read_excel(file_url,sheet_name="Passengers")
df.head(2) # Show first 2 rows

Remove the Cabin and Ticket columns

In [None]:
# Write your code here
df.drop(columns=['Cabin', 'Ticket'], inplace=True)
df.head(2) # Show first 2 rows

Split the Name column into three columns: last_name, title and other_names.  

In [None]:
# Write your code here
df[['last_name', 'remainder']] = df['Name'].str.split(',', expand=True, n=1)
df[['title', 'other_names']] = df['remainder'].str.split('.', expand=True, n=1)
df.drop(columns=['remainder', 'Name'], inplace=True)

df.head(2) # Show first 2 rows

Add a new column, passenger_class.  The values are mapped from Pclass, 1 to 1st, 2 to 2nd, 3 to 3rd.

*Hint - use the map function and a dictionary of old and new values*

In [None]:
# Write your code here
df['passenger_class'] = df['Pclass'].map({1:'1st', 2:'2nd', 3:'3rd'})
df.head(2) # Show first 2 rows

Add a column, family_size, calculated as SibSp + Parch + 1

In [None]:
# Write your code here
df['family_size'] = df['SibSp'] + df['Parch'] + 1
df.head(2) # Show first 2 rows

Replace the values of the Embarked column with the full words, C to Cherbourg, Q to Queenstown, S to Southampton

In [None]:
# Write your code here
df['Embarked'] = df['Embarked'].map({'C':'Cherbourg', 'Q':'Queenstown', 'S':'Southampton'})
df.head(2) # Show first 2 rows

Add a column survival based on the values in the Survived column. Map values of 0 to No, 1 to Yes

In [None]:
# Write your code here
df['survival'] = df['Survived'].map({0:'Died', 1:'Survived'})
df.head(2) # Show first 2 rows

Empty values of Age, presumably np.nan values in Python, are shown as #NUM! in Excel, so need to replace - an empty string seems best.


In [None]:
df.Age

In [None]:
# Write your code here
df['age'] = df.Age.replace(np.nan, '')
df.head(2) # Show first 2 rows

Remove the columns we no longer need

In [None]:
# Write your code here
df.drop(columns=['SibSp', 'Parch', 'Pclass', 'Survived', 'Age'], inplace=True)
df.head(2) # Show first 2 rows

Rename any columns to a more Pythonic style with lowercase and undercscore style e.g.PassengerId -> passenger_id 

In [None]:
# Write your code here
df.rename(columns={'PassengerId':'passenger_id', 'Sex': 'sex', 'Fare': 'fare', 'Embarked': 'embarked' }, inplace=True)
df.head(2) # Show first 2 rows

If we are using Colab, uncomment the next code cell to save the clean data to a file on your Google drive

In [None]:
#from google.colab import drive
#drive.mount('/drive')
df.to_excel("demo.xlsx", index=False)

In [None]:
second_class_passengers = df.loc[df['passenger_class'] == '2nd']
print(second_class_passengers)

In [None]:
second_class_women_cherbourg = df[(df['passenger_class'] == '2nd') & (df['sex'] == 'female') & (df['embarked'] == 'Cherbourg')]
print(second_class_women_cherbourg)