# Cleaning Data With Pandas

In this exercise, we load data from an Excel spreadsheet, clean the dataset, then save the cleaned data to a new Excel sheet.

This dataset contains a list of 891 of the passengers on board including variables (columns) such as Name, Age, Sex, and Pclass i.e. whether they travelled 1st, 2nd or 3rd class.  We need to clean the data.  Some variables have missing values.  The names of the variables, are cryptic and so are the values. This file is at a public web location - the URL is provided.  The workbook has several sheets, the passenger data is in a sheet named 'Passengers'.

Here are some suggested data quality improvements:
* Remove the Ticket and Cabin columns (we don’t need them in this exercise).
* split the Name column into three columns: last_name, title and other_names.  
* The Survived column has two values 0 and 1 to indicate whether the passenger died or survived.  These values are not intuitive.  Create a new column survival, with values 'Died' or 'Survived' based on the value of the Survived column (0 and 1 respectively).
* The Pclass column has values 1, 2 and 3.  Perhaps integer values are not best in this case – is a 2nd class passenger somehow twice as much as a 1st class? Create a new column passenger_class with values '1st', '2nd' and '3rd'.
* In the Embarked column, replace S, C and Q values with Southampton, Cherbourg and Queenstown respectively.  Deal with the two empty values.
* add a column family_size, with formula = [SibSp]+[Parch]+1
* remove any further columns we no longer need e.g. Survived
* rename any columns to a more Pythonic style with lowercase and underscore style e.g.PassengerId -> passenger_id 

### Background
Almost everyone knows the story of the Titanic.  In April 1912, this magnificent ship left Southampton on its maiden voyage to New York but it never arrived.  It hit an iceberg in the Atlantic and sank.  There were over 2,000 people on board.  Less than half survived.
A century later, this Titanic dataset is a classic case study for rookie data scientist to build a predictive model to determine who is likely to survive or perish (ignoring the fact that this is a matter of historical record). However, we will visualise the data with Power BI and see if we can gain some intuition and who did and did not survive and why.  We know from the film that Kate Winslet survived but poor old Leo DiCaprio did not – is that an accurate reflection?

Note: we may need to pip install pandas, numpy, openpyxl (a dependency of pandas required for opening Excel sheets)

In [6]:
import pandas as pd
import numpy as np

In [10]:
file_url = "https://github.com/MarkWilcock/CourseDatasets/raw/main/Misc%20Datasets/Titanic%20Data.xlsx"

In [11]:
df = pd.read_excel(file_url,sheet_name="Passengers")
df

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,1
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,0
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,1
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,0
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,1


In [12]:
# remove the cabin and ticket columns
df.drop(columns=['Cabin', 'Ticket'], inplace=True)
df

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Survived
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.2500,S,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,1
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.9250,S,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1000,S,1
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,8.0500,S,0
...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0000,S,0
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0000,S,1
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,23.4500,S,0
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0000,C,1


In [13]:
# split the name column into first and last name
df[['last_name', 'remainder']] = df['Name'].str.split(',', expand=True, n=1)

# split remainder into two columns with delimiter of period. only bring back two columns max
df[['title', 'other_names']] = df['remainder'].str.split('.', expand=True, n=1)

df

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Survived,last_name,remainder,title,other_names
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.2500,S,0,Braund,Mr. Owen Harris,Mr,Owen Harris
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,1,Cumings,Mrs. John Bradley (Florence Briggs Thayer),Mrs,John Bradley (Florence Briggs Thayer)
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.9250,S,1,Heikkinen,Miss. Laina,Miss,Laina
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1000,S,1,Futrelle,Mrs. Jacques Heath (Lily May Peel),Mrs,Jacques Heath (Lily May Peel)
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,8.0500,S,0,Allen,Mr. William Henry,Mr,William Henry
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0000,S,0,Montvila,Rev. Juozas,Rev,Juozas
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0000,S,1,Graham,Miss. Margaret Edith,Miss,Margaret Edith
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,23.4500,S,0,Johnston,"Miss. Catherine Helen ""Carrie""",Miss,"Catherine Helen ""Carrie"""
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0000,C,1,Behr,Mr. Karl Howell,Mr,Karl Howell


In [14]:
# add a new column, passenger_class.  The values are mapped from pclass, 1 to 1st, 2 to 2nd, 3 to 3rd
df['passenger_class'] = df['Pclass'].map({1:'1st', 2:'2nd', 3:'3rd'})
df

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Survived,last_name,remainder,title,other_names,passenger_class
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.2500,S,0,Braund,Mr. Owen Harris,Mr,Owen Harris,3rd
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,1,Cumings,Mrs. John Bradley (Florence Briggs Thayer),Mrs,John Bradley (Florence Briggs Thayer),1st
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.9250,S,1,Heikkinen,Miss. Laina,Miss,Laina,3rd
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1000,S,1,Futrelle,Mrs. Jacques Heath (Lily May Peel),Mrs,Jacques Heath (Lily May Peel),1st
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,8.0500,S,0,Allen,Mr. William Henry,Mr,William Henry,3rd
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0000,S,0,Montvila,Rev. Juozas,Rev,Juozas,2nd
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0000,S,1,Graham,Miss. Margaret Edith,Miss,Margaret Edith,1st
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,23.4500,S,0,Johnston,"Miss. Catherine Helen ""Carrie""",Miss,"Catherine Helen ""Carrie""",3rd
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0000,C,1,Behr,Mr. Karl Howell,Mr,Karl Howell,1st


In [15]:
# add a column, family_size, which is the sum of sibsp and parch + 1
df['family_size'] = df['SibSp'] + df['Parch'] + 1
df

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Survived,last_name,remainder,title,other_names,passenger_class,family_size
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.2500,S,0,Braund,Mr. Owen Harris,Mr,Owen Harris,3rd,2
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,1,Cumings,Mrs. John Bradley (Florence Briggs Thayer),Mrs,John Bradley (Florence Briggs Thayer),1st,2
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.9250,S,1,Heikkinen,Miss. Laina,Miss,Laina,3rd,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1000,S,1,Futrelle,Mrs. Jacques Heath (Lily May Peel),Mrs,Jacques Heath (Lily May Peel),1st,2
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,8.0500,S,0,Allen,Mr. William Henry,Mr,William Henry,3rd,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0000,S,0,Montvila,Rev. Juozas,Rev,Juozas,2nd,1
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0000,S,1,Graham,Miss. Margaret Edith,Miss,Margaret Edith,1st,1
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,23.4500,S,0,Johnston,"Miss. Catherine Helen ""Carrie""",Miss,"Catherine Helen ""Carrie""",3rd,4
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0000,C,1,Behr,Mr. Karl Howell,Mr,Karl Howell,1st,1


In [16]:
# replace the values of the embarked column with the full words, C to Cherbourg, Q to Queenstown, S to Southampton
df['embarked'] = df['Embarked'].map({'C':'Cherbourg', 'Q':'Queenstown', 'S':'Southampton'})
df


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Survived,last_name,remainder,title,other_names,passenger_class,family_size,embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.2500,S,0,Braund,Mr. Owen Harris,Mr,Owen Harris,3rd,2,Southampton
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,1,Cumings,Mrs. John Bradley (Florence Briggs Thayer),Mrs,John Bradley (Florence Briggs Thayer),1st,2,Cherbourg
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.9250,S,1,Heikkinen,Miss. Laina,Miss,Laina,3rd,1,Southampton
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1000,S,1,Futrelle,Mrs. Jacques Heath (Lily May Peel),Mrs,Jacques Heath (Lily May Peel),1st,2,Southampton
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,8.0500,S,0,Allen,Mr. William Henry,Mr,William Henry,3rd,1,Southampton
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0000,S,0,Montvila,Rev. Juozas,Rev,Juozas,2nd,1,Southampton
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0000,S,1,Graham,Miss. Margaret Edith,Miss,Margaret Edith,1st,1,Southampton
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,23.4500,S,0,Johnston,"Miss. Catherine Helen ""Carrie""",Miss,"Catherine Helen ""Carrie""",3rd,4,Southampton
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0000,C,1,Behr,Mr. Karl Howell,Mr,Karl Howell,1st,1,Cherbourg


In [17]:
# add a column survival. map values of 0 to No, 1 to Yes
df['survival'] = df['Survived'].map({0:'No', 1:'Yes'})
df


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Survived,last_name,remainder,title,other_names,passenger_class,family_size,embarked,survival
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.2500,S,0,Braund,Mr. Owen Harris,Mr,Owen Harris,3rd,2,Southampton,No
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,1,Cumings,Mrs. John Bradley (Florence Briggs Thayer),Mrs,John Bradley (Florence Briggs Thayer),1st,2,Cherbourg,Yes
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.9250,S,1,Heikkinen,Miss. Laina,Miss,Laina,3rd,1,Southampton,Yes
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1000,S,1,Futrelle,Mrs. Jacques Heath (Lily May Peel),Mrs,Jacques Heath (Lily May Peel),1st,2,Southampton,Yes
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,8.0500,S,0,Allen,Mr. William Henry,Mr,William Henry,3rd,1,Southampton,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0000,S,0,Montvila,Rev. Juozas,Rev,Juozas,2nd,1,Southampton,No
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0000,S,1,Graham,Miss. Margaret Edith,Miss,Margaret Edith,1st,1,Southampton,Yes
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,23.4500,S,0,Johnston,"Miss. Catherine Helen ""Carrie""",Miss,"Catherine Helen ""Carrie""",3rd,4,Southampton,No
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0000,C,1,Behr,Mr. Karl Howell,Mr,Karl Howell,1st,1,Cherbourg,Yes


In [18]:
#empty values of Age, presumably np.nan values in Python, are shown as #NUM! in Excel, so need to replace - an empty string seems best
df.age = df.Age.replace(np.nan, '')
df

  df.age = df.Age.replace(np.nan, '')


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Survived,last_name,remainder,title,other_names,passenger_class,family_size,embarked,survival
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.2500,S,0,Braund,Mr. Owen Harris,Mr,Owen Harris,3rd,2,Southampton,No
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,1,Cumings,Mrs. John Bradley (Florence Briggs Thayer),Mrs,John Bradley (Florence Briggs Thayer),1st,2,Cherbourg,Yes
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.9250,S,1,Heikkinen,Miss. Laina,Miss,Laina,3rd,1,Southampton,Yes
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1000,S,1,Futrelle,Mrs. Jacques Heath (Lily May Peel),Mrs,Jacques Heath (Lily May Peel),1st,2,Southampton,Yes
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,8.0500,S,0,Allen,Mr. William Henry,Mr,William Henry,3rd,1,Southampton,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0000,S,0,Montvila,Rev. Juozas,Rev,Juozas,2nd,1,Southampton,No
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0000,S,1,Graham,Miss. Margaret Edith,Miss,Margaret Edith,1st,1,Southampton,Yes
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,23.4500,S,0,Johnston,"Miss. Catherine Helen ""Carrie""",Miss,"Catherine Helen ""Carrie""",3rd,4,Southampton,No
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0000,C,1,Behr,Mr. Karl Howell,Mr,Karl Howell,1st,1,Cherbourg,Yes


In [19]:
# remove the columns we no longer need
df.drop(columns=['Name', 'remainder', 'SibSp', 'Parch', 'Pclass', 'Survived', 'Age'], inplace=True)
df

Unnamed: 0,PassengerId,Sex,Fare,Embarked,last_name,title,other_names,passenger_class,family_size,embarked,survival
0,1,male,7.2500,S,Braund,Mr,Owen Harris,3rd,2,Southampton,No
1,2,female,71.2833,C,Cumings,Mrs,John Bradley (Florence Briggs Thayer),1st,2,Cherbourg,Yes
2,3,female,7.9250,S,Heikkinen,Miss,Laina,3rd,1,Southampton,Yes
3,4,female,53.1000,S,Futrelle,Mrs,Jacques Heath (Lily May Peel),1st,2,Southampton,Yes
4,5,male,8.0500,S,Allen,Mr,William Henry,3rd,1,Southampton,No
...,...,...,...,...,...,...,...,...,...,...,...
886,887,male,13.0000,S,Montvila,Rev,Juozas,2nd,1,Southampton,No
887,888,female,30.0000,S,Graham,Miss,Margaret Edith,1st,1,Southampton,Yes
888,889,female,23.4500,S,Johnston,Miss,"Catherine Helen ""Carrie""",3rd,4,Southampton,No
889,890,male,30.0000,C,Behr,Mr,Karl Howell,1st,1,Cherbourg,Yes


to do: rename any columns to a more Pythonic style with lowercase and undercscore style e.g.PassengerId -> passenger_id 

If we are using Colab, uncomment the next code cell to save the clean data to a file on your Google drive

In [20]:
#from google.colab import drive
#drive.mount('/drive')
#df.to_excel("/drive/My Drive/clean_titanic1.xlsx")