## Data Cleaning Practice Questions

Import the provided dataset 'sampledata.csv' and complete the following:

* 1) Split the first and last name in 2 separate columns:
* 2) Rename the two new columns to 'first_name' and 'last_name'
* 3) Identify the errors in province names
* 4) Create a dictionary object to change the wrongfully captured province names to the following format: ON, MB, QB, etc.
* 5) Apply the dictionary to the 'province' column
* 6) Fill in missing values in the 'province' column with 'Other'
* 7) Convert the 'net_worth' column to a numeric data type. Hint: Python doesn't recognize currency so you will need to remove the '$' and ',' symbols from each value
* 8) Add a new column for 'age' and calculate each person's age

In [None]:
#add answers here

## Suggested Solutions

We will first import all the relevant libraries, including some datetime classes and functions. Then, import the sampledata.csv file

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import datetime as dt
from datetime import datetime,timedelta


In [2]:
df=pd.read_csv('sampledata.csv')

Some of the issues noticed first hand: there are many variations of how the province variable is recorded, the net_worth column is text data (due to $ and , symbols), the first and last name aren't split, and we have some missing values in the province column.

In [3]:
df.head()


Unnamed: 0,customer_id,name,province,birth_date,net_worth
0,1,Audrey Thomas,Ont.,"December 27, 1992","$92,887.00"
1,2,Byron Tucker,Ontario,"May 1, 1995","$50,000.00"
2,3,Melissa Watson,Ontario,"August 11, 1996","$447,015.00"
3,4,Alisa Holmes,ON,"October 13, 1984","$294,583.00"
4,5,Clark Crawford,BC,"April 20, 2006","$24,873.00"


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   customer_id  50 non-null     int64 
 1   name         50 non-null     object
 2   province     42 non-null     object
 3   birth_date   50 non-null     object
 4   net_worth    50 non-null     object
dtypes: int64(1), object(4)
memory usage: 2.1+ KB


The first thing we will do is split up the first and last name. We can use str.split() or str.rsplit() for this.

str.split() will split a string starting on the left hand side, while rsplit() will do so starting on the right hand side. The 'n' parameter in this function tells it how many times to split (once, twice, and so on). In our case, we only need to split once, and we will use rsplit(). Why? Bacause in cases where we have middle names, we are going to record these as part of the first name. You need to watch out for and determine how to treat common name issues, like hyphenated names, middle initials, middle names, 2 last names, and so on.

Finally, the 'expand' argument in this function tells it to create two new columns - one for the first name and one for the last.

In [5]:
df2 = df['name'].str.rsplit(n=1, expand=True)

In [6]:
df2.head()

Unnamed: 0,0,1
0,Audrey,Thomas
1,Byron,Tucker
2,Melissa,Watson
3,Alisa,Holmes
4,Clark,Crawford


We can use the join() function to combine our original dataframe with the two new columns. You can also achieve the same result using concat(). I will also rename the columns.

In [7]:
df=df.join(df2)

In [8]:
df.rename(columns={0:'first_name',1:'last_name'}, inplace=True)

Time to deal with the province issues. I'm going to create a dictionary which maps each unique value to the value I actually want.

In [9]:
df['province'].unique()

array(['Ont.', 'Ontario', 'ON', 'BC', 'British Columbia', 'B.C.',
       'Manitoba', 'MB', 'M.B', 'M.B.', 'mb', 'Nfld', 'Newfoundland',
       'NF', 'QB', 'Quebec', 'Q.B.', nan, 'Bc', 'Ontaario', 'Ont',
       'Nunavut', 'Yukon', 'The Yukon', 'Sask', 'Saskatchewan', 'SK',
       'Yukon Territory', 'Toronto', 'Canada', 'PEI', 'Matitoba'],
      dtype=object)

In [10]:
dictionary={'Ont.':'ON', 'Ontario':'ON', 'British Columbia':'BC', 'B.C.':'BC',
       'Manitoba':'MB', 'M.B':'MB', 'M.B.':'MB', 'mb':'MB', 'Nfld':'NF', 'Newfoundland':'NF', 'Quebec':'QB', 
            'Q.B.':'QB', 'Bc':'BC', 'Ontaario':'ON', 'Ont':'ON',
       'Nunavut':'NV', 'Yukon':'YK', 'The Yukon':'YK', 'Sask':'SK', 'Saskatchewan':'SK',
       'Yukon Territory':'YK', 'Toronto':'ON', 'Canada':'ON', 'Matitoba':'MB'}

Then, I'm going to use the replace() function to replace the values in the province column with those from the dictionary. Finally, I'll just fill in the missing values with 'other'

In [11]:
df.replace({'province':dictionary}, inplace=True)

In [12]:
df['province'].fillna(value='Other',inplace=True)

In [13]:
df.head(2)

Unnamed: 0,customer_id,name,province,birth_date,net_worth,first_name,last_name
0,1,Audrey Thomas,ON,"December 27, 1992","$92,887.00",Audrey,Thomas
1,2,Byron Tucker,ON,"May 1, 1995","$50,000.00",Byron,Tucker


Time to clean that pesky dollar data! This should be easy - use the str.replace() method to get rid of $ and , symbols. Also, don't forget to change the type from string to float. You should use astype() for this.

In [14]:
df['net_worth']=df['net_worth'].str.replace('$','',regex=True)
df['net_worth']=df['net_worth'].str.replace(',','',regex=True)

In [15]:
df['net_worth']=df['net_worth'].astype(float)

In [16]:
df.head()

Unnamed: 0,customer_id,name,province,birth_date,net_worth,first_name,last_name
0,1,Audrey Thomas,ON,"December 27, 1992",92887.0,Audrey,Thomas
1,2,Byron Tucker,ON,"May 1, 1995",50000.0,Byron,Tucker
2,3,Melissa Watson,ON,"August 11, 1996",447015.0,Melissa,Watson
3,4,Alisa Holmes,ON,"October 13, 1984",294583.0,Alisa,Holmes
4,5,Clark Crawford,BC,"April 20, 2006",24873.0,Clark,Crawford


Now we can use the to_datetime() function to fix the format of our birth date data. This will help us calculate the age.

In [17]:
df['birth_date']=pd.to_datetime(df['birth_date'])

In [18]:
df.head()

Unnamed: 0,customer_id,name,province,birth_date,net_worth,first_name,last_name
0,1,Audrey Thomas,ON,1992-12-27,92887.0,Audrey,Thomas
1,2,Byron Tucker,ON,1995-05-01,50000.0,Byron,Tucker
2,3,Melissa Watson,ON,1996-08-11,447015.0,Melissa,Watson
3,4,Alisa Holmes,ON,1984-10-13,294583.0,Alisa,Holmes
4,5,Clark Crawford,BC,2006-04-20,24873.0,Clark,Crawford


We also should import the data class from datetime, which will allow us pull out today's date, and subtract the birth date from it to calculate age.

In [19]:
from datetime import date

In [20]:
today=date.today()

In [21]:
today

datetime.date(2021, 1, 13)

In [23]:
def age(i):
    today = date.today()
    return today.year - i.year - ((today.month, today.day) < (i.month, i.day))

In [24]:
df['age'] = df['birth_date'].apply(age)

In [25]:
df.head()

Unnamed: 0,customer_id,name,province,birth_date,net_worth,first_name,last_name,age
0,1,Audrey Thomas,ON,1992-12-27,92887.0,Audrey,Thomas,28
1,2,Byron Tucker,ON,1995-05-01,50000.0,Byron,Tucker,25
2,3,Melissa Watson,ON,1996-08-11,447015.0,Melissa,Watson,24
3,4,Alisa Holmes,ON,1984-10-13,294583.0,Alisa,Holmes,36
4,5,Clark Crawford,BC,2006-04-20,24873.0,Clark,Crawford,14
