#### Description

It's commonly said that data scientists spend 80% of their time cleaning and manipulating data and only 20% of their time analyzing it. The time spent cleaning is vital since analyzing dirty data can lead you to draw inaccurate conclusions.


Data cleaning is an essential task in data science. Without properly cleaned data, the results of any data analysis or machine learning model could be inaccurate. In this course, you will learn how to identify, diagnose, and treat a variety of data cleaning problems in Python, ranging from simple to advanced. You will deal with improper data types, check that your data is in the correct range, handle missing data, perform record linkage, and more!

# Data type constraints

#### Chapter 1 - Common data problems
#### Why do we need to clean data?


Data type constraints
Datatype                 Example
Text data                First name, last name, address...
Integers                 # Subscribers, # products sold...
Decimals                 Temperature, $ exchange rates...
Binary                   Is married, new customer,yes/no, ...
Dates                    Order dates, ship dates ...
Categories               Marriage status, gender ...

#### Python data type
- str
- int
- float
- bool
- datetime
- category


### Strings to integers

In [None]:
# Import CSV file and output header
sales = pd.read_csv('sales.csv')
sales.head(2)

In [None]:
# Get data types of columns
sales.dtypes

In [None]:
# Get DataFrame information
sales.info()

In [None]:
# Print sum of all Revenue column
sales['Revenue'].sum()

# Remove $ from Revenue column

In [None]:
sales['Revenue'] = sales['Revenue'].str.strip('$')
sales['Revenue'] = sales['Revenue'].astype('int')

In [None]:
# Verify that Revenue is now an integer
assert sales['Revenue'].dtype == 'int'

### The assert statement

In [1]:
# This will pass
assert 1+1 == 2

In [2]:
# This will not pass
assert 1+1 == 3

AssertionError: 

### Numeric or categorical?

In [None]:
0 = Never married     1 = Married      2 = Separated      3 = Divorced

df['marriage_status'].describe()

In [None]:
# Convert to categorical
df["marriage_status"] = df["marriage_status"].astype('category')
df.describe()

In [7]:
import numpy as np
import pandas as pd

ride_sharing = pd.read_csv('ride_sharing_new.csv',index_col=0)

### Numeric data or ... ?
In this exercise, and throughout this chapter, you'll be working with bicycle ride sharing data in San Francisco called ride_sharing. It contains information on the start and end stations, the trip duration, and some user information for a bike sharing service.

The user_type column contains information on whether a user is taking a free ride and takes on the following values:

1 for free riders.
2 for pay per ride.
3 for monthly subscribers.
In this instance, you will print the information of ride_sharing using .info() and see a firsthand example of how an incorrect data type can flaw your analysis of the dataset. The pandas package is imported as pd.

In [9]:
# Print the head()
display(ride_sharing.head())
print()
# Print the information of ride_sharing
print(ride_sharing.info())
print()
# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())

Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender
0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male
1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male
2,8 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1993,Male
3,4 minutes,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,1,1979,Male
4,11 minutes,22,Howard St at Beale St,350,8th St at Brannan St,4626,2,1994,Male



<class 'pandas.core.frame.DataFrame'>
Index: 25760 entries, 0 to 25759
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   duration         25760 non-null  object
 1   station_A_id     25760 non-null  int64 
 2   station_A_name   25760 non-null  object
 3   station_B_id     25760 non-null  int64 
 4   station_B_name   25760 non-null  object
 5   bike_id          25760 non-null  int64 
 6   user_type        25760 non-null  int64 
 7   user_birth_year  25760 non-null  int64 
 8   user_gender      25760 non-null  object
dtypes: int64(5), object(4)
memory usage: 2.0+ MB
None

count    25760.000000
mean         2.008385
std          0.704541
min          1.000000
25%          2.000000
50%          2.000000
75%          3.000000
max          3.000000
Name: user_type, dtype: float64


In [10]:
# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')

# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'

# Print new summary statistics 
print(ride_sharing['user_type_cat'].describe())

count     25760
unique        3
top           2
freq      12972
Name: user_type_cat, dtype: int64


### Summing strings and concatenating numbers
In the previous exercise, you were able to identify that category is the correct data type for user_type and convert it in order to extract relevant statistical summaries that shed light on the distribution of user_type.

Another common data type problem is importing what should be numerical values as strings, as mathematical operations such as summing and multiplication lead to string concatenation, not numerical outputs.

In this exercise, you'll be converting the string column duration to the type int. Before that however, you will need to make sure to strip "minutes" from the column in order to make sure pandas reads it as numerical. The pandas package has been imported as pd.

In [11]:
ride_sharing[['duration']]

Unnamed: 0,duration
0,12 minutes
1,24 minutes
2,8 minutes
3,4 minutes
4,11 minutes
...,...
25755,11 minutes
25756,10 minutes
25757,14 minutes
25758,14 minutes


In [17]:
# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes')

In [19]:
# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')

# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'

In [23]:
# Print formed columns and calculate average ride duration 
display(ride_sharing[['duration','duration_trim','duration_time']])

Unnamed: 0,duration,duration_trim,duration_time
0,12 minutes,12,12
1,24 minutes,24,24
2,8 minutes,8,8
3,4 minutes,4,4
4,11 minutes,11,11
...,...,...,...
25755,11 minutes,11,11
25756,10 minutes,10,10
25757,14 minutes,14,14
25758,14 minutes,14,14


In [24]:
print(ride_sharing[['duration_time']].mean())

duration_time    11.389053
dtype: float64


- Great work! 11 minutes is really not bad for an average ride duration in a city like San-Francisco. In the next lesson, you're going to jump right ahead into sanity checking the range of values in your data.

## Data range constraints

#### Motivation

In [None]:
movies.head()

In [None]:
import matplotlib.pyplot as plt
plt.hist(movies['avg_rating'])
plt.title('Average rating of movies (1-5)')

In [None]:
# Import date time
import datetime as dt
today_date = dt.date.today()
user_signups[user_signups['subscription_date'] > dt.date.today()]

#### How to deal with out of range data?
- Dropping data
- Setting custom minimums and maximums
- Treat as missing and impute
- Setting custom value depending on business assumptions


### Movie example

In [None]:
# Output Movies with rating > 5
movies[movies['avg_rating'] > 5]

In [None]:
# Drop values using filtering
movies = movies[movies['avg_rating'] <= 5]
# Drop values using .drop()
movies.drop(movies[movies['avg_rating'] > 5].index, inplace = True)
# Assert results
assert movies['avg_rating'].max() <= 5


In [None]:
# Convert avg_rating > 5 to 5
movies.loc[movies['avg_rating'] > 5, 'avg_rating'] = 5

In [None]:
# Assert statement
assert movies['avg_rating'].max() <= 5

- Remember, no output means it passed

### Date range example

In [None]:
# Output data types
user_signups.dtypes

In [None]:
# Convert to date
user_signups['subscription_date'] = pd.to_datetime(user_signups['subscription_date']).dt.date

In [None]:
today_date = dt.date.today()

#### Drop the data

In [None]:
# Drop values using filtering
user_signups = user_signups[user_signups['subscription_date'] < today_date]
# Drop values using .drop()
user_signups.drop(user_signups[user_signups['subscription_date'] > today_date].index, inplace = True)

#### Hardcode dates with upper limit

In [None]:
# Drop values using filtering
user_signups.loc[user_signups['subscription_date'] > today_date, 'subscription_date'] = today_date
# Assert is true
assert user_signups.subscription_date.max().date() <= today_date

### Tire size constraints
In this lesson, you're going to build on top of the work you've been doing with the ride_sharing DataFrame. You'll be working with the tire_sizes column which contains data on each bike's tire size.

Bicycle tire sizes could be either 26″, 27″ or 29″ and are here correctly stored as a categorical value. In an effort to cut maintenance costs, the ride sharing provider decided to set the maximum tire size to be 27″.

In this exercise, you will make sure the tire_sizes column has the correct range by first converting it to an integer, then setting and testing the new upper limit of 27″ for tire sizes.

In [33]:
np.random.choice(np.arange(25760, dtype=np.int16), size=(2, 1), replace=False)

array([[24072],
       [20058]], dtype=int16)

In [27]:
ride_sharing.columns

Index(['duration', 'station_A_id', 'station_A_name', 'station_B_id',
       'station_B_name', 'bike_id', 'user_type', 'user_birth_year',
       'user_gender', 'user_type_cat', 'duration_trim', 'duration_time'],
      dtype='object')

In [None]:
# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')

# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27

# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

# Print tire size description
print(ride_sharing['tire_sizes'].describe())

- Awesome work! You can look at the new maximum by looking at the top row in the description. Notice how essential it was to convert tire_sizes into integer before setting a new maximum.

### Back to the future
A new update to the data pipeline feeding into the ride_sharing DataFrame has been updated to register each ride's date. This information is stored in the ride_date column of the type object, which represents strings in pandas.

A bug was discovered which was relaying rides taken today as taken next year. To fix this, you will find all instances of the ride_date column that occur anytime in the future, and set the maximum possible value of this column to today's date. Before doing so, you would need to convert ride_date to a datetime object.

The datetime package has been imported as dt, alongside all the packages you've been using till now.

In [None]:
# Convert ride_date to date
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date

# Save today's date
today = dt.date.today()

# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today

# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())

In [None]:
# Convert ride_date to date
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date

# Save today's date
today = dt.date.today()

# Set all in the future to today's date
ride_sharing.loc[pd.to_datetime(ride_sharing['ride_date']).dt.date > dt.date.today(), 'ride_date'] = dt.date.today()

# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())

- Great job! Imagine counting the number of rides taken today without having cleaned your ranges correctly. You would have wildly underreported your findings!

## Uniqueness constraints

#### What are duplicate values?
- All columns have the same values

In [None]:
first_name       last_name                  address                                      height    weight
Justin           Saddlemyer                 Boulevard du Jardin Botanique 3, Bruxelles   193 cm    87 kg
Justin           Saddlemyer                 Boulevard du Jardin Botanique 3, Bruxelles   193 cm    87 kv

#### What are duplicate values?
- Most columns have the same values

In [None]:
first_name       last_name                  address                                      height    weight
Justin           Saddlemyer                 Boulevard du Jardin Botanique 3, Bruxelles   193 cm    87 kg
Justin           Saddlemyer                 Boulevard du Jardin Botanique 3, Bruxelles   194 cm    87 kg

#### How to find duplicate values?

In [None]:
# Print the header
height_weight.head()

In [None]:
# Get duplicates across all columns
duplicates = height_weight.duplicated()
print(duplicates)

In [None]:
# Get duplicate rows
duplicates = height_weight.duplicated()
height_weight[duplicates]

#### How to find duplicate rows?
- The .duplicated() method
- subset: List of column names to check for duplication.
- keep: Whether to keep first ('first'), last ('last') or all (False) duplicate values.

In [None]:
# Column names to check for duplication
column_names = ['first_name','last_name','address']
duplicates = height_weight.duplicated(subset = column_names, keep = False)

In [None]:
# Output duplicate values
height_weight[duplicates]

In [None]:
# Output duplicate values
height_weight[duplicates].sort_values(by = 'first_name')

#### How to treat duplicate values?
- The .drop_duplicates() method
- subset: List of column names to check for duplication.
- keep: Whether to keep first ('first'), last ('last') or all (False) duplicate values.
- inplace: Drop duplicated rows directly inside DataFrame without creating new object (True). 

In [None]:
# Drop duplicates
height_weight.drop_duplicates(inplace = True)

In [None]:
# Output duplicate values
column_names = ['first_name','last_name','address']
duplicates = height_weight.duplicated(subset = column_names, keep = False)
height_weight[duplicates].sort_values(by = 'first_name')

- The .groupby() and .agg() methods

In [None]:
# Group by column names and produce statistical summaries
column_names = ['first_name','last_name','address']
summaries = {'height': 'max', 'weight': 'mean'}
height_weight = height_weight.groupby(by = column_names).agg(summaries).reset_index()
# Make sure aggregation is done
duplicates = height_weight.duplicated(subset = column_names, keep = False)
height_weight[duplicates].sort_values(by = 'first_name')

### Finding duplicates
A new update to the data pipeline feeding into ride_sharing has added the ride_id column, which represents a unique identifier for each ride.

The update however coincided with radically shorter average ride duration times and irregular user birth dates set in the future. Most importantly, the number of rides taken has increased by 20% overnight, leading you to think there might be both complete and incomplete duplicates in the ride_sharing DataFrame.

In this exercise, you will confirm this suspicion by finding those duplicates. A sample of ride_sharing is in your environment, as well as all the packages you've been working with thus far.

In [39]:
ride_sharing[['bike_id']]

Unnamed: 0,bike_id
0,5480
1,5193
2,3652
3,1883
4,4626
...,...
25755,5063
25756,5411
25757,5157
25758,4438


In [40]:
# Find duplicates
duplicates = ride_sharing.duplicated(subset='bike_id', keep=False)
duplicates.head()

0    True
1    True
2    True
3    True
4    True
dtype: bool

In [43]:
# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('bike_id')
duplicated_rides

Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender,user_type_cat,duration_trim,duration_time
10857,4 minutes,3,Powell St BART Station (Market St at 4th St),47,4th St at Harrison St,11,1,1987,Male,1,4,4
3638,12 minutes,22,Howard St at Beale St,350,8th St at Brannan St,11,1,1988,Female,1,12,12
6088,5 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,63,Bryant St at 6th St,11,2,1985,Male,2,5,5
16021,12 minutes,30,San Francisco Caltrain (Townsend St at 4th St),22,Howard St at Beale St,27,3,1989,Female,3,12,12
17119,11 minutes,3,Powell St BART Station (Market St at 4th St),58,Market St at 10th St,27,3,1996,Male,3,11,11
...,...,...,...,...,...,...,...,...,...,...,...,...
6815,5 minutes,21,Montgomery St BART Station (Market St at 2nd St),343,Bryant St at 2nd St,6638,2,1995,Female,2,5,5
8300,6 minutes,16,Steuart St at Market St,36,Folsom St at 3rd St,6638,2,1962,Male,2,6,6
8812,10 minutes,5,Powell St BART Station (Market St at 5th St),345,Hubbell St at 16th St,6638,2,1986,Female,2,10,10
8456,7 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,19,Post St at Kearny St,6638,1,1983,Male,1,7,7


In [45]:
# Print relevant columns of duplicated_rides
print(duplicated_rides[['bike_id','duration','user_birth_year']])

       bike_id    duration  user_birth_year
10857       11   4 minutes             1987
3638        11  12 minutes             1988
6088        11   5 minutes             1985
16021       27  12 minutes             1989
17119       27  11 minutes             1996
...        ...         ...              ...
6815      6638   5 minutes             1995
8300      6638   6 minutes             1962
8812      6638  10 minutes             1986
8456      6638   7 minutes             1983
8380      6638   8 minutes             1984

[25717 rows x 3 columns]


In [None]:
# Find duplicates
duplicates = ride_sharing.duplicated(subset='ride_id', keep=False)

# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')

# Print relevant columns of duplicated_rides
print(duplicated_rides[['ride_id','duration','user_birth_year']])

- Great job! Notice that rides 33 and 89 are incomplete duplicates, whereas the remaining are complete. You'll learn how to treat them in the next exercise!

### Treating duplicates
In the last exercise, you were able to verify that the new update feeding into ride_sharing contains a bug generating both complete and incomplete duplicated rows for some values of the ride_id column, with occasional discrepant values for the user_birth_year and duration columns.

In this exercise, you will be treating those duplicated rows by first dropping complete duplicates, and then merging the incomplete duplicate rows into one while keeping the average duration, and the minimum user_birth_year for each set of incomplete duplicate rows.

In [None]:
# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()

# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}

# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()

# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]

# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0

- Awesome work! You can bet after this fix that ride sharing KPIs will come back to normal.