# Cleaning Data in Python

## (1) Common Data Problems

### Data Type Constraints
This text explains the importance of cleaning data in the data science workflow, and the potential consequences of not doing so. It highlights the different types of data that can be encountered, such as text, integers, decimals, dates, and zip codes, and emphasizes the need to ensure that variables have the correct data types before analysis to avoid compromising insights.

The text also provides examples of how to convert data types, such as converting a string to an integer or a categorical variable, and introduces the dot-astype() method and the assert statement. The importance of correctly identifying categorical data is emphasized, and the consequences of not doing so are explained.

#### Numeric data or ...?

In [None]:
import pandas as pd

ride_sharing = pd.read_csv('ride_sharing_new.csv')

# Print the information of ride_sharing
print(ride_sharing.info())

In [None]:
# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())

In [None]:
# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')

# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'

In [None]:
# Print new summary statistics
print(ride_sharing['user_type_cat'].describe())

#### Summing Strings and Concatenating Numbers

In [None]:
# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes')

# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')

# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'

# Print formed columns and calculate average ride duration
print(ride_sharing[['duration','duration_trim','duration_time']])
print(ride_sharing['duration_time'].mean())

### Data Range Constraints
In this lesson, the importance of data range constraints in data science is discussed. Examples are given where data falls out of the allowable range due to errors in data collection or parsing, and ways to deal with such out of range data are explored. The simplest option is to drop the data, but depending on the size and importance of the out of range data, this might not always be the best option. Custom minimums or maximums can be set for columns, or the out of range data can be imputed as missing data, to be dealt with in Chapter 3. The business assumptions behind the data can also dictate assigning a custom value for any values of data that go beyond a certain range.

Two examples are given, one involving a dataset of movies and their respective average rating from a streaming service, and another involving subscription dates in the future for a service. In the movie example, the out of range values are either dropped or set to a hard limit of 5, depending on the assumptions behind the data. In the date range example, the subscription date column is first converted to a pandas datetime object and then treated in a similar way to the movie example.

The lesson stresses the importance of understanding the dataset and the business assumptions behind it before deciding how to deal with out of range data. The assert statement is introduced as a way to verify that the treatment of the out of range data has been successful.

#### Tire Size Constraints

In [None]:
# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')

# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27

# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

# Print tire size description
print(ride_sharing['tire_sizes'].describe())

#### Back to the Future

In [None]:
# Convert ride_date to date
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date

# Save today's date
today = dt.date.today()

# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today

# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())

### Uniqueness Constraints
This lesson discusses duplicate values in data cleaning. Duplicate values occur when the same exact information is repeated across multiple rows in a DataFrame. They can arise from data entry and human errors, bugs and design errors, or from joining and consolidating data from various resources. To find duplicate values, the dot-duplicated() method can be used, and the subset and keep arguments can be adjusted to calibrate the search. Once duplicates are found, they can be treated by keeping only one row with dot-drop_duplicates() or combining rows using statistical measures with the groupby and agg methods.

#### Finding Duplicates

In [None]:
# Find duplicates
duplicates = ride_sharing.duplicated(subset = 'ride_id', keep = False)

# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')

# Print relevant columns
print(duplicated_rides[['ride_id','duration','user_birth_year']])

#### Treating Duplicates

In [None]:
# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()

# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}

# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()

# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]

# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0

## (2) Text and Categorical Data Problems

### Membership Constraints
This chapter focuses on common problems with text and categorical data, specifically membership constraints. Categorical data represents a predefined finite set of categories and inconsistencies can occur due to data entry issues, data parsing errors, and other reasons. To treat these problems, we can drop rows with incorrect categories, attempt to remap incorrect categories to correct ones, and more. An example is provided where a left anti join is used to return all data in the study_data DataFrame with inconsistent blood types, and an inner join returns all rows containing consistent blood types. To find inconsistent categories in Python, we create a set of the blood_type column, find the difference between that set and the categories DataFrame, and use the isin method to find all rows of the blood_type column that are equal to inconsistent categories. To drop inconsistent rows and keep consistent ones, we use the tilde symbol while subsetting.

#### Finding Consistency

In [None]:
import pandas as pd

airlines = pd.read_csv('airlines.csv')

# Print categories DataFrame
print(categories)

# Print unique values of survey columns in airlines
print('Cleanliness: ', airlines['cleanliness'].unique(), "\n")
print('Safety: ', airlines['safety'].unique(), "\n")
print('Satisfaction: ', airlines['satisfaction'].unique(),"\n")

In [None]:
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

In [None]:
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

# Print rows with consistent categories only
print(airlines[~cat_clean_rows])

### Categorical Variables
The lesson discusses how to deal with categorical variables when cleaning data. The problems that can arise include value inconsistency, too many categories, and incorrect data types. To deal with value inconsistency, we can capitalize or lowercase the column, or remove leading spaces. We can also create categories out of data by using the qcut or cut functions, or map categories to fewer ones using the replace method.

In [None]:
import pandas as pd

airlines = pd.read_csv('airlines_final.csv')

# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

In [None]:
# Lower dest_region column and then replace "eur" with "europe"
airlines['dest_region'] = airlines['dest_region'].str.lower()
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})

In [None]:
# Remove white spaces from `dest_size`
airlines['dest_size'] = airlines['dest_size'].str.strip()

# Verify changes have been effected
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

#### Remapping Categories

In [None]:
import numpy as np

# Create ranges for categories
label_ranges = [0, 60, 180, np.inf]
label_names = ['short', 'medium', 'long']

# Create wait_type column
airlines['wait_type'] = pd.cut(airlines['wait_min'], bins = label_ranges,
                               labels = label_names)

# Create mappings and replace
mappings = {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday',
            'Thursday': 'weekday', 'Friday': 'weekday',
            'Saturday': 'weekend', 'Sunday': 'weekend'}

airlines['day_week'] = airlines['day'].replace(mappings)

### Cleaning Text Data
n this lesson, we learn about text data and how to clean it using regular expressions. Text data is a common type of data that includes names, phone numbers, addresses, emails, and more. However, text data can have problems such as inconsistencies, typos, and formatting issues.

As an example, we use a DataFrame named phones that contains the full name and phone numbers of individuals. The phone number column has values that begin with "00" or "+", as well as dashes and a non-existent 4-digit entry. These formatting issues would prevent us from using the phone numbers in an automated call system or creating a report on the distribution of users by area code.

To clean the phone number column, we first replace the plus sign with "00" using the str.replace() method. Next, we remove the dashes using the same method. Finally, we replace all phone numbers with fewer than 10 digits with NaN using the loc method and numpy's nan object.

We can also write assert statements to test whether the phone number column has a specific length and contains the symbols we removed. Additionally, we introduce regular expressions, which allow us to search for patterns in text data. We use the str.replace() method with a regular expression pattern to extract only digits from the phone number column.

Regular expressions are powerful tools for cleaning text data, but constructing them can be complex.

#### Removing Titles and Taking Names

In [None]:
# Replace "Dr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Dr.", "")

# Replace "Mr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Mr.", "")

# Replace "Miss" with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Miss", "")

# Replace "Ms." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Ms.", "")

# Assert that full_name has no honorifics
assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False

### Keeping it Descriptive

In [None]:
# Store length of each row in survey_response column
resp_length = airlines['survey_response'].str.len()

# Find rows in airlines where resp_length > 40
airlines_survey = airlines[resp_length > 40]

# Assert minimum survey_response length is > 40
assert airlines_survey['survey_response'].str.len().min() > 40

# Print new survey_response column
print(airlines_survey['survey_response'])

## (3) Advanced Data Problems

### Uniformity
In this lesson, we learn about advanced data cleaning techniques, including uniformity, cross-field validation, and dealing with missing data. We begin by discussing the problem of unit uniformity, which occurs when data has values in multiple units (e.g., Celsius and Fahrenheit). To address this issue, we can plot the data and visually identify any discrepancies. We then demonstrate how to convert temperature data from Fahrenheit to Celsius using a simple formula and the loc method.

Next, we move on to date data, which can be inconsistent due to varying formats and ambiguous values. To address these issues, we can use the pandas to_datetime function to convert the column to a datetime format. We set the infer_datetime_format argument to True and errors to coerce to handle the varying formats and missing values. We can also use the dt.strftime method to convert the datetime format to our desired format. However, if the dataset has ambiguous dates with vague formats, there is no clear way to spot the inconsistency or to treat it. It is important to properly understand where the data comes from before trying to treat it, as it will make making these decisions much easier.

#### Uniform Currencies

In [None]:
# Banking import
import pandas as pd

banking = pd.read_csv('banking_dirty.csv')

# Find values of acct_cur that are equal to 'euro'
acct_eu = banking['acct_cur'] == 'euro'

# Convert acct_amount where it is in euro to dollars
banking.loc[acct_eu, 'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1

# Unify acct_cur column by changing 'euro' values to 'dollar'
banking.loc[acct_eu, 'acct_cur'] = 'dollar'

# Assert that only dollar currency remains
assert banking['acct_cur'].unique() == 'dollar'

#### Uniform Dates


In [None]:
# Print the header of account_opened
print(banking['account_opened'].head())

In [None]:
# Print the header of account_opened
print(banking['account_opened'].head())

# Convert account_opened to datetime
banking['account_opened'] = pd.to_datetime(banking['account_opened'],
                                           # Infer datetime format
                                           infer_datetime_format = True,
                                           # Return missing value for error
                                           errors = 'coerce')#%% md
# Cleaning Data in Python

## (1) Common Data Problems

### Data Type Constraints
This text explains the importance of cleaning data in the data science workflow, and the potential consequences of not doing so. It highlights the different types of data that can be encountered, such as text, integers, decimals, dates, and zip codes, and emphasizes the need to ensure that variables have the correct data types before analysis to avoid compromising insights.

The text also provides examples of how to convert data types, such as converting a string to an integer or a categorical variable, and introduces the dot-astype() method and the assert statement. The importance of correctly identifying categorical data is emphasized, and the consequences of not doing so are explained.

#### Numeric data or ...?

In [None]:
import pandas as pd

ride_sharing = pd.read_csv('ride_sharing_new.csv')

# Print the information of ride_sharing
print(ride_sharing.info())

In [None]:
# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())

In [None]:
# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')

# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'

In [None]:
# Print new summary statistics
print(ride_sharing['user_type_cat'].describe())

#### Summing Strings and Concatenating Numbers

In [None]:
# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes')

# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')

# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'

# Print formed columns and calculate average ride duration
print(ride_sharing[['duration','duration_trim','duration_time']])
print(ride_sharing['duration_time'].mean())

### Data Range Constraints
In this lesson, the importance of data range constraints in data science is discussed. Examples are given where data falls out of the allowable range due to errors in data collection or parsing, and ways to deal with such out of range data are explored. The simplest option is to drop the data, but depending on the size and importance of the out of range data, this might not always be the best option. Custom minimums or maximums can be set for columns, or the out of range data can be imputed as missing data, to be dealt with in Chapter 3. The business assumptions behind the data can also dictate assigning a custom value for any values of data that go beyond a certain range.

Two examples are given, one involving a dataset of movies and their respective average rating from a streaming service, and another involving subscription dates in the future for a service. In the movie example, the out of range values are either dropped or set to a hard limit of 5, depending on the assumptions behind the data. In the date range example, the subscription date column is first converted to a pandas datetime object and then treated in a similar way to the movie example.

The lesson stresses the importance of understanding the dataset and the business assumptions behind it before deciding how to deal with out of range data. The assert statement is introduced as a way to verify that the treatment of the out of range data has been successful.

#### Tire Size Constraints

In [None]:
# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')

# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27

# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

# Print tire size description
print(ride_sharing['tire_sizes'].describe())

#### Back to the Future

In [None]:
# Convert ride_date to date
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date

# Save today's date
today = dt.date.today()

# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today

# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())

### Uniqueness Constraints
This lesson discusses duplicate values in data cleaning. Duplicate values occur when the same exact information is repeated across multiple rows in a DataFrame. They can arise from data entry and human errors, bugs and design errors, or from joining and consolidating data from various resources. To find duplicate values, the dot-duplicated() method can be used, and the subset and keep arguments can be adjusted to calibrate the search. Once duplicates are found, they can be treated by keeping only one row with dot-drop_duplicates() or combining rows using statistical measures with the groupby and agg methods.

#### Finding Duplicates

In [None]:
# Find duplicates
duplicates = ride_sharing.duplicated(subset = 'ride_id', keep = False)

# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')

# Print relevant columns
print(duplicated_rides[['ride_id','duration','user_birth_year']])

#### Treating Duplicates

In [None]:
# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()

# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}

# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()

# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]

# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0

## (2) Text and Categorical Data Problems

### Membership Constraints
This chapter focuses on common problems with text and categorical data, specifically membership constraints. Categorical data represents a predefined finite set of categories and inconsistencies can occur due to data entry issues, data parsing errors, and other reasons. To treat these problems, we can drop rows with incorrect categories, attempt to remap incorrect categories to correct ones, and more. An example is provided where a left anti join is used to return all data in the study_data DataFrame with inconsistent blood types, and an inner join returns all rows containing consistent blood types. To find inconsistent categories in Python, we create a set of the blood_type column, find the difference between that set and the categories DataFrame, and use the isin method to find all rows of the blood_type column that are equal to inconsistent categories. To drop inconsistent rows and keep consistent ones, we use the tilde symbol while subsetting.

#### Finding Consistency

In [None]:
import pandas as pd

airlines = pd.read_csv('airlines.csv')

# Print categories DataFrame
print(categories)

# Print unique values of survey columns in airlines
print('Cleanliness: ', airlines['cleanliness'].unique(), "\n")
print('Safety: ', airlines['safety'].unique(), "\n")
print('Satisfaction: ', airlines['satisfaction'].unique(),"\n")

In [None]:
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

In [None]:
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

# Print rows with consistent categories only
print(airlines[~cat_clean_rows])

### Categorical Variables
The lesson discusses how to deal with categorical variables when cleaning data. The problems that can arise include value inconsistency, too many categories, and incorrect data types. To deal with value inconsistency, we can capitalize or lowercase the column, or remove leading spaces. We can also create categories out of data by using the qcut or cut functions, or map categories to fewer ones using the replace method.

In [None]:
import pandas as pd

airlines = pd.read_csv('airlines_final.csv')

# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

In [None]:
# Lower dest_region column and then replace "eur" with "europe"
airlines['dest_region'] = airlines['dest_region'].str.lower()
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})

In [None]:
# Remove white spaces from `dest_size`
airlines['dest_size'] = airlines['dest_size'].str.strip()

# Verify changes have been effected
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

#### Remapping Categories

In [None]:
import numpy as np

# Create ranges for categories
label_ranges = [0, 60, 180, np.inf]
label_names = ['short', 'medium', 'long']

# Create wait_type column
airlines['wait_type'] = pd.cut(airlines['wait_min'], bins = label_ranges,
                               labels = label_names)

# Create mappings and replace
mappings = {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday',
            'Thursday': 'weekday', 'Friday': 'weekday',
            'Saturday': 'weekend', 'Sunday': 'weekend'}

airlines['day_week'] = airlines['day'].replace(mappings)

### Cleaning Text Data
n this lesson, we learn about text data and how to clean it using regular expressions. Text data is a common type of data that includes names, phone numbers, addresses, emails, and more. However, text data can have problems such as inconsistencies, typos, and formatting issues.

As an example, we use a DataFrame named phones that contains the full name and phone numbers of individuals. The phone number column has values that begin with "00" or "+", as well as dashes and a non-existent 4-digit entry. These formatting issues would prevent us from using the phone numbers in an automated call system or creating a report on the distribution of users by area code.

To clean the phone number column, we first replace the plus sign with "00" using the str.replace() method. Next, we remove the dashes using the same method. Finally, we replace all phone numbers with fewer than 10 digits with NaN using the loc method and numpy's nan object.

We can also write assert statements to test whether the phone number column has a specific length and contains the symbols we removed. Additionally, we introduce regular expressions, which allow us to search for patterns in text data. We use the str.replace() method with a regular expression pattern to extract only digits from the phone number column.

Regular expressions are powerful tools for cleaning text data, but constructing them can be complex.

#### Removing Titles and Taking Names

In [None]:
# Replace "Dr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Dr.", "")

# Replace "Mr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Mr.", "")

# Replace "Miss" with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Miss", "")

# Replace "Ms." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Ms.", "")

# Assert that full_name has no honorifics
assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False

### Keeping it Descriptive

In [None]:
# Store length of each row in survey_response column
resp_length = airlines['survey_response'].str.len()

# Find rows in airlines where resp_length > 40
airlines_survey = airlines[resp_length > 40]

# Assert minimum survey_response length is > 40
assert airlines_survey['survey_response'].str.len().min() > 40

# Print new survey_response column
print(airlines_survey['survey_response'])

## (3) Advanced Data Problems

### Uniformity
In this lesson, we learn about advanced data cleaning techniques, including uniformity, cross-field validation, and dealing with missing data. We begin by discussing the problem of unit uniformity, which occurs when data has values in multiple units (e.g., Celsius and Fahrenheit). To address this issue, we can plot the data and visually identify any discrepancies. We then demonstrate how to convert temperature data from Fahrenheit to Celsius using a simple formula and the loc method.

Next, we move on to date data, which can be inconsistent due to varying formats and ambiguous values. To address these issues, we can use the pandas to_datetime function to convert the column to a datetime format. We set the infer_datetime_format argument to True and errors to coerce to handle the varying formats and missing values. We can also use the dt.strftime method to convert the datetime format to our desired format. However, if the dataset has ambiguous dates with vague formats, there is no clear way to spot the inconsistency or to treat it. It is important to properly understand where the data comes from before trying to treat it, as it will make making these decisions much easier.

#### Uniform Currencies

In [None]:
# Banking import
import pandas as pd

banking = pd.read_csv('banking_dirty.csv')

# Find values of acct_cur that are equal to 'euro'
acct_eu = banking['acct_cur'] == 'euro'

# Convert acct_amount where it is in euro to dollars
banking.loc[acct_eu, 'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1

# Unify acct_cur column by changing 'euro' values to 'dollar'
banking.loc[acct_eu, 'acct_cur'] = 'dollar'

# Assert that only dollar currency remains
assert banking['acct_cur'].unique() == 'dollar'

#### Uniform Dates


In [None]:
# Print the header of account_opened
print(banking['account_opened'].head())

In [None]:
# Convert account_opened to datetime
banking['account_opened'] = pd.to_datetime(banking['account_opened'],
                                           # Infer datetime format
                                           infer_datetime_format = True,
                                           # Return missing value for error
                                           errors = 'coerce')

In [None]:
# Print the header of account_opend
print(banking['account_opened'].head())

# Convert account_opened to datetime
banking['account_opened'] = pd.to_datetime(banking['account_opened'],
                                           # Infer datetime format
                                           infer_datetime_format = True,
                                           # Return missing value for error
                                           errors = 'coerce')

# Get year of account opened
banking['acct_year'] = banking['account_opened'].dt.strftime('%Y')

# Print acct_year
print(banking['acct_year'])

### Cross Field Validation
The article discusses cross-field validation for data cleaning, which is the use of multiple fields in a dataset to check for data integrity. It provides two examples of cross-field validation, one involving flight statistics and another involving user data. The article suggests that there is no one-size-fits-all solution for dealing with inconsistencies found through cross-field validation, and that the appropriate action will depend on a good understanding of the dataset and its sources.

#### Cross Field or No Cross Field
> Cross Field Validation
1) Confirming the Age provided by users by cross checking their birthdays
2) Row wise operations such as .sum(axis=1)
> Not Cross Field Validation
1) Making sure a subscription_date column has no values set in the future
2) Making sure that a revenue column is a numeric column
3) The use of the .astype() method

#### How's our data integrity?


In [None]:
# Store fund columns to sum against
fund_columns = ['fund_A', 'fund_B', 'fund_C', 'fund_D']

# Find rows where fund_columns row sum == inv_amount
inv_equ = banking[fund_columns].sum(axis = 1) == banking['inv_amount']

# Store consistent and inconsistent data
consistent_inv = banking[inv_equ]
inconsistent_inv = banking[~inv_equ]

# Store consistent and inconsistent data
print("Number of inconsistent investments: ", inconsistent_inv.shape[0])

In [None]:
# Store today's date and find ages
today = dt.date.today()
ages_manual = today.year - banking['birth_date'].dt.year

# Find rows where age column == ages_manual
age_equ = banking['age'] == ages_manual

# Store consistent and inconsistent data
consistent_ages = banking[age_equ]
inconsistent_ages = banking[~age_equ]

# Store consistent and inconsistent data
print("Number of inconsistent ages: ", inconsistent_ages.shape[0])

### Completeness
This lesson is about completeness and missing data. Missing data is a common problem caused by technical or human errors, and can take various forms. The airquality dataset is used as an example to demonstrate how missing data can be identified and analyzed using the missingno package. Different types of missingness are also explained, including missing completely at random, missing at random, and missing not at random. There are various ways to deal with missing data, including dropping missing values and replacing missing values with statistical measures such as mean, median, or mode.

#### Missing Investors
Dealing with missing data is one of the most common mistakes in Data Science. There are a variety of types of missingness, as well as a variety of types of solutions to missing data.

In [None]:
# Print number of missing values in banking
print(banking.isna().sum())

# Visualize missingness matrix
msno.matrix(banking)
plt.show()

In [None]:
# Print number of missing values in banking
print(banking.isna().sum())

# Visualize missingness matrix
msno.matrix(banking)
plt.show()

# Isolate missing and non missing values of inv_amount
missing_investors = banking[banking['inv_amount'].isna()]
investors = banking[~banking['inv_amount'].isna()]

In [None]:
# Print number of missing values in banking
print(banking.isna().sum())

# Visualize missingness matrix
msno.matrix(banking)
plt.show()

# Isolate missing and non missing values of inv_amount
missing_investors = banking[banking['inv_amount'].isna()]
investors = banking[~banking['inv_amount'].isna()]

# Sort banking by age and visualize
banking_sorted = banking.sort_values(by = 'age')
msno.matrix(banking_sorted)
plt.show()

#### Follow the Money
In this exercise, you're working with another version of the banking DataFrame that contains missing values for both the cust_id column and the acct_amount column.

In [None]:
# Drop missing values of cust_id
banking_fullid = banking.dropna(subset = ['cust_id'])

# Compute estimated acct_amount
acct_imp = banking_fullid['inv_amount'] * 5

# Impute missing acct_amount with corresponding acct_imp
banking_imputed = banking_fullid.fillna({'acct_amount':acct_imp})

# Print number of missing values
print(banking_imputed.isna().sum())

## (4) Record Linkage

### Comparing Strings
The text discusses string similarity and minimum edit distance algorithms, specifically the Levenshtein distance and thefuzz package. The WRatio function is introduced as a tool for comparing strings, including partial and differently ordered strings. The extract function is used to compare a string with an array of strings. String similarity is also shown as a method for collapsing categories in a DataFrame, with an example of using it to remap inconsistent state categories. Finally, the text briefly mentions record linkage as a way to join data sources with fuzzy duplicates.

#### Minimum Edit Distance
Used to identify how similar strings are. As a reminder, min edit distance is the minimum number of steps needed to reach from String A to String B with the operations available being:
1) Insertion of a new character
2) Deletion of an existing character
3) Subsitution of an existing character
4) Transposition of two existing consecutive characters

#### The Cutoff point

In [None]:
# Import process from thefuzz
from thefuzz import process
import pandas
import pandas as pd

# read from pandas
restaurants = pd.read_csv('restaurants_L2_dirty.csv')

# Store the unique values of cuisine_type in unique_types
unique_types = restaurants['cuisine_type'].unique()

# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian', unique_types, limit = len(unique_types)))

# Calculate similarity of 'american' to all values of unique_types
print(process.extract('american', unique_types, limit = len(unique_types)))

# Calculate similarity of 'italian' to all values of unique_types
print(process.extract('italian', unique_types, limit = len(unique_types)))

#### Remapping categories II

In [None]:
# Inspect the unique values of the cuisine_type column
print(restaurants['cuisine_type'].unique())

In [None]:
# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

# Inspect the first 5 matches
print(matches[0:5])

In [None]:
# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

# Iterate through the list of matches to italian
for match in matches:
    # Check whether the similarity score is greater than or equal to 80
    if match[1] >= 80:
        # Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
        restaurants.loc[restaurants['cuisine_type'] == match[0]] = 'italian'

In [None]:
# Iterate through categories
for cuisine in categories:
    # Create a list of matches, comparing cuisine with the cuisine_type column
    matches = process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

    # Iterate through the list of matches
    for match in matches:
        # Check whether the similarity score is greater than or equal to 80
        if match[1] >= 80:
            # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
            restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine

# Inspect the final result

print(restaurants['cuisine_type'].unique())

### Generating Pairs

In this lesson, the focus is on using record linkage to merge two data sources without duplicates. The example given is of two data frames containing NBA games and schedules from different sources, with some games having different naming and no common unique identifier. Regular join or merge methods cannot be used in such cases, and record linkage is used instead. The steps involved in record linkage include cleaning the data, generating pairs of potentially matching records, scoring these pairs based on string similarity and other metrics, and linking them. In this lesson, the emphasis is on generating pairs using blocking, which involves creating an indexing object and using the block method to generate pairs based on a matching column. The comparison object is then created to assign different comparison procedures for pairs, and string similarities between pairs of rows for columns with fuzzy values are computed using the string method. The compute function is used to compute the matches, and filtering is done to find potential matches.

#### To Link or Not to Link
Similar to joins, record linkage is the act of linking data from different sources regarding the same entity. But unlike joins, record linkage does not require exact matches between different pairs of data, and instead can find close matches using string similarity. This is why record linkage is effective when there are no common unique keys between the data sources you can rely upon when linking data sources such as a unique identifier.

#### Pairs of Restaurants

In [None]:
# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()

# Block pairing on cuisine_type
indexer.block('cuisine_type')

# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)

#### Similar Restaurants

In [None]:
# Create a comparison object
comp_cl = recordlinkage.Compare()

In [None]:
# Create a comparison object
comp_cl = recordlinkage.Compare()

# Find exact matches on city, cuisine_types
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('cuisine_type', 'cuisine_type', label='cuisine_type')

# Find similar matches of rest_name
comp_cl.string('rest_name', 'rest_name', label='name', threshold = 0.8)#%% md
# Cleaning Data in Python

## (1) Common Data Problems

### Data Type Constraints
This text explains the importance of cleaning data in the data science workflow, and the potential consequences of not doing so. It highlights the different types of data that can be encountered, such as text, integers, decimals, dates, and zip codes, and emphasizes the need to ensure that variables have the correct data types before analysis to avoid compromising insights.

The text also provides examples of how to convert data types, such as converting a string to an integer or a categorical variable, and introduces the dot-astype() method and the assert statement. The importance of correctly identifying categorical data is emphasized, and the consequences of not doing so are explained.

#### Numeric data or ...?

In [None]:
import pandas as pd

ride_sharing = pd.read_csv('ride_sharing_new.csv')

# Print the information of ride_sharing
print(ride_sharing.info())

In [None]:
# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())

In [None]:
# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')

# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'

In [None]:
# Print new summary statistics
print(ride_sharing['user_type_cat'].describe())

#### Summing Strings and Concatenating Numbers

In [None]:
# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes')

# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')

# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'

# Print formed columns and calculate average ride duration
print(ride_sharing[['duration','duration_trim','duration_time']])
print(ride_sharing['duration_time'].mean())

### Data Range Constraints
In this lesson, the importance of data range constraints in data science is discussed. Examples are given where data falls out of the allowable range due to errors in data collection or parsing, and ways to deal with such out of range data are explored. The simplest option is to drop the data, but depending on the size and importance of the out of range data, this might not always be the best option. Custom minimums or maximums can be set for columns, or the out of range data can be imputed as missing data, to be dealt with in Chapter 3. The business assumptions behind the data can also dictate assigning a custom value for any values of data that go beyond a certain range.

Two examples are given, one involving a dataset of movies and their respective average rating from a streaming service, and another involving subscription dates in the future for a service. In the movie example, the out of range values are either dropped or set to a hard limit of 5, depending on the assumptions behind the data. In the date range example, the subscription date column is first converted to a pandas datetime object and then treated in a similar way to the movie example.

The lesson stresses the importance of understanding the dataset and the business assumptions behind it before deciding how to deal with out of range data. The assert statement is introduced as a way to verify that the treatment of the out of range data has been successful.

#### Tire Size Constraints

In [None]:
# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')

# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27

# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

# Print tire size description
print(ride_sharing['tire_sizes'].describe())

#### Back to the Future

In [None]:
# Convert ride_date to date
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date

# Save today's date
today = dt.date.today()

# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today

# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())

### Uniqueness Constraints
This lesson discusses duplicate values in data cleaning. Duplicate values occur when the same exact information is repeated across multiple rows in a DataFrame. They can arise from data entry and human errors, bugs and design errors, or from joining and consolidating data from various resources. To find duplicate values, the dot-duplicated() method can be used, and the subset and keep arguments can be adjusted to calibrate the search. Once duplicates are found, they can be treated by keeping only one row with dot-drop_duplicates() or combining rows using statistical measures with the groupby and agg methods.

#### Finding Duplicates

In [None]:
# Find duplicates
duplicates = ride_sharing.duplicated(subset = 'ride_id', keep = False)

# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')

# Print relevant columns
print(duplicated_rides[['ride_id','duration','user_birth_year']])

#### Treating Duplicates

In [None]:
# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()

# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}

# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()

# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]

# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0

## (2) Text and Categorical Data Problems

### Membership Constraints
This chapter focuses on common problems with text and categorical data, specifically membership constraints. Categorical data represents a predefined finite set of categories and inconsistencies can occur due to data entry issues, data parsing errors, and other reasons. To treat these problems, we can drop rows with incorrect categories, attempt to remap incorrect categories to correct ones, and more. An example is provided where a left anti join is used to return all data in the study_data DataFrame with inconsistent blood types, and an inner join returns all rows containing consistent blood types. To find inconsistent categories in Python, we create a set of the blood_type column, find the difference between that set and the categories DataFrame, and use the isin method to find all rows of the blood_type column that are equal to inconsistent categories. To drop inconsistent rows and keep consistent ones, we use the tilde symbol while subsetting.

#### Finding Consistency

In [None]:
import pandas as pd

airlines = pd.read_csv('airlines.csv')

# Print categories DataFrame
print(categories)

# Print unique values of survey columns in airlines
print('Cleanliness: ', airlines['cleanliness'].unique(), "\n")
print('Safety: ', airlines['safety'].unique(), "\n")
print('Satisfaction: ', airlines['satisfaction'].unique(),"\n")

In [None]:
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

In [None]:
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

# Print rows with consistent categories only
print(airlines[~cat_clean_rows])

### Categorical Variables
The lesson discusses how to deal with categorical variables when cleaning data. The problems that can arise include value inconsistency, too many categories, and incorrect data types. To deal with value inconsistency, we can capitalize or lowercase the column, or remove leading spaces. We can also create categories out of data by using the qcut or cut functions, or map categories to fewer ones using the replace method.

In [None]:
import pandas as pd

airlines = pd.read_csv('airlines_final.csv')

# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

In [None]:
# Lower dest_region column and then replace "eur" with "europe"
airlines['dest_region'] = airlines['dest_region'].str.lower()
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})

In [None]:
# Remove white spaces from `dest_size`
airlines['dest_size'] = airlines['dest_size'].str.strip()

# Verify changes have been effected
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

#### Remapping Categories

In [None]:
import numpy as np

# Create ranges for categories
label_ranges = [0, 60, 180, np.inf]
label_names = ['short', 'medium', 'long']

# Create wait_type column
airlines['wait_type'] = pd.cut(airlines['wait_min'], bins = label_ranges,
                               labels = label_names)

# Create mappings and replace
mappings = {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday',
            'Thursday': 'weekday', 'Friday': 'weekday',
            'Saturday': 'weekend', 'Sunday': 'weekend'}

airlines['day_week'] = airlines['day'].replace(mappings)

### Cleaning Text Data
n this lesson, we learn about text data and how to clean it using regular expressions. Text data is a common type of data that includes names, phone numbers, addresses, emails, and more. However, text data can have problems such as inconsistencies, typos, and formatting issues.

As an example, we use a DataFrame named phones that contains the full name and phone numbers of individuals. The phone number column has values that begin with "00" or "+", as well as dashes and a non-existent 4-digit entry. These formatting issues would prevent us from using the phone numbers in an automated call system or creating a report on the distribution of users by area code.

To clean the phone number column, we first replace the plus sign with "00" using the str.replace() method. Next, we remove the dashes using the same method. Finally, we replace all phone numbers with fewer than 10 digits with NaN using the loc method and numpy's nan object.

We can also write assert statements to test whether the phone number column has a specific length and contains the symbols we removed. Additionally, we introduce regular expressions, which allow us to search for patterns in text data. We use the str.replace() method with a regular expression pattern to extract only digits from the phone number column.

Regular expressions are powerful tools for cleaning text data, but constructing them can be complex.

#### Removing Titles and Taking Names

In [None]:
# Replace "Dr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Dr.", "")

# Replace "Mr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Mr.", "")

# Replace "Miss" with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Miss", "")

# Replace "Ms." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Ms.", "")

# Assert that full_name has no honorifics
assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False

### Keeping it Descriptive

In [None]:
# Store length of each row in survey_response column
resp_length = airlines['survey_response'].str.len()

# Find rows in airlines where resp_length > 40
airlines_survey = airlines[resp_length > 40]

# Assert minimum survey_response length is > 40
assert airlines_survey['survey_response'].str.len().min() > 40

# Print new survey_response column
print(airlines_survey['survey_response'])

## (3) Advanced Data Problems

### Uniformity
In this lesson, we learn about advanced data cleaning techniques, including uniformity, cross-field validation, and dealing with missing data. We begin by discussing the problem of unit uniformity, which occurs when data has values in multiple units (e.g., Celsius and Fahrenheit). To address this issue, we can plot the data and visually identify any discrepancies. We then demonstrate how to convert temperature data from Fahrenheit to Celsius using a simple formula and the loc method.

Next, we move on to date data, which can be inconsistent due to varying formats and ambiguous values. To address these issues, we can use the pandas to_datetime function to convert the column to a datetime format. We set the infer_datetime_format argument to True and errors to coerce to handle the varying formats and missing values. We can also use the dt.strftime method to convert the datetime format to our desired format. However, if the dataset has ambiguous dates with vague formats, there is no clear way to spot the inconsistency or to treat it. It is important to properly understand where the data comes from before trying to treat it, as it will make making these decisions much easier.

#### Uniform Currencies

In [None]:
# Banking import
import pandas as pd

banking = pd.read_csv('banking_dirty.csv')

# Find values of acct_cur that are equal to 'euro'
acct_eu = banking['acct_cur'] == 'euro'

# Convert acct_amount where it is in euro to dollars
banking.loc[acct_eu, 'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1

# Unify acct_cur column by changing 'euro' values to 'dollar'
banking.loc[acct_eu, 'acct_cur'] = 'dollar'

# Assert that only dollar currency remains
assert banking['acct_cur'].unique() == 'dollar'

#### Uniform Dates


In [None]:
# Print the header of account_opened
print(banking['account_opened'].head())

In [None]:
# Print the header of account_opened
print(banking['account_opened'].head())

# Convert account_opened to datetime
banking['account_opened'] = pd.to_datetime(banking['account_opened'],
                                           # Infer datetime format
                                           infer_datetime_format = True,
                                           # Return missing value for error
                                           errors = 'coerce')#%% md
# Cleaning Data in Python

## (1) Common Data Problems

### Data Type Constraints
This text explains the importance of cleaning data in the data science workflow, and the potential consequences of not doing so. It highlights the different types of data that can be encountered, such as text, integers, decimals, dates, and zip codes, and emphasizes the need to ensure that variables have the correct data types before analysis to avoid compromising insights.

The text also provides examples of how to convert data types, such as converting a string to an integer or a categorical variable, and introduces the dot-astype() method and the assert statement. The importance of correctly identifying categorical data is emphasized, and the consequences of not doing so are explained.

#### Numeric data or ...?

In [None]:
import pandas as pd

ride_sharing = pd.read_csv('ride_sharing_new.csv')

# Print the information of ride_sharing
print(ride_sharing.info())

In [None]:
# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())

In [None]:
# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')

# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'

In [None]:
# Print new summary statistics
print(ride_sharing['user_type_cat'].describe())

#### Summing Strings and Concatenating Numbers

In [None]:
# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes')

# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')

# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'

# Print formed columns and calculate average ride duration
print(ride_sharing[['duration','duration_trim','duration_time']])
print(ride_sharing['duration_time'].mean())

### Data Range Constraints
In this lesson, the importance of data range constraints in data science is discussed. Examples are given where data falls out of the allowable range due to errors in data collection or parsing, and ways to deal with such out of range data are explored. The simplest option is to drop the data, but depending on the size and importance of the out of range data, this might not always be the best option. Custom minimums or maximums can be set for columns, or the out of range data can be imputed as missing data, to be dealt with in Chapter 3. The business assumptions behind the data can also dictate assigning a custom value for any values of data that go beyond a certain range.

Two examples are given, one involving a dataset of movies and their respective average rating from a streaming service, and another involving subscription dates in the future for a service. In the movie example, the out of range values are either dropped or set to a hard limit of 5, depending on the assumptions behind the data. In the date range example, the subscription date column is first converted to a pandas datetime object and then treated in a similar way to the movie example.

The lesson stresses the importance of understanding the dataset and the business assumptions behind it before deciding how to deal with out of range data. The assert statement is introduced as a way to verify that the treatment of the out of range data has been successful.

#### Tire Size Constraints

In [None]:
# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')

# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27

# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

# Print tire size description
print(ride_sharing['tire_sizes'].describe())

#### Back to the Future

In [None]:
# Convert ride_date to date
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date

# Save today's date
today = dt.date.today()

# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today

# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())

### Uniqueness Constraints
This lesson discusses duplicate values in data cleaning. Duplicate values occur when the same exact information is repeated across multiple rows in a DataFrame. They can arise from data entry and human errors, bugs and design errors, or from joining and consolidating data from various resources. To find duplicate values, the dot-duplicated() method can be used, and the subset and keep arguments can be adjusted to calibrate the search. Once duplicates are found, they can be treated by keeping only one row with dot-drop_duplicates() or combining rows using statistical measures with the groupby and agg methods.

#### Finding Duplicates

In [None]:
# Find duplicates
duplicates = ride_sharing.duplicated(subset = 'ride_id', keep = False)

# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')

# Print relevant columns
print(duplicated_rides[['ride_id','duration','user_birth_year']])

#### Treating Duplicates

In [None]:
# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()

# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}

# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()

# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]

# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0

## (2) Text and Categorical Data Problems

### Membership Constraints
This chapter focuses on common problems with text and categorical data, specifically membership constraints. Categorical data represents a predefined finite set of categories and inconsistencies can occur due to data entry issues, data parsing errors, and other reasons. To treat these problems, we can drop rows with incorrect categories, attempt to remap incorrect categories to correct ones, and more. An example is provided where a left anti join is used to return all data in the study_data DataFrame with inconsistent blood types, and an inner join returns all rows containing consistent blood types. To find inconsistent categories in Python, we create a set of the blood_type column, find the difference between that set and the categories DataFrame, and use the isin method to find all rows of the blood_type column that are equal to inconsistent categories. To drop inconsistent rows and keep consistent ones, we use the tilde symbol while subsetting.

#### Finding Consistency

In [None]:
import pandas as pd

airlines = pd.read_csv('airlines.csv')

# Print categories DataFrame
print(categories)

# Print unique values of survey columns in airlines
print('Cleanliness: ', airlines['cleanliness'].unique(), "\n")
print('Safety: ', airlines['safety'].unique(), "\n")
print('Satisfaction: ', airlines['satisfaction'].unique(),"\n")

In [None]:
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

In [None]:
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

# Print rows with consistent categories only
print(airlines[~cat_clean_rows])

### Categorical Variables
The lesson discusses how to deal with categorical variables when cleaning data. The problems that can arise include value inconsistency, too many categories, and incorrect data types. To deal with value inconsistency, we can capitalize or lowercase the column, or remove leading spaces. We can also create categories out of data by using the qcut or cut functions, or map categories to fewer ones using the replace method.

In [None]:
import pandas as pd

airlines = pd.read_csv('airlines_final.csv')

# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

In [None]:
# Lower dest_region column and then replace "eur" with "europe"
airlines['dest_region'] = airlines['dest_region'].str.lower()
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})

In [None]:
# Remove white spaces from `dest_size`
airlines['dest_size'] = airlines['dest_size'].str.strip()

# Verify changes have been effected
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

#### Remapping Categories

In [None]:
import numpy as np

# Create ranges for categories
label_ranges = [0, 60, 180, np.inf]
label_names = ['short', 'medium', 'long']

# Create wait_type column
airlines['wait_type'] = pd.cut(airlines['wait_min'], bins = label_ranges,
                               labels = label_names)

# Create mappings and replace
mappings = {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday',
            'Thursday': 'weekday', 'Friday': 'weekday',
            'Saturday': 'weekend', 'Sunday': 'weekend'}

airlines['day_week'] = airlines['day'].replace(mappings)

### Cleaning Text Data
n this lesson, we learn about text data and how to clean it using regular expressions. Text data is a common type of data that includes names, phone numbers, addresses, emails, and more. However, text data can have problems such as inconsistencies, typos, and formatting issues.

As an example, we use a DataFrame named phones that contains the full name and phone numbers of individuals. The phone number column has values that begin with "00" or "+", as well as dashes and a non-existent 4-digit entry. These formatting issues would prevent us from using the phone numbers in an automated call system or creating a report on the distribution of users by area code.

To clean the phone number column, we first replace the plus sign with "00" using the str.replace() method. Next, we remove the dashes using the same method. Finally, we replace all phone numbers with fewer than 10 digits with NaN using the loc method and numpy's nan object.

We can also write assert statements to test whether the phone number column has a specific length and contains the symbols we removed. Additionally, we introduce regular expressions, which allow us to search for patterns in text data. We use the str.replace() method with a regular expression pattern to extract only digits from the phone number column.

Regular expressions are powerful tools for cleaning text data, but constructing them can be complex.

#### Removing Titles and Taking Names

In [None]:
# Replace "Dr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Dr.", "")

# Replace "Mr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Mr.", "")

# Replace "Miss" with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Miss", "")

# Replace "Ms." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Ms.", "")

# Assert that full_name has no honorifics
assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False

### Keeping it Descriptive

In [None]:
# Store length of each row in survey_response column
resp_length = airlines['survey_response'].str.len()

# Find rows in airlines where resp_length > 40
airlines_survey = airlines[resp_length > 40]

# Assert minimum survey_response length is > 40
assert airlines_survey['survey_response'].str.len().min() > 40

# Print new survey_response column
print(airlines_survey['survey_response'])

## (3) Advanced Data Problems

### Uniformity
In this lesson, we learn about advanced data cleaning techniques, including uniformity, cross-field validation, and dealing with missing data. We begin by discussing the problem of unit uniformity, which occurs when data has values in multiple units (e.g., Celsius and Fahrenheit). To address this issue, we can plot the data and visually identify any discrepancies. We then demonstrate how to convert temperature data from Fahrenheit to Celsius using a simple formula and the loc method.

Next, we move on to date data, which can be inconsistent due to varying formats and ambiguous values. To address these issues, we can use the pandas to_datetime function to convert the column to a datetime format. We set the infer_datetime_format argument to True and errors to coerce to handle the varying formats and missing values. We can also use the dt.strftime method to convert the datetime format to our desired format. However, if the dataset has ambiguous dates with vague formats, there is no clear way to spot the inconsistency or to treat it. It is important to properly understand where the data comes from before trying to treat it, as it will make making these decisions much easier.

#### Uniform Currencies

In [None]:
# Banking import
import pandas as pd

banking = pd.read_csv('banking_dirty.csv')

# Find values of acct_cur that are equal to 'euro'
acct_eu = banking['acct_cur'] == 'euro'

# Convert acct_amount where it is in euro to dollars
banking.loc[acct_eu, 'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1

# Unify acct_cur column by changing 'euro' values to 'dollar'
banking.loc[acct_eu, 'acct_cur'] = 'dollar'

# Assert that only dollar currency remains
assert banking['acct_cur'].unique() == 'dollar'

#### Uniform Dates


In [None]:
# Print the header of account_opened
print(banking['account_opened'].head())

In [None]:
# Convert account_opened to datetime
banking['account_opened'] = pd.to_datetime(banking['account_opened'],
                                           # Infer datetime format
                                           infer_datetime_format = True,
                                           # Return missing value for error
                                           errors = 'coerce')

In [None]:
# Print the header of account_opend
print(banking['account_opened'].head())

# Convert account_opened to datetime
banking['account_opened'] = pd.to_datetime(banking['account_opened'],
                                           # Infer datetime format
                                           infer_datetime_format = True,
                                           # Return missing value for error
                                           errors = 'coerce')

# Get year of account opened
banking['acct_year'] = banking['account_opened'].dt.strftime('%Y')

# Print acct_year
print(banking['acct_year'])

### Cross Field Validation
The article discusses cross-field validation for data cleaning, which is the use of multiple fields in a dataset to check for data integrity. It provides two examples of cross-field validation, one involving flight statistics and another involving user data. The article suggests that there is no one-size-fits-all solution for dealing with inconsistencies found through cross-field validation, and that the appropriate action will depend on a good understanding of the dataset and its sources.

#### Cross Field or No Cross Field
> Cross Field Validation
1) Confirming the Age provided by users by cross checking their birthdays
2) Row wise operations such as .sum(axis=1)
> Not Cross Field Validation
1) Making sure a subscription_date column has no values set in the future
2) Making sure that a revenue column is a numeric column
3) The use of the .astype() method

#### How's our data integrity?


In [None]:
# Store fund columns to sum against
fund_columns = ['fund_A', 'fund_B', 'fund_C', 'fund_D']

# Find rows where fund_columns row sum == inv_amount
inv_equ = banking[fund_columns].sum(axis = 1) == banking['inv_amount']

# Store consistent and inconsistent data
consistent_inv = banking[inv_equ]
inconsistent_inv = banking[~inv_equ]

# Store consistent and inconsistent data
print("Number of inconsistent investments: ", inconsistent_inv.shape[0])

In [None]:
# Store today's date and find ages
today = dt.date.today()
ages_manual = today.year - banking['birth_date'].dt.year

# Find rows where age column == ages_manual
age_equ = banking['age'] == ages_manual

# Store consistent and inconsistent data
consistent_ages = banking[age_equ]
inconsistent_ages = banking[~age_equ]

# Store consistent and inconsistent data
print("Number of inconsistent ages: ", inconsistent_ages.shape[0])

### Completeness
This lesson is about completeness and missing data. Missing data is a common problem caused by technical or human errors, and can take various forms. The airquality dataset is used as an example to demonstrate how missing data can be identified and analyzed using the missingno package. Different types of missingness are also explained, including missing completely at random, missing at random, and missing not at random. There are various ways to deal with missing data, including dropping missing values and replacing missing values with statistical measures such as mean, median, or mode.

#### Missing Investors
Dealing with missing data is one of the most common mistakes in Data Science. There are a variety of types of missingness, as well as a variety of types of solutions to missing data.

In [None]:
# Print number of missing values in banking
print(banking.isna().sum())

# Visualize missingness matrix
msno.matrix(banking)
plt.show()

In [None]:
# Print number of missing values in banking
print(banking.isna().sum())

# Visualize missingness matrix
msno.matrix(banking)
plt.show()

# Isolate missing and non missing values of inv_amount
missing_investors = banking[banking['inv_amount'].isna()]
investors = banking[~banking['inv_amount'].isna()]

In [None]:
# Print number of missing values in banking
print(banking.isna().sum())

# Visualize missingness matrix
msno.matrix(banking)
plt.show()

# Isolate missing and non missing values of inv_amount
missing_investors = banking[banking['inv_amount'].isna()]
investors = banking[~banking['inv_amount'].isna()]

# Sort banking by age and visualize
banking_sorted = banking.sort_values(by = 'age')
msno.matrix(banking_sorted)
plt.show()

#### Follow the Money
In this exercise, you're working with another version of the banking DataFrame that contains missing values for both the cust_id column and the acct_amount column.

In [None]:
# Drop missing values of cust_id
banking_fullid = banking.dropna(subset = ['cust_id'])

# Compute estimated acct_amount
acct_imp = banking_fullid['inv_amount'] * 5

# Impute missing acct_amount with corresponding acct_imp
banking_imputed = banking_fullid.fillna({'acct_amount':acct_imp})

# Print number of missing values
print(banking_imputed.isna().sum())

## (4) Record Linkage

### Comparing Strings
The text discusses string similarity and minimum edit distance algorithms, specifically the Levenshtein distance and thefuzz package. The WRatio function is introduced as a tool for comparing strings, including partial and differently ordered strings. The extract function is used to compare a string with an array of strings. String similarity is also shown as a method for collapsing categories in a DataFrame, with an example of using it to remap inconsistent state categories. Finally, the text briefly mentions record linkage as a way to join data sources with fuzzy duplicates.

#### Minimum Edit Distance
Used to identify how similar strings are. As a reminder, min edit distance is the minimum number of steps needed to reach from String A to String B with the operations available being:
1) Insertion of a new character
2) Deletion of an existing character
3) Subsitution of an existing character
4) Transposition of two existing consecutive characters

#### The Cutoff point

In [None]:
# Import process from thefuzz
from thefuzz import process
import pandas
import pandas as pd

# read from pandas
restaurants = pd.read_csv('restaurants_L2_dirty.csv')

# Store the unique values of cuisine_type in unique_types
unique_types = restaurants['cuisine_type'].unique()

# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian', unique_types, limit = len(unique_types)))

# Calculate similarity of 'american' to all values of unique_types
print(process.extract('american', unique_types, limit = len(unique_types)))

# Calculate similarity of 'italian' to all values of unique_types
print(process.extract('italian', unique_types, limit = len(unique_types)))

#### Remapping categories II

In [None]:
# Inspect the unique values of the cuisine_type column
print(restaurants['cuisine_type'].unique())

In [None]:
# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

# Inspect the first 5 matches
print(matches[0:5])

In [None]:
# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

# Iterate through the list of matches to italian
for match in matches:
    # Check whether the similarity score is greater than or equal to 80
    if match[1] >= 80:
        # Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
        restaurants.loc[restaurants['cuisine_type'] == match[0]] = 'italian'

In [None]:
# Iterate through categories
for cuisine in categories:
    # Create a list of matches, comparing cuisine with the cuisine_type column
    matches = process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

    # Iterate through the list of matches
    for match in matches:
        # Check whether the similarity score is greater than or equal to 80
        if match[1] >= 80:
            # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
            restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine

# Inspect the final result

print(restaurants['cuisine_type'].unique())

### Generating Pairs

In this lesson, the focus is on using record linkage to merge two data sources without duplicates. The example given is of two data frames containing NBA games and schedules from different sources, with some games having different naming and no common unique identifier. Regular join or merge methods cannot be used in such cases, and record linkage is used instead. The steps involved in record linkage include cleaning the data, generating pairs of potentially matching records, scoring these pairs based on string similarity and other metrics, and linking them. In this lesson, the emphasis is on generating pairs using blocking, which involves creating an indexing object and using the block method to generate pairs based on a matching column. The comparison object is then created to assign different comparison procedures for pairs, and string similarities between pairs of rows for columns with fuzzy values are computed using the string method. The compute function is used to compute the matches, and filtering is done to find potential matches.

#### To Link or Not to Link
Similar to joins, record linkage is the act of linking data from different sources regarding the same entity. But unlike joins, record linkage does not require exact matches between different pairs of data, and instead can find close matches using string similarity. This is why record linkage is effective when there are no common unique keys between the data sources you can rely upon when linking data sources such as a unique identifier.

#### Pairs of Restaurants

In [None]:
# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()

# Block pairing on cuisine_type
indexer.block('cuisine_type')

# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)

#### Similar Restaurants

In [None]:
# Create a comparison object
comp_cl = recordlinkage.Compare()

In [None]:
# Create a comparison object
comp_cl = recordlinkage.Compare()

# Find exact matches on city, cuisine_types
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('cuisine_type', 'cuisine_type', label='cuisine_type')

# Find similar matches of rest_name
comp_cl.string('rest_name', 'rest_name', label='name', threshold = 0.8)

In [None]:
# Create a comparison object
comp_cl = recordlinkage.Compare()

# Find exact matches on city, cuisine_types -
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('cuisine_type', 'cuisine_type', label='cuisine_type')

# Find similar matches of rest_name
comp_cl.string('rest_name', 'rest_name', label='name', threshold = 0.8)

# Get potential matches and print
potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new)
print(potential_matches)

### Linking DataFrames
The lesson teaches how to link two DataFrames in Python, where the DataFrames are already compared and scored. The steps include isolating potential matches, extracting the index of the second DataFrame, subsetting the second DataFrame to remove duplicates with the first DataFrame, and finally appending the two DataFrames together. The lesson provides specific code examples and mentions a threshold of 0.85 for string similarity. The lesson ends with a call to practice linking DataFrames.

#### Linking them together!


In [None]:
# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]

# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)

# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]

# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
print(full_restaurants)

This is a congratulatory message for completing the course. The course taught about diagnosing and cleaning dirty data. The course had four chapters which covered basic data cleaning problems, categorical and text data problems, advanced data problems, and record linkage. The message encourages the learner to check out more courses, tracks, and projects in DataCamp's content library.