<a href="https://colab.research.google.com/github/Imppel-9704/de_track_datacamp/blob/main/l12_Cleaning_Data_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cleaning Data in Python

## Common data problems
### Why do we need to clean data?
Dirty data can appear because of
1. duplicate values
2. mis-spellings
3. data type parsing errors
4. legacy systems

Without making sure that data is properly cleaned in the exploration and processing phase, we will surely compromise the insights and reports subsequently generated. Garbage in Garbage out.

### Data type constraints
Data type
- Text data: First name, Last name, address
- Integers: # Subscribers, # products sold
- Decimals: Temperature, Exchange rates
- Binary: Is married?, New customer?, yes/no
- Dates: Order dates, ship dates
- Categories: Marriage status, Gender

Python data type
- str
- int
- float
- bool
- datetime
- category

Get dataframe's data types.
```
df.dtypes
```

Get dataframe information
```
df.info()
```

str to int
```
# remove $ from revenue column
df['revenue'] = df['revenue'].str.strip('$')
# convert str to int
df['revenue'] = df['revenue'].astype('int')

# Make sure that the revenue col is now an int by using the assert statement
assert df['revenue'].dtype == 'int'

# sum
df['revenue'].sum()
```

Numeric or categorical?
```
# convert to categorical
df['marriage_status'] = df['marriage_status'].astype('category')
# return statistical summary
df.describe()
```

In [None]:
assert 1+1 == 2

In [None]:
assert 1+1 == 3

AssertionError: ignored

*insert, convert to category into new column*
```
# Print the information of ride_sharing
print(ride_sharing.info())

# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())

# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')

# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'

# Print new summary statistics
print(ride_sharing['user_type_cat'].describe())
```

*remove word "minutes" from column and convert from str to int*
```
# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip("minutes")

# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')

# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'

# Print formed columns and calculate average ride duration
print(ride_sharing[['duration','duration_trim','duration_time']])
print(ride_sharing['duration_time'].mean())
```

## Data range constraints

Can future sign-ups exist? It actually shouldn't return any output.
```
# import data time
import datetime as dt
today_date = dt.date.today()
user_signups[user_signups['subscription_date'] > today_date]
```

### How to deal with out of range data?
- Dropping data
- Setting custom minimums and maximums
- Teat as missing and imput
- Setting custom value depending on business assumptions.

Movie example
```
import pandas as pd
# Output movies with rating should not be > 5
movies[movies['avg_rating'] > 5]

# There are 3 ways to get rid of rating > 5
# 1. drop values using filtering
movies = movies[movies['avg_rating'] <= 5]
# 2. drop values using .drop()
movies.drop(movies[movies['avg_rating'] > 5].index, inplace=True)
# 3. convert > 5 to 5
movies.loc[movies['avg_rating'] > 5, 'avg_rating'] = 5
```

### How to convert to datetime
```
import pandas as pd
import datetime as dt
# Convert to date
df['subscription_date'] = pd.to_datetime(df['subscription_date']).dt.date
```

Clean date
```
import datetime as dt
today_date = dt.date.today()
# drop values using filtering
df = df[df['subscription_date'] < today_date]
# using .drop()
df.drop(df[df['subscription_date'] > today_date].index, inplace=True)
```

Hardcodes date with upper limit
```
## Drop values using filtering, replace with today_date
df.loc[df['subscription_date'] > today_date, 'subscription_date'] = today_date
```

More example:
```
# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')

# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27

# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

# Print tire size description
print(ride_sharing['tire_sizes'].describe())
```

Convert to datetime example:
```
# Convert ride_date to date
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date

# Save today's date
today = dt.date.today()

# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today

# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())
```

## Uniqueness constraints

What is duplicate values?
- All columns have the same values.
- Most columns have the same vaues.

Why do they happen?
1. Data entry and Human Error
2. Bugs and design errors
3. Join or merge errors

How to find duplicate values?
```
# get duplicate across all columns
duplicates = df.duplicated()
print(duplicates)
## output will be True or False
```

See exactly
```
# get duplicated rows
duplicates = df.duplicated()
df[duplicates]

## output will show all duplicated values except for the first occurrences.
```

### How to find duplicate rows?
The .duplicated() method
- subset: List of colum names to check for duplication.
- keep: Whether to keep first ('first'), last ('last') or all (False) duplicate values.

```
# column names to check for duplication
col = ['first_name', 'last_name', 'address']
duplicates = df.duplicated(subet=col, keep=False)
# output duplicate values
df[duplicates].sort_values(by='first_name')
```

### How to treat duplicate values?
The .drop_duplicates() method
- subset: List of colum names to check for duplication.
- keep: Whether to keep first ('first'), last ('last') or all (False) duplicate values.
- inplace: Drop duplicated rows directly indside DataFrame without creating new object (True).

```
# Drop duplicates
df.drop_duplicates(inplace=True)
```

There are some duplicated columns as output (columns first_name, last_name are the same. only just weight col is differrent.) so try using .gropby() and agg()

### Group by and Aggregation
The .groupby() and .agg() methods \
Group by a set of common columns and return statistical values for specific columns when the aggregation is being performed.
```
# Group by column name and produce statistical summaries
col = ['first_name', 'last_name', 'address']
# create dict
summaries = {'height': 'max', 'weight': 'mean'}
# we can have numbered indices in the final output by using .reset_index()
height_weight = height_weight.groupby(by=col).agg(summaries).reset_index()

# Make sure aggregation is don
duplicates = height_weight.duplicates(subset=col, keep=False)
height_weight[duplicates].sort_values(by='first_name')
```

### Exercise
find duplicate
```
# Find duplicates
duplicates = ride_sharing.duplicated(subset='ride_id', keep=False)

# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')

# Print relevant columns of duplicated_rides
print(duplicated_rides[['ride_id','duration','user_birth_year']])
```

drop duplicate
```
# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates(inplace=True)

# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}

# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby(by=statistics).agg('ride_id').reset_index()

# Find duplicated values again
duplicates = ride_unique.duplicates(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]

# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0
```

## Text and categorical data problems

### Categories and membership constraints
Predefined finite set of categories.
- Type of data: Household Income
- Example: 0-20k, 20-40k, ...
- Numeric representation: 0, 1, ...

### Why could we have these problems?
This could be due to
- data entry issues
- data parsing errors

### How do we treat these problems?
- Dropping data
- Remapping Categories
- Inferring Categories

Dropping data example
```
# Import data
study_data = pd.read_csv('study.csv')
study_data.dtypes

# columns blood_type's type is category but have inconsistent value.
```

Finding inconsistent categories
```
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
print(inconsisten_categories)

## output will return all the categories in blood_type that are not in categories (inconsistent value)
```

dropping inconsistent categories
```
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories) # return True or False for each rows
inconsistent_data = study_data[inconsistent_rows]
# Drop inconsistent categories and get consistent data only
consistent_date = study_data[~inconsistent_rows]
```

### Exercise
```
# Print categories from DataFrame
print(categories)

# Print unique values of survey columns in airlines
print('Cleanliness: ', airlines['cleanliness'].unique(), "\n")
print('Safety: ', airlines['safety'].unique(), "\n")
print('Satisfaction: ', airlines['satisfaction'].unique(),"\n")
```

find inconsistency
```
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

## output will be only rows that contains inconsistency value
```

Print rows with consistent categories only
```
# Print rows with consistent categories only
print(airlines[~cat_clean_rows])
```

## Categorical variables
### What type of errors could we have
1. Value inconsistency
  - Inconsistency fields: 'married', 'Maried', 'UNMARRIED', 'not married', ...
  - _Trailing white spaces: 'married ', ' married ' ..
2. Collapsing too many categories to few
  - Creating new groups: 0-20k, 20-40k categories ... from continuos household income data
  - Mapping groups to new ones: Mapping household income categories to 2 'rich', 'poor'
3. Making sure data is of type category

### Value consistency
Capitalization: 'married', 'Married', 'UNMARRIED', 'unmarried' ..
```
# Get marriage status column
marriage_status = demographic['marriage_status']
marriage_status.value_count() # .value_count() works on series only

# get value counts on DataFrame
marriage_status.groupby('marriage_status').count()
```

To deal with capitalization use this \
Capitalie
```
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.upper()
marriage_status['marriage_status'].value_count()
```
Lower case
```
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.lower()
marriage_status['marriage_status'].value_count()
```

In case DataFrame contains with leading spaces. \
Strip all space
```
demographic = demographic['marriage_status'].str.strip()
demographic['marriage_status'].value_counts()
```

### Collapsing data into categories
Creating categories out of data: income_group column from income column. \
\
pd.qcut()
```
# Using qcut()
import pandas as pd
group_names = ['0-200k', '200k-500k', '500k+']
demographics['income_group'] = pd.qcut(demographics['household_income'], q=3, labels=group_names)

# Print income_group column
demographics[['income_group', 'household_income']]
```
\
pd.cut()
```
# Using cut() - create category range and names
ranges = [0, 200000, 500000, np.inf] # np.inf will represent the final one being infinity.
# Create income group column
demographics['income_group'] = pd.cut(demographics['household_income'], bin=ranges, labels=group_names) # bin will take in a list of cutoff points for each category
demographics[['income_group', 'household_income']]
```

Reduce the amount of data we have in our data. \
Mapping categories to fewer one
- OS columns is: 'Microsoft', 'MacOS', 'IOS', 'Android', 'Linux'
- OS column should become: 'DesktopOS', 'MobileOS'

```
# Create mapping dict and replace
mapping = {'Microsoft': 'DesktopOS', 'MacOS': 'DesktopOS', 'IOS': 'MobileOS', 'Android': 'MobileOS', 'Linux': 'DesktopOS'}
devices['OS'] = devices['OS'].replace(mapping)
devices['OS'].unique()
```

### Exercise
1
```
# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

# Lower dest_region column and then replace "eur" with "europe"
airlines['dest_region'] = airlines['dest_region'].str.lower()
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})

# Remove white spaces from `dest_size`
airlines['dest_size'] = airlines['dest_size'].str.strip()

# Verify changes have been effected
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())
```

2
```
# Create ranges for categories
label_ranges = [0, 60, 180, np.inf]
label_names = ['short', 'medium', 'long']

# Create wait_type column
airlines['wait_type'] = pd.cut(airlines['wait_min'], bins = label_ranges,
                                labels = label_names)

# Create mappings and replace
mappings = {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday',
            'Thursday': 'weekday', 'Friday': 'weekday',
            'Saturday': 'weekend', 'Sunday': 'weekend'}

airlines['day_week'] = airlines['day'].replace(mappings)
```

## Cleaning Text Data
Text data is most common type \
problems
1. Data inconsistency: +96171679912 or 0096171679912 or ..?
2. Fixed length violations: Passwords needs to be at least 8 characters
3. Typos: +961.71.679912

### Example
Fixing the phone number column
```
# replace "+" with "00"
phones['phone_number'] = phones['phone_number'].str.replace('+', '00')
phones
```
\
```
# replace "-" with nothing
phones['phone_number'] = phones['phone_number'].str.replace('-', '')
phones
```
\
```
# replace phone number with lower then 10 digits to NaN
digit = phones['phone_number'].str.len()
phones.loc[digit < 10, 'phone_number'] =np.nan
phones
```

We can you assert statements to test wether the phone number column has specific lenght, and wether it contains the symbols we removed.
```
# Find length of each row in Phone number column
sanity_check = phones['phone_number'].str.len()

# Assert min phone number length is 10
assert sanity_check.min() >= 10

# Assert all numbers do not have "+" or "-"
assert phones['phone_number'].str.contains("+|-").any() == False
```
*Remember assert returns nothing if the condition passes*

### What's about more complex examples?
e.g. +(01706)-25891, +0500-571437, +0800-1111 \

### Use Regular expression!
```
# Replace letters with nothing
phones['phone_number'] = phones['phone_number'].str.replace(r'\D+', '')
phones.head()
```

### Exercise
1
```
# Replace "Dr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Dr.","")

# Replace "Mr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Mr.","")

# Replace "Miss" with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Miss","")

# Replace "Ms." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Ms.","")

# Assert that full_name has no honorifics
assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False
```
2
```
# Store length of each row in survey_response column
resp_length = airlines['survey_response'].str.len()

# Find rows in airlines where resp_length > 40
airlines_survey = airlines[resp_length > 40]

# Assert minimum survey_response length is > 40
assert airlines_survey['survey_response'].str.len().min() > 40

# Print new survey_response column
print(airlines_survey['survey_response'])
```

# Advanced data problems
## Uniformity, Cross field validation and Dealing with missing data

### Uniformity
- Temp: 32C is also 89.6F
- Weight: 70Kg is also 11st.
- Date: 26-11-2019 is also 26, November, 2019

We can try pyplot to plot scatter to identify uniformity. \

### Treating with Temp data
Convert f to c
```
temp_fah = temperatures.loc[temperatures['Temperature'] > 40, 'Temperature']
temp_cels = (temp_fah - 32) * (5/9)
temperatures.loc[temperatures['Temperature'] > 40, 'Temperature'] = temp_cels

# Assert to check conversion is correct
assert temperatures['Temperature'].max() < 40
```

### Treating with date data
If DataFrame contains variaty date format like 27/27/19, 03-29-19, March 3rd, 2019.

```
birthdays['Birthday'] = pd.to_datetime(birthdays['Birthday'],
                        # Attempt to infer format of each date
                        infer_datetime_format=True,
                        # Return NA for rows where conversion failed
                        errors='coerce')
```

We can also convert datetime format using the dt.strftime
```
birthdays['Birthday'] = birthdays['Birthday'].dt.strftime("%d-%m-%Y")
birthdays.head()
```

### Exercise
```
# Print the header of account_opend
print(banking['account_opened'].head())

# Convert account_opened to datetime
banking['account_opened'] = pd.to_datetime(banking['account_opened'],
                                           # Infer datetime format
                                           infer_datetime_format = True,
                                           # Return missing value for error
                                           errors = 'coerce')

# Get year of account opened. Only pull year from col 'account_opened'
banking['acct_year'] = banking['account_opened'].dt.strftime('%Y')

# Print acct_year
print(banking['acct_year'])
```

### Cross field validation
The use of multiple field in a dataset to sanity check data integrity.
```
# use pandas to use subset to sum cols
sum_classes = flights[['economy_class', 'business_class', 'first_class']].sum(axis=1)
passenger_equ = sum_classes == flights['total_passenger']
# find and filter out rows with inconsistent passenger total
inconsistent_pass = flights[~passenger_equ]
consistent_pass = fligts[passenger_equ]
```

Another example containg user_id, birthdays and age values.
```
import pandas as pd
import datetime as dt

# Convert to datetime and get today's date
users['Birthday'] = pd.to_datetime(users['Birthday'])
today = dt.date.today()

# For each row in the Birthday column, calculate year difference
age_manual = today.year - users['Birthday'].dt.year

# Find instances where ages match
age_equ = age_manual == users['Age']

# Find and filter out rows with inconsistent age
inconsistent_age = users[~age_equ]
consistent_age = users[age_equ]
```

### Exercise
```
# Store today's date and find ages
today = dt.date.today()
ages_manual = today.year - banking['birth_date'].dt.year

# Find rows where age column == ages_manual
age_equ = banking['age'] == ages_manual

# Store consistent and inconsistent data
consistent_ages = banking[age_equ]
inconsistent_ages = banking[~age_equ]

# Store consistent and inconsistent data
print("Number of inconsistent ages: ", inconsistent_ages.shape[0])
```

### Completeness
### What is missing data?
Occurs when no data value is stored for a variable in an observation Can be represented as NA, nan, 0, .

```
# To find missing value
df.isna()

# Get summary of missingness
df.isna().sum()
```

Or can use Missingno to identify missing value
```
import missingno as msno
import matplotlib.pyplot as plt

# Visualize missingno
msno.matrix(df)
plt.show()

## the return will shows missing values are distributed across a column.
```
Isolate missing and complete values aside
```
# Isolate missing and complete values aside
missing = df[df['col1'].isna() ]
complete = df[~df['col1'].isna()]

# Then use .describe() method
missing.describe()
complete.describe()
```

Confirming by using MSNO
```
sorted_data = df.sort_values(by='col2')
msno.matrix(sorted_data)
plt.show()
```

### How to deal with missing data?
Simple approaches:
1. Drop missing data
2. Impute with statistical measures (mean, median, mode.)
More complex approaches:
1. Imputing using an algorithmic approach
2. Impute with machine learning models.

\

Dropping missing value
```
df_dropped = df.dropna(subset=['col1'])
df_dropped.head()
```
Replacing missing values with statistical measures
```
col1_mean = df['col1'].mean()
df_imputed = df.fillna({'col1': col1_mean})
df_imputed.head()
```

### Exercise
1
```
# Print number of missing values in banking
print(banking.isna().sum())

# Visualize missingness matrix
msno.matrix(banking)
plt.show()

# Isolate missing and non missing values of inv_amount
missing_investors = banking[banking['inv_amount'].isna()]
investors = banking[~banking['inv_amount'].isna()]

# Sort banking by age and visualize
banking_sorted = banking.sort_values(by='age')
msno.matrix(banking_sorted)
plt.show()
```
2
```
# Drop missing values of cust_id
banking_fullid = banking.dropna(subset = ['cust_id'])

# Compute estimated acct_amount
acct_imp = banking_fullid['inv_amount'] * 5

# Impute missing acct_amount with corresponding acct_imp
banking_imputed = banking_fullid.fillna({'acct_amount':acct_imp})

# Print number of missing values
print(banking_imputed.isna().sum())
```

## Record linkage
Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings.

### Simple string comparison
```
from thefuzz import fuzz

# compare reading vs reeding
fuzz.WRatio('Reeding', 'Reading')
```
\
What if we have so many inconsistent categories. We can easily do with String Similarity.
\
\
Let's say we have df named survey conting answers from respondentsfrom the state NYC and California asking how likely are you to move on a scale of 0 to 5. The state was free text and contains 100 of typos. We'll use string similarity.
\
We also have category DF containing the correct categories for each state.

```
# For each state category
for state in categories['state']:

  # Find potential matches in states with typoes
  matches = process.extract(state, survey['state'], limit=survey.shape[0])

  # For each potential match match
  for potential_match in matches:

    # If high similarity score
    if potential_match[1] >= 80:

      # Replace typo with correct category
      survey.loc[survey['state'] == potential_match[0], 'state'] = state
```

### Exercise
```
# Import process from thefuzz
from thefuzz import process

# Store the unique values of cuisine_type in unique_types
unique_types = restaurants['cuisine_type'].unique()

# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian', unique_types, limit = len(unique_types)))

# # Calculate similarity of 'american' to all values of unique_types
print(process.extract('american', unique_types, limit = len(unique_types)))

# # Calculate similarity of 'italian' to all values of unique_types
print(process.extract('italian', unique_types, limit = len(unique_types)))
```
Remapping categories II
1
```
# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', restaurants['cuisine_type'], limit=restaurants.shape[0])

# Inspect the first 5 matches
print(matches[0:5])
```
2
```
# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))
# Iterate through the list of matches to italian
for match in matches:
  # Check whether the similarity score is greater than or equal to 80
  # if match > 80:
  if match[1] >= 80:
    # Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
    restaurants.loc[restaurants['cuisine_type'] == match[0], 'cuisine_type'] = 'italian'
```
3 you have access to a categories list containing the correct cuisine types ('italian', 'asian', and 'american').
```
# Iterate through categories
for cuisine in categories:  
  # Create a list of matches, comparing cuisine with the cuisine_type column
  matches = process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

  # Iterate through the list of matches
  for match in matches:
     # Check whether the similarity score is greater than or equal to 80
    if match[1] >= 80:
      # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
      restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine
      
# Inspect the final result
print(restaurants['cuisine_type'].unique())
```

## Rocord Linkage
Record linkage is the act of linking data from different sources regarding the same entity.\
Generally, we clean two or more DataFrames, generate pairs of potentially matching records, score these pairs according to string similarity and other similarity metrics, and link them. using *recordlinkage* package.

```
# import recordlinkage
import recordlinkage

# Creating indexing object
indexer = recordlinkage.Index()

# Generate pairs blocked on state col
indexer.block('state')
pairs = indexer.index(census_A, census_B)

## output is a pandas multi index object containing paris of row indices from both dfs
```
find potential matches
```
# Generate a pairs
paris = indexer.index(census_A, census_B)

# Create a compare object
compare_cl = recordlinkage.Compare()

# Find exact matches for pairs of date_of_birth and state
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('state', 'state', label='state')

# Find similar matches for pairs of surname and address_1 using string similarity
compare_cl.string('surname', 'surname', threshold=0.85, label='surname')
compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')

# Find matches
potential_matches = compare_cl.compute(pairs, cencus_A, census_B)

## output is multi index df. cols are compared, value 1 = match, 0 = not a match
```
Find potential matches. We need to filter.
```
potential_matches[potential_matches.sum(axis=1) => 2]
```

### Exercise
```
# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()
indexer
# Block pairing on cuisine_type
indexer.block('cuisine_type')

# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)

# Create a comparison object
comp_cl = recordlinkage.Compare()

# Find exact matches on city, cuisine_types -
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('cuisine_type', 'cuisine_type', label='cuisine_type')

# Find similar matches of rest_name
comp_cl.string('rest_name', 'rest_name', label='name', threshold = 0.8)

# Get potential matches and print
potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new)
print(potential_matches)

# 3 because I need to have matches in all my cols.
potential_matches[potential_matches.sum(axis = 1) >= 3]
```

## Linking DataFrames
We've already generated apirs, compared 4 of cols, 2 for exact matches and 2 for str similarity. finally, found potential matches. Now it's time to link both census dfs.
\
Get the indices
```
matches.index

# Get indices from census_B only
duplicate_rows = matches.index.get_level_values(1)
print(census_B_index)
```

```
# finding duplicate in census_B
census_B_duplicates = census_B[census_B.index.isin(duplicate_rows)]

# finding new rows in census_B (non duplciates)
census_B_new = census_B[~census_B.index.isin(duplicate_rows)]

# Link the DFs
full_census = census_A.append(census_B_new)
```

full code of bringing full census result
```
# import recordlinkage
import recordlinkage

# Creating indexing object
indexer = recordlinkage.Index()

# Generate pairs blocked on state col
indexer.block('state')

full_paris = indexer.index(census_A, census_B)

# Create a compare object
compare_cl = recordlinkage.Compare()

# Find exact matches for pairs of date_of_birth and state
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('state', 'state', label='state')

# Find similar matches for pairs of surname and address_1 using string similarity
compare_cl.string('surname', 'surname', threshold=0.85, label='surname')
compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')

# Generate potential matches
potential_matches = compare_cl.compute(full_pairs, cencus_A, census_B)

# Isolate matches with matching values for 3 or more cols
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]

# Get index for matching census_B rows only
duplicate_rows = matches.index.get_level_values(1)

# finding new rows in census_B (non duplciates)
census_B_new = census_B[~census_B.index.isin(duplicate_rows)]

# Link the DFs
full_census = census_A.append(census_B_new)
```

### Exercise
```
# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]

# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)

# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]

# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
print(full_restaurants)
```

In [3]:
pip install thefuzz

Collecting thefuzz
  Downloading thefuzz-0.20.0-py3-none-any.whl (15 kB)
Collecting rapidfuzz<4.0.0,>=3.0.0 (from thefuzz)
  Downloading rapidfuzz-3.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, thefuzz
Successfully installed rapidfuzz-3.5.2 thefuzz-0.20.0


In [4]:
from thefuzz import fuzz

# compare reading vs reeding
fuzz.WRatio('Reeding', 'Reading')

86

In [5]:
fuzz.WRatio('Houston Rockets', 'Rockets')

90

In [6]:
fuzz.WRatio('Houston Rockets vs Los Angeles Lakers', 'Lakers vs Rockets')

86