# Lesson I

## Comparing Strings

We'll discover the world of record linkage. But before we get deep dive into record linkage, let's sharpen our understanding of string similarity and minimum edit distance.

### Minimum edit distance

*Minimum edit distance* is a systematic way to identify how close 2 strings are. 

For example, let's take a look at the following two words: **intention**, and **execution**. The minimum edit distance between them is the least possible amount of steps, that could get us from the word intention to execution, with the available operations being:

* Insertion
* Deletion
* Substitution
* Transposition

To get from **intention** to **execution** ;

* Deleting "I" from intention
* Adding "C" between "E" and "N"
* Substitute the first "N" with "E", 
* "T" with "X" and
* "N" with "U"

***Minimum edit distance being 5!***

The lower the edit distance, the closer two words are. For example, the two different typos of **reading** have a minimum edit distance of 1 between **reeding** and **reading**.

#### Minimum edit distance Algorithms

There's a variety of algorithms based on edit distance that differ on which operations they use, how much weight attributed to each operation, which type of strings they're suited for and more, with a variety of packages to get each similarity.

| Algorithm | Operations|
|-----------|-----------|
| Damerau-Levenshtein | insertion, substitution, deletion, transposition |
| **Levenshtein** | **insertion, substitution, deletion** |
| Hamming | substitution only |
| Jaro distance | transposition only |
| ... | ... |

**Possible packages :**
* ``nltk``
* ``fuzzywuzzy``
* ``textdistance``

For this lesson, we'll be comparing strings using *Levenshtein* distance since it's the most general form of string matching by using the **fuzzywuzzy** package.

### Simple String Comparison

``Fuzzywuzzy`` is a simple to use package to perform string comparison. 
We first ``import fuzz from fuzzywuzzy``, which allow us to compare between single strings. Here we use *fuzz*'s ``WRatio()`` function to compute the similarity between reading and its typo, inputting each string as an argument. 
For any comparison function using ``fuzzywuzzy``, our output is a score from **0** to **100** with 0 being not similar at all, 100 being an exact match. 

Do not confuse this with the minimum edit distance score earlier, where a lower minimum edit distance means a closer match.

In [66]:
# Let us compare between two strings
from fuzzywuzzy import fuzz

# Compare reeding vs reading
fuzz.WRatio('Reeding', 'Reading')

86

#### Partial strings and different orderings

The ``WRatio()`` function is highly robust against partial string comparison with different orderings. 
For example here we compare the strings *Houston Rockets and Rockets*, and still receive a high similarity score.

In [67]:
# Partial string comparison
fuzz.WRatio('Houston Rockets', 'Rockets')

90

The same can be said for the strings *Houston Rockets vs Los Angeles Lakers and Lakers vs Rockets*, where the team names are only partial and they are differently ordered.

In [68]:
# Partial string comparison with different order
fuzz.WRatio('Houston Rockets vs Los Angeles Lakers', 'Lakers vs Rockets')

86

#### Comparision with arrays

We can also compare a string with an *array* of strings by using the ``extract()`` function from the ``process module from fuzzy wuzzy``. 
``extract()`` takes in a *string*, *an array of strings*, and *the number of possible matches to return ranked from highest to lowest*. 
It *returns* a **list of tuples with 3 elements**; 
* the first one being the matching string being returned, 
* the second one being its similarity score, 
* and the third one being its index in the array.

In [69]:
# Import process
from fuzzywuzzy import process
import pandas as pd

# Define string and array of possible matches
string = "Houston Rockets vs Los Angeles Lakers"
choices = pd.Series(["Rockets vs Lakers", "Lakers vs Rockets",
                     "Houston vs Los Angeles", "Heat vs Bulls"])

process.extract(string, choices, limit=2)

[('Rockets vs Lakers', 86, 0), ('Lakers vs Rockets', 86, 1)]

### Collapsing Categories with string similarity

In chapter 2, we learned that collapsing data into categories is an essential aspect of working with categorical and text data, and we saw how to manually replace categories in a column of a DataFrame. But what if we had so many inconsistent categories that a manual replacement is simply not feasible? We can easily do that with string similarity!

* Use ``.replace()`` to collapse ``"eur"`` into ``"Europe"``

* What if there are too many variations?
    - ``"EU", "eur", "Europ", "Europa", "Erope", "Evropa"`` ...

Say we have DataFrame named ``survey`` containing answers from respondents from the state of New York and California asking them how likely are you to move on a scale of 0 to 5. 

```python
print(survey['state'].unique())
```

<img src='pictures/survey.jpg' width=250 />

The state field was free text and contains hundreds of typos. Remapping them manually would take a huge amount of time. Instead, we'll use string similarity. 

We also have a ``category`` DataFrame containing the correct categories for each state(``'California'`` ``'New York'``). Let's collapse the incorrect categories with string matching!

#### Collapsing all of the state

We first create a *for loop* iterating over each correctly typed state in the ``categories`` DataFrame. 

For each state, we find its matches in the ``state`` column of the ``survey`` DataFrame, returning all possible matches by setting the ``limit`` argument of extract to the length of the ``survey`` DataFrame.

Then we iterate over each potential match, isolating the ones only with a similarity score higher or equal than 80 with an if statement. 

Then for each of those returned strings, we replace it with the correct state using the .``loc`` method.

```python
# For Each correct category
for state in categories['state']:
    # Find potential matches in states with typoes
    matches = process.extract(state, survey['state'], limit = survey.shape[0])
    # For each potential match match
    for potential_match in matches:
        # If high similarity score
        if potential_match[1] >= 80:
            # Replace typo with correct category
            survey.loc[survey['state'] == potential_match[0], 'state'] = state
```

### Record Linkage

<img src='pictures/recordlinkage.jpg' />

Record linkage attempts to join data sources that have similarly fuzzy duplicate values, so that we end up with a final DataFrame with no duplicates by using string similarity. We'll cover record linkage in more detail in the next couple of lessons.

## Exercise 

### The cutoff point

In this exercise, and throughout this chapter, you'll be working with the ``restaurants`` DataFrame which has data on various restaurants. Your ultimate goal is to create a restaurant recommendation engine, but you need to first clean your data.

This version of restaurants has been collected from many sources, where the ``cuisine_type`` column is riddled with typos, and should contain only *italian, american and asian* cuisine types. There are so many unique categories that remapping them manually isn't scalable, and it's best to use string similarity instead.

Before doing so, you want to establish the cutoff point for the similarity score using the *fuzzywuzzy*'s ``process.extract()`` function by finding the similarity score of the most distant typo of each category.

In [70]:
# Restaurant dataset
restaurant = pd.read_csv('datasets/restaurants_L2_dirty.csv')

# Import process from fuzzywuzzy
from fuzzywuzzy import process

# Store the unique values of cuisine_type in unique_types
unique_types = restaurant['type'].unique()

# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian',unique_types, limit = len(unique_types)))

# Calculate similarity of 'american' to all values of unique_types
print(process.extract('american',unique_types, limit = len(unique_types)))

# Calculate similarity of 'italian' to all values of unique_types
print(process.extract('italian',unique_types, limit = len(unique_types)))


[('asian', 100), ('indonesian', 72), ('italian', 67), ('russian', 67), ('american', 62), ('californian', 54), ('japanese', 54), ('mexican/tex-mex', 54), ('american ( new )', 54), ('mexican', 50), ('pizza', 40), ('cajun/creole', 36), ('diners', 36), ('middle eastern', 36), ('vietnamese', 36), ('pacific new wave', 36), ('fast food', 36), ('continental', 36), ('seafood', 33), ('chicken', 33), ('chinese', 33), ('hamburgers', 27), ('steakhouses', 25), ('southern/soul', 22), ('delis', 20), ('hot dogs', 18), ('coffee shops', 18), ('noodle shops', 18), ('health food', 18), ('eclectic', 18), ('coffeebar', 18), ('french ( new )', 18), ('desserts', 18)]
[('american', 100), ('american ( new )', 90), ('mexican', 80), ('mexican/tex-mex', 68), ('asian', 62), ('californian', 53), ('italian', 53), ('russian', 53), ('middle eastern', 45), ('pacific new wave', 45), ('hamburgers', 44), ('indonesian', 44), ('chicken', 40), ('japanese', 38), ('eclectic', 38), ('delis', 36), ('pizza', 36), ('southern/soul', 

### Remapping Categories II

In the last exercise, you determined that the distance cutoff point for remapping typos of *'american', 'asian', and 'italian'* cuisine types stored in the ``cuisine_type`` column should be **80**.

In this exercise, you're going to put it all together by finding matches with similarity scores equal to or higher than 80 by using **fuzywuzzy.process**'s ``extract()`` function, for each correct cuisine type, and replacing these matches with it. Remember, when comparing a string with an array of strings using ``process.extract()``, the output is a list of tuples where each is formatted like:

```python
(closest match, similarity score, index of match)
```

In [71]:
# Categories array
import numpy as np
categories_np = np.array(['american', 'asian', 'italian'])
# Categories df
categories = pd.DataFrame(categories_np, columns=['type'])

# Inspect the unique values of the cuisine_type column
print(restaurant['type'].unique())

['american' 'californian' 'japanese' 'cajun/creole' 'hot dogs' 'diners'
 'delis' 'hamburgers' 'seafood' 'italian' 'coffee shops' 'russian'
 'steakhouses' 'mexican/tex-mex' 'noodle shops' 'mexican' 'middle eastern'
 'asian' 'vietnamese' 'health food' 'american ( new )' 'pacific new wave'
 'indonesian' 'eclectic' 'chicken' 'fast food' 'southern/soul' 'coffeebar'
 'continental' 'french ( new )' 'desserts' 'chinese' 'pizza']


In [72]:
# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', restaurant['type'], limit = len(restaurant['type']))

# Inspect the first 5 matches
print(matches[0:5])

[('italian', 100, 14), ('italian', 100, 21), ('italian', 100, 47), ('italian', 100, 57), ('italian', 100, 73)]


In [73]:
# Iterate through the list of matches to italian
for match in matches:
  # Check whether the similarity score is greater than or equal to 80
    if match[1] >= 80:
    # Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
        restaurant.loc[restaurant['type'] == match[0], 'type'] = 'italian'

In [74]:
# Iterate through categories
for cuisine in categories:
    # Create a list of matches, comparing the cuisine with the type column
    matches = process.extract(cuisine, restaurant['type'], limit = len(restaurant['type']))
    
    # Iterate through the list of matches
    for match in matches:
        # Check whether the similarity score is greater than or equal to 80
        if match[1] >= 80:
            # Select all rows where the type is spelled this way, and set them to the correct cuisine
            restaurant.loc[restaurant['type'] == match[0], 'type'] = cuisine
            
# Inspect the final result
print(restaurant['type'].unique())     
     

['american' 'californian' 'japanese' 'cajun/creole' 'hot dogs' 'diners'
 'delis' 'hamburgers' 'seafood' 'italian' 'coffee shops' 'russian'
 'steakhouses' 'mexican/tex-mex' 'noodle shops' 'mexican' 'middle eastern'
 'asian' 'vietnamese' 'health food' 'american ( new )' 'pacific new wave'
 'indonesian' 'eclectic' 'chicken' 'fast food' 'southern/soul' 'coffeebar'
 'continental' 'french ( new )' 'desserts' 'chinese' 'pizza']


# Lesson II

## Generating Pairs

At the end of the last video exercise, we saw how record linkage attempts to join data sources with fuzzy duplicate values. 

<img src='pictures/recordlinkage.jpg' />

For example here are two DataFrames containing NBA games and their schedules. They've both been scraped from different sites and we would want to merge them together and have one DataFrame containing all unique games.

We see that there are duplicates values in both DataFrames with different naming marked here in red, and non duplicate values, marked here in green. Since there are games happening at the same time, no common unique identifier between the DataFrames, and the events are differently named, a regular **join** or **merge** will not work. This is where **record linkage** comes in.

### Record Linkage

Record linkage is the act of linking data from different sources regarding the same entity. 

* Generally, we clean *two or more* DataFrames, 
* *Generate* pairs of potentially matching records, 
* *Score(Compare)* these pairs according to string similarity and other similarity metrics, 
* And *link* them. 

All of these steps can be achieved with the ``recordlinkage`` package, let's find how!

#### Our DataFrames

```census_A```

<img src='pictures/censusA.jpg' />

```census_B```	

<img src='pictures/censusB.jpg' />

Here we have two DataFrames, ``census_A``, and ``census_B``, containing data on individuals throughout the states. 
We want to merge them while avoiding duplication using *record linkage*, since they are collected manually and are prone to typos, there are no consistent IDs between them.

#### Generating Pairs

We first want to generate pairs between both DataFrames. Ideally, we want to generate all possible pairs between our DataFrames.

What if we had big DataFrames and ended up having to generate millions if not billions of pairs? It wouldn't prove scalable and could seriously hamper development time.

##### Blocking

This is where we apply what we call **blocking**, which creates pairs based on a matching column, which is in this case, the ``state`` column, reducing the number of possible pairs.

To do this, we first start off by importing ``recordlinkage``. 

```python
# Import recordlinkage
import recordlinkage
```

We then use the ``recordlinkage.Index()`` function, to create an *indexing* object. This essentially is an object we can use to generate pairs from our DataFrames. 

```python
# Create indexing object
indexer = recordlinkage.Index()
```

To generate pairs blocked on state, we use the ``block()`` method, inputting the ``state`` column as input. 

```python
# Generate pairs blocked on state
indexer.block('state')
```

Once the *indexer* object has been initialized, we generate our pairs using the ``.index()`` method, which takes in the *two dataframes*.

```python
pairs = indexer.index(census_A, census_B)
```

The resulting object, is a ``pandas`` multi index object containing pairs of row indices from both DataFrames, which is a fancy way to say it is an *array* containing possible pairs of indices that makes it much easier to subset DataFrames on.

<img src='pictures/pairs.jpg' />

#### Comparing the DataFrames

Since we've already generated our pairs, it's time to find potential matches. We first start by creating a *comparison* object using the ``recordlinkage.compare()`` function. This is similar to the indexing object we created while generating pairs, but this one is responsible for assigning different comparison procedures for pairs. 

```python
# Genarate the pairs
pairs = indexer.index(census_A, census_B)
# Create a Compare object
compare_cl = recordlinkage.Compare()
```

Let's say there are columns for which we want exact matches between the pairs. To do that, we use the ``exact`` method. It takes in the column name in question for each DataFrame, which is in this case ``date_of_birth`` and ``state``, and a ``label`` argument which lets us set the column name in the resulting DataFrame. 

```python
# Find exact matches for pairs of date_of_birth and state
compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth_state')
compare_cl.exact('state', 'state', label='state')
```

Now in order to compute string similarities between pairs of rows for columns that have fuzzy values, we use the ``.string()`` method, which also takes in the column names in question, the similarity *cutoff* point in the ``threshold`` argument, which takes in a value between ``0`` and ``1``, which we here set to ``0.85``. 

```python
# Find similar matches for pairs of surname and address_1 using string similarity
compare_cl.string('surname', 'surname', threshold=0.85, label='surname')
compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
```

Finally to compute the matches, we use the ``compute()`` function, which takes in the possible pairs, and the two DataFrames in question. 
Note that you need to **always** have the **same order of DataFrames** when inserting them as arguments when generating pairs, comparing between columns, and computing comparisons.

```python
# Find Matches
potential_matches = compare_cl.compute(pairs, census_A, census_B)
```

The output is a *multi-index* DataFrame, where the first index is the row index from the first DataFrame, or *census A*, and the second index is a list of all row indices in *census B*. The columns are the columns being compared, with values being **1** for a match, and **0** for not a match.

<img src='pictures/potential_matches.jpg' />

To find potential matches, we just filter for rows where the sum of row values is higher than a certain threshold. Which in this case higher or equal to 2. 

```python
potential_matches[potential_matches.sum(axis=1) >= 2]
```

## Exercise

### Pairs of restaurants

In the last lesson, you cleaned the restaurants dataset to make it ready for building a restaurants recommendation engine. You have a new DataFrame named ``restaurants_new`` with new restaurants to train your model on, that's been scraped from a new data source.

You've already cleaned the ``cuisine_type`` and ``city`` columns using the techniques learned throughout the course. However you saw duplicates with typos in restaurants names that require record linkage instead of joins with restaurants.

In this exercise, you will perform the first step in record linkage and generate possible pairs of rows between ``restaurants`` and ``restaurants_new``.

In [76]:
# Datasets
restaurants = pd.read_csv('datasets/restaurants_L2_dirty.csv')
restaurants_new = pd.read_csv('datasets/restaurants_L2.csv')

# Import packages
import pandas as pd
import recordlinkage

# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()

# Block pairing on cuisine_type
indexer.block('type', 'type')

# Generate pairs
pairs = indexer.index(restaurants,restaurants_new)

### Similar Restaurants

In the last exercise, you generated pairs between ``restaurants`` and ``restaurants_new`` in an effort to cleanly merge both DataFrames using record linkage.

When performing record linkage, there are different types of matching you can perform between different columns of your DataFrames, including exact matches, string similarities, and more.

Now that your pairs have been generated and stored in pairs, you will find exact matches in the ``city`` and ``cuisine_type`` columns between each pair, and similar strings for each pair in the ``rest_name`` column.

In [79]:
# Create a comparison object
comp_cl = recordlinkage.Compare()

# Find exact matches on city, types
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('type', 'type', label='type')

# Find similar matches of name
comp_cl.string('name', 'name', threshold=0.8 ,label='name')

# Get potential matches and print
potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new)
print(potential_matches)

        city  type  name
0  0       0     1   0.0
   1       0     1   0.0
   2       0     1   0.0
   3       0     1   0.0
   4       0     1   0.0
...      ...   ...   ...
55 221     1     1   0.0
   230     1     1   0.0
   233     1     1   0.0
   238     1     1   0.0
   241     1     1   0.0

[3631 rows x 3 columns]


In [82]:
potential_matches[potential_matches.sum(axis=1) >= 3]

Unnamed: 0,Unnamed: 1,city,type,name
1,3,1,1,1.0
7,13,1,1,1.0
12,17,1,1,1.0
20,20,1,1,1.0
27,21,1,1,1.0
28,1,1,1,1.0
40,0,1,1,1.0
43,8,1,1,1.0
50,9,1,1,1.0
53,4,1,1,1.0


# Lesson III

## Linking DataFrames

At this point, you've generated your pairs, compared them, and scored them. Now it's time to link your data!

Remember our census DataFrames from the video of the previous lesson? We've already generated pairs between them, compared four of their columns, two for exact matches and two for string similarity alongside a 0.85 threshold, and found potential matches.

Now it's time to link both census DataFrames. Let's look closely at our potential matches. It is a multi-index DataFrame, where we have two index columns, record id 1, and record id 2.

The first index column, stores indices from ``census A``. The second index column, stores all possible indices from ``census_B``, for each row index of ``census_A``. The columns of our potential matches are the columns we chose to link both DataFrames on, where the value is *1* for a match, and *0* otherwise.

<img src='pictures/potential_matches1.jpg' />

The first step in linking DataFrames, is to isolate the potentially matching pairs to the ones we're pretty sure of. We saw how to do this in the previous lesson, by subsetting the rows where the row sum is above a certain number of columns, in this case 3. 

```python
matches = potential_matches[potential_matches.sum(axis=1) >= 3]
print(matches)
```

The output is row indices between ``census A`` and ``census B`` that are most likely duplicates. Our next step is to extract the one of the index columns, and subsetting its associated DataFrame to filter for duplicates.

<img src='pictures/potential_matches2.jpg' />

Here we choose the second index column, which represents row indices of ``census B``. We want to extract those indices, and *subset* ``census_B`` on them to remove duplicates with ``census_A`` before appending them together.

We can access a DataFrame's index using the ``index`` attribute. Since this is a multi index DataFrame, it returns a multi index object containing pairs of row indices from ``census_A`` and ``census_B`` respectively. 

```python
matches.index
```

We want to extract all ``census_B`` indices, so we chain it with the ``get_level_values()`` method, which takes in which column index we want to extract its values. We can either input the index column's name, or its order, which is in this case 1.

```python
# Get indices from census_B only
duplicate_rows = matches.index.get_level_values(1)
```

To find the duplicates in ``census B``, we simply subset on all indices of ``census_B``, with the ones found through record linkage. You can choose to examine them further for similarity with their duplicates in ``census_A``, but if you're sure of your analysis, you can go ahead and find the non duplicates by repeating the exact same line of code, except by adding a *tilde* at the beginning of your subset. 

```python
# Finding duplicates in census_B
census_B_duplicates = census_B[census_B.index.isin(duplicate_rows)]

# Finding new rows in census_B
census_B_new = census_B[~census_B.index.isin(duplicate_rows)]
```
Now that you have your non duplicates, all you need is a simple ``append`` using the DataFrame append method of census A, and you have your linked Data!

```python
# Link the DataFrames
full_census = census_A.append(census_B_new)
```

To recap, what we did was build on top of our previous work in generating pairs, comparing across columns and finding potential matches. 
We then isolated all possible matches, where there are matches across 3 columns or more, ensuring we tightened our search for duplicates across both DataFrames before we link them. 
Extracted the row indices of census_B where there are duplicates. Found rows of census_B where they are not duplicated with census_A by using the tilde symbol. 
And linked both DataFrames for full census results!

## Exercise 

### Linking them together

In the last lesson, you've finished the bulk of the work on your effort to link ``restaurants`` and ``restaurants_new``. You've generated the different pairs of potentially matching rows, searched for exact matches between the ``type`` and ``city`` columns, but compared for similar strings in the ``name`` column. You stored the DataFrame containing the scores in ``potential_matches``.

Now it's finally time to link both DataFrames. You will do so by first extracting all row indices of ``restaurants_new`` that are matching across the columns mentioned above from ``potential_matches``. Then you will subset ``restaurants_new`` on these indices, then append the non-duplicate values to restaurants.

In [83]:
# Isolate the potential matches with row sum >= 3
matches = potential_matches[potential_matches.sum(axis=1) >= 3]

# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)

# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new.loc[~restaurants_new.index.isin(matching_indices)]

# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
print(full_restaurants)

     Unnamed: 0                name                      addr           city  \
0             0              kokomo         6333 w. third st.             la   
1             1              feenix   8358 sunset blvd. west       hollywood   
2             2             parkway      510 s. arroyo pkwy .       pasadena   
3             3                r-23          923 e. third st.    los angeles   
4             4               gumbo         6333 w. third st.             la   
..          ...                 ...                       ...            ...   
331         331   vivande porta via        2125 fillmore st.   san francisco   
332         332  vivande ristorante     670 golden gate ave.   san francisco   
333         333        world wrapps        2257 chestnut st.   san francisco   
334         334             wu kong            101 spear st.   san francisco   
335         335           yank sing          427 battery st.   san francisco   

          phone          type  
0    21

  full_restaurants = restaurants.append(non_dup)
