# Implementation of Record Linkage in Python

Record Linkage is the process of linking data from different sources regarding the same entity

Its process includes:
- Data preprocessing
- String similarity metrics
- Blocking
- Comparisons
- Classifications
- Post-processing
- Evaluations

In [40]:
# implementing with a fictional dataframes

import pandas as pd
import recordlinkage

In [41]:
fixtures_data1 = {
    'HomeTeam' : ["Arsenal", "Chelsea", "Liverpool", "Manchester United"],
    'AwayTeam' : ["Manchester City", "Leeds", "Everton", "Wolves"],
    'Date' : ['2024-02-19', '2024-02-20','2024-02-21', '2024-02-22'],
    'Venue' : ['Emirates Stadium', 'Stamford Bridge', 'Old Trafford', 'Anfield']
}


fixtures_data2 = {
    'HomeTeam' : ['Liverpool', 'Manchester United', 'Chelsea', 'Manchester City'],
    'AwayTeam' : ['Everton', 'Leeds', 'Wolves', 'Manchester City'],
    'Date' : ['2024-02-22', '2024-02-23', '2024-02-24', '2024-02-25'],
    'Venue' : ['Anfield', 'Old Trafford', 'Stamford Bridge', 'Etihad stadium']                                                                                                                                      
}

In [42]:
df1 = pd.DataFrame(fixtures_data1)                           
df2 = pd.DataFrame(fixtures_data2)

In [43]:
df1

Unnamed: 0,HomeTeam,AwayTeam,Date,Venue
0,Arsenal,Manchester City,2024-02-19,Emirates Stadium
1,Chelsea,Leeds,2024-02-20,Stamford Bridge
2,Liverpool,Everton,2024-02-21,Old Trafford
3,Manchester United,Wolves,2024-02-22,Anfield


In [44]:
df2

Unnamed: 0,HomeTeam,AwayTeam,Date,Venue
0,Liverpool,Everton,2024-02-22,Anfield
1,Manchester United,Leeds,2024-02-23,Old Trafford
2,Chelsea,Wolves,2024-02-24,Stamford Bridge
3,Manchester City,Manchester City,2024-02-25,Etihad stadium


# Implementing Blocking

In [45]:
indexer = recordlinkage.Index()  #create an empty object of recordlinkage index
indexer.block('Date') #block the date column

<Index>

In [46]:
candidate_links = indexer.index(df1, df2)
candidate_links

MultiIndex([(3, 0)],
           )

# Implementing Comparison

In [47]:
compare = recordlinkage.Compare()  #create a compare empty object and set it to the variable compare
compare.string('HomeTeam', 'HomeTeam', method='levenshtein', threshold=0.8)
compare.string('AwayTeam', 'AwayTeam', method='levenshtein', threshold = 0.8)

<Compare>

# Compute Similarity

In [48]:
features = compare.compute(candidate_links, df1, df2)
features

Unnamed: 0,Unnamed: 1,0,1
3,0,0.0,0.0


# Classification

In [53]:
matches = features[features.sum(axis=1)> 2]
 
print(matches)

Empty DataFrame
Columns: [0, 1]
Index: []


This means that there are no potential matches between the two dataframes