# **28.09.23**

# **Exercise of Data Linkage Concepts and Techniques**

---


## **1. Data Preparation**

In [None]:
import pandas as pd
import numpy as np
from math import floor, ceil

import matplotlib.pyplot as plt

In [None]:
dfA = pd.read_csv('dataA.csv', index_col=0)
dfB = pd.read_csv('dataB.csv', index_col=0)

you can check the data at your convenience 

## **2. Calculation of Jaro-Winkler Distance**

Define functions to calculate the Jaro-Winkler Distance

In [None]:
# Python3 implementation of JW di
 
# Function to calculate the
# Jaro Similarity of two s
def jaro_distance(s1, s2):
     
    # If the s are equal
    if (s1 == s2):
        return 1.0
 
    # Length of two s
    len1 = len(s1)
    len2 = len(s2)
 
    # Maximum distance upto which matching
    # is allowed
    max_dist = floor(max(len1, len2) / 2) - 1
 
    # Count of matches
    match = 0
 
    # Hash for matches
    hash_s1 = [0] * len(s1)
    hash_s2 = [0] * len(s2)
 
    # Traverse through the first
    for i in range(len1):
 
        # Check if there is any matches
        for j in range(max(0, i - max_dist),
                       min(len2, i + max_dist + 1)):
             
            # If there is a match
            if (s1[i] == s2[j] and hash_s2[j] == 0):
                hash_s1[i] = 1
                hash_s2[j] = 1
                match += 1
                break
 
    # If there is no match
    if (match == 0):
        return 0.0
 
    # Number of transpositions
    t = 0
    point = 0
 
    # Count number of occurrences
    # where two characters match but
    # there is a third matched character
    # in between the indices
    for i in range(len1):
        if (hash_s1[i]):
 
            # Find the next matched character
            # in second
            while (hash_s2[point] == 0):
                point += 1
 
            if (s1[i] != s2[point]):
                t += 1
            point += 1
    t = t//2
 
    # Return the Jaro Similarity
    return (match/ len1 + match / len2 +
            (match - t) / match)/ 3.0
 
def jaro_winkler_distance(s1, s2):
    jaro_dist = jaro_distance(s1, s2)
    
    # Length of common prefix
    L = 0
    for l1, l2 in zip(s1, s2):
        if l1 == l2:
            L += 1
        else:
            break
    L = max(4, L)  # Take at most 4 characters
    p = 0.1  # Scaling factor
    
    return jaro_dist + (L * p * (1 - jaro_dist))
def compute_similarity_matrix(dfA, dfB):
    n, m = dfA.shape
    p, _ = dfB.shape
    
    # Create an empty matrix to store the results
    results = []
    
    # Iterate through each pair of records
    for idx_a, row_a in dfA.iterrows():
        for idx_b, row_b in dfB.iterrows():
            # For each variable, compute the Jaro-Winkler similarity
            similarities = [jaro_winkler_distance(str(row_a[i]), str(row_b[i])) for i in range(m)]
            results.append(similarities)
            
    # Convert to DataFrame with combined index
    index_pairs = [(idx_a, idx_b) for idx_a in dfA.index for idx_b in dfB.index]
    df_result = pd.DataFrame(results, columns=dfA.columns, index=pd.MultiIndex.from_tuples(index_pairs))
    
    return df_result

The function compute_similarity_matrix() will take two datasets and compute Jaro-Winkler Distance for all pairs and all attributes 

### **Exercise #1**

+ *What is meaning of the Jaro-Winkler Distance?*
+ *Calculate the Jaro-Winkler Distance as df_similarity.*

In [None]:
# Your solutions here
#df_similarity = 

### **Exercise #2**

+ *If our similarity levels $L_k$ is 3 (different, similar, identical),*
+ *convert the result into similarity levels based on the threshold values of 0.88 and 0.94*

In [None]:
def convert_matrix_values(matrix):
    # Your solutions here
    pass
    return # Your solutions here

converted_df = convert_matrix_values(df_similarity)
converted_df.head()

### **Exercise #3**

+ *If we set the rule that a pair is considered as a match if they have no "different", then*
+ *Identify the matched pairs and compute number of matched pairs*

In [None]:
# Your solutions here

filtered_df = # Your solutions here

print("Pairs with either two 'identical' attributes or three 'similar' attributes, and no 'different' attributes:")
print(filtered_df)

print("\nNumber of such pairs:", len(filtered_df))


## **3. Using Fellegi-Sunter method with ECM through Record linkage Toolkit (optional)**

Installation of Python Record linkage Toolkit

The Python Record linkage Toolkit requires Python 3.6 or higher. 

Install the package easily with pip

$ pip install recordlinkage

The related website is https://recordlinkage.readthedocs.io/en/latest/installation.html

You can check about the installation using the following code

In [None]:
import recordlinkage as rl

### **Exercise #4**

+ *Try to run the following code to compute the result using the original JW scores*
+ *Try to understand the result*

In [None]:
cl = rl.ECMClassifier(binarize=0.8)
cl.fit(df_similarity)
fsweights = cl.log_weights

In [None]:
print("p probability P(Match):", cl.p)
print("log weights of features:", fsweights)

### **Exercise #5**

+ *Using the match weight to compute the link probability of each pairs*
+ *Identify the matched pairs with a threshold of 0.9*
+ *Will this result better?*

In [None]:
# Your solutions here


filtered_pairs2 = # Your solutions here

# Print the results
print("Pairs with values higher than 0.9:")
print(filtered_pairs2.index)

print("\nNumber of such pairs:", len(filtered_pairs2))