# Name Variation Matching using Fuzzy String Matching
This notebook demonstrates how to match name variations to a set of base names using normalization and fuzzy string matching techniques. We will preprocess the data, apply fuzzy matching, and evaluate the results.

In [3]:
import pandas as pd
from thefuzz import fuzz, process
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re

## Import Required Libraries
Import the necessary libraries for data manipulation, fuzzy string matching, text vectorization, and regular expressions.

In [4]:
name_variations = pd.read_csv("name_variations.csv")
base_names = pd.read_csv("base_names.csv")

## Load Data
Read the name variations and base names from their respective CSV files.

In [5]:
base_names, name_variations

(    Base_Name_ID          Base_Name
 0              1         John Smith
 1              2     Jennifer Brown
 2              3   Michael O'Connor
 3              4       Maria Garcia
 4              5         Robert Lee
 5              6      Linda Johnson
 6              7      William Davis
 7              8   Elizabeth Wilson
 8              9     David Martinez
 9             10        Susan Clark
 10            11    James Rodriguez
 11            12         Mary Lewis
 12            13         Paul Allen
 13            14        Karen Young
 14            15        Thomas King
 15            16       Nancy Wright
 16            17       Daniel Scott
 17            18        Sandra Hill
 18            19  Christopher Green
 19            20      Jessica Adams,
           Variation Matches_With_Base_Name
 0      Thomas  King            Thomas King
 1        ThomasKing            Thomas King
 2      Maria Garcia           Maria Garcia
 3         MaryLewis             Mary Lewis
 4

## Preview Data
Display the loaded dataframes to understand their structure and contents.

In [6]:
def normalize_name(name: str) -> str:
    if pd.isna(name):
        return ""
    name = name.strip()
    name = re.sub(r'(?<=[a-z])(?=[A-Z])', ' ', name)
    name = name.lower()
    name = re.sub(r'[^a-z\s]', '', name)
    name = re.sub(r'\s+', ' ', name).strip()
    return name

## Define Name Normalization Function
Create a function to normalize names by lowercasing, removing special characters, and standardizing whitespace.

In [7]:
name_variations["Normalized"] = name_variations["Variation"].astype(str).apply(normalize_name)
base_names["Normalized"] = base_names["Base_Name"].astype(str).apply(normalize_name)

## Normalize Names
Apply the normalization function to both name variations and base names.

In [8]:
def get_best_match(name, base_names, threshold=80):
    match = process.extractOne(
        name,
        base_names["Normalized"].tolist(),
        scorer=fuzz.token_sort_ratio
    )
    if match and match[1] >= threshold:
        # Get the original base name for reporting
        matched_row = base_names.loc[base_names["Normalized"] == match[0], "Base_Name"].values[0]
        return matched_row, match[1]
    return None, None

## Define Fuzzy Matching Function
Create a function to find the best fuzzy match for a normalized name from the list of base names using token sort ratio.

In [9]:
results = []
for name, norm in zip(name_variations["Variation"], name_variations["Normalized"]):
    matched_name, score = get_best_match(norm, base_names)
    results.append((name, matched_name, score))


## Apply Fuzzy Matching
For each name variation, find the best matching base name using the defined fuzzy matching function.

In [10]:
matches_df = pd.DataFrame(results, columns=["Variation_Name", "Matched_Base_Name", "Score"])
matches_df.head(15)

Unnamed: 0,Variation_Name,Matched_Base_Name,Score
0,Thomas King,Thomas King,100.0
1,ThomasKing,Thomas King,100.0
2,Maria Garcia,Maria Garcia,100.0
3,MaryLewis,Mary Lewis,100.0
4,Nancy W.,,
5,Dani3l Scott,Daniel Scott,96.0
6,JOHN smith,John Smith,100.0
7,linda johnson,Linda Johnson,100.0
8,N@ncy Wright,Nancy Wright,96.0
9,William Davis,William Davis,100.0


## Display Matching Results
Create a DataFrame to show the original variation, matched base name, and the matching score. Display the top results.

In [11]:
from sklearn.metrics import accuracy_score

y_true = name_variations['Matches_With_Base_Name'].fillna("No Match")
y_pred = matches_df['Matched_Base_Name'].fillna("No Match")

accuracy_score(y_true, y_pred)

0.95

## Evaluate Matching Accuracy
Compare the predicted matches with the ground truth and calculate the accuracy score.

## Conclusion
In this notebook, we demonstrated how to match name variations to base names using normalization and fuzzy string matching. By preprocessing the data and applying token sort ratio, we achieved effective matching and evaluated the results using accuracy. This approach can be further improved by experimenting with different normalization strategies or matching algorithms.