## Fuzzy string matching

The helper function fuzzy_match_any_term returns a tuple: (True, matched_term) if a match is found, and (False, None) if no match is found.

Storing the Matched Term: Inside fuzzy_match_any_term, best_match and best_score variables now keep track of the best-matched term and its score.

Creating the flag and matched_term Columns:

The results Series now holds tuples of (boolean, string).
`df['flag']` is created by taking the first element (boolean) of each tuple.
`df['matched_term']` is created by taking the second element (string or None) of each tuple.

In [8]:
import pandas as pd
from rapidfuzz import fuzz, process

In [9]:

def flag_text_chunks_fuzzy(df, text_column, terms_list, threshold=90, scorer=fuzz.WRatio):
    """
    Checks each text chunk in a DataFrame column against a list of terms using fuzzy
    matching (RapidFuzz) and flags the column, indicating if a match was found and
    the matched term.

    Args:
        df: The pandas DataFrame.
        text_column: The name of the column containing the text chunks.
        terms_list: A list of terms to search for.
        threshold: The minimum similarity score (0-100) for a match to be considered.
        scorer: The RapidFuzz scorer function to use (default: fuzz.WRatio).

    Returns:
        A new DataFrame with added columns: 'flag' (boolean) and 'matched_term' (str).
    """

    def fuzzy_match_any_term(text, terms, threshold, scorer):
        """
        Helper function to check if any word in a text chunk fuzzy matches any term
        above a given threshold and returns the matched term.
        """
        words = text.lower().split()  # Split text into words (lowercase)
        best_match = None
        best_score = -1

        for word in words:
            match = process.extractOne(word, terms, scorer=scorer)
            if match and match[1] >= threshold and match[1] > best_score:
                best_match = match[0]  # Store the matched term
                best_score = match[1]

        if best_match:
            return True, best_match
        else:
            return False, None

    # Apply the fuzzy matching and get both the flag and the matched term
    results = df[text_column].apply(
        lambda x: fuzzy_match_any_term(x, terms_list, threshold, scorer)
    )

    # Create new columns based on the results
    df['flag'] = results.apply(lambda x: x[0])
    df['matched_term'] = results.apply(lambda x: x[1])

    return df


## Example Usage:


In [10]:

# Sample DataFrame
data = {'text': ["This is a sample text with the word aple.",
                 "Another example with bana and orange.",
                 "This text contains neither.",
                 "We have an appl here.",
                 "This mentions ORANGE juice.",
                 "grappe is close",
                 "bannana is also similar"]}
df = pd.DataFrame(data)


In [11]:
print(df)

                                        text
0  This is a sample text with the word aple.
1      Another example with bana and orange.
2                This text contains neither.
3                      We have an appl here.
4                This mentions ORANGE juice.
5                            grappe is close
6                    bannana is also similar


In [12]:

# List of terms to check for
terms = ["apple", "banana", "grape"]

# Flag the DataFrame based on fuzzy matches
df_flagged = flag_text_chunks_fuzzy(df, 'text', terms, threshold=85)


In [13]:

# Print the flagged DataFrame
print(df_flagged)

                                        text   flag matched_term
0  This is a sample text with the word aple.   True        apple
1      Another example with bana and orange.   True       banana
2                This text contains neither.  False         None
3                      We have an appl here.   True       banana
4                This mentions ORANGE juice.  False         None
5                            grappe is close   True        grape
6                    bannana is also similar   True       banana
