<a href="https://colab.research.google.com/github/SBOSE550/Customer-Data-Processing-and-Validation-System/blob/main/Data_Cleaning_and_Processing_with_Fuzzy_Matching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning and Processing with Fuzzy Matching

This notebook performs data cleaning and customer name matching using fuzzy logic.  
It processes two datasets:
- **Existing Customer Log** – updates names based on fuzzy matches with a master list.
- **New Customer Log** – classifies new customers as either genuine new or potentially matching an existing master record.

The matching process is based on the following logic:
- If the fuzzy match score is **above 90**, the name is automatically corrected.
- If the score is **between 55 and 90**, the user is prompted to confirm the suggested match.
- If the score is **below 55**, no match is made, and the record is flagged accordingly.

The final cleaned data is consolidated and exported to an Excel workbook with multiple sheets.

## Importing Required Libraries

We import the necessary libraries, including:
- **pandas** and **numpy** for data manipulation.
- **fuzzywuzzy** for fuzzy string matching.
- **openpyxl** for Excel file operations.

In [1]:
!pip install fuzzywuzzy
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz, process
import openpyxl
import os

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0




## Data Extraction Functions

The following function extracts data from a specified sheet in an Excel file.
## Data Preprocessing

We clean the customer name and FPR (or salesperson) fields by stripping whitespace and converting text to lowercase.
## Classifying Existing Customers

This function matches the customer names from the existing customer log with the master list using fuzzy matching.
- If a match score is above 90, the customer name is automatically corrected.
- If the score is between 55 and 90, the user is prompted to confirm the replacement.
- If the score is below 55, the record is collected for further review (e.g., via mail).
## Classifying New Customers

This function processes the new customer log. It attempts to match each new customer with the master dataset.
- If a direct match is found (score ≥ 90), the customer name is updated automatically.
- For scores between 55 and 90, the user is prompted to confirm the match.
- If no sufficient match is found (score < 55), the customer is classified as a genuine new customer.

In [9]:
def extract_data_from_sheet(excel_file, sheet_name):
    """Extracts data from a specific sheet in an Excel file.

    Args:
        excel_file: Path to the Excel file.
        sheet_name: Name of the sheet to extract data from.

    Returns:
        A list of lists representing the data in the sheet, or None if the sheet doesn't exist.
    """
    try:
        workbook = openpyxl.load_workbook(excel_file)
        sheet = workbook[sheet_name]  # Access the sheet by name
        data = []
        for row in sheet.iter_rows():
            row_data = [cell.value for cell in row]
            data.append(row_data)
        return data
    except KeyError:
        print(f"Sheet '{sheet_name}' not found.")
        return None
    except FileNotFoundError:
        print(f"File '{excel_file}' not found.")
        return None

# Preprocessing function
def preprocess_column(df, column1,column2):
    df[column1] = df[column1].str.strip().str.lower()
    df[column2] = df[column2].str.strip().str.lower()
    return df

def classify_existing_customer(existing_customers_df, master_df):
  mail=[]
  for index, row in existing_customers_df.iterrows():
    customer_name = row["Customer Name"]
    FPR = row["Name"]

    # Filter master list by FPR (if applicable)
    if FPR:
      FPR_customers = master_df[master_df["FPR"] == FPR]
    else:
      FPR_customers = master_df.copy()  # Consider all customers if no FPR filter

    # Check if there are no matching FPR/customers
    if FPR_customers.empty:
      print(f"No customers found for FPR: {FPR}. Skipping fuzzy matching for '{customer_name}'.")
      continue

    # Perform fuzzy matching
    matches = FPR_customers["Customer Name"].apply(lambda x: fuzz.ratio(x.lower(), customer_name.lower()))
    best_match_idx = matches.idxmax()
    best_match_score = matches.max()

    if best_match_score > 90:
      existing_customers_df.at[index, "Customer Name"] = master_df.loc[best_match_idx, "Customer Name"]
      print(f"Auto-corrected '{customer_name}' to '{master_df.loc[best_match_idx, 'Customer Name']}' under {FPR} (Score: {best_match_score})")

    elif 55 <= best_match_score <= 90:  # Adjust threshold for suggestions
      suggestion = master_df.loc[best_match_idx, "Customer Name"]
      print(f"Suggested match for '{customer_name}': '{suggestion}' Under {FPR} (Score: {best_match_score})")
      user_input = input("Replace? (yes/no): ").strip().lower()
      if user_input == 'yes':
        existing_customers_df.at[index, "Customer Name"] = suggestion
        print(f"Replaced '{customer_name}' with '{suggestion}' under {FPR}")
      else:
        print(f"Not maching with existing data '{customer_name}' under {FPR}")
        mail.append(row)

    else:
      print(f"No match found for '{customer_name}' under {FPR} (Score: {best_match_score})")
      mail.append(row)

  mail_df = pd.DataFrame(mail, columns=existing_customers_df.columns)

  print("Cleaning complet for existing customers")
  return existing_customers_df,mail_df

# Function to classify new customers based on fuzzy matching
def classify_new_customers(master_df, new_customers):
    genuine_new = []
    existing_customers = []

    for index, row in new_customers.iterrows():
        new_customer_name = row['Customer Name']
        FPR = row['Name']

        # Filter master list by salesperson (if applicable)
        if FPR:
          master_names = master_df[master_df["FPR"] == FPR]
        else:
          master_names = master_df.copy()  # Consider all customers if no salesperson filter

        # Check if there are no matching salespeople/customers
        if master_names.empty:
          print(f"No customers found for FPR: {FPR}. Skipping fuzzy matching for '{new_customer_name}'.")
          genuine_new.append(row)  # Add to genuine_new if no match found
          continue


        # Perform fuzzy matching against master dataset
       # Handle cases where extractOne might return a single element or None
        result = process.extractOne(new_customer_name, master_names['Customer Name'].tolist()) # Extract from 'Customer Name' column
        if result:
            match, score = result
        else:
            match, score = None, 0  # Default values if no match

        if score >= 90:
            # Direct match; replace with Master dataset's name
            row['Customer Name'] = match
            print(f"Potential Match Found:\nNew Customer: {new_customer_name}\nMaster Dataset Match: {match} under {FPR} (Score: {score})")
            existing_customers.append(row)
        elif 55 <= score < 90:
            # Prompt for manual input
            print(f"Potential Match Found:\nNew Customer: {new_customer_name}\nMaster Dataset Match: {match} under {FPR} (Score: {score})")
            user_input = input("Is this a match? (yes/no): ").strip().lower()

            if user_input == "yes":
                row['Customer Name'] = match
                existing_customers.append(row)
            else:
                genuine_new.append(row)
        else:
            # Genuine new customer
            genuine_new.append(row)

    # Convert lists back to DataFrames
    genuine_new_df = pd.DataFrame(genuine_new, columns=new_customers.columns)
    existing_customers_df = pd.DataFrame(existing_customers, columns=new_customers.columns)
    genuine_new_df['Business Status'] = 'Inactive'
    genuine_new_df['Sub Status'] = 'New'


    return genuine_new_df, existing_customers_df




## Data Execution

The following section demonstrates how to:
1. Load the master dataset from an Excel file.
2. Load new and existing customer logs from CSV files.
3. Preprocess the data.
4. Apply fuzzy matching to classify and clean customer records.
5. Consolidate the data and export the final result to an Excel workbook.

In [19]:

# Example usage
excel_file = "master dumy.xlsx"
sheet_name = "master dumy"
data = extract_data_from_sheet(excel_file, sheet_name)
master_df=pd.DataFrame(data[1:],columns=data[0])
new_customer_df=pd.read_csv("New Customer Log.csv")
existing_customers_df = pd.read_csv("Existing Cutomer log.csv")


# Preprocess customer names and FPR fields
master_df = preprocess_column(master_df, 'Customer Name','FPR')
new_customer_df = preprocess_column(new_customer_df, 'Customer Name','Name')
existing_customers_df = preprocess_column(existing_customers_df, 'Customer Name','Name')



# Example usage (assuming your CSV files are in the same directory)
clean_existing_customers,mail_df=classify_existing_customer(existing_customers_df,master_df )

# Get lists of master customer names and process new customers

genuine_new_df, new_existing_customers_df = classify_new_customers(master_df, new_customer_df)
genuine_new_df["Customer_type"]="new"
new_existing_customers_df["Customer_type"]="existing"
clean_existing_customers["Customer_type"]="existing"
mail_df["Customer_type"]="new"


Visit_df=pd.concat([genuine_new_df,clean_existing_customers,new_existing_customers_df,mail_df],axis=0)

# Sort by 'Date'
Visit_df.sort_values(by="For date", inplace=True)
# Drop duplicate rows based on a subset of columns(it will only include one entry if the user input same cusotmer in the same da)
Visit_df.drop_duplicates(subset=['Name', 'For date', 'Customer Name'], inplace=True)

# Define the file name for the workbook
file_name = "Data_Report.xlsx"

# Use ExcelWriter to write multiple sheets
with pd.ExcelWriter(file_name, engine='openpyxl') as writer:
    Visit_df.to_excel(writer, sheet_name='Visit Data', index=False)
    genuine_new_df.to_excel(writer, sheet_name='Genuine New Data', index=False)
    mail_df.to_excel(writer, sheet_name='Mail Data', index=False)

print(f"Data successfully written to {file_name}")


Suggested match for 'abc food corp': 'abc food corporation' Under john doe (Score: 79)
Replace? (yes/no): yes
Replaced 'abc food corp' with 'abc food corporation' under john doe
Auto-corrected 'xyz retailers' to 'xyz retailers' under jane smith (Score: 100)
Suggested match for 'pqr restaurants': 'pqr restaurants pvt. ltd.' Under david lee (Score: 75)
Replace? (yes/no): yes
Replaced 'pqr restaurants' with 'pqr restaurants pvt. ltd.' under david lee
Cleaning complet for existing customers
Potential Match Found:
New Customer: abc foods
Master Dataset Match: abc foods under john doe (Score: 100)
No customers found for FPR: emily white. Skipping fuzzy matching for 'pqr restaurant'.


  Visit_df=pd.concat([genuine_new_df,clean_existing_customers,new_existing_customers_df,mail_df],axis=0)


# Summary

- **Data Extraction:** Reads master data from an Excel sheet and customer logs from CSV files.
- **Preprocessing:** Standardizes customer names and salesperson identifiers.
- **Fuzzy Matching:** Uses fuzzy matching to correct and classify customer names with auto-correction for high-confidence matches and user prompts for ambiguous cases.
- **Consolidation & Export:** Merges the processed records and exports them into an organized Excel report.

This enhanced version includes detailed documentation and markdown cells to improve readability and maintainability. Feel free to adjust thresholds, add further error handling, or modify the logic as needed for your specific use case.