### Importing Required Libraries

This cell imports various Python libraries needed for data processing, machine learning, and text analysis.

In [1]:
import pandas as pd
import re
from time import time
from rapidfuzz import process, fuzz
from deep_translator import GoogleTranslator
import random
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.metrics.pairwise import cosine_similarity
import joblib
import os

### Loading and Preparing the Dataset

This section loads The Excel file containing product matching data.  
It reads two sheets:  
1. **Master File** - Likely contains reference product names.  
2. **Dataset** - Contains new product names that need to be matched.  
Duplicates in the dataset are removed to ensure data consistency.

In [2]:
# Load Excel dataset
file_path = "Product Matching Dataset.xlsx" # edit the path if you want to train on another data
xls = pd.ExcelFile(file_path)

# Read both sheets
master_file = pd.read_excel(xls, sheet_name="Master File")
dataset = pd.read_excel(xls, sheet_name="Dataset")

# Remove duplicate entries from the dataset
dataset = dataset.drop_duplicates()

### Identifying Missing Product Names

This section extracts product names from both the dataset and master file  
to identify missing product names that are present in the master file but not in the dataset.

In [3]:
# Get missing product names
existing_names = set(dataset["marketplace_product_name_ar"].unique())
all_names = set(master_file["product_name_ar"].unique())
missing_names = list(all_names - existing_names)

### Function: `augment_name(name)`

This function generates **augmented variations** of a given product name  
to simulate different possible representations of the same name.  
It applies multiple random transformations such as:
- **Shuffling words** within the name
- **Removing digits** (e.g., `0`)
- **Deleting random characters** from words
- **Replacing random characters** with Arabic letters
- **Appending random suffixes** (e.g., `"جديد"`, `"سعر جديد"`)

This is useful for training models that need to handle variations in  
product naming conventions.


In [4]:
def augment_name(name):
    """
    Generates augmented variations of the input name by applying random transformations.
    
    Args:
        name (str): The original product name.

    Returns:
        list: A list of unique augmented product names.
    """
    augmented_names = set() # Store unique augmented names

    for _ in range(70): # Generate multiple variations
        modified_name = name

         # Randomly shuffle words in the name (50% probability)
        if random.random() < 0.5:
            words = modified_name.split()
            if len(words) > 1:
                random.shuffle(words)
                modified_name = " ".join(words)

        # Randomly remove the digit '0' from the name (50% probability)
        if random.random() < 0.5:
            modified_name = re.sub("0", "", modified_name, 1)

        words = modified_name.split()

        for i, word in enumerate(words):
            # Randomly delete a character from words longer than 5 letters (50% probability)
            if len(word) > 5 and random.random() < 0.5:
                idx = random.randint(0, len(word) - 1)
                words[i] = word[:idx] + word[idx+1:]

            # Randomly replace a character with an Arabic letter (50% probability)
            if len(word) > 5 and random.random() < 0.5:
                idx = random.randint(0, len(word) - 1)
                new_char = random.choice("ابتثجحخدذرزسشصضطظعغفقكلمنهوي")
                words[i] = word[:idx] + new_char + word[idx+1:]

        modified_name = " ".join(words)
        
        # Randomly append a suffix (50% probability)
        if random.random() < 0.5:
            suffix = random.choice(["جديد", "سعر", "سعر جديد", "س ج"])
            modified_name = f"{modified_name} {suffix}".strip()
            
        # Add the modified name to the set (ensuring uniqueness)
        augmented_names.add(modified_name)

    return list(augmented_names)

### Augmenting Missing Product Names and Expanding the Dataset

This section generates additional variations of missing product names  
to enhance the dataset for better model training. The steps include:

1. **Extracting product details** (SKU & price) from the master file.
2. **Generating augmented variations** of missing product names using the `augment_name()` function.
3. **Storing the augmented names** along with the corresponding SKU and price.
4. **Appending the augmented data** to the original dataset.


In [5]:
# Create a DataFrame to store augmented data
augmented_data = []

# Iterate over missing names and generate augmented samples
for true_name in missing_names:
    # Get SKU and price from master file
    product_info = master_file.loc[master_file["product_name_ar"] == true_name, ["sku", "price"]].values
    if len(product_info) == 0:
        continue  # Skip if no match found (shouldn't happen)

    sku, price = product_info[0]  # Extract SKU and price

    # Generate augmented names
    augmented_names = augment_name(true_name)

    for aug_name in augmented_names:
        augmented_data.append({
            "seller_item_name": aug_name,
            "marketplace_product_name_ar": true_name,
            "sku": sku,
            "price": price  # Use actual price from master file
        })

# Convert to DataFrame
augmented_df = pd.DataFrame(augmented_data)

# Append to the original dataset
dataset = pd.concat([dataset, augmented_df], ignore_index=True)

### Arabic-to-English Number Conversion & English-to-Arabic Translation Dictionary

This section prepares necessary data structures for handling product name translations and number conversions:
1. **Arabic Number Conversion**: Maps Arabic numerals (٠١٢٣٤٥٦٧٨٩) to English numerals (0123456789).
2. **Translation Dictionary**: Creates a dictionary mapping English product names to their Arabic counterparts.
3. **Fuzzy Matching List**: Extracts a list of English product names from the master file for similarity matching.
4. **English Character Detection Function**: Checks if a given text contains English letters.


In [6]:
# Arabic number conversion dictionary
arabic_to_english_numbers = str.maketrans("٠١٢٣٤٥٦٧٨٩", "0123456789")

# Create a dictionary for English-to-Arabic translation from the Master File
translation_dict = dict(zip(master_file["product_name"].astype(str).str.lower(), master_file["product_name_ar"].astype(str)))

# List of English product names from the Master File for fuzzy matching
master_names_en = list(translation_dict.keys())

# Function to check if text contains English characters
def contains_english(text):
    return bool(re.search("[A-Za-z]", text))

### Function: `translate_to_arabic(text)`

This function translates English product names to Arabic using two approaches:
1. **Fuzzy Matching with Master File**:
   - Finds the closest English product name match from the Master File.
   - If the similarity score is **≥ 50%**, it returns the corresponding Arabic name.
2. **Google Translation (Fallback)**:
   - If no strong match is found, it translates the text using `GoogleTranslator`.
   - Ensures the output is actually in Arabic before returning it.
   - This method is considered weak and fails most of the time due to connection problems or API problems,
     but it is used because it is free and does not require an API key.
3. **Error Handling & Logging**:
   - If translation fails, it logs an error and returns the original text.
   - If translation returns non-Arabic text, it logs a warning and defaults to the closest match.

This helps ensure accurate and reliable translations while minimizing errors.


In [None]:
# Function to translate English to Arabic (with retry logic)
def translate_to_arabic(text):
    """
    Translates English product names to Arabic using:
    1. Fuzzy matching with the Master File
    2. Google Translator (fallback)

    Args:
        text (str): The input text (product name).

    Returns:
        str: Arabic translation of the product name.
    """
    if contains_english(text):   # Only translate if it contains English
        # Normalize text before processing
        text_lower = text.lower().strip()
        text = text.replace('.', ' ')

        # Find the closest match from the Master File
        match, score, _ = process.extractOne(text_lower, master_names_en, scorer=fuzz.ratio)

        # If match is strong (90% similarity or higher), use the Master File translation
        if score >= 50:
          return translation_dict[match]

        try:
            # Attempt translation using GoogleTranslator
            translated_text = GoogleTranslator(source="english", target="arabic").translate(text_lower)
            # Ensure translation is in Arabic
            if translated_text and not contains_english(translated_text):
                return translated_text
            else:
                print(f"Translation did not return Arabic text for: ({text}) and instead return ({translation_dict[match]}) and score : {score}")
                return text  # Return original if translation is not in Arabic
        except Exception as e:
            print(f"Translation failed for {text}: {e}")
    return text  # Return original if translation fails

### Arabic Text Normalization & Cleaning Functions

This section defines functions to **clean and normalize Arabic text** for consistency and better processing.  

#### **Steps Included:**
1. **Remove Diacritics**: Eliminates Arabic **tashkeel** (harakat) such as `َ ُ ِ ّ` to normalize text.
2. **Remove Unwanted Words**: Filters out specific words like `"جديد"` and `"سعر"`, which may add noise.
3. **Standardize Arabic Characters**:
   - Converts different forms of **Alef** (`أ, إ, آ, ٱ`) to `"ا"`.
   - Normalizes **Ta Marbuta** (`ة`) to `"ه"`, and **Ya** (`ى`) to `"ي"`.
   - Converts **Hamza-based letters** (`ؤ, ئ`) to `"و"` and `"ي"`.
4. **Remove Non-Arabic Characters**: Excludes **special characters** except numbers and slashes.
5. **Convert Arabic Numerals**: Replaces **Arabic digits** (`٠١٢٣٤٥٦٧٨٩`) with **English digits** (`0123456789`).
6. **Ensure Proper Spacing**: Standardizes spaces and removes unnecessary symbols.

In [8]:
# Function to remove diacritics and normalize Arabic text
def remove_diacritics(text):
    arabic_diacritics = re.compile("[\u064B-\u0652]") # Arabic diacritic range
    return re.sub(arabic_diacritics, "", text)

# Function to remove specific unwanted words ("جديد" with variations & "سعر")
def remove_unwanted_words(text):
    text = re.sub(r"جدي+د", "", text)  # Remove "جديد" with varying "ي" count
    text = re.sub(r"\bسعر\b", "", text)  # Remove exact match "سعر"
    text = re.sub(r"\s+", " ", text).strip()  # Remove extra spaces
    return text

def normalize_arabic(text):
    text = str(text).strip()
    text = remove_diacritics(text)
    text = text.replace("أ", "ا").replace("إ", "ا").replace("آ", "ا")  # Normalize Alef
    text = text.replace("ى", "ي").replace("ة", "ه").replace("ٱ", "ا")  # Normalize common variations
    text = text.replace("ؤ", "و").replace("ئ", "ي")  # Normalize more variations
    text = re.sub(r"[^\u0600-\u06FF0-9 %\\/]", "", text)  # Remove non-Arabic characters except numbers
    text = text.translate(arabic_to_english_numbers)  # Convert Arabic numbers to English
    text = re.sub(r"(\d+)", r" \1 ", text).strip()  # Add spaces before and after numbers
    text = re.sub(r"\s+", " ", text).strip()  # Remove extra spaces
    text = re.sub(r"ـ+", "", text)  # Remove extensions in words
    text = remove_unwanted_words(text)  # Remove "جديد" variations and "سعر"
    return text

### Apply normalization to Arabic names in master file and dataset

In [9]:
# Apply normalization to Arabic names
master_file["new_product_name_ar"] = master_file["product_name_ar"].astype(str).apply(normalize_arabic)
dataset["new_marketplace_product_name_ar"] = dataset["marketplace_product_name_ar"].astype(str).apply(normalize_arabic)

# Translate only English seller names to Arabic
dataset["new_seller_item_name"] = dataset["seller_item_name"].astype(str).apply(translate_to_arabic)

# Normalize translated Arabic names
dataset["new_seller_item_name"] = dataset["new_seller_item_name"].apply(normalize_arabic)

Translation did not return Arabic text for: (تلفاست 180 مجم 20قرص  س ج F) and insteat return (فاستل 180 مجم 20 قرص) and score : 38.297872340425535
Translation did not return Arabic text for: (كولوفرين A اقرص) and insteat return (كولوفيرين أ 30 قرص) and score : 18.181818181818176
Translation did not return Arabic text for: (كولوفرين* 3 شريط01 A) and insteat return (اماريل 3 مجم 30 قرص) and score : 31.57894736842105
Translation did not return Arabic text for: (كولوفرينِA3شريط) and insteat return (انتوكس 30 قرص) and score : 14.814814814814813
Translation did not return Arabic text for: (كولوفيرين A اقراص) and insteat return (كولوفيرين أ 30 قرص) and score : 17.14285714285714
Translation did not return Arabic text for: (تلفاست 120مجم 20 قرص س ج F) and insteat return (فاستل 120 مجم 20 قرص) and score : 34.78260869565217
Translation did not return Arabic text for: (تلفاست  120 مجم  اقراصGEG011) and insteat return (شان جيل ملطف 120 جم) and score : 31.818181818181824
Translation did not return A

### Extract relevant columns from the dataset and Split the data into training and testing sets

In [10]:
# Extract relevant columns
seller_names = dataset["new_seller_item_name"].astype(str).values
marketplace_names = dataset["new_marketplace_product_name_ar"].astype(str).values

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(seller_names, marketplace_names, test_size=0.2, random_state=42)

### Data Preparation and Feature Transformation
1. **Create TF-IDF Vectorizer**: We are using the `TfidfVectorizer` to convert text data into numerical features.
    - `analyzer='char'` specifies that we want to extract character-level features (as opposed to word-level features).
    - `ngram_range=(1, 3)` means the vectorizer will consider unigrams (single characters), bigrams (pairs of characters), and trigrams (triplets of characters) for feature extraction.
2. **Transform the training dataset into TF-IDF vectors**.
3. **Transform the testing dataset into TF-IDF vectors**.

In [11]:
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(1, 3))

# Transform the entire dataset into TF-IDF vectors
X_train_tfidf = vectorizer.fit_transform(X_train).astype(np.float32)  # Convert to 32-bit float
X_test_tfidf = vectorizer.transform(X_test).astype(np.float32)

### Training the Logistic Regression Model

In [12]:
# Train Logistic Regression model
model = LogisticRegression(max_iter=100)
start_time = time()
model.fit(X_train_tfidf, y_train)
training_time = time() - start_time

### Evaluating the model

In [13]:
# Evaluate the model
y_pred = model.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, y_pred)

# Print Performance Metrics
print("Model Accuracy:", accuracy)
print("Training Time (s):", round(training_time, 2))

Model Accuracy: 0.9962661443349452
Training Time (s): 435.85


### Saving the Trained Model and Vectorizer
The models are saved in the "models" directory.

In [20]:
# Create the 'models' directory if it doesn't exist
if not os.path.exists("models"):
    os.makedirs("models")

# Define file paths
model_path = "models/logistic_regression_model.pkl"
vectorizer_path = "models/tfidf_vectorizer.pkl"

# Delete existing files if they exist
if os.path.exists(model_path):
    os.remove(model_path)
if os.path.exists(vectorizer_path):
    os.remove(vectorizer_path)

# Save the trained model and vectorizer in the 'models' directory
joblib.dump(model, model_path)  # Save the trained Logistic Regression model
joblib.dump(vectorizer, vectorizer_path)  # Save the TF-IDF vectorizer

print("Model and vectorizer saved successfully in the 'models' directory!")

Model and vectorizer saved successfully in the 'models' directory!


### Loading the Trained Model and Vectorizer

In [15]:
# Load the model and vectorizer from the 'models' directory
model = joblib.load("models/logistic_regression_model.pkl")
vectorizer = joblib.load("models/tfidf_vectorizer.pkl")

print("Model and vectorizer loaded successfully!")

Model and vectorizer loaded successfully!


### Compute Similarity Function

This function computes the similarity between a given seller name and the predicted marketplace name
based on the trained model and TF-IDF vectorizer. It also provides a confidence level based on similarity scores.

In [16]:
def compute_similarity(seller_name, model, vectorizer, high_threshhold = 0.8, medium_threshhold = 0.6, Unknown = 0.2,):
    """
    Args:
        seller_name (str): The name of the seller's product.
        model (sklearn.model): The trained machine learning model used for predictions.
        vectorizer (sklearn.feature_extraction.text.TfidfVectorizer): The TF-IDF vectorizer used for text feature extraction.
        high_threshhold (float, optional): The threshold for classifying the similarity as "High". Default is 0.8.
        medium_threshhold (float, optional): The threshold for classifying the similarity as "Medium". Default is 0.6.
        Unknown (float, optional): The threshold below which the result is considered "Unknown". Default is 0.2.

    Returns:
        dict: A dictionary containing:
            - "seller_name" (str): The original seller's name.
            - "matched_name" (str): The predicted marketplace name or "Not Found".
            - "similarity_score" (float): The cosine similarity score between the seller name and matched name.
            - "confidence" (str): The confidence level based on similarity score ("High", "Medium", "Low", "Unknown").
            - "execution_time_ms" (float): The time taken to compute the similarity in milliseconds.
    """
    # Start time to calculate execution time
    start_time = time()

    # Transform seller name to TF-IDF vector
    seller_vector = vectorizer.transform([seller_name])

    # Predict the most likely marketplace name using the TF-IDF vector
    predicted_name = model.predict(seller_vector)[0]

    # Transform predicted name to TF-IDF vector
    predicted_vector = vectorizer.transform([predicted_name])

    # Compute cosine similarity
    similarity_score = cosine_similarity(seller_vector, predicted_vector)[0, 0]

    # Set confidence levels and handle low similarity cases
    if similarity_score < 0.4:
        matched_name = "Not Found"
        confidence = "Unknown"
        similarity_score = 0.0
    else:
        matched_name = predicted_name
        confidence = "High" if similarity_score > high_threshhold else "Medium" if similarity_score > medium_threshhold else "Low"

    execution_time = (time() - start_time) * 1000  # Convert to milliseconds

    return {
        "seller_name": seller_name,
        "matched_name": matched_name,
        "similarity_score": round(float(similarity_score), 4),
        "confidence": confidence,
        "execution_time_ms": round(execution_time, 2)
    }

### Example Usage of compute_similarity Function

In [17]:
# Example usage
example_seller_name = '*فلاجيل 500 مجم اقراص 15 ج' # Replace this with an actual example seller name to test the function.
result = compute_similarity(example_seller_name, model, vectorizer)

print("Model Accuracy:", accuracy)
print("Training Time (s):", round(training_time, 2))
print(result)

Model Accuracy: 0.9962661443349452
Training Time (s): 435.85
{'seller_name': '*فلاجيل 500 مجم اقراص 15 ج', 'matched_name': 'فلاجيل 500 مجم 20 قرص', 'similarity_score': 0.7448, 'confidence': 'Medium', 'execution_time_ms': 102.47}
