**Entity-Unit Extraction and Assignment**

This code performs the extraction and assignment of entity values and their units from text, specifically for attributes like weight, volume, voltage, and wattage. Here's how it works:

Libraries:

It uses spaCy for natural language processing and tokenization, re for regex-based matching, and pandas for CSV file handling.
Entity-Unit Map:

A dictionary (entity_unit_map) maps entity types (e.g., item weight, voltage) to a set of potential units associated with each (e.g., grams, kilograms for weight).

Text Extraction:

The function extract_and_detect_first processes each text field, using regex to match numbers followed by a valid unit. It checks whether the detected unit belongs to the set defined for that entity type.

Value Assignment:

The assign_values function assigns the detected value and unit to their respective columns in the correct format (e.g., "3.75 pounds"). If multiple units are found, it uses the first one.

CSV Processing:

The process_csv function reads a CSV file, processes the data row by row, extracts entity values for specific columns (like weight and volume), and then saves the updated file.

Usage:

You run the process_csv function with a file path to process the CSV and save an updated version.

In [None]:
import spacy
import re
import pandas as pd

# Load spaCy's small English model
nlp = spacy.load("en_core_web_sm")

# Entity-unit map
entity_unit_map = {
    'item_weight': {'gram', 'kilogram', 'microgram', 'milligram', 'ounce', 'pound', 'ton'},
    'maximum_weight_recommendation': {'gram', 'kilogram', 'microgram', 'milligram', 'ounce', 'pound', 'ton'},
    'voltage': {'kilovolt', 'millivolt', 'volt'},
    'wattage': {'kilowatt', 'watt'},
    'item_volume': {'centilitre', 'cubic foot', 'cubic inch', 'cup', 'decilitre', 'fluid ounce', 'gallon', 'imperial gallon', 'litre', 'microlitre', 'millilitre', 'pint', 'quart'}
}

# Function to extract and detect the first unit for each quantity
def extract_and_detect_first(text, entity_type):
    doc = nlp(text)
    detected_units = {}

    # Check if the entity type is in the map
    if entity_type not in entity_unit_map:
        return detected_units

    # Iterate over tokens to extract numbers and their associated units
    for token in doc:
        if re.match(r"[\d.]+", token.text):  # Match numbers
            next_token = doc[token.i + 1] if token.i + 1 < len(doc) else None
            if next_token and next_token.text in entity_unit_map.get(entity_type, {}):  # Match unit
                unit = next_token.text
                if unit not in detected_units:
                    detected_units[unit] = token.text  # Store number and unit together in correct order

    return detected_units

# Function to assign extracted values based on entity types
def assign_values(row):
    detected_values = {}
    for entity_type, units in entity_unit_map.items():
        text = str(row.get(entity_type, ''))
        detected_units = extract_and_detect_first(text, entity_type)
        if detected_units:
            # Correct the format: use number and then unit (e.g., "3.75 pound")
            detected_values[entity_type] = next(iter(detected_units.items()))  # Get the first detected unit

    # Assign detected values to the corresponding columns in correct format
    for entity_type, (unit_value, unit_name) in detected_values.items():
        row[entity_type] = f"{unit_name} {unit_value}"  # Correct order: number followed by unit

    return row

# Function to process CSV file and update relevant columns
def process_csv(file_path):
    # Read the CSV file into a pandas DataFrame
    df = pd.read_csv(file_path)

    # Process each row in the DataFrame
    for index, row in df.iterrows():
        # Extract and assign values for all quantities
        updated_row = assign_values(row)

        # Update the DataFrame with the new values
        df.loc[index] = updated_row

    # Save the updated DataFrame back to the CSV file or create a new file
    df.to_csv('updated_file.csv', index=False)
    print("CSV file processed and updated.")

# Example usage
file_path = r"C:\Users\MYPC\AMAZON_ML\student_resource 3\merged_output.csv" # Replace with your CSV file path
process_csv(file_path)

**Measurement Extraction and Conversion**

This code processes text fields to extract and convert size measurements (e.g., height, width, depth) from units like centimeters, millimeters, meters, inches, and feet. It converts all values to millimeters for uniformity.

Libraries:

It uses spaCy for text tokenization, re for regex matching, and pandas for CSV processing.

Conversion Factors:

A dictionary (conversion_factors) stores conversion ratios for different units (e.g., 1 centimeter = 10 millimeters, 1 foot = 304.8 millimeters).

Priority of Units:

A priority list (unit_priority) defines which unit should take precedence when multiple units are detected (e.g., centimeters before inches).

Text Extraction and Conversion:

The function extract_and_convert extracts numerical values from text and multiplies them by the appropriate conversion factor to convert them to millimeters. It sorts these values by size (in millimeters).

Value Assignment:

The assign_values function assigns the top three extracted values to height, width, and depth based on size, ensuring the largest is assigned to height, the second largest to width, and so on. If fewer than three values are found, the assignments are made accordingly.

CSV Processing:

The process_csv function reads a CSV file, processes each row, extracts size measurements (e.g., width), and assigns the converted values to height, width, and depth columns. The results are saved in a new CSV file.

Usage:

This function is executed by specifying the CSV file path, which will be updated with the new extracted and converted measurements.

In [None]:
import spacy
import re
import pandas as pd

# Load spaCy's small English model
nlp = spacy.load("en_core_web_sm")

# Conversion factors for units to millimeters
conversion_factors = {
    'centimetre': 10,          # Centimeter to millimeter
    'millimetre': 1,           # Millimeter as base
    'metre': 1000,             # Meter to millimeter
    'inch': 25.4,              # Inch to millimeter
    'foot': 304.8              # Foot to millimeter
}

# Define the priority list for units
unit_priority = ['centimetre', 'inch', 'millimetre', 'foot','metre']

# Function to extract and convert quantities from text
# def extract_and_convert(text):
#     doc = nlp(text)
#     measurements = []
#     detected_unit = None

#     # Iterate over tokens to extract numbers and their associated units
#     for token in doc:
#         if re.match(r"[\d.]+", token.text):  # Match numbers
#             next_token = doc[token.i + 1] if token.i + 1 < len(doc) else None
#             if next_token and next_token.text in conversion_factors:  # Match unit
#                 unit = next_token.text

#                 # Check if the unit is of higher priority or same as already detected one
#                 if detected_unit is None or unit_priority.index(unit) < unit_priority.index(detected_unit):
#                     # If a higher priority unit is detected, reset the measurements list
#                     detected_unit = unit
#                     measurements = []

#                 if unit == detected_unit:  # Only process the highest-priority unit
#                     number = float(token.text)
#                     mm_value = number * conversion_factors[unit]
#                     measurements.append((number, unit, mm_value))

#     # Sort the measurements by their millimeter value in descending order
#     sorted_measurements = sorted(measurements, key=lambda x: x[2], reverse=True)

#     return sorted_measurements

def extract_and_convert(text):
    doc = nlp(text)
    measurements = []

    for token in doc:
        if re.match(r"[\d.]+", token.text):  # Match numbers
            next_token = doc[token.i + 1] if token.i + 1 < len(doc) else None
            if next_token and next_token.text in conversion_factors:  # Match unit
                unit = next_token.text
                number = float(token.text)
                mm_value = number * conversion_factors[unit]
                measurements.append((number, unit, mm_value))

    # Sort the measurements by their millimeter value in descending order
    sorted_measurements = sorted(measurements, key=lambda x: x[2], reverse=True)

    return sorted_measurements


# Function to assign extracted values based on count
# def assign_values(extracted_values, entity_name):
#     # Initialize with None for height, width, and depth
#     height, width, depth = None, None, None

#     if len(extracted_values) >= 3:
#         # Assign largest to height, second largest to width, and third to depth
#         height = f"{extracted_values[0][0]} {extracted_values[0][1]}"
#         width = f"{extracted_values[1][0]} {extracted_values[1][1]}"
#         depth = f"{extracted_values[2][0]} {extracted_values[2][1]}"
#     elif len(extracted_values) == 2:
#         # Assign largest to height, second to width, and empty the width cell
#         height = f"{extracted_values[0][0]} {extracted_values[0][1]}"
#         width = f"{extracted_values[1][0]} {extracted_values[1][1]}"
#         depth = ""
#     elif len(extracted_values) == 1:
#         # Check the entity_name and assign the only value based on that
#         if "height" in entity_name.lower():
#             height = f"{extracted_values[0][0]} {extracted_values[0][1]}"
#             width = ""
#             depth = ""
#         elif "width" in entity_name.lower():
#             width = f"{extracted_values[0][0]} {extracted_values[0][1]}"
#             height = ""
#             depth = ""
#         elif "depth" in entity_name.lower():
#             depth = f"{extracted_values[0][0]} {extracted_values[0][1]}"
#             height = ""
#             width = ""

#     return height, width, depth

def assign_values(extracted_values, entity_name):
    # Initialize with None for height, width, and depth
    height, width, depth = None, None, None

    # Sort extracted values by their millimeter value in descending order
    sorted_values = sorted(extracted_values, key=lambda x: x[2], reverse=True)

    # Assign top three values to height, width, and depth
    if len(sorted_values) >= 3:
        height = f"{sorted_values[0][0]} {sorted_values[0][1]}"
        width = f"{sorted_values[1][0]} {sorted_values[1][1]}"
        depth = f"{sorted_values[2][0]} {sorted_values[2][1]}"
    elif len(sorted_values) == 2:
        height = f"{sorted_values[0][0]} {sorted_values[0][1]}"
        width = f"{sorted_values[1][0]} {sorted_values[1][1]}"
        depth = ""
    elif len(sorted_values) == 1:
        # Check the entity_name and assign the only value based on that
        if "height" in entity_name.lower():
            height = f"{sorted_values[0][0]} {sorted_values[0][1]}"
            width = ""
            depth = ""
        elif "width" in entity_name.lower():
            width = f"{sorted_values[0][0]} {sorted_values[0][1]}"
            height = ""
            depth = ""
        elif "depth" in entity_name.lower():
            depth = f"{sorted_values[0][0]} {sorted_values[0][1]}"
            height = ""
            width = ""

    return height, width, depth

# Function to process CSV file and update height, width, and depth columns
def process_csv(file_path):
    # Read the CSV file into a pandas DataFrame
    df = pd.read_csv(file_path)

    # Iterate through each row and process the width column
    for index, row in df.iterrows():
        # Extract the width column text
        text = str(row['width'])

        # Extract and convert the values using the provided function
        extracted_values = extract_and_convert(text)

        # Assign the extracted values to height, width, depth based on the logic
        entity_name = row.get('entity_name', '')  # Assuming there is an 'entity_name' column
        height, width, depth = assign_values(extracted_values, entity_name)

        # Update the DataFrame with the new values
        if height is not None:
            df.at[index, 'height'] = height
        if width is not None:
            df.at[index, 'width'] = width
        if depth is not None:
            df.at[index, 'depth'] = depth

    # Save the updated DataFrame back to the CSV file or create a new file
    df.to_csv('updated_file1.csv', index=False)
    print("CSV file processed and updated.")

# Example usage
file_path = r"E:\ML_challenge_DATASET\TEST\Check_detection\DETECTION_2\Merged_OUTPUT.csv"  # Replace with your CSV file path
process_csv(file_path)