# **Amazon ML Challenge**


---
### Problem Overview

We have a dataset with the following components:

1. **Image Link**: URL to product's image.
2. **Category**: The product category.
3. **Entity Name**: The attribute whose value has to be predicted (e.g., weight, height).
4. **Actual Value**: The true value of the entity (e.g., "50 grams", "15 cm"). This value is absent in the test data.

**Objective**: To train a machine learning model to predict the entity value (eg : "12 grams" or "15 cm") from the image urls in the test data.

**Evaluation Metric**: Predictions will be evaluated using the F1 score, which balances precision and recall.

**Constraints**:
- Use only specific units provided in the appendix.
- Ensure the prediction file format adheres to the validation checker requirements.



# **Our Approach**

---
## ***1. Text Extraction from Images Using OCR***

   - **OCR Technology**: We are using Optical Character Recognition (OCR) technology to extract text data from the images.
   - **Tool**: The EasyOCR method is employed to perform this extraction, which converts visual text within images into machine-readable text.

## ***2. Data Preprocessing and Entity Extraction***

* **Data Preprocessing:** The code involves cleaning and standardizing text data (e.g., replacing commas with periods, handling ambiguous characters).

* **Entity Extraction:** The code includes logic for extracting specific entities and their associated values from text based on predefined mappings and patterns.

## ***1. Text Extraction from Images Using OCR***

---



## **Import Libraries**

In [4]:
!pip install easyocr

Collecting easyocr
  Downloading easyocr-1.7.1-py3-none-any.whl.metadata (11 kB)
Collecting python-bidi (from easyocr)
  Downloading python_bidi-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.6 kB)
Collecting pyclipper (from easyocr)
  Downloading pyclipper-1.3.0.post5-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (9.0 kB)
Collecting ninja (from easyocr)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
Downloading easyocr-1.7.1-py3-none-any.whl (2.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyclipper-1.3.0.post5-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (

In [12]:
import pandas as pd
import requests
import cv2
import numpy as np
import easyocr
from concurrent.futures import ThreadPoolExecutor, as_completed
import re

## **Initialize EasyOCR Reader**
Creating a global EasyOCR reader instance to avoid reinitialization for each image.

In [13]:
# Global EasyOCR reader to avoid reinitializing for each image
reader = easyocr.Reader(['en'])


  net.load_state_dict(copyStateDict(torch.load(trained_model, map_location=device)))
  state_dict = torch.load(model_path, map_location=device)


## **Define Helper Functions**
* Downloads an image from the given URL.

In [14]:
def download_image(url):
    response = requests.get(url)
    return response.content


 * Converts image bytes to an OpenCV image, performs OCR to extract text, and returns the text.

In [15]:
def extract_text(image_bytes):
    arr = np.frombuffer(image_bytes, np.uint8)
    img = cv2.imdecode(arr, cv2.IMREAD_GRAYSCALE)
    result = reader.readtext(img)
    return ' '.join([text for _, text, _ in result])


* Handles the image processing for a single row from the CSV, including downloading and text extraction.

In [16]:
def process_row(row):
    image_url = row['image_link']
    try:
        image_bytes = download_image(image_url)
        text = extract_text(image_bytes)
        return text
    except Exception as e:
        return f"Error processing {image_url}: {str(e)}"


* Processes a batch of rows concurrently using a ThreadPoolExecutor.

In [17]:
def process_batch(batch):
    results = []
    with ThreadPoolExecutor(max_workers=10) as executor:
        future_to_row = {executor.submit(process_row, row): row for row in batch.to_dict('records')}
        for future in as_completed(future_to_row):
            results.append(future.result())
    return results


## **Define the main Function**
*  Main function that reads the CSV file, processes it in chunks, and writes the results to a new CSV file.

In [18]:
def main():
    # Read the CSV file
    chunk_size = 2  # Process in smaller chunks
    output_file = "test_data.csv"

    # Write the header to the output file
    pd.DataFrame(columns=['image_link', 'extracted_text']).to_csv(output_file, index=False)

    # Process the full CSV file in chunks
    for chunk in pd.read_csv("test.csv", chunksize=chunk_size, nrows=6):
        extracted_texts = process_batch(chunk)
        chunk['extracted_text'] = extracted_texts

        # Append the results to the CSV file
        chunk.to_csv(output_file, mode='a', header=False, index=False)


    print(f"Processing complete. Results saved to '{output_file}'")

if __name__ == "__main__":
  main()



Processing complete. Results saved to 'test_data.csv'


## ***2. Data Preprocessing and Entity Extraction***


---



In [19]:
df = pd.read_csv('train_data.csv')

## **Define Preprocessing Function**

* Replace commas with periods in numeric values within text strings to standardize numeric formats.
* Replace ambiguous characters in the text to clean and standardize it for further processing.

In [20]:
def replace_comma_with_period(text):
    if isinstance(text, str):  # Ensure the input is a string
        return re.sub(r'(\d),(\d)', r'\1.\2', text)
    return text  # Return unchanged if not a string


In [21]:
# Define Replacement for Ambiguous Characters
replacements = {
    'O': '0',
    'D': '0',
    'l': '1',
    'I': '1',
    'S': '5',
    'B': '8',
    'Z': '2',
    's': '5'
}
# Define Preprocessing Function for Text
def preprocess_text(text):
    for key, value in replacements.items():
        text = text.replace(key, value)
    return text


In [22]:
df['corrected_text'] = df['extracted_text'].apply(replace_comma_with_period)

## **Define Entity-to-Unit Mapping**

*   Map entities (e.g., width, height) to their possible units for later extraction.
*   Provide mappings from abbreviated or symbolic unit representations to their full names for normalization.



In [23]:
# Entity to unit mapping
entity_unit_map = {
    'width': {'centimetre', 'foot', 'inch', 'metre', 'millimetre', 'yard'},
    'depth': {'centimetre', 'foot', 'inch', 'metre', 'millimetre', 'yard'},
    'height': {'centimetre', 'foot', 'inch', 'metre', 'millimetre', 'yard'},
    'item_weight': {'gram', 'kilogram', 'microgram', 'milligram', 'ounce', 'pound', 'ton'},
    'maximum_weight_recommendation': {'gram', 'kilogram', 'microgram', 'milligram', 'ounce', 'pound', 'ton'},
    'voltage': {'kilovolt', 'millivolt', 'volt'},
    'wattage': {'kilowatt', 'watt'},
    'item_volume': {'centilitre', 'cubic foot', 'cubic inch', 'cup', 'decilitre', 'fluid ounce',
                    'gallon', 'imperial gallon', 'litre', 'microlitre', 'millilitre', 'pint', 'quart'}
}

# Symbol to unit conversion
unit_symbol_map = {
    'g': 'gram',
    'kg': 'kilogram',
    'k9': 'kilogram',
    'mg': 'milligram',
    'm9': 'milligram',
    'lb': 'pound',
    '1b': 'pound',
    'Ib': 'pound',
    '%und': 'pound',
    '% und': 'pound',
    'oz': 'ounce',
    'o2': 'ounce',
    '0z': 'ounce',
    'ml': 'millilitre',
    'mI': 'millilitre',
    'm1': 'millilitre',
    'l': 'litre',
    'cm': 'centimetre',
    'mm': 'millimetre',
    'm': 'metre',
    'kv': 'kilovolt',
    'v': 'volt',
    'w': 'watt'
}

## **Function to Attempt Extraction of Values**

Use regex to find and normalize values and units in the text, then return the most appropriate match based on valid units for the given entity.

In [24]:
def extract_entity_value(row):
    entity = row['entity_name']
    text = row['corrected_text']

    # Ensure the text is a string
    if pd.isna(text):
        return None
    text = str(text)

    # First attempt to extract without preprocessing
    extracted_value = attempt_extraction(text, entity)

    # If no value is extracted, apply the replacement method and try again
    if extracted_value is None:
        text = preprocess_text(text)
        extracted_value = attempt_extraction(text, entity)

    return extracted_value


## **Function to Extract Entity Value**
Extract relevant entity values from the text, first without and then with preprocessing if needed.

In [25]:
def attempt_extraction(text, entity):
    # Get valid units for the entity
    valid_units = entity_unit_map.get(entity, set())

    # Create a regex pattern for matching numbers followed by units
    unit_pattern = '|'.join([re.escape(unit) for unit in valid_units] + [re.escape(symbol) for symbol in unit_symbol_map.keys()])
    pattern = fr'(\d+(?:\.\d+)?)\s*({unit_pattern})'

    # Find all matches in the text
    matches = re.findall(pattern, text, re.IGNORECASE)

    # If matches are found, process them
    extracted_values = []
    for match in matches:
        value, unit = match

        # Normalize unit (if symbol is used, convert to full unit name)
        normalized_unit = unit_symbol_map.get(unit.lower(), unit.lower())

        # Check if the normalized unit is in the valid units set
        if normalized_unit in valid_units:
            extracted_values.append(f"{value} {normalized_unit}")

    # Return the most appropriate match (e.g., first one found)
    if extracted_values:
        return extracted_values[0]

    return None


In [26]:
  df['extracted_entity_value'] = df.apply(extract_entity_value, axis=1)

## **Prepare and Save the Output DataFrame**
Create a DataFrame with extracted entity values as predictions, save it to a CSV file, and print the final output.


In [27]:
output_df = df[['extracted_entity_value']].rename(columns={'extracted_entity_value': 'prediction'})
output_df.index.name = 'index'
output_df.reset_index(inplace=True)

# Save the output to a CSV file
output_df.to_csv('output.csv', index=False)

# Show the final DataFrame with predictions
print(output_df)


     index      prediction
0        0        500 gram
1        1    5 millilitre
2        2      0.709 gram
3        3            None
4        4  1400 milligram
..     ...             ...
894    894  240 millilitre
895    895  1485 milligram
896    896    40 milligram
897    897            None
898    898            None

[899 rows x 2 columns]
