#### CarbonFact Take Home exercise  
The second interview is a take-home exercice about data parsing.  
I've attached some raw product data we got from one of our customers.  
The goal of Carbonfact's data science team is to clean this data so that we can measure it with our LCA engine.  
Your challenge is to :  
- load this data,
- parse it,  
- and store it in a structured format.  

I don't have a strict format in mind, so I'll let you decide. I would however like to see a class-based approach, written in Python, based on Pydantic or dataclasses. This is a take-home exercise, so you can take your time. I would like to see your code and a brief explanation of your approach. We can then discuss it in a call.

#### 0. imports

In [2]:
import pandas as pd
import numpy as np
import random
import pprint
import re
pd.set_option('display.max_colwidth', None)
from tqdm import tqdm
import json
import ast
from rapidfuzz import process, fuzz

#### 1. EDA

In [3]:
df_cf=pd.read_csv('care_labels.csv')

In [4]:
df_cf.head()

Unnamed: 0,product_id,product_category,care_label
0,#113,PANTS,"Main: 40% Cotton, 60% Polyester, 290 g/m².\nContrast: 53% Cotton 47% Polyester, 290 g/m².\nReinforcement Knee: 100% CORDURA®-Polyamide, 205 g/m²."
1,#212,PANTS,"Main: DuraTwill, 52% Cotton 48% Polyamide, 240 g/m².\nReinforcement: 100% CORDURA®-Polyamide."
2,#213,PANTS,"Main: 40% Cotton, 60% Polyester, 290 g/m².\nContrast: 53% Cotton, 47% Polyester, 290 g/m².\nReinforcement Knee: 100% CORDURA®-Polyamide, 205 g/m²."
3,#214,PANTS,"Main: Canvas+, 60% Cotton, 40% Polyester, 340 g/m². Reinforcement: 100% CORDURA®-Polyamide."
4,#312,PANTS,"Main: DuraTwill, 52% Cotton 48% Polyamide, 240 g/m². \nReinforcement: 100% CORDURA®-Polyamide."


In [5]:
df_cf.product_category.value_counts()

product_category
PANTS                   139
JACKET                  108
TSHIRT                   51
SHIRT                    38
SWEATER/HOODIE           35
HOSIERY/SOCKS            27
PANTS/SHORTS             26
GLOVES                   22
TSHIRT/LONG-SLEEVE       19
ACCESSORY/KNIT-CAP       17
ACCESSORY/BELT           16
ACCESSORY/PHONE-CASE     12
ACCESSORY/CAP-HAT         9
HOSIERY/LEGGINGS          8
ACCESSORY/KNEEPAD         8
ACCESSORY/MASK            5
UNKNOWN                   5
ACCESSORY/WALLET          4
JACKET/COAT               4
WORKWEAR/COVERALL         3
SWEATER                   3
UNDERWEAR/BOXERS          3
ACCESSORY/SUN-HAT         2
BAG/MEDIUM                2
ACCESSORY/SCARF           1
BAG/LARGE                 1
TSHIRT/TANK-TOP           1
UNDERWEAR/BRA             1
UNDERWEAR/PANTIES         1
ACCESSORY/KEYCHAIN        1
ACCESSORY/HEADBAND        1
Name: count, dtype: int64

#### 2. Parse the product_category

In [6]:
def split_product_category(the_product_category):
    try:
        main_product_category=str(the_product_category).split('/')[0]
        parse_product_category=True
        return main_product_category, parse_product_category
    except:
        parse_product_category=False
        return the_product_category, parse_product_category

In [7]:
df_cf["main_prod_cat"]=[split_product_category(the_pc)[0] for the_pc in df_cf["product_category"]]
df_cf["log_parse_cat"]=[split_product_category(the_pc)[1] for the_pc in df_cf["product_category"]]

In [8]:
df_cf.head()

Unnamed: 0,product_id,product_category,care_label,main_prod_cat,log_parse_cat
0,#113,PANTS,"Main: 40% Cotton, 60% Polyester, 290 g/m².\nContrast: 53% Cotton 47% Polyester, 290 g/m².\nReinforcement Knee: 100% CORDURA®-Polyamide, 205 g/m².",PANTS,True
1,#212,PANTS,"Main: DuraTwill, 52% Cotton 48% Polyamide, 240 g/m².\nReinforcement: 100% CORDURA®-Polyamide.",PANTS,True
2,#213,PANTS,"Main: 40% Cotton, 60% Polyester, 290 g/m².\nContrast: 53% Cotton, 47% Polyester, 290 g/m².\nReinforcement Knee: 100% CORDURA®-Polyamide, 205 g/m².",PANTS,True
3,#214,PANTS,"Main: Canvas+, 60% Cotton, 40% Polyester, 340 g/m². Reinforcement: 100% CORDURA®-Polyamide.",PANTS,True
4,#312,PANTS,"Main: DuraTwill, 52% Cotton 48% Polyamide, 240 g/m². \nReinforcement: 100% CORDURA®-Polyamide.",PANTS,True


In [9]:
df_cf.log_parse_cat.value_counts()

log_parse_cat
True    573
Name: count, dtype: int64

In [10]:
df_cf.main_prod_cat.value_counts()

main_prod_cat
PANTS        165
JACKET       112
ACCESSORY     76
TSHIRT        71
SHIRT         38
SWEATER       38
HOSIERY       35
GLOVES        22
UNKNOWN        5
UNDERWEAR      5
WORKWEAR       3
BAG            3
Name: count, dtype: int64

In [83]:
df_cf[df_cf.care_label.str.contains('CORDURA')]

Unnamed: 0,product_id,product_category,care_label,main_prod_cat,log_parse_cat
0,#113,PANTS,"Main: 40% Cotton, 60% Polyester, 290 g/m².\nContrast: 53% Cotton 47% Polyester, 290 g/m².\nReinforcement Knee: 100% CORDURA®-Polyamide, 205 g/m².",PANTS,True
1,#212,PANTS,"Main: DuraTwill, 52% Cotton 48% Polyamide, 240 g/m².\nReinforcement: 100% CORDURA®-Polyamide.",PANTS,True
2,#213,PANTS,"Main: 40% Cotton, 60% Polyester, 290 g/m².\nContrast: 53% Cotton, 47% Polyester, 290 g/m².\nReinforcement Knee: 100% CORDURA®-Polyamide, 205 g/m².",PANTS,True
3,#214,PANTS,"Main: Canvas+, 60% Cotton, 40% Polyester, 340 g/m². Reinforcement: 100% CORDURA®-Polyamide.",PANTS,True
4,#312,PANTS,"Main: DuraTwill, 52% Cotton 48% Polyamide, 240 g/m². \nReinforcement: 100% CORDURA®-Polyamide.",PANTS,True
...,...,...,...,...,...
545,#9757,PANTS,"39% Modacrylic, 28% CORDURA®, 17% Cotton, 15% Aramid, 1% Antistatic, 270 g/m².",PANTS,True
554,#9770,ACCESSORY/BELT,"100% Nylon, 100% CORDURA®-Polyamide.\nLeather pouches.",ACCESSORY,True
559,#9780,ACCESSORY/BELT,"100% Nylon, 100% CORDURA®-Polyamide. Leather pouches.",ACCESSORY,True
565,#9790,ACCESSORY/BELT,"100% Nylon, 100% CORDURA®-Polyamide.",ACCESSORY,True


#### Explore the care_label structure  
Challenges  
1. not common structure
2. same fabric words but maybe written slightly differently
3. special characters like (r)
4. % composition
5. density information g/m2
6. changing units like : g/mé, grams, else ?
7. changing structure like Main: ... Contrast ... Reinforcement, or not

ideas: what do we want ?  
1. the composition with fabrics / fibers per parts
2. the proportions of each fiber in the fabric  
3. the density of the fabric

Step1: build keywords set  
Step2: build test at materials: sum of % = 100  
Step3: build scraping test: % of characters scrapped = ~100% originial characters

In [12]:
for i in range(20):
    seed=random.randint(0,df_cf.shape[0])
    product_cat=df_cf.product_category.iloc[seed]
    care_label_example=df_cf.care_label.iloc[seed]
    print("{} ({}) : {}".format(seed, product_cat,care_label_example))
    print('\n')

36 (JACKET) : Material: 100% polyester, 260 g/m² (39% Sorona® polyester).


171 (TSHIRT/TANK-TOP) : 91% polyester, 9% Elastane 190 g/m2.


65 (TSHIRT/LONG-SLEEVE) : Color 9567: Main: 100% polyester, 140 g/m².


527 (TSHIRT/LONG-SLEEVE) : X% polyamide, X% polypropylene, X% elastane, X g/m2.


164 (TSHIRT) : Col 0400, 2800 and 3100: 82% cotton, 18% polyester. 
Col 4500: 64% cotton, 36% polyester, 280 g/m2.
2X2 Rib: 97% cotton, 3% elastane, 380 g/m2.



460 (HOSIERY/SOCKS) : 66% cotton, 33% polyamide, 1% elastane.


57 (JACKET) : Main: 100% polyamide, 70 g/m².Lining: 90% polyester, 10% elastane, 95 g/m².


383 (JACKET) : Main Material: 97% Recycled Polyester 3% Elastane, 295 g Contrast;


190 (PANTS) : Main: 60% polyester, 40% cotton, 290 g/m². Contrast: 53% cotton, 47% polyester 290 g/m².


254 (PANTS) : Main: 31% polyester, 28% modacrylic, 20% Aramid Kermel®, 20% CV FR, 1% antistatic, 320 g/m2. Reinforcement: 39% modacrylic, 28% CORDURA®, 17% cotton, 15% aramid, 1% antistatic, 270 g/m2.

In [56]:
def clean_raw_text(text):
    text = text.lower()  # Normalize case
    text = re.sub(r'\s{2,}', ' ', text) #remove multiple spaces
    text = text.replace('\n','')
    text = text.replace('g/m²', 'g/m2')  # Standardize weight units
    text = text.replace('g/m.', 'g/m2')  # Standardize weight units
    text = text.replace('gram', 'g/m2')  # Standardize weight units
    text = text.replace('gsm', 'g/m2')  # Standardize weight units
    text = text.replace('g.', 'g/m2')  # Standardize weight units
    text = text.replace(' gr.', 'g/m2')  # Standardize weight units
    text = text.replace(' gr', 'g/m2')  # Standardize weight units
    text = re.sub(r'[^a-z0-9%.,/®™ ]+', '', text)  # Remove unnecessary punctuation
    text = text.replace('x%','0%')
    text = text.replace('xg/m2','0 g/m2')
    text = text.replace('x g/m2','0 g/m2')
    text = text.replace('reinforcements','reinforcement')
    return text.strip()

In [57]:
raw_text = 'Main: 57% modacrylic, 36% cotton, 5% elastan, 2% Belltron™, 210 g/m. Lining: 58% modacrylic FR, 35% cotton, 2% Belltron™, 5% elastan.'
clean_raw_text(raw_text)

'main 57% modacrylic, 36% cotton, 5% elastan, 2% belltron™, 210 g/m2 lining 58% modacrylic fr, 35% cotton, 2% belltron™, 5% elastan.'

In [58]:
def parse_fabric_description_v3(text_raw, print_test=False):
    parsed_output = {}

    parsed_output['raw_data']=text_raw
    parsed_output['parsed']={}
    text=clean_raw_text(text_raw)
    parsed_output['clean_data']=text
    parsed_output['len_raw_clean_text']=len(parsed_output['clean_data'])
    #clean the percentage
    # Define a regex pattern to target only commas in percentage expressions
    text = re.sub(r'(\d+),(\d+%)', r'\1.\2', text)
    if print_test:
        print("text :", text)

    # Step 2: Identify colors and split by layers parts
    color_pattern = r'((?:color|colors|colour|colours|col|col.)\s+(?:\d{4}(?:,\s*)?)+)'
    layer_pattern = r'(?:(main fabric|main face|main backing|main 2|main|contrast main|contrast 1|contrast 2|contrast|reinforcement|reinforced|cuffs|cuff stretch|cuff|stretch|padding|insulation|isolation|mesh|pocket lining|collar lining|lining|polartec®|details front panels and gusset|ripstop|2x2 rib|dipping|backing|cooltwill)[\s:]+)'  # Pattern for part types  # Pattern for part types
    #layer_pattern = r'(?:(main|contrast|reinforcement|cuff stretch|padding|insulation|mesh|lining|polartec®)[\s:]+)'  # Pattern for part types  # Pattern for part types

    # Split text by color blocks
    color_blocks = re.split(color_pattern, text, flags=re.IGNORECASE)
    if print_test:
        print("color blocks:", color_blocks)
        
    current_color = 'default'  # Default color if none is mentioned
    if len(color_blocks)==1:
        color_blocks=[current_color]+color_blocks
    for i in range(len(color_blocks)):
        block = color_blocks[i].strip()
        if print_test:
            print("block :", block)
        if (('color' in block[:20]) | ('colour' in block[:20]) | ('col' in block[:20])):  # If it's a color block (i.e., odd index)
            current_color = block
            if print_test:
                print("color block :", block)
        else:  # Otherwise, it's a fabric description block
            if 'main' not in block:
                block='main: '+block
            if block:
                # Process the block for parts and materials
                split_layers = re.split(layer_pattern, block)
                if print_test:
                    print("split layers: ",split_layers)
                current_part_type = None
                if len(split_layers)==1:
                    split_layers=['','main']+split_layers
                for j in range(len(split_layers)):
                    part = split_layers[j].strip()
                    if j % 2 == 1:  # It's a part type
                        current_part_type = part.capitalize()  # Capitalize for uniformity
                        #we remove any numbers at the end
                        current_part_type=re.sub(r'\d+\s*$', '', current_part_type).strip()
                    else:
                        if part and current_part_type:
                            materials, weight, test_pc = parse_composition(part)
                            if current_color not in parsed_output['parsed']:
                                parsed_output['parsed'][current_color] = {}
                            if current_part_type not in parsed_output['parsed'][current_color] and materials!=[]:
                                i=0
                            if current_part_type+'_'+str(i) in parsed_output['parsed'][current_color] and materials!=[]:
                                i+=1
                            current_part_type=current_part_type+'_'+str(i)
                            if materials!=[]:
                                parsed_output['parsed'][current_color][current_part_type] = {
                                    'Materials': [],
                                    'Weight': None,
                                    'Test_pc': False
                                }
                                parsed_output['parsed'][current_color][current_part_type]['Materials'] = materials
                                parsed_output['parsed'][current_color][current_part_type]['Weight'] = weight
                                parsed_output['parsed'][current_color][current_part_type]['Test_pc'] = test_pc
            else:
                print("no block detected")
                                

    #test length of parsed output
    parsed_output["test_length"], parsed_output["rebuild_text"], parsed_output["len_rebuild"]=test2_total_characters_length(parsed_output, print_test)

    #test number of False test_pc (target is 0)
    parsed_output["test_test_pc"]=count_false_test_pc(parsed_output)

    #test number of missing weights
    parsed_output["test_missing_weight"]=count_missing_weight(parsed_output)
    
    return parsed_output

In [217]:
def test1_materials_pc(materials_list):
    #test that the sum of parced % is 100 for 1 fabric
    sum_pc=sum([float(mat['percentage']) for mat in materials_list])
    if abs(sum_pc-100)<1:
        return True
    else:
        return False

In [218]:
def parse_composition(comp_text):
    materials = []
    weight = None

    # Improved regex for materials with percentage upfront
    material_pattern = r'(\d+\[.,]?\d*)\s*%\s*([^,]+?)(?=\s*(\d+\s*g/m2|,|\.|$))'

    # Regex for capturing weights (e.g., "149 g/m²")
    weight_pattern = r'(\d+)\s*g/m2'

    # Extract weight if present
    weight_match = re.search(weight_pattern, comp_text)
    if weight_match:
        weight = int(weight_match.group(1))
        # Remove the weight part from the text to avoid interference with material parsing
        comp_text = re.sub(weight_pattern, '', comp_text)

    # Clean up extra spaces and commas
    comp_text = re.sub(r'\s*,\s*', ', ', comp_text)  # Normalize spaces around commas
    comp_text = re.sub(r'\s+', ' ', comp_text)  # Normalize multiple spaces to a single space

    #replace , by .
    comp_text=comp_text.replace(',', '.')
    # Split the text by percentage patterns
    material_parts = re.split(r'(\d+[.,]?\d*\s*%)', comp_text)

    # Process each part to identify materials and percentages
    i = 0
    while i < len(material_parts):
        part = material_parts[i].strip()
        if part and i < len(material_parts):
            percentage_part = material_parts[i - 1].strip()
            if percentage_part.endswith('%'):
                # Extract percentage, allowing for both dot and comma decimals
                match = re.match(r'(\d+[.,]?\d*)', percentage_part)
                if match:
                    # Normalize comma to dot for proper float conversion
                    percentage_str = match.group(1).replace(',', '.')
                # Convert to float
                percentage = float(percentage_str)
                material = part
                clean_material=clean_material_string(material.strip())
                material_class = get_material_class(clean_material, material_set)
                # Add material and percentage
                materials.append({
                    'material_full': clean_material,
                    'percentage': percentage,
                    'material_class': material_class
                })
        i += 2

    #test materials object
    test_pc=test1_materials_pc(materials)
    return materials, weight, test_pc

In [219]:
def rebuild_sentence(parsed_dict):
    #used to compare parse length to original string
    result = []
    
    # Iterate through the outer layers of the dict (i.e., colors like 'default')
    for color, parts in parsed_dict.items():
        # If it's a non-default color, prepend the color info
        if color != 'default':
            result.append(f"{color}:")
        
        # Iterate through parts like 'Main', 'Mesh', etc.
        for part, data in parts.items():
            materials_str = []
            
            # Iterate through the materials list to build material sentences
            for material_info in data['Materials']:
                material = material_info['material_full'].strip(',.')  # Clean up extra commas or periods
                percentage = material_info['percentage']
                materials_str.append(f"{int(percentage)}% {material}")
            
            # Join materials with commas
            material_sentence = ', '.join(materials_str)
            
            # Add weight if available
            if 'Weight' in data and data['Weight']:
                material_sentence += f", {data['Weight']} g/m2"
            
            # Construct the full sentence for each part
            result.append(f"{part} {material_sentence}.")
    
    # Join all parts with spaces to form the final result
    return ' '.join(result).lower()

In [220]:
def test2_total_characters_length(parsed_output, print_test=False):
    #1 rebuilt the cleanded description data from the parsed output
    rebuild=rebuild_sentence(parsed_output['parsed'])
    #2 compare the length
    len_rebuild=len(rebuild)
    len_original=len(parsed_output['clean_data'])
    #test the length comparison
    ratio_length=(len_rebuild-len_original)/len_original*100
    if (ratio_length>-10 and ratio_length<50):
        return True, rebuild, len_rebuild
    else:
        if print_test:
            print("raw data: ",parsed_output['raw_data'])
            print("clean data: ",parsed_output['clean_data'])
            print("rebuild: ",rebuild)
            print('\n')
        return False, rebuild, len_rebuild

In [221]:
def count_false_test_pc(parsed_output):
    count = 0
    
    # Iterate through the outer dict (e.g., 'default')
    for layer, properties in parsed_output['parsed'].items():
        # Iterate through each material block (e.g., 'Main_0', 'Contrast_0', etc.)
        for key, value in properties.items():
            # Check if 'Test_pc' is present and False
            if 'Test_pc' in value and not value['Test_pc']:
                count += 1
    
    return count

In [222]:
def count_missing_weight(parsed_output):
    count = 0
    
    # Iterate through the outer dict (e.g., 'default')
    for layer, properties in parsed_output['parsed'].items():
        # Iterate through each material block (e.g., 'Main_0', 'Contrast_0', etc.)
        for key, value in properties.items():
            # Check if 'Test_pc' is present and False
            if 'Weight' in value and not value['Weight']:
                count += 1
    
    return count

In [223]:
def clean_material_string(the_material):
    clean_material=the_material.strip().rstrip(". ").lstrip(".").strip().lstrip('%').rstrip("/")
    clean_material=re.sub(r'(?<!\d)\.(?!\d)', '', clean_material)
    clean_material=clean_material.replace('®', ' ').replace('™', ' ')
    clean_material=re.sub(r'\s+', ' ', clean_material)
    clean_material=clean_material.strip().rstrip(' .')
    return clean_material

In [224]:
material_set=set([
 'polyester',
 'acrylic',
 'antistatic',
 'antistatic pu laminated',
 'aramid',
 'belltron',
 'carbon',
 'polyamide',
 'cotton',
 'elastane',
 'epdm rubber',
 'eva',
 'leather',
 'glass fibre',
 'goatskin',
 'hppe',
 'latex',
 'lycra',
 'wool',
 'metal fibre',
 'modacrylic',
 'modal',
 'nitrile',
 'nylon',
 'other fibre',
 'pes',
 'polyethylene',
 'polyolefin',
 'polypropylene',
 'polyurethane',
 'poyamide',
 'protal',
 'pu',
 'pu laminated polyester',
 'pvc',
 'reflective yarn',
 'viscose',
 'rubber',
 'silicon',
 'silk',
 'spandex',
 'wbpu/nitrile rubber'])

In [225]:
def get_material_class(the_material, material_set):
    # Step 4: Find the main fabric material by checking keywords
    extracted_fabrics = [fabric for fabric in material_set if fabric in the_material]
    if extracted_fabrics==[]:
        extracted_fabrics.append("Unknown")
    return extracted_fabrics

In [226]:
# Step 2: Function to clean the fabric description
def get_material_class(the_material: str, material_set, threshold: int = 80):
    
    # Step 4: Use fuzzy matching to find the closest fabric materials
    words = the_material.split()
    extracted_fabrics = set()

    for word in words:
        # Find the closest match from fabric_keywords based on a similarity threshold
        match, score,_ = process.extractOne(word, material_set, scorer=fuzz.ratio)
        if score >= threshold:
            extracted_fabrics.add(match)

    if extracted_fabrics=={}:
        extracted_fabrics=set(['Unknown'])
    
    return list(extracted_fabrics)

In [227]:
seed=random.randint(0,df_cf.shape[0])
seed=15
product_cat=df_cf.product_category.iloc[seed]
care_label_example=df_cf.care_label.iloc[seed]
#care_label_example=' Main: 74% polyamide, 22% REPREVE® recycled polyamide, 4% elastane 149 g/m². Mesh: 90% polyester, 10% elastane, 98 g/m²'
print("{} ({}) : {}".format(str(seed), product_cat,care_label_example))
print("\n")
parsed_data = parse_fabric_description_v3(care_label_example, print_test=False)
print(parsed_data, "\n")

15 (JACKET) : Main: 100% polyester, 230 g/m². Lining: 100% solution dyed polyamide 65 g/m². Padding: 100% polyester 120 g/m². Pocket lining: 100% polyester, 215 g/m². Collar lining: 100% polyester, 350 g/m²


{'raw_data': 'Main: 100% polyester, 230 g/m². Lining: 100% solution dyed polyamide 65 g/m². Padding: 100% polyester 120 g/m². Pocket lining: 100% polyester, 215 g/m². Collar lining: 100% polyester, 350 g/m²', 'parsed': {'default': {'Main_0': {'Materials': [{'material_full': 'polyester', 'percentage': 100.0, 'material_class': ['polyester']}], 'Weight': 230, 'Test_pc': True}, 'Lining_0': {'Materials': [{'material_full': 'solution dyed polyamide', 'percentage': 100.0, 'material_class': ['polyamide']}], 'Weight': 65, 'Test_pc': True}, 'Padding_0': {'Materials': [{'material_full': 'polyester', 'percentage': 100.0, 'material_class': ['polyester']}], 'Weight': 120, 'Test_pc': True}, 'Pocket lining_0': {'Materials': [{'material_full': 'polyester', 'percentage': 100.0, 'material_class': ['

#### Apply parser to full dataset to check for performance

In [228]:
output=[]
for i in tqdm(range(df_cf.shape[0])):
    result_dict={}
    result_dict["product_id"]=df_cf.product_id.iloc[i]
    try:
        parsed_dict=parse_fabric_description_v3(df_cf.care_label.iloc[i])
        result_dict={**result_dict, **parsed_dict}
        output.append(result_dict)
    except:
        print("failed:"+df_cf.product_id.iloc[i])
        pass

#build dataframe final
df_parsed=pd.DataFrame(output)
df_parsed_final=pd.merge(df_cf, df_parsed, left_on='product_id', right_on='product_id')
#print shape
print("final shape : {}".format(df_parsed_final.shape))
#save
df_parsed_final.to_csv('care_labels_parsed.csv', sep=';')

100%|██████████████████████████████████████████████████████████████████████████████| 573/573 [00:00<00:00, 4876.96it/s]

final shape : (573, 14)





In [229]:
#build def_evaluate matrixes to navigate in product category errors
df_evaluate_lengthparsing=df_parsed_final.groupby(by='main_prod_cat').test_length.value_counts(normalize=True).reset_index()
df_evaluate_percentparsing=df_parsed_final.groupby(by='main_prod_cat').test_test_pc.value_counts(normalize=True).reset_index()
df_evaluate_missingweight=df_parsed_final.groupby(by='main_prod_cat').test_missing_weight.value_counts(normalize=True).reset_index()

#display key numbers on parsing quality
print("test length string parsed PASSED : {}%".format(round(df_parsed_final.test_length.value_counts(normalize=True).reset_index().iloc[0,1]*100,2)))
print("test percent composition sums 100% : {}%".format(round(df_parsed_final.test_test_pc.value_counts(normalize=True).reset_index().iloc[0,1]*100,2)))
print("test no missing weight : {}%".format(round(df_parsed_final.test_missing_weight.value_counts(normalize=True).reset_index().iloc[0,1]*100,2)))

#display key categories with most of failures


test length string parsed PASSED : 93.89%
test percent composition sums 100% : 94.07%
test no missing weight : 70.33%


In [230]:
df_evaluate_lengthparsing[df_evaluate_lengthparsing.test_length==False].sort_values('proportion', ascending=False).iloc[:5,:]

Unnamed: 0,main_prod_cat,test_length,proportion
2,BAG,False,0.666667
18,UNKNOWN,False,0.6
1,ACCESSORY,False,0.236842
5,GLOVES,False,0.045455
9,JACKET,False,0.035714


In [231]:
df_evaluate_percentparsing[df_evaluate_percentparsing.test_test_pc==0].sort_values('proportion', ascending=True).iloc[:5,:]

Unnamed: 0,main_prod_cat,test_test_pc,proportion
0,ACCESSORY,0,0.855263
4,HOSIERY,0,0.942857
15,TSHIRT,0,0.943662
11,SHIRT,0,0.947368
13,SWEATER,0,0.947368


In [232]:
df_evaluate_missingweight[df_evaluate_missingweight.test_missing_weight>0].sort_values('proportion', ascending=False).iloc[:5,:]

Unnamed: 0,main_prod_cat,test_missing_weight,proportion
8,HOSIERY,1,0.857143
5,GLOVES,1,0.818182
0,ACCESSORY,1,0.75
4,BAG,1,0.333333
27,UNKNOWN,1,0.2


In [233]:
df_parsed_final[df_parsed_final.test_length==False]

Unnamed: 0,product_id,product_category,care_label,main_prod_cat,log_parse_cat,raw_data,parsed,clean_data,len_raw_clean_text,test_length,rebuild_text,len_rebuild,test_test_pc,test_missing_weight
58,#1915,JACKET,"Main: 94% polyamide 6%, elastane, 178 g/m². Contrast 1 (backpanel): 92% polyester, 8% elastane, 118 g/m². Contrast 2 (armpit): 91.5% polyamide, 8.5% elastane, 250 g/m².",JACKET,True,"Main: 94% polyamide 6%, elastane, 178 g/m². Contrast 1 (backpanel): 92% polyester, 8% elastane, 118 g/m². Contrast 2 (armpit): 91.5% polyamide, 8.5% elastane, 250 g/m².","{'default': {'Main_0': {'Materials': [{'material_full': 'polyamide', 'percentage': 94.0, 'material_class': ['polyamide']}, {'material_full': 'elastane', 'percentage': 6.0, 'material_class': ['elastane']}], 'Weight': 178, 'Test_pc': True}, 'Contrast_0': {'Materials': [{'material_full': 'polyester', 'percentage': 92.0, 'material_class': ['polyester']}, {'material_full': 'elastane', 'percentage': 8.0, 'material_class': ['elastane']}], 'Weight': 118, 'Test_pc': True}, 'Contrast_1': {'Materials': [{'material_full': 'polyamide', 'percentage': 91.5, 'material_class': ['polyamide']}, {'material_full': 'elastane', 'percentage': 8.5, 'material_class': ['elastane']}], 'Weight': 250, 'Test_pc': True}}}","main 94% polyamide 6%, elastane, 178 g/m2. contrast 1 backpanel 92% polyester, 8% elastane, 118 g/m2. contrast 2 armpit 91.5% polyamide, 8.5% elastane, 250 g/m2.",161,False,"main_0 94% polyamide, 6% elastane, 178 g/m2. contrast_0 92% polyester, 8% elastane, 118 g/m2. contrast_1 91% polyamide, 8% elastane, 250 g/m2.",142,0,0
140,#2840,TSHIRT,Double interlock fabric,TSHIRT,True,Double interlock fabric,{'default': {}},double interlock fabric,23,False,,0,0,0
290,#6580,PANTS,"Main: 100% GORETEX® polyester 140 g/m². Reinforcement: 100% GORETEX® polyamide 190 g/m². Reinforcement: 100% CORDURA polyamide® 205 g/m². Reinforcement: 77% polyamide, 13% REF, 6% polyamide, 4% PU, 255 g/m². Insulation: 50% 37.5® polyester, 35% REPREVE® recycled polyester, 15% polyester, 60 g/m².",PANTS,True,"Main: 100% GORETEX® polyester 140 g/m². Reinforcement: 100% GORETEX® polyamide 190 g/m². Reinforcement: 100% CORDURA polyamide® 205 g/m². Reinforcement: 77% polyamide, 13% REF, 6% polyamide, 4% PU, 255 g/m². Insulation: 50% 37.5® polyester, 35% REPREVE® recycled polyester, 15% polyester, 60 g/m².","{'default': {'Main_0': {'Materials': [{'material_full': 'goretex polyester', 'percentage': 100.0, 'material_class': ['polyester']}], 'Weight': 140, 'Test_pc': True}, 'Reinforcement_0': {'Materials': [{'material_full': 'goretex polyamide', 'percentage': 100.0, 'material_class': ['polyamide']}], 'Weight': 190, 'Test_pc': True}, 'Reinforcement_1': {'Materials': [{'material_full': 'polyamide', 'percentage': 77.0, 'material_class': ['polyamide']}, {'material_full': 'ref', 'percentage': 13.0, 'material_class': []}, {'material_full': 'polyamide', 'percentage': 6.0, 'material_class': ['polyamide']}, {'material_full': 'pu', 'percentage': 4.0, 'material_class': ['pu']}], 'Weight': 255, 'Test_pc': True}, 'Insulation_0': {'Materials': [{'material_full': '37.5 polyester', 'percentage': 50.0, 'material_class': ['polyester']}, {'material_full': 'repreve recycled polyester', 'percentage': 35.0, 'material_class': ['polyester']}, {'material_full': 'polyester', 'percentage': 15.0, 'material_class': ['polyester']}], 'Weight': 60, 'Test_pc': True}}}","main 100% goretex® polyester 140 g/m2. reinforcement 100% goretex® polyamide 190 g/m2. reinforcement 100% cordura polyamide® 205 g/m2. reinforcement 77% polyamide, 13% ref, 6% polyamide, 4% pu, 255 g/m2. insulation 50% 37.5® polyester, 35% repreve® recycled polyester, 15% polyester, 60 g/m2.",292,False,"main_0 100% goretex polyester, 140 g/m2. reinforcement_0 100% goretex polyamide, 190 g/m2. reinforcement_1 77% polyamide, 13% ref, 6% polyamide, 4% pu, 255 g/m2. insulation_0 50% 37.5 polyester, 35% repreve recycled polyester, 15% polyester, 60 g/m2.",250,0,0
291,#6590,PANTS,"Material: Main: 47% cotton, 53% polyester, 250g/m². Contrast stretch back thigh panels: 91,5% polyamide, 8.5% elastane, 250g/m². Reinforcement 1; knee pad: 100% polyester CORDURA®, 320g/m². \nReinforcement 2; pockets: 100% polyamide CORDURA®,\nDetails front panels and gusset: 88% polyamide, 12% elastane",PANTS,True,"Material: Main: 47% cotton, 53% polyester, 250g/m². Contrast stretch back thigh panels: 91,5% polyamide, 8.5% elastane, 250g/m². Reinforcement 1; knee pad: 100% polyester CORDURA®, 320g/m². \nReinforcement 2; pockets: 100% polyamide CORDURA®,\nDetails front panels and gusset: 88% polyamide, 12% elastane","{'default': {'Main_0': {'Materials': [{'material_full': 'cotton', 'percentage': 47.0, 'material_class': ['cotton']}, {'material_full': 'polyester', 'percentage': 53.0, 'material_class': ['polyester']}], 'Weight': 250, 'Test_pc': True}, 'Stretch_0': {'Materials': [{'material_full': 'polyamide', 'percentage': 91.5, 'material_class': ['polyamide']}, {'material_full': 'elastane', 'percentage': 8.5, 'material_class': ['elastane']}], 'Weight': 250, 'Test_pc': True}, 'Reinforcement_0': {'Materials': [{'material_full': 'polyester cordura', 'percentage': 100.0, 'material_class': ['polyester']}], 'Weight': 320, 'Test_pc': True}, 'Reinforcement_1': {'Materials': [{'material_full': 'polyamide cordura', 'percentage': 100.0, 'material_class': ['polyamide']}], 'Weight': None, 'Test_pc': True}, 'Details front panels and gusset_0': {'Materials': [{'material_full': 'polyamide', 'percentage': 88.0, 'material_class': ['polyamide']}, {'material_full': 'elastane', 'percentage': 12.0, 'material_class': ['elastane']}], 'Weight': None, 'Test_pc': True}}}","material main 47% cotton, 53% polyester, 250g/m2. contrast stretch back thigh panels 91,5% polyamide, 8.5% elastane, 250g/m2. reinforcement 1 knee pad 100% polyester cordura®, 320g/m2. reinforcement 2 pockets 100% polyamide cordura®,details front panels and gusset 88% polyamide, 12% elastane",292,False,"main_0 47% cotton, 53% polyester, 250 g/m2. stretch_0 91% polyamide, 8% elastane, 250 g/m2. reinforcement_0 100% polyester cordura, 320 g/m2. reinforcement_1 100% polyamide cordura. details front panels and gusset_0 88% polyamide, 12% elastane.",244,0,2
292,#6593,PANTS,"Main: 47% cotton, 53% polyester, 237g/m². Contrast stretch back thigh panels: 91,5% polyamide, 8.5% elastane, 250g/m². Reinforcement 1; knee pad: 100% polyester CORDURA®, 320g/m². \nReinforcement 2; pockets: 100% polyamide CORDURA® 205 g/m².\nDetails front panels and gusset: 88% polyamide, 12% elastane, 270 g/m².\n",PANTS,True,"Main: 47% cotton, 53% polyester, 237g/m². Contrast stretch back thigh panels: 91,5% polyamide, 8.5% elastane, 250g/m². Reinforcement 1; knee pad: 100% polyester CORDURA®, 320g/m². \nReinforcement 2; pockets: 100% polyamide CORDURA® 205 g/m².\nDetails front panels and gusset: 88% polyamide, 12% elastane, 270 g/m².\n","{'default': {'Main_0': {'Materials': [{'material_full': 'cotton', 'percentage': 47.0, 'material_class': ['cotton']}, {'material_full': 'polyester', 'percentage': 53.0, 'material_class': ['polyester']}], 'Weight': 237, 'Test_pc': True}, 'Stretch_0': {'Materials': [{'material_full': 'polyamide', 'percentage': 91.5, 'material_class': ['polyamide']}, {'material_full': 'elastane', 'percentage': 8.5, 'material_class': ['elastane']}], 'Weight': 250, 'Test_pc': True}, 'Reinforcement_0': {'Materials': [{'material_full': 'polyester cordura', 'percentage': 100.0, 'material_class': ['polyester']}], 'Weight': 320, 'Test_pc': True}, 'Reinforcement_1': {'Materials': [{'material_full': 'polyamide cordura', 'percentage': 100.0, 'material_class': ['polyamide']}], 'Weight': 205, 'Test_pc': True}, 'Details front panels and gusset_0': {'Materials': [{'material_full': 'polyamide', 'percentage': 88.0, 'material_class': ['polyamide']}, {'material_full': 'elastane', 'percentage': 12.0, 'material_class': ['elastane']}], 'Weight': 270, 'Test_pc': True}}}","main 47% cotton, 53% polyester, 237g/m2. contrast stretch back thigh panels 91,5% polyamide, 8.5% elastane, 250g/m2. reinforcement 1 knee pad 100% polyester cordura®, 320g/m2. reinforcement 2 pockets 100% polyamide cordura® 205 g/m2.details front panels and gusset 88% polyamide, 12% elastane, 270 g/m2.",303,False,"main_0 47% cotton, 53% polyester, 237 g/m2. stretch_0 91% polyamide, 8% elastane, 250 g/m2. reinforcement_0 100% polyester cordura, 320 g/m2. reinforcement_1 100% polyamide cordura, 205 g/m2. details front panels and gusset_0 88% polyamide, 12% elastane, 270 g/m2.",264,0,0
311,#6873,PANTS,"Main:92% recycled polyester, 8% elastane, 260 g/m2. Reinforcement: 88% polyamide, 12% elastane, 275 g/m2.",PANTS,True,"Main:92% recycled polyester, 8% elastane, 260 g/m2. Reinforcement: 88% polyamide, 12% elastane, 275 g/m2.","{'default': {'Reinforcement_0': {'Materials': [{'material_full': 'polyamide', 'percentage': 88.0, 'material_class': ['polyamide']}, {'material_full': 'elastane', 'percentage': 12.0, 'material_class': ['elastane']}], 'Weight': 275, 'Test_pc': True}}}","main92% recycled polyester, 8% elastane, 260 g/m2. reinforcement 88% polyamide, 12% elastane, 275 g/m2.",103,False,"reinforcement_0 88% polyamide, 12% elastane, 275 g/m2.",54,0,0
334,#8002,SWEATER/HOODIE,"Polartec® Power Stretch®: 84% recycled polyester, 16% elastane, 224 g/m2",SWEATER,True,"Polartec® Power Stretch®: 84% recycled polyester, 16% elastane, 224 g/m2","{'default': {'Polartec®_0': {'Materials': [{'material_full': 'recycled polyester', 'percentage': 84.0, 'material_class': ['polyester']}, {'material_full': 'elastane', 'percentage': 16.0, 'material_class': ['elastane']}], 'Weight': 224, 'Test_pc': True}}}","polartec® power stretch® 84% recycled polyester, 16% elastane, 224 g/m2",71,False,"polartec®_0 84% recycled polyester, 16% elastane, 224 g/m2.",59,0,0
335,#8003,JACKET,"Polartec® Power Stretch®: 84% recycled polyester, 16% elastane, 224 g/m2",JACKET,True,"Polartec® Power Stretch®: 84% recycled polyester, 16% elastane, 224 g/m2","{'default': {'Polartec®_0': {'Materials': [{'material_full': 'recycled polyester', 'percentage': 84.0, 'material_class': ['polyester']}, {'material_full': 'elastane', 'percentage': 16.0, 'material_class': ['elastane']}], 'Weight': 224, 'Test_pc': True}}}","polartec® power stretch® 84% recycled polyester, 16% elastane, 224 g/m2",71,False,"polartec®_0 84% recycled polyester, 16% elastane, 224 g/m2.",59,0,0
358,#8042,JACKET,"100% polyester mesh fleece, 210 g/m².",JACKET,True,"100% polyester mesh fleece, 210 g/m².","{'default': {'Main_0': {'Materials': [{'material_full': 'polyester', 'percentage': 100.0, 'material_class': ['polyester']}], 'Weight': None, 'Test_pc': True}}}","100% polyester mesh fleece, 210 g/m2.",37,False,main_0 100% polyester.,22,0,1
372,#8133,JACKET,"Material: Main: 100% polyester 150 g/m². Contrast: 90% polyester, 10% elastane 250 g/m². Reinforcement: 88% polyamide CORDURA® 12% elastane 275 g/m². Lining: 100% polyamide 65 g/m². Padding: 100% polyester, 90% of the isolation is made of Repreve® Polyester 80 g/m².",JACKET,True,"Material: Main: 100% polyester 150 g/m². Contrast: 90% polyester, 10% elastane 250 g/m². Reinforcement: 88% polyamide CORDURA® 12% elastane 275 g/m². Lining: 100% polyamide 65 g/m². Padding: 100% polyester, 90% of the isolation is made of Repreve® Polyester 80 g/m².","{'default': {'Main_0': {'Materials': [{'material_full': 'polyester', 'percentage': 100.0, 'material_class': ['polyester']}], 'Weight': 150, 'Test_pc': True}, 'Contrast_0': {'Materials': [{'material_full': 'polyester', 'percentage': 90.0, 'material_class': ['polyester']}, {'material_full': 'elastane', 'percentage': 10.0, 'material_class': ['elastane']}], 'Weight': 250, 'Test_pc': True}, 'Reinforcement_0': {'Materials': [{'material_full': 'polyamide cordura', 'percentage': 88.0, 'material_class': ['polyamide']}, {'material_full': 'elastane', 'percentage': 12.0, 'material_class': ['elastane']}], 'Weight': 275, 'Test_pc': True}, 'Lining_0': {'Materials': [{'material_full': 'polyamide', 'percentage': 100.0, 'material_class': ['polyamide']}], 'Weight': 65, 'Test_pc': True}, 'Padding_0': {'Materials': [{'material_full': 'polyester', 'percentage': 100.0, 'material_class': ['polyester']}, {'material_full': 'of the', 'percentage': 90.0, 'material_class': []}], 'Weight': None, 'Test_pc': False}}}","material main 100% polyester 150 g/m2. contrast 90% polyester, 10% elastane 250 g/m2. reinforcement 88% polyamide cordura® 12% elastane 275 g/m2. lining 100% polyamide 65 g/m2. padding 100% polyester, 90% of the isolation is made of repreve® polyester 80 g/m2.",260,False,"main_0 100% polyester, 150 g/m2. contrast_0 90% polyester, 10% elastane, 250 g/m2. reinforcement_0 88% polyamide cordura, 12% elastane, 275 g/m2. lining_0 100% polyamide, 65 g/m2. padding_0 100% polyester, 90% of the.",217,1,1


#### Parse all the extacted layers and materials names and build a set to clean the raw text

In [234]:
df_parsed_final=pd.read_csv('care_labels_parsed.csv', sep=';', index_col=0)

In [235]:
df_parsed_final.head(2)

Unnamed: 0,product_id,product_category,care_label,main_prod_cat,log_parse_cat,raw_data,parsed,clean_data,len_raw_clean_text,test_length,rebuild_text,len_rebuild,test_test_pc,test_missing_weight
0,#113,PANTS,"Main: 40% Cotton, 60% Polyester, 290 g/m².\nContrast: 53% Cotton 47% Polyester, 290 g/m².\nReinforcement Knee: 100% CORDURA®-Polyamide, 205 g/m².",PANTS,True,"Main: 40% Cotton, 60% Polyester, 290 g/m².\nContrast: 53% Cotton 47% Polyester, 290 g/m².\nReinforcement Knee: 100% CORDURA®-Polyamide, 205 g/m².","{'default': {'Main_0': {'Materials': [{'material_full': 'cotton', 'percentage': 40.0, 'material_class': ['cotton']}, {'material_full': 'polyester', 'percentage': 60.0, 'material_class': ['polyester']}], 'Weight': 290, 'Test_pc': True}, 'Contrast_0': {'Materials': [{'material_full': 'cotton', 'percentage': 53.0, 'material_class': ['cotton']}, {'material_full': 'polyester', 'percentage': 47.0, 'material_class': ['polyester']}], 'Weight': 290, 'Test_pc': True}, 'Reinforcement_0': {'Materials': [{'material_full': 'cordura polyamide', 'percentage': 100.0, 'material_class': ['polyamide']}], 'Weight': 205, 'Test_pc': True}}}","main 40% cotton, 60% polyester, 290 g/m2.contrast 53% cotton 47% polyester, 290 g/m2.reinforcement knee 100% cordura®polyamide, 205 g/m2.",137,True,"main_0 40% cotton, 60% polyester, 290 g/m2. contrast_0 53% cotton, 47% polyester, 290 g/m2. reinforcement_0 100% cordura polyamide, 205 g/m2.",141,0,0
1,#212,PANTS,"Main: DuraTwill, 52% Cotton 48% Polyamide, 240 g/m².\nReinforcement: 100% CORDURA®-Polyamide.",PANTS,True,"Main: DuraTwill, 52% Cotton 48% Polyamide, 240 g/m².\nReinforcement: 100% CORDURA®-Polyamide.","{'default': {'Main_0': {'Materials': [{'material_full': 'cotton', 'percentage': 52.0, 'material_class': ['cotton']}, {'material_full': 'polyamide', 'percentage': 48.0, 'material_class': ['polyamide']}], 'Weight': 240, 'Test_pc': True}, 'Reinforcement_0': {'Materials': [{'material_full': 'cordura polyamide', 'percentage': 100.0, 'material_class': ['polyamide']}], 'Weight': None, 'Test_pc': True}}}","main duratwill, 52% cotton 48% polyamide, 240 g/m2.reinforcement 100% cordura®polyamide.",88,True,"main_0 52% cotton, 48% polyamide, 240 g/m2. reinforcement_0 100% cordura polyamide.",83,0,1


In [236]:
# Function to recursively parse the dictionary and extract materials
def extract_materials(data_dict):
    for key, value in data_dict.items():
        if isinstance(value, dict):
            # Recursively go deeper if value is a dictionary
            extract_materials(value)
        elif isinstance(value, list):
            # Process lists to extract 'material' field
            for item in value:
                if 'material_full' in item:
                    unique_materials.add(item['material_full']) # Add material to the set

In [237]:
def count_unic_layers(data_dict):
    count = 0
    # Iterate through the outer dict (e.g., 'default')
    for color in data_dict:
        # Iterate through each material block (e.g., 'Main_0', 'Contrast_0', etc.)
        for key in data_dict[color]:
            key=key.split('_')[0]
            # Check if 'Test_pc' is present and False
            unique_layer.add(key)

In [238]:
unique_materials=set()
unique_layer=set()
for i in tqdm(range(df_parsed_final.shape[0])):
    dict_parsed=df_parsed_final.parsed.iloc[i]
    dict_parsed = ast.literal_eval(dict_parsed)
    extract_materials(dict_parsed)
    count_unic_layers(dict_parsed)

100%|██████████████████████████████████████████████████████████████████████████████| 573/573 [00:00<00:00, 6889.77it/s]


In [239]:
len(unique_materials), len(unique_layer)

(138, 25)

In [182]:
unique_materials

{'',
 '37. 5 polyester',
 '37.5',
 '37.5 polyester',
 'abs plastic polyester string/m2',
 'acrylic',
 'antistatic',
 'antistatic pu laminated',
 'aramid',
 'aramid kermel',
 'aramide',
 'belltron',
 'carbon',
 'cordura',
 'cordura and',
 'cordura polyamide',
 'cordura polyamide 500d and 1000d',
 'cordura polyamide leather pouches',
 'cordura polyamideleather pouches',
 'cordura polyester',
 'cordura solution dyed polyester',
 'cotton',
 'cotton 40. s',
 'cv fr',
 'd3o',
 'd3o lite',
 'dupont kevlar aramid',
 'dupont kevlar aramid fibers',
 'durable and comfortable polyester',
 'elastan',
 'elastane',
 'elastane 275. g/m2',
 'elastane 295 g contrast',
 'elastane schoeller',
 'elastane weight',
 'epdm rubber',
 'eva',
 'fullgrain leather 0.81.0 mm thick',
 'glass fiber',
 'glass fibre',
 'goatskin',
 'goatskin thinsulate',
 'goretex polyamide',
 'goretex polyester',
 'hppe',
 'latex',
 'leather',
 'lycra',
 'merino wool',
 'metaaramid',
 'metal fibre',
 'modacryl fr',
 'modacrylic',
 'mo

#### Pydantic object structures

In [38]:
from typing import List, Dict, Optional
from pydantic import BaseModel, Field, model_validator

In [145]:
# Define the Material class
class Material(BaseModel):
    material_full: str
    percentage: float
    material_class:List[str]

    @model_validator(mode="before")
    def check_percentage(cls, values):
        """Ensure percentage is a valid float between 0 and 100."""
        percentage = values.get('percentage')
        if percentage is not None and (percentage < 0 or percentage > 100):
            raise ValueError(f'Percentage {percentage} must be between 0 and 100.')
        return values

# Define the GarmentPart class with type and number attributes
class GarmentPart(BaseModel):
    type: str  # e.g., Main, Contrast, Reinforcement
    number: int  # e.g., 0, 1, 2 (corresponding to Main_0, Contrast_0, etc.)
    Materials: List[Material]
    Weight: Optional[float]  # Weight might not always be present
    Test_pc: bool

    @property
    def total_percentage(self) -> float:
        """Calculate the total percentage of materials in the garment part."""
        return sum(material.percentage for material in self.Materials)

    @model_validator(mode="before")
    def check_total_percentage(cls, values):
        """Ensure that the total percentage of materials equals 100%."""
        materials = values.get('Materials', [])
        total_percentage = sum(material.percentage for material in materials)
        if not (99.9 <= total_percentage <= 100.1):  # Allow small float imprecision
            raise ValueError(f'Total material percentage {total_percentage}% does not sum to 100%.')
        return values

# Define the ColorSection class
class ColorSection(BaseModel):
    name: str 
    parts: List[GarmentPart]  # List of garment parts for each color section
    total_parts: int

# Define the Garment composition model
class GarmentComposition(BaseModel):
    product_id: str
    product_category: str
    product_main_category: str
    colour_blocks: List[ColorSection]
    total_color_blocks: int
    

In [146]:
# Create a JSON schema from the model
display(GarmentComposition.model_json_schema())

{'$defs': {'ColorSection': {'properties': {'name': {'title': 'Name',
     'type': 'string'},
    'parts': {'items': {'$ref': '#/$defs/GarmentPart'},
     'title': 'Parts',
     'type': 'array'},
    'total_parts': {'title': 'Total Parts', 'type': 'integer'}},
   'required': ['name', 'parts', 'total_parts'],
   'title': 'ColorSection',
   'type': 'object'},
  'GarmentPart': {'properties': {'type': {'title': 'Type', 'type': 'string'},
    'number': {'title': 'Number', 'type': 'integer'},
    'Materials': {'items': {'$ref': '#/$defs/Material'},
     'title': 'Materials',
     'type': 'array'},
    'Weight': {'anyOf': [{'type': 'number'}, {'type': 'null'}],
     'title': 'Weight'},
    'Test_pc': {'title': 'Test Pc', 'type': 'boolean'}},
   'required': ['type', 'number', 'Materials', 'Weight', 'Test_pc'],
   'title': 'GarmentPart',
   'type': 'object'},
  'Material': {'properties': {'material_full': {'title': 'Material Full',
     'type': 'string'},
    'percentage': {'title': 'Percentage'

In [147]:
# Function to convert parsed data into the new schema
def convert_parsed_data_to_pydantic(parsed_line):
    product_id=parsed_line["product_id"]
    product_category=parsed_line["product_category"]
    main_prod_cat = parsed_line["main_prod_cat"]
    parsed_data = parsed_line["parsed"]
    parsed_data_dict=ast.literal_eval(parsed_data)
    
    def process_garment_parts(parts_data):
        garment_parts = []
        for key, value in parts_data.items():
            # Split the part name to extract type and number (e.g., "Main_0" -> "Main", 0)
            part_type, part_number = key.rsplit('_', 1)
            part_number = int(part_number)  # Convert the number to an integer

            # Create the GarmentPart object with extracted type and number
            garment_part = GarmentPart(
                type=part_type,
                number=part_number,
                Materials=[Material(**material) for material in value['Materials']],
                Weight=value.get('Weight'),
                Test_pc=value['Test_pc']
            )
            garment_parts.append(garment_part)

            #total_parts
            total_parts=len(garment_parts)
        return garment_parts, total_parts

    # Process any color blocks if present
    color_blocks = []
    for color_key, color_data in parsed_data_dict.items():
        if color_key != 'default':
            color_parts,total_number_parts=process_garment_parts(color_data)
            color_blocks.append(ColorSection(name = color_key,parts=color_parts, total_parts=total_number_parts))
        else:
            # Process the default section
            color_parts,total_number_parts=process_garment_parts(color_data)
            color_blocks.append(ColorSection(name = 'default_color',parts=color_parts, total_parts=total_number_parts))

    # Create the GarmentComposition object
    return GarmentComposition(
        #default=default_section,
        colour_blocks=color_blocks,
        total_color_blocks=len(color_blocks),
        product_id = product_id,
        product_category=product_category,
        product_main_category=main_prod_cat
    )

In [189]:
seed = random.randint(0,df_parsed_final.shape[0])
print(seed)
parsed_data=df_parsed_final.iloc[seed,:]
try:
    parsed_data_classObject=convert_parsed_data_to_pydantic(parsed_data)
except Exception as e:
    print(e)

325


In [190]:
display(parsed_data_classObject)

GarmentComposition(product_id='#6944', product_category='PANTS', product_main_category='PANTS', colour_blocks=[ColorSection(name='default_color', parts=[GarmentPart(type='Main', number=0, Materials=[Material(material_full='polyamide', percentage=51.0, material_class=['polyamide']), Material(material_full='polyamide cordura', percentage=40.0, material_class=['polyamide']), Material(material_full='elastane', percentage=9.0, material_class=['elastane'])], Weight=220.0, Test_pc=True), GarmentPart(type='Contrast', number=0, Materials=[Material(material_full='polyamide', percentage=92.0, material_class=['polyamide']), Material(material_full='elastane', percentage=8.0, material_class=['elastane'])], Weight=250.0, Test_pc=True), GarmentPart(type='Reinforcement', number=0, Materials=[Material(material_full='polyamide cordura', percentage=100.0, material_class=['polyamide'])], Weight=205.0, Test_pc=True), GarmentPart(type='Reinforcement', number=1, Materials=[Material(material_full='polyester co

#### can we do better parsing with spacy NER ?  
- step0: build train dataset
- step1: train custom NER
- step2: apply custom NER on pipeline
- step3:  ompare performance  

In [10]:
#step0: create text corpus from dataset
text=[]
for i in range(50):
    seed=random.randint(0,df_cf.shape[0])
    text.append(df_cf.care_label.iloc[seed])

with open("spacy_ner_train.txt", mode="wt") as f:
    f.write("\n".join([my_val for my_val in text]))
    f.close()

In [8]:
text

['100% PU laminated Polyester.',
 'Main: 61% polyester, 39% Sorona® polyester, 260 g/m².  Reinforcement: 100% polyamide CORDURA®, 205 g/m². Reinforcement: 53% solution dyed CORDURA® polyamide, 47% CORDURA® solution dyed polyester, 283 g/m²',
 '53% Merino Wool, 39% Polyamide, 8% Elastane',
 '95% Cotton, 5% Elastane, 210 g/m².',
 'Main: 88% Polyester, 12% Elastane. Contrast: 65% Polyester, 35% Sorona.\n',
 'Main: 47% cotton, 53% polyester, 251 g/m². Contrast: 91.5% polyamide, 8.5% elastane, 250 g/m². Reinforcement: 100% CORDURA®, polyester 320 g/m². Reinforcement: 100% CORDURA® polyamide, 205 g/m².  Colour 0904; Main: 61% polyester 39% Sorona® polyester, 252 g/m². Contrast: 91.5% polyamide, 8.5% elastane, 250 g/m². Reinforcement: 53% solution dyed CORDURA® polyamide, 47% solution dyed CORDURA® polyester, 283 g/m². Reinforcement: 100% polyamide CORDURA®, 205 g/m²',
 '85% polyester, 15% cotton, 350 g/m2.',
 '100% polyester, 235 g/m2.',
 'Main: 84% polyamide 16% elastane, 193 g/m2. Contrast