### 1. Goal
To classify the headers of the Bill of Materials (BOM) table uploaded by users into the categories.

The Relationship Between Categories and BOM Headers

| Categories       | BOM Headers |
|-----------------|------------------------------------------------------------|
| **公司料號**    | Comp_item, 元件料號, serial number, 子件代碼, 物料型號 |
| **料號描述**    | Description, Size/Dimension, Comment, Designator, 材料規格, 物料描述 |
| **料號淨重**    | Net weight |
| **料號毛重**    | Gross weight |
| **淨毛重單位**  | 單位, unit, Net weight unit, Gross weight unit |

### 2. User Story
When users upload a new, unseen BOM table (e.g., `test_bom.xlsx`), the service needs to classify the headers into corresponding categories.

True Relationship Between `test_bom.xlsx` Headers and Pre-defined Categories

| Categories       | BOM Headers |
|-----------------|------------|
| **公司料號**    | 料號 |
| **料號描述**    | 規格, 備註 |
| **料號淨重**    | 淨重 |
| **料號毛重**    | 毛重 |
| **淨毛重單位**  | 重量單位 |

### 3. Classification examples with test_bom
#### 3.1 Matched categories with its headers:
<span style="color:yellow;">-For demostration, few of the BOM headers from test_bom are classified wrongly or missing.
```json
{
    '公司料號':['料號'],
    '料號描述':['規格', '備註','重量單位'],
    料號毛重:['毛重']
}
```

#### 3.2 Unmatched categories:
<span style="color:yellow;"> -Categories 料號淨重 and 淨毛重單位 are not matched. <u>This can be detected by setting rule</u>, for example minimum 1 matching headers should be found.
```json
{
    '料號毛重',
    '淨毛重單位'
}
```


#### 3.3 Matched categories with wrong headers:
<span style="color:yellow;"> -重量單位 is missclassified into 料號描述, correct classification is 料號淨重. <u>This should be detected by validation after classification</u>, for example validate the data type of column.
```json
{
    '料號描述':['重量單位']
}
```


#### 3.4 Unmatched bom headers:
<span style="color:yellow;"> -This can be either model fail to classify bom headers.
```json
{'淨重'}
```



pip or conda to install openpyxl

In [830]:
import pandas as pd
import numpy as np
data_path="data/"
import re
import json

from langchain_core.tools import tool
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import PromptTemplate
import google.generativeai as genai

In [1307]:
bom1=pd.read_excel(data_path+"bom1.xlsx")
bom2=pd.read_excel(data_path+"bom2.xlsx")
test_bom=pd.read_excel(data_path+"test_bom.xlsx")
case1=pd.read_excel(data_path+"case1.xlsx")

In [913]:
type(5.6)

float

In [None]:
ground_truth={
    'Categories':['Company Part No.','Part Description','Part Net Weight','Part Gross Weight','Net/Gross Unit'],
    'BOM headers':[
        ['Company Part No.','公司料號','Comp_item','元件料號','serial number','子件代碼','物料型號'],
        ['Part Description','料號描述','Description', 'Size/Dimension', 'Comment', 'Designator','材料規格', '物料描述'],
        ['Part Net Weight','料號淨重','Net weight'],
        ['Part Gross Weight','料號毛重','Gross weight'],
        ['Net/Gross Unit','淨毛重單位','單位','unit', 'Net weight unit','Gross weight unit']
    ]
    }

data_type={
    'Categories':['Company Part No.','Part Description','Part Net Weight','Part Gross Weight','Net/Gross Unit'],
    'data_type':[
        ['object','str'],
        ['object','str'],
        ['int64','float64'],
        ['int64','float64'],
        ['object','str']
    ]
    }

data_type_df=pd.DataFrame(data_type)

reversed_ground_truth = {}
for category, headers in zip(ground_truth['Categories'], ground_truth['BOM headers']):
    for header in headers:
        reversed_ground_truth[header] = category
print(reversed_ground_truth)

{'Company Part No.': 'Company Part No.', '公司料號': 'Company Part No.', 'Comp_item': 'Company Part No.', '元件料號': 'Company Part No.', 'serial number': 'Company Part No.', '子件代碼': 'Company Part No.', '物料型號': 'Company Part No.', 'Part Description': 'Part Description', '料號描述': 'Part Description', 'Description': 'Part Description', 'Size/Dimension': 'Part Description', 'Comment': 'Part Description', 'Designator': 'Part Description', '材料規格': 'Part Description', '物料描述': 'Part Description', 'Part Net Weight': 'Part Net Weight', '料號淨重': 'Part Net Weight', 'Net weight': 'Part Net Weight', 'Part Gross Weight': 'Part Gross Weight', '料號毛重': 'Part Gross Weight', 'Gross weight': 'Part Gross Weight', 'Net/Gross Unit': 'Net/Gross Unit', '淨毛重單位': 'Net/Gross Unit', '單位': 'Net/Gross Unit', 'unit': 'Net/Gross Unit', 'Net weight unit': 'Net/Gross Unit', 'Gross weight unit': 'Net/Gross Unit'}


In [None]:
helper = ChatGoogleGenerativeAI(model="gemini-2.0-flash",temperature=1,max_tokens=None,timeout=None,)
json_promt="""
Your goal is to help data engineer to understand inventory database header.
Output very detailed inventory database header description into json format.
Do not need to mention chinese context in the description.
Examples start here.
Given: ['Company Part No.', '子件代碼']
Return: 
```json
    "Company Part No.":"Unique identifier assigned to a part by the company. This is the internal part number used for tracking and identification within the organization. It serves as a primary key for referencing the part in various systems and processes.",
    "子件代碼":"Identifier for a sub-component within a larger assembly. This code allows for easy tracking and management of smaller parts that make up a more complex product. It helps in managing inventory and understanding the composition of the final product."
```
Begin now.
"""

In [839]:
train_df=pd.DataFrame({'BOM headers':reversed_ground_truth.keys(),'Categories':reversed_ground_truth.values()})
train_df['strip_BOM_headers']=train_df['BOM headers'].apply(lambda x: re.sub(r"\s+", "", x).replace("_", "").lower())
print(train_df.head(3))

header_description=helper.invoke(json_promt+str(list(train_df['BOM headers'])))
print(header_description.content)
description=re.sub(r"^```json\n|```$", "", header_description.content, flags=re.MULTILINE)
header_description_json=json.loads(description)
header_description_df=pd.DataFrame({"BOM headers":header_description_json.keys(),"column_description":header_description_json.values()})

        BOM headers        Categories strip_BOM_headers
0  Company Part No.  Company Part No.    companypartno.
1              公司料號  Company Part No.              公司料號
2         Comp_item  Company Part No.          compitem
```json
{
    "Company Part No.":"Unique identifier assigned to a part by the company. This is the internal part number used for tracking and identification within the organization. It serves as a primary key for referencing the part in various systems and processes.",
    "公司料號":"An alternative company part number, potentially used for specific departments or systems within the organization. It serves as a unique identifier for the part, similar to the primary 'Company Part No.', but may have different formatting or usage rules.",
    "Comp_item":"Abbreviated form of 'Component Item', referring to the unique identifier or code assigned to a specific component used in a product or assembly. Used for tracking and managing individual components within a larger system.

In [840]:
train_df=train_df.merge(header_description_df,how='left',on='BOM headers')
#train_df['column_description']=train_df['BOM headers']+": "+train_df['column_description']
train_df.head(3)

Unnamed: 0,BOM headers,Categories,strip_BOM_headers,column_description
0,Company Part No.,Company Part No.,companypartno.,Unique identifier assigned to a part by the co...
1,公司料號,Company Part No.,公司料號,"An alternative company part number, potentiall..."
2,Comp_item,Company Part No.,compitem,"Abbreviated form of 'Component Item', referrin..."


In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
embeddings_model = HuggingFaceEmbeddings(model_name="intfloat/multilingual-e5-large")

https://python.langchain.com/api_reference/huggingface/embeddings/langchain_huggingface.embeddings.huggingface.HuggingFaceEmbeddings.html

In [841]:
bom_headers_embedding_list=[]
for c in train_df['BOM headers']:
    embedding = embeddings_model.embed_query(c)
    bom_headers_embedding_list.append(embedding)
print(bom_headers_embedding_list)

description_embedding_list=[]
for c in train_df['column_description']:
    embedding = embeddings_model.embed_query(c)
    description_embedding_list.append(embedding)
print(description_embedding_list)

[[6.642220978392288e-05, -0.010337683372199535, -0.01148230955004692, -0.03415466472506523, 0.005939788185060024, -0.0003789965121541172, 0.004023274406790733, 0.052121374756097794, 0.04695210978388786, -0.02690179832279682, 0.042652033269405365, 0.006332517601549625, -0.05020089074969292, -0.03272632509469986, -0.015712684020400047, -0.026301229372620583, -0.018420176580548286, 0.008542709983885288, -0.004355146083980799, 0.0021711040753871202, 0.026408711448311806, -0.012764856219291687, -0.042218755930662155, -0.0038842298090457916, 0.0007194860954768956, -0.04637666419148445, -0.04588543251156807, -0.023558130487799644, -0.01675584726035595, -0.03137725964188576, 0.021346934139728546, 0.009896727278828621, -0.04399774223566055, -0.012245206162333488, -0.03025130368769169, 0.02732754312455654, 0.052898697555065155, 0.04571720212697983, -0.03843925520777702, 0.02778533287346363, -0.02797052264213562, 0.04984840378165245, 0.008450601249933243, -0.07426391541957855, -0.0019508205587044

In [842]:
train_df['BOM_headers_embedding']=bom_headers_embedding_list
train_df['BOM_headers_relevant_score']=0
train_df['description_embedding']=description_embedding_list
train_df['description_relevant_score']=0
train_df.to_parquet("embeded_train_df.gzip")
train_df.head(3)

Unnamed: 0,BOM headers,Categories,strip_BOM_headers,column_description,BOM_headers_embedding,BOM_headers_relevant_score,description_embedding,description_relevant_score
0,Company Part No.,Company Part No.,companypartno.,Unique identifier assigned to a part by the co...,"[6.642220978392288e-05, -0.010337683372199535,...",0,"[0.026099292561411858, -0.005072433967143297, ...",0
1,公司料號,Company Part No.,公司料號,"An alternative company part number, potentiall...","[0.01681557111442089, 0.01526159979403019, -0....",0,"[0.007724988739937544, -0.02758294716477394, -...",0
2,Comp_item,Company Part No.,compitem,"Abbreviated form of 'Component Item', referrin...","[0.010160735808312893, 0.025978170335292816, -...",0,"[0.030199265107512474, 0.0049665868282318115, ...",0


In [None]:
# Not in used
def L1_match(uploaded_df):
    train_df=pd.read_parquet("embeded_train_df.gzip")
    result_df=pd.DataFrame()
    L1_output_category_list=[]
    L1_matched_list=[]
    L1_status=[]
    for c in uploaded_df.columns:
        #strip whitespaces and _
        c = re.sub(r"\s+", "", c).replace("_", "")c.lower()
        if c in list(train_df['strip_BOM_headers']):
            L1_output_category_list.append(train_df[train_df['strip_BOM_headers']==c].iloc[0]['Categories'])
            L1_matched_list.append(train_df[train_df['strip_BOM_headers']==c].iloc[0]['strip_BOM_headers'])
            L1_status.append('L1 success')
        else:
            L1_output_category_list.append(None)
            L1_matched_list.append(None)
            L1_status.append('pending')
    
    result_df['uploaded_BOM_headers']=list(uploaded_df.columns)
    result_df['status']=L1_status
    result_df['L1_output_category']=L1_output_category_list
    result_df['L1_matched_reference_header']=L1_matched_list
    
    return result_df

In [883]:
test=bom1+bom2+test_bom+case1
test.head(3)

Unnamed: 0,Comp_item,Description,Gross weight,Net weight,Size/Dimension,unit,体积单位,体重,備註,元件料號,...,性别,料號,材料規格,毛重,淨重,物料描述,花费,規格,部门,重量單位
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,


In [884]:
def L1_translate_match(uploaded_df):
    translate_prompt="""
    You are english translator.
    Output english synonyms into json format. If the original term is english, return the original term.
    Examples start here.
    Given: ['Company Part No.', '淨重']
    Return: 
    ```json
        "Company Part No.":"Company Part No.",
        "淨重":"Net Weight"
    ```
    Begin now.
    """
    L1_translated_BOM_headers_str=helper.invoke(translate_prompt+str(list(uploaded_df.columns))).content
    #print(L1_translated_BOM_headers_str)
    L1_translated_BOM_headers_str=re.sub(r"^```json\n|```$", "", L1_translated_BOM_headers_str, flags=re.MULTILINE)
    L1_translated_BOM_headers_json=json.loads(L1_translated_BOM_headers_str)
    L1_df=pd.DataFrame({"L1_translated_BOM_headers":L1_translated_BOM_headers_json.values()})
    L1_df["BOM headers"]=uploaded_df.columns
    
    train_df=pd.read_parquet("embeded_train_df.gzip")
    result_df=pd.DataFrame()
    L1_output_category_list=[]
    L1_matched_list=[]
    L1_status=[]

    for c in L1_df["BOM headers"]:
        c_strip = re.sub(r"\s+", "", c).replace("_", "").lower()
        c_translated=L1_df[L1_df["BOM headers"]==c].iloc[0]['L1_translated_BOM_headers']
        c_translated_strip = re.sub(r"\s+", "", c_translated).replace("_", "").lower()
        #print(c,c_strip,c_translated_strip)
        if c_strip in list(train_df['strip_BOM_headers']):
            L1_output_category_list.append(train_df[train_df['strip_BOM_headers']==c_strip].iloc[0]['Categories'])
            L1_matched_list.append(train_df[train_df['strip_BOM_headers']==c_strip].iloc[0]['strip_BOM_headers'])
            L1_status.append('L1 success')
        elif c_translated_strip in list(train_df['strip_BOM_headers']):
            L1_output_category_list.append(train_df[train_df['strip_BOM_headers']==c_translated_strip].iloc[0]['Categories'])
            L1_matched_list.append(train_df[train_df['strip_BOM_headers']==c_translated_strip].iloc[0]['strip_BOM_headers'])
            L1_status.append('L1 success after translation')
        else:
            L1_output_category_list.append(None)
            L1_matched_list.append(None)
            L1_status.append('pending')
    
    result_df['uploaded_BOM_headers']=list(uploaded_df.columns)
    result_df['status']=L1_status
    result_df['L1_output_category']=L1_output_category_list
    result_df['L1_translated_BOM_headers']=L1_df['L1_translated_BOM_headers']
    result_df['L1_matched_reference_header']=L1_matched_list
    
    return result_df

In [885]:
result_df=L1_translate_match(test)

In [887]:
result_df[result_df['status']=='L1 success']

Unnamed: 0,uploaded_BOM_headers,status,L1_output_category,L1_translated_BOM_headers,L1_matched_reference_header
0,Comp_item,L1 success,Company Part No.,Comp_item,compitem
1,Description,L1 success,Part Description,Description,description
2,Gross weight,L1 success,Part Gross Weight,Gross weight,grossweight
3,Net weight,L1 success,Part Net Weight,Net weight,netweight
4,Size/Dimension,L1 success,Part Description,Size/Dimension,size/dimension
5,unit,L1 success,Net/Gross Unit,unit,unit
9,元件料號,L1 success,Company Part No.,Component Part No.,元件料號
11,單位,L1 success,Net/Gross Unit,Unit,單位
14,材料規格,L1 success,Part Description,Material Specification,材料規格
17,物料描述,L1 success,Part Description,Material Description,物料描述


In [888]:
result_df[result_df['status']=='L1 success after translation']

Unnamed: 0,uploaded_BOM_headers,status,L1_output_category,L1_translated_BOM_headers,L1_matched_reference_header
15,毛重,L1 success after translation,Part Gross Weight,Gross Weight,grossweight
16,淨重,L1 success after translation,Part Net Weight,Net Weight,netweight


In [886]:
result_df[result_df['status']=='pending']

Unnamed: 0,uploaded_BOM_headers,status,L1_output_category,L1_translated_BOM_headers,L1_matched_reference_header
6,体积单位,pending,,Volume Unit,
7,体重,pending,,Weight,
8,備註,pending,,Remarks,
10,公司商号,pending,,Company Name,
12,性别,pending,,Gender,
13,料號,pending,,Part Number,
18,花费,pending,,Cost,
19,規格,pending,,Specification,
20,部门,pending,,Department,
21,重量單位,pending,,Weight Unit,


In [999]:
result=L1_translate_match(bom1)

In [1004]:
len(result[result['status']=='pending'])

0

In [1024]:
def L2_relevant2(result_df):
    if len(result_df[result_df['status']=='pending'])>=1:
        train_df=pd.read_parquet("embeded_train_df.gzip")
        L2_df=pd.DataFrame()
        header_description=helper.invoke(json_promt+str(list(result_df[result_df['status']=='pending']['uploaded_BOM_headers'])))
        #print(header_description.content)
        description=re.sub(r"^```json\n|```$", "", header_description.content, flags=re.MULTILINE)
        header_description_json=json.loads(description)
        header_description_df=pd.DataFrame({"BOM headers":header_description_json.keys(),"column_description":header_description_json.values()})
        for input_header in header_description_df['BOM headers']:
            input_header_description=header_description_df[header_description_df['BOM headers']==input_header]['column_description'].iloc[0]
            
            #header_threshold=0.8
            description_threshold=0 # apply threshold later
            ranking=1
            header_embedding = embeddings_model.embed_query(input_header)   
            description_embedding = embeddings_model.embed_query(input_header_description)

            train_df['L2_BOM_headers_relevant_score'] = np.dot(np.stack(train_df['BOM_headers_embedding']), header_embedding) # can remove this line to speed up. BOM header relevant score is not for classification but for reference only
            train_df['L2_description_relevant_score'] = np.dot(np.stack(train_df['description_embedding']), description_embedding)
            
            #bom_header_top_relevant=train_df.loc[(train_df['BOM_headers_relevant_score']>header_threshold)].sort_values('BOM_headers_relevant_score',ascending=False).head(ranking)
            description_top_relevant=train_df.loc[(train_df['L2_description_relevant_score']>description_threshold)].sort_values('L2_description_relevant_score',ascending=False).head(ranking)
            description_top_relevant['uploaded_BOM_headers']=input_header
            description_top_relevant['L2_uploaded_BOM_headers_description']=input_header_description
            L2_df=pd.concat([L2_df,description_top_relevant[['uploaded_BOM_headers','L2_description_relevant_score','Categories','BOM headers','column_description','L2_uploaded_BOM_headers_description','L2_BOM_headers_relevant_score']]])

            #result_df.loc[result_df['uploaded_BOM_headers'] == input_header, 'status'] = 'L2 success'
        L2_df=L2_df.rename(columns={"Categories": "L2_output_category", "BOM headers": "L2_relevant_BOM_headers","column_description":"L2_relevant_column_description"})
        result_df=result_df.merge(L2_df,how='left',on='uploaded_BOM_headers')
        result_df.loc[result_df['L2_description_relevant_score']>=0.9,'status']='L2 success'
        return result_df
    else:
        return result_df    

In [None]:
# not used
def L2_relevant(result_df):
    train_df=pd.read_parquet("embeded_train_df.gzip")
    L2_df=pd.DataFrame()
    for input_header in result_df[result_df['status']=='pending']['uploaded_BOM_headers']:
        promt="""
        Your goal is to help data engineer to understand inventory database header.
        Your output should only contain very detailed inventory database header description.
        Do not need to mention chinese context in the description.
        Examples start here.
        database header: Part Number
        Identifier assigned to a specific part, often alphanumeric. This is the primary key for identifying and tracking a part throughout its lifecycle, from design and manufacturing to inventory management and sales. It ensures uniqueness and avoids confusion between similar but distinct parts.
        database header: 子件代碼
        Identifier for a sub-component within a larger assembly. This code allows for easy tracking and management of smaller parts that make up a more complex product. It helps in managing inventory and understanding the composition of the final product.
        Begin now.
        database header:"""
        input_header_description=helper.invoke(promt+input_header)
        
        #header_threshold=0.8
        description_threshold=0 # apply threshold later
        ranking=1
        header_embedding = embeddings_model.embed_query(input_header)   
        description_embedding = embeddings_model.embed_query(input_header_description.content.replace('"',''))

        train_df['L2_BOM_headers_relevant_score'] = np.dot(np.stack(train_df['BOM_headers_embedding']), header_embedding) # can remove this line to speed up. BOM header relevant score is not for classification but for reference only
        train_df['L2_description_relevant_score'] = np.dot(np.stack(train_df['description_embedding']), description_embedding)
        
        #bom_header_top_relevant=train_df.loc[(train_df['BOM_headers_relevant_score']>header_threshold)].sort_values('BOM_headers_relevant_score',ascending=False).head(ranking)
        description_top_relevant=train_df.loc[(train_df['L2_description_relevant_score']>description_threshold)].sort_values('L2_description_relevant_score',ascending=False).head(ranking)
        description_top_relevant['uploaded_BOM_headers']=input_header
        description_top_relevant['L2_uploaded_BOM_headers_description']=input_header_description.content.replace('"','')
        L2_df=pd.concat([L2_df,description_top_relevant[['uploaded_BOM_headers','L2_description_relevant_score','Categories','BOM headers','column_description','L2_uploaded_BOM_headers_description','L2_BOM_headers_relevant_score']]])

        #result_df.loc[result_df['uploaded_BOM_headers'] == input_header, 'status'] = 'L2 success'
    L2_df=L2_df.rename(columns={"Categories": "L2_output_category", "BOM headers": "L2_relevant_BOM_headers","column_description":"L2_relevant_column_description"})
    result_df=result_df.merge(L2_df,how='left',on='uploaded_BOM_headers')
    result_df.loc[result_df['L2_description_relevant_score']>=0.9,'status']='L2 success'
       
    return result_df


In [None]:
def v1_datatype(result_df,upload_df):
    ouput_list=[]
    for c in result_df['uploaded_BOM_headers']:
        s=result_df[result_df['uploaded_BOM_headers']==c].iloc[0]['status']
        if s[:10]=='L1 success':
            ouput_list.append(result_df[result_df['uploaded_BOM_headers']==c].iloc[0]['L1_output_category'])
        elif s[:10]=='L2 success':
            ouput_list.append(result_df[result_df['uploaded_BOM_headers']==c].iloc[0]['L2_output_category'])
        else:
            ouput_list.append("no result")
    result_df['output']=ouput_list

    validation_list=[]
    for o in result_df['output'].drop_duplicates():
        for c in result_df[result_df['output']==o]['uploaded_BOM_headers']:
            c_dtype=upload_df[c].dtype            
            if c=='unable classify':
                validation_list.append('unable classify')
            elif str(c_dtype) in data_type_df[data_type_df['Categories']==o]['data_type'].iloc[0]:
                print(c+" is classified as "+o+", data type validation pass: "+str(c_dtype))
                validation_list.append('pass')
            else:
                print(c+" is classified as "+o+", but data type validation fail "+str(c_dtype))
                validation_list.append('fail')
    result_df['validation']=validation_list
    return result_df

In [None]:
def v1_datatype2(result_df,upload_df):
    ouput_list=[]
    for c in result_df['uploaded_BOM_headers']:
        s=result_df[result_df['uploaded_BOM_headers']==c].iloc[0]['status']
        if s[:10]=='L1 success':
            ouput_list.append(result_df[result_df['uploaded_BOM_headers']==c].iloc[0]['L1_output_category'])
        elif s[:10]=='L2 success':
            ouput_list.append(result_df[result_df['uploaded_BOM_headers']==c].iloc[0]['L2_output_category'])
        else:
            ouput_list.append("unable classify")
    result_df['output']=ouput_list
    #print(ouput_list)

    validation_list=[]
    for c in result_df['uploaded_BOM_headers']:
    #for c in result_df[result_df['output']==o]['uploaded_BOM_headers']:
        c_dtype=upload_df[c].dtype
        output=result_df[result_df['uploaded_BOM_headers']==c].iloc[0]['output']            
        if output=='unable classify':
            validation_list.append('unable classify')
        elif str(c_dtype) in data_type_df[data_type_df['Categories']==output]['data_type'].iloc[0]:
            print(c+" is classified as "+output+", data type validation pass: "+str(c_dtype))
            validation_list.append('pass')
        else:
            print(c+" is classified as "+output+", but data type validation fail: "+str(c_dtype))
            validation_list.append('fail')
    result_df['validation']=validation_list
    return result_df

In [1026]:
def classify_bom(upload_df):
    upload_df=upload_df
    result_df=L2_relevant2(L1_translate_match(upload_df))
    return result_df

In [1081]:
upload_df=test_bom
result_df=classify_bom(upload_df)

In [1082]:
result_df[['uploaded_BOM_headers','status']]

Unnamed: 0,uploaded_BOM_headers,status
0,料號,L2 success
1,規格,L2 success
2,備註,L2 success
3,淨重,L1 success after translation
4,毛重,L1 success after translation
5,重量單位,L2 success


In [1087]:
result_df=v1_datatype2(result_df,upload_df)

料號 is classified as Company Part No., data type validation pass: object
規格 is classified as Part Description, data type validation pass: object
備註 is classified as Part Description, data type validation pass: object
淨重 is classified as Part Net Weight, data type validation pass: float64
毛重 is classified as Part Gross Weight, data type validation pass: float64
重量單位 is classified as Net/Gross Unit, data type validation pass: object


In [1296]:
def run_classify_and_validationupload_df(upload_df):
    result_df=classify_bom(upload_df)
    result_df=v1_datatype2(result_df,upload_df)
    return result_df

In [1311]:
result_df=run_classify_and_validationupload_df(case1)
category_df=pd.DataFrame({'Categories':ground_truth['Categories'],'中文對照':['公司料號','料號描述','料號淨重','料號毛重','淨毛重單位']})
result_df=result_df.merge(category_df,how='left',left_on='output',right_on='Categories')
result_df[['uploaded_BOM_headers','output','中文對照','validation']]

产品描述 is classified as Part Description, data type validation pass: object
净重 is classified as Part Net Weight, but data type validation fail object


Unnamed: 0,uploaded_BOM_headers,output,中文對照,validation
0,性别,unable classify,,unable classify
1,产品描述,Part Description,料號描述,pass
2,部门,unable classify,,unable classify
3,BMI体重,unable classify,,unable classify
4,公司商号,unable classify,,unable classify
5,净重,Part Net Weight,料號淨重,fail


In [1298]:
def print_AC(result_df):
    AC1_matched_cat={}

    AC3_wrong_classification={}
    AC4_unmatched_header={}

    for k in category_df['中文對照']:
        AC1_matched_cat_list=list(result_df[(result_df['validation']=='pass') & (result_df['中文對照']==k)]['uploaded_BOM_headers'])
        if AC1_matched_cat_list!=[]:
            AC1_matched_cat.update({k:AC1_matched_cat_list})
        
        AC3_wrong_classification_list=list(result_df[(result_df['validation']=='fail') & (result_df['中文對照']==k)]['uploaded_BOM_headers'])
        if AC3_wrong_classification_list!=[]:
            AC3_wrong_classification.update({k:AC3_wrong_classification_list})

        AC4_unmatched_header_list=list(result_df[(result_df['validation']=='unable classify')]['uploaded_BOM_headers'])
        if AC4_unmatched_header!=[]:
            AC4_unmatched_header=set(AC4_unmatched_header_list)

    AC2_unmatched_cat_list=[]
    for k in category_df['中文對照']:
        if k not in list(AC1_matched_cat.keys())+list(AC3_wrong_classification.keys()):
            AC2_unmatched_cat_list.append(k)
    AC2_unmatched_cat=set(AC2_unmatched_cat_list)

    print("Matched categories with its headers: ",AC1_matched_cat)
    print("Unmatched categories: ",AC2_unmatched_cat)
    print("Matched cattegories with wrong headers: ",AC3_wrong_classification)
    print("Unmatched bom headers: ",AC4_unmatched_header)

In [1312]:
print_AC(result_df)

Matched categories with its headers:  {'料號描述': ['产品描述']}
Unmatched categories:  {'淨毛重單位', '公司料號', '料號毛重'}
Matched cattegories with wrong headers:  {'料號淨重': ['净重']}
Unmatched bom headers:  {'公司商号', '部门', 'BMI体重', '性别'}
