# Task B: RFQ Similarity

In this task, I’m calculating how similar different RFQs are to each other based on their properties, both numeric and categorical, so that we can find the top matches for any given RFQ. Here’s a breakdown of what I did in each step:

## Load Datasets

I load the RFQs and the reference properties files. This is just to get all the data ready so I can work with it. I also checked the columns to make sure the names are what I expect.

In [12]:
import pandas as pd
import numpy as np


# Load datasets
rfq = pd.read_csv("../data/rfq.csv")              # RFQs
ref = pd.read_csv("../data/reference_properties.tsv", sep='\t')  # Reference grade

# Inspect columns to confirm names
print(rfq.columns)
print(ref.columns)

Index(['id', 'grade', 'grade_suffix', 'coating', 'finish', 'surface_type',
       'surface_protection', 'form', 'thickness_min', 'thickness_max',
       'width_min', 'width_max', 'length_min', 'height_min', 'height_max',
       'weight_min', 'weight_max', 'inner_diameter_min', 'inner_diameter_max',
       'outer_diameter_min', 'outer_diameter_max', 'yield_strength_min',
       'yield_strength_max', 'tensile_strength_min', 'tensile_strength_max'],
      dtype='object')
Index(['Grade/Material', 'UNS_No', 'Steel_No', 'Standards', 'Carbon (C)',
       'Manganese (Mn)', 'Silicon (Si)', 'Sulfur (S)', 'Phosphorus (P)',
       'Chromium (Cr)', 'Nickel (Ni)', 'Molybdenum (Mo)', 'Vanadium (V)',
       'Tungsten (W)', 'Cobalt (Co)', 'Copper (Cu)', 'Aluminum (Al)',
       'Titanium (Ti)', 'Niobium (Nb)', 'Boron (B)', 'Nitrogen (N)',
       'Tensile strength (Rm)', 'Yield strength (Re or Rp0.2)',
       'Elongation (A%)', 'Reduction of area (Z%)', 'Hardness (HB, HV, HRC)',
       'Impact toughness 

## Task B.1 — Reference join & missing values

Here, I normalized the grade names in both datasets so that they match (lowercase, stripped of spaces). Then I parsed numeric ranges in the reference properties (like “2.5–3.0” or “≤5”) into min, max, and mid values. Finally, I joined the RFQs with the reference properties based on the normalized grades and flaged any missing references.

In [15]:
# Normalize RFQ grades
rfq['grade_norm'] = rfq['grade'].str.lower().str.strip()

# Normalize reference grades
# Adjust 'Grade/Material' based on actual column
ref['grade_norm'] = ref['Grade/Material'].str.lower().str.strip()

In [16]:
import re
def parse_range(s):
    if pd.isna(s):
        return (np.nan, np.nan)
    s = str(s).replace('≤','').replace('≥','').replace('MPa','').strip()
    # Split by - or en dash
    parts = re.split(r'[-–]', s)
    try:
        a = float(parts[0].replace(',', '.')) if parts[0] else np.nan
        b = float(parts[1].replace(',', '.')) if len(parts)>1 and parts[1] else a
        return (min(a,b), max(a,b))
    except:
        return (np.nan, np.nan)

# Example numeric columns to parse
numeric_cols = [
    'Carbon (C)','Manganese (Mn)','Silicon (Si)','Sulfur (S)','Phosphorus (P)',
    'Tensile strength (Rm)','Yield strength (Re or Rp0.2)'
]

for col in numeric_cols:
    if col in ref.columns:
        ref[[f'{col}_min', f'{col}_max']] = ref[col].apply(lambda x: pd.Series(parse_range(x)))
        ref[f'{col}_mid'] = ref[[f'{col}_min', f'{col}_max']].mean(axis=1)

### Join RFQs with reference

In [17]:
# Merge RFQs with reference properties
rfq_ref = rfq.merge(ref, on='grade_norm', how='left')

# Flag missing reference grades
rfq_ref['missing_ref'] = rfq_ref['grade_norm'].isnull() | rfq_ref['grade_norm'].isin(ref['grade_norm'].isnull())

## Task B.2 — Feature engineering
In this step, I prepared the features for similarity calculation:

For numeric dimensions, I made sure we have min and max columns, and if max is missing I filled it with the min.

I defined the categorical columns that we want to compare.

I also created midpoints for numeric grades so that they can be used in similarity calculations.

In [18]:
# Dimensions as intervals
dimensions = ['thickness','width','length','height','weight',
              'inner_diameter','outer_diameter',
              'yield_strength','tensile_strength']

for dim in dimensions:
    min_col = f'{dim}_min'
    max_col = f'{dim}_max'

    # Create the columns if they don't exist
    if min_col not in rfq_ref.columns:
        rfq_ref[min_col] = np.nan
    if max_col not in rfq_ref.columns:
        rfq_ref[max_col] = np.nan

    # Fill missing max with min (singleton case)
    rfq_ref[max_col] = rfq_ref[max_col].fillna(rfq_ref[min_col])

In [19]:
# Define categorical similarity
categorical_cols = ['coating','finish','form','surface_type']

In [20]:
# Grade midpoints
grade_numeric_cols = [f'{col}_mid' for col in numeric_cols if f'{col}_mid' in rfq_ref.columns]

## Task B.3 — Similarity calculation

I created functions to compute similarity between two RFQs:

Dimension similarity: I used an interval-based IoU, which looks at how much two numeric ranges overlap.

Categorical similarity: I checked if the categorical properties are the same.

Grade similarity: For numeric grade properties, I compared the values using a simple ratio of smaller to larger.

Then I combined all three types of similarity into an overall score using weights (dimension 0.5, categorical 0.2, grade 0.3).

In [21]:
# Interval IoU
def interval_iou(a_min, a_max, b_min, b_max):
    """Compute IoU (intersection over union) for numeric intervals."""
    if np.isnan(a_min) or np.isnan(a_max) or np.isnan(b_min) or np.isnan(b_max):
        return np.nan
    inter = max(0, min(a_max, b_max) - max(a_min, b_min))
    union = max(a_max, b_max) - min(a_min, b_min)
    return inter / union if union > 0 else 0

In [22]:
# RFQ similarity function
def rfq_similarity(a, b, dimension_list, categorical_list, grade_numeric_cols):
    """Compute overall similarity between two RFQs."""
    
    #Dimension similarity
    dim_sims = []
    for dim in dimension_list:
        min_col = f'{dim}_min'
        max_col = f'{dim}_max'

        # Only compute if both RFQs have the dimension
        if all(col in a.index for col in [min_col, max_col]) and all(col in b.index for col in [min_col, max_col]):
            sim = interval_iou(a[min_col], a[max_col], b[min_col], b[max_col])
            if sim is not None and not np.isnan(sim):
                dim_sims.append(sim)

    dim_score = np.mean(dim_sims) if dim_sims else 0

    #Categorical similarity
    cat_sims = []
    for cat in categorical_list:
        if cat in a.index and cat in b.index:
            val_a, val_b = a[cat], b[cat]
            if pd.isna(val_a) or pd.isna(val_b):
                continue
            cat_sims.append(1 if val_a == val_b else 0)

    cat_score = np.mean(cat_sims) if cat_sims else 0

    #Grade numeric properties similarity
    grade_sims = []
    for col in grade_numeric_cols:
        if col in a.index and col in b.index:
            val_a, val_b = a[col], b[col]
            if pd.isna(val_a) or pd.isna(val_b):
                continue
            # Simple similarity: ratio of smaller/larger
            grade_sims.append(min(val_a, val_b) / max(val_a, val_b) if max(val_a, val_b) != 0 else 0)

    grade_score = np.mean(grade_sims) if grade_sims else 0

    # Aggregate similarity
    # Example: weighted average 
    weights = {
        'dimension': 0.5,
        'categorical': 0.2,
        'grade': 0.3
    }

    overall_score = (
        dim_score * weights['dimension'] +
        cat_score * weights['categorical'] +
        grade_score * weights['grade']
    )

    return overall_score

## Compute top-3 most similar RFQs

I looped through all RFQs and compared each one with every other RFQ using the similarity function. For each RFQ, I picked the top 3 most similar ones and saved them into a DataFrame. This gave me an easy way to see which RFQs are closest to any given RFQ.

Finally, I saved the results to a CSV.

In [23]:
from tqdm import tqdm  # optional, shows progress bar

# Define the columns to use (replace with your dataset columns)
dimension_list = [
    'thickness', 'width', 'length', 'height',
    'weight', 'inner_diameter', 'outer_diameter',
    'yield_strength', 'tensile_strength'
]
categorical_list = ['coating', 'finish', 'form', 'surface_type']
grade_numeric_cols = [
    'C', 'Mn', 'Si', 'S', 'P', 'Cr', 'Ni', 'Mo', 'V', 'W', 'Co', 'Cu', 'Al', 'Ti', 'Nb', 'B', 'N', 
    'tensile_strength_mid', 'yield_strength_mid'
]

top3_list = []

# Use tqdm to monitor progress if dataset is large
for idx_a, row_a in tqdm(rfq_ref.iterrows(), total=len(rfq_ref), desc="Processing RFQs"):
    sims = []
    for idx_b, row_b in rfq_ref.iterrows():
        if row_a['id'] == row_b['id']:
            continue
        score = rfq_similarity(row_a, row_b, dimension_list, categorical_list, grade_numeric_cols)
        sims.append((row_b['id'], score))
    
    # Get top 3 matches
    top3 = sorted(sims, key=lambda x: x[1], reverse=True)[:3]
    for match_id, score in top3:
        top3_list.append({
            'rfq_id': row_a['id'],
            'match_id': match_id,
            'similarity_score': score
        })

# Convert to DataFrame and save
top3_df = pd.DataFrame(top3_list)
top3_df.to_csv("../outputs/top3.csv", index=False, encoding='utf-8-sig')
display(top3_df.head())

Processing RFQs: 100%|█████████████████████████████████████████████████████████████| 1005/1005 [13:12<00:00,  1.27it/s]


Unnamed: 0,rfq_id,match_id,similarity_score
0,8aff426d-b8c0-43aa-ad26-835ef4de6129,2e56f82b-a80d-4704-83a5-14a52a117b65,0.6
1,8aff426d-b8c0-43aa-ad26-835ef4de6129,7d1ab305-7fc6-4ab0-bc2a-9ae1e038e67e,0.544795
2,8aff426d-b8c0-43aa-ad26-835ef4de6129,9195d6ad-120f-4ae2-8e69-462ca56c3fc0,0.475
3,37e624be-b125-464f-85b6-1838530193ef,56d8f8e0-7e07-42f7-9503-360aed60100a,0.2
4,37e624be-b125-464f-85b6-1838530193ef,74fcfbc9-2569-46c8-ad3f-a9ea8b9f5a45,0.2
