<a href="https://colab.research.google.com/github/RaresNitu03/company-classifier/blob/main/Insurance_Taxonomy_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Company Classifier**



## 🟣 **Section 0 - Introduction**


This project presents a robust classifier designed to categorize companies accurately according to a predefined insurance taxonomy. Given a list of companies with associated data—**including company descriptions, business tags, and industry-specific classifications**—the solution classifies each company into relevant insurance labels from a provided static taxonomy.

**Objectives:**

*   Develop a method to accurately map companies to one or multiple insurance taxonomy labels.
*   Validate classifier performance effectively, considering the lack of a predefined ground truth.

*   Evaluate the strengths and weaknesses of the implemented solution.
*   Discuss scalability, assumptions, and potential unknown factors impacting the classification performance.

**What the Project Does:**


*   **Accepts Input:** Takes company descriptions, business tags, and sector/classification details.
*   **Classification:** Implements advanced natural language processing techniques and domain-specific heuristics to assign appropriate insurance taxonomy labels.
*   **Validation:** Employs custom evaluation methods tailored to real-world relevance, providing insights into classification accuracy and potential improvement areas.

**What it Should Achieve:**

*   **Accuracy:** Ensure companies are correctly classified into meaningful and contextually appropriate insurance labels.
*   **Scalability:** Although designed initially for a manageable dataset, the solution framework supports extension to large-scale data processing.
*   **Adaptability:** Allow easy integration or enhancement for similar classification problems within the insurance or related sectors.


**Project Structure:**



*   SECTION  1 – Setup & Imports
*   SECTION  2 – Data Loading
*   SECTION  3 – Data Preparation & Text Processing
*   SECTION  4 – Semantic Similarity & Pseudo-Label Generation
*   SECTION  5 – Training Classifier with Pseudo-Labels
*   SECTION  6 – Prediction Function & Re-labeling
*   SECTION  7 – Model Validation & Performance Analysis
*   SECTION  8 – Visual Check & Sample Evaluation
*   SECTION  9 – Model Saving & Export
*   SECTION 10 – Analysis and Conclusions

## 🟣 **SECTION 1 – Setup & Imports**

In this section, we prepare the necessary environment to run the project by installing and importing relevant Python packages. These libraries are essential for data processing, classification modeling, and performance evaluation.

**Package Installation**

In [None]:
# Install the necessary packages (only on the first run)
!pip install -U sentence-transformers joblib

Collecting sentence-transformers
  Downloading sentence_transformers-4.0.2-py3-none-any.whl.metadata (13 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_



*   **sentence-transformers:** Used to generate embeddings (vector representations) from descriptions and texts, facilitating semantic similarity measurements.
*   **joblib:** Utilized for serialization and saving of trained models, enabling easy reuse of models



**Essential Imports**

In [None]:
# Essential imports
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer, util
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, f1_score
import joblib
from tqdm import tqdm
from IPython.display import display, HTML
import ast
tqdm.pandas()


Purpose:
*   **pandas and numpy:** Efficient management and manipulation of data.
*   **sentence_transformers:** Creating embeddings and measuring semantic similarity between texts.
*   **TfidfVectorizer:** Converts texts into numeric TF-IDF weighted vectors, useful for capturing term importance.
*   **MultiLabelBinarizer:** Converts categorical labels into binary format, suitable for multi-label classification.
*   **LogisticRegression & OneVsRestClassifier:** Statistical models for effectively classifying companies into multiple classes.
*   **make_pipeline:** Automates the sequential process of transformations and modeling.
*   **classification_report & f1_score:** Evaluate the quality of classification achieved by the model.
*   **tqdm:** Tracks the progress of time-consuming operations (e.g., embedding generation).

## 🟣 **SECTION 2 – Data Loading**

In this section, we load the core datasets used in the project: the list of companies to be classified, the insurance taxonomy (labels), and a manually labeled validation set. We also explore the structure of the data to ensure that all files were loaded correctly

**Data Loading and Initial Exploration**

In [None]:
# Load the main files
companies_df = pd.read_csv("ml_insurance_challenge.csv")
taxonomy_df = pd.read_csv("insurance_taxonomy - insurance_taxonomy.csv")

# Load the file for validation
last_100_df = pd.read_csv("Last_100.csv")

*   Loads datasets containing company data, taxonomy labels, and manually labeled validation data.



**Display Options & Initial Data Preview**

In [None]:
# Set options for displaying DataFrames (optional)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 10)

# To center the text, we apply CSS styles to cells and headers
companies_styled = companies_df.head().style \
    .set_properties(**{'text-align': 'center'}) \
    .set_table_styles([{'selector': 'th', 'props': [('text-align', 'center')]}]) \
    .set_caption("<b>Top 5 Companies</b>")

display(companies_styled)
display(HTML(f"<b>Total validation rows: {len(last_100_df)}</b>"))

last_100_styled = last_100_df.head().style \
    .set_properties(**{'text-align': 'center'}) \
    .set_table_styles([{'selector': 'th', 'props': [('text-align', 'center')]}]) \
    .set_caption("<b>Top 5 Companies from Validation File</b>")

display(last_100_styled)
display(HTML(f"<b>Total validation rows: {len(last_100_df)}</b>"))


Unnamed: 0,description,business_tags,sector,category,niche
0,"Welchcivils is a civil engineering and construction company that specializes in designing and building utility network connections across the UK. They offer multi-utility solutions that combine electricity, gas, water, and fibre optic installation into a single contract. Their design engineer teams are capable of designing electricity, water and gas networks from existing network connection points to meter locations at the development, as well as project management of reinforcements and diversions. They provide custom connection solutions that take into account any existing assets, maximize the usage of every trench, and meet project deadlines. Welchcivils has considerable expertise installing gas and electricity connections in a variety of market categories, including residential, commercial, and industrial projects, as as well.","['Construction Services', 'Multi-utilities', 'Utility Network Connections Design and Construction', 'Water Connection Installation', 'Multi-utility Connections', 'Fiber Optic Installation']",Services,Civil Engineering Services,Other Heavy and Civil Engineering Construction
1,"Kyoto Vegetable Specialists Uekamo, also known as Iwa-machi, is a company based in Kyoto, Japan that specializes in the sale of vegetables. They have been in business for ten years and offer a collection of vegetable recipes through their Keiō Vegetable Recipe Collection and Online Shop. The company is directly owned by Uekamoo Farm, Uekame Farm, and Lobechi Shijo-hara Farm. They offer a variety of vegetable products, including suguki-zuke and Kamoo eggplant, and also accept production cultivation according to customer requests. Iwaichi Limited Company uses their experience in production and sales to provide tailored vegetables to meet customer needs and also accepts cultivation of products according to their requirements.","['Wholesale', 'Dual-task Movement Products', 'Cast Iron Products Manufacturer', 'Manufacturing Technology', 'Food and Beverage', 'Rice And Noodles', 'High-quality Gloss of Cast Iron', 'Rice Wholesaler', 'Miscellaneous Crop Farming', 'Health and Wellness Products', 'Agricultural Cooperative', 'Medical Practice Based on Eastern Medicine', 'Production', 'Rice Pudding']",Manufacturing,Fruit & Vegetable - Markets & Stores,"Frozen Fruit, Juice, and Vegetable Manufacturing"
2,"Loidholdhof Integrative Hofgemeinschaft is a company that offers a range of services and products to its customers. Their products are all handmade and of the highest quality, produced on a biodynamic basis with a focus on freshness and quality. The company's product range includes homemade bread, honey from their own beekeeping, syrup, and fresh vegetables, which can be purchased in their farm shop. In addition to their farm products, they also have a farm shop and cafe where customers can enjoy fresh coffee and delicious cakes.","['Living Forms', 'Farm Cafe', 'Fresh Coffee', 'Community Engagement', 'Freshly Baked Bread', 'Social Interaction Opportunities', 'Fresh Vegetables', 'Homemade Honey', 'Delicious Cakes', 'Community-oriented Living', 'Handmade Products', 'Fresh Juices', 'Farm Fresh Products', 'Integrated Farming Community', 'Biodynamic Farming']",Manufacturing,Farms & Agriculture Production,All Other Miscellaneous Crop Farming
3,"PATAGONIA Chapa Y Pintura is an auto body shop located in Comodoro Rivadavia, Chubut Province, Argentina. The company specializes in providing auto body repair services.","['Automotive Body Repair Services', 'Interior Repair Services']",Services,Auto Body Shops,"Automotive Body, Paint, and Interior Repair and Maintenance"
4,"Stanica WODNA PTTK Swornegacie is a cultural establishment located in Swornychgaciach, Poland. It is a popular destination for kayakers and tourists of all levels, offering a variety of activities and events. The establishment is managed by Zbigniew Galiński.","['Cultural Activities', 'Accommodation Services', 'Kayak Rentals', 'Small Gastronomy Products', 'Tourism Services', 'Recreational Activities', 'Cultural Center']",Services,Boat Tours & Cruises,"Scenic and Sightseeing Transportation, Water"


Unnamed: 0,description,business_tags,sector,category,niche,corect_label
0,"Mitch Lui is a Hong Kong-based lifestyle photographer who specializes in portrait, film, and landscape photography. His work is inspired by ambient, light, sounds, colors, space, and texture. He is known for his passion-driven and observant approach to capturing fine details through his lens in both digital and analog formats.","['Fine Details Photography', 'Lifestyle Photography Services', 'Ambient Light Photography', 'Analog Photography', 'Art Gallery', 'Still Photography']",Services,Photographers & Photographic Studios,Commercial Photography,"['Media Production Services', 'Content Creation Services', 'Graphic Design Services']"
1,"AWM Perionica veša is a company based in Mirijevo, Croatia that specializes in laundry and dry cleaning services. They offer free delivery of laundry to customers' doors throughout Mirije and its surrounding areas. Their services include washing, ironing, and dry-cleaning of various items such as blankets, duvets, bedding, quilts, curtains, duvet covers, bathrobes, tablecloths, table skirts, and curtains. The company prides itself on using high-quality materials that do not harm the fabric.","['Laundry Service', 'Laundromat Services', 'Ironing Services']",Services,Dry Cleaners,Coin-Operated Laundries and Drycleaners,['Building Cleaning Services']
2,"FreshLookHomes is a company that specializes in the sales and marketing of new and pre-owned homes. They offer a variety of properties for sale, including single-family homes, condos, townhomes, and apartments. The company provides a platform for potential buyers to view properties and receive notifications about upcoming sales and events. FreshLookHouses aims to help buyers find their dream homes quickly and affordably.",['Home Improvement Services'],Services,Home Builders & Renovation Contractors,New Housing For-Sale Builders,"['Real Estate Services', 'Single Family Residential Construction', 'Property Management Services', 'Multi-Family Construction Services']"
3,"Balıkesir Gıda Food is a market that offers a variety of food items for purchase. The company is known for its fast and friendly service, as evidenced by positive customer reviews.","['Shopping Services', 'General Store Products', 'Online Food Ordering Platform', 'Greengrocer Products', 'Buffet Products', 'Wi-fi Services', 'Alcohol Retail', 'Card Payment Services', 'Wheelchair Accessible Entrance']",Retail,Groceries,Convenience Retailers,"['Retail Groceries', 'Convenience Retailers', 'Shopping Services', 'Food Processing Services']"
4,"Hustlerholic is a company that offers a podcast series called ""The Hustlerholist Podcast"" that provides tips and advice on various topics such as real estate, the stock market, forex market, crypto currency, and running a small business. The podcast features stories of Hustlers who have achieved financial independence through hustling and offers insights from leaders in the business world. The company also provides information on Forex trading and the importance of Forex in achieving financial independence. Additionally, they offer a virtual seminar on the stock and real estate markets.","['Financial Education', 'Real Estate Tips', 'Small Business Education', 'Financial Services', 'Real-life Success in Real Estate', 'Virtual Seminar on Real Estate', 'Cryptocurrency Education', 'Podcast Production Services', 'Forex Education']",Services,Cryptocurrency,Securities and Commodity Exchanges,"['Financial Services', 'Consulting Services', 'Media Production Services', 'Training Services']"



*   Sets pandas display options to show all columns and limits the number of displayed rows to 10 for easier reading.
*   Applies basic styling to tables to center-align text in both headers and cells for better visual formatting.
*   Displays the first 5 companies from the main dataset and the validation file with custom captions.
*   Shows the total number of entries in each dataset to quickly understand dataset size.

## 🟣 **SECTION 3 – Data Preparation & Text Processing**

Here, we preprocess the company data by combining multiple descriptive fields (such as description, tags, sector, etc.) into a single unified text field. This step is necessary to create a rich textual representation for each company, which will later be used for generating embeddings and training the model.

**Text Building**

In [None]:
def build_full_text(row):
    parts = [
        str(row["description"]),
        str(row["business_tags"]),
        str(row["sector"]),
        str(row["category"]),
        str(row["niche"]),
    ]
    return " | ".join(parts)

companies_df["full_text"] = companies_df.apply(build_full_text, axis=1)

*   Combines multiple textual fields (description, business_tags, sector, category, niche) into a unified text field (full_text) for semantic analysis and modeling.
*   This concatenated text will be used as input for generating semantic embeddings and training the model.

**Preview with Full Text Column**

In [None]:
# Set options for displaying DataFrames (optional)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 10)

# To center the text, we apply CSS styles to cells and headers
companies_styled = companies_df.head().style \
    .set_properties(**{'text-align': 'center'}) \
    .set_table_styles([{'selector': 'th', 'props': [('text-align', 'center')]}]) \
    .set_caption("<b>Top 5 Companies</b>")

display(companies_styled)

Unnamed: 0,description,business_tags,sector,category,niche,full_text
0,"Welchcivils is a civil engineering and construction company that specializes in designing and building utility network connections across the UK. They offer multi-utility solutions that combine electricity, gas, water, and fibre optic installation into a single contract. Their design engineer teams are capable of designing electricity, water and gas networks from existing network connection points to meter locations at the development, as well as project management of reinforcements and diversions. They provide custom connection solutions that take into account any existing assets, maximize the usage of every trench, and meet project deadlines. Welchcivils has considerable expertise installing gas and electricity connections in a variety of market categories, including residential, commercial, and industrial projects, as as well.","['Construction Services', 'Multi-utilities', 'Utility Network Connections Design and Construction', 'Water Connection Installation', 'Multi-utility Connections', 'Fiber Optic Installation']",Services,Civil Engineering Services,Other Heavy and Civil Engineering Construction,"Welchcivils is a civil engineering and construction company that specializes in designing and building utility network connections across the UK. They offer multi-utility solutions that combine electricity, gas, water, and fibre optic installation into a single contract. Their design engineer teams are capable of designing electricity, water and gas networks from existing network connection points to meter locations at the development, as well as project management of reinforcements and diversions. They provide custom connection solutions that take into account any existing assets, maximize the usage of every trench, and meet project deadlines. Welchcivils has considerable expertise installing gas and electricity connections in a variety of market categories, including residential, commercial, and industrial projects, as as well. | ['Construction Services', 'Multi-utilities', 'Utility Network Connections Design and Construction', 'Water Connection Installation', 'Multi-utility Connections', 'Fiber Optic Installation'] | Services | Civil Engineering Services | Other Heavy and Civil Engineering Construction"
1,"Kyoto Vegetable Specialists Uekamo, also known as Iwa-machi, is a company based in Kyoto, Japan that specializes in the sale of vegetables. They have been in business for ten years and offer a collection of vegetable recipes through their Keiō Vegetable Recipe Collection and Online Shop. The company is directly owned by Uekamoo Farm, Uekame Farm, and Lobechi Shijo-hara Farm. They offer a variety of vegetable products, including suguki-zuke and Kamoo eggplant, and also accept production cultivation according to customer requests. Iwaichi Limited Company uses their experience in production and sales to provide tailored vegetables to meet customer needs and also accepts cultivation of products according to their requirements.","['Wholesale', 'Dual-task Movement Products', 'Cast Iron Products Manufacturer', 'Manufacturing Technology', 'Food and Beverage', 'Rice And Noodles', 'High-quality Gloss of Cast Iron', 'Rice Wholesaler', 'Miscellaneous Crop Farming', 'Health and Wellness Products', 'Agricultural Cooperative', 'Medical Practice Based on Eastern Medicine', 'Production', 'Rice Pudding']",Manufacturing,Fruit & Vegetable - Markets & Stores,"Frozen Fruit, Juice, and Vegetable Manufacturing","Kyoto Vegetable Specialists Uekamo, also known as Iwa-machi, is a company based in Kyoto, Japan that specializes in the sale of vegetables. They have been in business for ten years and offer a collection of vegetable recipes through their Keiō Vegetable Recipe Collection and Online Shop. The company is directly owned by Uekamoo Farm, Uekame Farm, and Lobechi Shijo-hara Farm. They offer a variety of vegetable products, including suguki-zuke and Kamoo eggplant, and also accept production cultivation according to customer requests. Iwaichi Limited Company uses their experience in production and sales to provide tailored vegetables to meet customer needs and also accepts cultivation of products according to their requirements. | ['Wholesale', 'Dual-task Movement Products', 'Cast Iron Products Manufacturer', 'Manufacturing Technology', 'Food and Beverage', 'Rice And Noodles', 'High-quality Gloss of Cast Iron', 'Rice Wholesaler', 'Miscellaneous Crop Farming', 'Health and Wellness Products', 'Agricultural Cooperative', 'Medical Practice Based on Eastern Medicine', 'Production', 'Rice Pudding'] | Manufacturing | Fruit & Vegetable - Markets & Stores | Frozen Fruit, Juice, and Vegetable Manufacturing"
2,"Loidholdhof Integrative Hofgemeinschaft is a company that offers a range of services and products to its customers. Their products are all handmade and of the highest quality, produced on a biodynamic basis with a focus on freshness and quality. The company's product range includes homemade bread, honey from their own beekeeping, syrup, and fresh vegetables, which can be purchased in their farm shop. In addition to their farm products, they also have a farm shop and cafe where customers can enjoy fresh coffee and delicious cakes.","['Living Forms', 'Farm Cafe', 'Fresh Coffee', 'Community Engagement', 'Freshly Baked Bread', 'Social Interaction Opportunities', 'Fresh Vegetables', 'Homemade Honey', 'Delicious Cakes', 'Community-oriented Living', 'Handmade Products', 'Fresh Juices', 'Farm Fresh Products', 'Integrated Farming Community', 'Biodynamic Farming']",Manufacturing,Farms & Agriculture Production,All Other Miscellaneous Crop Farming,"Loidholdhof Integrative Hofgemeinschaft is a company that offers a range of services and products to its customers. Their products are all handmade and of the highest quality, produced on a biodynamic basis with a focus on freshness and quality. The company's product range includes homemade bread, honey from their own beekeeping, syrup, and fresh vegetables, which can be purchased in their farm shop. In addition to their farm products, they also have a farm shop and cafe where customers can enjoy fresh coffee and delicious cakes. | ['Living Forms', 'Farm Cafe', 'Fresh Coffee', 'Community Engagement', 'Freshly Baked Bread', 'Social Interaction Opportunities', 'Fresh Vegetables', 'Homemade Honey', 'Delicious Cakes', 'Community-oriented Living', 'Handmade Products', 'Fresh Juices', 'Farm Fresh Products', 'Integrated Farming Community', 'Biodynamic Farming'] | Manufacturing | Farms & Agriculture Production | All Other Miscellaneous Crop Farming"
3,"PATAGONIA Chapa Y Pintura is an auto body shop located in Comodoro Rivadavia, Chubut Province, Argentina. The company specializes in providing auto body repair services.","['Automotive Body Repair Services', 'Interior Repair Services']",Services,Auto Body Shops,"Automotive Body, Paint, and Interior Repair and Maintenance","PATAGONIA Chapa Y Pintura is an auto body shop located in Comodoro Rivadavia, Chubut Province, Argentina. The company specializes in providing auto body repair services. | ['Automotive Body Repair Services', 'Interior Repair Services'] | Services | Auto Body Shops | Automotive Body, Paint, and Interior Repair and Maintenance"
4,"Stanica WODNA PTTK Swornegacie is a cultural establishment located in Swornychgaciach, Poland. It is a popular destination for kayakers and tourists of all levels, offering a variety of activities and events. The establishment is managed by Zbigniew Galiński.","['Cultural Activities', 'Accommodation Services', 'Kayak Rentals', 'Small Gastronomy Products', 'Tourism Services', 'Recreational Activities', 'Cultural Center']",Services,Boat Tours & Cruises,"Scenic and Sightseeing Transportation, Water","Stanica WODNA PTTK Swornegacie is a cultural establishment located in Swornychgaciach, Poland. It is a popular destination for kayakers and tourists of all levels, offering a variety of activities and events. The establishment is managed by Zbigniew Galiński. | ['Cultural Activities', 'Accommodation Services', 'Kayak Rentals', 'Small Gastronomy Products', 'Tourism Services', 'Recreational Activities', 'Cultural Center'] | Services | Boat Tours & Cruises | Scenic and Sightseeing Transportation, Water"



*   This block displays the first 5 rows of the companies_df DataFrame, including the newly added full_text column.
*   The purpose is to visually inspect whether the full text was constructed correctly by combining all relevant fields (description, tags, sector, category, niche).
*   The styled display helps confirm that the textual concatenation works as expected before moving forward with embeddings and classification.



## 🟣 **SECTION 4 – Semantic Similarity & Pseudo-Label Generation**

This section focuses on generating initial pseudo-labels for each company by computing semantic similarity between company texts and taxonomy labels. Using a pre-trained embedding model, we encode both sets of texts and calculate cosine similarity scores to infer the most relevant labels for each company.

**Embedding and Similarity Calculation**

In [None]:
# List the labels from the taxonomy
labels = taxonomy_df["label"].dropna().astype(str).tolist()

# Load the embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for labels and companies
label_embeddings = model.encode(labels, convert_to_tensor=True, show_progress_bar=True)
company_embeddings = model.encode(companies_df["full_text"].tolist(), convert_to_tensor=True, show_progress_bar=True)

# Compute cosine similarity
from sentence_transformers.util import cos_sim
similarities = cos_sim(company_embeddings, label_embeddings)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

Batches:   0%|          | 0/297 [00:00<?, ?it/s]

*   Generates embeddings for the taxonomy labels and the full_text column in the companies_df.
*   Uses SentenceTransformer to encode both the labels and the company texts into vector representations.
*   Computes cosine similarity between the generated embeddings to measure how similar each company is to each label.

**Cosine Similarity – Example for a Single Company**

In [None]:
# Select a sample company (e.g., first row)
sample_index = 0
sample_text = companies_df.iloc[sample_index]["full_text"]
sample_embedding = company_embeddings[sample_index]

# Calculate cosine similarity between the sample company and all labels
sample_similarities = cos_sim(sample_embedding, label_embeddings)[0].cpu().numpy()

# Pair each label with its similarity score
label_scores = list(zip(labels, sample_similarities))

# Sort by similarity (descending)
sorted_scores = sorted(label_scores, key=lambda x: x[1], reverse=True)

# Display top 5 most similar labels
print(f"Full text for company #{sample_index}:\n{sample_text}\n")
print("🔍 Top 5 most similar labels:")
for label, score in sorted_scores[:5]:
    print(f"{label}: {score:.4f}")


Full text for company #0:
Welchcivils is a civil engineering and construction company that specializes in designing and building utility network connections across the UK. They offer multi-utility solutions that combine electricity, gas, water, and fibre optic installation into a single contract. Their design engineer teams are capable of designing electricity, water and gas networks from existing network connection points to meter locations at the development, as well as project management of reinforcements and diversions. They provide custom connection solutions that take into account any existing assets, maximize the usage of every trench, and meet project deadlines. Welchcivils has considerable expertise installing gas and electricity connections in a variety of market categories, including residential, commercial, and industrial projects, as as well. | ['Construction Services', 'Multi-utilities', 'Utility Network Connections Design and Construction', 'Water Connection Installation

*   Computes cosine similarity between that company's embedding and all label embeddings.
*   Displays the top 5 most similar labels, showing how the model would decide which labels are most relevant to the company.
*   Helps us verify that the embeddings and similarity computations work as intended.



**Pseudo-Label Generation**

In [None]:
# Generate pseudo-labels
top_k = 5
threshold = 0.3
fallback_top_k = 3

pseudo_labels = []
for i in range(len(similarities)):
    sims = similarities[i].cpu().numpy()
    above_thresh = np.where(sims >= threshold)[0]
    if len(above_thresh) > 0:
        sorted_indices = above_thresh[np.argsort(-sims[above_thresh])]
        selected = [labels[idx] for idx in sorted_indices[:top_k]]
    else:
        top_indices = np.argsort(sims)[-fallback_top_k:][::-1]
        selected = [labels[idx] for idx in top_indices]
    pseudo_labels.append(selected)

companies_df["pseudo_labels"] = pseudo_labels


*   Generates pseudo-labels by selecting the top 5 most similar labels for each company based on their embeddings.
*   If the similarity is above the threshold (0.3), the top labels are selected. If not, a fallback approach is used to choose the top 3 most similar labels.
*   These pseudo-labels are added as a new column in the companies_df for training the model.

**Preview with pseudo_labels Column**

In [None]:
# Set options for displaying DataFrames (optional)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 10)

# To center the text, we apply CSS styles to cells and headers
companies_styled = companies_df.head().style \
    .set_properties(**{'text-align': 'center'}) \
    .set_table_styles([{'selector': 'th', 'props': [('text-align', 'center')]}]) \
    .set_caption("<b>Top 5 Companies</b>")

display(companies_styled)

Unnamed: 0,description,business_tags,sector,category,niche,full_text,pseudo_labels
0,"Welchcivils is a civil engineering and construction company that specializes in designing and building utility network connections across the UK. They offer multi-utility solutions that combine electricity, gas, water, and fibre optic installation into a single contract. Their design engineer teams are capable of designing electricity, water and gas networks from existing network connection points to meter locations at the development, as well as project management of reinforcements and diversions. They provide custom connection solutions that take into account any existing assets, maximize the usage of every trench, and meet project deadlines. Welchcivils has considerable expertise installing gas and electricity connections in a variety of market categories, including residential, commercial, and industrial projects, as as well.","['Construction Services', 'Multi-utilities', 'Utility Network Connections Design and Construction', 'Water Connection Installation', 'Multi-utility Connections', 'Fiber Optic Installation']",Services,Civil Engineering Services,Other Heavy and Civil Engineering Construction,"Welchcivils is a civil engineering and construction company that specializes in designing and building utility network connections across the UK. They offer multi-utility solutions that combine electricity, gas, water, and fibre optic installation into a single contract. Their design engineer teams are capable of designing electricity, water and gas networks from existing network connection points to meter locations at the development, as well as project management of reinforcements and diversions. They provide custom connection solutions that take into account any existing assets, maximize the usage of every trench, and meet project deadlines. Welchcivils has considerable expertise installing gas and electricity connections in a variety of market categories, including residential, commercial, and industrial projects, as as well. | ['Construction Services', 'Multi-utilities', 'Utility Network Connections Design and Construction', 'Water Connection Installation', 'Multi-utility Connections', 'Fiber Optic Installation'] | Services | Civil Engineering Services | Other Heavy and Civil Engineering Construction","['Commercial Construction Services', 'Gas Installation Services', 'Residential Plumbing Services', 'Commercial Communication Equipment Installation', 'Commercial Plumbing Services']"
1,"Kyoto Vegetable Specialists Uekamo, also known as Iwa-machi, is a company based in Kyoto, Japan that specializes in the sale of vegetables. They have been in business for ten years and offer a collection of vegetable recipes through their Keiō Vegetable Recipe Collection and Online Shop. The company is directly owned by Uekamoo Farm, Uekame Farm, and Lobechi Shijo-hara Farm. They offer a variety of vegetable products, including suguki-zuke and Kamoo eggplant, and also accept production cultivation according to customer requests. Iwaichi Limited Company uses their experience in production and sales to provide tailored vegetables to meet customer needs and also accepts cultivation of products according to their requirements.","['Wholesale', 'Dual-task Movement Products', 'Cast Iron Products Manufacturer', 'Manufacturing Technology', 'Food and Beverage', 'Rice And Noodles', 'High-quality Gloss of Cast Iron', 'Rice Wholesaler', 'Miscellaneous Crop Farming', 'Health and Wellness Products', 'Agricultural Cooperative', 'Medical Practice Based on Eastern Medicine', 'Production', 'Rice Pudding']",Manufacturing,Fruit & Vegetable - Markets & Stores,"Frozen Fruit, Juice, and Vegetable Manufacturing","Kyoto Vegetable Specialists Uekamo, also known as Iwa-machi, is a company based in Kyoto, Japan that specializes in the sale of vegetables. They have been in business for ten years and offer a collection of vegetable recipes through their Keiō Vegetable Recipe Collection and Online Shop. The company is directly owned by Uekamoo Farm, Uekame Farm, and Lobechi Shijo-hara Farm. They offer a variety of vegetable products, including suguki-zuke and Kamoo eggplant, and also accept production cultivation according to customer requests. Iwaichi Limited Company uses their experience in production and sales to provide tailored vegetables to meet customer needs and also accepts cultivation of products according to their requirements. | ['Wholesale', 'Dual-task Movement Products', 'Cast Iron Products Manufacturer', 'Manufacturing Technology', 'Food and Beverage', 'Rice And Noodles', 'High-quality Gloss of Cast Iron', 'Rice Wholesaler', 'Miscellaneous Crop Farming', 'Health and Wellness Products', 'Agricultural Cooperative', 'Medical Practice Based on Eastern Medicine', 'Production', 'Rice Pudding'] | Manufacturing | Fruit & Vegetable - Markets & Stores | Frozen Fruit, Juice, and Vegetable Manufacturing","['Food Processing Services', 'Bakery Production Services', 'Meat Processing Services', 'Frozen Food Processing', 'Pet Food Manufacturing']"
2,"Loidholdhof Integrative Hofgemeinschaft is a company that offers a range of services and products to its customers. Their products are all handmade and of the highest quality, produced on a biodynamic basis with a focus on freshness and quality. The company's product range includes homemade bread, honey from their own beekeeping, syrup, and fresh vegetables, which can be purchased in their farm shop. In addition to their farm products, they also have a farm shop and cafe where customers can enjoy fresh coffee and delicious cakes.","['Living Forms', 'Farm Cafe', 'Fresh Coffee', 'Community Engagement', 'Freshly Baked Bread', 'Social Interaction Opportunities', 'Fresh Vegetables', 'Homemade Honey', 'Delicious Cakes', 'Community-oriented Living', 'Handmade Products', 'Fresh Juices', 'Farm Fresh Products', 'Integrated Farming Community', 'Biodynamic Farming']",Manufacturing,Farms & Agriculture Production,All Other Miscellaneous Crop Farming,"Loidholdhof Integrative Hofgemeinschaft is a company that offers a range of services and products to its customers. Their products are all handmade and of the highest quality, produced on a biodynamic basis with a focus on freshness and quality. The company's product range includes homemade bread, honey from their own beekeeping, syrup, and fresh vegetables, which can be purchased in their farm shop. In addition to their farm products, they also have a farm shop and cafe where customers can enjoy fresh coffee and delicious cakes. | ['Living Forms', 'Farm Cafe', 'Fresh Coffee', 'Community Engagement', 'Freshly Baked Bread', 'Social Interaction Opportunities', 'Fresh Vegetables', 'Homemade Honey', 'Delicious Cakes', 'Community-oriented Living', 'Handmade Products', 'Fresh Juices', 'Farm Fresh Products', 'Integrated Farming Community', 'Biodynamic Farming'] | Manufacturing | Farms & Agriculture Production | All Other Miscellaneous Crop Farming","['Bakery Production Services', 'Dairy Production Services', 'Food Processing Services', 'Gardening Services', 'Catering Services']"
3,"PATAGONIA Chapa Y Pintura is an auto body shop located in Comodoro Rivadavia, Chubut Province, Argentina. The company specializes in providing auto body repair services.","['Automotive Body Repair Services', 'Interior Repair Services']",Services,Auto Body Shops,"Automotive Body, Paint, and Interior Repair and Maintenance","PATAGONIA Chapa Y Pintura is an auto body shop located in Comodoro Rivadavia, Chubut Province, Argentina. The company specializes in providing auto body repair services. | ['Automotive Body Repair Services', 'Interior Repair Services'] | Services | Auto Body Shops | Automotive Body, Paint, and Interior Repair and Maintenance",['Pallet Manufacturing']
4,"Stanica WODNA PTTK Swornegacie is a cultural establishment located in Swornychgaciach, Poland. It is a popular destination for kayakers and tourists of all levels, offering a variety of activities and events. The establishment is managed by Zbigniew Galiński.","['Cultural Activities', 'Accommodation Services', 'Kayak Rentals', 'Small Gastronomy Products', 'Tourism Services', 'Recreational Activities', 'Cultural Center']",Services,Boat Tours & Cruises,"Scenic and Sightseeing Transportation, Water","Stanica WODNA PTTK Swornegacie is a cultural establishment located in Swornychgaciach, Poland. It is a popular destination for kayakers and tourists of all levels, offering a variety of activities and events. The establishment is managed by Zbigniew Galiński. | ['Cultural Activities', 'Accommodation Services', 'Kayak Rentals', 'Small Gastronomy Products', 'Tourism Services', 'Recreational Activities', 'Cultural Center'] | Services | Boat Tours & Cruises | Scenic and Sightseeing Transportation, Water","['Canning Services', 'Welding Services', 'Ornamental Plant Nurseries']"


*   Displays the first 5 companies in the dataset to verify that the full_text column was correctly constructed.
*   Useful for confirming that the individual fields (description, business tags, sector, category, niche) were properly joined into one coherent text string for each company.



## 🟣 **SECTION 5 – Training Classifier with Pseudo-Labels**

We use the pseudo-labeled dataset to train a multi-label classification model. The model pipeline includes TF-IDF vectorization and a logistic regression classifier, wrapped in a One-vs-Rest strategy. This allows the system to assign multiple relevant labels to each company.

**Training the Classifier**

In [None]:
# Select training data by excluding the last 100 rows (reserved for validation)
training_df = companies_df.iloc[:-100].copy()

# Keep only rows that have at least one pseudo-label
df_train = training_df[training_df["pseudo_labels"].map(len) > 0]

# Transform the list of pseudo-labels into a binary format suitable for multi-label classification
mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(df_train["pseudo_labels"])

# Create a TF-IDF vectorizer to convert text into numerical features
# Uses unigrams and bigrams, limits to top 5000 features, and removes English stopwords
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english')

# Define a multi-label classifier using Logistic Regression in a One-vs-Rest strategy
clf = OneVsRestClassifier(LogisticRegression(C=2.0, max_iter=2000))

# Combine the vectorizer and classifier into a pipeline
pipeline = make_pipeline(vectorizer, clf)

# Train the pipeline on the full_text and binary-encoded labels
pipeline.fit(df_train["full_text"], Y)


*   Trains the model using all companies except the last 100 rows, which are reserved for manual validation.
*   Filters the companies with valid pseudo-labels (pseudo_labels).
*   Converts the list of pseudo-labels into a binary format using MultiLabelBinarizer (Y).
*   Uses TfidfVectorizer to convert the text into numerical features, followed by LogisticRegression for multi-label classification.
*   Combines vectorization and classification into a single pipeline and trains it on the company data.
*   Uses OneVsRestClassifier with LogisticRegression to handle multi-label classification, allowing each company to be assigned multiple relevant labels.

## 🟣 **SECTION 6 – Prediction Function & Re-labeling**

In this section, we apply the trained model to reclassify all companies using a custom prediction function. This step replaces the initial pseudo-labels with refined model-generated labels, ensuring consistency and taking advantage of the model’s learning.

**Prediction Function**

In [None]:
def predict_with_fallback(text, threshold=0.3, fallback_top_k=3):
    # Predict probabilities for each label
    probs = pipeline.predict_proba([text])[0]

    # Select labels with probability above the threshold
    indices = np.where(probs >= threshold)[0]

    # If there are matches above the threshold, return those labels
    if len(indices) > 0:
        return [mlb.classes_[i] for i in indices]

    # If none are above the threshold, return the top-k highest probability labels
    else:
        top_indices = np.argsort(probs)[-fallback_top_k:][::-1]
        return [mlb.classes_[i] for i in top_indices]

# Re-predict labels for the entire dataset
companies_df["model_labels"] = companies_df["full_text"].apply(lambda x: predict_with_fallback(x, threshold=0.3))

# Overwrite pseudo_labels with model_labels
companies_df["pseudo_labels"] = companies_df["model_labels"]



*   Defines a prediction function that uses predict_proba() to get probabilities for each class.
*   Applies a threshold to classify a company or uses a fallback approach to choose the top k labels if the threshold isn't met.
*   Relabels the entire dataset using the trained model and updates the pseudo_labels column with these predictions.

**Preview with model_labels Column**

In [None]:
# Set options for displaying DataFrames (optional)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 10)

# To center the text, we apply CSS styles to cells and headers
companies_styled = companies_df.head().style \
    .set_properties(**{'text-align': 'center'}) \
    .set_table_styles([{'selector': 'th', 'props': [('text-align', 'center')]}]) \
    .set_caption("<b>Top 5 Companies</b>")

display(companies_styled)

Unnamed: 0,description,business_tags,sector,category,niche,full_text,pseudo_labels,model_labels
0,"Welchcivils is a civil engineering and construction company that specializes in designing and building utility network connections across the UK. They offer multi-utility solutions that combine electricity, gas, water, and fibre optic installation into a single contract. Their design engineer teams are capable of designing electricity, water and gas networks from existing network connection points to meter locations at the development, as well as project management of reinforcements and diversions. They provide custom connection solutions that take into account any existing assets, maximize the usage of every trench, and meet project deadlines. Welchcivils has considerable expertise installing gas and electricity connections in a variety of market categories, including residential, commercial, and industrial projects, as as well.","['Construction Services', 'Multi-utilities', 'Utility Network Connections Design and Construction', 'Water Connection Installation', 'Multi-utility Connections', 'Fiber Optic Installation']",Services,Civil Engineering Services,Other Heavy and Civil Engineering Construction,"Welchcivils is a civil engineering and construction company that specializes in designing and building utility network connections across the UK. They offer multi-utility solutions that combine electricity, gas, water, and fibre optic installation into a single contract. Their design engineer teams are capable of designing electricity, water and gas networks from existing network connection points to meter locations at the development, as well as project management of reinforcements and diversions. They provide custom connection solutions that take into account any existing assets, maximize the usage of every trench, and meet project deadlines. Welchcivils has considerable expertise installing gas and electricity connections in a variety of market categories, including residential, commercial, and industrial projects, as as well. | ['Construction Services', 'Multi-utilities', 'Utility Network Connections Design and Construction', 'Water Connection Installation', 'Multi-utility Connections', 'Fiber Optic Installation'] | Services | Civil Engineering Services | Other Heavy and Civil Engineering Construction",['Commercial Construction Services'],['Commercial Construction Services']
1,"Kyoto Vegetable Specialists Uekamo, also known as Iwa-machi, is a company based in Kyoto, Japan that specializes in the sale of vegetables. They have been in business for ten years and offer a collection of vegetable recipes through their Keiō Vegetable Recipe Collection and Online Shop. The company is directly owned by Uekamoo Farm, Uekame Farm, and Lobechi Shijo-hara Farm. They offer a variety of vegetable products, including suguki-zuke and Kamoo eggplant, and also accept production cultivation according to customer requests. Iwaichi Limited Company uses their experience in production and sales to provide tailored vegetables to meet customer needs and also accepts cultivation of products according to their requirements.","['Wholesale', 'Dual-task Movement Products', 'Cast Iron Products Manufacturer', 'Manufacturing Technology', 'Food and Beverage', 'Rice And Noodles', 'High-quality Gloss of Cast Iron', 'Rice Wholesaler', 'Miscellaneous Crop Farming', 'Health and Wellness Products', 'Agricultural Cooperative', 'Medical Practice Based on Eastern Medicine', 'Production', 'Rice Pudding']",Manufacturing,Fruit & Vegetable - Markets & Stores,"Frozen Fruit, Juice, and Vegetable Manufacturing","Kyoto Vegetable Specialists Uekamo, also known as Iwa-machi, is a company based in Kyoto, Japan that specializes in the sale of vegetables. They have been in business for ten years and offer a collection of vegetable recipes through their Keiō Vegetable Recipe Collection and Online Shop. The company is directly owned by Uekamoo Farm, Uekame Farm, and Lobechi Shijo-hara Farm. They offer a variety of vegetable products, including suguki-zuke and Kamoo eggplant, and also accept production cultivation according to customer requests. Iwaichi Limited Company uses their experience in production and sales to provide tailored vegetables to meet customer needs and also accepts cultivation of products according to their requirements. | ['Wholesale', 'Dual-task Movement Products', 'Cast Iron Products Manufacturer', 'Manufacturing Technology', 'Food and Beverage', 'Rice And Noodles', 'High-quality Gloss of Cast Iron', 'Rice Wholesaler', 'Miscellaneous Crop Farming', 'Health and Wellness Products', 'Agricultural Cooperative', 'Medical Practice Based on Eastern Medicine', 'Production', 'Rice Pudding'] | Manufacturing | Fruit & Vegetable - Markets & Stores | Frozen Fruit, Juice, and Vegetable Manufacturing","['Agricultural Equipment Services', 'Bakery Production Services', 'Food Processing Services', 'Frozen Food Processing', 'Meat Processing Services', 'Pet Food Manufacturing']","['Agricultural Equipment Services', 'Bakery Production Services', 'Food Processing Services', 'Frozen Food Processing', 'Meat Processing Services', 'Pet Food Manufacturing']"
2,"Loidholdhof Integrative Hofgemeinschaft is a company that offers a range of services and products to its customers. Their products are all handmade and of the highest quality, produced on a biodynamic basis with a focus on freshness and quality. The company's product range includes homemade bread, honey from their own beekeeping, syrup, and fresh vegetables, which can be purchased in their farm shop. In addition to their farm products, they also have a farm shop and cafe where customers can enjoy fresh coffee and delicious cakes.","['Living Forms', 'Farm Cafe', 'Fresh Coffee', 'Community Engagement', 'Freshly Baked Bread', 'Social Interaction Opportunities', 'Fresh Vegetables', 'Homemade Honey', 'Delicious Cakes', 'Community-oriented Living', 'Handmade Products', 'Fresh Juices', 'Farm Fresh Products', 'Integrated Farming Community', 'Biodynamic Farming']",Manufacturing,Farms & Agriculture Production,All Other Miscellaneous Crop Farming,"Loidholdhof Integrative Hofgemeinschaft is a company that offers a range of services and products to its customers. Their products are all handmade and of the highest quality, produced on a biodynamic basis with a focus on freshness and quality. The company's product range includes homemade bread, honey from their own beekeeping, syrup, and fresh vegetables, which can be purchased in their farm shop. In addition to their farm products, they also have a farm shop and cafe where customers can enjoy fresh coffee and delicious cakes. | ['Living Forms', 'Farm Cafe', 'Fresh Coffee', 'Community Engagement', 'Freshly Baked Bread', 'Social Interaction Opportunities', 'Fresh Vegetables', 'Homemade Honey', 'Delicious Cakes', 'Community-oriented Living', 'Handmade Products', 'Fresh Juices', 'Farm Fresh Products', 'Integrated Farming Community', 'Biodynamic Farming'] | Manufacturing | Farms & Agriculture Production | All Other Miscellaneous Crop Farming","['Agricultural Equipment Services', 'Bakery Production Services', 'Dairy Production Services', 'Food Processing Services', 'Gardening Services']","['Agricultural Equipment Services', 'Bakery Production Services', 'Dairy Production Services', 'Food Processing Services', 'Gardening Services']"
3,"PATAGONIA Chapa Y Pintura is an auto body shop located in Comodoro Rivadavia, Chubut Province, Argentina. The company specializes in providing auto body repair services.","['Automotive Body Repair Services', 'Interior Repair Services']",Services,Auto Body Shops,"Automotive Body, Paint, and Interior Repair and Maintenance","PATAGONIA Chapa Y Pintura is an auto body shop located in Comodoro Rivadavia, Chubut Province, Argentina. The company specializes in providing auto body repair services. | ['Automotive Body Repair Services', 'Interior Repair Services'] | Services | Auto Body Shops | Automotive Body, Paint, and Interior Repair and Maintenance","['Textile Manufacturing Services', 'Road Maintenance Services', 'Welding Services']","['Textile Manufacturing Services', 'Road Maintenance Services', 'Welding Services']"
4,"Stanica WODNA PTTK Swornegacie is a cultural establishment located in Swornychgaciach, Poland. It is a popular destination for kayakers and tourists of all levels, offering a variety of activities and events. The establishment is managed by Zbigniew Galiński.","['Cultural Activities', 'Accommodation Services', 'Kayak Rentals', 'Small Gastronomy Products', 'Tourism Services', 'Recreational Activities', 'Cultural Center']",Services,Boat Tours & Cruises,"Scenic and Sightseeing Transportation, Water","Stanica WODNA PTTK Swornegacie is a cultural establishment located in Swornychgaciach, Poland. It is a popular destination for kayakers and tourists of all levels, offering a variety of activities and events. The establishment is managed by Zbigniew Galiński. | ['Cultural Activities', 'Accommodation Services', 'Kayak Rentals', 'Small Gastronomy Products', 'Tourism Services', 'Recreational Activities', 'Cultural Center'] | Services | Boat Tours & Cruises | Scenic and Sightseeing Transportation, Water",['Travel Services'],['Travel Services']



*   Shows all columns, including the concatenated full_text, for full visibility.
*   This quick visual check helps ensure that the text processing step worked as intended before moving on to embeddings and modeling.


## 🟣 **SECTION 7 – Model Validation & Performance Analysis**

We evaluate the performance of our classification pipeline using a small manually labeled dataset. Accuracy is measured by checking whether the model predicted at least one correct label for each company. While this method is lenient, it provides a first estimate of the model’s effectiveness.

**Model Validation**

In [None]:
# Predictions on the validation set
last_100_df["predicted_labels"] = last_100_df["description"].apply(lambda x: predict_with_fallback(x, threshold=0.3))

# Select relevant columns (description, manually assigned labels, and predicted labels)
validation_results = last_100_df[["description", "corect_label", "predicted_labels"]]

# Set display options for DataFrames
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 10)

# Style the table to show the first 5 rows with centered text and a caption
validation_styled = validation_results.head().style \
    .set_properties(**{'text-align': 'center'}) \
    .set_table_styles([{'selector': 'th', 'props': [('text-align', 'center')]}]) \
    .set_caption("<b>Validation Set Predictions (First 5 Rows)</b>")

display(validation_styled)


Unnamed: 0,description,corect_label,predicted_labels
0,"Mitch Lui is a Hong Kong-based lifestyle photographer who specializes in portrait, film, and landscape photography. His work is inspired by ambient, light, sounds, colors, space, and texture. He is known for his passion-driven and observant approach to capturing fine details through his lens in both digital and analog formats.","['Media Production Services', 'Content Creation Services', 'Graphic Design Services']","['Painting Services', 'Accessory Manufacturing', 'Cosmetic Manufacturing']"
1,"AWM Perionica veša is a company based in Mirijevo, Croatia that specializes in laundry and dry cleaning services. They offer free delivery of laundry to customers' doors throughout Mirije and its surrounding areas. Their services include washing, ironing, and dry-cleaning of various items such as blankets, duvets, bedding, quilts, curtains, duvet covers, bathrobes, tablecloths, table skirts, and curtains. The company prides itself on using high-quality materials that do not harm the fabric.",['Building Cleaning Services'],"['Apparel Manufacturing', 'Building Cleaning Services', 'Textile Manufacturing Services']"
2,"FreshLookHomes is a company that specializes in the sales and marketing of new and pre-owned homes. They offer a variety of properties for sale, including single-family homes, condos, townhomes, and apartments. The company provides a platform for potential buyers to view properties and receive notifications about upcoming sales and events. FreshLookHouses aims to help buyers find their dream homes quickly and affordably.","['Real Estate Services', 'Single Family Residential Construction', 'Property Management Services', 'Multi-Family Construction Services']","['Residential Roofing Services', 'Property Management Services', 'Marketing Services']"
3,"Balıkesir Gıda Food is a market that offers a variety of food items for purchase. The company is known for its fast and friendly service, as evidenced by positive customer reviews.","['Retail Groceries', 'Convenience Retailers', 'Shopping Services', 'Food Processing Services']","['Catering Services', 'Food Processing Services', 'Food Safety Services']"
4,"Hustlerholic is a company that offers a podcast series called ""The Hustlerholist Podcast"" that provides tips and advice on various topics such as real estate, the stock market, forex market, crypto currency, and running a small business. The podcast features stories of Hustlers who have achieved financial independence through hustling and offers insights from leaders in the business world. The company also provides information on Forex trading and the importance of Forex in achieving financial independence. Additionally, they offer a virtual seminar on the stock and real estate markets.","['Financial Services', 'Consulting Services', 'Media Production Services', 'Training Services']","['Market Research Services', 'Real Estate Services']"




*   Manual Accuracy Evaluation
*   Saves the validation results to CSV, showing how well the model performed



**Manual Accuracy Evaluation**

In [None]:
def calculate_correct_predictions(row):
    def parse_labels(value):
        # Try to parse the value using ast.literal_eval if it looks like a list
        try:
            parsed = ast.literal_eval(value) if isinstance(value, str) else value
            if isinstance(parsed, list):
                return set(map(str.strip, map(str, parsed)))
        except (ValueError, SyntaxError):
            pass
        # Fallback: split by comma and strip whitespace
        return set(map(str.strip, str(value).split(",")))

    correct_labels = parse_labels(row["corect_label"])
    predicted_labels = parse_labels(row["predicted_labels"])

    # Return True if there is at least one match between the two sets
    return len(correct_labels & predicted_labels) > 0


# Load predictions from CSV (sau folosește direct last_100_df)
predictions_df = last_100_df.copy()
predictions_df["correct_prediction"] = predictions_df.apply(calculate_correct_predictions, axis=1)

correct_predictions = predictions_df["correct_prediction"].sum()
accuracy_percentage = (correct_predictions / len(predictions_df)) * 100
print(f"Number of correct predictions: {correct_predictions} din {len(predictions_df)}")
print(f"Predictions: {accuracy_percentage:.2f}%")

Number of correct predictions: 65 din 100
Predictions: 65.00%


*   Calculates how well the model’s predictions align with the manually labeled "correct labels".
*   Evaluates accuracy and saves the validation results to CSV, showing how well the model performed.

**Preview of Validation Results**

In [None]:
# Set options for displaying DataFrames (optional)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)

# To center the text, we apply CSS styles to cells and headers
companies_styled = predictions_df.head(5).style \
    .set_properties(**{'text-align': 'center'}) \
    .set_table_styles([{'selector': 'th', 'props': [('text-align', 'center')]}]) \
    .set_caption("<b>Top 5 Companies</b>")

display(companies_styled)

Unnamed: 0,description,business_tags,sector,category,niche,corect_label,predicted_labels,correct_prediction
0,"Mitch Lui is a Hong Kong-based lifestyle photographer who specializes in portrait, film, and landscape photography. His work is inspired by ambient, light, sounds, colors, space, and texture. He is known for his passion-driven and observant approach to capturing fine details through his lens in both digital and analog formats.","['Fine Details Photography', 'Lifestyle Photography Services', 'Ambient Light Photography', 'Analog Photography', 'Art Gallery', 'Still Photography']",Services,Photographers & Photographic Studios,Commercial Photography,"['Media Production Services', 'Content Creation Services', 'Graphic Design Services']","['Painting Services', 'Accessory Manufacturing', 'Cosmetic Manufacturing']",False
1,"AWM Perionica veša is a company based in Mirijevo, Croatia that specializes in laundry and dry cleaning services. They offer free delivery of laundry to customers' doors throughout Mirije and its surrounding areas. Their services include washing, ironing, and dry-cleaning of various items such as blankets, duvets, bedding, quilts, curtains, duvet covers, bathrobes, tablecloths, table skirts, and curtains. The company prides itself on using high-quality materials that do not harm the fabric.","['Laundry Service', 'Laundromat Services', 'Ironing Services']",Services,Dry Cleaners,Coin-Operated Laundries and Drycleaners,['Building Cleaning Services'],"['Apparel Manufacturing', 'Building Cleaning Services', 'Textile Manufacturing Services']",True
2,"FreshLookHomes is a company that specializes in the sales and marketing of new and pre-owned homes. They offer a variety of properties for sale, including single-family homes, condos, townhomes, and apartments. The company provides a platform for potential buyers to view properties and receive notifications about upcoming sales and events. FreshLookHouses aims to help buyers find their dream homes quickly and affordably.",['Home Improvement Services'],Services,Home Builders & Renovation Contractors,New Housing For-Sale Builders,"['Real Estate Services', 'Single Family Residential Construction', 'Property Management Services', 'Multi-Family Construction Services']","['Residential Roofing Services', 'Property Management Services', 'Marketing Services']",True
3,"Balıkesir Gıda Food is a market that offers a variety of food items for purchase. The company is known for its fast and friendly service, as evidenced by positive customer reviews.","['Shopping Services', 'General Store Products', 'Online Food Ordering Platform', 'Greengrocer Products', 'Buffet Products', 'Wi-fi Services', 'Alcohol Retail', 'Card Payment Services', 'Wheelchair Accessible Entrance']",Retail,Groceries,Convenience Retailers,"['Retail Groceries', 'Convenience Retailers', 'Shopping Services', 'Food Processing Services']","['Catering Services', 'Food Processing Services', 'Food Safety Services']",True
4,"Hustlerholic is a company that offers a podcast series called ""The Hustlerholist Podcast"" that provides tips and advice on various topics such as real estate, the stock market, forex market, crypto currency, and running a small business. The podcast features stories of Hustlers who have achieved financial independence through hustling and offers insights from leaders in the business world. The company also provides information on Forex trading and the importance of Forex in achieving financial independence. Additionally, they offer a virtual seminar on the stock and real estate markets.","['Financial Education', 'Real Estate Tips', 'Small Business Education', 'Financial Services', 'Real-life Success in Real Estate', 'Virtual Seminar on Real Estate', 'Cryptocurrency Education', 'Podcast Production Services', 'Forex Education']",Services,Cryptocurrency,Securities and Commodity Exchanges,"['Financial Services', 'Consulting Services', 'Media Production Services', 'Training Services']","['Market Research Services', 'Real Estate Services']",False



*   Helps visually inspect whether the model's predictions aligned with manual labels for each entry.
*   Shows all columns, including the original description, manually labeled classes, predicted classes, and evaluation outcome.
*   Useful for quickly confirming that the accuracy evaluation logic was applied correctly before analyzing results in depth.

### **Bonus Verification**

**Manual vs. Model Label Comparison (Validation Preview)**

In [None]:
# Function to compare manual and predicted labels
def calculate_correct_predictions(row):
    try:
        correct_labels = set(ast.literal_eval(row["corect_label"]))
        predicted_labels = set(ast.literal_eval(row["predicted_labels"]))
    except (ValueError, SyntaxError):
        correct_labels = set(str(row["corect_label"]).split(","))
        predicted_labels = set(str(row["predicted_labels"]).split(","))

    # Strip spaces to make sure matches are accurate
    correct_labels = set(map(str.strip, correct_labels))
    predicted_labels = set(map(str.strip, predicted_labels))

    return len(correct_labels & predicted_labels) > 0

# Prepare the validation DataFrame
validation_from_main = companies_df.iloc[-100:].copy()

# Copy the manually labeled column
validation_from_main["corect_label"] = last_100_df["corect_label"].values

# IMPORTANT: If predictions were made in `last_100_df`, bring them over too
validation_from_main["predicted_labels"] = last_100_df["predicted_labels"].values

# Calculate prediction correctness
validation_from_main["correct_prediction"] = validation_from_main.apply(calculate_correct_predictions, axis=1)

# Compute metrics
correct_predictions = validation_from_main["correct_prediction"].sum()
accuracy_percentage = (correct_predictions / len(validation_from_main)) * 100

# Display summary
display(HTML(f"<b>Number of correct predictions: {correct_predictions} out of {len(validation_from_main)}</b>"))
display(HTML(f"<b>Accuracy: {accuracy_percentage:.2f}%</b>"))

# Optional: styled preview
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 10)

validation_styled = validation_from_main.head(5).style \
    .set_properties(**{'text-align': 'center'}) \
    .set_table_styles([{'selector': 'th', 'props': [('text-align', 'center')]}]) \
    .set_caption("<b>First 10 Companies: Manual Labels vs. Predicted Labels</b>")

display(validation_styled)


Unnamed: 0,description,business_tags,sector,category,niche,full_text,pseudo_labels,model_labels,corect_label,predicted_labels,correct_prediction
9394,"Mitch Lui is a Hong Kong-based lifestyle photographer who specializes in portrait, film, and landscape photography. His work is inspired by ambient, light, sounds, colors, space, and texture. He is known for his passion-driven and observant approach to capturing fine details through his lens in both digital and analog formats.","['Fine Details Photography', 'Lifestyle Photography Services', 'Ambient Light Photography', 'Analog Photography', 'Art Gallery', 'Still Photography']",Services,Photographers & Photographic Studios,Commercial Photography,"Mitch Lui is a Hong Kong-based lifestyle photographer who specializes in portrait, film, and landscape photography. His work is inspired by ambient, light, sounds, colors, space, and texture. He is known for his passion-driven and observant approach to capturing fine details through his lens in both digital and analog formats. | ['Fine Details Photography', 'Lifestyle Photography Services', 'Ambient Light Photography', 'Analog Photography', 'Art Gallery', 'Still Photography'] | Services | Photographers & Photographic Studios | Commercial Photography","['Arts Services', 'Painting Services']","['Arts Services', 'Painting Services']","['Media Production Services', 'Content Creation Services', 'Graphic Design Services']","['Painting Services', 'Accessory Manufacturing', 'Cosmetic Manufacturing']",False
9395,"AWM Perionica veša is a company based in Mirijevo, Croatia that specializes in laundry and dry cleaning services. They offer free delivery of laundry to customers' doors throughout Mirije and its surrounding areas. Their services include washing, ironing, and dry-cleaning of various items such as blankets, duvets, bedding, quilts, curtains, duvet covers, bathrobes, tablecloths, table skirts, and curtains. The company prides itself on using high-quality materials that do not harm the fabric.","['Laundry Service', 'Laundromat Services', 'Ironing Services']",Services,Dry Cleaners,Coin-Operated Laundries and Drycleaners,"AWM Perionica veša is a company based in Mirijevo, Croatia that specializes in laundry and dry cleaning services. They offer free delivery of laundry to customers' doors throughout Mirije and its surrounding areas. Their services include washing, ironing, and dry-cleaning of various items such as blankets, duvets, bedding, quilts, curtains, duvet covers, bathrobes, tablecloths, table skirts, and curtains. The company prides itself on using high-quality materials that do not harm the fabric. | ['Laundry Service', 'Laundromat Services', 'Ironing Services'] | Services | Dry Cleaners | Coin-Operated Laundries and Drycleaners","['Apparel Manufacturing', 'Building Cleaning Services', 'Textile Manufacturing Services']","['Apparel Manufacturing', 'Building Cleaning Services', 'Textile Manufacturing Services']",['Building Cleaning Services'],"['Apparel Manufacturing', 'Building Cleaning Services', 'Textile Manufacturing Services']",False
9396,"FreshLookHomes is a company that specializes in the sales and marketing of new and pre-owned homes. They offer a variety of properties for sale, including single-family homes, condos, townhomes, and apartments. The company provides a platform for potential buyers to view properties and receive notifications about upcoming sales and events. FreshLookHouses aims to help buyers find their dream homes quickly and affordably.",['Home Improvement Services'],Services,Home Builders & Renovation Contractors,New Housing For-Sale Builders,"FreshLookHomes is a company that specializes in the sales and marketing of new and pre-owned homes. They offer a variety of properties for sale, including single-family homes, condos, townhomes, and apartments. The company provides a platform for potential buyers to view properties and receive notifications about upcoming sales and events. FreshLookHouses aims to help buyers find their dream homes quickly and affordably. | ['Home Improvement Services'] | Services | Home Builders & Renovation Contractors | New Housing For-Sale Builders","['Mobile Home Construction Services', 'Real Estate Services', 'Single Family Residential Construction']","['Mobile Home Construction Services', 'Real Estate Services', 'Single Family Residential Construction']","['Real Estate Services', 'Single Family Residential Construction', 'Property Management Services', 'Multi-Family Construction Services']","['Residential Roofing Services', 'Property Management Services', 'Marketing Services']",True
9397,"Balıkesir Gıda Food is a market that offers a variety of food items for purchase. The company is known for its fast and friendly service, as evidenced by positive customer reviews.","['Shopping Services', 'General Store Products', 'Online Food Ordering Platform', 'Greengrocer Products', 'Buffet Products', 'Wi-fi Services', 'Alcohol Retail', 'Card Payment Services', 'Wheelchair Accessible Entrance']",Retail,Groceries,Convenience Retailers,"Balıkesir Gıda Food is a market that offers a variety of food items for purchase. The company is known for its fast and friendly service, as evidenced by positive customer reviews. | ['Shopping Services', 'General Store Products', 'Online Food Ordering Platform', 'Greengrocer Products', 'Buffet Products', 'Wi-fi Services', 'Alcohol Retail', 'Card Payment Services', 'Wheelchair Accessible Entrance'] | Retail | Groceries | Convenience Retailers","['E-Commerce Services', 'Food Processing Services', 'Marketing Services']","['E-Commerce Services', 'Food Processing Services', 'Marketing Services']","['Retail Groceries', 'Convenience Retailers', 'Shopping Services', 'Food Processing Services']","['Catering Services', 'Food Processing Services', 'Food Safety Services']",False
9398,"Hustlerholic is a company that offers a podcast series called ""The Hustlerholist Podcast"" that provides tips and advice on various topics such as real estate, the stock market, forex market, crypto currency, and running a small business. The podcast features stories of Hustlers who have achieved financial independence through hustling and offers insights from leaders in the business world. The company also provides information on Forex trading and the importance of Forex in achieving financial independence. Additionally, they offer a virtual seminar on the stock and real estate markets.","['Financial Education', 'Real Estate Tips', 'Small Business Education', 'Financial Services', 'Real-life Success in Real Estate', 'Virtual Seminar on Real Estate', 'Cryptocurrency Education', 'Podcast Production Services', 'Forex Education']",Services,Cryptocurrency,Securities and Commodity Exchanges,"Hustlerholic is a company that offers a podcast series called ""The Hustlerholist Podcast"" that provides tips and advice on various topics such as real estate, the stock market, forex market, crypto currency, and running a small business. The podcast features stories of Hustlers who have achieved financial independence through hustling and offers insights from leaders in the business world. The company also provides information on Forex trading and the importance of Forex in achieving financial independence. Additionally, they offer a virtual seminar on the stock and real estate markets. | ['Financial Education', 'Real Estate Tips', 'Small Business Education', 'Financial Services', 'Real-life Success in Real Estate', 'Virtual Seminar on Real Estate', 'Cryptocurrency Education', 'Podcast Production Services', 'Forex Education'] | Services | Cryptocurrency | Securities and Commodity Exchanges","['Financial Services', 'Market Research Services', 'Property Management Services', 'Real Estate Services']","['Financial Services', 'Market Research Services', 'Property Management Services', 'Real Estate Services']","['Financial Services', 'Consulting Services', 'Media Production Services', 'Training Services']","['Market Research Services', 'Real Estate Services']",False



*   Compares pseudo-labels predicted by the model with manually labeled data for the last 100 companies.
*   A prediction is considered correct if at least one label matches between the model’s output and the manual label list.
*   Displays both the number of correct predictions and the accuracy percentage in bold format for clear reporting.
*   Uses a styled table to preview the first 5 validation samples and visually compare manual vs. model-assigned labels.
*   This method gives a quick and visual way to assess the performance of the model on human-labeled examples, even though it's based on a lenient matching rule.



## 🟣 **SECTION 8 – Visual Check & Sample Evaluation**

To support qualitative evaluation, we perform a random sample inspection of the classified companies. This manual review helps confirm that the model predictions make sense and provides additional confidence in the results.

**Sample Evaluation**

In [None]:
# Random visual check
sample_df = predictions_df.sample(10, random_state=42)

for i, row in sample_df.iterrows():
    print(f"\n🔹 Company #{i}")
    print("📄 Full text:")
    print(row["description"][:500], "...")
    print("\n🏷️ Manually Assigned Labels:", row["corect_label"])
    print("\n🏷️ Predicted Labels:", row["predicted_labels"])
    print("=" * 100)



🔹 Company #83
📄 Full text:
Kitchener Mortgage Agent Alex Harun is a mortgage broker based in Kitchener, Canada. The company offers expert mortgage advice and services for purchases, refinancing, renewals, debt consolidation, and personal loans. They provide assistance with mortgage pre-approval, first-time buyers, self-employed individuals, new to Canada, investment properties, debt Consolidation, mortgage renewals and refinancing. The mortgage process is fast and easy, and Alex Haru is available to help clients select th ...

🏷️ Manually Assigned Labels: ['Financial Services', 'Consulting Services']

🏷️ Predicted Labels: ['Financial Services', 'Property Management Services', 'Real Estate Services']

🔹 Company #53
📄 Full text:
The District Court for Warsaw-Śródmieście is a regional court located in Warsaw, Poland. It is responsible for the administration of justice in the district of Warsaw- Śrżycie. The court has a list of judges and assessors, as well as a public information bulleti



*   Randomly selects a few companies from the dataset for a visual inspection of the predicted labels.
*   Prints out the company description and its predicted labels to evaluate the classification results qualitatively.


## 🟣 **SECTION 9 – Model Saving & Export**

The trained model and label binarizer are saved using joblib, making them reusable for future inference. We also export the final classified dataset into a CSV file for external use or further analysis.

**Export of Final Classified Dataset**

In [None]:
# Export dataset final clasificat
companies_df["insurance_label"] = companies_df["model_labels"]
companies_df.to_csv("classified_companies_with_labels.csv", index=False)
print("✅ Fișier exportat: classified_companies_with_labels.csv")


✅ Fișier exportat: classified_companies_with_labels.csv




*   Adds a final column named insurance_label, which contains the predicted labels (pseudo_labels) for each company.
*   Prepares the complete dataset for export with all original data and model-generated classifications included.
*   This step ensures the results of the classification process are preserved and accessible outside of the notebook.



**Export the Trained Model and Label Binarizer**

In [None]:
# Salvare model și binarizator
joblib.dump(pipeline, "multilabel_model_v2.pkl")
joblib.dump(mlb, "label_binarizer_v2.pkl")
print("✅ Model și label encoder salvate!")

✅ Model și label encoder salvate!


*   Saves the trained classification pipeline (TF-IDF vectorizer + logistic regression) using joblib, making it reusable for future predictions without retraining.
*   This step is essential for model deployment, reproducibility, and scaling to other datasets.
*   This step is essential for model deployment, reproducibility, and scaling to other datasets.

## 🟣 **SECTION 10 – Analysis and Conclusions**

### **- Analysis**

Strengths:
*   Effective classification achieved despite limited labeled data, by leveraging semantic embeddings and pseudo-labeling techniques.
*   Adaptable and flexible method, reducing dependency on potentially inaccurate manual labels.
*   Iterative re-labeling enhances consistency and potentially increases accuracy.

Weaknesses:
*   Strong dependence on the quality of semantic embeddings and the initial pseudo-label assignments, potentially propagating labeling inaccuracies.
*   Validation methodology was permissive; considered correct if at least one predicted label matched any manually assigned label, potentially inflating accuracy results.

Scalability:
*   Embedding computation and semantic similarity calculation can scale well with parallelized processing.
*   With significantly larger datasets, computational resource demands could increase notably; however, these processes can be efficiently scaled through parallel computing infrastructures and optimizations.




### **- Conclusion**

In this project, I tackled the challenge of classifying companies into an insurance taxonomy without having a predefined training dataset. This constraint prevented me from using a conventional supervised learning approach.

My approach involved combining available company information into a single text representation and leveraging semantic embeddings to measure similarity between company texts and taxonomy labels. Based on these semantic similarities, I initially generated "pseudo-labels" to approximate true labels for training purposes.

After assigning pseudo-labels, I trained a multi-label classification model using these labels, which were likely accurate for most cases. Following the training phase, I re-labeled the entire dataset by applying the trained model, ensuring greater label consistency.

In the final step, I validated my approach using a manually labeled subset of data, labeled with the assistance of other AI applications. I deliberately avoided training directly on this manually labeled dataset because of potential inaccuracies, mismatched taxonomy labels, and its limited size (only around 100 labeled examples), which would have compromised the reliability and generalization of the model.

This methodology enabled effective company classification despite limited labeled data, demonstrating robustness through semantic embedding and iterative pseudo-labeling strategies.
Notably, the model achieved 65% accuracy when evaluated against the manually labeled dataset — a significant improvement compared to the 24% accuracy of the initial pseudo-labeling stage.
This highlights that the classifier is not just copying similarities, but learning meaningful distinctions and evolving beyond its initial weak supervision.