<a href="https://colab.research.google.com/github/LucaLazar07/VeridionChallenge/blob/main/challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Imports**

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer

**Reading the Data**

In [3]:
insurance_df = pd.read_csv('ml_insurance_challenge.csv')
label_df = pd.read_csv('insurance_taxonomy.csv')

**Info about the Data**

In [None]:
insurance_df

In [None]:
label_df

In [None]:
insurance_df.describe()

In [None]:
insurance_df.isnull().sum()

**Text Preprocessing prior to Company Embeddings**

In [19]:
def create_text_about_company(row):
    if (pd.notna(row['business_tags'])):
        business_tags_processed = row['business_tags'].replace("['", "").replace("']", "").replace("'", "")

    description = row['description'] if pd.notna(row['description']) else ""

    business_tags = business_tags_processed if pd.notna(business_tags_processed) else ""

    sector = row['sector'] if pd.notna(row['sector']) else ""

    category = row['category'] if pd.notna(row['category']) else ""

    niche = row['niche'] if pd.notna(row['niche']) else ""

    text = f"{description} Services: {business_tags}. Industry: {sector} - {category} - {niche}"

    return text


In [None]:
# testing to see what model to use based on the number of tokens accepted by it

rows = insurance_df.shape[0]
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')

for i in range(rows):
    string_row = create_text_about_company(insurance_df.iloc[i])
    tokens = tokenizer.encode(str(string_row))
    if (len(tokens)) > 512:
        print("A model with a bigger number of tokens is needed")
        break

**Model Creation**

In [None]:
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

company_text = []

for i in range(rows):
    string_row = create_text_about_company(insurance_df.iloc[i])
    company_text.append(string_row)

company_embeddings = model.encode(company_text)
label_embeddings = model.encode(label_df['label'].tolist())

similarity = util.cos_sim(company_embeddings, label_embeddings)

**Veryfing most similar labels for each company**

**Bi-Encoder**

In [None]:
k = 5
for i in range(rows):
    company_similarity = similarity[i]

    top_k_labels = company_similarity.topk(k=k)

    label_index = top_k_labels.indices[0].item()
    label_name = label_df.iloc[label_index]['label']

    insurance_df.loc[i, "label_insurance"] = label_name

insurance_df.to_csv('ml_insurance_challenge_with_bi_encoder.csv')


**Cross-Encoder**

In [None]:
from sentence_transformers.cross_encoder import CrossEncoder

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")

for i in range(rows):
    query = company_text[i]
    corpus = label_df['label'].tolist()

    ranks = model.rank(query, corpus)
    label_name = corpus[ranks[0]['corpus_id']]

    insurance_df.loc[i, "label_insurance"] = label_name

insurance_df.to_csv("ml_insurance_challenge_with_cross_encoder.csv")


**Bi-Encoder + Cross-Encoder**

In [None]:
k = 10
best_k = 5
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")


for i in range(rows):
    company_similarity = similarity[i]
    query = company_text[i]

    top_k_labels = company_similarity.topk(k=k)
    best_labels = []

    for index in range(best_k):
        label_index = top_k_labels.indices[index].item()
        best_labels.append(label_df.iloc[label_index]['label'])

    corpus = best_labels

    ranks = model.rank(query, corpus)
    label_name = corpus[ranks[0]['corpus_id']]

    insurance_df.loc[i, "label_insurance"] = label_name

insurance_df.to_csv("ml_insurance_challenge_with_bi_and_cross_encoder.csv")


**Results overview**

After analyzing the results from using a bi-encoder, cross-encoder and bi + cross encoder model, I noticed that the best results were obtained by using the bi and the bi + cross encoder. Although, I was not satisfied with the results so I decided to attack the challenge with a new approach. I will change the weight for all the features except the description and I also changed the pretrained models used for embedding and cross-encoder to a DeBERTa-v3 based model who has a much higher performance in these types of situations than the ones used before.

**Weight change for features**

In [9]:
def create_text_about_company(row):
    business_tags_processed = ""
    if (pd.notna(row['business_tags'])):
        business_tags_processed = row['business_tags'].replace("['", "").replace("']", "").replace("'", "")

    description = row['description'] if pd.notna(row['description']) else ""

    business_tags = business_tags_processed if pd.notna(business_tags_processed) else ""

    sector = row['sector'] if pd.notna(row['sector']) else ""

    category = row['category'] if pd.notna(row['category']) else ""

    niche = row['niche'] if pd.notna(row['niche']) else ""

    text = f"{description} Services: {business_tags}. Industry: {sector} - {category} - {niche}"

    return text

**Using DeBERTa-V3 model for embedding**

In [7]:
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
model = AutoModel.from_pretrained("microsoft/deberta-v3-base")

def deberta_embeddings(text, tokenizer, model):
  inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
  with torch.no_grad():
    output = model(**inputs)

  hidden_states = output.last_hidden_state

  mask = inputs['attention_mask'].unsqueeze(-1).float()

  sum_embeddings = torch.sum(hidden_states * mask, dim=1)

  sum_mask = torch.max(torch.sum(mask, dim=1), torch.tensor(1e-9))

  mean_embeddings = sum_embeddings / sum_mask

  return mean_embeddings

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

**Applying DeBERTa Embedding to the companies and labels**

In [10]:
rows = insurance_df.shape[0]
company_text = []

for i in range(rows):
    string_row = create_text_about_company(insurance_df.iloc[i])
    company_text.append(string_row)

company_embeddings_list = [deberta_embeddings(text, tokenizer, model) for text in company_text]
company_embeddings = torch.stack(company_embeddings_list).squeeze()

label_embeddings_list = [deberta_embeddings(label, tokenizer, model) for label in label_df['label'].tolist()]
label_embeddings = torch.stack(label_embeddings_list).squeeze()

similarity = util.cos_sim(company_embeddings, label_embeddings)

print(company_embeddings.shape)
print(label_embeddings.shape)
print(similarity.shape)

torch.Size([9494, 768])
torch.Size([220, 768])
torch.Size([9494, 220])


**Assigning the best label to the company**

In [13]:
for i in range(rows):
  company_similarity = similarity[i]

  top_labels = company_similarity.topk(k=5)

  label_index = top_labels.indices[0].item()
  label_name = label_df.iloc[label_index]['label']

  insurance_df.loc[i, "label_insurance"] = label_name

insurance_df.to_csv("ml_insurance_challenge_with_deberta.csv")

In [None]:
deberta_csv = pd.read_csv("ml_insurance_challenge_with_deberta.csv")
deberta_csv['label_insurance'].value_counts()

**Despite its good recognition as a pretrained model, DeBERTav3 failed to assign the labels correctly, assigning the label "Non-Alcoholic Beverage Manufacturing" to 8815 companies. This might be due to the embeddings as they are manually calculated. Therefore, I decided to stick to the results obtained before who have less mistakes.**

In [22]:
bi_csv = pd.read_csv("ml_insurance_challenge_with_bi_encoder.csv")
bi_plus_cross_csv = pd.read_csv("ml_insurance_challenge_with_bi_and_cross_encoder.csv")

In [None]:
bi_csv['label_insurance'].value_counts()

In [None]:
bi_plus_cross_csv['label_insurance'].value_counts()

In [None]:
from sklearn.metrics import classification_report, accuracy_score

print(classification_report(bi_csv['label_insurance'], bi_plus_cross_csv['label_insurance']))

**Observing the results from the Bi encoding and the Bi and Cross encoding, we notice that the Bi and Cross encoding tends to assign the label "Non-Alcoholic Beverage Manufacturing" more than the Bi encoding, therefore being less precise. Therefore, the best result is obtained with a Bi encoder.**