## Embeddings and Data Preprocessing for Token-Label Matching

In this step, we prepare all necessary embeddings, contextual matrices, token metadata, and TF-IDF weighting structures for semantic matching and final classification.



### Loading Vector Embeddings

We load the following precomputed embedding matrices using `numpy`:

- **Token embeddings (`vec_embeddings.txt`)**  
  Dense semantic representations of company tokens.

- **Label embeddings (`label_embeddings.txt`)**  
  Dense semantic representations of insurance taxonomy labels.

- **Category and Niche embeddings (`cat_niche_embeddings.txt`)**  
  Embeddings representing combined niche and category structures.

- **Context matrices**:
  - **`context_matrix.txt`**: Captures the general semantic context for tokens.
  - **`first_sentence_matrix_total1.txt`**: Captures semantic features specifically from the first sentence of company descriptions.



### Loading Structured Tabular Data

We load multiple structured datasets using `pandas`:

- **`tokens_vector.csv`**  
  Tokens associated with each company.

- **`sectors.csv`**  
  Sector information for each company.

- **`counted_elements.csv`**  
  Frequency counts of each token across company descriptions.

- **`new_label_with_categories.csv`**  
  Structured decomposition of labels into domain, modifier, and core parts.

- **`ml_insurance_challenge.csv`**  
  Main company dataset, cleaned by dropping missing values.

- **`insurance_taxonomy.csv`**  
  Official insurance taxonomy used for classification.
- **`our_counted_words_category`**  
  Frequency counts of each **category and niche term** across all company descriptions.
- **`our_current_categories.csv`**  
  Category and niche combinations, tokenized and preprocessed.


### Text Normalization and Cleaning

- **Lowercase Conversion**  
  All tokens are converted to lowercase for consistency.

- **Selective Punctuation Removal**  
  Unwanted punctuation is removed, while useful characters such as underscores (`_`), periods (`.`), dashes (`-`), and semicolons (`;`) are preserved to maintain compound word structures.

- **Tokenization**  
  Fields such as `new_col` (company tokens) and `count_words` (token counts) are split into clean, tokenized word lists.  
  Category and niche fields are also tokenized.


### Loading Additional Structures

We also load manually curated and precomputed resources:
- **Most Important Words (`most_important_words.txt`)**  
  Tokens extracted primarily from the **first sentences** of company descriptions (after filtering), representing highly significant features.

- **Average TF-IDF Scores (`average_tf_idf_per_word.txt`)**  
  Precomputed average TF-IDF values for important tokens across all companies.

- **Per-Token TF-IDF Matches (`match_words_with_tf_idf.txt`)**  
  Contains TF-IDF scores for each word-company pair.  
  Higher TF-IDF indicates stronger relevance for that specific company.
### Category and Niche Similarity Matrices

Two important matrices reflect semantic similarity of companies to their metadata categories:

- `similarity_desc_to_niche.txt` → (`similarity_to_niche`)  
  Semantic match between company descriptions and niche embeddings.

- `similarity_desc_to_category.txt` → (`similarity_to_cat`)  
  Semantic match between company descriptions and category embeddings.

These are used to **validate** or **boost** label candidates if their components align with niche/category fields.

### Goal

This preprocessing step ensures that:

- All token-level semantic information (embeddings, context, first-sentence emphasis) is prepared.
- Frequency and TF-IDF importance are properly computed for every token.
- Tokens, sectors, and categories are normalized and consistent.
- All necessary structures for fast and accurate semantic matching against insurance taxonomy labels are ready.

In [1]:
import numpy as np 
import pandas as pd
import string

vec_embeddings = np.loadtxt("vec_embeddings.txt", dtype=float)
label_embeddings = np.loadtxt("label_embeddings.txt", dtype=float)
cat_niche_embeddings =np.loadtxt("cat_niche_embeddings.txt", dtype=float)
context_matrix = np.loadtxt("context_matrix.txt", dtype=float)
first_sentence_matrix = np.loadtxt("first_sentence_matrix_total1.txt", dtype=float)
similarity_to_cat = np.loadtxt("similarity_desc_to_category.txt", dtype=float)
similarity_to_niche = np.loadtxt("similarity_desc_to_niche.txt", dtype=float)

our_words = pd.read_csv("tokens_vector.csv")
our_sector = pd.read_csv("sectors.csv")
our_sector['sector']=our_sector['sector'].fillna("")
our_counted_words = pd.read_csv('counted_elements.csv')
our_counted_words_category = pd.read_csv('counted_categories.csv')

divided_words = pd.read_csv("new_label_with_categories.csv")
all_punctuation_except_hyphen = string.punctuation.replace("_","").replace(".","").replace("-","").replace(";","").replace("-","")

our_words['new_col2'] = our_words['new_col'].str.lower().replace('[{}]'.format(all_punctuation_except_hyphen), '', regex=True).str.split()
our_counted_words['count_words'] = our_counted_words['new_col'].str.lower().replace('[{}]'.format(all_punctuation_except_hyphen), '', regex=True).str.split()
our_counted_words_category['count_cat'] = our_counted_words_category['count_niche_cat'].str.lower().replace('[{}]'.format(all_punctuation_except_hyphen), '', regex=True).str.split()

df = pd.read_csv("../inputs/ml_insurance_challenge.csv")
our_classes = pd.read_csv("../inputs/insurance_taxonomy - insurance_taxonomy.csv")

In [2]:
context_matrix_niche = np.loadtxt("niche_matrix.txt", dtype=float)
context_matrix_category = np.loadtxt("category_matrix.txt", dtype=float)


In this step, we preprocess the tokens by combining frequency-based and TF-IDF-based information.  
We identify **strong terms** (high z-score and relevance) and **weak terms** (low z-score), based on a combination of frequency counts and TF-IDF similarities.  
Strong terms are later **boosted** to have more impact during label assignment, while weak terms are **filtered out** or deboosted to minimize noise.  
This ensures that only the most relevant tokens contribute significantly to the final classification decisions.

In [3]:
our_sector = pd.read_csv("sectors.csv")


In [4]:
our_sector['sector']=our_sector['sector'].fillna("")


In [5]:
sum_words_occurence = []
for idx, _ in our_counted_words.iterrows():
    element =(our_counted_words.at[idx, 'count_words'].copy())
    element_category = (our_counted_words_category.at[idx, 'count_cat'].copy())
    occurance = [int(element[i]) for i in range(1, len(element), 2)]
    element = {element[i]: int(element[i + 1]) for i in range(0, len(element), 2)}    
    element = {k : v for k, v in element.items() if k in our_words['new_col2'][idx]}
    

    sum_words = sum([element[k] for k in element])
    our_counted_words.at[idx, 'count_words'] = element.copy()
    sum_words_occurence.append(sum_words)

In [6]:
for idx, _ in our_counted_words.iterrows():
    element_category = (our_counted_words_category.at[idx, 'count_cat'].copy())
    element_category[0] = element_category[0].replace("counter", "")
    if len(element_category) == 1 and element_category[0] == "":
        our_counted_words_category.at[idx, 'count_cat'] = element_category.copy()
        continue
    element_category = {element_category[i]: int(element_category[i + 1]) for i in range(0, len(element_category), 2)}
    our_counted_words_category.at[idx, 'count_cat'] = element_category.copy()


In [7]:
our_categories_niche_words = pd.read_csv("our_current_categories.csv")
our_categories_niche_words['niche_plus_cat'] = our_categories_niche_words['niche_plus_cat'].str.lower().replace('[{}]'.format(all_punctuation_except_hyphen), '', regex=True).str.split()

In [8]:
most_important_words = []
with  open('most_important_words.txt') as f:
    for line in f:
        values = line.strip().split(" ")
        most_important_words.append(values)   

In [9]:
average_tf_idf= dict()
with  open('average_tf_idf_per_word.txt') as f:
    min_v=23
    for line in f:
        values = line.strip().split(" ")
        average_tf_idf[values[0]] = float(values[1])

In [10]:
match_words_with_tf_idf_valuess = {}
elements_all = []
for idx, _ in enumerate(our_words['new_col2']):
    elements  = []
    elements.append((1, 'z'))
    elements_all.append(elements)

with  open('match_words_with_tf_idf.txt') as f:
    min_v=23
    for line in f:
        values = line.strip().split(" ")
        match_words_with_tf_idf_valuess[(values[0], int(values[1]))] = float(values[2])
        if int(values[1])==1:
           
            if match_words_with_tf_idf_valuess[(values[0],int(values[1]))] < min_v and int(values[1])==1:
                min_v = match_words_with_tf_idf_valuess[(values[0],int(values[1]))]
        list_elements_all = elements_all[int(values[1])]
        set_elements_all = set(list_elements_all)
        set_elements_all.add((match_words_with_tf_idf_valuess[(values[0],int(values[1]))], values[0]))
        list_elements_all = list(set_elements_all)
        list_elements_all.sort(reverse=True)
        elements_all[int(values[1])] = list_elements_all

        if elements_all[int(values[1])][0][0] == 1 and  elements_all[int(values[1])][0][1] == 'z':
            elements_all[int(values[1])] = elements_all[int(values[1])][1:]


In [11]:
element_value_all = []
our_score_dict = {}

for idx, _ in our_counted_words.iterrows():
    element_value = []
    sum_whole_tf_idf = sum([match_words_with_tf_idf_valuess[(wd, idx)] for wd in our_words['new_col2'][idx]])
    for z in our_counted_words['count_words'][idx]:
        element_value.append(our_counted_words['count_words'][idx][z]/sum_words_occurence[idx] * 0.5 + 0.5* match_words_with_tf_idf_valuess[(z, idx)]/sum_whole_tf_idf)
        our_score_dict[(z, idx)] = our_counted_words['count_words'][idx][z]/sum_words_occurence[idx] * 0.5 + 0.5* match_words_with_tf_idf_valuess[(z, idx)]/sum_whole_tf_idf
    element_value = list(zip(element_value, list(our_counted_words['count_words'][idx])))
    element_value.sort(reverse=True)
    element_value_all.append(element_value)

In [12]:

idx=66
strong_values_all = []
weak_values_all = []

for idx, _ in our_words.iterrows():

    scores = [iz[0] for iz in element_value_all[idx]]
    mean = np.mean(scores)
    std = np.std(scores)
    strong_values = []
    weak_values = []

    for score, word in element_value_all[idx]:
        if std != 0:
            z = (score - mean) / std
        else:
            std = 1
        
        
        if z > 0.3 and (word not in average_tf_idf or average_tf_idf[word]>0.075):
            strong_values.append(word)
        elif z < -0.3:
            weak_values.append(word)

    strong_values_all.append(strong_values)
    weak_values_all.append(weak_values)





In [13]:
scores = [iz[0] for iz in element_value_all[96]]
mean = np.mean(scores)
std = np.std(scores)
strong_values = []
weak_values = []

for score, word in element_value_all[idx]:
    z = (score - mean) / std
    
    if z > 0.3:
        strong_values.append(word)
    elif z < -0.3:
        weak_values.append(word)


In [14]:
min_val_list = [2] * len(our_words['new_col2'])
max_val_list = [-2] * len(our_words['new_col2'])
z=0
for idx, row in enumerate(our_words['new_col2']):
    min_val = 2
    max_val = -2
    for i in row:
        value = match_words_with_tf_idf_valuess[(i, z)]
        if min_val > value:
            min_val = value
        if max_val < value:
            max_val = value

    min_val_list[idx] = min_val
    max_val_list[idx] = max_val
    z+=1


### Final Classification with FAISS and Contextual Tie-Breaking

In this file, we work with the **preprocessed data** obtained from the previous step.  
We use **FAISS** to compare the `vec_embeddings` of company tokens with the `label_embeddings`. These results are then **combined with the contextual similarity matrix** (`context_matrix`) that we previously generated using `SentenceTransformer`.

In [15]:
import faiss

index = faiss.IndexFlatL2(label_embeddings.shape[1])
label_embeddings = np.array(label_embeddings, dtype=np.float32, order='C')
faiss.normalize_L2(label_embeddings)
index.add(label_embeddings)

In [16]:
from sentence_transformers import SentenceTransformer

def find_best_labels(vec_embeddings):
    vec_embeddings = np.array(vec_embeddings, dtype=np.float32, order='C')

    faiss.normalize_L2(vec_embeddings)

    distances, indices = index.search(vec_embeddings, len(label_embeddings))
    return (distances, indices) 

(distances, indices) = find_best_labels(vec_embeddings)


In [17]:
from sentence_transformers import SentenceTransformer

def find_best_labels(vec_embeddings):
    vec_embeddings = np.array(vec_embeddings, dtype=np.float32, order='C')

    faiss.normalize_L2(vec_embeddings)

    distances_cat, indices_cat = index.search(vec_embeddings, len(label_embeddings))
    return (distances_cat, indices_cat) 

(distances_cat, indices_cat) = find_best_labels(cat_niche_embeddings)

In [18]:
from common_functions_f import generate_noun_for_adj, generate_embedings_index

In [19]:
import numpy as np
embeddings_index = generate_embedings_index()

### Optimal Weighting Between FAISS and Contextual Similarity

Through experimentation, we found that the most effective proportion for combining our two similarity measures is:

- **0.7** for the **FAISS-based vector similarity** (token-level matching)
- **0.3** for the **contextual similarity** (SentenceTransformer-based)

This weighting balances **fine-grained token relevance** with **overall semantic meaning**, and led to more accurate and consistent label predictions in our testing.

In [20]:
import copy
values_matrix = context_matrix

list_labels_found_for_companies = []
companies_matrix = copy.deepcopy(values_matrix)
list_matrix=[]
list_matrix2=[]
similarity_matrix_total = []



for i in range(0, len(df['description'])):
    list__ = []
    max1=-1
    max_index = -1
    list__cat=[]

    las = set()
    for j in range(0, 220):
        val_distance_faiss = 1/(1+distances[i][j])
        val_distance_faiss_cat = 1/(1+distances_cat[i][j])
        las.add(indices[i][j])
    
        list__.append((val_distance_faiss * 0.7+ values_matrix[i][indices[i][j]]*0.3, indices[i][j]))
        list__cat.append((val_distance_faiss_cat, indices_cat[i][j]))
        
        if max1 == -1 or max1 < list__[j][0]:
            max1 = list__[j][0]

            max_index = j
    
    list_matrix.append(list__)
    list_matrix2.append(list__cat)
    list_labels_found_for_companies.append(our_classes['label'][indices[i][max_index]])
    list_matrix2.append(max_index)

 

In [21]:
from nltk.corpus import wordnet as wn

In [22]:
def is_the_second_term_common_term(word1, word2):
    syns1 = wn.synsets(word1)
    syns2 = wn.synsets(word2)
    new_word2 = generate_noun_for_adj(word2, embeddings_index)
    new_word1 = generate_noun_for_adj(word1, embeddings_index)
    new_syns1 = None
    new_syns2 = None

    if new_word1 == "":
        new_word1 = word1
        new_syns1 = syns1
    else:
        new_syns1 = wn.synsets(new_word1)

    if new_word2 == "":
        new_word2 = word2
        new_syns2 = syns2
    else:
        new_syns2 = wn.synsets(new_word2)

    while syns1 == [] and len(word1.split("_"))>1:
        word1 = "_".join(word1.split("_")[:-1])
        syns1 = wn.synsets(word1)
    

    for s1 in syns1:
        for s2 in syns2:
            lch = s1.lowest_common_hypernyms(s2)
            if not lch:
                continue

            common_name = lch[0].lemmas()[0].name()
            if common_name == word2 or word2 in common_name:
                return 1
    if(word1!=new_word1 or word2!=new_word2):
        for s1 in new_syns1:
            for s2 in new_syns2:
                lch = s1.lowest_common_hypernyms(s2)
                if not lch:
                    continue

                common_name = lch[0].lemmas()[0].name()

                if common_name == word2 or word2 in common_name:
                    return 1

    return -1


In [23]:
correlated_terms = set()
correlated_terms_with_each_word = {}
with  open('correlated_terms.txt') as f:
    for line in f:
        values = line.strip().split(" ")
        score = np.dot(embeddings_index[values[0]].reshape(1,-1), embeddings_index[values[1]].reshape(1,-1).T)
        if score > 0.3 and is_the_second_term_common_term(values[1], values[0]) == -1:
            correlated_terms.add((values[0], values[1]))   
            if values[0] not in correlated_terms_with_each_word.keys() and values[1] in embeddings_index.keys():
                correlated_terms_with_each_word[values[0]] = [values[1]]
            elif values[1] in embeddings_index.keys():
                correlated_terms_with_each_word[values[0]].append(values[1])

In [24]:
from nltk.corpus import wordnet as wn

In [25]:
def get_specificity(term_):
    synsets_ = wn.synsets(term_)
    if not synsets_:
        return 0
    return np.max([len(paths) for paths in synsets_[0].hypernym_paths()])
def get_specificity2(term_):
    synsets_ = wn.synsets(term_)
    if not synsets_:
        return 0
    return np.min([len(paths) for paths in synsets_[0].hypernym_paths()])


### Smart Handling of Genericity, Specificity, and Semantic Relationships

To maximize label precision and robustness, we implement fine-grained control over matching and scoring based on term specificity, genericity, and semantic relationships.


#### Detecting Very Generic Terms (WordNet-Based)

- We use **WordNet hierarchies** to identify overly generic terms:
  - Terms appearing **≥ 70 times** at shallow depth or **≥ 450 times** overall are flagged.
  - Additional heuristics detect generic words like `prepared`, `object`, `thing`, `matter`.
- **Very generic terms** are excluded early to prevent noisy matches.


#### Differentiating Generic (`__`) vs Super Generic (`::`) Terms

When analyzing label components:

- **Generic terms (`__`)**:  
  ➔ Moderately broad but still potentially useful.  
  ➔ Example: `furniture manufacturing__1` — "manufacturing" is generic, but can act as a differentiator in context.

- **Super generic terms (`::`)**:  
  ➔ Extremely broad concepts spanning many fields.  
  ➔ Example: `bakery production::1` — "production" is vague and adds less specificity.


#### Specificity-Based Matching Adjustment

- **Dynamic thresholding**:
  - For **specific critical terms** (e.g., `"veterinary"` in `"veterinary clinic"`), we require **higher cosine similarity** to accept a match.
  - For **generic and super generic terms**, we allow **lower cosine similarities** during matching.
- **Scoring adjustment**:
  - **Generic terms (`__`)** are **moderately downweighted**.
  - **Super generic terms (`::`)** are **heavily downweighted** inside the final label score.
  - **Specific, highly meaningful words dominate** the scoring for more accurate label assignment.

#### Generic-Aware Semantic Checking

During token-to-label matching:

- **Priority rule**:
  - If a token has **high cosine similarity** or is **semantically correlated** (via WordNet/ConceptNet), we **keep** the match.
  - Otherwise, if the token is **too generic relative to the label component**, it is **excluded**, even if its raw cosine similarity is acceptable.
- We use custom methods like `is_the_second_term_common_term()` for semantic genericity detection.


#### Boosting Related Words and Handling Antonyms

- **Related words** are expanded dynamically using:
  - **WordNet** `is-a` relationships.
  - **Embedding similarity** (threshold ≥ 0.4).
- **Antonyms** are explicitly detected and **penalized** to prevent assigning contradictory labels.


In [26]:
import csv
import time
antonyms_words = set()
antonyms_words_for_each_word = {}

isA_relationship = {}
isA_reverse_relationship = {}
isA_reverse_relationship_even_invalid_terms = {}


hasContext_relationship = {}
hasContext_reverse_relationship = {}


related_terms_for_antonyms = {}
convertPluralToSingular = {}
occuranceTermForParent = {}

with open("../inputs/filtered_assertions.txt") as f:
    rows = f.read().split("\n")
    for row in rows:
        if row=="":
            continue
        words = row.split(" ")
        if (words[0]=="/r/Antonym"):
            word1 = words[1]
            word2 = words[2]
            antonyms_words.add((word1, word2))
            if word1 not in antonyms_words_for_each_word.keys():
                antonyms_words_for_each_word[word1] = [word2]
            else:
                antonyms_words_for_each_word[word1].append(word2)
            
            if word2 not in antonyms_words_for_each_word.keys():
                antonyms_words_for_each_word[word2] = [word1]
            else:
                antonyms_words_for_each_word[word2].append(word1)
    
        if (words[0]=="/r/IsA"):
            word1 = words[1]
            word2 = words[2]
            list_full = isA_reverse_relationship_even_invalid_terms.get(word2, set())
            list_full.add(word1)
            isA_reverse_relationship_even_invalid_terms[word2] = list_full

            if word1.startswith(f"{words[2]}_"):
                
                if word1.split(f"{words[2]}_")[1] in embeddings_index.keys() and words[2] in embeddings_index.keys():
                    word_temp = word1.split(f"{words[2]}_")[1]
                    value = np.dot(embeddings_index[word_temp], embeddings_index[words[2]].T)

                    if value > 0.35:
                        word1 = word1.split(f"{words[2]}_")[1]
            elif word1.endswith(f"_{words[2]}"): 
                if word1.split(f"_{words[2]}")[0] in embeddings_index.keys() and words[2] in embeddings_index.keys(): 
                    word_temp = word1.split(f"_{words[2]}")[0]
                    value = np.dot(embeddings_index[word_temp], embeddings_index[words[2]].T)

                    if value > 0.35:
                        word1 = word1.split(f"_{words[2]}")[0]
          
        
            

            if word1 in embeddings_index.keys() and word2 in embeddings_index.keys():
                list_is = isA_relationship.get(word1, set())
                list_is.add(word2)
                isA_relationship[word1] = list_is
                
                list_is_rev = isA_reverse_relationship.get(word2, set())
                
                list_is_rev.add(word1)
                isA_reverse_relationship[word2] = list_is_rev
                occuranceTermForParent[word2] = len(list_is_rev)
                    
             
        
        if (words[0]=="/r/FormOf"):
            word1 = words[1]
            word2 = words[2]
            convertPluralToSingular[word1] = word2

        if (words[0]=="/r/HasContext"):
            word1 = words[1]
            word2 = words[2]
            if word1 not in hasContext_relationship.keys():
                hasContext_relationship[word1] = [word2]
            else:
                hasContext_relationship[word1].append(word2)
            if word2 not in hasContext_reverse_relationship.keys():
                hasContext_reverse_relationship[word2] = [word1]
            else:
                hasContext_reverse_relationship[word2].append(word1)

        

In [27]:
from collections import deque 

def bfs(dict_terms, start):
    visited = []
    queue = deque([(start, 0)])

    while queue:
        node = queue.popleft()
        if node[0] not in visited:

            visited.append(node[0])
        
            if node[0] not in dict_terms.keys():
                return node[1]                
            for neighbor in dict_terms[node[0]]:
                if neighbor not in visited:
                    queue.append((neighbor, node[1]+1)) 
    return 0

In [28]:
original_terms_for_parent = occuranceTermForParent.copy()
occuranceTermForParent = dict(sorted(occuranceTermForParent.items(), key=lambda item: item[1]))
occuranceTermForParent = {k: v for k, v in occuranceTermForParent.items() if (v>=70 and  bfs(isA_relationship,k)<=1) or (v >=450 and bfs(isA_relationship,k)<=2) }
very_generic_terms = set([k for k, _ in occuranceTermForParent.items()])

In [29]:
terms_for_look_for_beginning = {"prepared", "shaped", "solid", "liquid", "long", "light", "dirty", "busy", "organized","constructed", "processed"}


In [30]:
terms_for_look_for_beginning = {"prepared", "shaped", "processed"}
terms_for_look_for_end = {"object", "thing", "fluid", "thing", "matter"}

In [31]:
for original_term in original_terms_for_parent.keys():
    
    if ((original_term in occuranceTermForParent and occuranceTermForParent[original_term]>4) or (original_term not in occuranceTermForParent.keys())) and len(original_term.split("_")) > 1 and get_specificity(original_term)<4:
        is_generic_term = original_term.split("_")[0] in terms_for_look_for_beginning or (original_term.split("_")[-1] in terms_for_look_for_end and get_specificity("_".join(original_term.split("_")[:-1])) >1 and get_specificity("_".join(original_term.split("_")[:-1]))<5) 

        if is_generic_term:
            very_generic_terms.add(original_term)


In [32]:
related_words_to_a_word_for_antonyms = {}
related_words_to_a_word_similarity = {}

for word in isA_relationship:
    set_element = isA_relationship[word]
    set_element_temp = set_element.copy()
    
    for word2 in set_element:
        if word==word2:
            continue
        set_is_A_relationship = {}
        set_is_A_reverse_relationship = {}

        if word2 in isA_relationship.keys():
            set_is_A_relationship = isA_relationship[word2]

        if word2 in isA_reverse_relationship.keys()  and word2 not in very_generic_terms:
            set_is_A_reverse_relationship = isA_reverse_relationship[word2]
        set_element_temp.update(set_is_A_relationship)

        set_element_temp.update(set_is_A_reverse_relationship)

        if word in set_element_temp:
            set_element_temp.remove(word)

    if set_element_temp !=set():

        list_words = list(set_element_temp)
        current_word_embedding = (embeddings_index[word]).reshape(1,-1)
        faiss.normalize_L2(current_word_embedding)
        all_word_embeddings = np.array([embeddings_index[wd] if wd in embeddings_index else np.zeros(300) for wd in list_words]).astype(np.float32, order='C')
        faiss.normalize_L2(all_word_embeddings)
        our_values = np.dot(current_word_embedding, all_word_embeddings.T)[0]
        list_words_zip_for_antonyms = list(filter(lambda x: x[1]>=0.4, list(zip(list_words, our_values))))
        set_element_temp = set(map(lambda x: x[0], list_words_zip_for_antonyms))

    
    if word not in related_words_to_a_word_for_antonyms.keys():
        related_words_to_a_word_for_antonyms[word] = set_element_temp
    else:
        set_word = related_words_to_a_word_for_antonyms[word]

        set_word.update(set_element_temp)
        related_words_to_a_word_for_antonyms[word] = set_word

    if word not in related_words_to_a_word_similarity.keys():
        related_words_to_a_word_similarity[word] = set_element_temp
    else:
        set_word_sim = related_words_to_a_word_similarity[word] 

        set_word_sim.update(set_element_temp)
        related_words_to_a_word_similarity[word] = set_word_sim


related_words_to_a_word_for_antonyms2 = {}
for word in isA_reverse_relationship:
    set_element = isA_reverse_relationship[word]
    set_element_temp = set_element.copy()
    for word_related in set_element:
        set_is_A_relationship = {}
        set_is_A_reverse_relationship = {}

        if word_related in isA_relationship.keys():
            set_is_A_relationship = isA_relationship[word_related]

        if word_related in isA_reverse_relationship.keys() and word_related not in very_generic_terms:
            set_is_A_reverse_relationship = isA_reverse_relationship[word_related]
        
        set_element_temp.update(set_is_A_relationship)
        set_element_temp.update(set_is_A_reverse_relationship)
        if word in set_element_temp:
            set_element_temp.remove(word)


    if set_element_temp !=set():
        list_words = list(set_element_temp)
        current_word_embedding = (embeddings_index[word]).reshape(1,-1)
        faiss.normalize_L2(current_word_embedding)
        all_word_embeddings = np.array([embeddings_index[wd] if wd in embeddings_index else np.zeros(300) for wd in list_words]).astype(np.float32, order='C')
        faiss.normalize_L2(all_word_embeddings)
        our_values = np.dot(current_word_embedding, all_word_embeddings.T)[0]
        list_words_zip_with_antonyms = list(filter(lambda x: x[1]>=0.4, list(zip(list_words, our_values))))
        set_element_temp = set(map(lambda x: x[0], list_words_zip_with_antonyms))

    if word not in related_words_to_a_word_for_antonyms2.keys():
        related_words_to_a_word_for_antonyms2[word] = set_element_temp
    else:
        set_word = related_words_to_a_word_for_antonyms2[word]
        set_word.update(set_element_temp)
        related_words_to_a_word_for_antonyms2[word] = set_word





In [33]:
for value_word in related_words_to_a_word_similarity.keys():
    list_value_word = related_words_to_a_word_similarity[value_word]
    if value_word in isA_reverse_relationship.keys():
        for wd in isA_reverse_relationship[value_word]:
            list_value_word = set(filter(lambda x: x if wd not in x else '', related_words_to_a_word_similarity[value_word]))
            related_words_to_a_word_similarity[value_word] = list_value_word

In [34]:
final_list = []
elements_filtered = []

for i in range(len(df['description'])):
    list_matrix[i].sort(reverse=True)
    
    list111 = []

    max_value = list_matrix[i][0][0]



    row_filtered = []
    for j in range(100):
       
        if max_value - list_matrix[i][j][0] <= 0.20:
            list111.append(our_classes['label'][list_matrix[i][j][1]])
            row_filtered.append(list_matrix[i][j])
    final_list.append(list111)
    elements_filtered.append(row_filtered)    
    
        

In [35]:
domain = [""] * len(divided_words)
original_domain = [""] * len(divided_words)
modifier = [""] * len(divided_words)
original_modifier = [""] * len(divided_words)
core = [""] * len(divided_words)
original_core = [""] * len(divided_words)


In [36]:
for i in range(len(divided_words)):
    all_words = divided_words.at[i,'new_label_with_categories'].split(" ")
    domain[i], modifier[i], core[i] = all_words
    original_domain[i] = domain[i].split("__")[0].split("::")[0]
    original_core[i] = core[i].split("__")[0].split("::")[0]
    original_modifier[i] = modifier[i].split("__")[0]

In [37]:
scores_all = []
list_elements = []
for i in range(len(our_classes)):
    scores = []
    elements = []
    a, b, c = 0.3, 0.4, 0.3
    if modifier[i]=="0":
        a, b, c = 0.325, 0, 0.275
    if len(domain[i].split("__"))>1:
        a*=0.7

    if len(modifier[i].split("__"))>1:
        b*=0.75
    
    if len(domain[i].split("::"))>1:
        a*=0.4
    if len(core[i].split("__"))>1:
        c*=0.7

    if len(core[i].split("::"))>1:
        c*=0.4

    if modifier[i] == "0" and core[i] == "0":
        val_score = 0.05 * a/a
        if len(original_domain[i].split("_")) > 2:
            val_score *=1.125
        scores.append(val_score)
        elements.append(original_domain[i])

    elif modifier[i] == "0":
        val_score1 = 0.065 * a/(a+c)
        val_score2 = 0.065 * c/(a+c)
        if len(original_domain[i].split("_")) > 2:
            val_score1 *=1.125
        if len(original_core[i].split("_")) > 2:
            val_score2 *=1.125
        scores.append(val_score1)
        scores.append(val_score2)
        elements.append(original_domain[i])
        elements.append(original_core[i])
    else:
        val_score1 = 0.08 * a
        val_score2 = 0.08 * b
        val_score3 = 0.08 * c
        scores.append(val_score1)
        scores.append(val_score2)
        scores.append(val_score3)
        elements.append(original_domain[i])
        elements.append(original_modifier[i])
        elements.append(original_core[i])

    if f"{original_domain[i]}_{original_modifier[i]}" in embeddings_index.keys() and modifier[i]==original_modifier[i] and original_domain[i]==domain[i]:
        elements = []
        elements.append(f"{original_domain[i]}_{original_modifier[i]}")
        elements.append(original_core[i])
        original_domain[i] = f"{original_domain[i]}_{original_modifier[i]}"
        domain[i] = original_domain[i]
        modifier[i] = "0"
        original_modifier[i] = "0"
        
    if f"{original_domain[i]}_{original_core[i]}" in embeddings_index.keys() and original_domain[i] == domain[i] and original_core[i]==core[i]:
        elements = []

        elements.append(f"{original_domain[i]}_{original_core[i]}")
        original_domain[i] = f"{original_domain[i]}_{original_core[i]}"
        domain[i] = original_domain[i]
        modifier[i] = "0"
        original_modifier[i] = "0"
        core[i] = "0"
        original_core[i] = "0"
    if f"{original_modifier[i]}_{original_core[i]}" in embeddings_index.keys()  and original_core[i]==core[i] and original_modifier[i]==modifier[i]:
        elements = []
        elements.append(original_domain[i])
        elements.append(f"{original_modifier[i]}_{original_core[i]}")
        original_core[i] = f"{original_modifier[i]}_{original_core[i]}"
        core[i] = original_core[i]
        original_modifier[i] = "0"
        modifier[i]="0"


        
    
    
    
    scores_all.append(scores)
    list_elements.append(elements)



In [38]:
def get_sim_list(word_key_label, wd, index_label, dict_words_same_category):
    if (word_key_label, wd) not in antonyms_words and ((word_key_label, wd) not in dict_words_same_category or dict_words_same_category[(word_key_label, wd)]==-1):
        filtered_similarity_list = set()
        modify_word=wd
        if(wd not in related_words_to_a_word_similarity and wd in convertPluralToSingular):
            modify_word = convertPluralToSingular[wd]
        if modify_word in related_words_to_a_word_similarity:
            filtered_similarity_list = set(filter(lambda x: f"_{word_key_label}_" in x or word_key_label == x or f"{word_key_label}_" == x[0:(len(word_key_label)+1)] or f"_{word_key_label}" in x, related_words_to_a_word_similarity[modify_word]))
        is_term_generic = False
        is_term_generic = (word_key_label == original_domain[index_label] and original_domain[index_label]==domain[index_label])
        is_term_generic = is_term_generic or (word_key_label == original_modifier[index_label])
        is_term_generic = is_term_generic or (word_key_label == original_core[index_label] and original_core[index_label]==core[index_label])
        



        if filtered_similarity_list != set() and is_term_generic:
            specialize_wd =  get_specificity2(wd)
            specialize_word_key_label =  get_specificity2(word_key_label)
        
                
            is_term_more_generic_than_the_key_word = (specialize_wd >= specialize_word_key_label)
            if is_term_more_generic_than_the_key_word == False:
                specialize_wd =  get_specificity(wd)
                specialize_word_key_label =  get_specificity(word_key_label)
                    
                is_term_more_generic_than_the_key_word = specialize_wd >= specialize_word_key_label
            filtered_similarity_list = set(filter(lambda x: is_the_second_term_common_term(x, wd)==-1, filtered_similarity_list))
            
            return filtered_similarity_list
    return None

In [39]:
from sklearn.metrics.pairwise import cosine_similarity


### Special Case: Handling `non` Labels

Labels starting with `non` (e.g., `non_alcoholic`) are tricky because they often score high with their opposite meanings.

#### Strategy:

- **First**, if the label itself (`non_alcoholic`) isn't found:
  - Check if the term is **closer to an antonym** (cosine > 0.7).
  - If the antonym is closer than the label, **penalize** the match.

- **Second**, if needed:
  - Expand to **similar words of the antonym**.
  - Filter out confusing terms (cosine > 0.7 with the label).

#### Goal:
- Avoid misclassifying opposites.
- Keep only labels truly matching the non-term meaning.

In [40]:
list_matrix_original = []
for i in range(len(df['description'])):
    list_matrix[i].sort(reverse=True)
    list_matrix_original.append(list_matrix[i].copy())

### Medium Similarity Recovery via Parent-Child Heuristics

In some cases, label terms may not appear extremely specific but still carry meaningful business context. These often show **medium cosine similarity** (0.3–0.4) with company tokens.

We retain such terms when:

- The **parent term** (e.g., `laboratory`) has only a moderate similarity,
- But a more **specialized child term** (e.g., `medical_laboratory`) shows stronger contextual alignment,
- And the term has a **moderate specificity score** (between 8 and 12), indicating it's precise but not overly niche.

🔍 **Example**:
- Parent: `laboratory` (cosine ~0.35)
- Child: `medical_laboratory` (cosine ~0.52) → More relevant in a healthcare-related context.

This heuristic allows us to **recover contextually meaningful labels** that might otherwise be excluded by rigid similarity thresholds.

In [41]:
full_list_companies = []
dict_words_same_category = {}
specifity_word = {}
is_the_set_valid_or_not = {}
is_the_set_valid_or_not2 = {}


dict_words_dif_category = {}
list_matrix_elements_val = np.zeros((len(df['description']), len(label_embeddings), 100))
list_matrix_core_val = np.zeros((len(df['description']), len(label_embeddings), 100))

antonyms_words_for_companies = set()
are_antonyms_with_word = {}
are_antonyms = set()
noun_for_adj_dict = {}

for i in range(len(df['description'])):
    list_matrix[i].sort(reverse=True)
    list111 = []
    max_origin = list_matrix[i][0][0]
    max_val = list_matrix[i][j][0]
    element = {}


    for j in range(len(elements_filtered[i])):
       
        if(max_val - list_matrix[i][j][0]>0.15):
            continue
        
        index_label = list_matrix[i][j][1]
        word_most_specific = original_domain[index_label]

        is_valid_term=0
        should_discard_label = 0

        list_words = []
        word_most_specific1 = None
        all_word_embeddings = None
        domain_values = None
        core_values = None
        modifier_values = None

        word_most_specific1 = word_most_specific.replace("-","_")
        we1 = (embeddings_index[original_domain[index_label]]).reshape(1,-1)
        faiss.normalize_L2(we1)
        all_word_embeddings = np.array([embeddings_index[wd.replace("-","_")]  for wd in our_words['new_col2'][i]]).astype(np.float32, order='C')
        faiss.normalize_L2(all_word_embeddings)
        domain_values = np.dot(we1, all_word_embeddings.T)[0]
        total_value = domain_values
        core_values = np.zeros(0)
        if np.max(domain_values) < 0.2 and original_domain[index_label]==domain[index_label]:
                should_discard_label=1


        we2 = None
        we3 = None

        if(core[index_label]!="0" and should_discard_label==0):
            we2 = (embeddings_index[original_core[index_label]]).reshape(1,-1)
            faiss.normalize_L2(we2)
            core_values = np.dot(we2, all_word_embeddings.T)[0]


            if np.max(core_values) < 0.2 and original_core[index_label]==core[index_label]:
                should_discard_label=1

            if original_core[index_label] == core[index_label]:
                total_value = domain_values * 0.5 + core_values * 0.5
            elif len(core[index_label].split("__"))>1:
                total_value = domain_values * 0.7 + core_values * 0.3
            elif len(core[index_label].split("::"))>1:
                total_value = domain_values * 0.9 + core_values * 0.1
        
        if(modifier[index_label]!="0" and should_discard_label==0):
            we3 = (embeddings_index[original_modifier[index_label]]).reshape(1,-1)
            faiss.normalize_L2(we3)
            modifer_values = np.dot(we3, all_word_embeddings.T)[0]


            if np.max(modifer_values) < 0.2:
                should_discard_label=1
            if original_core[index_label] != core[index_label] and original_domain[index_label]!=domain[index_label] and len(core[index_label].split("::"))>1:
                total_value = domain_values * 0.235 + modifer_values * 0.665 + core_values * 0.1
            elif original_core[index_label] != core[index_label] and original_domain[index_label]!=domain[index_label]:
                total_value = domain_values * 0.2 + modifer_values * 0.6 + core_values * 0.2
            elif original_core[index_label] != core[index_label] and len(core[index_label].split("::"))>1:
                total_value = domain_values * 0.425 + modifer_values * 0.475 + core_values * 0.1
            elif original_core[index_label] != core[index_label] and len(core[index_label].split("__"))>1:
                if original_modifier[index_label] == modifier[index_label]:
                    total_value = domain_values * 0.35 + modifer_values * 0.45 + core_values * 0.2
                else:
                     total_value = domain_values * 0.55 + modifer_values * 0.25 + core_values * 0.2
            elif original_domain[index_label] != domain[index_label]:
                total_value = domain_values * 0.2 + modifer_values * 0.425 + core_values * 0.375
            else:
                total_value = domain_values * 0.3 + modifer_values * 0.4 + core_values * 0.3
        if len(total_value) < 100:
            mult = 1
            
            list_matrix_elements_val[i][index_label] = np.pad(total_value, (0, 100 - len(total_value)), mode='constant')
            list_matrix_core_val[i][index_label] = np.pad(core_values, (0, 100 - len(core_values)), mode='constant')

        if(np.max(total_value) < 0.25):
            should_discard_label=1

        is_value_antonym = 0

        
        if domain[index_label] != original_domain[index_label] and modifier[index_label]!='0':


            if should_discard_label==0:
                domain_values = modifer_values.copy()
            else:
                we3 = (embeddings_index[original_modifier[index_label]]).reshape(1,-1)
                faiss.normalize_L2(we3)
                domain_values = np.dot(we3, all_word_embeddings.T)[0]

            word_most_specific = original_modifier[index_label]
            word_most_specific1 = word_most_specific.replace("-","_")
        elif domain[index_label] != original_domain[index_label] and core[index_label]!='0' and core[index_label]==original_core[index_label]:
            if should_discard_label==0:
                domain_values = core_values.copy()
            else:
                we3 = (embeddings_index[original_core[index_label]]).reshape(1,-1)
                faiss.normalize_L2(we3)
                domain_values = np.dot(we3, all_word_embeddings.T)[0]
            word_most_specific = core[index_label]
            word_most_specific1 = word_most_specific.replace("-","_")
        elif domain[index_label] != original_domain[index_label]:
            continue


        max_our_value = np.max(domain_values)
        all_values = domain_values
        best_antonyms_word = ''
        scores_max_list_antonyms = []
        is_not_antonyms = 0
        if "non" == word_most_specific[0:3]:
            word_most_specific1 = word_most_specific
            if word_most_specific1 not in antonyms_words_for_each_word.keys() and word_most_specific in convertPluralToSingular.keys() and  convertPluralToSingular[word_most_specific1] in antonyms_words_for_each_word.keys():
                    word_most_specific1 = convertPluralToSingular[word_most_specific1]
            

            if word_most_specific1 in antonyms_words_for_each_word.keys():
                
                filter_antonyms_if_they_are_there = any(x in our_categories_niche_words['niche_plus_cat'][i] for x in antonyms_words_for_each_word[word_most_specific1])
                filter_antonyms_if_they_are_there_most_important = any(x in most_important_words[i] for x in antonyms_words_for_each_word[word_most_specific1])
                filter_similar_if_they_are_there_most_important = word_most_specific in most_important_words[i] or word_most_specific1 in most_important_words[i]
                if filter_antonyms_if_they_are_there or (filter_antonyms_if_they_are_there_most_important and not filter_similar_if_they_are_there_most_important):
                    is_not_antonyms = 2
                elif not filter_similar_if_they_are_there_most_important:
                    antonyms_word_for_that =  embeddings_index.get(antonyms_words_for_each_word[word_most_specific1][0], np.zeros(300))
                    most_important_word_emb =  np.array([embeddings_index.get(wd, np.zeros(300)) if wd[0:3]!="non" else np.zeros(300) for wd in most_important_words[i]], dtype=np.float32, order='C')
                    faiss.normalize_L2(most_important_word_emb)
                    values_most_imp_ant = np.dot(antonyms_word_for_that, most_important_word_emb.T)
                    values_most_imp = np.dot(embeddings_index.get(word_most_specific, np.zeros(300)), most_important_word_emb.T)
                    indices_list = range(len(values_most_imp))
                    list_values_most_imp_ant =  list(values_most_imp_ant)
                    list_values_most_imp = list(values_most_imp)


                    indices_list = list(filter(lambda x: list_values_most_imp_ant[x]>0.7 and list_values_most_imp_ant[x] > list_values_most_imp[x], indices_list))
                    for ind in indices_list:
                        if list_values_most_imp_ant[ind]>0.7:
                            index_val = list_values_most_imp_ant.index(list_values_most_imp_ant[ind])
                            if word_most_specific not in are_antonyms_with_word:
                                are_antonyms_with_word[word_most_specific] = [most_important_words[i][index_val]]
                            else:
                                are_antonyms_with_word[word_most_specific].append(most_important_words[i][index_val])
                            is_not_antonyms = 2


        is_antonyms = 0
            
        for idx, wd in enumerate(our_words['new_col2'][i]):
            our_values = all_values[idx]


            if core[index_label] != "0" and core[index_label] != original_core[index_label] and list_matrix_core_val[i][index_label][idx]>0.6 and list_matrix_core_val[i][index_label][idx]>list_matrix_elements_val[i][index_label][idx]:
                continue
            if scores_max_list_antonyms != []:
                our_values_antonyms = scores_max_list_antonyms[idx]
            else:
                our_values_antonyms = 0
            
            
            if(is_not_antonyms==2):
                continue
           
            specificity_score=-1

            if our_values > 0.85 and (scores_max_list_antonyms == [] or our_values >= our_values_antonyms):
                is_valid_term=1
                break
        
            if word_most_specific not in specifity_word.keys():
                specificity_score = get_specificity(word_most_specific)
                specifity_word[word_most_specific] = specificity_score
            else:
                specificity_score = specifity_word[word_most_specific]
            
            if scores_max_list_antonyms != [] and our_values < our_values_antonyms and our_values_antonyms!=0:
                
                are_antonyms.add((word_most_specific, wd))
                
                is_antonyms = 1  
                continue
            
            if our_values > 0.5  and our_values<0.6 and (word_most_specific, wd) in dict_words_same_category.keys() and dict_words_same_category[(word_most_specific, wd)] == 1 and word_most_specific!=wd:
                continue
            


            if our_values>0.5 and our_values<0.6 and (word_most_specific, wd) not in dict_words_same_category.keys() and (is_the_second_term_common_term(word_most_specific, wd) == 1) and word_most_specific!=wd:
                dict_words_same_category[(word_most_specific, wd)] = 1
                dict_words_same_category[(wd, word_most_specific)] = -1
                continue

            if word_most_specific in isA_relationship.keys() and wd in isA_relationship[word_most_specific]:
                dict_words_same_category[(word_most_specific, wd)] = 1
                dict_words_same_category[(wd, word_most_specific)] = -1
                continue

            
            

            if (wd, word_most_specific) not in antonyms_words:
                the_set_is_valid=False
                if(specificity_score>=8):
                    if (word_most_specific, wd) not in is_the_set_valid_or_not.keys():
                        each_words = get_sim_list(word_most_specific, wd, index_label, dict_words_same_category)
                        the_set_is_valid = each_words != None and each_words != set()
                        if the_set_is_valid == True:
                            is_the_set_valid_or_not[(word_most_specific, wd)] = True
                        else:
                            is_the_set_valid_or_not[(word_most_specific, wd)] = False
                    else:
                        the_set_is_valid = is_the_set_valid_or_not[(word_most_specific, wd)]

                    
                    if our_values >= 0.3 and specificity_score<12 and our_values <= 0.4 and (word_most_specific, wd) not in is_the_set_valid_or_not2 and not the_set_is_valid:
                        current_words = isA_reverse_relationship_even_invalid_terms.get(word_most_specific, set())

                        if current_words != set():

                            embeddings_words = np.array([embeddings_index.get(wd, np.mean([embeddings_index.get(wd2, np.zeros(300)) for wd2 in wd.split("_")], axis=0)) for wd in current_words], dtype=np.float32, order='C')
                    
                            faiss.normalize_L2(embeddings_words)
                            cur_list = list(np.dot(embeddings_words, embeddings_index[wd].T))

                            cur_list_val = any(x>0.42 for x in cur_list)
                            
                            if cur_list_val:
                                the_set_is_valid = True
                              
                                is_the_set_valid_or_not2[(word_most_specific, wd)] = True
                            else:
                                the_set_is_valid = False
                                is_the_set_valid_or_not2[(word_most_specific, wd)] = False
                        else:
                            the_set_is_valid = False
                            is_the_set_valid_or_not2[(word_most_specific, wd)] = False
                    elif (word_most_specific, wd) in is_the_set_valid_or_not2.keys() and the_set_is_valid==False:
                        the_set_is_valid = is_the_set_valid_or_not2[(word_most_specific, wd)]


                
               
                
                is_generic_term = (i in original_core and i not in core) or (i in original_domain and i not in domain)

                
                if specificity_score < 8 and our_values > 0.3:
                    is_valid_term = 1 
                    is_antonyms = 0           
                    break
                elif specificity_score < 12 and (our_values > 0.4 or the_set_is_valid):

                    
                    is_valid_term = 1
                    is_antonyms = 0   
                    break
                elif specificity_score >= 12 and (our_values > 0.6 or (the_set_is_valid and our_values>0.42)):

                    
                    
                    if our_values > 0.75:
                        is_valid_term=1
                        is_antonyms = 0
                        break
                    is_very_specific = None
                    is_second_term = None
                    if (the_set_is_valid):
                        if word_most_specific not in specifity_word.keys():
                            specifity_word[word_most_specific] = get_specificity(word_most_specific)
                        specifity1 = specifity_word[word_most_specific]
                        if word_most_specific not in noun_for_adj_dict.keys():
                            noun_for_adj_dict[word_most_specific] = generate_noun_for_adj(word_most_specific, embeddings_index)
                        noun_for_adj = noun_for_adj_dict[word_most_specific]

                        if noun_for_adj not in specifity_word.keys():
                            specifity_word[noun_for_adj] = get_specificity(noun_for_adj)
                        specifity1_5 = specifity_word[noun_for_adj]



                        if wd not in specifity_word.keys():
                            specifity_word[wd] = get_specificity(wd)
                        specifity2 = specifity_word[wd]

                        if wd not in noun_for_adj_dict.keys():
                            noun_for_adj_dict[wd] = generate_noun_for_adj(wd, embeddings_index)
                        noun_for_adj2 = noun_for_adj_dict[wd]

                        if noun_for_adj2 not in specifity_word.keys():
                            specifity_word[noun_for_adj2] = get_specificity(noun_for_adj2)
                        specifity2_5 = specifity_word[noun_for_adj2]
                        
                        is_very_specific = max(specifity1, specifity1_5)
                        is_second_term = max(specifity2, specifity2_5)


                    if not the_set_is_valid or (not(is_very_specific==0 and is_second_term>0) and not(is_very_specific>0 and is_second_term<is_very_specific)):
                        is_valid_term = 1
                        is_antonyms = 0   
                        break
                    else:
                        continue
            else:
                is_value_antonym = 1
        
    
              
        if(is_valid_term==0 or should_discard_label==1):
          
            if(is_valid_term==0 and (is_value_antonym==1 or is_antonyms==1 or is_not_antonyms==2)):
                antonyms_words_for_companies.add((word_most_specific, i))
            list_matrix[i][j] = (-1, list_matrix[i][j][1])


In [42]:
for i in range(len(df['description'])):
    list_matrix[i].sort(reverse=True)
    list111 = []
    max_origin = list_matrix[i][0][0]
    max_val = list_matrix[i][j][0]
    element = {}

    for j in range(len(elements_filtered[i])):
       
        if(max_val - list_matrix[i][j][0]>0.15):
            continue
        
        index_label = list_matrix[i][j][1]

        if domain[index_label][0:3] == "non" and domain[index_label] in are_antonyms_with_word: 


            set_elements = set(are_antonyms_with_word[domain[index_label]])

            set_elements_emb = np.array([embeddings_index[wd] for wd in set_elements], dtype=np.float32, order='C')
            faiss.normalize_L2(set_elements_emb)
            values = np.max(np.dot(embeddings_index[domain[index_label]], set_elements_emb.T))
            if values > 0.8:
                continue
            set_elements_expanded = set()
            for el in set_elements:
                if el in related_words_to_a_word_for_antonyms2:
                    
                    set_elements_embeddings_emb = np.array([embeddings_index[wd] for wd in list(related_words_to_a_word_for_antonyms2[el])], dtype=np.float32, order='C')
                    faiss.normalize_L2(set_elements_embeddings_emb)
                    if domain[index_label] in antonyms_words_for_each_word.keys():
                        antonyms_domain = antonyms_words_for_each_word[domain[index_label]][0]
                    elif domain[index_label] in convertPluralToSingular and convertPluralToSingular[domain[index_label]] in antonyms_words_for_each_word.keys():
                        antonyms_domain = antonyms_words_for_each_word[convertPluralToSingular[domain[index_label]]][0]
                    our_values_set_elements = np.dot(embeddings_index[antonyms_domain], set_elements_embeddings_emb.T)
                    range_list = range(len(our_values_set_elements))
                    range_list_filtered = list(filter(lambda x: our_values_set_elements[x]>=0.45, range_list))
                    


                    for z in range_list_filtered:
                        set_elements_expanded.add(list(related_words_to_a_word_for_antonyms2[el])[z])

            if set_element == set():
                continue
            set_elements_filtered = set(filter(lambda x: x in most_important_words[i] or x in our_categories_niche_words['niche_plus_cat'][i], set_elements_expanded))
            if set_elements_filtered == set():
                continue
            
            set_elements_emb_anto = np.array([embeddings_index[wd] for wd in set_elements_filtered], dtype=np.float32, order='C')
            values_anto = np.min(np.dot(embeddings_index[domain[index_label]], set_elements_emb_anto.T))
            if values_anto > 0.7:
                continue
            list_matrix[i][j] = (-1, list_matrix[i][j][1])
        

In [43]:
final_list = []
elements_filtered = []



for i in range(len(df['description'])):
    list_matrix[i].sort(reverse=True)
    list111 = []
    max_value = list_matrix[i][0][0]

    row_filtered = []
    for j in range(30):
        
        if max_value - list_matrix[i][j][0] <= 0.3:
            list111.append(our_classes['label'][list_matrix[i][j][1]])
            
            row_filtered.append(list_matrix[i][j])
    final_list.append(list111)
    elements_filtered.append(row_filtered)


### Handling Related-to-Antonym Conflicts

After we finish the main matching between company words and labels, we add one more filtering step to catch tricky cases where a company's description might **hint at the opposite** of what a label means — even if it's not a direct antonym.


### How It Works

- For each company, we go through its **top label candidates** (ranked by cosine similarity).
- We use our prepared set (`set_antonyms_list_final`) to catch possible conflicts.
- For every candidate label (up to 100):
  - If the label already looks **pretty weak** compared to the best match (cosine difference > 0.2), we **skip it** — no point wasting time on low-scoring ones.
  - If the **contextual similarity** (`context_matrix[i][index_label]`) is already high (> 0.4), we **skip it too** — the label is probably fine.
- Otherwise:
  - We check the **domain term** of the label.
  - Using our graphs (`related_words_to_a_word_for_antonyms` and `related_words_to_a_word_for_antonyms2`), we look for company words that are **related to antonyms** of the label’s domain.
    - Example: "livestock" is closely related to "meat" — even though they aren't direct opposites, they point to **different industries**.
  - If we find a match like this, we **invalidate** the label by setting its score to -1.


### Why We Do This

Not all conflicts are obvious.  
Sometimes words like "livestock" and "meat" seem close, but they **mean very different things** depending on context.  
Without this step, we could accidentally assign labels that look correct based on similarity, but **completely miss the real meaning**.

By checking for words **related to antonyms**, we catch these fuzzy mismatches and avoid bad predictions — while still being efficient and not stressing over weak candidates.


In [44]:
from sklearn.metrics.pairwise import cosine_similarity


for i in range(len(df['description'])):
    list_matrix[i].sort(reverse=True)
    list111 = []
    max_origin = list_matrix[i][0][0]
    max_val = list_matrix[i][j][0]
    set_antonyms_list = set(filter(lambda x: x if x[1] == i else "", antonyms_words_for_companies))
    set_antonyms_list_final = set(map(lambda x: x[0], set_antonyms_list ))
    
    if set_antonyms_list_final != set():
        for j in range(100): 
            
            if(max_val - list_matrix[i][j][0]>0.2):
                continue
            
            index_label = list_matrix[i][j][1]

            if context_matrix[i][index_label] > 0.4:
                continue

            word_most_specific = original_domain[index_label]
            val = False
            
            if original_domain[index_label]==domain[index_label]:

                for wd in set_antonyms_list_final:
                    val = True
                    
                    if word_most_specific in related_words_to_a_word_for_antonyms:
                        val = any(set(filter(lambda x:x == antonym_word or  f"{antonym_word}_" == x[0:len(antonym_word)+1] or f"_{antonym_word}_" in x or f"_{antonym_word}" == x[-(len(antonym_word)+1):0] , related_words_to_a_word_for_antonyms[word_most_specific])) != set() for antonym_word in set_antonyms_list_final)
                        
                    else:
                        val = False
                    
                    if (val==False):
                        if word_most_specific in related_words_to_a_word_for_antonyms2:                           
                            val = any(set(filter(lambda x: f"{antonym_word}_" == x[0:len(antonym_word)+1] or f"_{antonym_word}_" in x or f"_{antonym_word}" == x[-(len(antonym_word)+1):0] or x == antonym_word, related_words_to_a_word_for_antonyms2[word_most_specific])) != set() for antonym_word in set_antonyms_list_final)
                        
                    if(val==True):
                        break

            
                
            if(val):
                list_matrix[i][j] = (-1, list_matrix[i][j][1])
            

In [45]:
set_antonyms_list = set(filter(lambda x: x if x[1] == 8 else "", antonyms_words_for_companies))

### Selecting Top Candidate Labels

As mentioned earlier, we only select **labels that have scores close to the highest-scoring label**.  
This helps us narrow down the candidates to only the most relevant options and avoids including unrelated or weakly matching labels.

Specifically, we define "close" as being within a **0.15 score difference** from the top label.

In [46]:
final_list = []
elements_filtered = []

for i in range(len(df['description'])):
    list_matrix[i].sort(reverse=True)
    
    list111 = []

    max_value = list_matrix[i][0][0]

    row_filtered = []
    for j in range(30):
        if max_value - list_matrix[i][j][0] <= 0.2:
            list111.append(our_classes['label'][list_matrix[i][j][1]])
            row_filtered.append(list_matrix[i][j])
      
    
    final_list.append(list111)
    elements_filtered.append(row_filtered)

### Fine-Grained Boosting Strategy for Token-Level Matching

In our matching pipeline, we apply **dynamic boosting** based on the strength of each token match:

#### Boosting Based on Cosine Similarity

- If the cosine similarity is **above 0.68**, we apply a **strong boost** to the label score.
- If the similarity is between **0.45 and 0.68**, we apply a **moderate boost**.
- If the similarity is **above 0.95** (perfect match), we apply the **highest possible boost**.


#### Additional Contextual Boosts

- If a company token is **similar to niche or category words** (`our_categories_niche_words['niche_plus_cat'][i]`), we apply a **slight extra boost** to favor domain alignment.
- If a token matches **relevant terms from the first sentence** (`most_important_words`), we apply **targeted boosts** based on its contextual importance.


#### Per-Token Boosting Adjustments

- Using `strongest_values_all[i]`, we **increase** the score for tokens that are globally strong indicators.
- Using `weakest_values_all[i]`, we:
  - **Avoid discarding** tokens with cosine similarity **greater than 0.525** (even if they are weak).
  - **Deboost** tokens with lower similarity accordingly to avoid noise.


#### Final Scoring

- All individual token-level boosts are **summed and integrated** into the final label score.
- This ensures that **strong, domain-relevant, and contextually supported matches** have the greatest impact on the final classification.

In [47]:
list_min_tf_idf_all = []
for idx, _ in enumerate(our_words['new_col2']):
    list_min_tf_idf=[]
    for each_word in our_words['new_col2'][idx]:
        list_min_tf_idf.append(match_words_with_tf_idf_valuess[(each_word, idx)])
    list_min_tf_idf.sort()
    list_min_tf_idf_all.append(list_min_tf_idf)

In [48]:
full_list_companies = []
terms_not_compatible = set()
frequency_strong_terms_per_label = {}

terms_are_similar_or_not = {}

list_scores = np.zeros((len(df['description']), len(label_embeddings), 3))


match_cosine_score_word_with_label = {}
match_cosine_score_word_with_company_word = {}

i=0
for index, _ in df.iterrows():
    list_matrix[i].sort(reverse=True)
    list111 = []
    max_val = list_matrix[i][0][0].copy()

  
    for j, z in enumerate(elements_filtered[i]):
        
        index_label = list_matrix[i][j][1]
       

       
        
        current_scores = np.zeros(3)
        element_weight = np.zeros(3)
        multiply_word = -1
        for idx, word_key_label in enumerate(list_elements[index_label]):
            current_score = 1
            we1 = cat_niche_embeddings[i].reshape(1,-1).astype(np.float32, order='C')
            faiss.normalize_L2(we1)
            label_word_embedding = embeddings_index[word_key_label]
            if len(label_word_embedding) < 400:
                label_word_embedding = np.pad(label_word_embedding, (0, 400 - len(label_word_embedding)), mode='constant')
            le13 = label_word_embedding.reshape(1,-1).astype(np.float32, order='C')
            our_values_cat = np.max(np.dot(we1, le13.T))

   
            if our_values_cat > 0.8:
                if (original_domain[index_label] == word_key_label and  domain[index_label]!=original_domain[index_label]) or (original_core[index_label] == word_key_label and core[index_label]!=original_core[index_label]):
                    current_score = 1.35
                else:
                    current_score = 1.75
            
            if our_values_cat > 0.7:
                if (original_domain[index_label] == word_key_label and  domain[index_label]!=original_domain[index_label]) or (original_core[index_label] == word_key_label and core[index_label]!=original_core[index_label]):
                    current_score = 1.3
                else:
                    current_score = 1.65

            elif our_values_cat > 0.5:
                if (original_domain[index_label] == word_key_label and  domain[index_label]!=original_domain[index_label]) or (original_core[index_label] == word_key_label and core[index_label]!=original_core[index_label]):
                    current_score = 1.2 
                else:
                    current_score = 1.5

            else:

                list_current = list(our_categories_niche_words['niche_plus_cat'][i])
                if our_values_cat > 0.3 and list_current!=[]:
                    word_embedding_list1 = np.array([embeddings_index.get(wd.lower(), np.zeros(300)) if (wd, i) in match_words_with_tf_idf_valuess and wd not in weak_values_all[i] else np.zeros(300) for wd in list_current])
                    we_all = np.array(word_embedding_list1, dtype=np.float32, order='C')
                    faiss.normalize_L2(we_all)
                    le_wd = embeddings_index[word_key_label].reshape(1,-1).astype(np.float32, order='C')
                    faiss.normalize_L2(le_wd)
                    our_values_cat_indiv = np.max(np.dot(we_all, le_wd.T))
                   
                    if our_values_cat_indiv > 0.75:
                        if (original_domain[index_label] == word_key_label and domain[index_label]!=original_domain[index_label]) or (original_core[index_label] == word_key_label and core[index_label]!=original_core[index_label]):
                            current_score = 1.15
                        else:
                            current_score = 1.5
                    elif our_values_cat_indiv > 0.6 and (original_domain[index_label] == word_key_label and domain[index_label]!=original_domain[index_label]) or (original_core[index_label] == word_key_label and core[index_label]!=original_core[index_label]):
                        if current_score != 1:
                            current_score = 1.3

            if list(most_important_words[i]) != []:
                list_current = list(most_important_words[i])
                word_embedding_list1 = np.array([embeddings_index.get(wd.lower(), np.zeros(300)) if  wd not in weak_values_all[i] else np.zeros(300) for wd in list_current])
                we_all = np.array(word_embedding_list1, dtype=np.float32, order='C')
                faiss.normalize_L2(we_all)
                le_wd = embeddings_index[word_key_label].reshape(1,-1).astype(np.float32, order='C')
                faiss.normalize_L2(le_wd)
                our_values_cat_indiv = np.max(np.dot(we_all, le_wd.T))
                if our_values_cat_indiv > 0.70:
                    if (original_domain[index_label] == word_key_label and domain[index_label]!=original_domain[index_label]) or (original_core[index_label] == word_key_label and core[index_label]!=original_core[index_label]):
                        if current_score != 1:
                            current_score *= 1.05
                        else:
                            current_score = 1.15
                    else:
                        if current_score != 1:
                            current_score *= 1.2
                        else:
                            current_score = 1.35
                elif our_values_cat_indiv > 0.6 and (original_domain[index_label] == word_key_label and domain[index_label]!=original_domain[index_label]) or (original_core[index_label] == word_key_label and core[index_label]!=original_core[index_label]):
                    if current_score != 1:
                        if current_score != 1:
                            current_score *= 1.125
                        else:
                            current_score = 1.2


                




            current_scores[idx] = current_score

        if len(list_elements[index_label])==3:
            
            if original_core[index_label] != core[index_label] and original_domain[index_label]!=domain[index_label] and len(core[index_label].split("::"))>1:
                element_weight = [0.235, 0.665, 0.1]
            elif original_core[index_label] != core[index_label] and original_domain[index_label]!=domain[index_label]:
                element_weight = [0.2, 0.6, 0.2]
            elif original_core[index_label] != core[index_label] and len(core[index_label].split("::"))>1:
                element_weight = [0.425, 0.475, 0.1]
            elif original_core[index_label] != core[index_label]:
                if len(modifier[index_label].split("__"))>1:
                    element_weight = [0.55, 0.25, 0.2]
                else:
                    element_weight = [0.35, 0.45, 0.2]
            elif original_domain[index_label] != domain[index_label]:
                element_weight = [0.2, 0.425, 0.375]
            else:
                element_weight = [0.3, 0.4, 0.3]
            
            score_total = current_scores[0] * element_weight[0] + current_scores[1] * element_weight[1] + current_scores[2] * element_weight[2]
            list_scores[i][index_label] = [current_scores[0] * element_weight[0], current_scores[1] * element_weight[1], current_scores[2] * element_weight[2]]

            
            score_total *=0.07
            multiply_word = 0.07
        elif len(list_elements[index_label])==2:
            if original_core[index_label] != core[index_label] and original_domain[index_label] != domain[index_label]:
                if len(core[index_label].split("__")) > 1:
                    element_weight = [0.3, 0.3, 0]
                    list_scores[i][index_label] = [current_scores[0] * 0.3, 0, current_scores[1] * 0.3]
                    score_total = current_scores[0] * 0.3 + current_scores[1] * 0.3
                else:
                    element_weight = [0.35, 0.1, 0]
                    list_scores[i][index_label] = [current_scores[0] * 0.35, 0, current_scores[1] * 0.1]
                    score_total = current_scores[0] * 0.35 + current_scores[1] * 0.1
            elif original_core[index_label] != core[index_label]:
                if len(core[index_label].split("__")) > 1:
                    element_weight = [0.7, 0.3, 0]
                    list_scores[i][index_label] = [current_scores[0] * 0.7, 0, current_scores[1] * 0.1]
                    score_total = current_scores[0] * 0.7 + current_scores[1] * 0.3
                else:
                    list_scores[i][index_label] = [current_scores[0] * 0.9, 0, current_scores[1] * 0.1]
                    element_weight = [0.9, 0.1]

                    score_total = current_scores[0] * 0.9 + current_scores[1] * 0.1
            elif original_domain[index_label] != domain[index_label]:
                element_weight = [0.3, 0.7, 0]
                list_scores[i][index_label] = [current_scores[0] * 0.3, 0, current_scores[1] * 0.7]
                score_total = current_scores[0] * 0.3 + current_scores[1] * 0.7
            else:
                list_scores[i][index_label] = [current_scores[0] * 0.5, 0, current_scores[1] * 0.5]
                element_weight = [0.5, 0.5, 0]
                score_total = current_scores[0] * 0.5 + current_scores[1] * 0.5

            score_total *=0.06
            multiply_word = 0.06


        else:
            score_total = current_scores[0]
            list_scores[i][index_label] = [current_scores[0], 0, 0]

            if domain[index_label] != original_domain[index_label]:
                score_total*=0.5
            score_total *=0.05
            multiply_word = 0.05


        weight_core = 0
        if len(list_elements[index_label])==2:
            weight_core = element_weight[1]
        elif len(list_elements[index_label])==3:
            weight_core = element_weight[2]
            
        list_wd = [wd for wd in our_words['new_col2'][i]]
        for idx2, wd in enumerate(list_wd):
            mult_value = 1

            score_total1 = score_total
            
            our_values = list_matrix_elements_val[i][index_label][idx2]
            
            
            if core[index_label] != "0" and core[index_label] != original_core[index_label] and list_matrix_core_val[i][index_label][idx2]>0.6 and list_matrix_core_val[i][index_label][idx2]>list_matrix_elements_val[i][index_label][idx2]:
                score_total1*=weight_core

            if our_values < 0.57 and  original_domain[index_label]!= wd and (original_domain[index_label], wd) in dict_words_same_category.keys() and dict_words_same_category[(original_domain[index_label], wd)]==1:
                continue

            if our_values < 0.57 and original_modifier[index_label]!= wd and  (original_modifier[index_label], wd) in dict_words_same_category.keys() and dict_words_same_category[(original_modifier[index_label], wd)]==1:
                continue

            if our_values < 0.57 and original_core[index_label]!= wd and (original_core[index_label], wd) in dict_words_same_category.keys() and dict_words_same_category[(original_core[index_label], wd)]==1:
                continue



            if (i, index_label) not in frequency_strong_terms_per_label.keys():
                frequency_strong_terms_per_label[(i,index_label)] = set()
            list_set = frequency_strong_terms_per_label[(i,index_label)]
    
            tf_idf_value = match_words_with_tf_idf_valuess[(wd, i)]


            
            can_continue=True
            can_continue2=True

            
            if can_continue and can_continue2:
                if our_values > 0.95:
                    list_set.add(wd.lower())
                    if wd in strong_values_all[i]:
                        score_total1 *=1.25
                    elif wd in weak_values_all[i]:
                        score_total1 *=0.90
                    list_matrix[i][j] = (list_matrix[i][j][0]+score_total1*1.8,list_matrix[i][j][1])
                    continue

                elif our_values > 0.68:
                    list_set.add(wd.lower())
                    if wd in strong_values_all[i]:
                        score_total1 *=1.25
                    elif wd in weak_values_all[i] and wd not in our_categories_niche_words['niche_plus_cat'][i]:
                        score_total1 *=0.85
                    
                    list_matrix[i][j] = (list_matrix[i][j][0]+score_total1*1.5,list_matrix[i][j][1])
                    continue

                elif our_values > 0.45 and (wd not in weak_values_all[i] or wd in our_categories_niche_words['niche_plus_cat'][i]) or our_values > 0.525:
                    if len(wd.split("_")) > 1 and our_values < 0.45:
                        list_Min = [len(wd1) for wd1 in wd.split("_")]
                        min_value  = min(list_Min)
                        if min_value > 3:
                            length_n = len(wd.split("_"))
                            wd_splitted_embeddings = np.array([embeddings_index.get(wd1, np.zeros(300)) for wd1 in wd.split("_")], dtype=np.float32, order='C')
                            faiss.normalize_L2(wd_splitted_embeddings)
                            total_value1 = embeddings_index[original_domain[index_label]] * element_weight[0]
                            if modifier[index_label]!='0' and core[index_label]!='0':
                                total_value1+=embeddings_index[original_modifier[index_label]]*element_weight[1]
                                total_value1+=embeddings_index[original_core[index_label]]*element_weight[2]
                            elif modifier[index_label]=='0':
                                total_value1+=embeddings_index[original_core[index_label]]*element_weight[1]
                            maximum_value = np.max(np.dot(total_value1, wd_splitted_embeddings.T))

                            if (maximum_value<0.3):
                                terms_not_compatible.add((wd, word_most_specific))
                                continue
                    
                    list_set.add(wd.lower())
                    if wd in strong_values_all[i]:
                        score_total1 *=1.25

                    if wd in weak_values_all[i] and wd not in our_categories_niche_words['niche_plus_cat'][i]:
                        score_total1 *=0.75
                   
                    list_matrix[i][j] = (list_matrix[i][j][0]+score_total1,list_matrix[i][j][1])
                    continue
            
                elif (f"{domain[index_label]}_" == wd[0:len(domain[index_label])+1]) and our_values<0.4 and len(wd.split("_"))>1:
                        
                        word_separate_value = np.array([embeddings_index[our_words] for our_words in wd.split("_")], dtype=np.float32, order='C')
                        faiss.normalize_L2(word_separate_value)
                        array_weighted_labels = sum(embeddings_index.get(word_key_label, np.zeros(300)) * element_weight[idx] for idx, word_key_label in enumerate(list_elements[index_label]))
                        our_value_common = np.mean(np.dot(array_weighted_labels, word_separate_value.T))
                        if wd in strong_values_all[i] and word_key_label == original_domain[index_label] and original_domain[index_label] == domain[index_label]:
                            score_total1*=1.25
                        elif wd in weak_values_all[i] and wd not in our_categories_niche_words['niche_plus_cat'][i]:
                            score_total1*=0.75
                        
                        if our_value_common > 0.55 and word_key_label == original_domain[index_label] and original_domain[index_label] != domain[index_label]:
                            list_set.add(wd.lower())
                            list_matrix[i][j] = (list_matrix[i][j][0]+score_total1*0.6,list_matrix[i][j][1])
                            continue
                        elif our_value_common > 0.55:
                            list_set.add(wd.lower())
                            list_matrix[i][j] = (list_matrix[i][j][0]+ score_total1,list_matrix[i][j][1])
                            continue
                
                if our_values > 0.35 and wd not in weak_values_all[i]:
                    list_set.add(wd.lower())
                    list_matrix[i][j] = (list_matrix[i][j][0]+our_values*score_total1*0.8,list_matrix[i][j][1])


        frequency_strong_terms_per_label[(i,index_label)] = list_set
    i+=1

In [49]:
list_matrix_original_2 = []
for i in range(len(df['description'])):
    list_matrix[i].sort(reverse=True)
    list_matrix_original_2.append(list_matrix[i].copy())

In [50]:
final_list = []
elements_filtered = []

for i in range(len(df['description'])):
    list_matrix[i].sort(reverse=True)
    
    list111 = []

    max_value = list_matrix[i][0][0]


    row_filtered = []
    for j in range(30):
       
        if max_value - list_matrix[i][j][0] <= 0.1:
            list111.append(our_classes['label'][list_matrix[i][j][1]])
            row_filtered.append(list_matrix[i][j])
        
      
    
    final_list.append(list111) 
    elements_filtered.append(row_filtered)

    
    

In [51]:
def eliminate_wrong_similar_elements(word, each_key_word, correlated_terms_filter_list):
    values_list = np.array([embeddings_index[wd] for wd in correlated_terms_filter_list], dtype=np.float32, order='C')
    faiss.normalize_L2(values_list)
    values = list(np.dot(values_list, embeddings_index[each_key_word].T))

    if values == []:
        return False
    
    antonyms_terms = [w2 for w1, w2 in antonyms_words if w1==word and w2 in embeddings_index]
    if antonyms_terms != list():
        values_list2 = np.array([embeddings_index.get(wd, np.zeros(300)) for wd in antonyms_terms], dtype=np.float32, order='C')
        faiss.normalize_L2(values_list2)

        list_values = list(np.max(np.dot(values_list, values_list2.T), axis=1))
        is_true = any((values[index]>=0.55 and list_values[index]<0.5)  for index in range(len(values)))
        return is_true
    return any(val>=0.55 for val in values)

In [52]:
all_current_embeddings = []
for idx, _ in enumerate(our_words['new_col2']):
    emb_for_the_current_one = []

    
    for j, z in enumerate(elements_filtered[idx]):


        index_label = list_matrix[idx][j][1]

        col_emb = np.array([embeddings_index[wd] for wd in our_words['new_col2'][idx]], dtype =np.float32, order='C')
        filter_emb = np.array([embeddings_index[wd] for wd in list_elements[index_label]], dtype =np.float32, order='C')
        faiss.normalize_L2(col_emb)
        faiss.normalize_L2(filter_emb)

        current_embedding = np.dot(filter_emb, col_emb.T)
        emb_for_the_current_one.append(current_embedding)

       
    all_current_embeddings.append(emb_for_the_current_one)


In [53]:
all_core_embeddings_terms = []
for idx, _ in enumerate(our_words['new_col2']):
    emb_for_the_current_one = []

    
    for j, z in enumerate(elements_filtered[idx]):


        index_label = list_matrix[idx][j][1]

        col_emb = np.array([embeddings_index[wd] for wd in our_words['new_col2'][idx]], dtype =np.float32, order='C')
        faiss.normalize_L2(col_emb)

        if(original_core[index_label]!='0') and original_core[index_label]!=core[index_label]:
            current_embedding = np.dot(embeddings_index[original_core[index_label]], col_emb.T)
        else:
            current_embedding = np.dot(np.zeros(300), col_emb.T)
        emb_for_the_current_one.append(current_embedding)

       
    all_core_embeddings_terms.append(emb_for_the_current_one)


### Advanced Filtering and Boosting Logic for Noisy or Ambiguous Terms

In this phase, we refine matching by **cleaning noisy matches** and **boosting relevant but low-similarity terms** carefully.


### 1. Cleaning Weak Matches

- We try to **filter out tokens** that have **low similarity** with label parts and are **not strong terms** (`weak_values_all`).
- This helps to **reduce noise** and **focus on meaningful words** when matching labels.


### 2. Boosting Low Similarity but Relevant Terms

To avoid missing useful matches, we apply two strategies:

#### (a) Dictionary-Based Similar Term Expansion
- We use a **custom dictionary** (built using ConceptNet relationships) to find words **semantically related** to the current token.
- We ensure that:
  - The term is **not more generic** than the label term.
  - The match is **specific enough** to avoid false positives.
- If the relationship is strong, we **boost the label score** even if the cosine similarity is low.

#### (b) Splitting and Validating Compound Tokens
- If the token is a **compound word** (e.g., `agriculture_machinery`), we split it into parts.
- For each part:
  - If a part matches well with the label term (cosine > 0.7), we **boost** the score proportionally (e.g., `+1/len(word.split("_"))` per good part).
  - If parts match indirectly, we still use them carefully to **adjust the boost**.
- If splitting leads to strong matches, we **accumulate** a multiplier boost.



### 3. Using Correlated Terms Fallback

If a term is **not directly correlated**, we:

- Look into `correlated_terms_with_each_word` dictionary.
- Try to find **terms containing the label** or having **high cosine similarity** with it.
- Apply `eliminate_wrong_similar_elements` to **filter out false positives** by checking:
  - If the term has **high similarity with antonyms** (then discard it).
  - If not, and the remaining terms are good, **accept** and **boost** the label.


### 4. Memory Optimization and Smart Multipliers

- We **store results in dictionaries** (like `are_terms_correlated`) to avoid redundant computation.
- In complex cases, if relatedness is detected through **second-level correlated terms**, we apply a **small boost** (`mult = 0.05`) to avoid wrongly promoting a false positive.


### 5. Final Goal

This filtering and boosting phase ensures:
- **Relevant terms are rescued** even if they are not directly obvious by cosine.
- **Irrelevant or noisy labels are heavily penalized** and **filtered out**.
- **Compound terms** are handled smartly without missing real matches.
- **False positives** caused by generic or antonymic confusion are minimized.


In [54]:
last_index_elements = [-1] * len(our_words['new_col2'])
zzz=0
are_terms_correlated = {}
are_terms_asssociated = {}
does_contain_similar_elements = {}
each_key_word_specifity_score = {}
for idx, _ in enumerate(our_words['new_col2']):
    min_25th = (min_val_list[idx] + (max_val_list[idx]-min_val_list[idx]) * 0.25)

    max_25th = (max_val_list[idx] - (max_val_list[idx]-min_val_list[idx]) * 0.25)
    list_matrix[idx].sort(reverse=True)
    max_val = list_matrix[idx][0][0]

    if(len(elements_filtered[idx])==1 or (list_matrix[idx][0][0] - list_matrix[idx][1][0])>=0.1):
        last_index_elements[idx] = list_matrix[idx][0][1]
        continue
    
    

    for j, z in enumerate(elements_filtered[idx]):


        if(max_val - list_matrix[idx][j][0]>0.15):
            continue
        index_label = list_matrix[idx][j][1]
        
      
        if (idx, index_label) not in frequency_strong_terms_per_label.keys():
            
            list_matrix[idx][j] = (-1, list_matrix[idx][j][1])
            continue


        for idx2, each_key_word in enumerate(list_elements[index_label]):               
            
            if list_matrix[idx][j][0] == -1:
                continue
            end_loop = -1
            
            if each_key_word == core[index_label] or each_key_word == domain[index_label] or each_key_word == modifier[index_label]:
                end_loop = 0
                max_val = -1
                keydf = ""

                for our_index, i in enumerate(our_words['new_col2'][idx]):

                    if core[index_label] != "0" and core[index_label] != original_core[index_label] and list_matrix_core_val[idx][index_label][our_index]>0.7:
                        continue

                    if each_key_word not in specifity_word.keys():
                        specifity_word[each_key_word] = get_specificity(each_key_word)
                        
                    
                    specifity_score2 = specifity_word[each_key_word]


                    
                    cos_similarity = all_current_embeddings[idx][j][idx2][our_index]

                
                    did_not_split_compound_word = 1
                    cos_similarity1 = -1

                    
                    if (i not in weak_values_all[idx]) and ((specifity_score2!=0 and cos_similarity >0.36) or cos_similarity >0.55) and (did_not_split_compound_word==1 or cos_similarity1 > 0.35 ):
                        
                        end_loop = 1
                        max_val = cos_similarity1
                        keydf = i
                       
                
                    elif match_words_with_tf_idf_valuess[(i, idx)] > min_25th and cos_similarity<=0.35 and ((i, each_key_word) not in are_terms_asssociated or are_terms_asssociated[(i, index_label)] != 0):
                        
                        similar_words_to_that =get_sim_list(each_key_word, i, index_label, dict_words_same_category)
                       
                        if each_key_word not in specifity_word.keys():
                            specifity_word[each_key_word] = get_specificity(each_key_word)
                        specifity1 = specifity_word[each_key_word]
                        if each_key_word not in noun_for_adj_dict.keys():
                            noun_for_adj_dict[each_key_word] = generate_noun_for_adj(each_key_word, embeddings_index)
                        noun_for_adj = noun_for_adj_dict[each_key_word]

                        if noun_for_adj not in specifity_word.keys():
                            specifity_word[noun_for_adj] = get_specificity(noun_for_adj)
                        specifity1_5 = specifity_word[noun_for_adj]



                        if i not in specifity_word.keys():
                            specifity_word[i] = get_specificity(i)
                        specifity2 = specifity_word[i]

                        if i not in noun_for_adj_dict.keys():
                            noun_for_adj_dict[i] = generate_noun_for_adj(i, embeddings_index)
                        noun_for_adj2 = noun_for_adj_dict[i]

                        if noun_for_adj2 not in specifity_word.keys():
                            specifity_word[noun_for_adj2] = get_specificity(noun_for_adj2)
                        specifity2_5 = specifity_word[noun_for_adj2]
                        
                        is_very_specific = max(specifity1, specifity1_5)
                        is_second_term = max(specifity2, specifity2_5)
                                  
                        

                        mult = 1
                        values = []

                        if len(i.split("_")) > 1 and each_key_word==core[index_label] and core[index_label]==original_core[index_label]:
                            values_list = np.array([embeddings_index.get(wd, np.zeros(300)) for wd in i.split("_")], dtype=np.float32, order='C')
                            number_splitted_words = len(i.split("_"))
                            faiss.normalize_L2(values_list)
                            elements = list(np.dot(embeddings_index[each_key_word], values_list.T))
                            elements = list(filter(lambda x: x>0.5, elements))
                            if len(elements)!=0:
                                mult = len(elements)/number_splitted_words
                                values = [1,2,3]
                            else:
                                mult = 0
                                continue
                            are_terms_correlated[(i, each_key_word)] = mult


                        if values == [] and (i, each_key_word) in are_terms_correlated and (similar_words_to_that == None or similar_words_to_that==set()):
                            
                            mult = are_terms_correlated[(i, each_key_word)]
                           
                            if mult == 0:
                                continue
                            else:
                                values = [1,2,3]
                        elif values == [] and (similar_words_to_that == None or similar_words_to_that==set()):
                            mult=0
                            
                            if i in correlated_terms_with_each_word.keys():
                                correlated_terms_filter_list = correlated_terms_with_each_word[i]
                                value_mlt = 1
                                if (i, each_key_word) not in does_contain_similar_elements:
                                    bool_value = eliminate_wrong_similar_elements(i, each_key_word, correlated_terms_filter_list)
                                    if bool_value:
                                        values = [1,2,3]
                                    else:
                                        values = []
                                    does_contain_similar_elements[(i, each_key_word)] = values
                                    if values == list():
                                        set_temp = set()
                                        
                                        for val in correlated_terms_with_each_word[i]:
                                            if val in correlated_terms_with_each_word.keys():
                                                lzt = list(filter(lambda x: x in our_words['new_col2'][idx] , correlated_terms_with_each_word[val]))
                                                set_temp.update(lzt)
                                                                                    
                                        if set_temp != set():

                                            values =  list(set_temp)
                                            values_emb = np.array([embeddings_index[wd] for wd in values],dtype=np.float32, order='C')
                                            our_values = list(np.dot(values_emb, embeddings_index[each_key_word].T))
                                            is_true = any(val>=0.5 for val in our_values)

                                            if(is_true):
                                                values = [1,2,3]
                                            value_mlt = 0.05
                                    else:
                                        does_contain_similar_elements[(i, each_key_word)] = values
                                        values = does_contain_similar_elements[(i, each_key_word)]
                                else:
                                    values = does_contain_similar_elements[(i, each_key_word)]
                                        

            
                                if values !=list():
                                    values = [1,2,3]
                                    mult=1
                                mult = mult * value_mlt
                                are_terms_correlated[(i, each_key_word)] = mult * value_mlt

                            elif len(i.split("_")) > 1:
                                later_list = []
                                now_list = []
                                mult = 0

                                values = []

                                for wd2 in i.split("_"):
                                    curr_list = set()
                                    values2 = []
                                    
                                    
                                    if wd2 in correlated_terms_with_each_word.keys():
                                        correlated_terms_filter_list = correlated_terms_with_each_word[wd2]
                                        if (wd2, each_key_word) not in does_contain_similar_elements and wd2 in embeddings_index.keys():
                                            bool_value = eliminate_wrong_similar_elements(wd2, each_key_word, correlated_terms_filter_list)
                                            if bool_value:
                                                values2 = [1,2,3]
                                            else:
                                                values2 = []
                                            does_contain_similar_elements[(wd2, each_key_word)] = values2
                                            
                                        elif wd2 not in embeddings_index.keys():
                                            does_contain_similar_elements[(wd2, each_key_word)] = []
                                        
                                        values2 = does_contain_similar_elements[(wd2, each_key_word)]

                                        if values2!=[]:
                                            values = values2.copy()
                                            now_list.append(wd2)
                                            mult+=1/len(i.split("_"))

                                    
                                    elif wd2 in embeddings_index.keys():
                                        later_list.append(wd2)
                                
                                if now_list!=[] and later_list!=[]:
                                    now_values_list = np.array([embeddings_index[wd_now] for wd_now in now_list], dtype=np.float32, order='C')
                                    faiss.normalize_L2(now_values_list)
                                    later_values_list = np.array([embeddings_index[wd_later] for wd_later in later_list], dtype=np.float32, order='C')
                                    faiss.normalize_L2(later_values_list)

                                    current_scores_later = np.dot(later_values_list, now_values_list.T)

                                    for idx3, our_word in enumerate(later_list):
                                        values_list = current_scores_later[idx3]

                                    
                                        values_list = any(val>=0.7 for val in values_list)

                                        if values_list!= []:
                                            mult+=1/len(i.split("_"))

                                            if our_word in our_words['new_col2'][idx]:
                                                if (our_word, each_key_word) in are_terms_correlated and are_terms_correlated[(our_word, each_key_word)] == 0:
                                                    our_words['new_col2'][idx].remove(our_word)
                                                    our_words['new_col2'][idx].append(our_word)


                                                are_terms_correlated[(our_word, each_key_word)] = 1

                                            
                                if values == []:
                                    mult=0
                            are_terms_correlated[(i, each_key_word)] = mult
            
                        
                        is_generic_term = (i in original_core and i not in core) or (i in original_domain and i not in domain)
                        
                        if (mult!=0 and ((similar_words_to_that != None and similar_words_to_that!=set()) or (values!=list())) and not is_generic_term):
                            
                            end_loop = 1

                            list_set.add(wd.lower())
                            frequency_strong_terms_per_label[(i,index_label)] = list_set
                            are_terms_correlated[(i, each_key_word)] = mult
                            each_key_word1 = each_key_word
                            if each_key_word not in average_tf_idf.keys() and each_key_word1 in convertPluralToSingular.keys() and convertPluralToSingular[each_key_word1] in average_tf_idf.keys():
                                each_key_word1 = convertPluralToSingular[each_key_word]
                            if len(list_elements[index_label]) == 3 and (each_key_word1 not in average_tf_idf or  average_tf_idf[each_key_word1]>0.075):
                                mult3 = 0.07
                                our_current_score = list_scores[idx][index_label][idx2] * mult3 * mult
                            
                                list_matrix[idx][j] = (list_matrix[idx][j][0]+our_current_score, list_matrix[idx][j][1])
                            elif len(list_elements[index_label]) == 2 and (each_key_word1 not in average_tf_idf or average_tf_idf[each_key_word1]>0.075):
                                mult3 = 0.06
                                our_current_score = list_scores[idx][index_label][idx2] * mult3 * mult
                                list_matrix[idx][j] = (list_matrix[idx][j][0]+our_current_score, list_matrix[idx][j][1])
                            elif each_key_word1 not in average_tf_idf or average_tf_idf[each_key_word1]>0.075:
                                mult3 = 0.06
                                our_current_score = list_scores[idx][index_label][idx2] * mult3 * mult
                                
                                list_matrix[idx][j] = (list_matrix[idx][j][0]+our_current_score, list_matrix[idx][j][1])
                        else:
                            are_terms_correlated[(i, each_key_word)] = 0
                            are_terms_asssociated[(i, index_label)] = 0
            if end_loop == 0 and list_matrix[idx][j][0]!=-1:
               list_matrix[idx][j] = (-1, list_matrix[idx][j][1])
            else:        
                last_index_elements[idx] = index_label



In [55]:
final_list = []
elements_filtered = []
for i in range(len(df['description'])):
    list_matrix[i].sort(reverse=True)
    
    list111 = []

    max_value = list_matrix[i][0][0]


    row_filtered = []
    
    for j in range(30):


        
        list111.append(our_classes['label'][list_matrix[i][j][1]])
        row_filtered.append(list_matrix[i][j])
        
       
      
    
    final_list.append(list111)
    elements_filtered.append(row_filtered)
    

In [56]:
final_list = []
elements_filtered = []

for i in range(len(df['description'])):
    list_matrix[i].sort(reverse=True)
    
    list111 = []

    max_value = list_matrix[i][0][0]


    row_filtered = []
    for j in range(30):
        if max_value - list_matrix[i][j][0] <= 0.2:
            list111.append(our_classes['label'][list_matrix[i][j][1]])
            row_filtered.append(list_matrix[i][j])
    
    final_list.append(list111)
    elements_filtered.append(row_filtered)
    
    

In [57]:
ignore_if_label_contains = ["promotion", "solutions", "support", "building"]
terms_to_consider_ignoring = ["commercial"]

In [58]:
weak_values_all_copy = weak_values_all.copy()

### Weak Label Invalidation

After scoring all label candidates, we apply an additional pass to **invalidate weak labels** when stronger ones are present:

- For each company:
  - If the **top label score ≥ 0.30** (i.e., a strong match exists):
    - All labels scoring **< 0.30** and within **0.4 of the top score** are **discarded**.
    - This prevents weaker labels from influencing downstream selection or ranking.
    
This ensures that only **semantically confident** labels are retained, while borderline or noisy ones are filtered out early in the pipeline.

In [59]:
for i in range(len(df['description'])):
    list_matrix[i].sort(reverse=True)
    max_value = list_matrix[i][0][0]

    if list_matrix[i][0][0] >=0.3:
        for j in range(50):
            
            if max_value - list_matrix[i][j][0] <= 0.4:
                if list_matrix[i][j][0] < 0.3:
                    list_matrix[i][j] = (-1, list_matrix[i][j][1])
                

    list_matrix[i].sort(reverse=True)

In [60]:
old_list_matrix = []
for i in range(len(df['description'])):
    list_matrix[i].sort(reverse=True)
    old_list_matrix.append(list_matrix[i].copy())

In [61]:
final_list = []
elements_filtered = []

for i in range(len(df['description'])):
    list_matrix[i].sort(reverse=True)
    
    list111 = []

    max_value = list_matrix[i][0][0]


    row_filtered = []
    for j in range(30):
        if max_value - list_matrix[i][j][0] <= 0.2:
            list111.append(our_classes['label'][list_matrix[i][j][1]])
            row_filtered.append(list_matrix[i][j])
    
    final_list.append(list111)
    elements_filtered.append(row_filtered)
    
    

In [62]:
hasContext_relevant_words = {}
for iz in domain:
    if iz in hasContext_relationship.keys():
        if len(iz.split("_")) > 1:
            list_values = list(set(hasContext_relationship[iz]))
            embeddings_value = np.array([embeddings_index.get(wd, np.zeros(300)) for wd in list_values], dtype=np.float32, order='C')
            our_values = np.dot(embeddings_index[iz], embeddings_value.T)
            list_values_range = range(len(list_values))

            wd_splitted_embeddings = np.array([embeddings_index.get(wd, np.zeros(300)) for wd in iz.split("_")], dtype=np.float32, order='C')
            faiss.normalize_L2(wd_splitted_embeddings)

            list_values_range = list(filter(lambda x: our_values[x]>=0.4 and our_values[x]>np.max(np.dot(embeddings_index[list_values[x]], wd_splitted_embeddings.T)), list_values_range))
            if list_values_range!=[]:
                values_add = []
                for i in list_values_range:
                    values_add.append(list_values[i])
                hasContext_relevant_words[iz] = values_add
                


In [63]:
last_index_elements = [-1] * len(our_words['new_col2'])
idx=-1
each_key_word_specifity_score = {}
while idx + 1 < len(our_words['new_col2']):
    idx+=1
    
    list_min_tf_idf=[]
    for each_word in our_words['new_col2'][idx]:
        list_min_tf_idf.append(match_words_with_tf_idf_valuess[(each_word, idx)])
    list_min_tf_idf.sort()

    list_matrix[idx].sort(reverse=True)
    max_val = list_matrix[idx][0][0]

    if list_matrix[idx][0][0] - list_matrix[idx][1][0]>=0.3 and list_matrix[idx][1][0]<=0.5:
        continue

   
    
    
    words_eliminated=0

    copy_list_matrix = list_matrix[idx].copy()
    no_val=0
    if strong_values_all[idx] == []:
        no_val = 10000.0
    else:
        no_val = (len(weak_values_all[idx])/len(strong_values_all[idx]))

    copy_weak = weak_values_all[idx].copy()
    if no_val > 3.5:
        weak_values_all[idx] = []

    for j in range(20):

        list_matrix_copy_row = list_matrix[idx].copy()
        no_key_word_eliminated=0
        index_label = list_matrix[idx][j][1]
        
        if (idx, index_label) not in frequency_strong_terms_per_label.keys() or list_matrix[idx][j][0] == -1:
            words_eliminated+=1
            list_matrix[idx][j] = (-1, list_matrix[idx][j][1])
            continue
     

        
        
      
        for idx2, each_key_word in enumerate(list_elements[index_label]):               

            end_loop = -1

            

            ignore_term = not(idx2 == 1 or ((core[index_label] !='0' and core[index_label]==original_core[index_label]) and each_key_word!=original_core[index_label]) or ((modifier[index_label]!='0' and modifier[index_label]==original_modifier[index_label]) and each_key_word!=original_modifier[index_label]) or ((domain[index_label]!='0' and domain[index_label]==original_domain[index_label]) and each_key_word!=original_domain[index_label])) and each_key_word in terms_to_consider_ignoring
          
            
           
            
            if  each_key_word not in ignore_if_label_contains and (ignore_term or each_key_word not in terms_to_consider_ignoring):
                end_loop = 0
                max_val = -1
                keydf = ""

                max_v=-1

                
        
                array_value = np.array([embeddings_index[wd] for wd in our_words['new_col2'][idx]], dtype=np.float32, order='C')
                faiss.normalize_L2(array_value)

                our_values_all = np.dot(embeddings_index[each_key_word], array_value.T)
                if each_key_word == domain[index_label]:
                    
                    max_value = np.max(our_values_all)
                    if max_value < 0.3:
                        list_matrix[idx][j] = (-1, list_matrix[idx][j][1])
                        no_key_word_eliminated+=1
                        continue
                merge_word = ''
                cos_merged = 0.0

                if each_key_word == original_core[index_label] and original_core[index_label] != core[index_label] and f"{original_domain[index_label]}_{original_core[index_label]}" in embeddings_index.keys():
                        if modifier[index_label] == '0':
                            merge_word = f"{original_domain[index_label]}_{original_core[index_label]}"
                elif each_key_word == original_core[index_label] and original_core[index_label] != core[index_label] and modifier[index_label]=="0" and original_domain[index_label] in isA_relationship.keys():
                    
                    for word_sub in isA_relationship[original_domain[index_label]]:
                        if f"{word_sub}_{original_core[index_label]}" in embeddings_index.keys():
                            merge_word = f"{word_sub}_{original_core[index_label]}"
                            break
                if each_key_word == original_core[index_label] and original_core[index_label] != core[index_label] and original_modifier[index_label]!="0" and f"{original_modifier[index_label]}_{original_core[index_label]}" in embeddings_index.keys():      
                    merge_word = f"{original_modifier[index_label]}_{original_core[index_label]}"
                elif each_key_word == original_core[index_label] and original_core[index_label] != core[index_label] and original_modifier[index_label]!="0" and original_modifier[index_label] in isA_relationship.keys():
                    for word_sub in isA_relationship[original_modifier[index_label]]:
                        if f"{word_sub}_{original_core[index_label]}" in embeddings_index.keys():
                            merge_word = f"{word_sub}_{original_core[index_label]}"
                            break                 
                    
                if each_key_word == original_modifier[index_label] and original_modifier[index_label] != modifier[index_label] and f"{original_domain[index_label]}_{original_modifier[index_label]}" in embeddings_index.keys():
                    merge_word = f"{original_domain[index_label]}_{original_modifier[index_label]}"
                elif each_key_word == original_modifier[index_label] and original_modifier[index_label] != modifier[index_label] and original_domain[index_label] in isA_relationship.keys():
                    for word_sub in isA_relationship[original_domain[index_label]]:
                        if f"{word_sub}_{original_modifier[index_label]}" in embeddings_index.keys():
                            merge_word = f"{word_sub}_{original_modifier[index_label]}"                            
                            break
                
                
                if merge_word != '':
                    values_emb = np.array([embeddings_index[wd] for wd in our_words['new_col2'][idx]], dtype=np.float32, order='C')
                    faiss.normalize_L2(values_emb)
                    values_merges = np.dot(embeddings_index[merge_word], values_emb.T)
                else:
                    values_merges = np.zeros(300)
                
                for idx2, i in enumerate(our_words['new_col2'][idx]):

                    
                    if each_key_word not in specifity_word.keys():
                        specifity_word[each_key_word] = get_specificity(each_key_word)
                    specifity_score2 = specifity_word[each_key_word]


                    cos_similarity = our_values_all[idx2]
                    cos_merged = values_merges[idx2]
                    

                    did_not_split_compound_word = 1                    
                    is_correlated = (i, each_key_word) in are_terms_correlated and are_terms_correlated[(i, each_key_word)]>0.05

                   

                    if each_key_word !=original_core[index_label] and core[index_label] != "0" and core[index_label] != original_core[index_label] and list_matrix_core_val[idx][index_label][idx2]>0.7:
                        continue
                    is_okay = False
                    if len(i.split("_"))>1 and cos_similarity < 0.3 and each_key_word!=original_domain[index_label] and each_key_word == original_core[index_label]:
                        array_emb = np.array([embeddings_index.get(wd, np.zeros(300)) for wd in i.split("_")], dtype=np.float32, order='C')
                        faiss.normalize_L2(array_emb)
                        values_list = np.max(np.dot(embeddings_index[each_key_word], array_emb.T))
                        values_list_domain = np.max(np.dot(embeddings_index[original_domain[index_label]], array_emb.T))

                        if values_list > 0.5 and values_list_domain > 0.6:
                            is_okay = True

                    if len(i.split("_"))>1 and cos_similarity < 0.3 and original_modifier[index_label]!='0' and each_key_word!=original_modifier[index_label] and each_key_word == original_core[index_label]:
                        array_emb = np.array([embeddings_index.get(wd, np.zeros(300)) for wd in i.split("_")], dtype=np.float32, order='C')
                        faiss.normalize_L2(array_emb)
                        values_list = np.max(np.dot(embeddings_index[each_key_word], array_emb.T))
                        values_list_domain = np.max(np.dot(embeddings_index[original_modifier[index_label]], array_emb.T))

                        if values_list > 0.5 and values_list_domain > 0.6:
                            is_okay = True
                    
                    if len(i.split("_"))>1 and cos_similarity < 0.3 and each_key_word!=original_domain[index_label] and each_key_word == original_modifier[index_label]:
                        array_emb = np.array([embeddings_index.get(wd, np.zeros(300)) for wd in i.split("_")], dtype=np.float32, order='C')
                        faiss.normalize_L2(array_emb)
                        values_list = np.max(np.dot(embeddings_index[each_key_word], array_emb.T))
                        values_list_domain = np.max(np.dot(embeddings_index[original_domain[index_label]], array_emb.T))

                        if values_list > 0.5 and values_list_domain > 0.6:
                            is_okay = True

                    
                        
                   
                    if each_key_word == domain[index_label] and domain[index_label]==original_domain[index_label] and i in average_tf_idf and average_tf_idf[i] > 0.075 and (i not in weak_values_all[idx] or (match_words_with_tf_idf_valuess[(i, idx)]>=list_min_tf_idf_all[idx][-len(our_words['new_col2'][idx])//5])) and ((cos_similarity > 0.325 or is_correlated or is_okay)) or ((cos_similarity>0.41 or cos_merged>0.42 or is_okay) and i in average_tf_idf and  average_tf_idf[i] > 0.1) or cos_similarity>0.5:
                        if each_key_word in hasContext_relevant_words.keys():
                            related_words = hasContext_relevant_words[each_key_word]
                            values_list = np.array([embeddings_index.get(wd, np.zeros(300)) for wd in related_words], dtype=np.float32, order='C')
                            faiss.normalize_L2(values_list)
                            values_min = np.min(np.dot(values_list, embeddings_index[i].T))
                            
                            if values_min < 0.20:
                                continue
                        end_loop = 1
                        max_val = cos_similarity1
                        keydf = i

                    elif (each_key_word != domain[index_label] or domain[index_label]!=original_domain[index_label]) and i in average_tf_idf and average_tf_idf[i] > 0.065 and (i not in weak_values_all[idx] or (match_words_with_tf_idf_valuess[(i, idx)]>=list_min_tf_idf_all[idx][-len(our_words['new_col2'][idx])//5])) and ((cos_similarity > 0.30 or is_correlated or cos_merged>0.35 or is_okay) or ((cos_similarity > 0.25 or is_correlated or is_okay) and (len(core[index_label].split("::"))>1 and each_key_word==original_core[index_label]))) or ((cos_similarity>0.42 or cos_merged>0.42) and i in average_tf_idf and  average_tf_idf[i] > 0.1):
                        end_loop = 1
                        max_val = cos_similarity1
                        keydf = i

                    if cos_similarity > 0.51:
                        
                        end_loop = 1
                        max_val = cos_similarity1
                        keydf = i
                       
  
                        break

            
            if end_loop == 0 and list_matrix[idx][j][0]!=-1:  
                list_matrix[idx][j] = (-1, list_matrix[idx][j][1])
                no_key_word_eliminated+=1

            else:   
                last_index_elements[idx] = index_label

        if no_key_word_eliminated!=0:
            words_eliminated+=1
    if no_val > 3.5:
            weak_values_all[idx] = copy_weak
    if(words_eliminated==20):
        list_matrix[idx] = copy_list_matrix.copy()
    
        if weak_values_all[idx] != []:
            weak_values_all[idx] = []
            idx-=1



In [64]:
weak_values_all = weak_values_all_copy.copy()

### 4. Context-Aware Candidate Filtering (Post-Scoring Refinement)

After computing scores for all label candidates (`list_matrix`), we apply a **contextual filtering step** to remove weak or irrelevant matches using semantic signals.

#### How It Works:

- For each company:
  - Retain the top 30 label candidates, **as long as their score is within 0.15** of the best one.
  - For each of these, compute contextual alignment using:
    - `first_sentence_matrix` → alignment with the company’s opening sentence.
    - `context_matrix` → general semantic similarity (via BERT embeddings).
    - `context_matrix_niche` → similarity to niche-specific signals.
    - `context_matrix_category` → similarity to broader category context.

- **Keep the label** if at least one of the following holds:
  - `first_sentence_matrix > 0.3`
  - `context_matrix > 0.3`
  - `context_matrix_niche > 0.5`
  - `context_matrix_category > 0.5`

- All other labels are dropped by setting their score to `-1` in `list_matrix`.

In [65]:
final_list = []
elements_filtered = []
z=0
for i in range(len(df['description'])):
    list_matrix[i].sort(reverse=True)
    
    list111 = []
    list_el = []
    list_el2 = []
    list_el3 = []
    list_el4 = []

    max_value = list_matrix[i][0][0]


    row_filtered = []


    for j in range(30):
        
        if max_value - list_matrix[i][j][0] <= 0.15:
            list111.append(our_classes['label'][list_matrix[i][j][1]])
            row_filtered.append(list_matrix[i][j])
            list_el.append((our_classes['label'][list_matrix[i][j][1]], first_sentence_matrix[i][list_matrix[i][j][1]]))
            list_el2.append((our_classes['label'][list_matrix[i][j][1]], context_matrix[i][list_matrix[i][j][1]]))
            list_el3.append((our_classes['label'][list_matrix[i][j][1]], context_matrix_niche[i][list_matrix[i][j][1]]))
            list_el4.append((our_classes['label'][list_matrix[i][j][1]],context_matrix_category[i][list_matrix[i][j][1]]))
        
    if len(row_filtered)==1:
        z+=1
    
    final_list.append(list111)
    list_range = list(range(len(row_filtered)))
   
    list_range_filtered = list(filter(lambda x: list_el[x][1] > 0.3 or list_el2[x][1]>0.3 or list_el3[x][1]>0.5 or list_el4[x][1]>0.5 , list_range))
    if list_range_filtered != []:
        new_list = []
        for k in list_range_filtered:
            new_list.append(list111[k])
        for indices in range(len(list_matrix[i])):
            if indices not in list_range_filtered:
                list_matrix[i][indices] = (-1, list_matrix[i][indices][1])
    elements_filtered.append(row_filtered)


In [66]:
final_list = []
row_no1=0
for i in range(len(df['description'])):
    list_matrix_original_2[i].sort(reverse=True)
    
    list111 = []

    max_value = list_matrix_original_2[i][0][0]


    row_filtered = []
    for j in range(20):
        if max_value - list_matrix_original_2[i][j][0]<=0.1:
            list111.append(our_classes['label'][list_matrix_original_2[i][j][1]])
            row_filtered.append(list_matrix[i][j])
    final_list.append(list111)
  
    
    

In [67]:


for i, (idx, _) in enumerate(df.iterrows()):
    list_matrix[i].sort(reverse=True)
    

    max_value = list_matrix[i][0][0]

    if list_matrix[i][0][0] - list_matrix[i][1][0]>=0.1:
        continue


    row_filtered = []
    for j, _ in enumerate(elements_filtered[i]):
        if max_value - list_matrix[i][j][0] <= 0.1:

            if(first_sentence_matrix[i][list_matrix[i][j][1]]>=0.41):
                if (context_matrix[i][list_matrix[i][j][1]]>=0.45) and  (first_sentence_matrix[i][list_matrix[i][j][1]]>=0.43):
                    list_matrix[i][j] = (list_matrix[i][j][0]+0.30, list_matrix[i][j][1])

      
    

In [68]:
final_list = []
elements_filtered = []
row_no1=0
for i in range(len(df['description'])):
    list_matrix[i].sort(reverse=True)
    
    list111 = []

    max_value = list_matrix[i][0][0]


    row_filtered = []
    for j in range(20):
        if max_value - list_matrix[i][j][0] <= 0.1:
            list111.append(our_classes['label'][list_matrix[i][j][1]])
            row_filtered.append(list_matrix[i][j])
    
    final_list.append(list111)

    if(len(row_filtered)==1):
        row_no1+=1
    elements_filtered.append(row_filtered)


In [69]:
weak_values_all = weak_values_all_copy.copy()

### Sector-Aware Label Filtering

This module refines the top label predictions by checking if key terms from each label (core/modifier/domain) align with the company's sector and context.

### Purpose

To prevent selecting labels that score high numerically but are semantically inconsistent with the company’s actual business activity.


### Logic Summary

1. **Normalize Cores**: Convert plural core terms to singular using a mapping dictionary.

2. **Sector-Term Vocabulary**:  
   - Extract valid business terms from sector metadata and labels.
   - Remove sectors not present in any label core or domain.

3. **Strong Term Identification**:  
   - A term is considered “strong” if its cosine similarity with any sector term > 0.4.

4. **Per-Company Validation**:
   - For each label, extract a representative term (last term - either domain or core).
   - Compare this term’s embeddings with:
     - The sector embedding
     - Category/niche tokens
     - All company context tokens
   - If all similarities are below threshold, remove the label.

5. **Fallback**:  
   - If all filtered labels are eliminated, revert to the original top labels.


### Benefit

Ensures that selected labels are not just similar in wording, but contextually appropriate based on the company’s sector and descriptive metadata.

In [70]:
simplified_cores= []
original_cores = []

In [71]:
for core_ in original_core:
    if core_ in convertPluralToSingular:
        simplified_cores.append(convertPluralToSingular[core_])
        original_cores.append(core_)
    else:
        original_cores.append(core_)
        simplified_cores.append(core_)

In [72]:
set_element_sectors = set()
our_sector=our_sector.dropna()
for i in our_sector['sector']:
    set_element_sectors.add(i)

sectors_last = set()
for label in our_classes['label']:
    sectors_last.add(label.split(" ")[-1].lower())

sectors_cores = set()
for co in original_core:
    if co != '0':
        sectors_cores.add(co)

for idx, la in enumerate(original_domain):
    if original_modifier[idx] == '0' and original_cores[idx] == '0':
        sectors_cores.add(la)

list_set_element_sectors = set_element_sectors.copy()
for i in list_set_element_sectors:
    if i not in sectors_cores and i in sectors_last:
        set_element_sectors.remove(i)
set_element_sectors = set(list(filter(lambda x: x!='', set_element_sectors)))
list_element_sectors = list(set_element_sectors)

list_element_sectors_embeddings = np.array([embeddings_index[wd] for wd in list_element_sectors], dtype=np.float32, order='C')
faiss.normalize_L2(list_element_sectors_embeddings)

strong_terms = []
for i in set(original_core):
    if i != '0' and np.max(np.dot(list_element_sectors_embeddings, embeddings_index[i].T))>0.4:
        strong_terms.append(i)


In [73]:
our_sector['sector']=our_sector['sector'].fillna("")

In [74]:
last_index_elements = [-1] * len(our_words['new_col2'])
zzz=0
for idx, _ in enumerate(our_words['new_col2']):
    list_min_tf_idf=[]
    for each_word in our_words['new_col2'][idx]:
        list_min_tf_idf.append(match_words_with_tf_idf_valuess[(each_word, idx)])
    list_min_tf_idf.sort()

    list_matrix[idx].sort(reverse=True)
    max_val = list_matrix[idx][0][0]

   
    
    
    words_eliminated=0

    copy_list_matrix = list_matrix[idx].copy()

    if list_matrix[idx][0][0] - list_matrix[idx][1][0]>=0.10:
        continue
    no_key_word_eliminated=0
    list_matrix_copy_row = list_matrix[idx].copy()
    last_element = -1
        
    for  j in range(30):

        if list_matrix[idx][0][0] - list_matrix[idx][j][0]>=0.15:
            last_element = j
            break

        index_label = list_matrix[idx][j][1]
      
        our_terms = ""

        if modifier[index_label] == '0' and core[index_label]=="0":
            our_terms = original_domain[index_label]
        else:
            our_terms = original_core[index_label]

     
        if our_terms not in strong_terms:
            continue
        if our_sector['sector'][idx] in our_words['new_col2'][idx]:
            our_array = np.array([embeddings_index[wd] for wd in our_categories_niche_words['niche_plus_cat'][idx]], dtype=np.float32, order='C')
            our_array_words = np.array([embeddings_index[wd] for wd in our_words['new_col2'][idx]], dtype=np.float32, order='C')

            if our_categories_niche_words['niche_plus_cat'][idx] != []:
                faiss.normalize_L2(our_array)
                max_sim_cat_niche = np.max(np.dot(embeddings_index[our_terms], our_array.T))
            else:
                max_sim_cat_niche = 0

            if our_words['new_col2'][idx] != []:
                faiss.normalize_L2(our_array_words)
                max_words = np.max(np.dot(embeddings_index[our_terms], our_array_words.T))

            else:
                max_words = 0


            cos_similarity = np.dot(embeddings_index[our_terms],embeddings_index[our_sector['sector'][idx]].T )

            if cos_similarity < 0.4 and max_sim_cat_niche < 0.5 and max_words < 0.55:
                no_key_word_eliminated+=1     
                list_matrix[idx][j] = (-1, list_matrix[idx][j][1])
                
            else:        
                last_index_elements[idx] = index_label
        else:
            last_index_elements[idx] = index_label
   
    if(no_key_word_eliminated==last_element):
       
       list_matrix[idx] = list_matrix_copy_row.copy()



### Category and Niche-Aware Boosting and Filtering

This module performs selective boosting of label scores based on metadata fields such as `niche` and `category`. The approach ensures that only contextually validated metadata contributes to boosting decisions, avoiding overreliance on noisy or weakly aligned metadata.

### Core Logic

For each of the top label candidates in `list_matrix[i]`, the algorithm:

1. **Checks Label Presence in Niche**
   - If the label is found in the `niche` field, it proceeds to validate the `domain`, `modifier`, and `core` components of the label.
   - Validation is done using:
     - Cosine similarity between the label components and the filtered token set (`new_col2`) of the company.
     - Embeddings from `embeddings_index`, normalized via FAISS.
   - If the component is not well-supported (cosine similarity < 0.4), the label is either ignored or penalized.
   - If all components pass the validation, the label is boosted by +0.30.

2. **Checks Label Presence in Category**
   - Similar logic as for niche.
   - Additional verification is applied using the count of supporting words in `our_counted_words` and `our_counted_words_category`.

3. **Contextual Safeguards**
   - Even if the label appears in niche or category, its domain/modifier/core must also appear meaningfully in the full description or related fields.
   - This ensures that we avoid boosting false positives coming from overly broad, generic, or mismatched metadata.
   - If the label passes all checks, it is boosted; otherwise, it is penalized or ignored.

4. **Anchor Filtering**
   - Labels that do not meet the required semantic similarity thresholds are explicitly filtered.
   - The system ensures that labels are only retained if they fall within 0.1 of the top score, and are supported by strong tokens.

### Output

- `final_list`: Final list of boosted and filtered label names per company.
- `elements_filtered`: Intermediate list of (score, label) tuples that survived all filtering stages.

This layered approach ensures robustness against misleading metadata and enforces semantic coherence in label assignment.

In [75]:
final_list = []
elements_filtered = []
df['niche']=df['niche'].fillna("")
df['category']=df['category'].fillna("")
skip_categories_for_this_one = []

i=0

zz=0
for idx, _ in df.iterrows():
    
    list_matrix[i].sort(reverse=True)
    list111 = []

    
    row_filtered = []
        
    for j in range(30):

        index_label = list_matrix[i][j][1]
        okay = 1
        if our_classes['label'][list_matrix[idx][j][1]] in df['niche'][idx]:
            okay = 1
            if original_domain[index_label]!='0' and original_domain[index_label] not in our_counted_words['count_words'][idx]:
                okay = -1
            elif our_counted_words['count_words'][idx][original_domain[index_label]] == our_counted_words_category['count_cat'][idx][original_domain[index_label]]:
                values_domain = np.array([embeddings_index[wd] if wd != original_domain[index_label] and (wd not in our_categories_niche_words['niche_plus_cat'] or (our_counted_words['count_words'][idx][wd] != our_counted_words_category['count_cat'][idx][wd])) else np.zeros(300) for wd in our_words['new_col2'][idx]], dtype=np.float32, order='C')
                faiss.normalize_L2(values_domain)
                values_list = np.max(np.dot(embeddings_index[original_domain[index_label]], values_domain.T))
                if values_list < 0.4:
                    okay = -1
                    list_matrix[i][j] = (-1, list_matrix[i][j][1])
                    context_matrix_category[idx][index_label] = 0

                    skip_categories_for_this_one.append(i)
            
            
            if original_modifier[index_label]!='0' and original_modifier[index_label] not in our_counted_words['count_words'][idx]:
                okay = -1
            elif okay == 1 and modifier[index_label] != '0' and our_counted_words['count_words'][idx][original_modifier[index_label]] == our_counted_words_category['count_cat'][idx][original_modifier[index_label]]:
                values_domain = np.array([embeddings_index[wd] if wd != modifier[index_label] and (wd not in our_categories_niche_words['niche_plus_cat'] or our_counted_words['count_words'][idx][wd] != our_counted_words_category['count_cat'][idx][wd])  else np.zeros(300) for wd in our_words['new_col2'][idx]], dtype=np.float32, order='C')
                faiss.normalize_L2(values_domain)
                values_list = np.max(np.dot(embeddings_index[original_modifier[index_label]], values_domain.T))
                if values_list < 0.4:
                    okay = -1
                    

            if okay == 1 and core[index_label] != '0' and original_core[index_label] not in our_counted_words['count_words'][idx]:
                okay = -1
            
            elif okay == 1 and core[index_label] != '0' and original_core[index_label] in our_counted_words and our_counted_words['count_words'][idx][original_core[index_label]] == our_counted_words_category['count_cat'][idx][original_core[index_label]]:
                values_domain = np.array([embeddings_index[wd] if wd != original_core[index_label] and (wd not in our_categories_niche_words['niche_plus_cat'] or our_counted_words['count_words'][idx][wd] != our_counted_words_category['count_cat'][idx][wd])  else np.zeros(300) for wd in our_words['new_col2'][idx]], dtype=np.float32, order='C')
                faiss.normalize_L2(values_domain)
                values_list = np.max(np.dot(embeddings_index[original_core[index_label]], values_domain.T))
                if values_list < 0.4:

                    okay = -1
            

            if (okay==1):
                list_matrix[i][j] = (list_matrix[i][j][0]+0.30, list_matrix[i][j][1])
                okay = 2


        if our_classes['label'][list_matrix[idx][j][1]] in df['category'][idx]!=2:
            okay = 1
            if original_domain[index_label]!='0' and original_domain[index_label] not in our_counted_words['count_words'][idx]:
                okay = -1
            elif our_counted_words['count_words'][idx][original_domain[index_label]] == 2:
                values_domain = np.array([embeddings_index[wd] if wd != original_domain[index_label] and (wd not in our_categories_niche_words['niche_plus_cat'] or our_counted_words['count_words'][idx][wd] != 2) else np.zeros(300) for wd in our_words['new_col2'][idx]], dtype=np.float32, order='C')
                faiss.normalize_L2(values_domain)
                values_list = np.max(np.dot(embeddings_index[original_domain[index_label]], values_domain.T))
                if values_list < 0.4:
                    okay = -1
                    list_matrix[i][j] = (-1, list_matrix[i][j][1])
                    context_matrix_category[idx][index_label] = 0
                    skip_categories_for_this_one.append(i)

            
            
            if original_modifier[index_label]!='0' and original_modifier[index_label] not in our_counted_words['count_words'][idx]:
                okay = -1
            elif okay == 1 and modifier[index_label] != '0' and our_counted_words['count_words'][idx][original_modifier[index_label]] == 2:
                values_domain = np.array([embeddings_index[wd] if wd != modifier[index_label] and (wd not in our_categories_niche_words['niche_plus_cat'] or our_counted_words['count_words'][idx][wd] != 2)  else np.zeros(300) for wd in our_words['new_col2'][idx]], dtype=np.float32, order='C')
                faiss.normalize_L2(values_domain)
                values_list = np.max(np.dot(embeddings_index[original_modifier[index_label]], values_domain.T))
                if values_list < 0.4:
                    okay = -1

            if okay == 1 and core[index_label] != '0' and original_core[index_label] not in our_counted_words['count_words'][idx]:
                okay = -1
            
            elif okay == 1 and core[index_label] != '0' and original_core[index_label] in our_counted_words and our_counted_words['count_words'][idx][original_core[index_label]] == 2:
                values_domain = np.array([embeddings_index[wd] if wd != original_core[index_label] and (wd not in our_categories_niche_words['niche_plus_cat'] or our_counted_words['count_words'][idx][wd] != 2)  else np.zeros(300) for wd in our_words['new_col2'][idx]], dtype=np.float32, order='C')
                faiss.normalize_L2(values_domain)
                values_list = np.max(np.dot(embeddings_index[original_core[index_label]], values_domain.T))
                if values_list < 0.4:
                    okay = -1
            

            if (okay==1):
                list_matrix[i][j] = (list_matrix[i][j][0]+0.30, list_matrix[i][j][1])
        
        
        
        
        if list_matrix[i][0][0] - list_matrix[i][j][0] <= 0.1:
            list111.append(our_classes['label'][list_matrix[i][j][1]])
            row_filtered.append(list_matrix[i][j])
        
        
        
    final_list.append(list111)
    
    elements_filtered.append(row_filtered)
    i+=1


In [76]:
last_index_elements2 = [-1] * len(our_words['new_col2'])


### Term Filtering and Anchor-Based Label Selection

In the final label selection step, we focus on **reinforcing matches around the most meaningful tokens** and filtering out noisy or weak candidates.


### How It Works:

1. **Anchor Term Selection**:
   - For each company:
     - Sort label candidates by similarity score.
     - Pick the label with the **highest number of strong matches** (cosine similarity > **0.5**) with:
       - Business-important words (`strongest_value_words[idx]`)
       - Niche/category terms (`our_categories_niche_words['niche_plus_cat'][idx]`)
     - If no matches > 0.5 exist, allow threshold to drop to **0.4**.

2. **Fallback to Contextual Support**:
   - If there are ties or uncertainty:
     - Prefer labels supported by:
       - `context_matrix`
       - `context_matrix_niche` + high `similarity_to_niche`
       - `context_matrix_category` + high `similarity_to_cat`
       - `first_sentence_matrix`

3. **Fallback to Mean Cosine Similarity**:
   - Still unresolved?
     - Compute **mean cosine similarity** for each label vs. strong/niche tokens.
     - Choose the label with the **highest average similarity** as anchor.

4. **Candidate Filtering**:
   - Once an anchor is selected:
     - Discard any other label with:
       - Cosine similarity < **0.55** to anchor, unless structurally related (e.g. modifier/core match).


### Why This Matters:

- Keeps labels **centered around the most relevant and specific tokens**.
- Prioritizes **semantic closeness** and **contextual fit** over weak textual overlap.
- Produces **tighter, more meaningful matches** for each company profile.

In [78]:

max_rows = []
final_list = []
array_list = []
list_best_cat = []
last_index_elements2 = [-1] * len(our_words['new_col2'])
for idx in range(len(our_words['new_col2'])):
    list_matrix[idx].sort(reverse=True)

    max_value11 = list_matrix[idx][0][0]

    
    
    if(len(elements_filtered[idx])==1 or max_value11 - list_matrix[idx][1][0]>=0.1):
        last_index_elements2[idx] = list_matrix[idx][1][1]
        continue

    rows = []
    rows_emb = []

    list111 = []
    row_filtered=[]
    mean_values = []
    mean_values_total = []

    row_embeddings = []
    our_mean_embeddings = []
    rows_emb4_5 = []

    set_values = set(strong_values_all[idx])
    set_values.update(our_categories_niche_words['niche_plus_cat'][idx])
    list_val = set_values
   
    
    if list_val == []:
        continue
    embeddings_value_all = np.array([embeddings_index.get(val, np.zeros(300)) if val in weak_values_all[idx]  else np.zeros(300) for val in our_words['new_col2'][idx]],  dtype=np.float32, order='C')
    embeddings_value = np.array([embeddings_index.get(val, np.zeros(300)) for val in list_val],  dtype=np.float32, order='C')

    
    
    if list_val == set():
        continue
    faiss.normalize_L2(embeddings_value)
   
    for j in range(len(elements_filtered[idx])):
        list111.append(our_classes['label'][list_matrix[idx][j][1]])
        row_filtered.append(list_matrix[idx][j])
        index_label = list_matrix[idx][j][1]

        our = 0
        our_values = np.dot(embeddings_index[original_domain[index_label]].reshape(1,-1), embeddings_value.T)[0]

        score_b = 1
        score_a = 0
        embedding_val = None
        penalizer_genericity = 1.0
        our_cores_value = np.zeros(300)
        if len(list_elements[index_label])==1:
           embedding_val = embeddings_index[original_domain[index_label]]
           our_values_all = np.dot(embeddings_index[original_domain[index_label]], embeddings_value_all.T)

           if len(original_domain[index_label].split("_"))>1 and original_domain[index_label] == domain[index_label]:
        
                our_values_temp = 0.5 * np.dot(embeddings_index[original_domain[index_label].split("_")[0]], embeddings_value.T) + 0.5 * np.dot(embeddings_index[original_domain[index_label].split("_")[1]], embeddings_value.T)
                our_values_45 = list(filter(lambda x: x>=0.45, our_values))
                our_values_45_temp = list(filter(lambda x: x>=0.45, our_values_temp))

                our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))
               
                if (len(our_values_45_temp) > len(our_values_45)) or (len(our_values_5_temp) > len(our_values_5)):
                    our_values = our_values_temp

        elif len(list_elements[index_label])==2:
            if  original_domain[index_label] != domain[index_label] and original_core[index_label] == core[index_label]:
                score_a = 0.3
                score_b = 0.7
        
                our_values = our_values * 0.3 + 0.7 * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                if len(original_core[index_label].split("_")) > 1:
                    our_values_temp =  0.2 * np.dot(embeddings_index[original_domain[index_label]],  embeddings_value.T)+ 0.425 * np.dot(embeddings_index[original_core[index_label].split("_")[0]], embeddings_value.T) + 0.375 * np.dot(embeddings_index[original_core[index_label].split("_")[1]], embeddings_value.T) 
                    our_values_45 = list(filter(lambda x: x>=0.45, our_values))
                    our_values_45_temp = list(filter(lambda x: x>=0.45, our_values_temp))
                    our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                    our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))
               
                    if (len(our_values_5_temp) > len(our_values_5) or len(our_values_45_temp) > len(our_values_45)):
                        our_values = our_values_temp
            elif original_domain[index_label] != domain[index_label] and original_core[index_label] != core[index_label]:
                score_a = 0.3
                if len(core[index_label].split("::"))>1:
                    score_b = 0.1
                elif len(core[index_label].split("__"))>1:
                    score_b = 0.3
                penalizer_genericity = 0.7
                
                our_values = our_values * (score_a/(score_a+score_b)) + (score_b/(score_a+score_b)) * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] == core[index_label]:
                score_a = 0.5
                score_b = 0.5
             
                our_values = our_values * (score_a/(score_a+score_b)) + (score_b/(score_a+score_b)) * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                if len(original_core[index_label].split("_")) > 1:
                    our_values_temp =  0.3 * np.dot(embeddings_index[original_domain[index_label]],  embeddings_value.T)+ 0.4 * np.dot(embeddings_index[original_core[index_label].split("_")[0]], embeddings_value.T)[0] + 0.3 * np.dot(embeddings_index[original_core[index_label].split("_")[1]], embeddings_value.T)[0] 
                    our_values_45 = list(filter(lambda x: x>=0.45, our_values))
                    our_values_45_temp = list(filter(lambda x: x>=0.45, our_values_temp))
                    our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                    our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))

                    if len(our_values_45_temp) > len(our_values_45) or len(our_values_5_temp) > len(our_values_5):
                        our_values = our_values_temp
                if len(original_domain[index_label].split("_"))>1:
                    our_values_temp = 0.3 * np.dot(embeddings_index[original_domain[index_label].split("_")[0]], embeddings_value.T) + 0.4 * np.dot(embeddings_index[original_domain[index_label].split("_")[1]], embeddings_value.T) + 0.3 * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                    our_values_45 = list(filter(lambda x: x>=0.45, our_values))
                    our_values_45_temp = list(filter(lambda x: x>=0.45, our_values_temp))
                    our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                    our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))

                    if len(our_values_5_temp) > len(our_values_5) or len(our_values_45_temp) > len(our_values_45):
                        our_values = our_values_temp
            
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] != core[index_label]:
               
                
                if len(core[index_label].split("::"))>1:
                    score_a = 0.9
                elif len(core[index_label].split("__"))>1:
                    score_a = 0.7
                score_b= 1 - score_a
    
               
                our_values = our_values * score_a + score_b * np.dot(embeddings_index[original_core[index_label]].reshape(1,-1), embeddings_value.T)[0]
                if len(original_domain[index_label].split("_"))>1:
                    if len(core[index_label].split("::"))>1:
                        score_c1 = 0.1
                        score_a1 = 0.4
                        score_b1 = 0.5
                    elif len(core[index_label].split("__"))>1:
                        score_c1 = 0.2
                        score_a1 = 0.35
                        score_b1 = 0.45

                    our_values_temp = score_a1 * np.dot(embeddings_index[original_domain[index_label].split("_")[0]], embeddings_value.T) + score_b1 * np.dot(embeddings_index[original_domain[index_label].split("_")[1]], embeddings_value.T) + score_c1 * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                    our_values_45 = list(filter(lambda x: x>=0.45, our_values))
                    our_values_45_temp = list(filter(lambda x: x>=0.45, our_values_temp))
                    our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                    our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))
                    
                    if len(our_values_5_temp) > len(our_values_5) or len(our_values_45_temp) > len(our_values_45):
                        our_values = our_values_temp
            
            our_cores_value = np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
            embedding_val = (score_a/(score_a+score_b)) * embeddings_index[original_domain[index_label]] +  (score_b/(score_a+score_b)) * embeddings_index[original_core[index_label]]
            our_values_all = np.dot(embeddings_index[original_domain[index_label]], embeddings_value_all.T) * (score_a/(score_a+score_b))  + (score_b/(score_a+score_b)) * np.dot(embeddings_index[original_core[index_label]], embeddings_value_all.T)

        elif len(list_elements[index_label])==3:
            score_a = 0.3
            score_b = 0.4
            score_c = 0.3
            if  original_domain[index_label] != domain[index_label] and original_core[index_label] == core[index_label]:
                score_a = 0.2
                score_b = 0.425
                score_c = 0.375                
                our_values = our_values * 0.2 + 0.425 * np.dot(embeddings_index[original_modifier[index_label]], embeddings_value.T)[0] + 0.375 * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)[0]
  
            elif original_domain[index_label] != domain[index_label] and original_core[index_label] != core[index_label]:                
                if len(core[index_label].split("::"))>1:
                    score_c = 0.1
                    score_a = 0.235
                elif len(core[index_label].split("__"))>1:
                    score_c = 0.2
                    score_a = 0.2
                score_b = 1 - score_a - score_c
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] == core[index_label]:
                our_values = our_values * score_a + score_b * np.dot(embeddings_index[original_modifier[index_label]], embeddings_value.T)[0] + + score_c * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)[0]
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] != core[index_label] and len(core[index_label].split("::"))>1:
                score_a = 0.425
                score_c = 0.1
                score_b = 1 - score_a - score_c
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] != core[index_label]:
                if len(modifier[index_label].split("__"))>1:
                    score_b = 0.5
                    score_a = 0.3 
                    score_c = 1 - score_a-score_b
                else:
                    score_b = 0.45
                    score_a = 0.35 
                    score_c = 1 - score_a - score_b
            
            our_cores_value = np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
            our_values = our_values * (score_a/(score_a+score_b+score_c)) + (score_b/(score_a+score_b+score_c)) * np.dot(embeddings_index[original_modifier[index_label]], embeddings_value.T) + (score_c/(score_a+score_b+score_c)) * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
            our_values_all = np.dot(embeddings_index[original_domain[index_label]], embeddings_value_all.T) * (score_a/(score_a+score_b+score_c)) + (score_b/(score_a+score_b+score_c)) * np.dot(embeddings_index[original_modifier[index_label]], embeddings_value_all.T) + (score_c/(score_a+score_b+score_c)) * np.dot(embeddings_index[original_core[index_label]], embeddings_value_all.T)

            embedding_val = (score_a/(score_a+score_b+score_c)) * embeddings_index[original_domain[index_label]] +  (score_b/(score_a+score_b+score_c)) * embeddings_index[original_modifier[index_label]]+ (score_c/(score_a+score_b+score_c)) * embeddings_index[original_core[index_label]]




        range_list_val = list(range(len(our_values)))

        length_no6 = list(filter(lambda x: (x * penalizer_genericity) >= 0.6, our_values)) 
        length_no = list(filter(lambda x: (x * penalizer_genericity) >= 0.5, our_values)) 
        length_no4_5 = list(filter(lambda x: (x * penalizer_genericity) >= 0.45, our_values)) 

        if core[index_label] != original_core[index_label]:
            range_list_val = list(filter(lambda x: our_cores_value[x]<0.6 or our_cores_value[x] <= our_values[x], range_list_val))

            our_values_temp = []
            for val in range_list_val:
                our_values_temp.append(our_values[val])
            our_values = our_values_temp.copy()


        max_value = 0
      
        
        length_no6 = list(filter(lambda x: (x * penalizer_genericity) >= 0.6, our_values)) 
        length_no = list(filter(lambda x: (x * penalizer_genericity) >= 0.5, our_values)) 
        length_no4_5 = list(filter(lambda x: (x * penalizer_genericity) >= 0.45, our_values)) 

       


       
        length_no4 = list(filter(lambda x: (x * penalizer_genericity) >= 0.4, our_values))


        if len(length_no) > 0: 
            mean_values_total.append(np.mean(list(map(lambda x: x, length_no))))
        else:
            mean_values_total.append(0)

        our=len(length_no)
        rows.append(our)
        rows_emb.append(length_no)
        rows_emb4_5.append(length_no4_5)
        our_mean_embeddings = our_values
        list_elements2 = our_mean_embeddings
        row_embeddings.append(embedding_val)


    max_values = 0
    mean_max_values = -1
    index_x = -1
    max_index=-1
    indices = []

    for idx4, ivv in enumerate(rows):

        if max_values < ivv and mean_values_total[idx4] >= 0.085:
            max_values = ivv
            index_x = idx4
            indices = [idx4]
        elif max_values == ivv and mean_values_total[idx4] >= 0.085:
            indices.append(idx4)
    

    indices_value = indices.copy()
    if max_values != 0:
        indices_elements_copy = []
        for idx4 in indices:
            
            if context_matrix[idx][list_matrix[idx][idx4][1]]>0.45 or (context_matrix_niche[idx][list_matrix[idx][idx4][1]]>0.45 and similarity_to_niche[idx][idx] > 0.3) or (context_matrix_category[idx][list_matrix[idx][idx4][1]]>0.45 and similarity_to_cat[idx][idx]>0.4) or first_sentence_matrix[idx][list_matrix[idx][idx4][1]]>0.45:
                indices_elements_copy.append(idx4)
                index_x = idx4
            
        if indices_elements_copy != []:
            indices_value = indices_elements_copy.copy()
    

    if max_values==0:
        continue

    
    max_len_4_5_values = -1
    list_indices_temp = []
    if len(indices_value) > 1:
        for idx3 in indices_value:
            mean = 0
            ivv = list_matrix[idx][idx3]
            if len(rows_emb[idx3])!=0:
                mean_val = np.mean(rows_emb[idx3])
            length_4_5 = len(rows_emb4_5[idx3])
        
            if (length_4_5 > max_len_4_5_values):
                max_len_4_5_values = length_4_5
                index_x = idx3
                list_indices_temp = [idx3]
            elif (length_4_5 == max_len_4_5_values):
                list_indices_temp.append(idx3)

        if len(list_indices_temp) == 0:
            indices_value = list_indices_temp.copy()
    if len(indices_value) > 1:
        mean_val = -1
        for idx3  in indices_value:
            ivv = list_matrix[idx][idx3]
            if len(rows_emb4_5[idx3]) > 0:
                mean_ = np.mean(rows_emb4_5[idx3])

            else:
                mean_ = 0
            if mean_ > mean_val:
                index_x = idx3
                mean_val = mean_

    list_best_cat.append(index_x)

    for j in range(len(rows)):
        max_new=-1
        element = cosine_similarity(row_embeddings[j].reshape(1,-1), row_embeddings[index_x].reshape(1,-1))
        mean = -1
        if element.size > 0:
            mean = np.mean(element)
        else:
            mean = 0
        values_6 = list(filter(lambda x: x>=0.6, rows_emb[j]))
        length_4_5 = len(rows_emb4_5[j]) + len(values_6)

        is_a_label_subset_of_another_label = original_modifier[list_matrix[idx][index_x][1]] == original_domain[list_matrix[idx][j][1]] and original_core[list_matrix[idx][index_x][1]] == original_core[list_matrix[idx][j][1]]
        is_a_label_subset_of_another_label = is_a_label_subset_of_another_label or original_core[list_matrix[idx][index_x][1]] == original_domain[list_matrix[idx][j][1]] and original_modifier[list_matrix[idx][index_x][1]] == "0" and original_modifier[list_matrix[idx][j][1]] == "0"

        if mean <= 0.75 or is_a_label_subset_of_another_label: 
            list_matrix[idx][j] = (-1, list_matrix[idx][j][1])
            array_list.append(idx)
        else:
            last_index_elements2[idx] = list_matrix[idx][len(elements_filtered[idx])][1]


In [79]:
final_list = []
elements_filtered = []

i=0

zz=0
ridiculous_number = 0
for idx, _ in df.iterrows():
    list_matrix[i].sort(reverse=True)
    
    list111 = []
    list111_og = []

    max_value = list_matrix[i][0][0]
    row_filtered = []
    

    for j in range(30):

        
        if(list_matrix[i][j][1]==last_index_elements2[i]):
            break
        
        
        
        if max_value - list_matrix[i][j][0] <= 0.15:
            list111.append(our_classes['label'][list_matrix[i][j][1]])
            row_filtered.append(list_matrix[i][j])
                   
        
    final_list.append(list111)  
     
    elements_filtered.append(row_filtered)
    i+=1



### Category-Based and Text-Based Anchor Selection

This stage filters out semantically weak or ambiguous labels, ensuring that only contextually relevant and structurally meaningful candidates remain. It runs in two passes:


### 1. Category-Based Anchor Filtering (Strong and Niche Tokens)

Selects an anchor label based on alignment with company’s important tokens (`strong_values_all[idx]`) and metadata (`niche_plus_cat[idx]`).

#### Process

- **Skip Filtering** if:
  - Only one candidate in `elements_filtered[idx]`
  - Or top label is clearly dominant (score gap ≥ 0.1)

- **Embedding Preparation**:
  - For each candidate label:
    - Build a **weighted embedding** using `domain`, `core`, and `modifier`
    - Adjust weights based on:
      - Structural overlap (e.g., `domain = core`)
      - Genericity (e.g., `__`, `::`)
      - Compound splitting (e.g., `agriculture_machinery`)

- **Similarity Evaluation**:
  - Cosine ≥ 0.475 → `row_values`
  - Cosine ≥ 0.4 → `row_values_4`
  - Mean ≥ 0.3 → `mean_values`
  - Mean ≥ 0.475 → `mean_values_total`

- **Anchor Selection**:
  - Choose label with:
    - Most matches ≥ 0.475
    - `mean_values_total ≥ 0.085`
    - Contextual alignment via:
      - `context_matrix`
      - `context_matrix_niche`
      - `context_matrix_category`
      - `first_sentence_matrix`

- **Tie-Breaking**:
  1. Prefer high contextual support
  2. Then highest `mean_values`
  3. Then most matches ≥ 0.4
  4. Then highest `mean` in `row_values_4`

#### Filtering

- Discard labels with:
  - Cosine to anchor < 0.75  
  - Unless structurally related (e.g., shared domain/core)


### 2. Text-Based Anchor Filtering (All Company Tokens)

This step refines label selection by comparing each label candidate against **all company tokens** (excluding only the **very weakest terms**), ensuring structural and semantic alignment beyond top-scoring labels.


### When It Runs

- Only if the **top 2 label scores differ by less than 0.1**
- Skipped if:
  - Only one label survives early filters (`elements_filtered[idx]`)
  - The top label is clearly dominant

### Anchor Selection Logic

For each label in `elements_filtered[idx]`:

- Compute semantic similarity between the label's domain/core/modifier and **all tokens in `new_col2[idx]`**
- Apply component-specific blending:
  - One-part labels: use domain embedding
  - Two-part: blend domain + core (adjusted by specificity)
  - Three-part: weighted blend of domain, modifier, core (penalize genericity)

- Retain similarity values:
  - `length_no`: tokens with cosine ≥ 0.475
  - `length_no2`: tokens with cosine ≥ 0.3
  - `length_no4`: tokens with cosine ≥ 0.4

- Use contextual support as secondary signal:
  - Accept label if any of:
    - `context_matrix[idx][label_id] > 0.3`
    - `context_matrix_niche[idx][label_id] > 0.5` and `similarity_to_niche[idx][idx] > 0.3`
    - `context_matrix_category[idx][label_id] > 0.5` and `similarity_to_cat[idx][idx] > 0.3`
    - `first_sentence_matrix[idx][label_id] > 0.3`

### Anchor Label Selection (Tie-Breaking Priority)

1. Labels supported by **any context matrix ≥ 0.45**
2. Label with highest **mean similarity** over values ≥ 0.475
3. Label with highest count of values ≥ 0.4
4. Label with highest overall mean similarity if tie persists


### Filtering Logic (Post-Anchor)

- After selecting the best anchor:
  - Discard any other label if:
    - Cosine similarity to anchor < 0.85
    - Mean similarity to company tokens < 0.1
    - Structurally redundant (e.g., label is just a weaker modifier/domain variant)
    - Any of domain, core, modifier similarity < 0.4

- Labels with **maximum cosine > 0.9** and strong values ≥ 0.4 may be re-added from `values_to_reconsider`


### Result

- `list_matrix[idx]` is updated to keep only:
  - A **semantically grounded** label
  - **Contextually validated** through BERT matrices
  - Structurally sound and specific
- Ensures consistency even when multiple label candidates are equally strong in FAISS

- Anchors are selected based on structure, semantics, and context.
- Ensures that final label:
  - Matches important tokens and metadata
  - Is supported by contextual embeddings
  - Avoids generic or structurally vague alternatives
- Produces stable, interpretable, and accurate predictions.

In [80]:
max_rows = []
final_list = []

for idx in range(len(our_words['new_col2'])):
    list_matrix[idx].sort(reverse=True)

    max_value11 = list_matrix[idx][0][0]

    

    if(len(elements_filtered[idx])==1):
        continue

    if(list_matrix[idx][0][0] - list_matrix[idx][1][0]>=0.1):
        continue

    rows = []


    list111 = []
    row_filtered=[]
    mean_values = []
    mean_values_total = []
    row_values = []
    row_values_4 = []

    row_embeddings = []
   
    for j in range(len(elements_filtered[idx])):
        list111.append(our_classes['label'][list_matrix[idx][j][1]])
        row_filtered.append(list_matrix[idx][j])
      
    
    
        
       
        index_label = list_matrix[idx][j][1]

        our = 0
        our_mean_embeddings = []
        
        embeddings_value = np.array([embeddings_index[val] if val not in weak_values_all[35][:-(len(weak_values_all[35])//2)] else np.zeros(300) for val in our_words['new_col2'][idx]],  dtype=np.float32, order='C')
        our_values = np.dot(embeddings_index[original_domain[index_label]], embeddings_value.T)
        score_b = 1
        score_a = 0
        embedding_val = None

            

        if len(list_elements[index_label])==1:
           embedding_val = embeddings_index[original_domain[index_label]]
           our_values = list(map(lambda x: x, our_values))
           if len(original_domain[index_label].split("__"))>1:
                our_values_temp = 0.5 * np.dot(embeddings_index[original_domain[index_label].split("_")[0]], embeddings_value.T) + 0.5 * np.dot(embeddings_index[original_domain[index_label].split("_")[1]], embeddings_value.T)
                our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))

                if len(our_values_5_temp) > len(our_values_5):
                    our_values = our_values_temp


        
        elif len(list_elements[index_label])==2:
            if  original_domain[index_label] != domain[index_label] and original_core[index_label] == core[index_label]:
                score_a = 0.3
                score_b = 0.7

                our_values = our_values * 0.3 + 0.7 * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)[0]
                if len(original_core[index_label].split("_")) > 1:
                    our_values_temp =  0.2 * np.dot(embeddings_index[original_domain[index_label]],  embeddings_value.T)+ 0.425 * np.dot(embeddings_index[original_core[index_label].split("_")[0]], embeddings_value.T) + 0.375 * np.dot(embeddings_index[original_core[index_label].split("_")[1]], embeddings_value.T)
                    our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                    our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))
                    if len(our_values_5_temp) > len(our_values_5):
                        our_values = our_values_temp

            elif original_domain[index_label] != domain[index_label] and original_core[index_label] != core[index_label]:
                score_a = 0.3
                if len(core[index_label].split("::"))>1:
                    score_b = 0.1
                elif len(core[index_label].split("__"))>1:
                    score_b = 0.3

                our_values = our_values * (score_a/(score_a+score_b)) + (score_b/(score_a+score_b)) * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                
            
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] == core[index_label]:
                score_a = 0.5
                score_b = 0.5
               
                our_values = our_values * (score_a/(score_a+score_b)) + (score_b/(score_a+score_b)) * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                if len(original_core[index_label].split("_")) > 1:
                    our_values_temp =  0.3 * np.dot(embeddings_index[original_domain[index_label]],  embeddings_value.T)+ 0.4 * np.dot(embeddings_index[original_core[index_label].split("_")[0]], embeddings_value.T) + 0.3 * np.dot(embeddings_index[original_core[index_label].split("_")[1]], embeddings_value.T)
                    our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                    our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))
                    if len(our_values_5_temp) > len(our_values_5):
                        our_values = our_values_temp
                if  len(original_domain[index_label].split("__"))>1:
                    our_values_temp = 0.3 * np.dot(embeddings_index[original_domain[index_label].split("_")[0]], embeddings_value.T) + 0.4 * np.dot(embeddings_index[original_domain[index_label].split("_")[1]], embeddings_value.T) + 0.3 * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                    our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                    our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))

                    if len(our_values_5_temp) > len(our_values_5):
                        our_values = our_values_temp
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] != core[index_label]:
               
                
                if len(core[index_label].split("::"))>1:
                    score_b = 0.1
                elif len(core[index_label].split("__"))>1:
                    score_b = 0.3
                score_a = 1 - score_b
                
                our_values = our_values * score_a + score_b * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                if len(original_domain[index_label].split("__"))>1:
                    if len(core[index_label].split("::"))>1:
                        score_c1 = 0.1
                        score_a1 = 0.4
                        score_b1 = 0.5
                    elif len(core[index_label].split("__"))>1:
                        score_c1 = 0.2
                        score_a1 = 0.35
                        score_b1 = 0.45

                    our_values_temp = score_a1 * np.dot(embeddings_index[original_domain[index_label].split("_")[0]], embeddings_value.T) + score_b1 * np.dot(embeddings_index[original_domain[index_label].split("_")[1]], embeddings_value.T) + score_c1 * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                    our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                    our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))

                    if len(our_values_5_temp) > len(our_values_5):
                        our_values = our_values_temp
                
            embedding_val = (score_a/(score_a+score_b)) * embeddings_index[original_domain[index_label]] +  (score_b/(score_a+score_b)) * embeddings_index[original_core[index_label]]
      
        elif len(list_elements[index_label])==3:
            score_a = 0.3
            score_b = 0.4
            score_c = 0.3
            if  original_domain[index_label] != domain[index_label] and original_core[index_label] == core[index_label]:
                score_a = 0.2
                score_b = 0.425
                score_c = 0.375

                if len(modifier[index_label].split("__"))>1:
                    score_b = 0.325
                    score_c = 0.45
                    score_a = 0.225
                    our_values = our_values * score_a + score_b * np.dot(embeddings_index[original_modifier[index_label]], embeddings_value.T) + score_c * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                else:
                    our_values = our_values * 0.2 + 0.425 * np.dot(embeddings_index[original_modifier[index_label]], embeddings_value.T) + 0.375 * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)

                
            elif original_domain[index_label] != domain[index_label] and original_core[index_label] != core[index_label]:
                score_a = 0.25
                if len(core[index_label].split("::"))>1:
                    score_c = 0.1
                    score_a = 0.20
                elif len(core[index_label].split("__"))>1:
                    score_c = 0.25

                if len(modifier[index_label].split("__"))>1:
                    score_b = 0.35
                    score_c+= 0.075
                    score_a+= 0.075
                else:
                    score_b = 1 - score_a - score_c

                our_values = our_values * (score_a/(score_a+score_b+score_c)) + (score_b/(score_a+score_b+score_c)) * np.dot(embeddings_index[original_modifier[index_label]], embeddings_value.T) + (score_c/(score_a+score_b+score_c)) * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] == core[index_label]:
                if len(modifier[index_label].split("__"))>1:
                    score_b = 0.3
                    score_a = 0.35
                    score_c = 0.35
                
                our_values = our_values * score_a + score_b * np.dot(embeddings_index[original_modifier[index_label]], embeddings_value.T) + + score_c * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] != core[index_label]:
                score_a = 0.35
                score_b = 0
                
                if len(core[index_label].split("::"))>1:
                    score_c = 0.1

                    score_a = 0.4
                elif len(core[index_label].split("__"))>1:
                    score_c = 0.2
                    score_a = 0.35
                if len(modifier[index_label].split("__"))>1:
                    score_b = 0.35
                    score_a = 1 - score_a-score_b
                else:
                    score_b = 1 - score_a - score_c
                our_values = our_values * score_a + score_b * np.dot(embeddings_index[original_modifier[index_label]], embeddings_value.T) + score_c * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)

            embedding_val = (score_a/(score_a+score_b+score_c)) * embeddings_index[original_domain[index_label]] +  (score_b/(score_a+score_b+score_c)) * embeddings_index[original_modifier[index_label]]+ (score_c/(score_a+score_b+score_c)) * embeddings_index[original_core[index_label]]
        
     
        
        elements = our_values
        length_no = []
        length_no4 = []
        length_no2 = []
        length_no_avg = []
        if context_matrix[idx][list_matrix[idx][j][1]] >= 0.075 or np.max(context_matrix[idx]) <= 0.25:
            length_no = list(filter(lambda x: x >= 0.475 , elements)) 
            length_no2 = list(filter(lambda x: x >= 0.3 , elements))
            length_no4 = list(filter(lambda x: x >= 0.4 , elements))
            length_no_avg = list(filter(lambda x: x <=0.45 , elements))




        if length_no2 != list():
            mean_values.append(np.mean(length_no2))
        else:
            mean_values.append(np.mean(np.zeros(300)))
        if length_no != list():
            mean_values_total.append(np.mean(length_no))
        else:
            mean_values_total.append(np.mean(np.zeros(300)))


        our+=len(length_no)
        rows.append(our)
        row_values_4.append(length_no4)
        row_values.append(length_no)

        row_embeddings.append(embedding_val)
    max_values = 0
    mean_max_values = -1
    index_x = -1
    max_index=-1

    
    max_mean_5 = -1
    indices_value = []
    final_indices_val = -1
    values_to_reconsider = []
    for idx4, ivv in enumerate(rows):

        if max_values < ivv and mean_values_total[idx4] >= 0.085:
            
            max_values = ivv
            index_x = idx4
            mean_val_mid = sum(row_values[idx4]) / len(row_values[idx4])
            
            if max_mean_5 < mean_val_mid:
                max_mean_5 = mean_val_mid
            indices_value = [idx4]
        elif max_values == ivv and mean_values_total[idx4] >= 0.085 :
            indices_value.append(idx4)


    for idx4, ivv in enumerate(rows):
        if index_x != idx4 and  mean_values_total[idx4] >= 0.085 and np.max(row_values[idx4]) > 0.9 and len(row_values_4[idx4]) >= len(row_values_4[index_x]):
            indices_value.append(idx4)
            values_to_reconsider.append(idx4)

    if max_mean_5 < 0.65 and max_values != 0:
        indices_elements_copy = []
        for idx4 in indices_value:
            
            if context_matrix[idx][list_matrix[idx][idx4][1]]>0.45 or (context_matrix_niche[idx][list_matrix[idx][idx4][1]]>0.45 and similarity_to_niche[idx][idx] > 0.3) or (context_matrix_category[idx][list_matrix[idx][idx4][1]]>0.45 and similarity_to_cat[idx][idx]) or first_sentence_matrix[idx][list_matrix[idx][idx4][1]]>0.45:
                indices_elements_copy.append(idx4)
        if indices_elements_copy != []:
            indices_value = indices_elements_copy.copy()

    if max_mean_5 < 0.65 or len(indices_value)>1:
        indices_value_copy = []

        for idx4 in indices_value:
            ivv = len(row_values_4[idx4])

            if max_values < ivv and mean_values_total[idx4] >= 0.085:
                max_values = ivv
                index_x = idx4
                indices_value_copy=[idx4]
            elif max_values == ivv and mean_values_total[idx4] >= 0.085:
                indices_value_copy.append(idx4)
            elif idx4 in values_to_reconsider:
                indices_value_copy.append(idx4)
        if indices_value_copy != []:
            indices_value = indices_value_copy.copy()

    if max_values!=0 and len(indices_value) > 1:
        indices_value_copy = []
        max_values = len(list(row_values[index_x]))
        mean_max_values = 0

        for idx3 in indices_value:

            row_values_final = list(row_values[idx3])
            ivv = len(list(row_values[idx3]))
            mean_val = 0
            if len(row_values_final) != 0:
                mean_val = np.mean(row_values_final)            
            
            if (max_values == ivv and mean_val > mean_max_values):
                mean_max_values = mean_val
                index_x = idx3
                indices_value_copy= [idx3]
        
        for idx3, ivv in enumerate(rows):
            if idx3 in values_to_reconsider:
                indices_value_copy.append(idx3)
        
        if indices_value_copy != []:
            indices_value = indices_value_copy.copy()
    
    if max_values == 0 or len(indices_value) > 1:
        
    
        mean_max_values = -1
        index_x=-1
        if max_values==0:
            indices_value = list(range(len(row_values_4)))  
        max_values = len(row_values_4[0])
        for idx3, _ in enumerate(indices_value):
            mean_val = 0
            
            
            if len(row_values_4[idx3]) != 0:
                mean_val = np.mean(row_values_4[idx3])
            
            if (max_values == len(row_values_4[idx3]) and mean_val > mean_max_values):
                mean_max_values = mean_val
                max_values = len(row_values_4[idx3])
                index_x = idx3
            
            
    for j in range(len(rows)):
        max_new=-1
        element = cosine_similarity(row_embeddings[j].reshape(1,-1), row_embeddings[index_x].reshape(1,-1))
        mean = -1
        if element.size > 0:
            mean = np.mean(element)
        else:
            mean = 0

        mean_val = 0
        if len(row_values[j]) > 1:
            mean_val = np.mean(row_values[j])
        
        is_a_label_subset_of_another_label = original_modifier[list_matrix[idx][index_x][1]] == original_domain[list_matrix[idx][j][1]] and original_core[list_matrix[idx][index_x][1]] == original_core[list_matrix[idx][j][1]]
        
        is_a_label_subset_of_another_label = is_a_label_subset_of_another_label or original_core[list_matrix[idx][index_x][1]] == original_domain[list_matrix[idx][j][1]] and original_modifier[list_matrix[idx][index_x][1]] == "0" and original_modifier[list_matrix[idx][j][1]] == "0" and original_core[list_matrix[idx][j][1]]=="0"
        
        is_a_label_subset_of_another_label = is_a_label_subset_of_another_label or original_domain[list_matrix[idx][index_x][1]] == original_core[list_matrix[idx][j][1]] and original_modifier[list_matrix[idx][index_x][1]] == "0" and original_modifier[list_matrix[idx][j][1]] == "0" and original_core[list_matrix[idx][index_x][1]]=="0"

        
        if (mean <= 0.85 and not is_a_label_subset_of_another_label):
            list_matrix[idx][j] = (-1, list_matrix[idx][j][1])
            array_list.append(idx)
        else:
            
            if mean >=0.85 and mean < 0.99 or is_a_label_subset_of_another_label:
                min_value = 1
                min_value2 = 1
                min_value3 = 1
                min_value = np.max(np.dot(embeddings_index[original_domain[list_matrix[idx][index_x][1]]], embeddings_value.T))
                if original_modifier[list_matrix[idx][index_x][1]] != "0":
                    min_value2 = np.max(np.dot(embeddings_index[original_modifier[list_matrix[idx][index_x][1]]], embeddings_value.T))
                if original_core[list_matrix[idx][j][1]] != "0":
                    min_value3 = np.max(np.dot(embeddings_index[original_core[list_matrix[idx][j][1]]], embeddings_value.T))
                if len(original_core[list_matrix[idx][j][1]].split("_"))>1:
                    min_value11 = np.min(np.dot(embeddings_index[original_core[list_matrix[idx][index_x][1]].split("_")[0]], embeddings_value.T))
                    min_value1= np.min(np.dot(embeddings_index[original_core[list_matrix[idx][index_x][1]].split("_")[1]], embeddings_value.T))  
                    min_value = np.min([min_value, min_value11, min_value1])
                if len(original_domain[list_matrix[idx][j][1]].split("_"))>1:
                    min_value22 = np.min(np.dot(embeddings_index[original_domain[list_matrix[idx][j][1]].split("_")[0]], embeddings_value.T))
                    min_value2= np.min(np.dot(embeddings_index[original_domain[list_matrix[idx][j][1]].split("_")[1]], embeddings_value.T))  
                    min_value = np.min([min_value, min_value22, min_value2])

                min_value = np.min([min_value, min_value2, min_value3])
                
                if min_value < 0.4:
                    list_matrix[idx][j] = (-1, list_matrix[idx][j][1])
                    array_list.append(idx)
            last_index_elements2[idx] = list_matrix[idx][len(elements_filtered[idx])][1]


In [81]:
final_list = []
elements_filtered = []
no_elements=0
no_elements_not_okay = 0

for i in range(len(df['description'])):
    list_matrix[i].sort(reverse=True)
    
    list111 = []

    max_value = list_matrix[i][0][0]


    row_filtered = []

    for j in range(30):

        if(list_matrix[i][j][1]==last_index_elements2[i]):
            break

        if max_value - list_matrix[i][j][0] <= 0.1:
            list111.append(our_classes['label'][list_matrix[i][j][1]])
            row_filtered.append(list_matrix[i][j])


        

    if (list_matrix[i][0][0]) < 0.30:
        no_elements_not_okay +=1
    
    elements_filtered.append(row_filtered)
        

In [82]:
final_list = []
elements_filtered = []

i=0

zz=0
ridiculous_number = 0
for idx, _ in df.iterrows():
    list_matrix[i].sort(reverse=True)
    
    list111 = []

    max_value = list_matrix[i][0][0]
    row_filtered = []
    

    for j in range(30):

        
        if(list_matrix[i][j][1]==last_index_elements2[i]):
            break
        
        
        
        if max_value - list_matrix[i][j][0] <= 0.10:
            list111.append(our_classes['label'][list_matrix[i][j][1]])
            row_filtered.append(list_matrix[i][j])
                   
        
    final_list.append(list111)    
    elements_filtered.append(row_filtered)
    i+=1


In [83]:
def filter_by_text(our_matrix, idx, magic_number):


    our_matrix[idx].sort(reverse=True)

    max_value11 = our_matrix[idx][0][0]



    rows = []


    list111 = []
    row_filtered=[]
    mean_values = []
    mean_values_total = []
    row_values = []
    row_values_4 = []

    row_embeddings = []
   
    for j in range(magic_number):
        list111.append(our_classes['label'][list_matrix[idx][j][1]])
        row_filtered.append(list_matrix[idx][j])
      
    
    
        
       
        index_label = list_matrix[idx][j][1]

        our = 0
        our_mean_embeddings = []
        
        embeddings_value = np.array([embeddings_index[val] if val not in weak_values_all[35][:-(len(weak_values_all[35])//2)] else np.zeros(300) for val in our_words['new_col2'][idx]],  dtype=np.float32, order='C')
        our_values = np.dot(embeddings_index[original_domain[index_label]], embeddings_value.T)
        score_b = 1
        score_a = 0
        embedding_val = None

            

        if len(list_elements[index_label])==1:
           embedding_val = embeddings_index[original_domain[index_label]]
           our_values = list(map(lambda x: x, our_values))
           if len(original_domain[index_label].split("__"))>1:
                our_values_temp = 0.5 * np.dot(embeddings_index[original_domain[index_label].split("_")[0]], embeddings_value.T) + 0.5 * np.dot(embeddings_index[original_domain[index_label].split("_")[1]], embeddings_value.T)
                our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))

                if len(our_values_5_temp) > len(our_values_5):
                    our_values = our_values_temp


        
        elif len(list_elements[index_label])==2:
            if  original_domain[index_label] != domain[index_label] and original_core[index_label] == core[index_label]:
                score_a = 0.3
                score_b = 0.7

                our_values = our_values * 0.3 + 0.7 * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)[0]
                if len(original_core[index_label].split("_")) > 1:
                    our_values_temp =  0.2 * np.dot(embeddings_index[original_domain[index_label]],  embeddings_value.T)+ 0.425 * np.dot(embeddings_index[original_core[index_label].split("_")[0]], embeddings_value.T) + 0.375 * np.dot(embeddings_index[original_core[index_label].split("_")[1]], embeddings_value.T)
                    our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                    our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))
                    if len(our_values_5_temp) > len(our_values_5):
                        our_values = our_values_temp

            elif original_domain[index_label] != domain[index_label] and original_core[index_label] != core[index_label]:
                score_a = 0.3
                if len(core[index_label].split("::"))>1:
                    score_b = 0.1
                elif len(core[index_label].split("__"))>1:
                    score_b = 0.3

                our_values = our_values * (score_a/(score_a+score_b)) + (score_b/(score_a+score_b)) * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                
            
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] == core[index_label]:
                score_a = 0.5
                score_b = 0.5
               
                our_values = our_values * (score_a/(score_a+score_b)) + (score_b/(score_a+score_b)) * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                if len(original_core[index_label].split("_")) > 1:
                    our_values_temp =  0.3 * np.dot(embeddings_index[original_domain[index_label]],  embeddings_value.T)+ 0.4 * np.dot(embeddings_index[original_core[index_label].split("_")[0]], embeddings_value.T) + 0.3 * np.dot(embeddings_index[original_core[index_label].split("_")[1]], embeddings_value.T)
                    our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                    our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))
                    if len(our_values_5_temp) > len(our_values_5):
                        our_values = our_values_temp
                if  len(original_domain[index_label].split("__"))>1:
                    our_values_temp = 0.3 * np.dot(embeddings_index[original_domain[index_label].split("_")[0]], embeddings_value.T) + 0.4 * np.dot(embeddings_index[original_domain[index_label].split("_")[1]], embeddings_value.T) + 0.3 * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                    our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                    our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))

                    if len(our_values_5_temp) > len(our_values_5):
                        our_values = our_values_temp
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] != core[index_label]:
               
                
                if len(core[index_label].split("::"))>1:
                    score_b = 0.1
                elif len(core[index_label].split("__"))>1:
                    score_b = 0.3
                score_a = 1 - score_b
                
                our_values = our_values * score_a + score_b * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                if len(original_domain[index_label].split("__"))>1:
                    if len(core[index_label].split("::"))>1:
                        score_c1 = 0.1
                        score_a1 = 0.4
                        score_b1 = 0.5
                    elif len(core[index_label].split("__"))>1:
                        score_c1 = 0.2
                        score_a1 = 0.35
                        score_b1 = 0.45

                    our_values_temp = score_a1 * np.dot(embeddings_index[original_domain[index_label].split("_")[0]], embeddings_value.T) + score_b1 * np.dot(embeddings_index[original_domain[index_label].split("_")[1]], embeddings_value.T) + score_c1 * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                    our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                    our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))

                    if len(our_values_5_temp) > len(our_values_5):
                        our_values = our_values_temp
                
            embedding_val = (score_a/(score_a+score_b)) * embeddings_index[original_domain[index_label]] +  (score_b/(score_a+score_b)) * embeddings_index[original_core[index_label]]
      
        elif len(list_elements[index_label])==3:
            score_a = 0.3
            score_b = 0.4
            score_c = 0.3
            if  original_domain[index_label] != domain[index_label] and original_core[index_label] == core[index_label]:
                score_a = 0.2
                score_b = 0.425
                score_c = 0.375

                if len(modifier[index_label].split("__"))>1:
                    score_b = 0.325
                    score_c = 0.45
                    score_a = 0.225
                    our_values = our_values * score_a + score_b * np.dot(embeddings_index[original_modifier[index_label]], embeddings_value.T) + score_c * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                else:
                    our_values = our_values * 0.2 + 0.425 * np.dot(embeddings_index[original_modifier[index_label]], embeddings_value.T) + 0.375 * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)

                
            elif original_domain[index_label] != domain[index_label] and original_core[index_label] != core[index_label]:
                score_a = 0.25
                if len(core[index_label].split("::"))>1:
                    score_c = 0.1
                    score_a = 0.20
                elif len(core[index_label].split("__"))>1:
                    score_c = 0.25

                if len(modifier[index_label].split("__"))>1:
                    score_b = 0.35
                    score_c+= 0.075
                    score_a+= 0.075
                else:
                    score_b = 1 - score_a - score_c

                our_values = our_values * (score_a/(score_a+score_b+score_c)) + (score_b/(score_a+score_b+score_c)) * np.dot(embeddings_index[original_modifier[index_label]], embeddings_value.T) + (score_c/(score_a+score_b+score_c)) * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] == core[index_label]:
                if len(modifier[index_label].split("__"))>1:
                    score_b = 0.3
                    score_a = 0.35
                    score_c = 0.35
                
                our_values = our_values * score_a + score_b * np.dot(embeddings_index[original_modifier[index_label]], embeddings_value.T) + + score_c * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] != core[index_label]:
                score_a = 0.35
                score_b = 0
                
                if len(core[index_label].split("::"))>1:
                    score_c = 0.1

                    score_a = 0.4
                elif len(core[index_label].split("__"))>1:
                    score_c = 0.2
                    score_a = 0.35
                if len(modifier[index_label].split("__"))>1:
                    score_b = 0.35
                    score_a = 1 - score_a-score_b
                else:
                    score_b = 1 - score_a - score_c
                our_values = our_values * score_a + score_b * np.dot(embeddings_index[original_modifier[index_label]], embeddings_value.T) + score_c * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)

            embedding_val = (score_a/(score_a+score_b+score_c)) * embeddings_index[original_domain[index_label]] +  (score_b/(score_a+score_b+score_c)) * embeddings_index[original_modifier[index_label]]+ (score_c/(score_a+score_b+score_c)) * embeddings_index[original_core[index_label]]
        
        
        elements = our_values
        length_no = list(filter(lambda x: x >= 0.475 , elements)) 

        
        length_no2 = list(filter(lambda x: x >= 0.3 , elements))
        length_no4 = list(filter(lambda x: x >= 0.4 , elements))
      


        if length_no2 != list():
            mean_values.append(np.mean(length_no2))
        else:
            mean_values.append(np.mean(np.zeros(300)))
        if length_no != list():
            mean_values_total.append(np.mean(length_no))
        else:
            mean_values_total.append(np.mean(np.zeros(300)))


        our+=len(length_no)
        rows.append(our)
        row_values_4.append(length_no4)
        row_values.append(length_no)

        row_embeddings.append(embedding_val)
    max_values = 0
    mean_max_values = -1
    index_x = -1
    max_index=-1

    
    max_mean_5 = -1
    indices_value = []
    final_indices_val = -1
    values_to_reconsider = []

    for idx4, ivv in enumerate(range(len(row_values))):

        if max_values < ivv and mean_values_total[idx4] >= 0.085:
            
            max_values = ivv
            index_x = idx4
           
            mean_val_mid = sum(row_values[idx4]) / len(row_values[idx4])
            
            if max_mean_5 < mean_val_mid:
                max_mean_5 = mean_val_mid
            indices_value = [idx4]
        elif max_values == ivv and mean_values_total[idx4] >= 0.085 :
            indices_value.append(idx4)


    for idx4, ivv in enumerate(row_values):
        if index_x != idx4 and  mean_values_total[idx4] >= 0.085 and np.max(row_values[idx4]) > 0.9 and len(row_values_4[idx4]) >= len(row_values_4[index_x]):
            indices_value.append(idx4)
            values_to_reconsider.append(idx4)

    if max_mean_5 < 0.65 and max_values != 0:
        indices_elements_copy = []
        for idx4 in indices_value:
            
            if context_matrix[idx][list_matrix[idx][idx4][1]]>0.45 or (context_matrix_niche[idx][list_matrix[idx][idx4][1]]>0.45 and similarity_to_niche[idx][idx] > 0.3) or (context_matrix_category[idx][list_matrix[idx][idx4][1]]>0.45 and similarity_to_cat[idx][idx]) or first_sentence_matrix[idx][list_matrix[idx][idx4][1]]>0.45:
                indices_elements_copy.append(idx4)
        if indices_elements_copy != []:
            indices_value = indices_elements_copy.copy()

    if max_mean_5 < 0.65 or len(indices_value)>1:
        indices_value_copy = []

        for idx4 in indices_value:
            ivv = len(row_values_4[idx4])

            if max_values < ivv and mean_values_total[idx4] >= 0.085:
                max_values = ivv
                index_x = idx4
                indices_value_copy=[idx4]
            elif max_values == ivv and mean_values_total[idx4] >= 0.085:
                indices_value_copy.append(idx4)
            elif idx4 in values_to_reconsider:
                indices_value_copy.append(idx4)
        if indices_value_copy != []:
            indices_value = indices_value_copy.copy()

    if max_values!=0 and len(indices_value) > 1:
        indices_value_copy = []
        max_values = len(list(row_values[index_x]))
        mean_max_values = 0

        for idx3 in indices_value:

            row_values_final = list(row_values[idx3])
            ivv = len(list(row_values[idx3]))
            mean_val = 0
            if len(row_values_final) != 0:
                mean_val = np.mean(row_values_final)            
            
            if (max_values == ivv and mean_val > mean_max_values):
                mean_max_values = mean_val
                index_x = idx3
                indices_value_copy= [idx3]
        
        for idx3, ivv in enumerate(rows):
            if idx3 in values_to_reconsider:
                indices_value_copy.append(idx3)
        
        if indices_value_copy != []:
            indices_value = indices_value_copy.copy()
 
    if max_values == 0 or len(indices_value) > 1:
        
    
        mean_max_values = -1
        index_x=-1
        if max_values==0:
            indices_value = list(range(len(row_values_4)))  
        max_values = len(row_values_4[0])
        for idx3, _ in enumerate(indices_value):
            mean_val = 0
            
            
            if len(row_values_4[idx3]) != 0:
                mean_val = np.mean(row_values_4[idx3])
            
            if (max_values == len(row_values_4[idx3]) and mean_val > mean_max_values):
                mean_max_values = mean_val
                max_values = len(row_values_4[idx3])
                index_x = idx3

    return index_x, row_values[index_x], row_values_4[index_x]

### Layered Label Recovery and Semantic Filtering Strategy

This module implements a robust fallback mechanism to recover meaningful labels when initial predictions are weak (score ≤ 0.30) or overly filtered. It operates through multiple backup matrices and semantic checks to re-rank and reinforce candidate labels using contextual embeddings and structural logic.


### Objective

To maintain stable and contextually accurate label predictions, especially when the top scoring candidate is filtered out or deemed unreliable.


### Fallback Recovery Pipeline

When `list_matrix[i][0][0] ≤ 0.30` or when the candidate list is too large or noisy, the following recovery layers are applied in order:

1. **old_list_matrix**  
   The less filtered version than the current one. If it includes a valid high-scoring candidate, it is used. Otherwise, fallback continues.

2. **list_matrix_original_2**  
   A moderately-filtered version, used if the previous matrix fails to yield a suitable result.

3. **list_matrix_original**  
   A stable baseline. If no valid candidates are found yet, this version is used.

4. **Last Resort**  
   If all above fail, the first valid label (based on contextual and category thresholds) is selected as a fallback.


### Key Functions

**`filter_by_text()`**
Originally designed to filter candidates across the entire dataset, this function is now applied at row level to select **a single most semantically relevant label** for a specific instance. It uses domain-core-modifier decomposition and token similarity scores to prioritize meaningful matches.

**`find_best_indices_cat()`**
Previously used globally for category-aware filtering, this function now operates at row level. It identifies the best label aligned with the company’s category and niche by scoring semantic similarity against known business-relevant terms.


### Semantic Validation and Scoring

Each fallback step involves:

- Comparing embeddings of label terms with input tokens.
- Verifying modifier, domain, and core relevance through cosine similarity.
- Penalizing generic or ambiguous terms using a weighted scheme.
- Enforcing contextual similarity thresholds to prevent mislabeling.


### Outcome

This multi-stage approach ensures:

- Recovery of relevant labels even under heavy filtering.
- Suppression of generic or poorly-aligned terms.
- More stable and meaningful predictions, consistent with the company’s context and domain.

In [84]:
def find_best_indices_cat(our_matrix, idx, magic_number):
    our_matrix[idx].sort(reverse=True)

    max_value11 = list_matrix[idx][0][0]

    
    
    if(max_value11 - list_matrix[idx][1][0]>=0.1):
        return 0

    rows = []
    rows_emb = []

    list111 = []
    row_filtered=[]
    mean_values_total = []

    row_embeddings = []
    rows_emb4_5 = []

    set_values = set(strong_values_all[idx])
    set_values.update(our_categories_niche_words['niche_plus_cat'][idx])
    list_val = set_values
   
    
    if list_val == []:
        return -1
    embeddings_value = np.array([embeddings_index.get(val, np.zeros(300)) for val in list_val],  dtype=np.float32, order='C')
    if list_val == set():
        return -1
    
    faiss.normalize_L2(embeddings_value)
   
    for j in range(magic_number):
        list111.append(our_classes['label'][list_matrix[idx][j][1]])
        row_filtered.append(list_matrix[idx][j])
        index_label = list_matrix[idx][j][1]

        our = 0
        our_values = np.dot(embeddings_index[original_domain[index_label]].reshape(1,-1), embeddings_value.T)[0]

        score_b = 1
        score_a = 0
        embedding_val = None
        penalizer_genericity = 1.0
        
        if len(list_elements[index_label])==1:
           embedding_val = embeddings_index[original_domain[index_label]]
           
           if len(original_domain[index_label].split("_"))>1 and original_domain[index_label] == domain[index_label]:
 
                our_values_temp = 0.5 * np.dot(embeddings_index[original_domain[index_label].split("_")[0]], embeddings_value.T) + 0.5 * np.dot(embeddings_index[original_domain[index_label].split("_")[1]], embeddings_value.T)
                our_values_45 = list(filter(lambda x: x>=0.45, our_values))
                our_values_45_temp = list(filter(lambda x: x>=0.45, our_values_temp))

                our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))
               
                if (len(our_values_45_temp) > len(our_values_45)) or (len(our_values_5_temp) > len(our_values_5)):
                    our_values = our_values_temp

        elif len(list_elements[index_label])==2:
            if  original_domain[index_label] != domain[index_label] and original_core[index_label] == core[index_label]:
                score_a = 0.3
                score_b = 0.7
        
                our_values = our_values * 0.3 + 0.7 * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                if len(original_core[index_label].split("_")) > 1:
                    our_values_temp =  0.2 * np.dot(embeddings_index[original_domain[index_label]],  embeddings_value.T)+ 0.425 * np.dot(embeddings_index[original_core[index_label].split("_")[0]], embeddings_value.T) + 0.375 * np.dot(embeddings_index[original_core[index_label].split("_")[1]], embeddings_value.T) 
                    our_values_45 = list(filter(lambda x: x>=0.45, our_values))
                    our_values_45_temp = list(filter(lambda x: x>=0.45, our_values_temp))
                    our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                    our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))
               
                    if (len(our_values_5_temp) > len(our_values_5) or len(our_values_45_temp) > len(our_values_45)):
                        our_values = our_values_temp

            elif original_domain[index_label] != domain[index_label] and original_core[index_label] != core[index_label]:
                score_a = 0.3
                if len(core[index_label].split("::"))>1:
                    score_b = 0.1
                elif len(core[index_label].split("__"))>1:
                    score_b = 0.3
                penalizer_genericity = 0.7
                
                our_values = our_values * (score_a/(score_a+score_b)) + (score_b/(score_a+score_b)) * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] == core[index_label]:
                score_a = 0.5
                score_b = 0.5
             
                our_values = our_values * (score_a/(score_a+score_b)) + (score_b/(score_a+score_b)) * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                if len(original_core[index_label].split("_")) > 1:
                    our_values_temp =  0.3 * np.dot(embeddings_index[original_domain[index_label]],  embeddings_value.T)+ 0.4 * np.dot(embeddings_index[original_core[index_label].split("_")[0]], embeddings_value.T)[0] + 0.3 * np.dot(embeddings_index[original_core[index_label].split("_")[1]], embeddings_value.T)[0] 
                    our_values_45 = list(filter(lambda x: x>=0.45, our_values))
                    our_values_45_temp = list(filter(lambda x: x>=0.45, our_values_temp))
                    our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                    our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))

                    if len(our_values_45_temp) > len(our_values_45) or len(our_values_5_temp) > len(our_values_5):
                        our_values = our_values_temp
                if len(original_domain[index_label].split("_"))>1:
                    our_values_temp = 0.3 * np.dot(embeddings_index[original_domain[index_label].split("_")[0]], embeddings_value.T) + 0.4 * np.dot(embeddings_index[original_domain[index_label].split("_")[1]], embeddings_value.T) + 0.3 * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                    our_values_45 = list(filter(lambda x: x>=0.45, our_values))
                    our_values_45_temp = list(filter(lambda x: x>=0.45, our_values_temp))
                    our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                    our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))

                    if len(our_values_5_temp) > len(our_values_5) or len(our_values_45_temp) > len(our_values_45):
                        our_values = our_values_temp
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] != core[index_label]:
               
                
                if len(core[index_label].split("::"))>1:
                    score_a = 0.9
                elif len(core[index_label].split("__"))>1:
                    score_a = 0.7
                score_b= 1 - score_a
               
                our_values = our_values * score_a + score_b * np.dot(embeddings_index[original_core[index_label]].reshape(1,-1), embeddings_value.T)[0]
                if len(original_domain[index_label].split("_"))>1:
                    if len(core[index_label].split("::"))>1:
                        score_c1 = 0.1
                        score_a1 = 0.4
                        score_b1 = 0.5
                    elif len(core[index_label].split("__"))>1:
                        score_c1 = 0.2
                        score_a1 = 0.35
                        score_b1 = 0.45

                    our_values_temp = score_a1 * np.dot(embeddings_index[original_domain[index_label].split("_")[0]], embeddings_value.T) + score_b1 * np.dot(embeddings_index[original_domain[index_label].split("_")[1]], embeddings_value.T) + score_c1 * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
                    our_values_45 = list(filter(lambda x: x>=0.45, our_values))
                    our_values_45_temp = list(filter(lambda x: x>=0.45, our_values_temp))
                    our_values_5 = list(filter(lambda x: x>=0.5, our_values))
                    our_values_5_temp = list(filter(lambda x: x>=0.5, our_values_temp))
                    
                    if len(our_values_5_temp) > len(our_values_5) or len(our_values_45_temp) > len(our_values_45):
                        our_values = our_values_temp
                
            embedding_val = (score_a/(score_a+score_b)) * embeddings_index[original_domain[index_label]] +  (score_b/(score_a+score_b)) * embeddings_index[original_core[index_label]]

        elif len(list_elements[index_label])==3:
            score_a = 0.3
            score_b = 0.4
            score_c = 0.3
            if  original_domain[index_label] != domain[index_label] and original_core[index_label] == core[index_label]:
                score_a = 0.2
                score_b = 0.425
                score_c = 0.375                
                our_values = our_values * 0.2 + 0.425 * np.dot(embeddings_index[original_modifier[index_label]], embeddings_value.T)[0] + 0.375 * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)[0]
  
            elif original_domain[index_label] != domain[index_label] and original_core[index_label] != core[index_label]:                
                if len(core[index_label].split("::"))>1:
                    score_c = 0.1
                    score_a = 0.235
                elif len(core[index_label].split("__"))>1:
                    score_c = 0.2
                    score_a = 0.2
                score_b = 1 - score_a - score_c
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] == core[index_label]:
                our_values = our_values * score_a + score_b * np.dot(embeddings_index[original_modifier[index_label]], embeddings_value.T)[0] + + score_c * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)[0]
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] != core[index_label] and len(core[index_label].split("::"))>1:
                score_a = 0.425
                score_c = 0.1
                score_b = 1 - score_a - score_c
            elif original_domain[index_label] == domain[index_label] and original_core[index_label] != core[index_label]:
                if len(modifier[index_label].split("__"))>1:
                    score_b = 0.5
                    score_a = 0.3 
                    score_c = 1 - score_a-score_b
                else:
                    score_b = 0.45
                    score_a = 0.35 
                    score_c = 1 - score_a - score_b
            
            our_values = our_values * (score_a/(score_a+score_b+score_c)) + (score_b/(score_a+score_b+score_c)) * np.dot(embeddings_index[original_modifier[index_label]], embeddings_value.T) + (score_c/(score_a+score_b+score_c)) * np.dot(embeddings_index[original_core[index_label]], embeddings_value.T)
            embedding_val = (score_a/(score_a+score_b+score_c)) * embeddings_index[original_domain[index_label]] +  (score_b/(score_a+score_b+score_c)) * embeddings_index[original_modifier[index_label]]+ (score_c/(score_a+score_b+score_c)) * embeddings_index[original_core[index_label]]

       
     
        length_no = list(filter(lambda zz: zz>=0.45, our_values))
        

        length_no = list(filter(lambda x: (x * penalizer_genericity) >= 0.5, our_values)) 

        length_no4_5 = list(filter(lambda x: (x * penalizer_genericity) >= 0.45, our_values)) 


        if len(length_no) > 0: 
            mean_values_total.append(np.mean(list(map(lambda x: x, length_no))))
        else:
            mean_values_total.append(0)

        our+=len(length_no)
        rows.append(our)
        rows_emb.append(length_no)
        rows_emb4_5.append(length_no4_5)
        row_embeddings.append(embedding_val)

    max_values = 0
    index_x = -1

    for idx4, ivv in enumerate(rows):
        if max_values < ivv and mean_values_total[idx4] >= 0.085:
            max_values = ivv
            index_x = idx4
    list_best_cat.append(index_x)

    if max_values==0:
        return -1


    max_len_4_5_values = -1
    list_indices = []
    for idx3, ivv in enumerate(rows):
        if len(rows_emb[idx3])!=0:
            mean_val = np.mean(rows_emb[idx3])
        if (max_values == ivv and len(rows_emb4_5[idx3]) > max_len_4_5_values):
            max_len_4_5_values = len(rows_emb4_5[idx3])
            index_x = idx3
            list_indices.append(idx3)
        elif (max_values == ivv and len(rows_emb4_5[idx3]) == max_len_4_5_values):
            list_indices.append(idx3)
    
    if len(list_indices) > 1:
        mean_val = -1
        for idx3, ivv in enumerate(list_indices):
            
            if len(rows_emb4_5[idx3]) > 0:
                mean_ = np.mean(rows_emb4_5[idx3])

            
            else:
                mean_ = 0
            if mean_ > mean_val:
                index_x = idx3
                mean_val = mean_
    
    

    return index_x


In [85]:
final_list = []
elements_filtered = []

i=0

zz=0
ridiculous_number = 0
for idx, _ in df.iterrows():
    list_matrix[i].sort(reverse=True)
    
    list111 = []

    max_value = list_matrix[i][0][0]
    row_filtered = []


    for j in range(30):
        
        if(list_matrix[i][j][1]==last_index_elements2[i] and j!=0):
            break        
        
        
        if max_value - list_matrix[i][j][0] <= 0.10:
            list111.append(our_classes['label'][list_matrix[i][j][1]])
            row_filtered.append(list_matrix[i][j])
        if list_matrix[i][j+1][0] < 0.30:
            break
            
            
    final_list.append(list111)
    
    ind_cat = -1
    list1 = [11]
    list2 = [11]
    list_matrix_original[i].sort(reverse=True)
    list_matrix_original_2[i].sort(reverse=True)
    old_list_matrix[idx].sort(reverse=True)

    if(list_matrix[i][0][0]<=0.30 and old_list_matrix[i][0][0]>0.30) or len(list111)>5:
        if i not in skip_categories_for_this_one:
            ind_cat = find_best_indices_cat(old_list_matrix, i, 5)
            list_matrix_temp = old_list_matrix[i].copy()
        else:
                
            for j in range(220):
                if context_matrix_niche[i][old_list_matrix[i][j][1]]>=0.65 or context_matrix_niche[i][old_list_matrix[i][j][1]]>=0.46:
                    old_list_matrix[i][j] = (-1, old_list_matrix[j][1][1])
            old_list_matrix[idx].sort(reverse=True)

            

        if(ind_cat==-1):
            ind_cat,list1, list2 = filter_by_text(old_list_matrix, i, 5)
        temp_list_element = old_list_matrix[i].copy()
        if list1!=[] or list2 != []:
            for j in range(220):
                if j!=ind_cat:
                    temp_list_element[j] = (-1, temp_list_element[j][1])
            list_matrix[i] = temp_list_element.copy()        
   
    if (list1==[] and list2==[]) or (list_matrix[idx][0][0]<=0.30 and list_matrix_original_2[idx][0][0]>0.30):
        ind_cat = -1
        if i not in skip_categories_for_this_one:
            ind_cat = find_best_indices_cat(list_matrix_original_2, i, 5)
            list_matrix_temp = list_matrix_original_2[idx].copy()
        else:

            for j in range(220):
                if context_matrix_niche[i][list_matrix_original_2[idx][j][1]]>=0.65 or context_matrix_niche[i][list_matrix_original_2[idx][j][1]]>=0.46:
                    list_matrix_original_2[i][j] = (-1, list_matrix_original_2[i][j][1])
            list_matrix_original_2[idx].sort(reverse=True)
            
        if(ind_cat==-1):
            ind_cat, list1, list2 = filter_by_text(list_matrix_original_2, i, 5)
        temp_list_element = list_matrix_original_2[i].copy()
        if list1!=[] or list2 != []:
            for j in range(220):
                if j!=ind_cat:
                    temp_list_element[j] = (-1, temp_list_element[j][1])
            list_matrix[i] = temp_list_element.copy()
        
  
    if (list1==[] and list2==[]) or (list_matrix[i][0][0]<=0.30 and list_matrix_original[i][0][0]>0.30):
        ind_cat = -1
        if i not in skip_categories_for_this_one:
            ind_cat = find_best_indices_cat(list_matrix_original, i, 5)
            list_matrix_temp = list_matrix_original[i].copy()
            
        else:
            listxzz=[]
            for j in range(220):
                
                
                if context_matrix_niche[i][list_matrix_original[i][j][1]]>=0.65 or context_matrix_niche[i][list_matrix_original[i][j][1]]>=0.46:
                    list_matrix_original[i][j] = (-1, list_matrix_original[i][j][1])
                
                list_matrix_original[idx].sort(reverse=True)

            for element in our_categories_niche_words['niche_plus_cat'][i]:
                if element in our_words['new_col2'][i]:
                    our_words['new_col2'][i].remove(element)
        
        if(ind_cat==-1):
            ind_cat, list1, list2 = filter_by_text(list_matrix_original, i, 5)
        temp_list_element = list_matrix_original[i].copy()
        if list1!=[] or list2 != []:
            for j in range(220):
                if j!=ind_cat:
                    temp_list_element[j] = (-1, temp_list_element[j][1])
            list_matrix[i] = temp_list_element.copy()
            
        
        

        if list1 == [] and list2 == []:
            max_value = -1
            index_v = -1
            
            for j in range(10):
                if max_value < context_matrix[i][list_matrix_original[i][j][1]] and list_matrix_original[i][j][0] >=0.3:
                    max_value = context_matrix[i][list_matrix_original[i][j][1]]
                    index_v = j
            

            for j in range(220):
                if j!=index_v:
                    temp_list_element[j] = (-1, temp_list_element[j][1])
           
            
            list_matrix[i] = temp_list_element.copy()
                        
                
    elements_filtered.append(row_filtered)

    i+=1



In [86]:
final_list = []
elements_filtered = []

i=0

zz=0
ridiculous_number = 0

for idx, _ in df.iterrows():
    list_matrix[i].sort(reverse=True)
    
    list111 = []
    list111_old = []

    if list_matrix[i] == []:
        final_list.append([])
        i+=1
        continue

    max_value = list_matrix[i][0][0]

    
    row_filtered = []



    for j in range(min(30, len(list_matrix[i]))):
        
        if(list_matrix[i][j][1]==last_index_elements2[i] and j!=0):
            break
        
        if max_value - list_matrix[i][j][0] <= 0.1:
            list111.append(our_classes['label'][list_matrix[i][j][1]])
            row_filtered.append(list_matrix[i][j])
        if list_matrix[i][j+1][0] < 0.30:
            break
        
    
    final_list.append(list111)

    elements_filtered.append(row_filtered)
    i+=1



In [87]:
df['insurance_label'] = final_list

In [88]:
df.to_csv('../output/challenge_ml_insurance_completed_full_final.csv')

### Saving Final Predictions

After generating the final results, we insert them back into the **valid rows** of our dataset by adding a new column.

The completed results, including the labels predicted, are saved to a file named **`challenge_ml_insurance_completed.csv`** for further use or evaluation.

In [91]:
import os
if os.path.exists("context_matrix.txt"):
    os.remove("context_matrix.txt")
if os.path.exists("context_sentence_embedding.txt"):
    os.remove("context_sentence_embedding.txt")
if os.path.exists("label_embeddings.txt"):
    os.remove("label_embeddings.txt")
if os.path.exists("vec_embeddings.txt"):
    os.remove("vec_embeddings.txt")
if os.path.exists("tokens_vector.csv"):
    os.remove("tokens_vector.csv")
if os.path.exists("strongest_words2.csv"):
    os.remove("strongest_words2.csv")
if os.path.exists("sectors.csv"):
    os.remove("sectors.csv")
if os.path.exists("sectors.csv"):
    os.remove("sectors.csv")
if os.path.exists("our_current_categories.csv"):
    os.remove("our_current_categories.csv")
if os.path.exists("counted_elements.csv"):
    os.remove("counted_elements.csv")
if os.path.exists("average_tf_idf_per_word.txt"):
    os.remove("average_tf_idf_per_word.txt")
if os.path.exists("most_important_words.txt"):
    os.remove("most_important_words.txt")
if os.path.exists("match_words_with_tf_idf.txt"):
    os.remove("match_words_with_tf_idf.txt")
if os.path.exists("correlated_terms.txt"):
    os.remove("correlated_terms.txt")
if os.path.exists("new_label_with_categories.csv"):
    os.remove("new_label_with_categories.csv")
if os.path.exists("first_sentence_matrix_total1.txt"):
    os.remove("first_sentence_matrix_total1.txt")
if os.path.exists("cat_niche_embeddings.txt"):
    os.remove("cat_niche_embeddings.txt")

if os.path.exists("counted_categories.csv"):
    os.remove("counted_categories.csv")
if os.path.exists("niche_matrix.txt"):
    os.remove("niche_matrix.txt")
if os.path.exists("category_matrix.txt"):
    os.remove("category_matrix.txt")
if os.path.exists("similarity_desc_to_category.txt"):
    os.remove("similarity_desc_to_category.txt")
if os.path.exists("similarity_desc_to_niche.txt"):
    os.remove("similarity_desc_to_niche.txt")
