## Rec-System v1 (similarity)

The objective of this challenge is to build a system capable of matching each content item to the most relevant curriculum topics using only the information provided in the dataset. To address this, I designed a semantic-similarity–based solution rather than a traditional supervised classifier. The core idea is to represent both contents and topics using dense embeddings generated by a Sentence Transformer, capturing their semantic meaning regardless of text length or structure. All topic embeddings are indexed using FAISS, which enables efficient nearest-neighbor search at scale. When a new content item is provided, the system encodes its text fields (title, description, and extracted text) into an embedding, queries FAISS, and returns the top-k most semantically similar topics. This approach generalizes naturally to unseen data, depends only on the dataset provided.

In [1]:
import faiss
import ast
import pandas as pd
import numpy as np

In [2]:
df_topics_v1 = pd.read_csv('../data/new_topics_MiniLM.csv')
df_topics_v1.head()

Unnamed: 0,id,title,description,channel,category,level,language,parent,has_content,parent_title,full_info,embedding
0,t_00004da3a1b2,Откриването на резисторите,"Изследване на материали, които предизвикват на...",000cf7,source,4,bg,t_16e29365b50d,True,Открития и проекти,title: Откриването на резисторите description:...,[-4.65347767e-02 -1.63958110e-02 -6.97771162e-...
1,t_000095e03056,Unit 3.3 Enlargements and Similarities,,b3f329,aligned,2,en,t_aa32fb6252dc,False,Junior High Level 3,title: Unit 3.3 Enlargements and Similarities ...,[-3.45621109e-02 -8.46866667e-02 -4.03910577e-...
2,t_00068291e9a4,Entradas e saídas de uma função,Entenda um pouco mais sobre funções.,8e286a,source,4,pt,t_d14b6c2a2b70,True,Álgebra: funções,title: Entradas e saídas de uma função descrip...,[ 3.56158055e-03 1.42162973e-02 -4.03103158e-...
3,t_00069b63a70a,Transcripts,,6e3ba4,source,3,en,t_4054df11a74e,True,Flow Charts: Logical Thinking?,title: Transcripts description: channel: 6e3b...,[-1.94012914e-02 -3.58841158e-02 -7.74949836e-...
4,t_0006d41a73a8,Графики на експоненциални функции (Алгебра 2 н...,Научи повече за графиките на сложните показате...,000cf7,source,4,bg,t_e2452e21d252,True,Показателни и логаритмични функции,title: Графики на експоненциални функции (Алге...,[-7.89579586e-04 -3.88328498e-03 -4.68218066e-...


In [3]:
df_content = pd.read_csv('../data/new_content_MiniLM.csv', nrows=100)
df_content.head()

Unnamed: 0,id,title,description,kind,text,language,copyright_holder,license,full_info,embedding
0,c_00002381196d,"Sumar números de varios dígitos: 48,029+233,930","Suma 48,029+233,930 mediante el algoritmo está...",video,,es,,,"title: Sumar números de varios dígitos: 48,029...",[-4.57105562e-02 3.43930125e-02 -2.76543275e-...
1,c_000087304a9e,Trovare i fattori di un numero,Sal trova i fattori di 120.\n\n,video,,it,,,title: Trovare i fattori di un numero descript...,[ 2.53774282e-02 1.16816849e-01 -1.95691772e-...
2,c_0000ad142ddb,Sumar curvas de demanda,Cómo añadir curvas de demanda\n\n,video,,es,,,title: Sumar curvas de demanda description: Có...,[-3.95715050e-02 1.02949284e-01 -4.05631624e-...
3,c_0000c03adc8d,Nado de aproximação,Neste vídeo você vai aprender o nado de aproxi...,document,\nNado de aproximação\nSaber nadar nas ondas ...,pt,Sikana Education,CC BY-NC-ND,title: Nado de aproximação description: Neste ...,[-1.54546378e-02 -1.90587360e-02 -2.66271979e-...
4,c_00016694ea2a,geometry-m3-topic-a-overview.pdf,geometry-m3-topic-a-overview.pdf,document,Estándares Comunes del Estado de Nueva York\n\...,es,Engage NY,CC BY-NC-SA,title: geometry-m3-topic-a-overview.pdf descri...,[ 1.35210603e-02 1.60078555e-02 3.03699523e-...


In [4]:
df_corr=pd.read_csv('../data/correlations_content_to_topics.csv')
df_corr.head()

Unnamed: 0,content_id,topic_ids
0,c_00002381196d,t_81be1094dd83 t_d0edb1c53d90 t_d66311c2e171 t...
1,c_000087304a9e,t_696e745e6d1f t_c5a64afaec08
2,c_0000ad142ddb,t_66f09929d0d3
3,c_0000c03adc8d,t_472e0a2df5f7
4,c_00016694ea2a,t_7a81e3d4aeae t_bc8f347a0b19


In [5]:
def str_to_float_list(s: str):
    return np.fromstring(s.strip("[]"), sep=" ").tolist()

In [6]:
df_topics_v1.iloc[0]["full_info"]

'title: Откриването на резисторите description: Изследване на материали, които предизвикват намаление в отклонението, когато се свържат последователно с нашия измервателен уред.  channel: 000cf7 parent_title: Открития и проекти'

In [7]:
df_content.iloc[52]["full_info"]

"title: Finding average speed or rate description: Using the formula for finding distance we can determine Usian Bolt's average\nspeed, or rate, when he broke the world record in 2009 in the 100m. Watch.\n\n textSALMAN KHAN: I have some footage here of one of the most exciting moments in sports history. And to make it even more exciting, the commentator is speaking in German. And I'm assuming that this is OK under fair use, because I'm really using it for a math problem. But I want you to watch this video, and then I'll ask you a question about it. [CHEERING] COMMENTATOR: [SPEAKING GERMAN] SALMAN KHAN: So you see, it's exciting in any language that you might watch it. But my question to you is, how fast was Usain Bolt going? What was his average speed when he ran that 100 meters right there? And I encourage you to watch the video as many times as you need to do it. And now I'll give you a little bit of time to think about it, and then we will solve it. So we needed to figure out how fa

### Creation of the Vector DB

The emb_to_store function prepares a dataframe column of embedding strings for storage and use in the FAISS index. It first converts each embedding from its string representation into a list of floats using str_to_float_list. These lists are then stacked into a NumPy array of type float32, the format required by FAISS. Finally, the function applies L2 normalization to all vectors, ensuring that similarity search using inner product is equivalent to cosine similarity. The returned matrix can be directly added to a FAISS index for efficient retrieval.

In [8]:
def emb_to_store(df_emb):
    emb = df_emb.apply(str_to_float_list)
    vec_emb=np.array(emb.to_list()).astype('float32')
    faiss.normalize_L2(vec_emb)
    return vec_emb

In [9]:
vec_emb = emb_to_store(df_topics_v1["embedding"])

In [10]:
vec_emb.shape

(76972, 384)

The variable size = 384 defines the dimensionality of the embeddings generated by the Sentence Transformer (all-MiniLM-L6-v2 produces 384-dimensional vectors). Using this value, faiss.IndexFlatIP(size) creates a FAISS index that performs inner-product similarity search on vectors of that same dimensionality. Since the embeddings are L2-normalized, inner product corresponds to cosine similarity, allowing the index to efficiently retrieve the most semantically similar topic vectors.

In [11]:
size = 384
index_top_v1 = faiss.IndexFlatIP(size)

In [12]:
index_top_v1.add(vec_emb)

In [13]:
########################
        # SAVE #
########################

#file_name = 'vector_db_v1.bin'
#faiss.write_index(index_top_v1, file_name)

########################
        # LOAD #
########################

# file_name = './vector_db_v1.bin'
# index_top_v1 = faiss.read_index(file_name)

### Getting similars

The functions cont_emb and get_similars work together to retrieve the topics most semantically aligned with a given content item. The cont_emb function extracts the stored embedding for a specific content ID, converts it from string to a float vector, reshapes it, and applies L2 normalization so it can be used in cosine-similarity search. The get_similars function then takes this normalized vector and performs a nearest-neighbor search in the FAISS index to find the top-k most similar topic embeddings. When can_print=True, it also displays useful debugging information, including the content metadata and the matched topics along with their cosine similarity* scores. Together, these functions provide a simple and interpretable way to inspect how well the embedding-based retrieval system is working for individual examples.

<b>Note:</b> Cosine similarity is a fundamental metric in embedding-based NLP systems because it measures how similar two vectors are based on their direction rather than their magnitude. This makes it especially suitable for text embeddings, where the goal is to capture semantic meaning rather than raw vector length. When embeddings are L2-normalized, which is the case in this project, cosine similarity becomes equivalent to the inner product, allowing efficient computation within FAISS. By comparing the angle between vectors, cosine similarity highlights how closely the meanings of a content item and a topic align, independent of differences in text length or embedding scale. This property makes it the ideal metric for retrieving semantically related topics in a large vector space.

In [14]:
def cont_emb(id, df=df_content):
    if not (df["id"] == id).any():
        print("ID not found:", id)
        return None
    new_emb = df[df["id"] == id]["embedding"].apply(str_to_float_list)
    cont_emb=np.array(new_emb.iloc[0]).astype('float32')
    cont_emb = cont_emb.reshape(1,-1)
    faiss.normalize_L2(cont_emb)
    return cont_emb

In [15]:
def get_similars(c_id, k=10, can_print=True, df=df_content, index=index_top_v1):
    vec=cont_emb(c_id, df=df)
    if vec is None:
        return
    if can_print:
        print(">","#"*46, "CONTENT","#"*46,"<")
        print("C_ID:",df[df.id == c_id]["id"].tolist())
        print("TITLE:",df[df.id == c_id]["title"].tolist())
        print("DESCRIPTION:",df[df.id == c_id]["description"].tolist())
        
        cont_text = df[df.id == c_id]["text"].tolist()
        cont_text = cont_text if isinstance(cont_text[0], float) else cont_text[0][:200]
        print("TEXT:",cont_text,"...")    

        print("\n>","#"*46, "SIMILARS","#"*46,"<\n")

    D, I = index.search(vec, k) ## Here we make the vector search
    topic_ids = []
    for i,line in enumerate(I[0]):
        topic_ids.append(df_topics_v1.iloc[line]["id"])
        if not can_print:
            continue
        print("ID:",df_topics_v1.iloc[line]["id"], "\n(COS_SIMILARITY:", D[0][i], ")\n")
        print("TITLE:",df_topics_v1.iloc[line]["title"])
        print("DESCRIPTION:",df_topics_v1.iloc[line]["description"])
        print("CHANNEL:",df_topics_v1.iloc[line]["channel"])
        print("PARENT:",df_topics_v1.iloc[line]["parent_title"])
        print("<","-"*100,">")
    return topic_ids

In [16]:
test_id='c_00046806ad8a'
top_ids=get_similars(test_id, can_print=True)

> ############################################## CONTENT ############################################## <
C_ID: ['c_00046806ad8a']
TITLE: ['Compare multi-digit numbers']
DESCRIPTION: ['Use your place value skills to practice comparing whole numbers.']
TEXT: [nan] ...

> ############################################## SIMILARS ############################################## <

ID: t_fb62b461ea0c 
(COS_SIMILARITY: 0.7848346 )

TITLE: Comparing 3-digit numbers
DESCRIPTION: Learn how to compare three-digit numbers by thinking about place value (hundreds, tens, and ones).
CHANNEL: 7f116c
PARENT: Knowing our numbers
< ---------------------------------------------------------------------------------------------------- >
ID: t_8b1faedc3acf 
(COS_SIMILARITY: 0.77856964 )

TITLE: Comparing 3-digit numbers
DESCRIPTION: Learn how to compare three-digit numbers by thinking about place value (hundreds, tens, and ones).
CHANNEL: 0ec697
PARENT: Place value
< ---------------------------------------------

In [17]:
test_id='c_001dbc1e76fb'
top_ids=get_similars(test_id, can_print=True)

> ############################################## CONTENT ############################################## <
C_ID: ['c_001dbc1e76fb']
TITLE: ['2.5: Entropy and Energy']
DESCRIPTION: [nan]
TEXT: Most students who have had some chemistry know about the principle of the Second Law of Thermodynamics with respect to increasing disorder of a system. Cells are very organized or ordered structures,  ...

> ############################################## SIMILARS ############################################## <

ID: t_5adaffb1882c 
(COS_SIMILARITY: 0.51307786 )

TITLE: Laws of thermodynamics
DESCRIPTION: Although it might be fun to be exempt from the laws of physics (flying, anyone?), it turns out that cells and organisms are in fact subject to these laws, just like any other type of matter. Learn more about the laws of thermodynamics and how they relate to energy transfers in biological systems.
CHANNEL: 2ee29d
PARENT: Energy and enzymes
< ----------------------------------------------------------

In [18]:
test_id='c_000087304a9e'
top_ids=get_similars(test_id, can_print=True)

> ############################################## CONTENT ############################################## <
C_ID: ['c_000087304a9e']
TITLE: ['Trovare i fattori di un numero']
DESCRIPTION: ['Sal trova i fattori di 120.\n\n']
TEXT: [nan] ...

> ############################################## SIMILARS ############################################## <

ID: t_a451e5b28ede 
(COS_SIMILARITY: 0.7889689 )

TITLE: Massimo comun divisore
DESCRIPTION: Sai come trovare i fattori di un numero. Ma per quanto riguarda i fattori che sono comuni ai due numeri? Ancora meglio, immagina i fattori più grandi che sono comuni ai due numeri. Lo so. Troppo emozionante!
CHANNEL: 60b280
PARENT: Fattori e multipli
< ---------------------------------------------------------------------------------------------------- >
ID: t_f1ed395c3c90 
(COS_SIMILARITY: 0.6875235 )

TITLE: Massimo comun divisore
DESCRIPTION: Sai come trovare i fattori di un numero, ma cosa succede quando due numeri hanno dei fattori in comune? Anche m

In [19]:
test_id='c_002ba31673b0'
top_ids=get_similars(test_id, can_print=True)

> ############################################## CONTENT ############################################## <
C_ID: ['c_002ba31673b0']
TITLE: ['15.7: Basic data modeling']
DESCRIPTION: [nan]
TEXT: The real power of a relational database is when we create multiple tables and make links between those tables. The act of deciding how to break up your application data into multiple tables and establ ...

> ############################################## SIMILARS ############################################## <

ID: t_8288b3b5d91c 
(COS_SIMILARITY: 0.4991407 )

TITLE: 7: Entity Relationship Modelling
DESCRIPTION: 7: Entity Relationship Modelling
CHANNEL: 88c9d6
PARENT: Book: Relational Databases and Microsoft Access (McFadyen)
< ---------------------------------------------------------------------------------------------------- >
ID: t_bef271c8c5a6 
(COS_SIMILARITY: 0.47522593 )

TITLE: Two-way tables
DESCRIPTION: Learn how to read, interpret, and use two-way frequency tables.
CHANNEL: 0ec697
PAR

In [20]:
test_id='c_002b2fb1886d'
top_ids=get_similars(test_id, can_print=True)

> ############################################## CONTENT ############################################## <
C_ID: ['c_002b2fb1886d']
TITLE: ['TI-AIE: Developing your English']
DESCRIPTION: [nan]
TEXT: [nan] ...

> ############################################## SIMILARS ############################################## <

ID: t_46d0d76a2bc9 
(COS_SIMILARITY: 0.80369496 )

TITLE: TI-AIE: Supporting independent writing in English
DESCRIPTION: nan
CHANNEL: 4d2d4a
PARENT: Secondary
< ---------------------------------------------------------------------------------------------------- >
ID: t_535764933d82 
(COS_SIMILARITY: 0.79736865 )

TITLE: TI-AIE: Building your students' confidence to speak English
DESCRIPTION: nan
CHANNEL: 4d2d4a
PARENT: Secondary
< ---------------------------------------------------------------------------------------------------- >
ID: t_6b8aa81bd8fb 
(COS_SIMILARITY: 0.77969706 )

TITLE: TI-AIE: Reading for information
DESCRIPTION: nan
CHANNEL: 4d2d4a
PARENT: Language and 

### Metrics

<p>
In a recommendation scenario, the metrics below help evaluate how effectively the system retrieves the most relevant items (topics) for each query (content). 
Recall@k measures how much of the true relevant set is recovered, while Precision@k indicates how accurate the top-k recommendations are. 
The micro-aggregated versions reflect overall performance across the entire dataset.
</p>

<table border="1" cellpadding="6" cellspacing="0">
    <tr>
        <th>Metric</th>
        <th>Description</th>
    </tr>
    <tr>
        <td><b>hits</b></td>
        <td>Number of relevant topics correctly retrieved within the top-k.</td>
    </tr>
    <tr>
        <td><b>true_count</b></td>
        <td>Total number of true topics for the content item.</td>
    </tr>
    <tr>
        <td><b>rec@k</b></td>
        <td>Recall@k = hits / true_count. Measures how many relevant items were recovered.</td>
    </tr>
    <tr>
        <td><b>prec@k</b></td>
        <td>Precision@k = hits / k. Measures how many retrieved items are correct.</td>
    </tr>
    <tr>
        <td><b>micro_rec@k</b></td>
        <td>Aggregated recall across the dataset: total_hits / total_true.</td>
    </tr>
    <tr>
        <td><b>micro_prec@k</b></td>
        <td>Aggregated precision: total_hits / (N * k), where N is the number of evaluated items.</td>
    </tr>
</table>


In [21]:
def num_of_match(id, top_k=10, df=df_content, index=index_top_v1):

    top_ids=get_similars(id, k=top_k, can_print=False, df=df, index=index)
    true_pred=df_corr[df_corr.content_id==id]["topic_ids"].tolist()[0].split(" ")
    real_match = 0

    for value in top_ids:
        if value in true_pred:
            real_match+=1

    return {
        "hits": real_match,
        "true_count": len(true_pred),
        "rec@k":real_match/len(true_pred), 
        "prec@k": real_match/top_k
    }

In [22]:
match_test=num_of_match('c_00046806ad8a')
match_test

{'hits': 2, 'true_count': 5, 'rec@k': 0.4, 'prec@k': 0.2}

In [23]:
def micro_metrics(match_top_k, k):
    total_hits = 0
    total_true = 0
    for match in match_top_k:
        total_hits += match.get("hits")
        total_true += match.get("true_count")
    
    return {
        "micro_rec@k" : round(total_hits/total_true, 4), 
        "micro_prec@k": round(total_hits/(len(match_top_k)*k), 4)
    }

In [24]:
def calculate_metrics(top_k=3, df=df_content, index=index_top_v1):
    matchs_top_k=[]
    for _, element in df.iterrows():
        matchs_top_k.append(num_of_match(element["id"],top_k=top_k, df=df, index=index))

    #print(matchs_top_k)
    print(f"MICRO_METRICS TOP_{top_k} (dataset size:{df.shape[0]}):{micro_metrics(matchs_top_k, top_k)}")

calculate_metrics()
calculate_metrics(top_k=5)
calculate_metrics(top_k=10)
calculate_metrics(top_k=50)
calculate_metrics(top_k=100)
calculate_metrics(top_k=200)

MICRO_METRICS TOP_3 (dataset size:100):{'micro_rec@k': 0.1709, 'micro_prec@k': 0.1133}
MICRO_METRICS TOP_5 (dataset size:100):{'micro_rec@k': 0.2111, 'micro_prec@k': 0.084}
MICRO_METRICS TOP_10 (dataset size:100):{'micro_rec@k': 0.2764, 'micro_prec@k': 0.055}
MICRO_METRICS TOP_50 (dataset size:100):{'micro_rec@k': 0.4221, 'micro_prec@k': 0.0168}
MICRO_METRICS TOP_100 (dataset size:100):{'micro_rec@k': 0.4824, 'micro_prec@k': 0.0096}
MICRO_METRICS TOP_200 (dataset size:100):{'micro_rec@k': 0.5377, 'micro_prec@k': 0.0053}


These results illustrate the typical recall–precision trade-off inherent in retrieval-based recommendation systems: smaller values of k (e.g., TOP-3 and TOP-5) yield higher precision because the model is more selective, while larger k values (such as TOP-50, TOP-100, and TOP-200) significantly increase recall, demonstrating the model’s ability to surface a substantial portion of the relevant topics when given a wider search window. This pattern is expected and highlights several positive aspects of the embedding-based approach: the model generalizes well without supervision, consistently captures semantic relationships between content and topics, and retrieves relevant items even under the strong sparsity and variability of the dataset. Moreover, the steady recall growth shows that the underlying embeddings do encode meaningful structure, allowing the system to identify relevant topics even when textual descriptions are short or inconsistent. Nonetheless, ranking quality at the very top positions remains a challenge, a common limitation in zero-shot similarity models. Future improvements could include fine-tuning a bi-encoder on the correlations data, enriching topic representations (e.g., using ancestors in the hierarchy), or adding a cross-encoder reranker to improve the top-k ranking accuracy.

In [28]:
#df_content_nrows=pd.read_csv('../data/new_content_MiniLM.csv', nrows=10_000)

In [25]:
df_content_nrows=pd.read_csv('../data/new_content_MiniLM.csv', nrows=10_000)
df_content_nrows.shape

(10000, 10)

In [26]:
calculate_metrics(df=df_content_nrows)
calculate_metrics(top_k=5, df=df_content_nrows)

MICRO_METRICS TOP_3 (dataset size:10000):{'micro_rec@k': 0.1816, 'micro_prec@k': 0.1104}
MICRO_METRICS TOP_5 (dataset size:10000):{'micro_rec@k': 0.2258, 'micro_prec@k': 0.0824}


Comparing the results from the smaller sample (100 items) to a larger sample (10,000 items), we observe that the metrics remain remarkably stable across both scales. For instance, rec@3 increases slightly from 0.1709 to 0.1842, and rec@5 from 0.2111 to 0.226, while precision values shift only marginally (e.g., 0.1133 → 0.1112). This consistency is a strong indication that the model generalizes well beyond a small subset and is not overfitting to specific examples. The near-identical precision values show that the embedding space preserves its discriminative behavior even when tested on a much larger and more diverse sample. Meanwhile, the small gain in recall suggests that the full dataset contains more cases where semantic similarity is correctly captured by the embeddings. Overall, the stability of both recall and precision across datasets strengthens the reliability of the retrieval approach.