## Homework Week 2
Tasks:
1. Experiment with Sentence Embeddings

    ◦ Load sentence-transformers/all-MiniLM-L6-v2

    ◦Encode 5–10 text samples and compare them using cosine similarity

    ◦Visualize embeddings using PCA or t-SNE

In [1]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np


  from .autonotebook import tqdm as notebook_tqdm


In [2]:

class retriever:

    def __init__(self):
        # Load sentence-transformers/all-MiniLM-L6-v2
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.faiss_index = None

    def add_documents(self, path):

        with open(path, 'r') as file:
            # TODO read contents of file into chunks
            # Read the entire content of the file into a string
            self.chunks = []
            text = file.read()
            chunk = [text[i:i+200] for i in range(0, len(text), 200)]
            self.chunks.append(chunk)

        embeddings = self.model.encode(self.chunks)

        if self.faiss_index is None:
            dimension = embeddings.shape[1]
            self.faiss_index= faiss.IndexFlatL2(dimension)

        embedding_matrix = np.array(embeddings).astype('float32')
        self.faiss_index.add(embedding_matrix)

    def query(self, query):
        query_vector = self.model.encode(query).astype('float32')
        similarity_score, matched_chunk_index = self.faiss_index.search(query_vector)
        return self.chunks[matched_chunk_index[0]]
        
    def save(self, index_file, docs_file):
        faiss.write_index(self.faiss_index, index_file)
        with open(docs_file, 'w') as f:
            for doc in self.documents:
                f.write(doc + '\n')

        
    def load(self, index_file, docs_file):
        self.faiss_index = faiss.read_index(index_file)
        with open(docs_file, 'r') as f:
            self.documents = f.readlines()

    def load_text_file(path):
        with open(path, 'r', encoding='utf-8') as f:
            return f.read()


myRetriever = retriever()
myRetriever.add_documents("document1.txt")
myRetriever.query("Why do people use TikTok?")

: 

### 2. Try out FAISS for Similarity Search
◦Store your embedded text chunks in a FAISS index

◦Query it with different formulations of the same question and print top-k matching chunks. Describe the difference in performance and a possible explanation for your observations.

In [18]:
#4 Store your embedded text chunks in a FAISS index
dimension = embeddings.shape[1]
print(dimension)
faiss_index= faiss.IndexFlatL2(dimension)
embedding_matrix = np.array(embeddings).astype('float32')
faiss_index.add(embedding_matrix)

384


In [19]:
queries = [
    "Why do people use TikTok?",
    "What is the main reason users watch TikTok videos?",
    "How do people typically interact with TikTok?",
    "What are users looking for on TikTok?",
    "Why do friends send TikToks to each other?",
    "Is TikTok mostly for content sharing among friends?",
]

query_vectors = model.encode(queries).astype('float32')

top_k = 3
similarity_scores, matched_sentence_indices = faiss_index.search(query_vectors, top_k)

for query_id, question in enumerate(queries):
    print(f"\nQuery {query_id + 1}: {question}")
    
    for rank, sentence_id in enumerate(matched_sentence_indices[query_id]):
        matched_sentence = sentences[sentence_id]
        score = similarity_scores[query_id][rank]
        
        print(f"Top {rank + 1}: {matched_sentence} (distance: {score:.4f})")



Query 1: Why do people use TikTok?
Top 1: So I took a step that would have nauseated an earlier version of myself: I downloaded the TikTok app, while I still could, to find out what all the fuss was about. (distance: 0.6798)
Top 2: According to Pew Research, the typical TikTok user never adds information to their account’s “bio” field. (distance: 1.0309)
Top 3: I use it pretty much exclusively either to view content that my friends have shared with me, or to look for content to share with my friends and family. (distance: 1.2729)

Query 2: What is the main reason users watch TikTok videos?
Top 1: So I took a step that would have nauseated an earlier version of myself: I downloaded the TikTok app, while I still could, to find out what all the fuss was about. (distance: 0.9000)
Top 2: I use it pretty much exclusively either to view content that my friends have shared with me, or to look for content to share with my friends and family. (distance: 1.1201)
Top 3: According to Pew Research,