# Analysis of Results and Model comparison

## Approach of code

- Data loader script loads data, cleans it and returns cleaned texts, original texts and volunteer ids.

- embedding.py consists of a class which takes in a model and also generates embeddings. The module embedding.py returns volunteer ids, cleaned texts, original texts, and embeddings as pickle files which are used in main.py and similarity.py for generating similarity search. 

- similarity.py file consists of a class which takes in outputs from embedding.py module and calculates similarity search between embddings of documents and also embeddings of a search query. Cosine similarity was used to calculate how closely related are the two embeddings(document embeeddings and query embeddings). The docs closest to the query embeddings were returned. 

**What is Cosine Similarity?**

- Cosine similarity measures the **angle** between two vectors, not their magnitude.
- It ranges from **-1 to 1**:
  - `1` means the vectors are pointing in the **same direction** (very similar)
  - `0` means they are **orthogonal** (no similarity)
  - `-1` means they are **opposite** (completely different)
- In this context, we use it to rank documents by their **semantic similarity** to the query.

The top `k` documents with the highest cosine similarity scores are returned as the most relevant matches.

- Main.py module was separated to ensure it does only one thing which is to take a query and generate top 3 similar documents to it. This module uses logic from similarity.py to generate these results. It returns a json output. 



## How this would work with large datasets

When working on large datasets generating embeddings can take time. Best approach is to use a Vector database and retrieve similar docs from the database. ChromaDB was used to improve the current approach and it worked well in retrieving results. ChromaDB used eucledian distance for measuring distance between embedddings. 

When working on large datasets, generating embeddings and performing similarity search in-memory can be slow and inefficient. A better approach is to use a **vector database** to store and index embeddings, allowing for fast and scalable retrieval.

In this project, **ChromaDB** was integrated to improve the current approach. It efficiently retrieved the most relevant documents based on embedding similarity.

ChromaDB uses **Euclidean distance** to measure how close embeddings are. What is Eucledian distance?

- **Euclidean distance** calculates the straight-line distance between two vectors.
- The **smaller the distance**, the more similar the two embeddings are.
- This method is simple and effective when embeddings are normalized or when magnitude matters.




## For production environment and faster speed of queries:

- Create an AWS EC2 instance(t3.medium for basic compute or g4dn.xlarge for GPU inference) or Azure Virtual Machine or Azure ML Compute using terraform for infrastructure as code, and CI/CD pipeline and create a python script which performs batch processing for generating embeddings and uploading them to a Vector database. The Vector database in this case will be a persistent one and not an in memory one. 
- Convert the main.py script to a Fast API application. Deploy the application within a Docker and upload docker image to a docker registry(AWS ECR or AZURE ACR). Create another EC2 instance to host the application. 
- Deploy the application as a Rest API using AWS API gateway or Azure API management. The API will be integrated with EC2 instance to pull the application and run it as an API. 

Additionally we can add a load balancer and auto scalling groups can be added to handle high volume of requests/queries. The load balancer will distribute incoming traffic evenly across multiple EC2 instances, while the Auto Scaling Group will automatically launch or terminate instances based on demand. This ensures high availability and cost-efficiency under varying workloads.

# Modelling results:

### General perfomance of the models used:

| **Model Name**                 | **Performance<br>Sentence Embeddings** | **Performance<br>Semantic Search** | **Model Size** | **Speed** |
|-------------------------------|----------------------------------------|------------------------------------|----------------|-----------|
| `all-mpnet-base-v2`           | 69.57                              | 57.02                          | 420 MB         | 2800      |
| `all-MiniLM-L12-v2`           | 68.70                                  | 50.82                              | 80 MB          | 14200     |
| `all-MiniLM-L6-v2`            | 68.06                                  | 49.54                              | 80 MB          | 14200     |
| `multi-qa-MiniLM-L6-cos-v1`   | 65.98                                  | 52.83                              | 250 MB         | 4000      |

Source: https://sbert.net/docs/sentence_transformer/pretrained_models.html#

# Findings:

Small models outperformed big models in terms of relevant results and speed.The recommended small model to use is all-MiniLM-L6-v2. The top 2 results for this model were more relevant than top 2 results of all-MiniLM-L12-v2 small model. See results below for each model. For measuring distance between vectors, cosine similarity was used. The higher the score, the better the results.

ChromaDB is not supporting cosine similarity so Eucledian distance was used. The smaller the score, the more similar are the docs and query. When embeddings are Normalized, the score is even better.

## ChromaDB Results for one model 

### ChromaDB (Euclidean Distance Based Retrieval)

**Query:**  
_“Looking for volunteers skilled in graphic design to help with non-profit branding.”_

**Top 3 Matches:**

| Rank | Volunteer ID | Similarity Score | Description (Truncated) |
|------|--------------|------------------|--------------------------|
| 1    | 63           | **0.93**         | UX designer for tech company. can help improve website experiences for nonprofit services. remote work in evenings and some weekends. |
| 2    | 15           | 0.92             | i've written successful grants for several nonprofits (references available). happy to help educational programs find $$$. can work remotely on my own schedule. |
| 3    | 2            | 0.43             | I'm pretty good with graphic design stuff, Photoshop, Illustrator etc. Would love to help non-profits with their promotional materials!! I work from home so can be flexible with my hours.  |



###  Model: `all-MiniLM-L6-v2`

**Query:**  
_“Looking for volunteers skilled in graphic design to help with non-profit branding.”_

**Top 3 Matches:**

| Rank | Volunteer ID | Similarity Score | Description (Truncated) |
|------|--------------|------------------|--------------------------|
| 1    | 2            | **0.77**         | I'm pretty good with graphic design stuff, Photoshop, Illustrator etc. Would love to help non-profits with their promotional materials!! I work from home so can be flexible with my hours.  |
| 2    | 63           | 0.54             | UX designer for tech company. can help improve website experiences for nonprofit services. remote work in evenings and some weekends. |
| 3    | 15           | 0.52             | i've written successful grants for several nonprofits (references available). happy to help educational programs find $$$. can work remotely on my own schedule. |



### Model: `all-MiniLM-L12-v2`

**Query:**  
_“Looking for volunteers skilled in graphic design to help with non-profit branding.”_

**Top 3 Matches:**

| Rank | Volunteer ID | Similarity Score | Description (Truncated) |
|------|--------------|------------------|--------------------------|
| 1    | 2            | **0.76**         | I'm pretty good with graphic design stuff, Photoshop, Illustrator etc. Would love to help non-profits with their promotional materials!! I work from home so can be flexible with my hours.  |
| 2    | 13           | 0.58             | Professional photographer here. Got all my own equipment. Would do free photos for nonprofits for their websites/social/etc... |
| 3    | 63           | 0.56             | UX designer for tech company. can help improve website experiences for nonprofit services. remote work in evenings and some weekends. |



### Model: `all-mpnet-base-v2`

**Query:**  
_“Looking for volunteers skilled in graphic design to help with non-profit branding.”_

**Top 3 Matches:**

| Rank | Volunteer ID | Similarity Score | Description (Truncated) |
|------|--------------|------------------|--------------------------|
| 1    | 2            | **0.78**         | I'm pretty good with graphic design stuff, Photoshop, Illustrator etc. Would love to help non-profits with their promotional materials!! I work from home so can be flexible with my hours.  |
| 2    | 60           | 0.67             | I've been a McKinsey consultant for 8 years, specialized in nonprofit sector. Can help with strategic planning, efficiency improvements... |
| 3    | 19           | 0.67             | i'm a lawyer (corporate law background) & can give some free legal advice to nonprofits. don't need much just schedule a time and we can zoom. |


###  Model: `multi-qa-MiniLM-L6-cos-v1`

**Query:**  
_“Looking for volunteers skilled in graphic design to help with non-profit branding.”_

**Top 3 Matches:**

| Rank | Volunteer ID | Similarity Score | Description (Truncated) |
|------|--------------|------------------|--------------------------|
| 1    | 2            | **0.73**         | I'm pretty good with graphic design stuff, Photoshop, Illustrator etc. Would love to help non-profits with their promotional materials!! I work from home so can be flexible with my hours.  |
| 2    | 13           | 0.57             | Professional photographer here. Got all my own equipment. Would do free photos for nonprofits for their websites/social/etc... |
| 3    | 63           | 0.53             | UX designer for tech company. can help improve website experiences for nonprofit services. remote work in evenings and some weekends. |




## Fine tuning:

- Data Cleaning
We lowercase and remove special characters from descriptions for consistency.

- Create Sentence Pairs
Since we don’t have labeled similar descriptions. Solution is to use the similarity search model to show which sentences are closer to each other and pair them. 

- Load Pretrained Model
We load a pretrained model for semantic search

- Apply LoRA (PEFT)
Instead of updating all model weights, we use LoRA to add small trainable layers (faster training, less memory).

- Tokenization
The text pairs are tokenized so the model can understand them.

- Training with Trainer API
We train using Hugging Face’s Trainer. The model sees many text pairs and learns to predict how similar they are.

- Saving the Model
Once trained, we save the fine-tuned model to disk for future use in your similarity search app.
