# Overview Notebook:

This notebook contains an overview of the work done of Long Document Similarity  project. Starting from the dataset used for this purpose and the proposed models. 

## 1 Datasets :

Two datasets(wines and video_games) extracted from Wikipedia have been used in this project, the link source of the dataset is [here](https://zenodo.org/record/4468783#.Yb-fKOrMJhH). 

In (/data/Ground-Truth) folder, we can find ground-truth labels for each dataset, and in (data/Load_dataset) we can find the calss (WikipediaLongDocumentSimilarityDataset(Dataset)) which creates a Dataset object for each dataset, load the raw data and ground-truth labels. 

Next cells show how to use this class:

In [26]:
from data import Load_dataset

wines_data = Load_dataset.WikipediaLongDocumentSimilarityDataset(dataset_name = "wines") #Load wines dataset
print("dataset size : {}".format(len(wines_data))) # print the length of the dataset, which is number of articles
print("Number of ground-truth articles: {}".format(len(wines_data.labels)))

dataset size : 1662
Number of ground-truth articles: 89


Each atticle in the dataset has a title and sections, and each section has a section_title and a description. Each article has different number of sections, An example of an article in this dataset :

In [23]:
print(wines_data.articles[765])

['Alphonse Tchami', '[["", "Alphonse Marie Tchami Djomaha (born 14 September 1971) is a Cameroonian former professional footballer who played as a striker. At international level, he represented Cameroon at the 1994 and 1998 FIFA World Cups.\\n\\n"], ["Club career.", "Born in Kekem, Tchami began his career in Cameroon with Unisport Bafang before moving to Danish club Vejle BK. In his short spell at Vejle he scored 8 goals in 15 games, but was unable to prevent the club being relegated. Tchami\'s spell at Vejle led to interest from other Danish clubs and Tchami eventually moved to Odense BK (OB). Tchami was a part of the OB team that defeated Real Madrid in the 1994\\u201395 UEFA Cup third round by 4\\u20133 on aggregate, earning a place in the quarter-finals.\\n\\nTchami joined Argentinian club Boca Juniors shortly after the 1994 FIFA World Cup. In total Tchami played 50 games and scored 11 goals for Boca. After three years he returned to Europe with German side Hertha BSC. Tchami spen

Each wiki_title in ground-truth has different number of similar articles, an example of ground-truth labels of an article:

In [16]:
print("An Example of ground-truth Labels for 'Champagne in popular culture' article :")
wines_data.labels['Champagne in popular culture']

An Example of ground-truth Labels for 'Champagne in popular culture' article :


{'Sparkling wine': 1,
 'Champagne Krug': 1,
 'Moët &amp; Chandon': 1,
 'Champagne Riots': 1,
 'Champagne wine region': 1,
 'History of French wine': 1,
 'Dom Pérignon': 1}

In [27]:
#To Load video_games dataset 
games_data = Load_dataset.WikipediaLongDocumentSimilarityDataset(dataset_name = "video_games") 
print("dataset size : {}".format(len(games_data))) 
print("Number of ground-truth articles: {}".format(len(games_data.labels)))

dataset size : 21228
Number of ground-truth articles: 88


## 2 Baselines Models :

Two Baselines model have been implemented to find the similar articles of each article in ground-truth, and then the results are compared to ground-truth labels. 

### 2.1 TF_IDF Model :

The implementation can be found [here](https://github.com/AzzaAbdelGhani/Long-Document-Similarity/blob/main/models/TFIDF.py), where [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) is used to build Tf-IDF features matrix. TF_IDF (Term Frequency_ Inverse Document Frequency) model mainly represents each article as a **weight vector** of **N** dimension, where N is the number of tokens in the article, and each component is the weight of the coressponding token, this weight reflect how important a token is to an article in the dataset. 

After creating TF-IDF matrix features, we compute the similarity score between articles by using *cosine_similarity* and then sort the scores.

In [35]:
from models.TFIDF import TF_IDF 

tfidf_model_1 = TF_IDF(dataset_name="wines")
#For this model , we find the (k = 1662) similar articles for each wiki_article in ground-Truth articles

Find k = 1662 Similar articles: 100%|███████████| 89/89 [00:01<00:00, 65.19it/s]


In [34]:
print("TFIDF matrix size : {}".format(tfidf_model_1.tfidf_matrix.shape))

TFIDF matrix size : (1662, 65968)


In [36]:
tfidf_model_2 = TF_IDF(dataset_name="video_games")

Find k = 21228 Similar articles: 100%|██████████| 88/88 [00:23<00:00,  3.78it/s]


In [38]:
print("TFIDF matrix size : {}".format(tfidf_model_2.tfidf_matrix.shape))

TFIDF matrix size : (21228, 203022)


### 2.2 SBERT Model :

The implementation can be found [here](https://github.com/AzzaAbdelGhani/Long-Document-Similarity/blob/main/models/SBERT.py), where [SentenceTransformer](https://www.sbert.net/) is used to compute sentence embeddings, and then we compute article's embeddings. Article's embedding is the average of sentences' embeddings that article contains. And then by using *faiss*, we find similarity scores between articles.

This model is run on GPU, and output embeddings of articles for each dataset is saved in (/data/saved_embeddings).

In [45]:
#These libraries need to be installed to call SBERT class
#!pip install faiss-gpu
#!pip install -U sentence-transformers

In [48]:
from models.SBERT import SBERT

#This loads the articles embeddings saved in "all-MiniLM-L6-v2_wines_embeddings.pkl" for wines dataset, and compute similarities
SBERT_wines = SBERT("wines", saved_embeddings= "data/saved_embeddings/all-MiniLM-L6-v2_wines_embeddings.pkl")

#This loads articles embeddings of video_games dataset
SBERT_games = SBERT("video_games", saved_embeddings= "data/saved_embeddings/all-MiniLM-L6-v2_video_games_embeddings.pkl")

Find k = 1661 Similar articles: 100%|███████████| 89/89 [00:00<00:00, 98.58it/s]
Find k = 21227 Similar articles: 100%|██████████| 88/88 [00:13<00:00,  6.70it/s]


## 3 Evaluation Metrics :

There metrics are used for evaluation : **Mean Percentile Ranking (MPR)**, **Mean Reciprocal Rank (MRR)** and **Hit-Ratio @ k (HR@k)**

By running *main* method in [main.py](https://github.com/AzzaAbdelGhani/Long-Document-Similarity/blob/main/main.py), we can see these metrics reults for each model for each dataset:

In [56]:
#This libraby is required to show the results in a nice table
#!pip install tabulate

In [55]:
from main import main

main()

Find k = 21227 Similar articles: 100%|██████████| 88/88 [00:11<00:00,  7.54it/s]
Find k = 100 Similar articles: 100%|████████████| 88/88 [00:11<00:00,  7.53it/s]
Find k = 1661 Similar articles: 100%|███████████| 89/89 [00:00<00:00, 89.22it/s]
Find k = 100 Similar articles: 100%|███████████| 89/89 [00:00<00:00, 100.36it/s]
Find k = 21228 Similar articles: 100%|██████████| 88/88 [00:22<00:00,  3.88it/s]
Find k = 100 Similar articles: 100%|████████████| 88/88 [00:22<00:00,  3.84it/s]
Find k = 1662 Similar articles: 100%|███████████| 89/89 [00:01<00:00, 62.93it/s]
Find k = 100 Similar articles: 100%|████████████| 89/89 [00:01<00:00, 62.21it/s]


 Results : 

-----------------------------------------------------------------
| 	  |   	 video_games  	     |   	 wines 	        |
-----------------------------------------------------------------
| Model   | MPR   | MRR   | HR@100   | MPR   | MRR   | HR@100   |
|---------|-------|-------|----------|-------|-------|----------|
| SBERT   | 29.5% | 29.8% | 19.9%    | 32.9% | 19.3% | 24.0%    |
| TF-IDF  | 35.0% | 43.2% | 11.6%    | 38.6% | 24.1% | 6.3%     |



