# Comparison of search articles 
----

This page describe the model recommended to compare articles.

In this page, articles from Unict and from UPHF are compared.

---

## The model 
Among the different way to compare text, the best way, to compare titles, or abstracts, is to use the cosine similarity. 
A lot of models are possible, trained on different datasets.

The model chosen is **all-MiniLM-L6-v2**. It presents the best (fast and pertinent) results.

In [1]:
#load the model
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

  from .autonotebook import tqdm as notebook_tqdm


## Datasets
We assume that the datasets follow the same template (see [introduction](introduction.ipynb))

In [7]:
#read the datasets
import pandas as pd

#pour tester on ne prend que nb lignes
nb_lignes = 1000
doc = pd.read_csv("outputRepositori.csv")
sentences_unict = doc['title'][:nb_lignes]

doc = pd.read_csv("outputHAL_IA.csv")
sentences_uphf = doc['title'][:nb_lignes]


In [8]:
type(sentences_unict)

pandas.core.series.Series

In [9]:
#concatenation de series 
sentences = pd.concat([sentences_unict, sentences_uphf], ignore_index=True)

In [10]:
sentences.size

2000

## Use the model

In [11]:
# encode the sentences : each sentence is represented as a vector of vectors (one vector by interesting word)
embeddings = model.encode(sentences)

In [12]:
#use the cosine similarity to compute the similarities between each pair of sentences
similarities = cosine_similarity(embeddings)

In [14]:
#similarities is a matrix of size (nb_sentences, nb_sentences)
#similarities[i][j] = cosine_similarity(embeddings[i], embeddings[j]) distance entre sentences i et j

array([[ 1.0000002 , -0.00358217,  0.181815  , ..., -0.06081729,
         0.03208993,  0.02925877],
       [-0.00358217,  1.0000002 ,  0.03591001, ...,  0.02880706,
         0.1662038 ,  0.00543173],
       [ 0.181815  ,  0.03591001,  1.        , ...,  0.0258176 ,
         0.0014459 ,  0.00467276],
       ...,
       [-0.06081729,  0.02880706,  0.0258176 , ...,  1.        ,
         0.03940472, -0.07543759],
       [ 0.03208993,  0.1662038 ,  0.0014459 , ...,  0.03940472,
         1.0000001 ,  0.5452151 ],
       [ 0.02925877,  0.00543173,  0.00467276, ..., -0.07543759,
         0.5452151 ,  1.0000001 ]], dtype=float32)

----
## Use of the similarities

For each sentence, we compute the n most similar sentences.

In [19]:
nb_closest = 5
top_similarities = []
for i in range(len(sentences)):
    similarities_i = similarities[i]
    #get the indices of the nb_closest greater elements of similarities_i
    top_i =sorted(range(len(similarities_i)), key=lambda i: similarities_i[i], reverse=True)[1:nb_closest+1]
    top_similarities.append(top_i)


In [20]:
top_similarities[:3]

[[1133, 1930, 1139, 1301, 1498],
 [586, 392, 685, 1021, 278],
 [16, 1253, 1781, 1147, 1251]]

In [21]:
for i in range(len(sentences)):
    #print and save the next line
    print(f"Document {i}: {sentences[i]}")
    top_i = top_similarities[i]
    print(f"Closest documents:")
    for t in top_i:
        print(f" {similarities[i][t]:.3f}  - doc {t}: {sentences[t]}")
    print()

Document 0: Design and verification of integrated circuitry for real-time frailty monitoring
Closest documents:
 0.388  - doc 1133: A DSP-Based EBI, ECG, and PPG Measurement Platform
 0.384  - doc 1930: Distributed Artificial Intelligence Integrated Circuits For Ultra-Low-Power Smart Sensors
 0.333  - doc 1139: A framework for detecting and analyzing behavior changes of elderly people over time using learning techniques
 0.317  - doc 1301: Active Monitoring of a Product
 0.316  - doc 1498: Automated System-Level Design for Reliability : RF front-end application

Document 1: Structural safety assessment criteria for dismantling operations of unique structures. San Mames roof arch experience
Closest documents:
 0.359  - doc 586: Equivalent static force in heavy mass impacts on structures
 0.341  - doc 392: Simplified model to consider influence of gravity on impacts on structures: Experimental and numerical validation
 0.322  - doc 685: Structural integrity assessment of additively manuf