### Utilizzando il dataset visto nella lezione del topic modeling, individuare il documento del dataset, più simile ad uno dei documenti a scelta dello stesso dataset.

In [1]:
import pandas as pd
from gensim.models import Word2Vec
import gensim.downloader
from scipy import spatial
import numpy as np



In [2]:
glove_vectors = gensim.downloader.load('glove-wiki-gigaword-300')

## Import datset

In [3]:
URL = "https://raw.githubusercontent.com/ProfAI/natural-language-processing/main/datasets/Lezione_7-Topic_modeling/"

In [4]:
dataset = pd.read_csv(URL+"dataset_Research_Article.csv")
dataset

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,1,0,0,0,0,0
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,0,0,1,0,0,0
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,0,0,1,0,0,0
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...
20967,20968,Contemporary machine learning: a guide for pra...,Machine learning is finding increasingly bro...,1,1,0,0,0,0
20968,20969,Uniform diamond coatings on WC-Co hard alloy c...,Polycrystalline diamond coatings have been g...,0,1,0,0,0,0
20969,20970,Analysing Soccer Games with Clustering and Con...,We present a new approach for identifying si...,1,0,0,0,0,0
20970,20971,On the Efficient Simulation of the Left-Tail o...,The sum of Log-normal variates is encountere...,0,0,1,1,0,0


### Function for data cleaning

In [5]:
import string
import spacy
from nltk.corpus import stopwords
import re

english_stopwords = stopwords.words('english')
nlp = spacy.load('en_core_web_sm')
punctuation = set(string.punctuation)

def data_cleaner(sentence):
    sentence = sentence.lower()
    for c in string.punctuation:
        sentence = sentence.replace(c, " ")
    document = nlp(sentence)
    sentence = ' '.join(token.lemma_ for token in document)
    sentence = ' '.join(word for word in sentence.split() if word not in english_stopwords)
    sentence = re.sub('\d', '', sentence)
    
    return sentence.split()

### Function for avg vector

In [6]:
def avg_vector(sentence):
    to_remove = 0
    vector = np.zeros(300)
    for word in sentence:
        if word in glove_vectors.key_to_index.keys():
            vector += glove_vectors.get_vector(word)
        else:
            to_remove += 1
    if len(sentence)== to_remove:
        return np.zeros(300)
        
    return vector/(len(sentence)-to_remove)

In [7]:
def most_similar(vectors, index):
    
    similarity = 0
    index_similar_doc = 0
    
    for i in range(0,len(vectors)):
        if i!=index and vectors[i].all()!=np.zeros(300).all():
            if 1 - spatial.distance.cosine(vectors[index], vectors[i]) > similarity:
                index_similar_doc = i
                similarity = 1 - spatial.distance.cosine(vectors[index], vectors[i])
    
    return similarity,index_similar_doc

In [8]:
vectors = [avg_vector(data_cleaner(doc)) for doc in dataset['ABSTRACT']]

### Test

In [9]:
index = 12345

In [10]:
dataset["ABSTRACT"][index]

'  It is well known that the affine matrix rank minimization problem is NP-hard\nand all known algorithms for exactly solving it are doubly exponential in\ntheory and in practice due to the combinational nature of the rank function. In\nthis paper, a generalized singular value thresholding operator is generated to\nsolve the affine matrix rank minimization problem. Numerical experiments show\nthat our algorithm performs effectively in finding a low-rank matrix compared\nwith some state-of-art methods.\n'

In [11]:
similarity,index_similar_doc = most_similar(vectors,index)

In [12]:
similarity

0.9302897940113917

In [13]:
index_similar_doc

17189

In [14]:
dataset["ABSTRACT"][index_similar_doc]

'  Differentiable systems in this paper means systems of equations that are\ndescribed by differentiable real functions in real matrix variables. This paper\nproposes algorithms for finding minimal rank solutions to such systems over\n(arbitrary and/or several structured) matrices by using the Levenberg-Marquardt\nmethod (LM-method) for solving least squares problems. We then apply these\nalgorithms to solve several engineering problems such as the low-rank matrix\ncompletion problem and the low-dimensional Euclidean embedding one. Some\nnumerical experiments illustrate the validity of the approach.\nOn the other hand, we provide some further properties of low rank solutions\nto systems linear matrix equations. This is useful when the differentiable\nfunction is linear or quadratic.\n'