## Utilizzando il dataset visto nella lezione del topic modeling, individuare il documento del dataset, più simile ad uno dei documenti a scelta dello stesso dataset.

In [1]:
import numpy as np
import pandas as pd
import string
import re
import spacy
import nltk
from nltk.corpus import stopwords

In [2]:
dataset = pd.read_csv('../datasets/Lezione_7-Topic_modeling/dataset_Research_Article.csv')
dataset.head()

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,1,0,0,0,0,0
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,0,0,1,0,0,0
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,0,0,1,0,0,0
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,1,0,0,1,0,0


## Data cleaning

In [3]:
X = dataset["TITLE"] + dataset["ABSTRACT"]
X.head()

0    Reconstructing Subject-Specific Effect Maps  P...
1    Rotation Invariance Neural Network  Rotation i...
2    Spherical polyharmonics and Poisson kernels fo...
3    A finite element approximation for the stochas...
4    Comparative study of Discrete Wavelet Transfor...
dtype: object

In [4]:
#utils
nlp = spacy.load("en_core_web_sm")

nltk.download("stopwords")
stopwords_en = stopwords.words("english")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aless\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
def clean_text(s:str):
    #lowercase
    s = s.lower()
    
    #punctuation
    for c in string.punctuation:
        s = s.replace(c," ")

    #lemmatization
    doc = nlp(s)
    s = " ".join(token.lemma_ for token in doc)

    #stopwords
    s = " ".join(word for word in s.split() if word not in stopwords_en)
    
    #remove numbers
    s = re.sub(r"\d","",s)

    #remove multiple spaces
    s = re.sub(r" +"," ",s)
    return s
    

In [6]:
X_clean = X.apply(clean_text)

## Word Embedding
For each document in corpus, we compute the mean embedding vector.

In [7]:
import gensim.downloader

In [8]:
print(list(gensim.downloader.info()['models']))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In this exercise the `glove-wiki-gigaword-200` model is used.

In [9]:
glove_model = gensim.downloader.load("glove-wiki-gigaword-200")

In [10]:
def compute_embedding(s:str):
    """
    Compute mean embedding vector for a sentence.
    """
    v = np.zeros(200) #same dimension of final embedding
    counter = 0
    for word in s.split():
        if word in glove_model.key_to_index.keys():
            v += glove_model.get_vector(word)
            counter += 1
    return v/counter if counter > 0 else v
    

In [11]:
embeddings = X_clean.apply(compute_embedding)

In [12]:
embeddings.head()

0    [0.22885676364413188, 0.1617810904739637, 0.12...
1    [0.16675768384955963, 0.313653826146861, 0.115...
2    [0.18669769784160467, 0.16619212610477752, -0....
3    [0.26152576006164674, 0.25411178452295163, 0.2...
4    [0.21645674137573223, 0.19605113154975698, 0.0...
dtype: object

In [13]:
embeddings[0].shape

(200,)

`embeddings` is a pandas Series where each element is a 200 elements vector, that is the mean embedding of a sentence.

## Cosine distances
Find the most similar document of the 176-th document.

In [15]:
embeddings_array = np.array([row for row in embeddings])
embeddings_array.shape

(20972, 200)

In [18]:
from scipy.spatial.distance import cdist

def find_closest_doc(idx:int):
    """
    Find the most similar document of the idx-th element in the dataset.
    The cosine distance is used as similarity metrics.
    """
    distances = cdist(np.expand_dims(embeddings_array[idx],axis=0), np.delete(embeddings_array,idx,0), 'cosine')
    closest_index = np.argmin(distances)
    return closest_index if closest_index < idx else closest_index + 1
    

In [19]:
idx_doc = 42
closest_idx = find_closest_doc(idx_doc)

print(f"Doc n. {idx_doc}:\n{X[idx_doc]}")
print("-"*10)
print(f"Closest doc is the n. {closest_idx}:\n{X[closest_idx]}")

Doc n. 42:
Probing valley filtering effect by Andreev reflection in zigzag graphene nanoribbon  Ballistic point contact (BPC) with zigzag edges in graphene is a main
candidate of a valley filter, in which the polarization of the valley degree of
freedom can be selected by using a local gate voltage. Here, we propose to
detect the valley filtering effect by Andreev reflection. Because electrons in
the lowest conduction band and the highest valence band of the BPC possess
opposite chirality, the inter-band Andreev reflection is strongly suppressed,
after multiple scattering and interference. We draw this conclusion by both the
scattering matrix analysis and the numerical simulation. The Andreev reflection
as a function of the incident energy of electrons and the local gate voltage at
the BPC is obtained, by which the parameter region for a perfect valley filter
and the direction of valley polarization can be determined. The Andreev
reflection exhibits an oscillatory decay with the length