<a href="https://colab.research.google.com/github/TurkuNLP/Text_Mining_Course/blob/master/doc_sim_tfidf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document-level similarity

* Often needed when deadling with large document collections
* Supported by most search engines like Solr (MoreLikeThis function)
* Document-by-document similarity matrix is input to many clustering algorithms
* Useful for pairing different document sources

# TF-IDF based document-level similarity

* Columns of term-by-document matrix represent documents (or rows of a document-by-term matrix, naturally)
* All-by-all cosine similarity is easy to calculate even for relatively large document collections
* In the following, let us try to pair news from HS and YLE from the beginning of 2020

# YLE-HS News Data

* http://dl.turkunlp.org/textual-data-analysis-course-data/hs_yle_spring_2020.json.gz 
* A single json object whih holds YLE/HS data from 1.1.2020 to roughly mid-march 2020 (i.e. the Raise of Coronavirus data :)
* This will be one of the datasets we can use throughout the course
* It has been gathered from public RSS feeds of YLE News and HS
* Note: historical STT and YLE news data can also be obtained via kielipankki.fi

In [2]:
!wget http://dl.turkunlp.org/textual-data-analysis-course-data/hs_yle_spring_2020.json.gz

--2021-03-09 16:29:31--  http://dl.turkunlp.org/textual-data-analysis-course-data/hs_yle_spring_2020.json.gz
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20590829 (20M) [application/octet-stream]
Saving to: ‘hs_yle_spring_2020.json.gz’


2021-03-09 16:29:34 (8.47 MB/s) - ‘hs_yle_spring_2020.json.gz’ saved [20590829/20590829]



In [4]:
import json
import gzip
from pprint import pprint  #pprint is prettyprint

with gzip.open("hs_yle_spring_2020.json.gz") as f:
    news_data=json.load(f)

pprint(news_data.keys())
pprint(news_data["2020"].keys())
pprint(news_data["2020"]["02"].keys())
pprint(news_data["2020"]["02"]["yle-text"][:2]) #first two YLE news articles in Feb 2020
pprint(news_data["2020"]["02"]["hs-text"][:2]) #first two HS news articles in Feb 2020

#So we have a dictionary whose key is year, its value is a dictionary whose key is month, its value is a dictionary whose key is data source
#and its value is a list of articles, each being a dictionary with keys "orig_filename" and "text"



dict_keys(['2020'])
dict_keys(['01', '02', '03'])
dict_keys(['hs-text', 'yle-text'])
[{'orig_filename': '2020-02-01-01-01-03--3-11187676.txt',
  'text': 'Taksimatkan hinnaston täytyy näkyä taksin kyljessä tästä päivästä '
          'lähtien\n'
          'Traficomin mukaan määräyksen myötä taksien hintojen muodostumisesta '
          'tulee selkeämpää ja ymmärrettävämpää.\n'
          'Taksimatkan hinta tai hinnan määräytyminen tulee tästä päivästä '
          'alkaen ilmoittaa siten, että hintatiedot ovat luettavissa taksin '
          'ulkopuolelta jo ennen matkan alkua.\n'
          'Liikenne- ja viestintävirasto Traficomin valtakunnallinen määräys '
          'koskee ennalta tilaamattomia, siis esimerkiksi taksitolpalla '
          'odotettavia takseja. Ennakkoon tilattavat ja hankintasopimuksella '
          'ajavat taksit ovat määräyksen ulkopuolella.\n'
          'Takseille tunnuksellisen keltamustan hinnaston tulee näkyä '
          'asiakkaalle auton oikealla puolella. Hinnasto

# Find corresponding news in HS/YLE

* This is an easy, yet quite useful task
* Let us look for news article correspondence, i.e. pair news from YLE and HS
* Often needed in various data aggregation tasks


In [9]:
#1) Vectorize
from sklearn.feature_extraction.text import TfidfVectorizer

yle_texts=[item["text"] for item in news_data["2020"]["02"]["yle-text"]] #The texts
hs_texts=[item["text"] for item in news_data["2020"]["02"]["hs-text"]]   #...

v=TfidfVectorizer()
yle_vec=v.fit_transform(yle_texts)
hs_vec=v.transform(hs_texts)          #That's all there is to it :D

print("YLE",yle_vec.shape)
print("HS",hs_vec.shape)

YLE (1853, 119006)
HS (6811, 119006)


*   (Often, this job is all about knowing the correct libraries)
*   [sklearn's metrics.pairwise module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.pairwise) has a bunch of useful functions to calculate different pairwise metrics



In [11]:
#2) Compare
import sklearn.metrics.pairwise as pairwise

yle_hs_sims=pairwise.cosine_similarity(yle_vec,hs_vec) #can it be made any easier than this?!
print(yle_hs_sims.shape) #we now have all YLE-vs-HS cosine similarities :)


(1853, 6811)


In [36]:
#3) Pick most similar
# 
# This is easy, in the end, but needs some amount of numpy magic ;)

import numpy as np
sorted_indices=np.argsort(-yle_hs_sims)[:,:1]
# argsort (argument sort, gives indices rather than sorted values)
# sort is always ascending but we want descending, the solution is to sort -yle_hs_sims
# [:,:1] means "take all rows and the first column" but do keep as a 2-dim array  [:,0] would produce a 1-dim array
print("Sorted_indices shape",sorted_indices.shape) #as many rows as there are YLE articles, and the index of the most similar HS article
print("First ten sorted indices",sorted_indices[:10])

#But now we want to see the YLE articles that have the highest correspondence to any HS article
#for that we need to sort again. For that, we also need the scores!
scores=np.take_along_axis(yle_hs_sims,sorted_indices,-1)  #pick values from yle_hs_sims using the sorted_indices, on the last axis (does your head spin?)
print("scores.shape",scores.shape)
scores_sorted_indices=np.argsort(-scores.flatten()) #We need to flatten before sort or else the 2nd dimension (which has only one element) will get sorted
#this is now indices to YLE texts sorted in descending order by their similarity to any HS article



Sorted_indices shape (1853, 1)
First ten sorted indices [[ 858]
 [  54]
 [  71]
 [ 567]
 [ 580]
 [4356]
 [2809]
 [  33]
 [2809]
 [1352]]
scores.shape (1853, 1)


In [37]:
#4) Inspect!

#Can we convince ourselves this works?
for yle_i in scores_sorted_indices[:5]: #first five YLE articles with the highest sim to HS
    #Which is the corresponding HS?
    hs_i=sorted_indices[yle_i][0] #so which is the HS index? look it up in sorted_indices, and since that is a 2-dim array, pick the first column (numpy arrays can be head-spinning experience)
    print("------------------------------------------")
    print("yle_i",yle_i,"hs_i",hs_i) #now we know which row (YLE) and column (HS) we are referring to
    sim=yle_hs_sims[yle_i,hs_i] #this is the similarity
    print("Sim",sim)
    print("*********** YLE")
    print(yle_texts[yle_i][:500]) #this is the YLE article, first 500 chars
    print("*********** HS")
    print(hs_texts[hs_i][:500]) #...and this is the HS article, first 500 chars
    print("------------------------------------------")
    print()

------------------------------------------
yle_i 1290 hs_i 4710
Sim 0.975945391128141
*********** YLE
Helsingin Laakson sairaalan lääkärit pelkäävät potilasturvallisuuden vaarantuvan lääkäripulan vuoksi – myös osastojen sulkeminen vaihtoehtona
"Nyt olemme kriisitilanteessa", lääkärit sanovat. Lääkäreistä on vajetta ympäri Suomen.
Helsingin Laakson sairaalan lääkärit ovat erittäin huolissaan sairaalan pitkään jatkuneesta ja yhä pahenevasta lääkäripulasta.
Lääkärien mukaan potilasturvallisuus on vaarantunut toistuvasti.
Laakson sairaalan lääkärit ovat lähestyneet Helsingin kaupunkia ja myös työsuo
*********** HS
Helsingin Laakson sairaalan lääkärit ovat erittäin huolissaan sairaalan pitkään jatkuneesta ja yhä pahenevasta lääkäripulasta.
Lääkärien mukaan potilasturvallisuus on vaarantunut toistuvasti.
Laakson sairaalan lääkärit ovat lähestyneet Helsingin kaupunkia ja myös työsuojelua kirjelmillä, joissa kerrotaan lääkäripulan vaikutuksista.
”Tämän seurauksena me apulaisylilääkärit ja osas

# That worked like charm!