#### PGGM Bootcamp Text Analytics 2020
*Notebook by [Pedro V Hernandez Serrano](https://github.com/pedrohserrano)*

---
![](images/4_1.png)

# 4.1 Text Similarity
* [4.1.1. Text distance algorithms](#3.2.1)
* [4.1.2. Vectorization](#3.2.2)

---

Measuring how ‘close’ two pieces of text are both in surface closeness [lexical similarity] and meaning [semantic similarity].  

For instance, how similar are the phrases “the cat ate the mouse” with “the mouse ate the cat food” by just looking at the words?
On the surface, if you consider only word level similarity, these two phrases appear very similar as 3 of the 4 unique words are an exact overlap. It typically does not take into account the actual meaning behind words or the entire phrase in context.

- [Repo distance of cases](https://github.com/MaastrichtU-IDS/cjeu-court-decision-similarity)
- [textdistance library](https://github.com/life4/textdistance) and a [blog](https://itnext.io/string-similarity-the-basic-know-your-algorithms-guide-3de3d7346227)
- [Tokenizers library](https://github.com/huggingface/tokenizers)

---
### 4.1.1. Text distance algorithms
<a id="4.1.1">

In [1]:
str1 = "I am a man. She is a woman."
str2 = """Case C-40/08
(Directive 93/13/EEC – Consumer contracts – Unfair arbitration clause – Measure void – Arbitration award which has become final – Enforcement – Whether the national court responsible for enforcement can consider of its own motion whether the unfair arbitration clause is null and void – Principles of equivalence and effectiveness)
"""

In [2]:
set(str1.split())

{'I', 'She', 'a', 'am', 'is', 'man.', 'woman.'}

In [3]:
def get_jaccard_sim(str1, str2): 
    a = set(str1.split()) 
    b = set(str2.split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

print(get_jaccard_sim(str1,str2))

0.021739130434782608


### Using textdistance

In [4]:
import pandas as pd

data = pd.read_pickle('pickle/EDGAR_matrix.pkl')
data.head(10)

Unnamed: 0,aa,aaa,aaaa,aaaawyi,aaab,aaabc,aaabf,aaabq,aaabs,aaac,...,zzzutreoh,zzzw,zzzwbme,zzzwql,zzzx,zzzxpe,zzzxzl,zzzyv,zzzz,zzzzx
AAIIQ,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ADK,49,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AMRC,22,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ARX,44,2,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AWIN,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AXLX,8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
BLFS,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
BMRA,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
BPZI,150,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
BUN,17,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
company1 = 'ADK'
all_tokens1 = data.loc[company1,:]
tokens1 = [i for i, j in zip(all_tokens1.index, all_tokens1) if j > 0]

In [6]:
company2 = 'BLFS'
all_tokens2 = data.loc[company2,:]
tokens2 = [i for i, j in zip(all_tokens2.index, all_tokens2) if j > 0]

In [8]:
import textdistance

In [9]:
print('Distances'
,textdistance.jaccard(tokens1 , tokens2)
,textdistance.hamming.distance(tokens1 , tokens2)
,textdistance.hamming.similarity(tokens1 , tokens2) #same
,textdistance.hamming.normalized_distance(tokens1 , tokens2)
     )

Distances 0.086432350718065 167480 -141571 6.46416303215099


In [10]:
print('Distances',
textdistance.cosine.distance(tokens1 , tokens2),
textdistance.cosine.normalized_distance(tokens1 , tokens2),
     )

Distances 0.7332928592466588 0.7332928592466588


---
#### *Check the textdistance [documentation](https://github.com/life4/textdistance)*

---

In [11]:
edgar_data = pd.read_csv('data.csv', index_col=0)

In [12]:
edgar_data.head()

Unnamed: 0,company_name,count_tokens,average_sentence_length,percentage_complex_word,positive_score,negative_score,uncertainty_score,constraining_score,relative_positive,relative_negative,relative_uncertainty,fog_index,relative_constraining,risk_loss,relative_risk
AAIIQ,Nortem NV,218399,218399,0.116548,458,2278,1077,658,0.021,0.104,0.049,87359.647,0.03,134,0.06
ADK,BPZ Resources Inc,3596084,3596084,0.084405,15859,11807,45740,3771,0.044,0.033,0.127,1438433.634,0.01,506,0.01
AMRC,Kadant Inc,2108248,2108248,0.113625,12793,3535,8625,1370,0.061,0.017,0.041,843299.245,0.006,881,0.04
ARX,Cascade Microtech Inc,2600192,2600192,0.116484,15966,6181,12894,2251,0.061,0.024,0.05,1040076.847,0.009,1050,0.04
AWIN,MBF Healthcare Acquisition Corp,305942,305942,0.073458,149,728,419,211,0.005,0.024,0.014,122376.829,0.007,130,0.04


#### Comparing with gold standard

In [15]:
gold_standard = 'ADK'
bag_ow = data.loc[gold_standard,:]
gold_standard_tokens = [i for i, j in zip(bag_ow.index, bag_ow) if j > 0]

In [16]:
cosine_distance = []
for company in data.index:
    bag = data.loc[company,:]
    tokens = [i for i, j in zip(bag.index, bag) if j > 0]
    cosine_distance.append(textdistance.cosine.distance(gold_standard_tokens , tokens))

In [17]:
edgar_data['cosine_distance'] = cosine_distance

In [18]:
edgar_data['cosine_distance'].describe()

count    80.000000
mean      0.699667
std       0.103950
min       0.000000
25%       0.656705
50%       0.681399
75%       0.738100
max       0.933104
Name: cosine_distance, dtype: float64

In [19]:
edgar_data.head()

Unnamed: 0,company_name,count_tokens,average_sentence_length,percentage_complex_word,positive_score,negative_score,uncertainty_score,constraining_score,relative_positive,relative_negative,relative_uncertainty,fog_index,relative_constraining,risk_loss,relative_risk,cosine_distance
AAIIQ,Nortem NV,218399,218399,0.116548,458,2278,1077,658,0.021,0.104,0.049,87359.647,0.03,134,0.06,0.725294
ADK,BPZ Resources Inc,3596084,3596084,0.084405,15859,11807,45740,3771,0.044,0.033,0.127,1438433.634,0.01,506,0.01,0.0
AMRC,Kadant Inc,2108248,2108248,0.113625,12793,3535,8625,1370,0.061,0.017,0.041,843299.245,0.006,881,0.04,0.649052
ARX,Cascade Microtech Inc,2600192,2600192,0.116484,15966,6181,12894,2251,0.061,0.024,0.05,1040076.847,0.009,1050,0.04,0.641332
AWIN,MBF Healthcare Acquisition Corp,305942,305942,0.073458,149,728,419,211,0.005,0.024,0.014,122376.829,0.007,130,0.04,0.763096


---
### 4.1.2. Vectorization
<a id="4.1.2">

In [20]:
import pandas as pd

data = pd.read_pickle('pickle/AnnualReports_corpus.pkl')
data.head()

Unnamed: 0,report,company_name
ABN_AMRO_Group_(2018).pdf,babn amro bank nvabn amro group nv annual repo...,ABN_AMRO_Group
AGNC_Investment_(2018).pdf,bproviding private capital to the us housing m...,AGNC_Investment
A_G_Barr_(2018).pdf,bag barr plc nannual report nand accounts njan...,A_G_Barr
Aboitiz_Power_(2018).pdf,bscanned with sheet nn n n n nn n n n nn ...,Aboitiz_Power
Acer_(2018).pdf,bacer annual reportnnpublication date april ...,Acer


In [21]:
corpus = list(data.report)

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [23]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')

In [24]:
tfidf_matrix =  tf.fit_transform(corpus)

Semantic similarity:
https://medium.com/@adriensieg/text-similarities-da019229c894