#### PGGM Bootcamp Text Analytics 2020
*Notebook by [Pedro V Hernandez Serrano](https://github.com/pedrohserrano)*

---
![](images/4_1.png)

# 4.1 Text Similarity
* [4.1.1. Text distance algorithms](#3.2.1)
* [4.1.2. Vectorization](#3.2.2)

---

Measuring how ‘close’ two pieces of text are both in surface closeness [lexical similarity] and meaning [semantic similarity].  

For instance, how similar are the phrases “the cat ate the mouse” with “the mouse ate the cat food” by just looking at the words?
On the surface, if you consider only word level similarity, these two phrases appear very similar as 3 of the 4 unique words are an exact overlap. It typically does not take into account the actual meaning behind words or the entire phrase in context.

Instead of doing a word for word comparison, we also need to pay attention to context in order to capture more of the semantics. To consider semantic similarity we need to focus on phrase/paragraph levels (or lexical chain level) where a piece of text is broken into a relevant group of related words prior to computing similarity. We know that while the words significantly overlap, these two phrases actually have different meaning.  

---
### 4.1.1. Text distance algorithms
<a id="4.1.1">

In [None]:
#Example

In [None]:
str1 = "I am a man. She is a woman."
str2 = """Case C-40/08
(Directive 93/13/EEC – Consumer contracts – Unfair arbitration clause – Measure void – Arbitration award which has become final – Enforcement – Whether the national court responsible for enforcement can consider of its own motion whether the unfair arbitration clause is null and void – Principles of equivalence and effectiveness)
"""

In [None]:
set(str1.split())

In [None]:
def get_jaccard_sim(str1, str2): 
    a = set(str1.split()) 
    b = set(str2.split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

print(get_jaccard_sim(str1,str2))

### Using textdistance

In [None]:
import pandas as pd

data = pd.read_pickle('pickle/EDGAR_matrix.pkl')
data.head(10)

In [None]:
company1 = 'ADK'
all_tokens1 = data.loc[company1,:]
tokens1 = [i for i, j in zip(all_tokens1.index, all_tokens1) if j > 0]

In [None]:
company2 = 'BLFS'
all_tokens2 = data.loc[company2,:]
tokens2 = [i for i, j in zip(all_tokens2.index, all_tokens2) if j > 0]

In [None]:
import textdistance

In [None]:
print('Distances'
,textdistance.jaccard(tokens1 , tokens2)
,textdistance.hamming(tokens1 , tokens2)
,textdistance.hamming.distance(tokens1 , tokens2)
,textdistance.hamming.similarity(tokens1 , tokens2) #same
,textdistance.hamming.normalized_distance(tokens1 , tokens2)
,textdistance.hamming.normalized_similarity(tokens1 , tokens2)
,textdistance.Hamming(qval=2).distance(tokens1 , tokens2)
     )

In [None]:
print('Distances',
textdistance.cosine.distance(tokens1 , tokens2),
textdistance.cosine.normalized_distance(tokens1 , tokens2),
     )

In [None]:
#dir(textdistance)

---
#### *Check the textdistance [documentation](https://github.com/life4/textdistance)*

---

In [None]:
edgar_data = pd.read_csv('data.csv', index_col=0)

In [None]:
edgar_data.head()

In [None]:
gold_standard = 'ADK'
bag_ow = data.loc[gold_standard,:]
gold_standard_tokens = [i for i, j in zip(bag_ow.index, bag_ow) if j > 0]

In [None]:
cosine_distance = []
for company in data.index:
    bag = data.loc[company,:]
    tokens = [i for i, j in zip(bag.index, bag) if j > 0]
    cosine_distance.append(textdistance.lcsseq.distance(gold_standard_tokens , tokens))

In [None]:
edgar_data['cosine_distance'] = cosine_distance

In [None]:
edgar_data['cosine_distance'].describe()

In [None]:
#edgar_data

---
### 4.1.2. Vectorization
<a id="4.1.2">

In [None]:
import pandas as pd

data = pd.read_pickle('pickle/AnnualReports_corpus.pkl')
data.head()

In [None]:
corpus = list(data.report)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')

In [None]:
tfidf_matrix =  tf.fit_transform(corpus)

Semantic
https://medium.com/@adriensieg/text-similarities-da019229c894