<a href="https://colab.research.google.com/github/KAKUMA-Minato/TextAnalysis/blob/master/Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing and calculate similarity


In [1]:
# 必要なパッケージのインストール
!pip install nltk
!pip install gensim



In [14]:
import nltk
import numpy as np
import pandas as pd
import re
from nltk.corpus import wordnet as wn

In [3]:
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

用いるドキュメント

In [47]:
docs = ["Japan is an island country in East Asia. Located in the Pacific Ocean, it lies off the eastern coast of the Asian continent and stretches from the Sea of Okhotsk in the north to the East China Sea and the Philippine Sea in the south.","The United States of America (USA), commonly known as the United States (U.S. or US) or America, is a country comprising 50 states, a federal district, five major self-governing territories, and various possessions.","England is a country that is part of the United Kingdom.","China, officially the People's Republic of China (PRC), is a country in East Asia and the world's most populous country, with a population of around 1.404 billion.","India, also known as the Republic of India,[19][e] is a country in South Asia.","Korea is a region in East Asia.","Germany, officially the Federal Republic of Germany is a country in Central and Western Europe, lying between the Baltic and North Seas to the north, and the Alps, Lake Constance and the High Rhine to the south.","Russia, or the Russian Federation[12], is a transcontinental country in Eastern Europe and North Asia.","France, officially the French Republic, is a country whose territory consists of metropolitan France in Western Europe and several overseas regions and territories.","Italy, officially the Italian Republic,[10][11][12][13] is a European country consisting of a peninsula delimited by the Italian Alps and surrounded by several islands.","Brazil officially the Federative Republic of Brazil [9] is the largest country in both South America and Latin America.","Canada is a country in the northern part of North America.","Spain, officially the Kingdom of Spain[11][a][b] is a country mostly located in Europe.","Australia, officially the Commonwealth of Australia,[12] is a sovereign country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands.","Indonesia, officially the Republic of Indonesia (Indonesian: Republik Indonesia [reˈpublik ɪndoˈnesia])[a], is a country in Southeast Asia, between the Indian and Pacific oceans.","Mexico, officially the United Mexican States (Spanish: Estados Unidos Mexicanos[10][11][12][13], is a country in the southern portion of North America."]

##前処理

## Preprocessing

In [48]:
en_stop = nltk.corpus.stopwords.words('english')

In [35]:
def preprocessing_text(text):
  def cleaning_text(text):
    # @の削除
    pattern1 = '@|%'
    text = re.sub(pattern1, '', text)    
    pattern2 = '\[[0-9 ]*\]'
    text = re.sub(pattern2, '', text)    
    # <b>タグの削除
    pattern3 = '\([a-z ]*\)'
    text = re.sub(pattern3, '', text)    
    pattern4 = '[0-9]'
    text = re.sub(pattern4, '', text)
    return text
  
  def tokenize_text(text):
    text = re.sub('[.,]', '', text)
    return text.split()

  def remove_stopwords(word, stopwordset):
    if word in stopwordset:
       return None
    else:
       return word  

  def lemmatize_word(word):
    # make words lower  example: Python =>python
    word=word.lower()
    
    # lemmatize  example: cooked=>cook
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
      return lemma
    
  text = cleaning_text(text)
  tokens = tokenize_text(text)
  tokens = [lemmatize_word(word) for word in tokens]
  tokens = [remove_stopwords(word, en_stop) for word in tokens]
  tokens = [word for word in tokens if word is not None]
  return tokens
  
preprocessed_docs = [preprocessing_text(text) for text in docs]

In [50]:
print(preprocessed_docs[0])

['japan', 'island', 'country', 'east', 'asia', 'locate', 'pacific', 'ocean', 'lie', 'eastern', 'coast', 'asian', 'continent', 'stretch', 'sea', 'okhotsk', 'north', 'east', 'china', 'sea', 'philippine', 'sea', 'south']


## Calculate similarity

- 1 集合ベースの類似度
  - 1.1 Jaccard係数
  - 1.2 Dice係数
  - 1.3 Simpson係数
- 2 ベクトルベースの類似度
  - 2.1 ユークリッド距離
  - 2.2コサイン類似度


### 1 集合ベース


#### 1.1Jaccard係数
Jaccard係数は二つの集合A,Bに対して定義される類似度である  
計算式は以下の通り

\begin{equation}
J(A,B)=\dfrac{|A\cap B|}{|A \cup B|}
\end{equation}

共通部分の割合が大きければその二つの文書は似ていると考える

In [56]:
from nltk.metrics import jaccard_distance

for i in range(16):
   set_0 = set(preprocessed_docs[0])
   set_i = set(preprocessed_docs[i])
   print("jaccard(0,",i,") = ", 1 - jaccard_distance(set_0, set_i))

jaccard(0, 0 ) =  1.0
jaccard(0, 1 ) =  0.027027027027026973
jaccard(0, 2 ) =  0.04166666666666663
jaccard(0, 3 ) =  0.13793103448275867
jaccard(0, 4 ) =  0.12
jaccard(0, 5 ) =  0.09090909090909094
jaccard(0, 6 ) =  0.11764705882352944
jaccard(0, 7 ) =  0.16000000000000003
jaccard(0, 8 ) =  0.030303030303030276
jaccard(0, 9 ) =  0.06451612903225812
jaccard(0, 10 ) =  0.07407407407407407
jaccard(0, 11 ) =  0.08333333333333337
jaccard(0, 12 ) =  0.07692307692307687
jaccard(0, 13 ) =  0.09999999999999998
jaccard(0, 14 ) =  0.13793103448275867
jaccard(0, 15 ) =  0.0625


#### 1.2 Sørensen-Dice係数

Jaccard係数では分母はの和集合であったため  
片方の集合がとても大きいと共通部分が大きくても係数の値が小さくなってしまうという問題がある  
Sørensen-Dice係数では、分母を二つの集合の大きさの平均をとることで、その影響を緩和している  

$
DSC(A,B) = \dfrac{|A\cap B|}{\dfrac{|A| + |B|}{2}} = \dfrac{2|A\cap B|}{|A| + |B|}
$

In [57]:
def dice_similarity(set_a, set_b):
  num_intersection =  len(set.intersection(set_a, set_b))
  sum_nums = len(set_a) + len(set_b)
  try:
    return 2 * num_intersection / sum_nums
  except ZeroDivisionError:
    return 1.0 

In [58]:
for i in range(16):
   set_0 = set(preprocessed_docs[0])
   set_i = set(preprocessed_docs[i])
   print("dice(0,",i,") = ", dice_similarity(set_0, set_i))

dice(0, 0 ) =  1.0
dice(0, 1 ) =  0.05263157894736842
dice(0, 2 ) =  0.08
dice(0, 3 ) =  0.24242424242424243
dice(0, 4 ) =  0.21428571428571427
dice(0, 5 ) =  0.16666666666666666
dice(0, 6 ) =  0.21052631578947367
dice(0, 7 ) =  0.27586206896551724
dice(0, 8 ) =  0.058823529411764705
dice(0, 9 ) =  0.12121212121212122
dice(0, 10 ) =  0.13793103448275862
dice(0, 11 ) =  0.15384615384615385
dice(0, 12 ) =  0.14285714285714285
dice(0, 13 ) =  0.18181818181818182
dice(0, 14 ) =  0.24242424242424243
dice(0, 15 ) =  0.11764705882352941


#### 1.3 Szymkiewicz-Simpson係数

差集合の要素数の影響を極限まで抑えたのがSzymkiewicz-Simpson係数    
$
overlap(𝐴,𝐵) = \dfrac{|A\cap B|}{\min(|A|, |B|)}
$



In [59]:
def simpson_similarity(list_a, list_b):
  num_intersection = len(set.intersection(set(list_a), set(list_b)))
  min_num = min(len(set(list_a)), len(set(list_b)))
  try:
    return num_intersection / min_num
  except ZeroDivisionError:
    if num_intersection == 0:
      return 1.0
    else:
      return 0

In [60]:
for i in range(16):
   set_0 = set(preprocessed_docs[0])
   set_i = set(preprocessed_docs[i])
   print("simpson(0,",i,") = ", simpson_similarity(set_0, set_i))

simpson(0, 0 ) =  1.0
simpson(0, 1 ) =  0.05555555555555555
simpson(0, 2 ) =  0.2
simpson(0, 3 ) =  0.3076923076923077
simpson(0, 4 ) =  0.375
simpson(0, 5 ) =  0.5
simpson(0, 6 ) =  0.2222222222222222
simpson(0, 7 ) =  0.4444444444444444
simpson(0, 8 ) =  0.07142857142857142
simpson(0, 9 ) =  0.15384615384615385
simpson(0, 10 ) =  0.2222222222222222
simpson(0, 11 ) =  0.3333333333333333
simpson(0, 12 ) =  0.25
simpson(0, 13 ) =  0.23076923076923078
simpson(0, 14 ) =  0.3076923076923077
simpson(0, 15 ) =  0.14285714285714285


### 2 ベクトルベース 

### TF-IDF(Term Frequency - Inverse Document Frequency)

BoWでは各単語の重みが同じだったが、単語によって重要度は変わる  
単語の重要度を考慮したのがTF-IDF  

TF(t, d) = ある単語(t)のある文書(d)における出現頻度  
IDF(t) = ある単語(t)が全文書集合(D)中にどれだけの文書で出現したかの逆数  

TF-IDF(t,d) = TF(t, d) * IDF(t)  

In [64]:
def tfidf_vectorizer(docs):
  def tf(word2id, doc):
    term_counts = np.zeros(len(word2id))
    for term in word2id.keys():
      term_counts[word2id[term]] = doc.count(term)
    tf_values = list(map(lambda x: x/sum(term_counts), term_counts))
    return tf_values
  
  def idf(word2id, docs):
    idf = np.zeros(len(word2id))
    for term in word2id.keys():
      idf[word2id[term]] = np.log(len(docs) / sum([bool(term in doc) for doc in docs]))
    return idf
  
  word2id = {}
  for doc in docs:
    for w in doc:
      if w not in word2id:
        word2id[w] = len(word2id)
  
  return [[_tf*_idf for _tf, _idf in zip(tf(word2id, doc), idf(word2id, docs))] for doc in docs], word2id
  

In [65]:
tfidf_vector, word2id = tfidf_vectorizer(preprocessed_docs)
print(tfidf_vector)
print(word2id.items())

[[0.1205473357495557, 0.07278158406833354, 0.0028060226581552677, 0.1455631681366671, 0.04264475013094462, 0.09041050181216677, 0.09041050181216677, 0.09041050181216677, 0.1205473357495557, 0.09041050181216677, 0.1205473357495557, 0.1205473357495557, 0.09041050181216677, 0.1205473357495557, 0.2712315054365003, 0.1205473357495557, 0.05057177433937743, 0.09041050181216677, 0.1205473357495557, 0.06027366787477785, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.002933569142616871, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0


#### 2.1 ユークリッド距離

各文書をベクトルで表すことが出来たので  
ユークリッド距離が計算できる  
この距離が小さければ似ていると考えることが出来る

\begin{equation}
d(v_1,v_2) =(\sum_{i=1}^n (v_{1i}-v_{2i})^2)^{\frac{1}{2}}
\end{equation}

In [66]:
def euclidean_distance(list_a, list_b):
  diff_vec = np.array(list_a) - np.array(list_b)
  return np.linalg.norm(diff_vec)

In [69]:
for i in range(16):
   list_0 = list(tfidf_vector[0])
   list_i = list(tfidf_vector[i])
   print("euclidean_distance(0,",i,") = ", euclidean_distance(list_0, list_i))

euclidean_distance(0, 0 ) =  0.0
euclidean_distance(0, 1 ) =  0.7568610948880183
euclidean_distance(0, 2 ) =  1.0121429343618757
euclidean_distance(0, 3 ) =  0.7126923005791219
euclidean_distance(0, 4 ) =  0.8432973548323277
euclidean_distance(0, 5 ) =  1.049789826669889
euclidean_distance(0, 6 ) =  0.6801622588258939
euclidean_distance(0, 7 ) =  0.8246258019334979
euclidean_distance(0, 8 ) =  0.803525019398566
euclidean_distance(0, 9 ) =  0.8246629509361179
euclidean_distance(0, 10 ) =  0.8809987778032525
euclidean_distance(0, 11 ) =  0.9367125179042475
euclidean_distance(0, 12 ) =  0.8616972672884817
euclidean_distance(0, 13 ) =  0.8165896066566779
euclidean_distance(0, 14 ) =  0.8733968730080951
euclidean_distance(0, 15 ) =  0.786732425078347


#### 2.2 コサイン類似度

ベクトルのなす角に着目して類似度を計算する  

\begin{equation}
similarity(A, B)=cos(\theta)=\dfrac{\sum_{i=1}^n A_iB_i}{{\sqrt A}{\sqrt B}}
\end{equation}


In [70]:
def cosine_similarity(list_a, list_b):
  
  inner_prod = np.array(list_a).dot(np.array(list_b))
  norm_a = np.linalg.norm(list_a)
  norm_b = np.linalg.norm(list_b)
  try:
      return inner_prod / (norm_a*norm_b)
  except ZeroDivisionError:
      return 1.0

In [71]:
for i in range(16):
   list_0 = list(tfidf_vector[0])
   list_i = list(tfidf_vector[i])
   print("cosine_similarity(0,",i,") = ", cosine_similarity(list_0, list_i))

cosine_similarity(0, 0 ) =  0.9999999999999998
cosine_similarity(0, 1 ) =  2.8873351476511364e-05
cosine_similarity(0, 2 ) =  8.135936584958245e-05
cosine_similarity(0, 3 ) =  0.14932426500054383
cosine_similarity(0, 4 ) =  0.044358536973654335
cosine_similarity(0, 5 ) =  0.14128529478592497
cosine_similarity(0, 6 ) =  0.14207323255887835
cosine_similarity(0, 7 ) =  0.09054479697520224
cosine_similarity(0, 8 ) =  3.576950361677573e-05
cosine_similarity(0, 9 ) =  0.025864847099641517
cosine_similarity(0, 10 ) =  0.02050403304744488
cosine_similarity(0, 11 ) =  0.02419449001683132
cosine_similarity(0, 12 ) =  0.06346642101352062
cosine_similarity(0, 13 ) =  0.08293620326441613
cosine_similarity(0, 14 ) =  0.0732154433477262
cosine_similarity(0, 15 ) =  0.013645710261608436
