# Bag of words、TF-IDF、N-gram

## 1、Bag of words词袋模型

基于文本的BoW模型的一个简单例子，首先给出两个简单的文本文档如下：
    
    文档1：John likes to watch movies. Mary likes too.
    文档2：John also likes to watch football games.

基于上述两个文档中出现的单词，构建如下一个词典 (dictionary)：

    Vocabulary=  {"John": 1, "likes": 2,"to": 3, "watch": 4, "movies": 5,"also": 6, "football": 7, "games": 8,"Mary": 9, "too": 10}

上面的词典中包含10个单词, 每个单词有唯一的索引, 那么每个文本我们可以使用一个10维的向量来表示。（用整数数字0~n（n为正整数）表示某个单词在文档中出现的次数）：

    文档1：[1, 2, 1, 1, 1, 0, 0, 0, 1, 1]
    文档2：[1, 1,1, 1, 0, 1, 1, 1, 0, 0]


In [1]:
dataset=['John likes to watch movies. Mary likes too.','John also likes to watch football games.']
idx=0
vocabulary={}
# split the sentences to several tokens and construct the vacabulary dictionary 
for sentence in dataset:
    # clear the stop words( here I just clear the dot)
    sentence=sentence.replace('.','')
    words=sentence.split(' ')
    for i in words:
        if i not in vocabulary:
            vocabulary[i]=idx
            idx+=1
vocabulary

{'John': 0,
 'likes': 1,
 'to': 2,
 'watch': 3,
 'movies': 4,
 'Mary': 5,
 'too': 6,
 'also': 7,
 'football': 8,
 'games': 9}

In [2]:
# here we create a bag of words for each sentence in the dataset 
encoded_dataset=[]
for sentence in dataset:
    sentence=sentence.replace('.','')
    words=sentence.split(' ')
    encoded_vector=[0]*idx
    for i in words:
        encoded_vector[vocabulary[i]]+=1
    encoded_dataset.append(encoded_vector)
encoded_dataset

[[1, 2, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]]

## 2、TF-IDF 词频-逆文件频率

词频TF（item frequency）：
    某一给定词语在该文本中出现次数。该数字通常会被归一化（分子一般小于分母），以防止它偏向长的文件，因为不管该词语重要与否，它在长文件中出现的次数很可能比在段文件中出现的次数更大。
    
    TF(word)=k/n,其中n表示文本词数，k表示word出现的词数

逆向文件频率IDF（inverse document frequency）：
    一个词语普遍重要性的度量。主要思想是：如果包含词条t的文档越少, IDF越大，则说明词条具有很好的类别区分能力。某一特定词语的IDF，可以由总文件数目除以包含该词语之文件的数目，再将得到的商取对数得到。
    
    IDF(word)=log(N/N(word)),其中N表示word在语料库中出现的次数，N(word)表示出现word的文本数

因此，有TF-IDF计算公式：
    
    TF-IDF=TF*IDF

In [3]:
import math 
corpus = ['Cats have four legs',
          'Cats and dogs are antagonistic',
          'He hate dogs']
# here we want to calculate the TF-IDF of the word "cat"

# 1、lower each litter and clear the dots
lower_corpus=[s.lower().replace('.',' ') for s in corpus]

# 2、tokenization
tokenized_corpus=[s.split(' ') for s in lower_corpus]
tokenized_corpus

[['cats', 'have', 'four', 'legs'],
 ['cats', 'and', 'dogs', 'are', 'antagonistic'],
 ['he', 'hate', 'dogs']]

In [4]:
# 3、 TF
k=[]
N_cat=0
n=[]
for s in tokenized_corpus:
    if s.count('cats') != 0:
        N_cat+=1
        k+=[s.count('cats')]
    if s.count('cats') == 0:
        k+=[0]
    n+=[len(s)]
TF=[]
for i in range(len(tokenized_corpus)):
    TF += [k[i]/n[i]]
TF

[0.25, 0.2, 0.0]

In [5]:
# 4、 IDF
IDF=math.log((len(tokenized_corpus)/(N_cat)),10)
IDF

0.17609125905568124

In [6]:
# 5、TF-IDF
TF_IDF=[TF_i*IDF for TF_i in TF]
TF_IDF

[0.04402281476392031, 0.03521825181113625, 0.0]

In [7]:
# also we can use the class 'TfidfVectorizer' of sklearn to calculate the TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
 
corpus = ['Cats have four legs',
          'Cats and dogs are antagonistic',
          'He hate dogs']
 
tfidf = TfidfVectorizer()
vect = tfidf.fit_transform(corpus)
 
df = pd.DataFrame()
df['vocabulary'] = tfidf.get_feature_names()
df['sentence1'] = vect.toarray()[0]
df['sentence2'] = vect.toarray()[1]
df['sentence3'] = vect.toarray()[2]
df.set_index('vocabulary', inplace=True)
print(df.T)

vocabulary       and  antagonistic       are      cats      dogs      four  \
sentence1   0.000000      0.000000  0.000000  0.402040  0.000000  0.528635   
sentence2   0.490479      0.490479  0.490479  0.373022  0.373022  0.000000   
sentence3   0.000000      0.000000  0.000000  0.000000  0.473630  0.000000   

vocabulary      hate      have        he      legs  
sentence1   0.000000  0.528635  0.000000  0.528635  
sentence2   0.000000  0.000000  0.000000  0.000000  
sentence3   0.622766  0.000000  0.622766  0.000000  


## 3、N-gram

该模型基于这样一种假设，第N个词的出现只与前面N-1个词相关，而与其它任何词都不相关，整句的概率就是各个词出现概率的乘积。这些概率可以通过直接从语料中统计N个词同时出现的次数得到。常用的是二元的Bi-Gram和三元的Tri-Gram。