# 文本表示

文本表示：将文本数据表示成计算机能够运算的数字或向量。

在自然语言处理（Natural Language Processing，NLP）领域，文本表示是处理流程的第一步，主要是将文本转换为计算机可以运算的数字。

## 01 词袋模型(Bag of Words)

- 思想：

  把每篇文章看成一袋子词，并忽略每个词出现的顺序。具体来看：将整段文本表示成一个长向量，每一维代表一个单词。该维对应的权重代表这个词在原文章中的重要程度。

- 例子1：

  句1：Jane wants to go to Shenzhen 句2：Bob wants to go to Shanghai

  使用两个例句来构造词袋： [Jane, wants, to, go, Shenzhen, Bob, Shanghai]

  两个例句就可以用以下两个向量表示，对应的下标与映射数组的下标相匹配，其值为该词语出现的次数

  句1：[1,1,2,1,1,0,0]  句2：[0,1,2,1,0,1,1]

- 例子2：

  这次我们加上停用词和标点符号的处理，

  句1：Jane wants to go to Shenzhen . 句2：Bob wants to go to Shanghai , me too .

In [3]:
sentence1 = 'Jane wants to go to Shenzhen .'
sentence2 = 'Bob wants to go to Shanghai , me too .'

tokens1 = sentence1.split(" ")
tokens2 = sentence2.split(" ")
print(tokens1)
print(tokens2)

['Jane', 'wants', 'to', 'go', 'to', 'Shenzhen', '.']
['Bob', 'wants', 'to', 'go', 'to', 'Shanghai', ',', 'me', 'too', '.']


### 向量化

In [7]:
def vectorize(tokens, filtered_vocab):
    """
    向量化
    """
    vector = []
    for w in filtered_vocab:
        vector.append(tokens.count(w))
    return vector

### 去重

In [8]:
def unique(sequence):
    """
    去重
    """
    seen = set()
    return [x for x in sequence if not (x in seen or seen.add(x))]

# create a vocabulary list
vocab = unique(tokens1+tokens2)
vocab

['Jane',
 'wants',
 'to',
 'go',
 'Shenzhen',
 '.',
 'Bob',
 'Shanghai',
 ',',
 'me',
 'too']

使用两个例句的tokens，过滤停用词和标点符号后来构造有效词袋：

In [9]:
# 停用词
stopwords = ["to", "is", "a"]
# 标点符号
special_chars = [",", ":", ";", ".", "?"]

# 过滤停用词和标点符号
filtered_vocab = []
for w in vocab: 
    if w not in stopwords and w not in special_chars: 
        filtered_vocab.append(w)
filtered_vocab

['Jane', 'wants', 'go', 'Shenzhen', 'Bob', 'Shanghai', 'me', 'too']

In [10]:
# convert sentences into vectords
vector1 = vectorize(tokens1, filtered_vocab)
print(vector1)
vector2 = vectorize(tokens2, filtered_vocab)
print(vector2)

[1, 1, 1, 1, 0, 0, 0, 0]
[0, 1, 1, 0, 1, 1, 1, 1]


Bag of Words模型向量的size就是vocabulary的size大小，所以该向量表示非常稀疏。


下面演示使用sklearn库做Bag of Words模型：

In [12]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
 
sentence1 = 'Jane wants to go to Shenzhen .'
sentence2 = 'Bob wants to go to Shanghai , me too .'
  
count_vec = CountVectorizer(ngram_range=(1, 1), # to use bigrams ngram_range=(2,2)
                           #stop_words='english'
                           )
#transform
feature = count_vec.fit_transform([sentence1, sentence2])
 
#create dataframe
df = pd.DataFrame(feature.toarray(), columns=count_vec.get_feature_names())
df

Unnamed: 0,bob,go,jane,me,shanghai,shenzhen,to,too,wants
0,0,1,1,0,0,1,2,0,1
1,1,1,0,1,1,0,2,1,1


## 02 词频-逆向文件频率（TF-IDF）

- 思想：

  字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。如果某个单词在一篇文章中出现的频率TF高，并且在其他文章中很少出现，则认为此词或者短语具有很好的类别区分能力，适合用来分类。

- 公式：

  - $TF-IDF(t,d)=TF(t,d) × IDF(t)$
  - $IDF(t)=log\frac {文章总数} {包含单词t的文章总数+1}$
  - $TF=\frac{单词t在文档中出现的次数}{该文档的总词量}$

- 缺点：

  （1）没有考虑特征词的位置因素对文本的区分度，词条出现在文档的不同位置时，对区分度的贡献大小是不一样的。

  （2）按照传统TF-IDF，往往一些生僻词的IDF(反文档频率)会比较高、因此这些生僻词常会被误认为是文档关键词。

  （3）IDF部分只考虑了特征词与它出现的文本数之间的关系，而忽略了特征项在一个类别中不同的类别间的分布情况。

  （4）对于文档中出现次数较少的重要人名、地名信息提取效果不佳。

使用示例：

In [15]:
sentence3 = 'Bob wants to visit Disneyland in Shanghai during the summer vacation  .'
sentence4 = 'Tim is planning to go to Shenzhen next month to discuss project with Jane .'
contents = [sentence1, sentence2, sentence3, sentence4]
# 参数为 CounterVectorizer 和 TfidfTransformer 的所有参数
vec = TfidfVectorizer(stop_words=stopwords,
                      norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
feature = vec.fit_transform(contents) #直接对文档进行转换提取tfidf特征
#一步就得到了tfidf向量
print(feature.toarray())
#create dataframe
df = pd.DataFrame(feature.toarray(), columns=vec.get_feature_names())
df

[[0.         0.         0.         0.         0.44493104 0.
  0.54957835 0.         0.         0.         0.         0.
  0.         0.54957835 0.         0.         0.         0.
  0.         0.         0.44493104 0.        ]
 [0.39137817 0.         0.         0.         0.31685436 0.
  0.         0.49641358 0.         0.         0.         0.
  0.39137817 0.         0.         0.         0.         0.49641358
  0.         0.         0.31685436 0.        ]
 [0.26805872 0.         0.33999849 0.33999849 0.         0.33999849
  0.         0.         0.         0.         0.         0.
  0.26805872 0.         0.33999849 0.33999849 0.         0.
  0.33999849 0.33999849 0.21701663 0.        ]
 [0.         0.33999849 0.         0.         0.21701663 0.
  0.26805872 0.         0.33999849 0.33999849 0.33999849 0.33999849
  0.         0.26805872 0.         0.         0.33999849 0.
  0.         0.         0.         0.33999849]]


Unnamed: 0,bob,discuss,disneyland,during,go,in,jane,me,month,next,...,shanghai,shenzhen,summer,the,tim,too,vacation,visit,wants,with
0,0.0,0.0,0.0,0.0,0.444931,0.0,0.549578,0.0,0.0,0.0,...,0.0,0.549578,0.0,0.0,0.0,0.0,0.0,0.0,0.444931,0.0
1,0.391378,0.0,0.0,0.0,0.316854,0.0,0.0,0.496414,0.0,0.0,...,0.391378,0.0,0.0,0.0,0.0,0.496414,0.0,0.0,0.316854,0.0
2,0.268059,0.0,0.339998,0.339998,0.0,0.339998,0.0,0.0,0.0,0.0,...,0.268059,0.0,0.339998,0.339998,0.0,0.0,0.339998,0.339998,0.217017,0.0
3,0.0,0.339998,0.0,0.0,0.217017,0.0,0.268059,0.0,0.339998,0.339998,...,0.0,0.268059,0.0,0.0,0.339998,0.0,0.0,0.0,0.0,0.339998
