In [3]:
corpus=[
    'UNC played Duke in basketball',
    'Duke lost the basketball game'
]

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[1 1 0 1 0 1 0 1]
 [1 1 1 0 1 0 1 0]]
{'unc': 7, 'played': 5, 'duke': 1, 'in': 3, 'basketball': 0, 'lost': 4, 'the': 6, 'game': 2}


输出结果向量中的1表示词典中该位置的单词出现在了句子中

In [5]:
corpus.append('Duke lost the game basketball')
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[1 1 0 1 0 1 0 1]
 [1 1 1 0 1 0 1 0]
 [1 1 1 0 1 0 1 0]]
{'unc': 7, 'played': 5, 'duke': 1, 'in': 3, 'basketball': 0, 'lost': 4, 'the': 6, 'game': 2}


注意到，在词袋模型中，即使两个句子的表意不完全相同，如果他们的词汇完全一致，则会被编码成相同的向量

In [6]:
corpus.append('I ate a sandwich')
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[0 1 1 0 1 0 1 0 0 1]
 [0 1 1 1 0 1 0 0 1 0]
 [0 1 1 1 0 1 0 0 1 0]
 [1 0 0 0 0 0 0 1 0 0]]
{'unc': 9, 'played': 6, 'duke': 2, 'in': 4, 'basketball': 1, 'lost': 5, 'the': 8, 'game': 3, 'ate': 0, 'sandwich': 7}


可以利用欧几里得范数（$L^2$范数）衡量两个向量的距离，代表着两个句子语义的差距

In [7]:
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np
X=vectorizer.fit_transform(corpus).todense()
X=np.asarray(X)
print(f'Distance between each document is {euclidean_distances(X[0].reshape(1,-1),X[1].reshape(1,-1))} and {euclidean_distances(X[1].reshape(1,-1),X[2].reshape(1,-1))}, and {euclidean_distances(X[2].reshape(1,-1),X[3].reshape(1,-1))}')

Distance between each document is [[2.44948974]] and [[0.]], and [[2.64575131]]


## 维度灾难

当使用的训练集包括大量的文本资料，同时词汇间的重合度不高时，容易产生一个向量中存在大量的0元素，导致向量列数激增，被称为维度灾难

一个解决维度灾难的方法为停用词过滤  
比如将大小写合并为小写，同时删除所有的冠词、助动词和介词

In [8]:
vectorizer = CountVectorizer(stop_words='english')
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[0 1 1 0 0 1 0 1]
 [0 1 1 1 1 0 0 0]
 [0 1 1 1 1 0 0 0]
 [1 0 0 0 0 0 1 0]]
{'unc': 7, 'played': 5, 'duke': 2, 'basketball': 1, 'lost': 4, 'game': 3, 'ate': 0, 'sandwich': 6}


虽然上述方式在一定程度上降低了维度，但是往往效果不好
另外两种能够降低维度的方法是**词干提取**和**词干还原**

In [9]:
corpus=[
    'He ate the sandwiches',
    'Every sandwich was eaten by him'
]
vectorizer=CountVectorizer(binary=True,stop_words='english')
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[1 0 0 1]
 [0 1 1 0]]
{'ate': 0, 'sandwiches': 3, 'sandwich': 2, 'eaten': 1}


两个句子的意思类似，但是特征矩阵是正交的，所以应该找到一种方法，将相同词的不同时态统一表示

In [10]:
corpus=[
    'I am gathering ingredients for the sandwich.',
    'There were many wizards at the gathering'    
]

In [11]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
print(lemmatizer.lemmatize('gathering','v'))
print(lemmatizer.lemmatize('gathering','n'))

gather
gathering


同样的，也可以利用该方法进行词形的还原

In [12]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('gathering'))

gather


In [13]:
# 将上述语料库做词形还原
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag

wordnet_tags=['n','v']
corpus=[
    'He ate the sandwiches',
    'Every sandwich was eaten by him'
]
stemmer=PorterStemmer()
print('stemmed: ',[[stemmer.stem(token) for token in word_tokenize(document)] for document in corpus])

stemmed:  [['he', 'ate', 'the', 'sandwich'], ['everi', 'sandwich', 'wa', 'eaten', 'by', 'him']]


In [14]:
def lemmatize(token,tag):
    if tag[0].lower() in wordnet_tags:
        return lemmatizer.lemmatize(token,tag[0].lower())
    return token

In [15]:
lemmatizer=WordNetLemmatizer()
tagged_corpus=[pos_tag(word_tokenize(document)) for document in corpus]
print('Lemmatized:', [[lemmatize(token,tag) for token,tag in document] for document in tagged_corpus])


LookupError: 
**********************************************************************
  Resource [93maveraged_perceptron_tagger[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('averaged_perceptron_tagger')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtaggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle[0m

  Searched in:
    - 'C:\\Users\\diomedes/nltk_data'
    - 'C:\\Users\\diomedes\\.conda\\envs\\pyCharm\\nltk_data'
    - 'C:\\Users\\diomedes\\.conda\\envs\\pyCharm\\share\\nltk_data'
    - 'C:\\Users\\diomedes\\.conda\\envs\\pyCharm\\lib\\nltk_data'
    - 'C:\\Users\\diomedes\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************
