英文分词相关API如下，nltk将会寻找punkt资源
~/nltk_data/tokenizers/punkt/

In [22]:
import nltk.tokenize as tk
# 将样本按句子进行拆分  sent_list：句子列表
sent_list = tk.sent_tokenize(text='text')

# 将样本按单词进行拆分  word_list：单词列表
word_list = tk.word_tokenize(text='text')

# 将样本按单词进行拆分  punctTokenizer：分词器对象
punctTokenizer = tk.WordPunctTokenizer()
word_list = punctTokenizer.tokenize(text='text')

**案例：英文分词**

In [23]:
import numpy as np
import nltk.tokenize as tk

In [24]:
doc = "Are you curious about tokenization? " \
"Let's see how it works! " \
"We need to analyze a couple of sentences " \
"with punctuations to see it in action."
print(doc)

Are you curious about tokenization? Let's see how it works! We need to analyze a couple of sentences with punctuations to see it in action.


In [25]:
# 分句子
sents = tk.sent_tokenize(doc)
for i in range(len(sents)):
    print(i+1,':',sents[i])

1 : Are you curious about tokenization?
2 : Let's see how it works!
3 : We need to analyze a couple of sentences with punctuations to see it in action.


In [26]:
# 分单词-01
words = tk.word_tokenize(doc)
for i in range(len(words)):
    print(i+1,':',words[i])

1 : Are
2 : you
3 : curious
4 : about
5 : tokenization
6 : ?
7 : Let
8 : 's
9 : see
10 : how
11 : it
12 : works
13 : !
14 : We
15 : need
16 : to
17 : analyze
18 : a
19 : couple
20 : of
21 : sentences
22 : with
23 : punctuations
24 : to
25 : see
26 : it
27 : in
28 : action
29 : .


In [27]:
# 分单词-02
tokenizer = tk.WordPunctTokenizer()
words = tokenizer.tokenize(doc)
for i in range(len(words)):
    print(i+1,':',words[i])

1 : Are
2 : you
3 : curious
4 : about
5 : tokenization
6 : ?
7 : Let
8 : '
9 : s
10 : see
11 : how
12 : it
13 : works
14 : !
15 : We
16 : need
17 : to
18 : analyze
19 : a
20 : couple
21 : of
22 : sentences
23 : with
24 : punctuations
25 : to
26 : see
27 : it
28 : in
29 : action
30 : .


In [28]:
import sklearn.feature_extraction.text as ft 
# 构建词袋模型对象
cv = ft.CountVectorizer()
# 训练模型，把句子中所有可能出现的单词作为特征名，
# 每一个句子为一个样本，单词在句子中出现的次数作为特征值
# sentences:[] 将句子中可能出现的单词放入列表中
bow = cv.fit_transform(sents)
print(bow)
print(bow.toarray())
print(cv.get_feature_names_out())

  (0, 3)	1
  (0, 20)	1
  (0, 5)	1
  (0, 0)	1
  (0, 16)	1
  (1, 9)	1
  (1, 13)	1
  (1, 6)	1
  (1, 8)	1
  (1, 19)	1
  (2, 13)	1
  (2, 8)	1
  (2, 17)	1
  (2, 10)	1
  (2, 15)	2
  (2, 2)	1
  (2, 4)	1
  (2, 11)	1
  (2, 14)	1
  (2, 18)	1
  (2, 12)	1
  (2, 7)	1
  (2, 1)	1
[[1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1]
 [0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 1 0]
 [0 1 1 0 1 0 0 1 1 0 1 1 1 1 1 2 0 1 1 0 0]]
['about' 'action' 'analyze' 'are' 'couple' 'curious' 'how' 'in' 'it' 'let'
 'need' 'of' 'punctuations' 'see' 'sentences' 'to' 'tokenization' 'we'
 'with' 'works' 'you']


**词频逆文档频率 TFIDF**

In [29]:
tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow).toarray()
print(np.round(tfidf,2))
print(cv.get_feature_names_out())

[[0.45 0.   0.   0.45 0.   0.45 0.   0.   0.   0.   0.   0.   0.   0.
  0.   0.   0.45 0.   0.   0.   0.45]
 [0.   0.   0.   0.   0.   0.   0.49 0.   0.37 0.49 0.   0.   0.   0.37
  0.   0.   0.   0.   0.   0.49 0.  ]
 [0.   0.26 0.26 0.   0.26 0.   0.   0.26 0.2  0.   0.26 0.26 0.26 0.2
  0.26 0.51 0.   0.26 0.26 0.   0.  ]]
['about' 'action' 'analyze' 'are' 'couple' 'curious' 'how' 'in' 'it' 'let'
 'need' 'of' 'punctuations' 'see' 'sentences' 'to' 'tokenization' 'we'
 'with' 'works' 'you']
