In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2
# 多行输出
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## IMDB

作者只考虑了高度两极化的评论。负面评价得分≤4分(满分10分)，正面评价得分≥7分(满分10分)。中性评审不包括在数据集中。数据集分为训练集和测试集。培训集是相同的25000个带标签的评论

In [2]:
from fastai import *
from fastai.text import *

In [3]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

- 获取数据

In [4]:
path = untar_data(URLs.IMDB_SAMPLE)
path

PosixPath('/home/lab/.fastai/data/imdb_sample')

In [5]:
path.ls()

[PosixPath('/home/lab/.fastai/data/imdb_sample/texts.csv')]

In [6]:
df = pd.read_csv(path/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


In [7]:
data = TextDataBunch.from_csv(path, 'texts.csv', text_cols=1, label_cols=0)

In [26]:
data.train_ds[:3]

LabelList (3 items)
x: TextList
xxbos ' xxmaj major xxmaj payne ' is a film about a major who makes life a living xxmaj hell for his small group of boys in the marines . xxmaj this film does not really have a lot to offer , but it provides several hilarious moments that are well - worth a watch . xxmaj do n't expect it to be a memorable film , however . xxmaj just expect to laugh your way through the film and at the expense of other people . xxmaj the confrontation between xxmaj major xxmaj payne and the chubby boy were hilarious , and that 's really all i remember about the film except for the boys wanting revenge on xxmaj major xxmaj payne . xxmaj again , it is not a great film , and it is probably best watched on a rainy day when you need some laughter .,xxbos xxmaj prince stars as ' the xxmaj kid ' in this semi - autobiographical film of a talented , but narcissistic young musician who has a less then stellar home life . xxmaj true the acting leaves a tad to be desired ( barring xx

### 标记化 token
- 分词，标点符号，有特殊含义的标记

- fastai 的特殊标记

```
defaults.text_spec_tok = [UNK,PAD,BOS,FLD,TK_MAJ,TK_UP,TK_REP,TK_WREP]

The rules are all listed below, here is the meaning of the special tokens:

UNK (xxunk) is for an unknown word (one that isn't present in the current vocabulary)
PAD (xxpad) is the token used for padding, if we need to regroup several texts of different lengths in a batch
BOS (xxbos) represents the beginning of a text in your dataset
FLD (xxfld) is used if you set mark_fields=True in your TokenizeProcessor to separate the different fields of texts (if your texts are loaded from several columns in a dataframe)
TK_MAJ (xxmaj) is used to indicate the next word begins with a capital in the original text
TK_UP (xxup) is used to indicate the next word is written in all caps in the original text
TK_REP (xxrep) is used to indicate the next character is repeated n times in the original text (usage xxrep n {char})
TK_WREP(xxwrep) is used to indicate the next word is repeated n times in the original text (usage xxwrep n {word})
```

In [30]:
len(data.train_ds)
len(data.valid_ds)

799

201

In [68]:
len(data.vocab.itos), len(data.vocab.stoi)

(8861, 19162)

stoi (string-to-int) is larger than itos (int-to-string).This is because many words are **mapping to unknown**. We can confirm here:

In [62]:
data.vocab.itos[0]

'xxunk'

In [73]:
# 0 是未知词
unk = []
for word, num in data.vocab.stoi.items():
    if num==0:
        unk.append(word)

In [74]:
len(unk)

10302

In [75]:
19162 - 8861

10301

In [76]:
unk[:10]

['xxunk',
 'dumpster',
 'showman',
 'concerts',
 'magnoli',
 'cavallo',
 'thorin',
 'grafitti',
 'bachstage',
 'riffs']

In [44]:
data.vocab.stoi['quotable']

6063

In [45]:
data.vocab.itos[6063]

'quotable'

In [46]:
data.vocab.stoi['xdfs']

0

In [47]:
data.vocab.itos[0]

'xxunk'

term-document 矩阵将文档表示为一个“bag of words”，也就是说，我们不记录单词的顺序，只记录单词出现的顺序(以及出现的频率)。

### 稀疏矩阵

- 稀疏矩阵的类型
    - coordinate-wise (scipy calls COO)
    - compressed sparse row (CSR)
    - compressed sparse column (CSC)

In [78]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

- n-gram

In [79]:
veczr = CountVectorizer(ngram_range=(1,3), preprocessor=noop, tokenizer=noop, max_features=800000)

In [84]:
docs = data.train_dl.x
docs[:2]

TextList (2 items)
xxbos ' xxmaj major xxmaj payne ' is a film about a major who makes life a living xxmaj hell for his small group of boys in the marines . xxmaj this film does not really have a lot to offer , but it provides several hilarious moments that are well - worth a watch . xxmaj do n't expect it to be a memorable film , however . xxmaj just expect to laugh your way through the film and at the expense of other people . xxmaj the confrontation between xxmaj major xxmaj payne and the chubby boy were hilarious , and that 's really all i remember about the film except for the boys wanting revenge on xxmaj major xxmaj payne . xxmaj again , it is not a great film , and it is probably best watched on a rainy day when you need some laughter .,xxbos xxmaj prince stars as ' the xxmaj kid ' in this semi - autobiographical film of a talented , but narcissistic young musician who has a less then stellar home life . xxmaj true the acting leaves a tad to be desired ( barring xxmaj morris xx

In [86]:
docs[0].data

array([   2,   64,    5,  546, ...,  342,   62, 1570,   11])

In [87]:
docs.vocab.itos[2]

'xxbos'

In [89]:
train_words = [[docs.vocab.itos[o] for o in doc.data] for doc in data.train_dl.x]

In [90]:
valid_words = [[docs.vocab.itos[o] for o in doc.data] for doc in data.valid_dl.x]

- 统计词频

In [95]:
train_ngram_doc = veczr.fit_transform(train_words)

In [99]:
train_ngram_doc.toarray()

array([[0, 0, 0, 0, ..., 0, 0, 0, 0],
       [3, 0, 0, 0, ..., 0, 0, 0, 0],
       [0, 0, 0, 0, ..., 0, 0, 0, 0],
       [2, 0, 0, 0, ..., 0, 0, 0, 0],
       ...,
       [2, 0, 0, 0, ..., 0, 0, 0, 0],
       [5, 0, 0, 0, ..., 0, 0, 0, 0],
       [5, 0, 0, 0, ..., 0, 0, 0, 0],
       [4, 0, 0, 1, ..., 0, 0, 0, 0]], dtype=int64)

In [120]:
sorted(veczr.vocabulary_.items(), key=lambda x: x[1], reverse=True)[:10]

[('£ 200 million', 283503),
 ('£ 200', 283502),
 ('£ 1 in', 283501),
 ('£ 1', 283500),
 ('£', 283499),
 ('\x96 xxmaj xxunk', 283498),
 ('\x96 xxmaj trained', 283497),
 ('\x96 xxmaj setting', 283496),
 ('\x96 xxmaj robert', 283495),
 ('\x96 xxmaj rhys', 283494)]

In [121]:
val_ngram_doc = veczr.transform(valid_words)

In [122]:
val_ngram_doc.toarray()

array([[6, 0, 0, 0, ..., 0, 0, 0, 0],
       [3, 0, 0, 0, ..., 0, 0, 0, 0],
       [1, 0, 0, 0, ..., 0, 0, 0, 0],
       [4, 0, 0, 0, ..., 0, 0, 0, 0],
       ...,
       [2, 0, 0, 0, ..., 0, 0, 0, 0],
       [0, 0, 0, 0, ..., 0, 0, 0, 0],
       [2, 0, 0, 0, ..., 0, 0, 0, 0],
       [0, 0, 0, 0, ..., 0, 0, 0, 0]])

- 使用词频进行分类

In [123]:
vocab = veczr.get_feature_names()

In [128]:
vocab[200000:200005]

['so xxunk at',
 'so xxunk by',
 'so xxunk cheesy',
 'so xxunk claimed',
 'so xxunk disappointed']

In [130]:
y = data.train_ds.y

In [136]:
y
y.items

CategoryList (799 items)
negative,positive,positive,negative,positive
Path: /home/lab/.fastai/data/imdb_sample

array([0, 1, 1, 0, ..., 0, 0, 1, 0])

In [140]:
?np.sign

[0;31mCall signature:[0m  [0mnp[0m[0;34m.[0m[0msign[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mType:[0m            ufunc
[0;31mString form:[0m     <ufunc 'sign'>
[0;31mFile:[0m            ~/Softwares/miniconda3/envs/fastai/lib/python3.6/site-packages/numpy/__init__.py
[0;31mDocstring:[0m      
sign(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])

Returns an element-wise indication of the sign of a number.

The `sign` function returns ``-1 if x < 0, 0 if x==0, 1 if x > 0``.  nan
is returned for nan inputs.

For complex inputs, the `sign` function returns
``sign(x.real) + 0j if x.real != 0 else sign(x.imag) + 0j``.

complex(nan, 0) is returned for complex nan inputs.

Parameters
----------
x : array_like
    Input values.
out : ndarray, None, or tuple of ndarray and None, optional
    A location into which the result is stored. If pro

In [150]:
train_ngram_doc.shape

(799, 283504)

- 是否出现某词比较重要，和出现的频率没有太大的关系

In [138]:
train_ngram_doc.sign().toarray()  # one-hot

array([[0, 0, 0, 0, ..., 0, 0, 0, 0],
       [1, 0, 0, 0, ..., 0, 0, 0, 0],
       [0, 0, 0, 0, ..., 0, 0, 0, 0],
       [1, 0, 0, 0, ..., 0, 0, 0, 0],
       ...,
       [1, 0, 0, 0, ..., 0, 0, 0, 0],
       [1, 0, 0, 0, ..., 0, 0, 0, 0],
       [1, 0, 0, 0, ..., 0, 0, 0, 0],
       [1, 0, 0, 1, ..., 0, 0, 0, 0]], dtype=int64)

In [144]:
set.union(*[set(c) for c in train_ngram_doc.sign().toarray()])

{0, 1}

- 训练

In [145]:
model = LogisticRegression(C=0.1, dual=True)
# 使用词频数据的符号函数
model.fit(train_ngram_doc.sign(), y.items);



- 预测

In [147]:
preds = model.predict(val_ngram_doc.sign())
preds

array([0, 0, 0, 1, ..., 1, 0, 0, 1])

In [149]:
(preds.T == data.valid_dl.y.items).mean()

0.7910447761194029

- 不使用二值化或者 one-hot

In [151]:
model = LogisticRegression(C=0.1, dual=True)
# 使用词频数据的符号函数
model.fit(train_ngram_doc, y.items);



In [153]:
preds = model.predict(val_ngram_doc)
preds

array([0, 0, 0, 0, ..., 1, 0, 0, 0])

In [154]:
(preds.T == data.valid_dl.y.items).mean()

0.7562189054726368

- 说明词频对语义分类是一个几乎无效的特征