<font color=red>***读取数据并保存为csv文件***</font>

In [4]:
import pyprind
import pandas as pd 
import os 

'''
PyPrind (Python Progress Indicator)模块提供了一个进度条和一个百分比指示器对象，
它允许您跟踪循环结构或其他迭代计算的进度。典型的应用程序包括处理大数据集，
以便在运行时对计算的进展提供直观的估计。
'''
pbar = pyprind.ProgBar(50000, title='Reading data ')
labels = {'pos' :1, 'neg' :0}
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = 'D:/Program Files/Sublime Text 3/MyProject/SentimentAnalysis/aclImdb_v1/aclImdb/%s/%s' % (s, l)
        for file in os.listdir(path):
            with open(file=os.path.join(path, file), mode='r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']
print(pbar)

Reading data 
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:03:37


Title: Reading data 
  Started: 07/10/2019 17:20:49
  Finished: 07/10/2019 17:24:26
  Total time elapsed: 00:03:37


In [5]:
import pandas as pd
import numpy as np 
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('./aclImdb_v1/movie_data.csv', index=False)

In [18]:
import pandas as pd 

df = pd.read_csv('./aclImdb_v1/movie_data.csv')
print(df.shape)
print(df.iloc[49999, 0]) # 取最后一行的第一列
df.tail()

(50000, 2)
I waited long to watch this movie. Also because I like Bruce Willis. The plot was quite different from what I had expected but still quite good. Its a good mix of emotions, humor and drama.<br /><br />Left me thinking over and again :)


Unnamed: 0,review,sentiment
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


<font color=green>***创建词袋模型***</font>

In [4]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(ngram_range=(1, 1)) # ngram_range参数控制词袋模型中统计的是几元组
docs = np.array(['The sun is shining', 'The weather is sweet',
                 'The sun is shining and the weather is sweet'])
bag = count.fit_transform(docs)

In [5]:
count.vocabulary_ # 保存为字典,字典的key对应单词，字典的value对应的是特征向量的索引值

{'the': 5, 'sun': 3, 'is': 1, 'shining': 2, 'weather': 6, 'sweet': 4, 'and': 0}

In [6]:
print(bag.toarray()) # 特征向量中的数组代表单词出现的频率

[[0 1 1 1 0 1 0]
 [0 1 0 0 1 1 1]
 [1 2 1 1 1 2 1]]


在多个文档中频繁出现的单词通常不具有高辨识度的信息，故可以通过词频-逆文档频率技术来削弱频繁出现的单词在特征向量中的影响。scikit-learn库中实现此技术使用的公式如下所示：

$$
\operatorname{tf-idf}(\mathrm{t}, \mathrm{d})=\mathrm{tf}(\mathrm{t}, \mathrm{d}) \times(\mathrm{idf}(\mathrm{t}, \mathrm{d})+1)
$$

其中$\operatorname{idf}(t, d)$的计算公式如下：

$$
\operatorname{idf}(\mathrm{t}, \mathrm{d})=\log \frac{1+n_{d}}{1+\mathrm{df}(\mathrm{d}, \mathrm{t})}
$$

在得到最终的特征向量前还需要进行归一化的处理：

$$
v_{\text {norm}}=\frac{v}{\|v\|_{2}}=\frac{v}{\sqrt{v_{1}^{2}+v_{2}^{2}+\cdots+v_{n}^{2}}}=\frac{v}{\left(\sum_{i=1}^{n} v_{i}\right)^{1 / 2}}
$$


词频-逆文档频率技术的代码调用如下：

In [7]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
np.set_printoptions(precision=2) # 设置打印精度
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.56 0.56 0.   0.43 0.  ]
 [0.   0.43 0.   0.   0.56 0.43 0.56]
 [0.4  0.48 0.31 0.31 0.31 0.48 0.31]]


<font color=blue>***清洗数据***</font>

In [8]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [9]:
import re

'''
通过代码中的第一个正则表达式<[^>]*>，移除电影评论中所有的HTML标记。在移除了HTML标记后，
使用正则表达式'(?::|;|=)(?:-)?(?:\)|\(|D|P)'寻找表情符号，并将其临时存储在emoticons中。
接下来，我们通过正则表达式[\W]+删除文本中所有的非单词字符，将文本转换为小写字母，最后
将emoticons中临时存储的表情符号追加在经过处理的文档字符串后。此外，为了保证表情符号的
一致，我们还删除了表情符号中代表鼻子的字符（-）。
'''
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    return text

In [10]:
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

In [11]:
preprocessor('</a>This :) is :( a test :-)!')

'this is a test :) :( :)'

In [12]:
df['review'] = df['review'].apply(preprocessor)

<font color=orange>***标记文档(tokenize)***</font>

此模块的代码解决的问题是如何将文本拆分为单独元素。

In [13]:
# 方法一：通过文档的空白字符将其拆分为单独的单词
def tokenizer_by_space(text):
    return text.split()

# 方法二：词干提取，提取单词原型的过程，将单词映射到其对应的词干上
from nltk.stem.porter import PorterStemmer

def tokenizer_porter_stemmer(text):
    porter = PorterStemmer()
    text_ = []
    for word in text.split():
        text_.append(porter.stem(word))
    return text_
#     return [porter.stem(word) for word in text.split()]

In [14]:
tokenizer_by_space('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [15]:
tokenizer_porter_stemmer('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

Poter Stemmer算法是词干提取算法的一种，从上述过程可以看出，词干提取可能会生成一些不存在的单词，如提取thus得到thu。此外，还有词形还原的算法，旨在获得单词的标准形式，但是计算相对复杂。在实际应用中已经得出结论，文本文类中，无论是词干提取还是词形还原都对分类结果影响不大。

<font color=gray>***停用词移除(stop-word removal)***</font>

In [16]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stop = stopwords.words('english')
text = tokenizer_porter_stemmer('a runner likes running and runs a lot') # 词干提取
[w for w in text if w not in stop] # 移除停用词

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hsz\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['runner', 'like', 'run', 'run', 'lot']

<font color=purple>***训练logistic回归模型***</font>

In [1]:
# 将清洗过的文本档案对象DataFrame划分为训练集和测试集
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

NameError: name 'df' is not defined

In [18]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf= TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)

param_grid = [{'vect__ngram_range':[(1, 1)], 'vect__stop_words': [stop, None], 
              'vect__tokenizer':[tokenizer_by_space, tokenizer_porter_stemmer],
             'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]},
             {'vect__ngram_range': [(1, 1)], 'vect__stop_words': [stop, None],
              'vect__tokenizer':[tokenizer_by_space, tokenizer_porter_stemmer],
              'vect__use_idf': [False], 'vect__norm': [None], 
              'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]}]

lr_tfidf = Pipeline([('vect', tfidf), ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 35.4min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 171.7min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 220.3min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...e, penalty='l2', random_state=0, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid=[{'vect__ngram_range': [(1, 1)], 'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's...se_idf': [False], 'vect__norm': [None], 'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]}],
       pre_dispatch='2*n_jobs', refit=T

In [20]:
print('Best parameter set: %s' % gs_lr_tfidf.best_params_)

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer_by_space at 0x000002A2ABF04488>}


注：网格搜索返回的最佳参数设置集合为：使用不含有停用词的常规标记（token）生成器，同时在逻辑斯谛回归中使用tf-idf，其中逻辑斯谛回归分类器使用L2正则化，正则化强度C＝10.0。

In [22]:
print('CV Accuracy: %0.3f' % gs_lr_tfidf.best_score_)

CV Accuracy: 0.897


In [25]:
clf  = gs_lr_tfidf.best_estimator_ # 获取分类性能最好的模型
print('Test Accuracy: %0.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.899


总结：逻辑回归模型针对电影评论是正面还是负面的分类准确率是0.899，结果并不是很突出。迄今为止执行文本分类十分流行的一种分离器是朴素贝叶斯分类器，特别是在垃圾邮件过滤方面。朴素贝叶斯分类器易于实现，计算性能高效，相对于其他的算法，在小数据集上表现异常出色。

<font color=red>***外存学习技术处理超大数据集***</font>

在本节，我们将使用scikit-learn中SGDClassifier的partial_fit函数来读取本地存储设备，并且使用小型子批次（minibatches）文档来训练一个逻辑斯谛回归模型。

#### 1.数据预处理：移除停用词以及清洗数据

In [62]:
import numpy as np
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

#### 2.定义生成器批量读取数据(避免把数据全部读进内存)

In [65]:
# 定义一个生成器函数：stream_docs，它每次读取且返回一个文档的内容
def stream_docs(path):
    with open(file=path, mode='r', encoding='utf-8') as csv:
        next(csv) # skip header
        for line in csv: # 每一行都是字符串类型
            text, label = line[:-3],  int(line[-2]) # csv文件以换行结尾，故保存类标的下标应该是-2
            yield text, label

In [66]:
next(stream_docs(path='./aclImdb_v1/movie_data.csv')) #用 next()函数得到yield生成器的迭代对象

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

In [102]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

#### 3.定义词袋模型的特征向量提取器以及分类器

In [108]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error='ignore', n_features=2**21, preprocessor=None, tokenizer=tokenizer)
clf = SGDClassifier(loss='log', random_state=1, max_iter=1, tol=1e-3) # max_iter迭代的最大次数，tol控制在当前迭代得到的准确性能很小时，结束循环
doc_stream = stream_docs(path='./aclImdb_v1/movie_data.csv')

In [109]:
import pyprind

pbar = pyprind.ProgBar(45)
classes = np.array([0, 1])

for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()   
    
print(pbar)

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:28


Title: 
  Started: 07/10/2019 21:37:17
  Finished: 07/10/2019 21:37:45
  Total time elapsed: 00:00:28


partial_fit()函数用于在线学习，与fit()函数最大的区别是，当有新数据出现时，可以在原来模型的基础上对新数据进行训练来达到优化模型而目的。

In [105]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.867


In [106]:
clf = clf.partial_fit(X_test, y_test) # 训练新数据，升级模型

总结：本篇笔记讲述了如何使用机器学习算法根据文本文档的情感倾向对其进行分类，我们不仅学习了如何使用词袋模型对文档进行编码，而且学习了如何使用词频-逆文档频率来矫正词频权重。虽然词袋模型仍旧是文本分类领域最为流行的模型，但是它没有考虑句子的结构和语法。在对文本进行情感分析的过程中，由于生成的特征向量巨大，导致文本数据处理会产生较高的计算成本。最后一节中，我们学习了外存和增量学习算法，它们无需将整个数据集同时加载到内存就能够完成对机器学习模型的训练。