<br>
问题描述： <strong>二分类问题</strong><br>
问题描述：<strong>根据测试集电影评论的内容预测评论者的情感分析: 1表示正面评价, 0表示负面评价。</strong><br>
评价指标：<strong>AUC</strong><br>
<br>

# 导入数据

In [1]:
import pandas as pd


# 数据集的每条记录以"\t"为分隔符, quotion=3表示读取数据时忽略双引号
data_train = pd.read_csv("labeledTrainData.tsv",delimiter='\t', quoting=3)
data_test = pd.read_csv("testData.tsv", delimiter="\t", quoting=3)

In [2]:
print("训练集的形状", data_train.shape)
print("测试集的形状", data_test.shape)

训练集的形状 (25000, 3)
测试集的形状 (25000, 2)


In [3]:
data_train.head(3)

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."


<br>
看看训练集的一条影评的具体内容

In [4]:
data_train["review"][8]

'"A friend of mine bought this film for £1, and even then it was grossly overpriced. Despite featuring big names such as Adam Sandler, Billy Bob Thornton and the incredibly talented Burt Young, this film was about as funny as taking a chisel and hammering it straight through your earhole. It uses tired, bottom of the barrel comedic techniques - consistently breaking the fourth wall as Sandler talks to the audience, and seemingly pointless montages of \'hot girls\'.<br /><br />Adam Sandler plays a waiter on a cruise ship who wants to make it as a successful comedian in order to become successful with women. When the ship\'s resident comedian - the shamelessly named \'Dickie\' due to his unfathomable success with the opposite gender - is presumed lost at sea, Sandler\'s character Shecker gets his big break. Dickie is not dead, he\'s rather locked in the bathroom, presumably sea sick.<br /><br />Perhaps from his mouth he just vomited the worst film of all time."'

# 数据清洗与文本处理

## 以一条影评为例

In [5]:
from bs4 import BeautifulSoup

<strong>1,&emsp;对一条影评论初始化为BeautifulSoup对象</strong>


In [6]:
soup = BeautifulSoup(data_train["review"][8], 'lxml')

<strong>2,&emsp;利用get_text()方法获取这条影评所有标签的文本内容。这将除去文档中的所有标签符号</strong>

In [7]:
text = soup.get_text()
text

'"A friend of mine bought this film for £1, and even then it was grossly overpriced. Despite featuring big names such as Adam Sandler, Billy Bob Thornton and the incredibly talented Burt Young, this film was about as funny as taking a chisel and hammering it straight through your earhole. It uses tired, bottom of the barrel comedic techniques - consistently breaking the fourth wall as Sandler talks to the audience, and seemingly pointless montages of \'hot girls\'.Adam Sandler plays a waiter on a cruise ship who wants to make it as a successful comedian in order to become successful with women. When the ship\'s resident comedian - the shamelessly named \'Dickie\' due to his unfathomable success with the opposite gender - is presumed lost at sea, Sandler\'s character Shecker gets his big break. Dickie is not dead, he\'s rather locked in the bathroom, presumably sea sick.Perhaps from his mouth he just vomited the worst film of all time."'

<strong>3,&emsp;利用正则表达式将这条文本内容中不是字母的所有其他字符全部替换成空格</strong>

In [8]:
import re

text = re.sub("[^a-zA-Z]", " ", text)
text

' A friend of mine bought this film for     and even then it was grossly overpriced  Despite featuring big names such as Adam Sandler  Billy Bob Thornton and the incredibly talented Burt Young  this film was about as funny as taking a chisel and hammering it straight through your earhole  It uses tired  bottom of the barrel comedic techniques   consistently breaking the fourth wall as Sandler talks to the audience  and seemingly pointless montages of  hot girls  Adam Sandler plays a waiter on a cruise ship who wants to make it as a successful comedian in order to become successful with women  When the ship s resident comedian   the shamelessly named  Dickie  due to his unfathomable success with the opposite gender   is presumed lost at sea  Sandler s character Shecker gets his big break  Dickie is not dead  he s rather locked in the bathroom  presumably sea sick Perhaps from his mouth he just vomited the worst film of all time  '

<strong>4,&emsp;将这条文本中的所有字符全部转换为小写,</strong><br>
<strong>&emsp;&emsp;再利用split()方法拆分成单词列表, 同时消除了所有空白字符</strong>

In [9]:
words_list = text.lower().split()
print(words_list)

['a', 'friend', 'of', 'mine', 'bought', 'this', 'film', 'for', 'and', 'even', 'then', 'it', 'was', 'grossly', 'overpriced', 'despite', 'featuring', 'big', 'names', 'such', 'as', 'adam', 'sandler', 'billy', 'bob', 'thornton', 'and', 'the', 'incredibly', 'talented', 'burt', 'young', 'this', 'film', 'was', 'about', 'as', 'funny', 'as', 'taking', 'a', 'chisel', 'and', 'hammering', 'it', 'straight', 'through', 'your', 'earhole', 'it', 'uses', 'tired', 'bottom', 'of', 'the', 'barrel', 'comedic', 'techniques', 'consistently', 'breaking', 'the', 'fourth', 'wall', 'as', 'sandler', 'talks', 'to', 'the', 'audience', 'and', 'seemingly', 'pointless', 'montages', 'of', 'hot', 'girls', 'adam', 'sandler', 'plays', 'a', 'waiter', 'on', 'a', 'cruise', 'ship', 'who', 'wants', 'to', 'make', 'it', 'as', 'a', 'successful', 'comedian', 'in', 'order', 'to', 'become', 'successful', 'with', 'women', 'when', 'the', 'ship', 's', 'resident', 'comedian', 'the', 'shamelessly', 'named', 'dickie', 'due', 'to', 'his', 

## 停用词的处理

<strong>利用nltk库和nltk.corpus语料库导入英语的停用词列表</strong><br>

In [10]:
import nltk
from nltk.corpus import stopwords


nltk.download()
stopwords_list = set(stopwords.words("english"))           # 英文的停用词集合
print(stopwords_list)

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
{'herself', 'are', 'd', 'how', 'from', 'they', 'by', 'both', 'who', 'before', 'she', 'those', 'or', 'ain', 'haven', 'will', 'a', 'against', 'why', 'won', 'where', 'this', 'our', 'doing', 'that', 'too', 'very', 'at', 'while', 'doesn', 'these', 'me', 'been', 'until', 'about', 'shouldn', 'for', 'itself', 'hadn', 'same', 'yourself', 'ourselves', 'above', 'few', 'an', 'their', 'her', 'if', 'mustn', 'had', 'on', 'be', 'being', 'down', 'wasn', 'hers', 'myself', 'were', 'nor', 'you', 'here', 'now', 've', 't', 'when', 'having', 'then', 'did', 'own', 'weren', 'can', 'ours', 'him', 'is', 'don', 'further', 'each', 'because', 'shan', 'so', 'needn', 'his', 'have', 'i', 'we', 'and', 'my', 'hasn', 'wouldn', 'more', 'am', 'such', 'o', 'of', 'out', 'not', 'into', 'some', 'he', 'the', 'up', 'there', 'any', 'just', 'them', 'has', 'what', 'should', 'was', 'themselves', 'most', 'only', 'whom', 'which', 'mightn', 'all', 'through

<strong>5,&emsp;去掉将这条影评的文本内容中的停用词</strong>

In [11]:
words_list_rm_stopwords = [w for w in words_list if not w in stopwords_list]
print(words_list_rm_stopwords)

['friend', 'mine', 'bought', 'film', 'even', 'grossly', 'overpriced', 'despite', 'featuring', 'big', 'names', 'adam', 'sandler', 'billy', 'bob', 'thornton', 'incredibly', 'talented', 'burt', 'young', 'film', 'funny', 'taking', 'chisel', 'hammering', 'straight', 'earhole', 'uses', 'tired', 'bottom', 'barrel', 'comedic', 'techniques', 'consistently', 'breaking', 'fourth', 'wall', 'sandler', 'talks', 'audience', 'seemingly', 'pointless', 'montages', 'hot', 'girls', 'adam', 'sandler', 'plays', 'waiter', 'cruise', 'ship', 'wants', 'make', 'successful', 'comedian', 'order', 'become', 'successful', 'women', 'ship', 'resident', 'comedian', 'shamelessly', 'named', 'dickie', 'due', 'unfathomable', 'success', 'opposite', 'gender', 'presumed', 'lost', 'sea', 'sandler', 'character', 'shecker', 'gets', 'big', 'break', 'dickie', 'dead', 'rather', 'locked', 'bathroom', 'presumably', 'sea', 'sick', 'perhaps', 'mouth', 'vomited', 'worst', 'film', 'time']


<strong>6,&emsp;最后将清洗后的单词列表用空格分隔成一条字符串</storng>

In [12]:
document = " ".join(words_list_rm_stopwords)
document

'friend mine bought film even grossly overpriced despite featuring big names adam sandler billy bob thornton incredibly talented burt young film funny taking chisel hammering straight earhole uses tired bottom barrel comedic techniques consistently breaking fourth wall sandler talks audience seemingly pointless montages hot girls adam sandler plays waiter cruise ship wants make successful comedian order become successful women ship resident comedian shamelessly named dickie due unfathomable success opposite gender presumed lost sea sandler character shecker gets big break dickie dead rather locked bathroom presumably sea sick perhaps mouth vomited worst film time'

## 构建一个文本处理函数

In [13]:
def clean_review(raw_review, remove_stopwords=False):
    """
    输入: 一条原始影评,
    输出: 一条经过清洗后的影评。
    """
    
    soup = BeautifulSoup(raw_review, "lxml")
    text = soup.get_text()
    text = re.sub("[^a-zA-Z]", " ", text)
    words_list = text.lower().split()
    if remove_stopwords:
        words_list_rm_stopwords = [w for w in words_list if not w in stopwords_list]
        document = " ".join(words_list_rm_stopwords)
    else:
        document = " ".join(words_list)
    return document
    
    

## 将每条影评清洗后保存

In [14]:
# 影评数量
num_train_reviews = data_train["review"].size

# 初始化一个空列表保存每条影评清洗后的内容
clean_train_reviews = []

for i in range(0, num_train_reviews):
    review = data_train["review"][i]
    clean_train_reviews.append(clean_review(review, remove_stopwords=True))
    

# 从词袋模型中创建特征(使用scikit-learn)

3.1 <strong>计算词频</strong>

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

初始化CountVectorizer对象,它是scikit-learn的词袋模型工具

In [16]:
vectorizer = CountVectorizer(analyzer = "word",
                             ngram_range = (1, 3),
                             max_features = 5000,
                             tokenizer = None,  
                             preprocessor = None,
                             stop_words = None,)           

<strong>CountVectorizer对象vectorizer的fit_transform()方法起到了两个函数的作用：</strong><br>
&emsp;第一: &emsp;<strong>拟合模型并且学到词汇</strong><br>
&emsp;第二: &emsp;<strong>将训练数据转化为特征向量</strong><br>

In [17]:
data_train_X = vectorizer.fit_transform(clean_train_reviews)
data_train_X.shape

(25000, 5000)

3.2 <strong>使用TF-IDF进行特征提取</strong>

In [18]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer(sublinear_tf=True)
data_train_X_tfidf = tfidf_transformer.fit_transform(data_train_X)
data_train_X_tfidf.shape

(25000, 5000)

# 处理测试集

In [19]:
num_test_reviews = data_test["review"].size
clean_test_reviews = []

for i in range(0, num_test_reviews):
    review = data_test["review"][i]
    clean_test_reviews.append(clean_review(review, remove_stopwords=True))
    
data_test_X = vectorizer.transform(clean_test_reviews)
data_test_X_tfidf = tfidf_transformer.transform(data_test_X)
data_test_X_tfidf.shape

(25000, 5000)

# 建立模型

## 朴素贝叶斯分类器

In [21]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

nb_model = MultinomialNB()
nb_model_auc = cross_val_score(nb_model, data_train_X_tfidf, data_train["sentiment"], cv=10, scoring="roc_auc").mean()
nb_model_auc

0.93587929600000008

## 线性支持向量机分类器

In [22]:
from sklearn.svm import LinearSVC

svc_model = LinearSVC(C=1)
svc_model_auc = cross_val_score(svc_model, data_train_X_tfidf, data_train["sentiment"], cv=10, scoring="roc_auc").mean()
svc_model_auc

0.94756691199999987

## Logistic回归分类器

In [23]:
from sklearn.linear_model import LogisticRegression

logistic_model = LogisticRegression(C=1.0)
logistic_model_auc = cross_val_score(logistic_model, data_train_X_tfidf, data_train["sentiment"], cv=10, scoring="roc_auc").mean()
logistic_model_auc

0.95550905600000002

## 随机梯度下降(SGD)分类器

In [24]:
from sklearn.linear_model import SGDClassifier

sgd_model = SGDClassifier()
sgd_model_auc = cross_val_score(sgd_model, data_train_X_tfidf, data_train["sentiment"], cv=10, scoring="roc_auc").mean()
sgd_model_auc

0.95466809600000013

## 随机森林分类器

In [25]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier()
rf_model_auc = cross_val_score(rf_model, data_train_X_tfidf, data_train["sentiment"], cv=10, scoring="roc_auc").mean()
rf_model_auc

0.86589513600000001

# 提交预测

<strong>不调参了, 也不模型融合了</strong>

In [26]:
df = pd.DataFrame({
        "model": ["朴素贝叶斯", "支持向量机", "Logistic回归", "随机梯度下降", "随机森林"],
        "score": [nb_model_auc, svc_model_auc, logistic_model_auc, sgd_model_auc, rf_model_auc]
    })

df.sort_values(by="score", ascending=False)

Unnamed: 0,model,score
2,Logistic回归,0.955509
3,随机梯度下降,0.954668
1,支持向量机,0.947567
0,朴素贝叶斯,0.935879
4,随机森林,0.865895


In [27]:
nb_model = nb_model.fit(data_train_X_tfidf, data_train["sentiment"])
nb_result = nb_model.predict(data_test_X_tfidf)

svc_model = svc_model.fit(data_train_X_tfidf, data_train["sentiment"])
svc_result = svc_model.predict(data_test_X_tfidf)

logistic_model = logistic_model.fit(data_train_X_tfidf, data_train["sentiment"])
logistic_result = logistic_model.predict(data_test_X_tfidf)

sgd_model = sgd_model.fit(data_train_X_tfidf, data_train["sentiment"])
sgd_result = sgd_model.predict(data_test_X_tfidf)

rf_model = rf_model.fit(data_train_X_tfidf, data_train["sentiment"])
rf_result = rf_model.predict(data_test_X_tfidf)

In [28]:
result = 0.2 * nb_result + 0.2 * svc_result + 0.2 * logistic_result + 0.2 * sgd_result + 0.2 * rf_result

In [30]:
df = pd.DataFrame({
        "id": data_test["id"],
        "sentiment": result,
    })

Unnamed: 0,sentiment
count,25000.0
mean,0.489248
std,0.447065
min,0.0
25%,0.0
50%,0.6
75%,1.0
max,1.0


In [32]:
df.loc[df["sentiment"] <= 0.4, "sentiment"] = 0

In [34]:
df.loc[df["sentiment"] >= 0.6, "sentiment"] = 1

In [35]:
df.to_csv("output.csv", header=True, index=False, quoting=3)