# Outline of sentiment analysis on movie reviews 


## Introduction of this notebook

This project is a machine learning exercise project based on movie reviews. The data set comes from Kaggle. Kaggle divides all reviews into five types. Participants are required to predict the sentiment labels of reviews through text analysis and machine learning model.


In this notebook, I mainly use the bag-of-words and TF-IDF model to complete the feature engineering of the text data and then use the logistic regression and Naive Bayesian classifier to train the data and predict the results. And finally, based on the grid search, the optimal combination of super parameters is found.


The Kaggle Competition url:
https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews

sentiment labels of film reviews

- 0 - negative
- 1 - somewhat negative
- 2 - neutral
- 3 - somewhat positive
- 4 - positive


## Feature engineering and ML model

steps:

- 1. build corpus

- 2. bag-of-words
    
    - Logistic Regression
    - Polynomial Naive Bayes
    
- 3. TF-IDF Model
    
    - Logistic Regression
    - Polynomial Naive Bayes
    
- 4. Find out the optimal super parameters base on the grid search




In [9]:
import sklearn
import warnings 
import pandas as pd

In [10]:
# import data 
data_train = pd.read_csv('./train.tsv',sep='\t') 
data_test = pd.read_csv('./test.tsv',sep='\t') 

# general config
stop_words = open('./stop_words.txt',encoding='utf-8').read().splitlines()
warnings.filterwarnings('ignore')

In [11]:
data_train.head() # train data with sentiment results

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


In [12]:
data_test.head() # the data set that need to generate sentiment

Unnamed: 0,PhraseId,SentenceId,Phrase
0,156061,8545,An intermittently pleasing but mostly routine ...
1,156062,8545,An intermittently pleasing but mostly routine ...
2,156063,8545,An
3,156064,8545,intermittently pleasing but mostly routine effort
4,156065,8545,intermittently pleasing but mostly routine


## Feature Engineering


### 1. build corpus and data pre-process

In [13]:

# contact the text data in senteces 
train_sentences = data_train['Phrase']
test_sentences = data_test['Phrase']
sentences = pd.concat([train_sentences,test_sentences])
sentences.shape

(222352,)

In [14]:
# filter the emotional labels 
label = data_train['Sentiment']
label.shape

(156060,)

### 2. bag-of-words model

In [17]:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
co = CountVectorizer(
    analyzer='word', # analyzer='word'指的是以词为单位进行分析，对于拉丁语系语言，有时需要以字母'character'为单位进行分析
    ngram_range=(1,4), # ngram 指分析相邻的几个词，避免原始的词袋模型中词序丢失的问题
    stop_words=stop_words,
    max_features=150000 # max_features 指最终的词袋矩阵里面包含语料库中出现次数最多的多少个词
)

co.fit(sentences)

x_train,x_test,y_train,y_test = train_test_split(train_sentences,label,random_state=1234)

# transform words to vector
x_train = co.transform(x_train)
x_test = co.transform(x_test)




Build classifiers and check accuracy, Machine learning and data mining of text processed by word bag model
#### Logistic Regression

In [26]:
from sklearn.linear_model import LogisticRegression
lg1 = LogisticRegression()
lg1.fit(x_train, y_train)
print('Text feature engineering based on bag-of-words，by using Logistic Regression，the accurancy of x_train:')
print( lg1.score(x_test,y_test))

Text feature engineering based on bag-of-words，by using Logistic Regression，the accurancy of x_train:
0.6461873638344227


#### Polynomial Naive Bayes

In [25]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(x_train,y_train)
print('Text feature engineering based on bag-of-words，by using Polynomial Naive Bayes，the accurancy of x_train:')
print(classifier.score(x_test,y_test))

Text feature engineering based on bag-of-words，by using Polynomial Naive Bayes，the accurancy of x_train:
0.6084070229398949


### 3. TF-IDF Model

In [27]:


from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(
    analyzer='word',
    ngram_range=(1,4),
    # stop_words=stop_words,
    max_features=150000
)

In [28]:
tf.fit(sentences)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0,
                max_features=150000, min_df=1, ngram_range=(1, 4), norm='l2',
                preprocessor=None, smooth_idf=True, stop_words=None,
                strip_accents=None, sublinear_tf=False,
                token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
                vocabulary=None)

In [29]:
x_train,x_test,y_train,y_test = train_test_split(train_sentences,label,random_state=1234)
x_train = tf.transform(x_train)
x_test = tf.transform(x_test)

#### Polynomial Naive Bayes

In [31]:
# 引用朴素贝叶斯进行分类训练和预测

classifier = MultinomialNB()
classifier.fit(x_train,y_train)
print('Text feature engineering based on TF-IDF，by using Polynomial Naive Bayes，the accurancy of x_train:')
print(classifier.score(x_test,y_test))

Text feature engineering based on TF-IDF，by using Polynomial Naive Bayes，the accurancy of x_train:
0.6045367166474432


#### Logistic Regression

In [32]:
# sklearn 默认的逻辑回归模型

lg1 = LogisticRegression()
lg1.fit(x_train,y_train)
print('Text feature engineering based on TF-IDF，by using Logistic Regression，the accurancy of x_train:',lg1.score(x_test,y_test))
print(0.640958605664488)

Text feature engineering based on TF-IDF，by using Logistic Regression，the accurancy of x_train: 0.640958605664488
0.640958605664488


### 4. Find out the optimal super parameters base on the grid search

In [None]:


from sklearn.model_selection import GridSearchCV
param_grid = {'C':range(1,10),
             'dual':[True,False]
              }
lgGS = LogisticRegression()
grid = GridSearchCV(lgGS, param_grid=param_grid,cv=3,n_jobs=-1)
grid.fit(x_train,y_train)

In [31]:
grid.best_params_

{'C': 5, 'dual': True}

In [None]:
# Finally, the result of super parameter search is that C is 5 and dual is true,
# which can make the prediction accuracy of logistic regression model in the verification set the highest. 
# We then use this optimal parameter to construct the Logistic Regression final classifier, 
# and finally predict the accuracy of 0.655464 on the verification set.

In [36]:
lg_final = grid.best_estimator_

In [35]:
print('Through the grid search, find the logical regression model corresponding to the optimal super parameter combination, and predict the accuracy on the verification set:')
print(lg_final.score(x_test,y_test))

Through the grid search, find the logical regression model corresponding to the optimal super parameter combination, and predict the accuracy on the verification set:
0.6546456491093169
