# Bag of Words Meets Bags of Popcorn

This is my second Kaggle competition. This is tutorial competition to learn about Word2vec neural network implementation but I realized project depending on tips for applying a simple Bag of Words model, mentioned in the tutorial. 

Link: https://www.kaggle.com/c/word2vec-nlp-tutorial

Problem description: need to create model for sentiment analysis of movie reviews. The model must distinguish negative and positive movie reviews and mark them as 0 and 1 accordingly. (NLP problem)

## 1 - Packages

In [1]:
import pandas as pd 
import numpy as np
import re
import seaborn as sns
from bs4 import BeautifulSoup

from sklearn import model_selection, ensemble, metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

import nltk
nltk.download('stopwords', 'wordnet', 'punkt')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

## 2 - Overview of the Dataset

Loading data:

In [2]:
train_data = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
test_data = pd.read_csv("unlabeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
submit_data = pd.read_csv("testData.tsv", header=0, delimiter="\t", quoting=3)

Shape of datasets:

In [3]:
print('Shape of train data: ', train_data.shape)
print('Shape of test data: ', test_data.shape)
print('Shape of submit data: ', submit_data.shape)

Shape of train data:  (25000, 3)
Shape of test data:  (50000, 2)
Shape of submit data:  (25000, 2)


First 5 rows of train data dataset:

In [4]:
train_data.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


Check column names:

In [5]:
print('Column names:', list(train_data.columns))

Column names: ['id', 'sentiment', 'review']


Let's see review samples.

In [6]:
train_data.review[0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

Text has HTML tags, punctuation, stop words(such as “the”, “a”, “an”, “in”) and numbers. So, we need to clean our data and then tokenize it.

## 3 - Data Cleaning and Text Preprocessing

Next functions are created for data cleaning. Preprocessing function prepares a review for following splitting into tokens. Tokenizing function works with every word in review to get the most accurate tokens for analyzing.

In [7]:
def my_tokenizer(sample):
    # Split into words
    words = nltk.word_tokenize(sample)
#     print('3',words)
    
    # Leave alphabetical tokens
    tokens = [word for word in words if word.isalnum()]
    tokens = [word for word in tokens if not word.isdigit()]
#     print('4',tokens)
    
    # Remove stopwords
    meaningful_words = [w for w in tokens if not w in stops]
#     print('5',meaningful_words)
    
    # Lemmatization 
    word_list = [lemmatizer.lemmatize(w) for w in meaningful_words]
#     print('6',word_list)
    
    return word_list

def my_preprocessor(sample):
    # Remove HTML tags
    no_tags_text = BeautifulSoup(sample).get_text()  
#     print('1',no_tags_text)
    
    # To lowercase
    review_text = no_tags_text.lower()
#     print('2',review_text)
    
    return review_text

Then we will instantiate lemmatizer and load useful set of stopwords. Now we are ready to instantiate `vectorizer` to perform whole text preprocessing and create bag-of-words features.

In [8]:
stops = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()
vectorizer = CountVectorizer(analyzer = "word", tokenizer = my_tokenizer, preprocessor = my_preprocessor, \
                             stop_words = None, max_features = 5000) 

Let's check work of `vectorizer` on the first sample in out train dataset.

In [9]:
sample = train_data.review[0]
train_data_features = vectorizer.fit_transform([sample])
train_data_features = train_data_features.toarray()

train_data_features

array([[ 1,  1,  2,  1,  1,  1,  1,  3,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  2,  1,  1,  1,  1,  1,
         1,  2,  1,  1,  3,  1,  1,  1,  1,  1,  1,  1,  1,  1,  2,  2,
         1,  2,  1,  1,  1,  1,  1,  1,  1,  3,  1,  2,  2,  3,  1,  1,
         1,  1,  1,  2,  2,  1,  1,  3,  1,  1,  1,  1,  3,  1,  1,  1,
         1,  1,  1,  1,  3,  3,  2,  1,  1, 11,  1,  2,  3,  1,  1,  1,
         2,  1,  1,  4,  1,  1,  2,  1,  5,  1,  2,  1,  1,  1,  1,  1,
         1,  2,  1,  1,  1,  1,  1,  1,  1,  3,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         2,  1,  2,  1,  2,  1,  2,  1,  1,  1,  1]], dtype=int64)

So, at least it works. I don't really know how to check if it's valid. Now fit our model and transform all train dataset.

In [10]:
train_data_features = vectorizer.fit_transform(train_data.review)
train_data_features = train_data_features.toarray()

In [11]:
# test_data_features = vectorizer.transform(test_data.review)
# test_data_features = test_data_features.toarray()

In [12]:
submit_data_features = vectorizer.transform(submit_data.review)
submit_data_features = submit_data_features.toarray()

The training data array now looks like:

In [13]:
train_data_features.shape

(25000, 5000)

Now that the Bag of Words model is trained, let's look at the vocabulary:

In [14]:
vocab = vectorizer.get_feature_names()
vocab

['13th',
 '1930s',
 '1950s',
 '1960s',
 '1970s',
 '1980s',
 '19th',
 '1st',
 '20th',
 '2nd',
 '3d',
 '3rd',
 '50',
 '60',
 '70',
 '80',
 '90',
 'abandoned',
 'abc',
 'ability',
 'able',
 'abraham',
 'abrupt',
 'absence',
 'absent',
 'absolute',
 'absolutely',
 'absurd',
 'absurdity',
 'abuse',
 'abused',
 'abusive',
 'abysmal',
 'academy',
 'accent',
 'accept',
 'acceptable',
 'acceptance',
 'accepted',
 'accepts',
 'access',
 'accident',
 'accidentally',
 'acclaimed',
 'accompanied',
 'accomplish',
 'accomplished',
 'according',
 'account',
 'accuracy',
 'accurate',
 'accused',
 'achieve',
 'achieved',
 'achievement',
 'acid',
 'across',
 'act',
 'acted',
 'acting',
 'action',
 'active',
 'activity',
 'actor',
 'actress',
 'actual',
 'actually',
 'ad',
 'adam',
 'adaptation',
 'adapted',
 'add',
 'added',
 'addict',
 'addiction',
 'adding',
 'addition',
 'additional',
 'address',
 'adequate',
 'admirable',
 'admire',
 'admit',
 'admittedly',
 'adolescent',
 'adopted',
 'adorable',
 'a

## 4 - Model

Below we will consider different models and choose that one, which gives the best metric rate. By the terms, submissions are judged on area under the ROC curve. 

Let's configure:
1. SVM
2. Naive Bayes
3. RandomForestClassifier
4. KNN
5. Logistic Regression

In [15]:
roc_auc_scorer = metrics.make_scorer(metrics.roc_auc_score)
X_train = train_data_features[:500]
Y_train = train_data.sentiment[:500]

In [16]:
forest = RandomForestClassifier(n_estimators = 100) 
forest = forest.fit( train_data_features, train_data.sentiment )

In [17]:
result = forest.predict(submit_data_features)

### SVM

In [18]:
params = {'kernel':['linear', 'rbf'], 'C':[0.1, 1, 5, 10]}
svc = SVC(probability = True, random_state = 0)
clf = GridSearchCV(svc, param_grid = params, scoring = roc_auc_scorer, cv = 5, n_jobs = -1)
clf.fit(X_train, Y_train)
print('Best score: {}'.format(clf.best_score_))
print('Best parameters: {}'.format(clf.best_params_))

Best score: 0.7880767730011616
Best parameters: {'C': 1, 'kernel': 'linear'}


In [19]:
svc_best = SVC(C = clf.best_params_['C'], kernel = clf.best_params_['kernel'], probability = True, random_state = 0)

### RandomForestClassifier

In [20]:
params = {'n_estimators':[10, 50, 100, 150], 'criterion':['gini', 'entropy'], 'max_depth':[None, 5, 10, 50]}
rf = RandomForestClassifier(random_state = 0)
clf = GridSearchCV(rf, param_grid = params, scoring = roc_auc_scorer, cv = 5, n_jobs = -1)
clf.fit(X_train, Y_train)
print('Best score: {}'.format(clf.best_score_))
print('Best parameters: {}'.format(clf.best_params_))



Best score: 0.7515696759542082
Best parameters: {'criterion': 'gini', 'max_depth': 50, 'n_estimators': 150}


In [21]:
rf_best = RandomForestClassifier(n_estimators = clf.best_params_['n_estimators'], criterion = clf.best_params_['criterion'], \
                                 max_depth = clf.best_params_['max_depth'], random_state = 0)

### LogisticRegression

In [32]:
params = {'penalty':['l1', 'l2'], 'C':[1, 2, 3, 5, 10]}
lr = LogisticRegression(random_state = 0)
clf = GridSearchCV(lr, param_grid = params, scoring = roc_auc_scorer, cv = 5, n_jobs = -1)
clf.fit(X_train, Y_train)
print('Best score: {}'.format(clf.best_score_))
print('Best parameters: {}'.format(clf.best_params_))

Best score: 0.8009286287317111
Best parameters: {'C': 10, 'penalty': 'l2'}




In [33]:
lr_best = LogisticRegression(penalty = clf.best_params_['penalty'], C = clf.best_params_['C'], random_state = 0)
# lr_best = LogisticRegression(penalty = 'l2', C = 10, random_state = 0)

### Naive Bayes

In [24]:
params = {"var_smoothing" : [1e-8, 1e-7, 1e-6, 1e-5, 1e-4]}
nb = GaussianNB()
clf = GridSearchCV(nb, param_grid = params, scoring = roc_auc_scorer, cv = 5, n_jobs = -1)
clf.fit(X_train, Y_train)
print('Best score: {}'.format(clf.best_score_))
print('Best parameters: {}'.format(clf.best_params_))

Best score: 0.6568464123016774
Best parameters: {'var_smoothing': 0.0001}




In [25]:
nb_best = GaussianNB(var_smoothing = clf.best_params_['var_smoothing'])

### KNN

In [26]:
params = {'n_neighbors':[3, 5, 10, 20], 'p':[1, 2, 5], 'weights':['uniform', 'distance']}
knc = KNeighborsClassifier()
clf = GridSearchCV(knc, param_grid = params, scoring = roc_auc_scorer, cv = 5, n_jobs = -1)
clf.fit(X_train, Y_train)
print('Best score: {}'.format(clf.best_score_))
print('Best parameters: {}'.format(clf.best_params_))

Best score: 0.5947478636456784
Best parameters: {'n_neighbors': 10, 'p': 5, 'weights': 'distance'}




In [27]:
knc_best = KNeighborsClassifier(n_neighbors = clf.best_params_['n_neighbors'], p=clf.best_params_['p'],\
                               weights = clf.best_params_['weights'])

тут я остановилась

In [28]:
# voting_clf = VotingClassifier(estimators=[('svc', svc_best), ('rf', rf_best), ('lr', lr_best), ('nb', nb_best),\
#                                           ('knc', knc_best)], voting='hard')
# voting_clf.fit(train_data_features, train_data.sentiment)


In [68]:
lr_best.fit(train_data_features, train_data.sentiment)
y_pred = lr_best.predict(submit_data_features)



Downloading results to file. 

In [69]:
submission = pd.read_csv("sampleSubmission.csv", header=0, delimiter=",", quoting=3)
col = submission.columns[1]
submission[col] = y_pred
submission.to_csv('submission.csv', index=False)

In [75]:
f = open("submission.csv", "r")
f.readline()
s = open("valid_submission.csv","w+")
s.write('\"id\",\"sentiment\"\n')
for x in f:
    x = x.split(',')
    x[0] = x[0][2:-2]
    s.write(','.join(x))