# Bag of Words Challenge

## Felipe de Ávila Granja

In this notebook I will tackle the Kaggle challenge: [Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/overview)

In [1]:
# import the necessary libraries

import pandas as pd
import numpy as np
import re
import nltk

In [2]:
# load the data, that is in TSV (separated with /t)

data = pd.read_csv('C:\\Users\\DELL\\Kaggle\\Datasets\\bow\\labeledTrainData.tsv',sep='\t')

In [3]:
data.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


## Kaggle overview

The sentiment of reviews is binary,
meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1.
No individual movie has more than 30 reviews. 
The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. 
In addition, there are another 50,000 IMDB reviews provided without any rating labels.

In [4]:
display(data['review'][0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

## Text processing:

There are a couple of this that needs to be done in order to process this data  with ML:

1. Text cleaning: eliminate the non-word characters, like: '' <> () /
2. Tokenize the words: basic thing just to transform the text into a list of words
3. Small words removal: remove the words like: I, a, he, she, so...

In [5]:
# before everyting is good to make a copy of the dataset to keep the original

review_data_clean = data.copy()

In [6]:
# function to text cleaning

def text_cleaning(row):
    return re.sub('[^a-zA-Z0-9]', ' ', row)

In [7]:
review_data_clean['review'] = review_data_clean['review'].apply(text_cleaning)

In [8]:
display(review_data_clean['review'][0])

'With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay  br    br   Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him  br    br   The actual feature film bit when it finally sta

In [9]:
# function for tokenizing the words

from nltk.tokenize import word_tokenize # i will use the nltk library for this
nltk.download('punkt')

def tokenize(row):
    return word_tokenize(row)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [10]:
review_data_clean['review'] = review_data_clean['review'].apply(tokenize)

In [11]:
display(review_data_clean['review'][0])

['With',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with',
 'MJ',
 'i',
 've',
 'started',
 'listening',
 'to',
 'his',
 'music',
 'watching',
 'the',
 'odd',
 'documentary',
 'here',
 'and',
 'there',
 'watched',
 'The',
 'Wiz',
 'and',
 'watched',
 'Moonwalker',
 'again',
 'Maybe',
 'i',
 'just',
 'want',
 'to',
 'get',
 'a',
 'certain',
 'insight',
 'into',
 'this',
 'guy',
 'who',
 'i',
 'thought',
 'was',
 'really',
 'cool',
 'in',
 'the',
 'eighties',
 'just',
 'to',
 'maybe',
 'make',
 'up',
 'my',
 'mind',
 'whether',
 'he',
 'is',
 'guilty',
 'or',
 'innocent',
 'Moonwalker',
 'is',
 'part',
 'biography',
 'part',
 'feature',
 'film',
 'which',
 'i',
 'remember',
 'going',
 'to',
 'see',
 'at',
 'the',
 'cinema',
 'when',
 'it',
 'was',
 'originally',
 'released',
 'Some',
 'of',
 'it',
 'has',
 'subtle',
 'messages',
 'about',
 'MJ',
 's',
 'feeling',
 'towards',
 'the',
 'press',
 'and',
 'also',
 'the',
 'obvious',
 'message',
 'of',
 'drugs',

In [16]:
#small words (the, a, in) are often useless so let's keep only words with 5 or more letters

def remove_small_words(row):
    return [word for word in row['review'] if len(word)>4]

In [17]:
review_data_clean['review'] = review_data_clean.apply(remove_small_words,axis=1)

In [18]:
display(review_data_clean['review'][0])

['stuff',
 'going',
 'moment',
 'started',
 'listening',
 'music',
 'watching',
 'documentary',
 'there',
 'watched',
 'watched',
 'Moonwalker',
 'again',
 'Maybe',
 'certain',
 'insight',
 'thought',
 'really',
 'eighties',
 'maybe',
 'whether',
 'guilty',
 'innocent',
 'Moonwalker',
 'biography',
 'feature',
 'which',
 'remember',
 'going',
 'cinema',
 'originally',
 'released',
 'subtle',
 'messages',
 'about',
 'feeling',
 'towards',
 'press',
 'obvious',
 'message',
 'drugs',
 'Visually',
 'impressive',
 'course',
 'about',
 'Michael',
 'Jackson',
 'unless',
 'remotely',
 'anyway',
 'going',
 'boring',
 'egotist',
 'consenting',
 'making',
 'movie',
 'would',
 'which',
 'really',
 'actual',
 'feature',
 'finally',
 'starts',
 'minutes',
 'excluding',
 'Smooth',
 'Criminal',
 'sequence',
 'Pesci',
 'convincing',
 'psychopathic',
 'powerful',
 'wants',
 'beyond',
 'Because',
 'overheard',
 'plans',
 'Pesci',
 'character',
 'ranted',
 'wanted',
 'people',
 'supplying',
 'drugs',
 'du

## Bag of Words

Now what I want to do is transform this list of words into a Bag of Words.
this is a bit of a black box but the objective is the following:
each row is a review (just like we have now),
for each word we create a column (stuff,	going,	moment,	started,...),
if the word "apply" appears once in the message of row 1, in line 1, column "apply" we put a 1,
if the word "account" does not appear in the message of row 1, in line 1, colums "account" we put a 0.

For this we use the library sklearn 

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

#create object
CountVec = CountVectorizer(ngram_range=(1,1), stop_words='english')

#CountVec is an implicit type, like maps and zips. Let-s make it explicit
Count_data = CountVec.fit_transform(review_data_clean['review'].apply((lambda x: " ".join(x))))

#Now let's make Count_data into a dataframe
cv_dataframe=pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_names())

#we only care about words that show up at least in 30 reviews
cv_dataframe=cv_dataframe.loc[:,(cv_dataframe.sum() >29)]
cv_dataframe.head()

Unnamed: 0,1920s,1930s,1940s,1950s,1960s,1970s,1980s,1990s,aaron,abandon,...,zealand,zelah,zenia,zhang,zizek,zodiac,zombi,zombie,zombies,zorro
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
#we can now add back the reviews so it's easier to peruse the dataset
review_bow = pd.merge(left=review_data_clean,
                   right=cv_dataframe,
                   left_index=True,
                   right_index=True)

display(review_bow.head())

Unnamed: 0,id,sentiment_x,review_x,1920s,1930s,1940s,1950s,1960s,1970s,1980s,...,zealand,zelah,zenia,zhang,zizek,zodiac,zombi,zombie,zombies,zorro
0,5814_8,1,"[stuff, going, moment, started, listening, mus...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2381_9,1,"[Classic, Worlds, Timothy, Hines, entertaining...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,7759_3,0,"[starts, manager, Nicholas, giving, welcome, i...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3630_4,0,"[assumed, those, praised, greatest, filmed, op...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9495_8,1,"[Superbly, trashy, wondrously, unpretentious, ...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Training the model 

We will use the frequency of words that show up in positive / non-positive reviews as clues. Messages where a lot of "bad words" appear will be more likely tagged as "non-positive".

We now need to compute, for each word:

- P(word showing up | message is positive)
- P(word showing up | message is non-positive)

This is the same as taking the mean of the column for each word (sum of times word shows up/number of lines)

In [23]:
from sklearn.model_selection import train_test_split

X = review_bow.drop(columns=['id','sentiment_x','review_x'])
y = review_bow['sentiment_x']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

In [24]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train,y_train)

MultinomialNB()

In [25]:
model.predict(X_test)

array([1, 1, 0, ..., 0, 1, 1], dtype=int64)

In [26]:
print('Test score:', model.score(X_test,y_test))
print('Train score:', model.score(X_train,y_train))

Test score: 0.854
Train score: 0.8713777777777778
