[Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)
======

## Data Set

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

## File descriptions

labeledTrainData - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.
## Data fields

* id - Unique ID of each review
* sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
* review - Text of the review

## Objective
Objective of this dataset is base on **review** we predict **sentiment** (positive or negative) so X is **review** column and y is **sentiment** column

## 1. Load Dataset
we only forcus on "labeledTrainData.csv" file

Let's first of all have a look at the data.

[Click here to download dataset](https://s3-ap-southeast-1.amazonaws.com/ml101-khanhnguyen/week3/assignment/labeledTrainData.tsv)

In [1]:
# Import pandas, numpy
import numpy as np
import pandas as pd

In [2]:
# Read dataset with extra params sep='\t', encoding="latin-1"
sentiment = pd.read_csv('labeledTrainData.tsv', sep='\t', encoding='latin-1');
sentiment.sample(10)

Unnamed: 0,id,sentiment,review
12053,7834_3,0,i usually don't write reviews but i can't unde...
4329,10958_4,0,Having watched this after receiving the DVD fo...
9508,11134_9,1,"Acidic, unremitting, and beautiful, John Schle..."
24781,7091_1,0,This film was so amateurish I could hardly bel...
24924,1333_1,0,I was utterly disappointed by this movie. I ha...
3828,8384_1,0,When I saw that IMDb users rated this movie th...
15749,2370_10,1,It's a great American martial arts movie. The ...
4328,1145_1,0,"Not much to say beyond the summary, save that ..."
16414,4972_9,1,LES CONVOYEURS ATTENDENT was the first film I ...
9258,2828_9,1,"After tracking it down for half a year, I fina..."


In [3]:
sentimentTestNoLabel = pd.read_csv('testData.tsv', sep='\t', encoding='latin-1');
sentimentTestNoLabel.sample(10)

Unnamed: 0,id,review
19423,6644_10,"A gruelling watch, but one of Bergman's finest..."
19614,8372_1,This really is by far the worst movie I've eve...
9716,1525_1,Doesn't anyone bother to check where this kind...
22901,10891_10,GoldenEye 007 is not only the best movie tie-i...
3989,2210_4,"This movie is total dumbness incarnate. Yet, i..."
4322,7626_7,i was 9 when i first saw this on TV. on a Frid...
16064,3447_3,Coen Brothers-wannabe from writer-director Pau...
596,12025_1,How in the world does a thing like this get in...
19560,9635_3,This representation of the popular children's ...
18261,739_10,"Although it isn't mentioned very often, \Don't..."


## 2. Preprocessing

In [4]:
# stop words
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to /home/hgn/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
from collections import Counter
vocab = Counter()
vocab_reduced = Counter()

for word in sentiment['review'].str.cat(sep=' ').split():
    vocab[word] += 1

for word, repeat in vocab.items():
    if word not in stop:
        vocab_reduced[word] = repeat
        
vocab_reduced.most_common(20)

[('I', 65973),
 ('/><br', 50935),
 ('The', 33762),
 ('movie', 30496),
 ('film', 27394),
 ('one', 20685),
 ('like', 18133),
 ('This', 12279),
 ('would', 11922),
 ('good', 11435),
 ('It', 10952),
 ('really', 10814),
 ('even', 10607),
 ('see', 10154),
 ('-', 9355),
 ('get', 8776),
 ('story', 8523),
 ('much', 8507),
 ('time', 7762),
 ('make', 7485)]

In [6]:
# Removing special characters and "trash"
import re
def preprocessor(text):
    # Remove HTML markup
    text = re.sub('<[^>]*>', '', text)
    # Save emoticons for later appending
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('(?::|;|=)(?:-)?(?:\)|\(|D|P)', '', text)
    # Remove any non-word character and append the emoticons,
    # removing the nose character for standarization. Convert to lower case
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
    text = re.sub(' +', ' ', text)
    
    return text

In [7]:
# tokenizer and stemming
# tokenizer: to break down our twits in individual words
# stemming: reducing a word to its root
from nltk.stem import PorterStemmer
# Your code here
porter = PorterStemmer()
def tokenizer(text):
    token = [] 
    # Your code here
    text = preprocessor(text)
    token = text.split()
    return token

def tokenizer_porter(text):
    token = []
    # Your code here
    for word in tokenizer(text):
        token.append(porter.stem(word))
    return token


In [10]:
# split the dataset in train and test
# Your code here
from sklearn.model_selection import train_test_split

X = sentiment['review']
y = sentiment['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

## 3. Create Model and Train 

Using **Pipeline** to concat **tfidf** step and **LogisticRegression** step

In [12]:
# Import Pipeline, LogisticRegression, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer


tfidf = TfidfVectorizer(stop_words=stop,
                        tokenizer=tokenizer_porter,
                        preprocessor=preprocessor)

clf = Pipeline([('vect', tfidf),
                ('clf', LogisticRegression(random_state=0))])
clf.fit(X_train, y_train)

  'stop_words.' % sorted(inconsistent))


Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=<function preprocessor at 0x7fd2f0442620>,
                                 smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', '...
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenizer_porter at 0x7fd2c4610b70>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
         

## 4. Evaluate Model

In [13]:
# Using Test dataset to evaluate model
# classification_report
# confusion matrix
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_predict = clf.predict(X_test)
accuracy_score(y_test, y_predict)

0.8836

In [14]:
confusion_matrix(y_test, y_predict)

array([[3270,  505],
       [ 368, 3357]])

In [15]:
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           0       0.90      0.87      0.88      3775
           1       0.87      0.90      0.88      3725

    accuracy                           0.88      7500
   macro avg       0.88      0.88      0.88      7500
weighted avg       0.88      0.88      0.88      7500



## 5. Export Model 

In [17]:
# Using pickle to export our trained model
import pickle
import os

pickle.dump(clf, open(os.path.join('logisticRegression.pkl'), 'wb'), protocol=4)

## 6 Try to predict non label data

In [20]:
sentimentTestNoLabel['predict review'] =  clf.predict(sentimentTestNoLabel['review'])

In [22]:
pd.set_option('display.max_colwidth', -1)
sentimentTestNoLabel.sample(20)

Unnamed: 0,id,review,predict review
557,11818_4,"The plot for Black Mama White Mama, revolves around two female inmates, at a women's prison in the Phillipines. One Black, and one White. These two women, are thrown together in the prison. Pam Grier is Lee Daniels Lee is incarcerated in the hellish women's prison, for dancing as a harem girl. <br /><br />Lee's boyfriend owes her part of his profits, from his drug-dealing activities. Lee is mainly interested in breaking out of the prison to get hold of her beau's drug money, so that she can leave the Phillipines and assume a better life. Margaret Markov plays Karen Brent, a white women from a privileged background, who is also a revolutionary. Karen has joined a group of revolutionaries, determined to change the corrupt Phillipino political system. She's captured by Phillipino authorities, and held as a political prisoner.<br /><br />The story-line takes-off, when Karen and Lee break out of the prison they were in together. The two of them also happened to be chained together at the wrist. As they flee, they also fight with each other, because they have different goals to pursue. Naturally, they hate being chained together. But they also realize that they must put aside their differences, to help each other survive while they evade capture.<br /><br />If this film seems very similar to The Big Bird Cage, it's because much of the cast in the two films is the same, as well as their location in the Phillipines. Roger Corman, has always had a consistent stable of actors, that he used in all of his 70s B movies. Besides Pam Grier, Sid Haig, Roberta Collins, Claudia Jennings, Betty Anne Rees, and William Smith, were also among the many actors that were frequently cast, in Corman's AIP films.<br /><br />Like The Big Bird Cage, Black Mama White Mama, relies on too much gory violence to be palatable. Pam Grier conveys her usual tough chick persona in this film, and shows her competence as a female action heroine. Margaret Markov is less effect, in her portrayal of the revolutionary Karen. She just seems to fragile and well-coiffed, to be a dedicated political guerrilla. Except for Sid Haig, as the colorful Ruben, the rest of the cast is forgettable.<br /><br />This film has little entertainment value, unless excessive, heinous acts of violence are your thing. Only the performances by Pam Grier and Sig Haig, make this film worth watching.",1
20836,1237_2,"I must have missed a part of this movie... I found myself asking who is this? And, when did that happen? It seemed to jump around but I kept watching for fear I was missing something and it would all be explained to me. I loved Lonesome Dove but this movie made no sense to me at all. I did love all the actors but what happened to the rest of the movie? It made me go \what\""? at the beginning of each part..As far as the scenery - I thought it was fine..It made me feel though like I was leafing through a book and leaving pages out.. The ending had me a little confused too although I imagine the boy was waiting for his father and was meant to leave you wondering if his father would finally come home to his son and be a father since his mother was now gone..I would like to read the book just to see what I missed in the movie..I don't expect this one to win any awards.""",1
5691,5339_1,"I've been watching a lot of Asian horror movies lately, but this one has to be the worst so far. It started out interestingly enough, but lost momentum after the first 15 minutes of the movie. The added \drama\"" scenes, flashback sequences and serious plot holes left me hanging. What really happened in the tunnel? Just \""something terrible\""??? Who started all the killing if it wasn't the ghost? What did she want returned to her????? No answers whatsoever! Overall, not very scary at all and the movie makers need to come up with a lot better ideas than this...<br /><br />One positive was the cute actress, but that's about it.<br /><br />Not recommended.""",0
13968,183_4,"He who fights with monsters might take care lest he thereby become a monster. And if you gaze for long into an abyss, the abyss gazes also into you.<br /><br />Yes, this is from Nietzsche's Aphorism 146 from \Beyond Good and Evil\"". And that's what you find at the start of this movie.<br /><br />If you watch the whole movie, you will doubt if it was the message that the Ram Gopal Varma Production wanted to pass on. As the scenes crop up one by one, quite violent and at times puke-raking, the viewer is expected to forget the Nietzsche quote and think otherwise. That to deal with few people you need dedicated people like Sadhu Agashe who will have the licence to kill anyone, not just writing FIRs (something unworthy of the police to do, as we are made to believe).<br /><br />When TADA was repealed and the government wanted to pass newer and even more draconian laws, RGV's \""Satya\"" did the required brain surgery without blood transfusion for the multiplex growing thinking urban crowd whose views matters in a democratic country like India. Within a year MCOCA was passed.<br /><br />When real life encounter specialist Daya Nayak 'became a monster on the path of fighting them' and was himself booked by MCOCA, \""Aab tak Chhappan\"" was made to heed out \""false\"" impression among the people about this. With it's \""you have to be a monster to save your nation\"" approach.<br /><br />And people consumed it. No questions raised. Only praises and hopes that they get a Sadhu Agashe in their local police station who will solve all problems and hence let only milk and butter flow all over. Blood? You can ignore.<br /><br />Every time Israel attacks Palestine or Lebanon, we hear voices like \""India must also similarly attack Pakistan\"". This movie is made for such psychopaths. If you don't give them this, they will probably die out of boredom and LSD and what not.<br /><br />Hence this game of the passion of hatred.""",0
4622,12206_10,"Surface was awesome, I don't know how many Mondays I survived at school just by thinking about the new episode of surface. I loved it, sometimes I had to call home and tell my mom to tape it for me. I was pretty upset when I heard it was cancelled, I mean jeez way to let us hang. So,they can have their new Tina fay comedy(you couldn't pay me to watch that, I think seeing the commercials made me dumber). I'm gonna miss my Monday night fix of Surface, even if my sister did make fun of me. although,kidnapped does look good and, they still have L&O: SVU (i think, i still have to check) (i only wrote the 2 lines above, because they said i needed ten lines).",1
20983,2842_9,"A question immediately arises in this extremely idiosyncratic film: Who are the crazy people?<br /><br />The answer become less clear as the film goes on.<br /><br />Renee Zellweger loses the whiney note in her voice and, while her voice is still high, she is incredibly effective as the shell-shocked Betty. In fact, she is so effective I almost wanted her to be just a little more crazy because her created reality was so believable.<br /><br />This is the first time Ms Zellweger has been called upon to carry a film and she is more than equal to the task.<br /><br />Chris Rock Â though as foul-mouthed as usual Â is fairly subdued as Wesley. He is able to sublimate his manic energy and it only occasionally surfaces and always when it is needed most.<br /><br />There are some interesting allusions: the first time you see Betty she is dressed almost exactly like Dorothy Gale from the `Wizard of Oz' Â then later in the film she is compared to Dorothy when she says she has never been out of Kansas before. At one point the song that Doris Day was best known for, ÂQue Sera Sera' is on the soundtrack and then later Charlie (Morgan Freeman) describes her as having Âa whole Doris Day thing going on.'<br /><br />This is an extremely quirky film with good performances by everyone including the supporting cast.<br /><br />It has a surprising ending that, as contrary as it sounds, is actually fairly predictable.<br /><br />If for no other reason see this film just to listen to the master of the human voice: Morgan Freeman.",1
15541,2570_4,"Apart from some quite stunning scenery, this Steven Seagal vehicle is devoid of reasons to spend any time watching it. For a Seagal movie it has very little (almost no) action but he does put in some reasonable (for him) acting in contrived character development scenes. Not recommended. To anyone.",0
24819,939_4,"A quick, funny coming-of-age matinÃ©e romp appealing to the underdog aldolescent in us all. It functions, in effect, as a vehicle for Justin Long who has subsequently erupted onto our screens in the fourth Die Hard via PC vs Mac ads, Dodgeball and The Break Up. He's funny, earnest and young - a big career ahead.<br /><br />A town's worth of college wannabes find a fake website Bartelby (Long) has set up to delude his judgemental parents and descend on the 'college' like it were a short notice Facebook party. Lewis Black summises the anarchic philosophy as a stand-in Dean - Long's delinquent friends provide support for the subterfuge and consequent appeal to grander traditions of education and friendship (Adam Herschman deserves special mention for his never-flagging slapstick contribution). Well executed, feelgood and instantly forgettable. 4/10",1
12083,11669_10,"I'm astonished how a filmmaker notorious for his political left-wing fervor could make such a subtle, non-sanctimonious picture. If you're for capital punishment, you'll still be for it after seeing this. If you're against capital punishment, you'll still be against it. But whatever your stance is, this movie will, at the very least, make you reflect on why you feel the way you do. There's not one false note in the film.",1
13259,7272_1,"Finally, I can connect the dots between Return of the Jedi and Phantom Menace. We see here where Lucas lost touch with what made the original Star Wars films great and began to descend into the plot less tripe that ruined episodes 1-3. This film is more like one of those cheesy low-budget 80s swords and sorcerer films than anything worthy of being associated with the Star Wars saga. As with the Jar-Jar character, this seems targeted at children (and the toy market). The battle scenes are particularly bad. It was depressing to see Sian Phillips' incredible talent go to such a waste, after her classic performance in I, Claudius.",0
