# Bag of Words Meets Bags of Popcorn

[Kaggle Chanllenge](https://www.kaggle.com/c/word2vec-nlp-tutorial)
Use Google's Word2Vec for movie reviews

Deadline: 2019/01/05

## Data

[Reference](https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184)

For this analysis we’ll be using a dataset of 50,000 movie reviews taken from IMDb compiled by Andrew Maas. 

The data is split evenly with 25k reviews intended for training and 25k for testing your classifier. Moreover, each set has 12.5k positive and 12.5k negative reviews.

* Rating rule

IMDb lets users rate movies on a scale from 1 to 10. To label these reviews the curator of the data labeled anything with ≤ 4 stars as negative and anything with ≥ 7 stars as positive. Reviews with 5 or 6 stars were left out.

# Part 1: Bag of Words

NLP (Natural Language Processing) is a set of techniques for approaching text problems. This page will help you get started with loading and cleaning the IMDB movie reviews, then applying a simple Bag of Words model to get surprisingly accurate predictions of whether a review is thumbs-up or thumbs-down.

## Reading the Data

In [16]:
import time

import numpy as np
import pandas as pd

# header = 0: the first line of the file contains column names
# delimiter="\t": tab-delimited
# quoting=3: ignore double-quotes
train = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)

In [17]:
train.shape

(25000, 3)

In [18]:
train.columns.values

array(['id', 'sentiment', 'review'], dtype=object)

## Data Cleaning and Text Preprocessing

There are HTML tags such as '<br/\>', abbreviations, and punctuation.

In [19]:
from bs4 import BeautifulSoup

# Checking
example1 = BeautifulSoup(train['review'][0])
#print(train['review'][0])
#print(example1.get_text())

For many problems, it makes sense to remove **punctuation**.  
In this case, we are tackling a sentiment analysis problem, and it is possible that "!!!" or ":-(" could carry sentiment, and should be treated as words. In this tutorial, we remove the punctuation altogether for simplicity, but it is something you can play with on your own.
  
Similarly, we will remove **numbers**, but there are other ways of dealing with them: treat them as words, or replace them all with a placeholder string such as "NUM".

In [20]:
import re

# Use regular expressions to do a find-and-replace
letters_only = re.sub("[^a-zA-Z]",         # search for
                        " ",               # replace with
                     example1.get_text())  # target text
#print(letters_only)

In [21]:
# convert reviews to lower case and split into individual words
lower_case = letters_only.lower()
words = lower_case.split()

Deal with frequently occurring words that don't carry much meaning called *stop words*, such as "a", "is", and "the".

In [22]:
import nltk
#nltk.download()

In [23]:
from nltk.corpus import stopwords

#print(stopwords.words("english"))
words = [w for w in words if not w in stopwords.words("english")]

### Other operations

There are many other things we could do to the data - For example, **Porter Stemming** and **Lemmatizing** (both available in NLTK) would allow us to treat "messages", "message", and "messaging" as the same word, which could certainly be useful. However, for simplicity, the tutorial will stop here.

In [24]:
# Convert a raw review to a string of words
def review_to_words(raw_review):
    
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text()
    
    # 2. Remove non-letters
    letters = re.sub("[^a-zA-Z]", " ", review_text)
    
    # 3. Convert to lower case, split into individual words
    words = letters.lower().split()
    
    # 4. Convert the stop words to a set (faster to search) and remove stop words
    stops = set(stopwords.words("english"))
    meaningful_words = [w for w in words if not w in stops]
    
    # 5. Join the words back into one string separated by space
    return (" ".join(meaningful_words))
    

In [25]:
#clean_review = review_to_words(train['review'][0])
#print(clean_review)

In [26]:
num_reviews = train['review'].size

clean_train_reviews = [ ]

for i in range(0, num_reviews):
#     if ((i + 1) % 1000 == 0):
#         print("Review %d of %d\n" % (i+1, num_reviews))
    clean_train_reviews.append(review_to_words(train['review'][i]))

## Creating Features from a Bag of Words

Convert training reviews to numeric representation of machine learning using scikit-learn.  
**Bag of Words**: learn a vocabulary from all of the documents and model each document by **counting the number of appearance for each word**.

In [27]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word",
                            tokenizer = None,
                            preprocessor = None,
                            stop_words = None,
                            max_features = 5000)

# Fit the model, learn the vocabulary, and transform training data into feature vectors.
train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features = train_data_features.toarray()

train_data_features.shape

(25000, 5000)

In [28]:
# vocab = vectorizer.get_feature_names()
# dist = np.sum(train_data_features, axis = 0)

# for tag, count in zip(vocab, dist):
#     print(count, tag)

## Random Forest (Supervised Learning)

More trees map perform better, but certainly take longer to run.

In [29]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit(train_data_features, train['sentiment'])

## Creating a Submission

Run the trained Random Forest on test set.

In [30]:
test = pd.read_csv("testData.tsv", header=0, delimiter="\t", quoting=3)

test.shape

(25000, 2)

In [31]:
clean_test_reviews = [ ]

for i in range(0, len(test['review'])):
    clean_test_reviews.append(review_to_words(test['review'][i]))
    
test_data_features = vectorizer.transform(clean_test_reviews).toarray()

In [32]:
result = forest.predict(test_data_features)

output = pd.DataFrame(data = {"id": test['id'], "sentiment": result})
output.to_csv("Bag_of_Words_model_tutorial.csv", index=False, quoting=3)

# Part 2: Word Vectors

Google Word2Vec is a neural network implementation which learns **distributed representations** for words.
* Word2Vec learns quickly
* 不需要 labels to create meaningful representations
* words with similar meaning appear in spaced clusters
* word relationships such as analogies can be reproduced using vevtor math *i.e. king - man + woman = queen*

In order for data to make sense for our machine learning algorithms, we need to convert each review to a numeric representation, which is called **vectorization**.

In [33]:
train = pd.read_csv( "labeledTrainData.tsv", header=0, delimiter="\t", quoting=3 )
unlabeled_train = pd.read_csv( "unlabeledTrainData.tsv", header=0, delimiter="\t", quoting=3 )

test = pd.read_csv( "testData.tsv", header=0, delimiter="\t", quoting=3 )

In [None]:
print("Read %d labeled train reviews, %d labeled test reviews, and %d unlabeled reviews\n" 
      % (train["review"].size, 
         test["review"].size, 
         unlabeled_train["review"].size ))

Read 25000 labeled train reviews, 25000 labeled test reviews, and 50000 unlabeled reviews



To train Word2Vec, it's better not to remove stop words since the algorithm relies on the **boader context of the sentence** to produce high-quality word vectors..

In [None]:
def review_to_wordlist(review, remove_stopwords = False):
    
    review_text = BeautifulSoup(review).get_text()
    review_text = re.sub("[^a-zA-Z]", " ", review_text)
    words = review_text.lower().split()
    
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    
    return(words)

In [None]:
# Split paragraph into sentences
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [None]:
def review_to_sentences(review, tokenizer, remove_stopwords = False):
    
    raw_sentences = tokenizer.tokenize(review.strip())
    
    sentences = []
    for raw_sentence in raw_sentences:
        if len(raw_sentence) > 0:       # skip empty sentences
            sentences.append(review_to_wordlist(raw_sentence, remove_stopwords))
            
    return sentences

In [None]:
sentences = []

for review in train['review']:
    sentences += review_to_sentences(review, tokenizer)

for review in unlabeled_train['review']:
    sentences += review_to_sentences(review, tokenizer)

  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup


In [None]:
print(sentences[1])

len(sentences)

## Training

In [None]:
import logging

logging.basicConfig(format = '%(asctime)s: %(levelname)s : %(message)s', level = logging.INFO)

In [None]:
# Parameter values
num_features = 300    # Word vector dimensionality                      
min_word_count = 40   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                   
downsampling = 1e-3   # Downsample setting for frequent words

In [None]:
from gensim.models import word2vec

model = word2vec.Word2Vec(sentences, workers = num_workers, size = num_features, min_count = min_word_count,
                         window = context, sample = downsampling)

# If you don't plan to train the model any further, calling init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

model_name = "features300-minwords40-context10"
model.save(model_name)

## Explore Model Results

In [None]:
# Deduce which word in a set is most dissimilar from the others
model.wv.doesnt_match("man woman child movie".split())

In [None]:
model.wv.most_similar("awful")

## Part 3-1: Numeric Representation of Words

The Word2Vec model consists of a feature vector for each word, called "syn0".  
  
The number of rows in syn0 is the **number of words** in the model's vocabulary, and the number of columns corresponds to the **size of the feature vector**.  
Setting the minimum word count to 40 gave us a total vocabulary of 16,492 words with 300 features apiece. 

In [None]:
#model["king"]

## Part 3-2: From words to paragraphs - Vector Averaging

We need to find a way to take **individual word vectors** and transform them into a feature set that is the same length for every review.  
  
-> Use vector operations to combine the words in each review.  
移除 stop words 因為會產生噪音

In [None]:
def makeFeatureVec(words, model, num_features):
    
    # Empty array
    featureVec = np.zeros((num_features,), dtype = "float32")
    
    # index2word: List contains the names of the words in the model's vocabulary.
    word_count = 0.
    index2word_set = set(model.wv.index2word)
    
    for word in words:
        if word in index2word_set:
            word_count = word_count + 1.
            featureVec = np.add(featureVec, model[word])
    
    # Get average
    featureVec = np.divide(featureVec, word_count)
    
    return featureVec    

In [None]:
def getAvgFeatureVecs(reviews, model, num_features):
    
    counter = 0
    
    reviewFeatureVecs = np.zeros((len(reviews), num_features), dtype = "float32")
    
    for review in reviews:
        reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features)
        counter = counter + 1
    
    return reviewFeatureVecs

Calculate average feature vectors for training and testing sets, using the functions we defined above.  
Notice that we now use stop word removal.

In [None]:
clean_train_reviews = []
for review in train['review']:
    clean_train_reviews.append(review_to_wordlist(review, remove_stopwords = True))
    
trainDataVecs = getAvgFeatureVecs(clean_train_reviews, model, num_features)

clean_test_reviews = []
for review in test['review']:
    clean_test_reviews.append(review_to_wordlist(review, remove_stopwords = True))
    
testDataVecs = getAvgFeatureVecs(clean_test_reviews, model, num_features)

Use the average paragraph vectors to train a random forest.

In [None]:
forest = RandomForestClassifier(n_estimators = 100)

forest = forest.fit(trainDataVecs, train['sentiment'])

In [None]:
result = forest.predict(testDataVecs)

output = pd.DataFrame(data = {"id": test["id"], "sentiment": result})
output.to_csv("Result/Word2Vec_AverageVectors.csv", index=False, quoting=3)

## Part 3-3: From words to paragraphs - Clustering

Word2Vec creates **clusters of semantically related words**, so another approach is to exploit the similarity of words within a cluster.  
  
Grouping vectors in this way is known as "**vector quantization**". To accomplish this, we first need to find the centers of the word clusters, which can be done by using a clustering algorithm such as K-Means.

### K-Means

The one parameter we need to set is "K" (the number of clusters).  
  
Trial and error suggested that small clusters, with an average of only 5 words or so per cluster, gave better results than large clusters with many words.  
K-Means clustering with large K can be very slow; the following code took more than 40 minutes on my computer.

In [None]:
from sklearn.cluster import KMeans

start = time.time()

# Set k (num_clusters) to be 1/5th of the vocabulary size, or an average of 5 words per cluster
word_vectors = model.wv.vectors
num_clusters = int(word_vectors.shape[0] / 5)

kmeans_clustering = KMeans(n_clusters = num_clusters)
idx = kmeans_clustering.fit_predict(word_vectors)

end = time.time()
elapsed = end - start
print("Time taken for K Means clustering: ", elapsed, "seconds.")

The cluster assignment for each word is now stored in **idx**, and the vocabulary from our original Word2Vec model is still stored in **model.index2word**.

In [None]:
# Create a Word / Index dictionary, mapping each vocabulary word to a cluster number.
word_centroid_map = dict(zip( model.wv.index2word, idx ))

In [None]:
for cluster in range(0, 5):
    print("\nCluster %d" % cluster)

    words = []
    vlist = list(word_centroid_map.values())
    for i in range(0, len(vlist)):
        if vlist[i] == cluster:
            words.append(list(word_centroid_map.keys())[i])
            
    print(words)

In [None]:
# This works just like Bag of Words but uses semantically related clusters instead of individual words
def create_bag_of_centroids( wordlist, word_centroid_map ):
    
    num_centroids = max( word_centroid_map.values() ) + 1
    
    bag_of_centroids = np.zeros( num_centroids, dtype="float32" )
    
    for word in wordlist:
        if word in word_centroid_map:
            index = word_centroid_map[word]
            bag_of_centroids[index] += 1

    return bag_of_centroids

# Bag of centroids: array of reviews, each with a number of features (=clusters)

In [None]:
train_centroid = np.zeros((train['review'].size, num_clusters), dtype="float32")

counter = 0
for review in clean_train_reviews:
    train_centroid[counter] = create_bag_of_centroids(review, word_centroid_map)
    counter += 1
    

test_centroid = np.zeros((train['review'].size, num_clusters), dtype="float32")

counter = 0
for review in clean_train_reviews:
    test_centroid[counter] = create_bag_of_centroids(review, word_centroid_map)
    counter += 1

In [None]:
forest = RandomForestClassifier(n_estimators = 100)

forest = forest.fit(train_centroid, train["sentiment"])

In [None]:
result = forest.predict(test_centroid)

output = pd.DataFrame(data={"id":test["id"], "sentiment":result})
output.to_csv("Result/Word2Vec_BagOfCentroids.csv", index=False, quoting=3)