# Word2vec

So, **bag-of-words** models may be surprisingly successful, but they are limited in what they can do. First and foremost, with bag-of-words models, words are encoded using one-hot-encoding. Instead, Word2vec, published by Google in 2013, is a neural network implementation that learns **distributed representations for words** (using vectors of real numbers). It does not need labels in order to create meaningful representations. This is useful, since most data in the real world is unlabeled. If the network is given enough training data (tens of billions of words), it produces word vectors with intriguing characteristics. Words with similar meanings appear in clusters, and clusters are spaced such that some word relationships, such as analogies, can be reproduced using vector math.


**We'll train a Word2vec model on our IMBD dataset and then we will use its word vectors to train our ML models and make predictions.**

## Fetch the data

The data files are located in the **data** folder. The training set contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [train/pos/200_8.txt] is the text for a positive-labeled train set example with unique id 200 and star rating 8/10 from IMDb.

Let's write some functions to get and store the dato:

In [1]:
import os
import pandas as pd

def load_imbd_dataset(data_path, unsup = False):
    """
    Load the IMDb dataset into Pandas DataFrames.

    Parameters:
        data_path (str): The root directory where the IMDb dataset is stored.
        unsup (bool): Whether the data is labeled or not
        
    Returns:
        df (pandas.DataFrame): A DataFrame containing the reviews and their labels.
    """
    reviews = []
    labels = []
    
    if not unsup:
        for label in ['pos', 'neg']:
            label_dir = os.path.join(data_path, label)
            for filename in os.listdir(label_dir):
                filepath = os.path.join(label_dir, filename)
                with open(filepath, 'r', encoding='utf-8') as file:
                    review_text = file.read()
                rating = int(filename.split('_')[1].split('.')[0])
                sentiment = 1 if label == 'pos' else 0
                reviews.append(review_text)
                labels.append(sentiment)
        df = pd.DataFrame({'review': reviews, 'sentiment': labels})
        return df
    else:
        label_dir = os.path.join(data_path, 'unsup')
        for filename in os.listdir(label_dir):
                filepath = os.path.join(label_dir, filename)
                with open(filepath, 'r', encoding='utf-8') as file:
                    review_text = file.read()
                reviews.append(review_text)
        df = pd.DataFrame({'review': reviews})
        return df

In [2]:
train_set = load_imbd_dataset('data/train')
test_set = load_imbd_dataset('data/test')
unlabeled_train_set = load_imbd_dataset('data/train', unsup=True)

In [3]:
train_set.shape

(25000, 2)

## Data Cleaning and Text Preprocessing
We will implement all the preprocessing steps as Transformers, so then we can apply them in a preprocessing Pipeline.


**Important:** to train Word2Vec **it is better not to remove stop words** because the algorithm relies on the broader context of the sentence in order to produce high-quality word vectors.

### Removing HTML Markup
First, we'll remove the HTML tags. We will use the **Beautiful Soup** package. 

In [4]:
from bs4 import BeautifulSoup
from sklearn.base import BaseEstimator, TransformerMixin

class HTMLTagRemover(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        # Function to remove HTML tags from each element in the input X
        def remove_html_tags(html_text):
            soup = BeautifulSoup(html_text, 'html.parser')
            return soup.get_text()
        
        return [remove_html_tags(text) for text in X]

### Dealing with Punctuation, Numbers and Stopwords

When considering text cleaning, it is essential to tailor the approach to the specific data problem we aim to solve. For certain tasks, removing punctuation can be beneficial, but in the context of sentiment analysis, expressions like "!!!" or ":-(" might contain sentiment and could be treated as words. Nevertheless, for simplicity, we will proceed with punctuation removal.

Similarly, we'll exclude numbers, although alternative methods exist, such as treating them as words or substituting them with a placeholder like "NUM."

To execute the punctuation and number removal, we'll leverage the re package, which handles regular expressions. Additionally, we'll tokenize the reviews, breaking them down into individual words.

Lastly, we must address frequently occurring words that carry little meaning, known as "stop words". In English, these encompass words like "a," "and," "is," and "the." Fortunately, Python packages like the Natural Language Toolkit (NLTK) provide built-in stop word lists that we can utilize by importing them.

In [5]:
import re
import string
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

class TextCleaner(BaseEstimator, TransformerMixin):
    def __init__(self, remove_stopwords = False):
        self.remove_stopwords = remove_stopwords
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        # Function to clean text (lowercase, remove numbers and punctuation)
        def clean_text(text):
            if isinstance(text, str):
                text = text.lower()

                # Remove numbers using regex
                text = re.sub(r'\d+', '', text)

                # Remove punctuation using string library
                text = text.translate(str.maketrans('', '', string.punctuation))

                # Split the text into words
                words = text.split()
                if self.remove_stopwords:
                    # Remove stop words from "words"
                    stops = set(stopwords.words("english"))   
                    words = [w for w in words if not w in stops]

                # Returns a list of words
                return words
            
            elif isinstance(text, list):
                word_list = []
                for sentence in text:
                    sentence = sentence.lower()

                    # Remove numbers using regex
                    sentence = re.sub(r'\d+', '', sentence)

                    # Remove punctuation using string library
                    sentence = sentence.translate(str.maketrans('', '', string.punctuation))

                    # Split the text into words
                    words = sentence.split()
                    if self.remove_stopwords:
                        # Remove stop words from "words"
                        stops = set(stopwords.words("english"))   
                        words = [w for w in words if not w in stops]
                    
                    word_list.append(words)
                    
                # Returns a list of lists of words
                return word_list

        return [clean_text(text) for text in X]


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Agustin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Sentence Tokenizer

Word2Vec expects single sentences, each one as a list of words. In other words, the input format is a list of lists.

It is not at all straightforward how to split a paragraph into sentences. There are all kinds of gotchas in natural language. English sentences can end with "?", "!", """, or ".", among other things, and spacing and capitalization are not reliable guides either. For this reason, we'll use NLTK's **punkt** tokenizer for sentence splitting.

In [6]:
class SentenceTokenizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        nltk.download('punkt')
        self.tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        def tokenize_sentence(text):
            # Use the NLTK tokenizer to split the paragraph into sentences
            raw_sentences = self.tokenizer.tokenize(text.strip())
            
            sentences = []
            for raw_sentence in raw_sentences:
                if len(raw_sentence) > 0:
                    sentences.append(raw_sentence)
                    
            return sentences
                    
        return [tokenize_sentence(text) for text in X]

We have to apply this transformer before applying our DataCleaner Transformer so that we obtain a list of words for each sentence in a review.

### Building the Preprocessing Pipeline

In [7]:
from sklearn.pipeline import Pipeline

review_to_sentences_pipeline = Pipeline([
    ('html_tag_remover', HTMLTagRemover()),
    ('sentence_tokenizer', SentenceTokenizer()),
    ('text_cleaner', TextCleaner()),
])

review_to_wordlist_pipeline = Pipeline([
    ('html_tag_remover', HTMLTagRemover()),
    ('text_cleaner', TextCleaner()),
])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Agustin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Now we can use this pipeline to transform our training set. Let's prepare our training set and preprocess it.

In [8]:
X_train_labeled, y_train = train_set["review"], train_set["sentiment"]
X_test, y_test = test_set["review"], test_set["sentiment"]
X_train_unlabeled = unlabeled_train_set["review"]

X_train = pd.concat([X_train_labeled, X_train_unlabeled])

# Transform our training set so we can train our Words2vec model.
# We will later apply other transformations to our training data to extract meaninful information
# from our W2V model.
X_train_w2v = review_to_sentences_pipeline.fit_transform(X_train)

  soup = BeautifulSoup(html_text, 'html.parser')


 Now we can extract the sentences of both lists into a single list:

In [9]:
sentences = []
for i in range(len(X_train_w2v)):
    sentences += X_train_w2v[i]
    
print(len(sentences))

812440


We have around 798 thousands sentences.

## Training and Saving our Words2vec Model

With the list of nicely parsed sentences, we're ready to train the model. There are a number of parameter choices that affect the run time and the quality of the final model that is produced. For details on the algorithms below, see the word2vec [API documentation](https://radimrehurek.com/gensim/models/word2vec.html) as well as the [Google documentation](https://code.google.com/archive/p/word2vec/).

- **Architecture**: Architecture options are skip-gram (default) or continuous bag of words. We found that skip-gram was very slightly slower but produced better results.
- **Training algorithm**: Hierarchical softmax (default) or negative sampling. For us, the default worked well.
- **Downsampling of frequent words**: The Google documentation recommends values between .00001 and .001. For us, values closer 0.001 seemed to improve the accuracy of the final model.
- **Word vector dimensionality**: More features result in longer runtimes, and often, but not always, result in better models. Reasonable values can be in the tens to hundreds; we used 300.
- **Context / window size**: How many words of context should the training algorithm take into account? 10 seems to work well for hierarchical softmax (more is better, up to a point).
- **Worker threads**: Number of parallel processes to run.
- **Minimum word count**: This helps limit the size of the vocabulary to meaningful words. Any word that does not occur at least this many times across all documents is ignored. Reasonable values could be between 10 and 100. In this case, since each movie occurs 30 times, we set the minimum word count to 40, to avoid attaching too much importance to individual movie titles. This resulted in an overall vocabulary size of around 15,000 words. Higher values also help limit run time.

In [10]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Set values for various parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 40   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# Initialize and train the model
from gensim.models import word2vec
print("Training model...")
model = word2vec.Word2Vec(sentences, workers=num_workers, vector_size=num_features, min_count = min_word_count, 
                          window = context, sample = downsampling)

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and 
# save the model for later use. You can load it later using Word2Vec.load()
model_name = "300features_40minwords_10context"
model.save(model_name)

2023-08-06 12:28:19,820 : INFO : collecting all words and their counts
2023-08-06 12:28:19,821 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2023-08-06 12:28:19,867 : INFO : PROGRESS: at sentence #10000, processed 217680 words, keeping 18588 word types
2023-08-06 12:28:19,917 : INFO : PROGRESS: at sentence #20000, processed 441935 words, keeping 28870 word types
2023-08-06 12:28:19,966 : INFO : PROGRESS: at sentence #30000, processed 661959 words, keeping 36734 word types
2023-08-06 12:28:20,013 : INFO : PROGRESS: at sentence #40000, processed 867989 words, keeping 42994 word types


Training model...


2023-08-06 12:28:20,067 : INFO : PROGRESS: at sentence #50000, processed 1084246 words, keeping 48774 word types
2023-08-06 12:28:20,115 : INFO : PROGRESS: at sentence #60000, processed 1304628 words, keeping 54830 word types
2023-08-06 12:28:20,168 : INFO : PROGRESS: at sentence #70000, processed 1525673 words, keeping 60382 word types
2023-08-06 12:28:20,221 : INFO : PROGRESS: at sentence #80000, processed 1747403 words, keeping 65838 word types
2023-08-06 12:28:20,275 : INFO : PROGRESS: at sentence #90000, processed 1958677 words, keeping 70537 word types
2023-08-06 12:28:20,329 : INFO : PROGRESS: at sentence #100000, processed 2179880 words, keeping 75221 word types
2023-08-06 12:28:20,385 : INFO : PROGRESS: at sentence #110000, processed 2393830 words, keeping 79727 word types
2023-08-06 12:28:20,436 : INFO : PROGRESS: at sentence #120000, processed 2614721 words, keeping 84197 word types
2023-08-06 12:28:20,490 : INFO : PROGRESS: at sentence #130000, processed 2830967 words, keep

2023-08-06 12:28:24,800 : INFO : PROGRESS: at sentence #770000, processed 16208738 words, keeping 273307 word types
2023-08-06 12:28:24,877 : INFO : PROGRESS: at sentence #780000, processed 16421644 words, keeping 275473 word types
2023-08-06 12:28:24,952 : INFO : PROGRESS: at sentence #790000, processed 16639359 words, keeping 278120 word types
2023-08-06 12:28:25,024 : INFO : PROGRESS: at sentence #800000, processed 16853982 words, keeping 280485 word types
2023-08-06 12:28:25,098 : INFO : PROGRESS: at sentence #810000, processed 17076297 words, keeping 283066 word types
2023-08-06 12:28:25,117 : INFO : collected 283624 word types from a corpus of 17125640 raw words and 812440 sentences
2023-08-06 12:28:25,120 : INFO : Creating a fresh vocabulary
2023-08-06 12:28:25,355 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=40 retains 16561 unique words (5.84% of original 283624, drops 267063)', 'datetime': '2023-08-06T12:28:25.355754', 'gensim': '4.3.0', 'python': '3.10.12 |

2023-08-06 12:29:11,945 : INFO : EPOCH 2 - PROGRESS: at 24.80% examples, 599319 words/s, in_qsize 7, out_qsize 0
2023-08-06 12:29:12,946 : INFO : EPOCH 2 - PROGRESS: at 30.14% examples, 602164 words/s, in_qsize 7, out_qsize 0
2023-08-06 12:29:13,952 : INFO : EPOCH 2 - PROGRESS: at 35.19% examples, 601614 words/s, in_qsize 7, out_qsize 0
2023-08-06 12:29:14,976 : INFO : EPOCH 2 - PROGRESS: at 40.19% examples, 601729 words/s, in_qsize 7, out_qsize 0
2023-08-06 12:29:15,981 : INFO : EPOCH 2 - PROGRESS: at 45.28% examples, 604908 words/s, in_qsize 7, out_qsize 0
2023-08-06 12:29:16,991 : INFO : EPOCH 2 - PROGRESS: at 50.29% examples, 605885 words/s, in_qsize 7, out_qsize 0
2023-08-06 12:29:17,991 : INFO : EPOCH 2 - PROGRESS: at 55.25% examples, 606494 words/s, in_qsize 7, out_qsize 0
2023-08-06 12:29:18,995 : INFO : EPOCH 2 - PROGRESS: at 60.29% examples, 607912 words/s, in_qsize 7, out_qsize 0
2023-08-06 12:29:19,999 : INFO : EPOCH 2 - PROGRESS: at 65.24% examples, 607945 words/s, in_qsiz

2023-08-06 12:30:09,737 : INFO : not storing attribute cum_table
2023-08-06 12:30:09,820 : INFO : saved 300features_40minwords_10context


### Exploring the models results
Let's take a look at the model we created out of our 75,000 training reviews.

In [11]:
print(model.wv.doesnt_match("man woman child kitchen".split()))

kitchen


Our model is capable of distinguishing differences in meaning.

## Numeric Representation of Words
The Word2Vec model trained consists of a feature vector for each word in the vocabulary, stored in a numpy array called "vectors":

In [12]:
type(model.wv.vectors)

numpy.ndarray

In [13]:
model.wv.vectors.shape

(16561, 300)

The number of rows in vectors is the number of words in the model's vocabulary, and the number of columns corresponds to the size of the feature vector, which we set before.  Setting the minimum word count to 40 gave us a total vocabulary of 16,492 words with 300 features apiece.

One challenge with the IMDB dataset is the variable-length reviews. We need to find a way to take individual word vectors and transform them into a feature set that is the same length for every review.

### Option 1: Vector Averaging

Since each word is a vector in 300-dimensional space, we can use vector operations to combine the words in each review. One method we tried was to simply average the word vectors in a given review  (for this purpose, we will remove stop words, which would just add noise).

In [14]:
import numpy as np

class VectorAverageVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, w2v_model, num_features):
        self.w2v_model = w2v_model
        self.num_features = num_features
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        def make_feature_vector(words):
            # Function to average all of the word vectors in a given review
            
            # Pre-initialize an empty numpy array (for speed)
            featureVec = np.zeros((self.num_features,),dtype="float32")
        
            nwords = 0.
            
            # Index2word is a list that contains the names of the words in 
            # the model's vocabulary. Convert it to a set, for speed 
            index2word_set = set(self.w2v_model.wv.index_to_key)
            
            # Loop over each word in the review and, if it is in the model's
            # vocabulary, add its feature vector to the total
            for word in words:
                if word in index2word_set: 
                    nwords = nwords + 1.
                    featureVec = np.add(featureVec,self.w2v_model.wv[word])
                    
            # Divide the result by the number of words to get the average
            featureVec = np.divide(featureVec,nwords)
            
            return featureVec
        
        def getAvgFeatureVecs(reviews):
            # Given a set of reviews (each one a list of words), calculate 
            # the average feature vector for each one and return a 2D numpy array 
            
            counter = 0
            
            # Preallocate a 2D numpy array, for speed
            reviewFeatureVecs = np.zeros((len(reviews), self.num_features),dtype="float32") 
        
            for review in reviews:
                # Call the function (defined above) that makes average feature vectors
                reviewFeatureVecs[counter] = make_feature_vector(review)
                counter += 1
                
            return reviewFeatureVecs
        
        return getAvgFeatureVecs(X)

In [15]:
from gensim.models import Word2Vec
w2v_model = Word2Vec.load("300features_40minwords_10context")

vector_average_preprocessor = Pipeline([
    ('review_to_wordlist', review_to_wordlist_pipeline),
    ('vector_average_vectorizer', VectorAverageVectorizer(w2v_model, w2v_model.wv.vectors.shape[1])),
]);

# We will remove stopwords in this step, so we need to set this hyperparameter manually.
vector_average_preprocessor.set_params(review_to_wordlist__text_cleaner__remove_stopwords=True);

2023-08-06 12:30:09,919 : INFO : loading Word2Vec object from 300features_40minwords_10context
2023-08-06 12:30:09,964 : INFO : loading wv recursively from 300features_40minwords_10context.wv.* with mmap=None
2023-08-06 12:30:09,965 : INFO : setting ignored attribute cum_table to None
2023-08-06 12:30:10,424 : INFO : Word2Vec lifecycle event {'fname': '300features_40minwords_10context', 'datetime': '2023-08-06T12:30:10.424665', 'gensim': '4.3.0', 'python': '3.10.12 | packaged by Anaconda, Inc. | (main, Jul  5 2023, 19:09:20) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19045-SP0', 'event': 'loaded'}


In [16]:
X_train_transformed_vect_avrg = vector_average_preprocessor.fit_transform(X_train_labeled)
X_test_transformed_vect_avrg = vector_average_preprocessor.transform(X_test)

  soup = BeautifulSoup(html_text, 'html.parser')


In [17]:
X_train_transformed_vect_avrg.shape

(25000, 300)

#### Building our classification models

In [18]:
# Create a little validation dataset to perform EarlyStop
X_train_transformed_vect_avrg, X_valid_transformed_vect_avrg = X_train_transformed_vect_avrg[:20000], X_train_transformed_vect_avrg[20000:]
y_train, y_valid = y_train[:20000], y_train[20000:]

print(f"Train shape: {y_train.shape} - Valid shape: {y_valid.shape}")

Train shape: (20000,) - Valid shape: (5000,)


##### Random Forest

In [19]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Initialize a Random Forest classifier with 100 trees
rf_clf = RandomForestClassifier(n_estimators = 100)
cv = cross_val_score(rf_clf, X_train_transformed_vect_avrg, y_train, cv=5)
print(cv)
print(cv.mean())

[0.83075 0.81875 0.81875 0.823   0.80375]
0.819


#### FeedForward Neural Network (FNN)

In [20]:
import tensorflow as tf
import numpy as np
from tensorflow import keras
from scipy.stats import reciprocal
from sklearn.model_selection import RandomizedSearchCV
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

def build_model(input_shape, n_hidden=1, n_neurons=30, learning_rate=3e-3):
    model = keras.models.Sequential()
    options = {"input_shape": input_shape}
    for layer in range(n_hidden):
        model.add(keras.layers.Dense(n_neurons, activation="relu", **options))
        options = {}
    model.add(keras.layers.Dense(1, activation="sigmoid"))
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

keras_clf = KerasClassifier(build_model, input_shape=X_train_transformed_vect_avrg.shape[1:])

param_distribs = {
 "n_hidden": [0, 1, 2, 3],
 "n_neurons": np.arange(1, 100),
 "learning_rate": reciprocal(3e-4, 3e-2)
}

vect_avrg_rnd_search_cv =  RandomizedSearchCV(keras_clf, param_distribs, n_iter=10, cv=3)

# Callbacks
early_stopping_cb = keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)


vect_avrg_rnd_search_cv.fit(X_train_transformed_vect_avrg, y_train, epochs=20, 
                        validation_data=(X_valid_transformed_vect_avrg, y_valid), 
                        callbacks=[early_stopping_cb]);

Epoch 1/20


  keras_clf = KerasClassifier(build_model, input_shape=X_train_transformed_vect_avrg.shape[1:])


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20


Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20


Epoch 7/20
Epoch 8/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20


Epoch 14/20
Epoch 15/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20


Epoch 11/20
Epoch 12/20
Epoch 13/20


In [32]:
vect_avrg_rnd_search_cv.best_score_

0.711654524008433

### Option 2: Clustering
Word2Vec creates clusters of semantically related words, so another possible approach is to exploit the similarity of words within a cluster. Grouping vectors in this way is known as "vector quantization." To accomplish this, we first need to find the centers of the word clusters, which we can do by using a clustering algorithm such as K-Means.

In [21]:
from sklearn.cluster import KMeans

# Set "k" (num_clusters) to be 1/5th of the vocabulary size
word_vectors = model.wv.vectors
num_clusters = int(word_vectors.shape[0] / 5)

# Initalize a k-means object and use it to extract centroids
kmeans_clustering = KMeans(n_clusters=num_clusters)
idx = kmeans_clustering.fit_predict(word_vectors)



The cluster assignment for each word is now stored in idx, and the vocabulary from our original Word2Vec model is still stored in `model.mv.index_to_key`. For convenience, we zip these into one dictionary as follows:

In [22]:
# Create a Word / Index dictionary, mapping each vocabulary word to
# a cluster number                                                                                            
word_centroid_map = dict(zip(model.wv.index_to_key,idx))

In [23]:
# For the first 10 clusters
for cluster in range(0,10):
    #
    # Print the cluster number  
    print(f"Cluster {cluster}")
    #
    # Find all of the words for that cluster number, and print them out
    words = []
    for i in range(0,len(word_centroid_map.values())):
        if(list(word_centroid_map.values())[i] == cluster ):
            words.append(list(word_centroid_map.keys())[i])
    print(words)

Cluster 0
['knows', 'thinks', 'cares', 'believes', 'understands', 'considers']
Cluster 1
['sand', 'moonlight', 'bushes']
Cluster 2
['professor', 'assistant', 'ex', 'ally', 'magician', 'knox', 'acquaintance', 'apprentice', 'physician', 'ivy']
Cluster 3
['horrors', 'atrocities', 'tragedies', 'genocide', 'symptoms']
Cluster 4
['enhance', 'contributing', 'heighten', 'relieve', 'reinforce']
Cluster 5
['rogers', 'solo', 'ruby', 'lil', 'merry', 'lucille', 'macmurray', 'dolly', 'burlesque', 'duet', 'rodgers', 'prima', 'keeler']
Cluster 6
['gregg', 'araki']
Cluster 7
['brady', 'evans', 'dale', 'meyers', 'hicks', 'perennial', 'kruger', 'trey', 'culkin', 'feldman', 'fosters', 'rowan', 'fenton', 'richie', 'talbot', 'mcdowall', 'joness', 'adrienne', 'brewster', 'robby', 'matthews', 'hammond', 'sawyer', 'rory', 'henson', 'damien', 'angelo']
Cluster 8
['go', 'sit', 'sat']
Cluster 9
['schedule']


Now we have a cluster (or "centroid") assignment for each word, and we can define a function to convert reviews into bags-of-centroids. This works just like Bag of Words but uses semantically related clusters instead of individual words:

In [24]:
class BagOfCentroidsVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, num_clusters, word_centroid_map):
        self.num_clusters = num_clusters
        self.word_centroid_map = word_centroid_map
        
    def fit(self, X, y=None):
        # Pre-allocate an array for the training set bags of centroids (for speed)
        self.train_centroids = np.zeros((len(X), self.num_clusters), dtype="float32")
        
        return self
    
    def transform(self, X, y=None):
        def create_bag_of_centroids(wordlists):
            for count, wordlist in enumerate(wordlists):
                # The number of clusters is equal to the highest cluster index
                # in the word / centroid map
                num_centroids = max(word_centroid_map.values()) + 1

                # Pre-allocate the bag of centroids vector (for speed)
                bag_of_centroids = np.zeros( num_centroids, dtype="float32" )

                # Loop over the words in the review. If the word is in the vocabulary,
                # find which cluster it belongs to, and increment that cluster count 
                # by one
                for word in wordlist:
                    if word in word_centroid_map:
                        index = word_centroid_map[word]
                        bag_of_centroids[index] += 1

                self.train_centroids[count] = bag_of_centroids
            return self.train_centroids
        
        return create_bag_of_centroids(X)

In [25]:
from gensim.models import Word2Vec
w2v_model = Word2Vec.load("300features_40minwords_10context")

boc_preprocessor = Pipeline([
    ('review_to_wordlist', review_to_wordlist_pipeline),
    ('bag_centroids_vectorizer', BagOfCentroidsVectorizer(num_clusters, word_centroid_map)),
]);

# We will remove stopwords in this step, so we need to set this hyperparameter manually.
boc_preprocessor.set_params(review_to_wordlist__text_cleaner__remove_stopwords=True);

2023-08-06 13:01:58,045 : INFO : loading Word2Vec object from 300features_40minwords_10context
2023-08-06 13:01:58,068 : INFO : loading wv recursively from 300features_40minwords_10context.wv.* with mmap=None
2023-08-06 13:01:58,068 : INFO : setting ignored attribute cum_table to None
2023-08-06 13:01:58,228 : INFO : Word2Vec lifecycle event {'fname': '300features_40minwords_10context', 'datetime': '2023-08-06T13:01:58.226989', 'gensim': '4.3.0', 'python': '3.10.12 | packaged by Anaconda, Inc. | (main, Jul  5 2023, 19:09:20) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19045-SP0', 'event': 'loaded'}


Here we will transform all our training data **(both our labeled and unlabeled data)**.

In [26]:
X_train_transformed_boc = boc_preprocessor.fit_transform(X_train_labeled)
X_test_transformed_boc = boc_preprocessor.transform(X_test)

  soup = BeautifulSoup(html_text, 'html.parser')


In [27]:
X_train_transformed_boc.shape

(25000, 3312)

#### Building our classification models

In [28]:
X_train_transformed_boc, X_valid_transformed_boc = X_train_transformed_boc[:20000], X_train_transformed_boc[20000:]

##### Random Forest

In [29]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Initialize a Random Forest classifier with 100 trees
rf_clf = RandomForestClassifier(n_estimators = 100)
cv = cross_val_score(rf_clf, X_train_transformed_boc, y_train, cv=5)
print(cv)
print(cv.mean())

[0.855   0.83375 0.84275 0.8345  0.83625]
0.8404499999999999


#### FeedForward Neural Network

In [30]:
keras_clf = KerasClassifier(build_model, input_shape=X_train_transformed_boc.shape[1:])

param_distribs = {
 "n_hidden": [0, 1, 2, 3],
 "n_neurons": np.arange(1, 100),
 "learning_rate": reciprocal(3e-4, 3e-2)
}

boc_rnd_search_cv =  RandomizedSearchCV(keras_clf, param_distribs, n_iter=10, cv=3)

# Callbacks
early_stopping_cb = keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)


boc_rnd_search_cv.fit(X_train_transformed_boc, y_train, epochs=20, 
                        validation_data=(X_valid_transformed_boc, y_valid), 
                        callbacks=[early_stopping_cb]);

  keras_clf = KerasClassifier(build_model, input_shape=X_train_transformed_boc.shape[1:])


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 1/20
Epoch 2/20
Epoch 3/20


Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20


In [33]:
boc_rnd_search_cv.best_score_

0.699600358804067

We found that the code above gives about the same (or slightly worse) results compared to the Bag of Words and TF-IDF.

## Comparing our models on the test set

In [31]:
# Evaluate the best models on the test set
rf_clf = RandomForestClassifier(n_estimators=100).fit(X_train_transformed_vect_avrg, y_train)
rf_vct_avrg_score = rf_clf.score(X_test_transformed_vect_avrg, y_test)
print(f"RF - Vector Average Accuracy: {rf_vct_avrg_score}")

fnn_avrg_score = vect_avrg_rnd_search_cv.score(X_test_transformed_vect_avrg, y_test)
print(f"FNN - Vector Average Accuracy: {fnn_avrg_score}")

rf_clf = RandomForestClassifier(n_estimators=100).fit(X_train_transformed_boc, y_train)
rf_boc_score = rf_clf.score(X_test_transformed_boc, y_test)
print(f"RF - Clustering Accuracy: {rf_boc_score}")

fnn_boc_score = boc_rnd_search_cv.score(X_test_transformed_boc, y_test)
print(f"FNN - Clustering Accuracy: {fnn_boc_score}")

RF - Vector Average Accuracy: 0.79928
FNN - Vector Average Accuracy: 0.8626518249511719
RF - Clustering Accuracy: 0.93224
FNN - Clustering Accuracy: 0.8975383639335632


We can see that the Clustering option outperforms Vector Average and also obtains much better results than Bag-of-Words and TF-IDF vectors. We might further improve our results if we had used some model like **Doc2vec** that directly outputs a representation vector for a review instead of having to compute the average or perform clustering.