## Fetch the data

The data files are located in the **data** folder. The training set contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [train/pos/200_8.txt] is the text for a positive-labeled train set example with unique id 200 and star rating 8/10 from IMDb.

Let's write some functions to get and store the dato:

In [1]:
import os
import pandas as pd

def load_imbd_dataset(data_path, unsup = False):
    """
    Load the IMDb dataset into Pandas DataFrames.

    Parameters:
        data_path (str): The root directory where the IMDb dataset is stored.
        unsup (bool): Whether the data is labeled or not
        
    Returns:
        df (pandas.DataFrame): A DataFrame containing the reviews and their labels.
    """
    reviews = []
    labels = []
    
    if not unsup:
        for label in ['pos', 'neg']:
            label_dir = os.path.join(data_path, label)
            for filename in os.listdir(label_dir):
                filepath = os.path.join(label_dir, filename)
                with open(filepath, 'r', encoding='utf-8') as file:
                    review_text = file.read()
                rating = int(filename.split('_')[1].split('.')[0])
                sentiment = 1 if label == 'pos' else 0
                reviews.append(review_text)
                labels.append(sentiment)
        df = pd.DataFrame({'review': reviews, 'sentiment': labels})
        return df
    else:
        label_dir = os.path.join(data_path, 'unsup')
        for filename in os.listdir(label_dir):
                filepath = os.path.join(label_dir, filename)
                with open(filepath, 'r', encoding='utf-8') as file:
                    review_text = file.read()
                reviews.append(review_text)
        df = pd.DataFrame({'review': reviews})
        return df

In [2]:
train_set = load_imbd_dataset('data/train')
test_set = load_imbd_dataset('data/test')
unlabeled_train_set = load_imbd_dataset('data/train', unsup=True)

In [3]:
train_set.shape

(25000, 2)

In [4]:
train_set.columns.values

array(['review', 'sentiment'], dtype=object)

The train set has 25000 rows and 3 columns ('id', 'sentiment' and 'review'). Now let's take a look at a few reviews.

In [5]:
print(train_set["review"][1])

Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they'll be next to end up on the streets.<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it's like to be homeless? That is Goddard Bolt's lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days without th

We can see some HTML tags such as `<br/>`, abbreviations and punctuation. 

## Data Cleaning and Text Preprocessing
We will implement all the preprocessing steps as Transformers, so then we can apply them in a preprocessing Pipeline.

### Removing HTML Markup
First, we'll remove the HTML tags. We will use the **Beautiful Soup** package. 

In [6]:
from bs4 import BeautifulSoup
from sklearn.base import BaseEstimator, TransformerMixin

class HTMLTagRemover(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        # Function to remove HTML tags from each element in the input X
        def remove_html_tags(html_text):
            soup = BeautifulSoup(html_text, 'html.parser')
            return soup.get_text()
        
        return [remove_html_tags(text) for text in X]

Let's try this:

In [7]:
example = [train_set["review"][1]]
print(example)

['Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they\'ll be next to end up on the streets.<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it\'s like to be homeless? That is Goddard Bolt\'s lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days witho

In [8]:
html_remover = HTMLTagRemover()
example = html_remover.transform(example)
print(example)

['Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they\'ll be next to end up on the streets.But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it\'s like to be homeless? That is Goddard Bolt\'s lesson.Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days without the luxuries; if Bolt

### Dealing with Punctuation, Numbers and Stopwords

When considering text cleaning, it is essential to tailor the approach to the specific data problem we aim to solve. For certain tasks, removing punctuation can be beneficial, but in the context of sentiment analysis, expressions like "!!!" or ":-(" might contain sentiment and could be treated as words. Nevertheless, for simplicity, we will proceed with punctuation removal.

Similarly, we'll exclude numbers, although alternative methods exist, such as treating them as words or substituting them with a placeholder like "NUM."

To execute the punctuation and number removal, we'll leverage the re package, which handles regular expressions. Additionally, we'll tokenize the reviews, breaking them down into individual words.

Moreover, we'll apply lemmatization, a process that converts words to their base forms. This ensures proper morphological meaning by referencing a dictionary within the library.

Lastly, we must address frequently occurring words that carry little meaning, known as "stop words". In English, these encompass words like "a," "and," "is," and "the." Fortunately, Python packages like the Natural Language Toolkit (NLTK) provide built-in stop word lists that we can utilize by importing them.

In [9]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

class TextCleaner(BaseEstimator, TransformerMixin):
    def __init__(self, remove_stopwords = False, lemmatization=True):
        nltk.download('stopwords')
        nltk.download('punkt')
        
        self.remove_stopwords = remove_stopwords
        self.lemmatization = lemmatization
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        # Function to clean text (lowercase, remove numbers and punctuation)
        def clean_text(text):
            text = text.lower()
            
            # Remove numbers using regex
            text = re.sub(r'\d+', '', text)

            #Remove URLs
            text = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', text, flags=re.MULTILINE)

            # Remove punctuation using string library
            text = text.translate(str.maketrans('', '', string.punctuation))
            
            
            # Split the text into words
            words = text.split()
            
            # Remove stop words from "words"
            if self.remove_stopwords:
                stops = set(stopwords.words("english"))   
                words = [w for w in words if not w in stops]
            
            if self.lemmatization:
                lemmatizer = WordNetLemmatizer()
                words = [lemmatizer.lemmatize(word, 'v') for word in words]

            return (" ".join(words)) 

        return [clean_text(text) for text in X]


Let's try this:

In [10]:
text_cleaner = TextCleaner()
example = text_cleaner.transform(example)
example

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Agustin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Agustin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['homelessness or houselessness as george carlin state have be an issue for years but never a plan to help those on the street that be once consider human who do everything from go to school work or vote for the matter most people think of the homeless as just a lose cause while worry about things such as racism the war on iraq pressure kid to succeed technology the elections inflation or worry if theyll be next to end up on the streetsbut what if you be give a bet to live on the streets for a month without the luxuries you once have from a home the entertainment set a bathroom picture on the wall a computer and everything you once treasure to see what its like to be homeless that be goddard bolt lessonmel brook who direct who star as bolt play a rich man who have everything in the world until decide to make a bet with a sissy rival jeffery tambor to see if he can live in the streets for thirty days without the luxuries if bolt succeed he can do what he want with a future project of ma

### Building the Preprocessing Pipeline

In [11]:
from sklearn.pipeline import Pipeline

preprocessing_pipeline = Pipeline([
    ('html_tag_remover', HTMLTagRemover()),
    ('text_cleaner', TextCleaner(remove_stopwords=True, lemmatization=True)),
])

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Agustin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Agustin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Now we can use this pipeline to transform our training set. Let's prepare our training set and preprocess it.

In [12]:
# Shuffle our training and test sets:
train_set = train_set.sample(frac=1, random_state=42)
test_set = test_set.sample(frac=1, random_state=42)

train_set = train_set.reset_index(drop=True)
test_set = test_set.reset_index(drop=True)

X_train_full, y_train_full = train_set["review"], train_set["sentiment"]

# Transform our training set
X_train_transformed_full = preprocessing_pipeline.fit_transform(X_train_full)

# Split the train set into a validation set.
X_train_transformed, X_valid_transformed = X_train_transformed_full[:20000], X_train_transformed_full[20000:]
y_train, y_valid = y_train_full[:20000], y_train_full[20000:]

  soup = BeautifulSoup(html_text, 'html.parser')


## Numerical Representations for our Data

Now that we have our training reviews tidied up, we need to convert them to some kind of numeric representation for machine learning

### Bag of Words
One common approach is called a **Bag of Words**. The Bag of Words model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears.

In the IMDB data, we have a very large number of reviews, which will give us a large vocabulary. To limit the size of the feature vectors, we should choose some maximum vocabulary size. Below, we use the **10000 most frequent words and bigrams** (remembering that stop words have already been removed).

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

bow_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2), tokenizer=None, preprocessor=None, stop_words=None, max_features=10000)

X_train_vectorized_bow = bow_vectorizer.fit_transform(X_train_transformed).toarray()
X_valid_vectorized_bow = bow_vectorizer.transform(X_valid_transformed).toarray()

In [14]:
X_train_vectorized_bow.shape

(20000, 10000)

Now our training data has 20,000 rows and 10,000 features (one for each vocabulary word/bigram).

In [15]:
# Take a look at the words in the vocabulary
vocab = bow_vectorizer.get_feature_names_out()
print(vocab)

['abandon' 'abc' 'abilities' ... 'zoom' 'zorro' 'zu']


### TF-IDF

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(min_df=5, max_df=0.8, sublinear_tf=True, use_idf=True, stop_words=None)

X_train_vectorized_tfidf = tfidf_vectorizer.fit_transform(X_train_transformed).toarray()
X_valid_vectorized_tfidf = tfidf_vectorizer.transform(X_valid_transformed).toarray()

In [17]:
X_train_vectorized_tfidf.shape

(20000, 20995)

In [18]:
X_train_vectorized_tfidf[0]

array([0., 0., 0., ..., 0., 0., 0.])

## Building our models

In [19]:
from sklearn.model_selection import cross_val_score

### Random Forest

In [20]:
from sklearn.ensemble import RandomForestClassifier

# Initialize a Random Forest classifier with 100 trees
rf_clf = RandomForestClassifier(n_estimators = 100)

print("BOW Metrics:")
cv = cross_val_score(rf_clf, X_train_vectorized_bow, y_train, cv=5)
print(cv)
print(cv.mean())

print("TF-IDF Metrics:")
cv = cross_val_score(rf_clf, X_train_vectorized_tfidf, y_train, cv=5)
print(cv)
print(cv.mean())

BOW Metrics:
[0.839   0.84825 0.847   0.83525 0.8435 ]
0.8426
TF-IDF Metrics:
[0.8435  0.8445  0.84275 0.83575 0.8385 ]
0.841


### FeedForward Neural Network (FNN)

In [21]:
import tensorflow as tf
import numpy as np
from tensorflow import keras
from scipy.stats import reciprocal
from sklearn.model_selection import RandomizedSearchCV
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

def build_model(input_shape, n_hidden=1, n_neurons=30, learning_rate=3e-3):
    model = keras.models.Sequential()
    options = {"input_shape": input_shape}
    for layer in range(n_hidden):
        model.add(keras.layers.Dense(n_neurons, activation="relu", **options))
        options = {}
    model.add(keras.layers.Dense(1, activation="sigmoid"))
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

keras_clf_bow = KerasClassifier(build_model, input_shape=X_train_vectorized_bow.shape[1:])
keras_clf_tfidf = KerasClassifier(build_model, input_shape=X_train_vectorized_tfidf.shape[1:])

param_distribs = {
 "n_hidden": [0, 1, 2, 3],
 "n_neurons": np.arange(1, 100),
}

bow_rnd_search_cv =  RandomizedSearchCV(keras_clf_bow, param_distribs, n_iter=10, cv=3)
tfidf_rnd_search_cv = RandomizedSearchCV(keras_clf_tfidf, param_distribs, n_iter=10, cv=3)

  keras_clf_bow = KerasClassifier(build_model, input_shape=X_train_vectorized_bow.shape[1:])
  keras_clf_tfidf = KerasClassifier(build_model, input_shape=X_train_vectorized_tfidf.shape[1:])


In [22]:
# Callbacks
early_stopping_cb = keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)


print("BOW:")
bow_rnd_search_cv.fit(X_train_vectorized_bow, y_train, epochs=20, 
                        validation_data=(X_valid_vectorized_bow, y_valid), 
                        callbacks=[early_stopping_cb]);


print("TF-IDF:")
tfidf_rnd_search_cv.fit(X_train_vectorized_tfidf, y_train, epochs=20, 
                        validation_data=(X_valid_vectorized_tfidf, y_valid), 
                        callbacks=[early_stopping_cb]);

BOW:
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20


Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
TF-IDF:
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1

Epoch 20/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 1/20
Epoch 2/20
Epoch 3/2

Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20


## Making predictions on the test set

In [23]:
X_test, y_test = test_set["review"], test_set["sentiment"]
X_test_transformed = preprocessing_pipeline.transform(X_test)
X_test_vectorized_bow = bow_vectorizer.transform(X_test_transformed)
X_test_vectorized_tfidf = tfidf_vectorizer.transform(X_test_transformed)


# Evaluate the best models on the test set
rf_clf = RandomForestClassifier(n_estimators=100).fit(X_train_vectorized_bow, y_train)
rf_bow_score = rf_clf.score(X_test_vectorized_bow, y_test)
print(f"RF - BoW Accuracy: {rf_bow_score}")

rf_clf = RandomForestClassifier(n_estimators=100).fit(X_train_vectorized_tfidf, y_train)
rf_tfidf_score = rf_clf.score(X_test_vectorized_tfidf, y_test)
print(f"RF - TFIDF Accuracy: {rf_tfidf_score}")

fnn_bow_score = bow_rnd_search_cv.score(X_test_vectorized_bow, y_test)
print(f"FNN - BoW Accuracy: {fnn_bow_score}")

fnn_tfidf_score = tfidf_rnd_search_cv.score(X_test_vectorized_tfidf, y_test)
print(f"FNN - TFIDF Accuracy: {fnn_tfidf_score}")

  soup = BeautifulSoup(html_text, 'html.parser')


RF - BoW Accuracy: 0.8476
RF - TFIDF Accuracy: 0.84392
FNN - BoW Accuracy: 0.8828399777412415
FNN - TFIDF Accuracy: 0.8727999925613403


With these approaches we've achieved around 88% accuracy in this dataset. This performance might be improved using a Distributed Vector Representation for words (such as Word2vec, Doc2vec, etc.)