In [1]:
import numpy as np
np.set_printoptions(precision=3)
import pandas as pd
from sklearn.model_selection import KFold, train_test_split

# Loading important libraries that will help us regress the iMBD dataset

In [17]:
"""
Now that we have the necessary libraries, we need to transform the data into 
a pandas dataframe that way we can understand better what information is being stored with
in the dataset. We will also need to split the data into training and testing data that way
we can get a baseline for how our first model will perform that way we can tune it later on when
we peform a 5 fold cross validation.

I have the data in a csv file that I have downloaded from Kaggle that I have renamed movie_reviews.csv
in this directory. Using the read_csv method from pandas, we can easily load in the data and view it.
"""

df = pd.read_csv("data.csv", names=["Review", "Sentiment"]) # loading in the data
df = df.iloc[1:] # removing the first row because it is just the column names


In [18]:
"""
Now that we have a representation of the data, let's visual what it looks in the dataframe.
Lets understand some basic information like the feature(s) (Review) and the target(s) (Sentiment).
We can also get some more information by using the describe method from pandas to understand anything
that might be interesting about the data or missing values.
"""

df.head() # viewing the first 5 rows of the data

Unnamed: 0,Review,Sentiment
1,One of the other reviewers has mentioned that ...,positive
2,A wonderful little production. <br /><br />The...,positive
3,I thought this was a wonderful way to spend ti...,positive
4,Basically there's a family where a little boy ...,negative
5,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [19]:
"""
Off rip, we can see that we have reviews as the one feature which we will need to later break up into
multiple feature values like words/characters that hold a heuristic value and ignoring some as well. We also 
have a sentiment which is simply just the label as if the movie was good or bad. Movies that are good are labeled 
as positive and have a rating above 6 and movies that are bad are labeled as negative and have a rating of 6 or below.

Also looking at the description of the data, I notice that we have 50000 reviews although some reveiws are not entirely
unique. This could mean a multitude of things like there being NA values or duplicates which may need to be handled.
Another thing I notice is that the Reviews has a frequency of 5. I am assuming this means that there's a lot of reviews
that share the same rating but I am honestly not too sure exactly what this 5 could represent.
"""

df.describe() # viewing the summary of the data

Unnamed: 0,Review,Sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [20]:
from sklearn.feature_extraction.text import CountVectorizer
"""
For this particular problem we have no real values that we can use to train a logistic regression model.
You might ask, what are we exactly supposed to do with the words we have been given? Well we can evaluate the
words based on their frequencies and how important they are as words. We can do this by creating a bag of words
model. This bag of words model is simply a dictionary of words that we want to count as well as how many times
they occur in the reviews. Some words might hold values of zeroes and some words may hold many values for their
frequency but the point is, we can turn our one feature (reviews) into multiple continuous features using our
bag of words model.

Another thing we must consider when extracting information from the reviews is unwanted words, symbols, formating, etc.
I notice in one particular review that their are html break tags in the review. We need to find a way to parse all of our
reviews and make sure they only contain "Rich" information that can be used as a feature and not useless information like
words: "the", "and", "a", etc.

Using sci-kit learn's CountVectorizer class we can easily create a bag of words model from the reviews. Before we do
I am going to apply it on a sample to get familiar with the bag of words model and using it.
"""

samples = np.array(df.sample(n=10).drop(columns=["Sentiment"])) # sampling 100 reviews from the data (removing the sentiment column)

counts = CountVectorizer() 
bag = counts.fit_transform(samples.ravel()) # fitting the model to the reviews (getting frequencies of words in reviews as well as total words in all reviews)

dictionary = counts.vocabulary_



bag = bag.toarray() # converting the bag of words model to a numpy array


In [21]:
"""
Using the code above we can now take a look at some of the words that we have extracted from the reviews. We can see
things such as their indicies relative to the sequences in the counts vocabulary_ attribute and we can also see the tokenization of words
in each review. The indicies of the review sequence represent a particular word in the counts vocabulary_ attribute. The
value with in that index is the frequency of that word in the review. The indicies of the sequence remain consistent
across all reviews which will help us later on when deciding word importance.

One thing I want to consider that we will implement later on is how we will value these words or how we will give them
some type of inherent value that makes them more important towards good or bad reviews over others. Let's take a look
at the word frequency of the word "the" in the reviews vs good and bad reviews. We can't make the decision based off the word
frequency alone because of words like "the" that are common in good and bad reviews, so we will need a method of determining
the relative importance of the word.
"""

print(f"the: {dictionary['the']}\ngood: {dictionary['good']}\nbad: {dictionary['bad']}") # index of good and bad relative to all sequences

print(f"Example review and its word frequencies:\n{bag[0]}") # example review and its word frequencies

the: 561
good: 229
bad: 53
Example review and its word frequencies:
[ 0  1  0  0  0  0  0  3  1  0  2  1  0  1  1  1  0  1  0  1  0  3  3  0
  0  1  0  0  1  2  0 12  0  0  1  0  0  1  1  3  0  3  0  1  3  0  1  1
  0  0  0  0  0  0  0  0  0  0  1  1  4  0  1  0  0  1  0  0  0  3  0  0
  0  0  0 14  0  0  1  0  0  0  0  4  3  0  1  0  0  0  0  0  0  1  0  1
  2  1  1  1  2  1  1  0  0  4  0  0  0  0  1  0  0  1  0  1  0  0  0  0
  0  2  1  0  0  0  0  1  0  1  0  0  1  1  1  1  0  0  1  0  2  1  1  0
  1  0  0  1  0  0  2  1  0  0  0  0  0  0  0  0  1  0  0  0  0  1  1  0
  0  2  0  0  0  0  0  0  0  0  0  1  0  0  0  1  1  0  0  0  0  1  0  0
  0  0  0  0  0  0  1  0  1  0  1  5  0  1  0  0  0  0  0  2  0  0  1  0
 12  5  2  0  0  0  0  0  0  1  0  1  0  0  0  0  0  7  2  0  0  2  0  2
  0  0  1  0  7  0  6  0  1  2  1  0  0  1  6  0  0  2  0  1  1  0  1  0
  2  1  0  0  1  1  0  0  1  0  0  1 12  0  2  1  1  0  0  1  0  1  0  0
  1  1  2  1  0  7  1  0  2  0  0  2  0  0  0  0  0  0  

In [22]:
"""
The example above is one example sequence from the unigram model we created. We get the prefix "uni" for one meaning we have one word
for every token in the sequence of a single review. We can change a parameter when initializing the CountVectorizer class
to use a n-gram model where we can make a single token be 2, 3, or 4 words etc. This can help enrich our bag of words model even more by
certain phrases that are more common in good and bad reviews, but excessive use of n-grams can lead to overfitting in which a good or bad
review must contain some super unique phrase or ordering of words to be classified as a positive or negative review which we don't want.

Lets do a quick demo with this feature to get familiar with the the ngram model.
"""

counts = CountVectorizer(ngram_range=(2,2)) # using a ngram model with 2 words for each token
bag = counts.fit_transform(samples.ravel()) # fitting the model to the reviews (getting frequencies of words in reviews as well as total words in all reviews)

dictionary = counts.vocabulary_

print(f"A token we might be looking for, for a movie review and its index location in the sequence array: {list(dictionary.keys())[0]} - {dictionary[list(dictionary.keys())[0]]}") # the first token in the dictionary
print(f"An example sequence of the 2 word ngram model: {bag.toarray()[8]}")


A token we might be looking for, for a movie review and its index location in the sequence array: blood of - 184
An example sequence of the 2 word ngram model: [1 0 0 ... 1 1 0]


In [23]:

"""
Now that we have our ngram model, we need a way to evulate each word or phrase (depeneding on what we do) to determine its value.
We can do this by using the tf-idf model which is a combination of the bag of words model and the inverse document frequency model.
Essentially, we are multiplying the frequency of a token in all documents times the log applied to the total number of documents
divided by the sum of 1 and the amount of documents where the token appears. This equation essentially "weights" tokens based on 
their relative importance accross the documents of good and bad reviews. There is an alternative equation that just is the same as
above although it sums 1 along with the number of total documents instead of just the total documents in the numerator and then adds
1 to the idf (to give non-zero values a value). We will use this one over the prior for the reasons mentioned previously.

Using the formulas given and a little bit of basic programming, I'm going to make a function that will process an entire sample
set into a tf-idf model normalized. After this has been implemented, there is one last thing we need to consider before creating
training/testing data and doing cross validation which is filtering. What I mean by this is removing things like stop words (words that
provide little to no information across either class) and removing some symbols or HTML tags that I talked above before.
"""

# l2 normalization
def normalize(x):
    x = np.array(x)
    return x / np.sqrt(np.sum(x**2))

# calculates document frequencies for entire samples according to dictionary of tokens
def doc_freq(samples, dictionary):

    doc_freqs = dict.fromkeys(list(dictionary.keys()), 0)
    
    for token in doc_freqs:

        doc_freqs[token] = np.sum([1 for sample in samples if token in sample.lower()])
    
    return doc_freqs

# function that takes an entire sample set and gets the l2 normalized tf * idf for each sequence
def tfidf(samples):

    counts = CountVectorizer()
    bag = np.array(counts.fit_transform(samples.ravel()).toarray()).astype(float) # creating bag of words
    dictionary = counts.vocabulary_ # getting the dictionary of words and their index location for each sequence
    doc_freqs = doc_freq(samples, dictionary) # getting the document frequency of all tokens
    dictionary = {v: k for k, v in dictionary.items()} # reversing the dictionary so we can use the index location to get the word

    N = len(bag) # total number of documents

    # iterate every sequence
    for idx, sequence in enumerate(bag):
        
        if idx % 5000 == 0:
            print(f"{idx} documents processed / {N}")

        sequence = np.array(sequence) # converting to numpy array
        
        # iterate each token in the sequence
        sequence = [tf * (1 + np.log((N + 1) / (1 + doc_freqs[dictionary[i]]))) for i, tf in enumerate(sequence)] # calculating tf-idf
            
        bag[idx] = normalize(sequence) # l2 normalization

    print("tf-idf model complete")
    
    return bag


In [24]:
"""
Here is an example from page 238 from the machine learning book that shows this implementation I built above
is working accordingly
"""


test = np.array(["The sun is shining", "The weather is sweet", "The sun is shining and the weather is sweet"])
print(tfidf(test))

0 documents processed / 3
tf-idf model complete
[[0.    0.434 0.558 0.558 0.    0.434 0.   ]
 [0.    0.434 0.    0.    0.558 0.434 0.558]
 [0.405 0.478 0.308 0.308 0.308 0.478 0.308]]


In [25]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize as tokenizer
from nltk.stem import WordNetLemmatizer


"""
Now that we have a function that can process an entire sample of sequences by their tokens relative to their importance,
we need to take care of the last preprocessing step before we can create training/testing data and do cross validation. In this
step we will focus on cleaning up the data and removing things like stop words, tokenization of reviews, and lemmanizing reviews. 
For a brief introduction on these topics, a stop word is a word that is common in a language and is not important to the meaning of the sentence. 
For example, words like "the", "a", and "an" are common in English and are not important to the meaning of the sentence making them stop 
words. We need to get rid of these because one, they are not important to the meaning of the sentence and two, it's not helpful enough
for it be a part of the features of a single review that helps depict the sentiment of the review. On the topic of tokenization, we are
basically stripping the text from things like line breaks, blank spaces, and punctuation. This will help resolve the HTML problem a mentioned
ealier in the project. For lemmanizing, we are reducing words to their canoical or base form (simplify). For example, we can reduce the word 
"trained" to "train". Overall, the combination of these three processes will generate reviews that hold information about the sentiment of the 
review and its general direction of which class it belongs to.

To elimate stop words we will tokenize the reviews then we will strip the sequence of tokens from stop words. Then to lemmanize we will
use the NLTK (Natural Language tool-kit) lemmatizer object and it's built in function which allow is to take the filtered tokens and simplify them into their basic
form. Lets do this an example first to make sure we get it working before we apply it to the entire review dataframe.
"""

# helper function to get pos of word so that it's lemmatized correctly
def pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

def clean(reviews):

    lemmatizer = WordNetLemmatizer() # initializing lemmatizer object

    stop_words = set(stopwords.words("english")) # create a set of stop words

    N = len(reviews) # get the number of reviews

    # apply filter to each review
    for i, review in enumerate(reviews):
        
        review = re.sub("<[^>]*>", " ", review) # stripping HTML tags
        review = re.sub("[\W]+", " ", review.lower()) # remove all non-word characters

        review = tokenizer(review) # tokenize the review
        review = [lemmatizer.lemmatize(t, pos(t)) for t in review if t not in stop_words] # strip tokens from stop words and lemmatize
        reviews[i] = " ".join(review) # join the tokens back together

        if (i + 1) % 5000 == 0:
            print(f"{i + 1} review(s) processed / {N}")
    
    print(f"{N} / {N} reviews processed")

    return reviews

In [26]:
"""
As we can see we removed all stop words and lemmatized the tokens. Now we can apply this to the entire review dataframe. One thing
I will mention is the lemmatization process is not perfect (shining -> shin x shine). Although as long as it remains consistent for the 
majority of the reviews, it should not affect the importance of tokens relative to the sentiment of the review label.
"""

print(clean(test))

3 / 3 reviews processed
['sun shin' 'weather sweet' 'sun shin weather sweet']


In [27]:
from sklearn.feature_extraction.text import TfidfTransformer as TFIDF

"""
Due to the runtime of my td-idf model, I will be using transformer provided by sklearn. It will take care of the preprocessing
in terms of turning our bag of words into a tf-idf matrix.
"""

# quick data preprocessing 
def preprocess(reviews):
    counts = CountVectorizer()
    transformer = TFIDF()
    reviews = clean(reviews)
    reviews = transformer.fit_transform(counts.fit_transform(reviews)).toarray()
    return np.array(reviews)


In [None]:

# preprocessing the training data
X = preprocess(df["Review"].values)

# turning sentiment labels into one hot vectors
y = np.array([1 if i == "positive" else 0 for i in df["Sentiment"].values])

# splitting training & testing data
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



In [2]:
from sklearn.linear_model import LogisticRegression

"""
With the data ready to go, lets build our logisite model. We can easily do this using scikit-learns LogisticRegression class. Before
I begin peforming corss-validation on the model, I will first try it on our training/testing splits then i will progress to cross
validation over 5 folds on the entire dataset. From there, we will be able to see the results and analyze the model. I will only iterate
the dataset 10 times seeing 40,000 reviews per iteration should be enough for our model to learn parameters.
"""

clf = LogisticRegression(max_iter=10) # initializing logistic regression model


In [24]:
clf.fit(x_train, y_train) # fitting the model

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(max_iter=10)

In [25]:
acc = clf.score(x_test, y_test) # testing the model
print(f"Base Accuracy: {acc*100}%")

Base Accuracy: 87.91%


In [28]:
"""
Now that we have a basic understanding on how are model peforms and a general idea of the optimal hyperparameters, we can
now begin cross validation. For this project, I will be using the KFold cross validation method. On the entire dataset, I will
be using 5 folds and will be declaring the accuracy of our model based on the average over the 5 folds.
"""

clf = LogisticRegression(max_iter=10)

kf = KFold(n_splits=5)
total_accuracy = 0
iter_ = 1
for train, test in kf.split(X):
    print(f"Iteration {iter_}")
    clf.fit(X[train], y[train])
    acc = clf.score(X[test], y[test])
    total_accuracy += acc
    print(f"Model Accuracy: {acc*100}%")
    iter_ += 1


print(f"Average accuracy across 5 folds: {np.array(total_accuracy / 5 * 100)}%")

    

Iteration 1


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Model Accuracy: 87.92%
Iteration 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Model Accuracy: 88.79%
Iteration 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Model Accuracy: 87.69%
Iteration 4


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Model Accuracy: 88.58%
Iteration 5


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Model Accuracy: 88.77000000000001%
Average accuracy across 5 folds: 88.35000000000001%


In [32]:
from sklearn.metrics import confusion_matrix

"""
Before I conclude, let's look at the confusion matrix from the logistic model we built.

It doesn't look like it's a perfect model, but it did a solid job of predicting the sentiment of
reviews correctly.
"""

pred = clf.predict(X)
cm = confusion_matrix(y, pred)
print(f"Confusion Matrix:\n{cm}")

Confusion Matrix:
[[22418  2582]
 [ 2042 22958]]


In [None]:
"""
After doing cross validation we can see our model is not that bad (roughly 88%). With some tweaks and improvments we can probably get a better
accuracy. Something I thought about doing during the project was ensebmling the model and using it to predict the sentiment of a new review
or using a svm model to do sentiment analysis. Maybe random forest would do better, although I will leave this project as a learning exercise 
for the next project. As a whole, it is crucial to understand your data better than your model because often times it depicts how well your model 
peforms. I believe I did a good job in terms of preprocessing the data and building the model, although there is always areas to improve. 
I wonder if I could've gotten better results with a different preporcessing method or a different model entirely.
""";

In [3]:
import preprocessing as pp
import process_functions as pf
import pickle


path = pf.path

X = pp.features
y = pp.labels

Loading in the Data...
5000 review(s) processed / 50000
10000 review(s) processed / 50000
15000 review(s) processed / 50000
20000 review(s) processed / 50000
25000 review(s) processed / 50000
30000 review(s) processed / 50000
35000 review(s) processed / 50000
40000 review(s) processed / 50000
45000 review(s) processed / 50000
50000 review(s) processed / 50000
50000 / 50000 reviews processed
Data Loaded...


### Main Model

In [4]:
model = LogisticRegression(max_iter=10)
model.fit(X, y)
print(f"Saving model and transformer to disk...")

Saving model and transformer to disk...


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [5]:
pickle.dump(model, open(f"{path}/model/model.pkl", "wb"))
pickle.dump(pp.transformer, open(f"{path}/model/transformer.obj", "wb"))