%pip install --user -U nltk

%pip install spacy

%pip install --upgrade spacy pydantic

%python -m spacy download en_core_web_sm

In [33]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import os

import nltk #natural language processing library
from nltk.corpus import stopwords #stop words is a list with all useless words
from nltk.tokenize import RegexpTokenizer #get root words of other words
import re #regEx - get substrings from a whole strings

from datetime import datetime #calendar dates
from time import time

#get split of data
from sklearn.model_selection import train_test_split
#models
from sklearn.naive_bayes import GaussianNB
#analyze models
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, pairwise_distances
#see the accuracy in plot form
from sklearn.metrics import confusion_matrix

#visualization stuff
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

#analyze text

import spacy

nltk.download('punkt') #tokenizer + stopwords for all lang.
nltk.download('stopwords')

In [56]:
#loading in the data
df = pd.read_csv("IMDB Dataset.csv")

In [58]:
#first 5 observations 
#Left is the review, and the right is whether pos or neg
df.head()

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.",positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've never been impressed with Scarlet Johanson, in this she managed to tone down her ""sexy"" image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than ""Devil Wears Prada"" and more interesting than ""Superman"" a great comedy to go see with friends.",positive
3,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them.",negative
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a variation on the Arthur Schnitzler's play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.<br /><br />The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case with most of the people we encounter.<br /><br />The acting is good under Mr. Mattei's direction. Steve Buscemi, Rosario Dawson, Carol Kane, Michael Imperioli, Adrian Grenier, and the rest of the talented cast, make these characters come alive.<br /><br />We wish Mr. Mattei good luck and await anxiously for his next work.",positive


In [50]:
def data_cleaning(df):
    
    start_1 = time()
    
    # Removing emtpy cells
    df.dropna(inplace=True)
    df['review_cleaned'] = df['review'].copy()
    
    # Removing Unicode Chars (URL)
    df['review_cleaned'] = df['review_cleaned'].apply(
        lambda rev: re.sub(r"(\w+:\/\/\S+)|^rt|http.+?", "", rev))
        
    # Replace HTML keywords with blank space ("&quot;", "br", "&#34")
    remove_dict = {"<br /><br />": " ", "<br />": " ", "br ": "", "&quot;": " ", "&#34": " ",
                   "<BR>": " ", "_": ""}
    for key, val in remove_dict.items():
        df['review_cleaned'] = df['review_cleaned'].apply(
            lambda x: x.replace(key, val))
        
    end_1 = time()
        
    print(f"\n######## [{end_1 - start_1:0.2f} secs] Remove URL and HTML Keywords Complete ########")
    
    start_2 = time()
    
    # Remove Punctuations and numbers
    tokenizer = RegexpTokenizer(r'\w+')
    df['review_cleaned'] = df['review_cleaned'].apply(
        lambda x: ' '.join([word for word in tokenizer.tokenize(x)]))
    
    remove_dict = {"0": "", "1": "", "2": "", "3": "", "4": "", "5": "", "6": "", "7": "", "8": "", "9": "",
                   "(": "", ")":""}
    for key, val in remove_dict.items():
        df['review_cleaned'] = df['review_cleaned'].apply(
            lambda x: x.replace(key, val))
    
    end_2 = time()
    
    print(f"\n######## [{end_2 - start_2:0.2f} secs] Remove Punctuation and Numbers Complete ########")
    
    start_3 = time()
    
    # Lowercase Words
    df['review_cleaned'] = df['review_cleaned'].str.lower()
    
    end_3 = time()
    
    print(f"\n######## [{end_3 - start_3:0.2f} secs] Lowercase Complete ########")
    
    start_4 = time()

    # Remove Stop Words.
    stop = stopwords.words('english')
      
    df['review_cleaned'] = df['review_cleaned'].apply(
        lambda x: ' '.join([word for word in x.split() if word.strip() not in stop]))
    
    end_4 = time()
    
    print(f"\n######## [{end_4 - start_4:0.2f} secs] Remove Stop Words Complete ########")
    
    start_5 = time()
    
    # Lemmatization using .lemma_
    nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
    df['review_cleaned'] = df['review_cleaned'].apply(
        lambda x: ' '.join([token.lemma_ for token in nlp(x)]))
    
    end_5 = time()
    
    print(f"\n######## [{end_5 - start_5:0.2f} secs] Lemmatization Complete ########")
    
    return df

In [52]:
# Initialize necessary resources
stop = stopwords.words('english') #english stopwords
tokenizer = RegexpTokenizer(r'\w+')
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

def remove_urls_html(text):
    text = re.sub(r"(\w+:\/\/\S+)|^rt|http.+?", "", text) #remove urls
    remove_dict = {"<br /><br />": " ", "<br />": " ", "br ": "", "&quot;": " ", "&#34": " ",  "<BR>": " ", "_": ""} #remove useless stuff
    for key, val in remove_dict.items(): #replacing everything
        text = text.replace(key, val)
    return text

def remove_punctuation_numbers(text): #remove all numbers in the reviews
    text = ' '.join([word for word in tokenizer.tokenize(text)]) #all root words
    remove_dict = {"0": "", "1": "", "2": "", "3": "", "4": "", "5": "", "6": "", "7": "", "8": "", "9": "",
                   "(": "", ")":""}
    for key, val in remove_dict.items():
        text = text.replace(key, val)
    return text

def lowercase(text):
    return text.lower()

def remove_stopwords(text): #already have roots by applying prev func ^^
    return ' '.join([word for word in text.split() if word.strip() not in stop]) #remove all irrelevant/stop (the) words

def lemmatize(text):
    return ' '.join([token.lemma_ for token in nlp(text)])

NameError: name 'spacy' is not defined

In [None]:
def data_cleaning(df):
    start_1 = time()
    df.dropna(inplace=True)
    df['review_cleaned'] = df['review'].copy()
    
    df['review_cleaned'] = df['review_cleaned'].apply(remove_urls_html)
    end_1 = time()
    print(f"\n######## [{end_1 - start_1:0.2f} secs] Remove URL and HTML Keywords Complete ########")
    
    start_2 = time()
    df['review_cleaned'] = df['review_cleaned'].apply(remove_punctuation_numbers)
    end_2 = time()
    print(f"\n######## [{end_2 - start_2:0.2f} secs] Remove Punctuation and Numbers Complete ########")
    
    start_3 = time()
    df['review_cleaned'] = df['review_cleaned'].apply(lowercase)
    end_3 = time()
    print(f"\n######## [{end_3 - start_3:0.2f} secs] Lowercase Complete ########")
    
    start_4 = time()
    df['review_cleaned'] = df['review_cleaned'].apply(remove_stopwords)
    end_4 = time()
    print(f"\n######## [{end_4 - start_4:0.2f} secs] Remove Stop Words Complete ########")
    
    start_5 = time()
    df['review_cleaned'] = df['review_cleaned'].apply(lemmatize)
    end_5 = time()
    print(f"\n######## [{end_5 - start_5:0.2f} secs] Lemmatization Complete ########")
    
    return df ##your actually applying the functions from prev cell

In [None]:
#everything is processed and ready to use
cleaned_df = data_cleaning(df)

In [None]:
cleaned_df.shape #review clean has the cleaned reviews

In [None]:
cleaned_df.head()

In [None]:
sns.countplot(cleaned_df, x = 'sentiment')

Balanced dataset between positive and negative reviews - this is good because don't need to do additional tuning or sampling to handle dataset imbalances

Vectorization methods in order to get numerical representations for the text
* bag of words
* tf-idf
* word2vec - word embeddings

for training and testing will follow an 80-20 split for training and testing respectively

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [None]:
X_train, X_test, y_train, y_test = train_test_split(cleaned_df['review_cleaned'], cleaned_df['sentiment'], test_size=0.2, random_state=42)

# Vectorization using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000) #this converts the words into numbers so it can be trained
X_train_vect = vectorizer.fit_transform(X_train) #search fit
X_test_vect = vectorizer.transform(X_test)

model = LogisticRegression()
model.fit(X_train_vect, y_train) #trains the model

y_pred = model.predict(X_test_vect) #predict

#get accuracy score
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(classification_report(y_test, y_pred))

In [None]:
def clean_single_review(review):
    review = remove_urls_html(review)
    review = remove_punctuation_numbers(review)
    review = lowercase(review)
    review = remove_stopwords(review)
    review = lemmatize(review)
    return review

In [None]:
#custom input
def test_model(review):
    cleaned = clean_single_review(review) #clean text
    new_review_vect = vectorizer.transform([cleaned]) #get numerical values
    predicted_sentiment = model.predict(new_review_vect) #run prediction/model
    return predicted_sentiment #returns neg or pos

In [None]:
test_model("Hello how are you today?")

In [None]:
test_model("This is a great product!")

In [None]:
test_model("This product could be better")

---

In [None]:
test_model('This was a decent product')

In [None]:
test_model('I have had other products that I think work better than this one')

In [None]:
test_model('not the best')

Future Work - using bigrams in order to create custom word vectorization model