## Import Libraries
Since we're working with a few things I have only really discusses briefly or read in passing, this project should be a log of fun. I import many of the usual libraries I do, but for some of the newer libraries 

### nltk
The natural language toolkit is a pretty amazing library that has all sorts of common NLP functions and various corpora. After doing some reading on what the common NLP libraries were, this one sprang up and I decided to give it a try. To install some corpora I had to run the following line:

$$\text{python -m textblob.download_corpora}$$

### textblob
This is another useful library for NLP but more specifically, sentiment analysis, noun-phrase extraction and tagging. I do not use it for much in the code, but a few of the examples I referenced used it.

In [1]:
import os
import re
import pandas as pd 
import numpy as np
from string import punctuation
from textblob import Word 

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk import word_tokenize

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kanin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kanin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kanin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
## Set Some Constants
seed_number = 35
np.random.seed(seed_number)
rev_col_name = 'review'
sent_col_name = 'sentiment'

## Import and Save Data Frame
First, I need to read in all of the data files. This assumes that you're running this code in the working directory where you have your data stored, and that all of the data is still in it's default folders after being unpack. I add them all to a data frame to begin working on them

In [3]:
folder_path = os.getcwd()

labels = {'pos':1, 'neg':0}
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(folder_path, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path,file), 'r', encoding="utf-8") as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
df.columns = [rev_col_name, sent_col_name]
df.to_csv('movie_data.csv', index=False, encoding='utf-8') #Saving a csv in case the other files break later

  df = df.append([[txt, labels[l]]], ignore_index=True)


## Build Regression Model
First, I wanted to just build just a normal Logistic Regression model without any cleaning. I did not thumb through the data, but I am assuming everything is relatively nice looking. 
NOTE: this takes quite a while. There are a lot of individual files. I thought of writng a script to combine them first, but then I realized I would have to wait the same amount if not more time.

In [4]:
#Get Data
x_train, x_test, y_train, y_test = train_test_split(df[rev_col_name],df[sent_col_name], shuffle=True, test_size=0.2, random_state=seed_number)

#Make Model
clf = Pipeline(steps =[('preprocessing', CountVectorizer()), ('classifier', LogisticRegression(dual=False,max_iter=2000))])
clf.fit(x_train, y_train)
print(clf.score(x_test,y_test))
p = clf.predict(x_test)
t = y_test.tolist()

0.8889


It looks like our accuracy is pretty good already which means just vectorizing our text was enough. Something tells me this shouldn't be the case. For now, I'll go ahead and K-Fold just to be sure this accuracy is repeated.

## K-Fold Validation

In [5]:
split_cnt = 5
kf = KFold(n_splits=split_cnt)
kf.get_n_splits(df[rev_col_name])
data = []

for train_index, test_index in kf.split(df[rev_col_name]):
    x_train, x_test = df[rev_col_name][train_index], df[rev_col_name][test_index]
    y_train, y_test = df[sent_col_name][train_index], df[sent_col_name][test_index]
    clf = Pipeline(steps =[('preprocessing', CountVectorizer()), ('classifier', LogisticRegression(dual=False,max_iter=2000))])
    clf.fit(x_train, y_train)
    data += [clf.score(x_test,y_test)]
print(sum(data)/len(data))

0.8341000000000001


Well, it seems like the accuracy is reproducible. I'm still going to check the data and clean it up just in case we're experiecing any overfitting. 

## Clean Data Function
I cannot take credit for everything hear. I did some reading on what sorts of things I could do to improve my accuracy. I'm not sure if we have any html tags or not, but all of the others are very applicable. I especially like removing all the most frequent words as they tend to add very little to your information (words like 'the', 'a', and things of the sort). Sadly, I was not able to get lemmatization to work due to some key errors. My guess is there is a name or something that is unrecognized. 

In [6]:
def clean_reviews(frame):
    #Remove HTML Tags
    frame[rev_col_name] = frame[rev_col_name].apply(lambda words: re.sub('<.*?>','',words))
    
    #Word Tokenization
    frame[rev_col_name] = frame[rev_col_name].apply(word_tokenize)
    
    #Convert to lower case
    frame[rev_col_name] = frame[rev_col_name].apply(lambda words: [x.lower() for x in words])
    
    #Remove Punctuation
    frame[rev_col_name] = frame[rev_col_name].apply(lambda words: [x for x in words if not x in punctuation])
    
    #Remove Numbers
    frame[rev_col_name] = frame[rev_col_name].apply(lambda words: [x for x in words if not x.isdigit()])
    
    #Remove Frequent Words
    temp = frame[rev_col_name].apply(lambda words: " ".join(words))
    freq = pd.Series(temp).value_counts()[:10] #removing 10 most common
    frame[rev_col_name] = frame[rev_col_name].apply(lambda words: [x for x in words if x not in freq.keys()])
    
    #Lemmatization
    #frame['review'] = frame['review'].apply(lambda words: " ".join([Word(x).lemmatize() for x in words]))
    frame['review'] = frame['review'].apply(lambda words: " ".join(words))
    return frame

## Cleaning Data 
Alright, time to clean up our data and re-run

In [7]:
df = clean_reviews(df)
x_train, x_test, y_train, y_test = train_test_split(df[rev_col_name],df[sent_col_name], shuffle=True, test_size=0.2, random_state=seed_number)

## Build Regression Model (Pt. 2)

In [8]:
clf = Pipeline(steps =[('preprocessing', CountVectorizer()), ('classifier', LogisticRegression(dual=False,max_iter=2000))])
clf.fit(x_train, y_train)
print(clf.score(x_test,y_test))
p = clf.predict(x_test) 
t = y_test.tolist()

0.8884


Hmm, well that accuracy is basically the same for our purposes. Did all that cleaning mean nothing?

## K-Fold Validation (Pt.2)

In [9]:
review_kf = KFold(n_splits=split_cnt)
kf.get_n_splits(df[rev_col_name])
data = []

for train_index, test_index in kf.split(df[rev_col_name]):
    x_train, x_test = df[rev_col_name][train_index], df[rev_col_name][test_index]
    y_train, y_test = df[sent_col_name][train_index], df[sent_col_name][test_index]
    clf = Pipeline(steps =[('preprocessing', CountVectorizer()), ('classifier', LogisticRegression(dual=False,max_iter=2000))])
    clf.fit(x_train, y_train)
    data += [clf.score(x_test,y_test)]
print(sum(data)/len(data))

0.83492


Well, I guess the data was already cleaned enough and the count vectorization pulls a majority of the weight in creating the model.

## Conclusion
Although I did not follow the book's code explicitly, I still followed along with what it was teaching me about Natural Language Processing and sentiment analysis. It was quite fun to actually dig my hands into a project like this especially because I have some research where I can turn around and use these skills. I opted to follow some other examples of this project so I could experience using some of these NLP libraries too (the one in the book was pretty cut and dry base python for the most part I felt). Overall, I'm surprised how much vectorizing the review does by itself. I almost wish I had worse data to see how much the cleaning could do. I feel like this project was a really good way to use logistic regression and NLP, and I look forward to doing the next one.

## References
- Textbook pages
- https://textblob.readthedocs.io/en/dev/
- https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
- https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
- https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/
- https://kavita-ganesan.com/what-are-stop-words/#:~:text=Stop%20words%20are%20a%20set,on%20the%20important%20words%20instead.
- https://monkeylearn.com/sentiment-analysis/
- https://www.analyticsvidhya.com/blog/2015/10/6-practices-enhance-performance-text-classification-model/
- https://www.nltk.org/
- https://medium.com/@pyashpq56/sentiment-analysis-on-imdb-movie-review-d004f3e470bd
- https://medium.com/hackerdawn/imdb-review-sentiment-analysis-using-logistic-regression-d7878ee01947
- https://towardsdatascience.com/imdb-reviews-or-8143fe57c825