## Hands-On Lab 5 - Document Classification

### Step 1 - Load Data

The *hotel reviews* data is stored as CSV file located within the *HotelReviews.zip* file. The *read_csv()* function from the *pandas* library will automatically load the CSV data from the ZIP file. Run the following code cell to load the data and display info about the data frame.

In [None]:
import pandas as pd

hotel_reviews = pd.read_csv('HotelReviews.zip')
hotel_reviews.info()

### Step 2 - Custom Normalization

As discussed during lecture, the scikit-learn library classes provide extension points for using custom tokenization. The following code instantiates a global stopword list and Snowball stemmer so they are only created one time. Run the following code to create the objects.

In [None]:
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

# Customize stopwords list
stop_words = set(stopwords.words('english'))

remove_words = ["mustn't", 'mustn', "couldn't", 'couldn', "hadn't", 'hadn', 
                "didn't", 'didn', "wouldn't", 'wouldn', "wasn't", 'wasn', 
                "isn't", 'isn', "doesn't", 'doesn', "weren't", 'weren', 
                "hasn't", 'hasn', 'not']

for word in remove_words:
    stop_words.remove(word)

# Instnatiate Snowball stemmer
snowball_stemmer = SnowballStemmer(language = 'english')

### Step 3 - Custom Tokenization

The following function defines the NLTK-based custom tokenization and creates a document-term matrix using the custom tokenizer. As discussed during lecture, the naive Bayes algorithm is typically used with token counts as opposed to TF-IDF. Run the follow code to tokenize the data.

In [None]:
from nltk.tokenize import word_tokenize
import string
from sklearn.feature_extraction.text import CountVectorizer

# Define custom tokenizer based on the NLTK
def nltk_tokenizer(text):
    raw_tokens = word_tokenize(text)
    punctuation_tokens = [token for token in raw_tokens if not token in string.punctuation]
    stop_words_tokens = [token for token in punctuation_tokens if not token in stop_words]
    return([snowball_stemmer.stem(token) for token in stop_words_tokens])

# Add bigram/trigrams and constrain the dimensionality by requiring a term to show up in at 
# least 5 documents and less than 75% of all documents
count_vectorizer = CountVectorizer(tokenizer = nltk_tokenizer, token_pattern = None,
                                   ngram_range = (1, 3), min_df = 5, max_df = 0.75)

doc_term_matrix = count_vectorizer.fit_transform(hotel_reviews['Review'])
print(f'Rows: {doc_term_matrix.shape[0]}, Columns: {doc_term_matrix.shape[1]}')

### Step 4 - Train the Model

As discussed in lecture, the naive Bayes algorithm is simple to understand, simple to train, and is surprisingly effective in classifying documents (e.g., email spam filtering). In this lab you will train a naive Bayes model to predict whether a reivew is positive (i.e., a rating of 4 or 5). Type the following code into the blank code cell in your lab notebook and run it to produce the output.

In [None]:
# Enter your lab code here

### Step 5 - Tokenize the Test Data

Before predictions can be made, the test data needs to be tokenized in the same way the training data was tokenized. This ensures the vocabulary of the test data is the same as that of the training data. This is accomplished by reusing the CountVectorizer object. Type the following code into the blank code cell in your lab notebook and run it to produce the output.

In [None]:
# Enter your lab code here

### Step 6 - Make Predictions and Evaluate

When classifying documents, you want some idea of how effective the predictive model will be when faced with new, unseen data. Using a test set allows you to simulate this scenario and estimate how effective the model will be. Type the following code into the blank code cell in your lab notebook and run it to produce the output.

In [None]:
# Enter your lab code here