## Hands-On Lab 4 - Grouping Documents

### Step 1 - Load Data

The *hotel reviews* data is stored as CSV file located within the *HotelReviews.zip* file. The *read_csv()* function from the *pandas* library will automatically load the CSV data from the ZIP file. Run the following code cell to load the data and display info about the data frame.

In [None]:
import pandas as pd

hotel_reviews = pd.read_csv('HotelReviews.zip')
hotel_reviews.info()

### Step 2 - Custom Normalization

As discussed during lecture, the scikit-learn library classes provide extension points for using custom tokenization. The following code instantiates a global stopword list and Snowball stemmer so they are only created once. Run the following code to create the objects.

In [None]:
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

# Customize stopwords list
stop_words = set(stopwords.words('english'))

remove_words = ["mustn't", 'mustn', "couldn't", 'couldn', "hadn't", 'hadn', 
                "didn't", 'didn', "wouldn't", 'wouldn', "wasn't", 'wasn', 
                "isn't", 'isn', "doesn't", 'doesn', "weren't", 'weren', 
                "hasn't", "hasn"]

for word in remove_words:
    stop_words.remove(word)

# Instantiate Snowball stemmer
snowball_stemmer = SnowballStemmer(language = 'english')

### Step 3 - Custom Tokenization

Using the term frequency-inverse document frequency (TF-IDF) calculation has many benefits in text analytics. It is very commonly used in scenarios where documents are being grouped for smiliarity based on the vector space model. Run the following code to tokenize the data.

In [None]:
from nltk.tokenize import word_tokenize
import string
from sklearn.feature_extraction.text import TfidfVectorizer

# Define custom tokenizer based on the NLTK
def nltk_tokenizer(text):
    raw_tokens = word_tokenize(text)
    punctuation_tokens = [token for token in raw_tokens if not token in string.punctuation]
    stop_words_tokens = [token for token in punctuation_tokens if not token in stop_words]
    return([snowball_stemmer.stem(token) for token in stop_words_tokens])

# Include bigrams/trigrams and constrain the dimensionality by requiring a term to show up 
# in at least 5 documents and less than 75% of all documents
tfidf_vectorizer = TfidfVectorizer(tokenizer = nltk_tokenizer, token_pattern = None, 
                                   ngram_range = (1, 3), min_df = 5, max_df = 0.75)

doc_term_matrix = tfidf_vectorizer.fit_transform(hotel_reviews['Review'])
print(f'Rows: {doc_term_matrix.shape[0]}, Columns: {doc_term_matrix.shape[1]}')

### Step 4 - Cosine Similarity

TF-IDF document-term matrices is a prime use case for cosine similarity. The TF-IDF calculation has a normalizing effect on the data (e.g., putting small documents on more equal footing with large documents) which often improves the cosine similarity results. Type the following code into the blank code cell in your lab notebook and run it to produce the output.

In [None]:
# Enter your lab code here

### Step 5 - Clustering with K-Means

Another common use case for TD-IDF document-term matrices is clustering using the k-means algorithm. As you learned in lecture, k-means relies on distance for determining clusters and is sensitive to outliers. Again, TF-IDF's normalization of values can help make k-means more effective. Type the following code into the blank code cell in your lab notebook and run it to produce the output.

In [None]:
# Enter your lab code here

### Step 6 - Examining Clusters Part 1

Looking at random samples of the documents assigned to each cluster is an important part of getting a feel for the effectiveness of the clustering. Larger random samples, of course, are more useful than smaller samples. Type the following code into the blank code cell in your lab notebook and run it to produce the output.

In [None]:
# Enter your lab code here

### Step 7 - Examining Clusters Part 2

In [None]:
# Enter your lab code here