# Introduction

We are going to run increasingly sophisticated classification models on whisky reviews

##Classifier based on TfIdf vectorization of reviews

### Follow Along 

1. Join the Kaggle Competition https://www.kaggle.com/c/whiskey-201911/submissions
2. Download the data
3. Train and hyperparameter tune a model using an sklearn pipeline

#### Get spacy and restart runtime

In [None]:
#YOUR CODE HERE
!python -m spacy download en_core_web_sm

#### import necessary packages, load spacy

In [None]:
import pandas as pd
import re

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
import spacy

Load `spacy`

In [None]:
#YOUR CODE HERE
nlp = spacy.load('en_core_web_sm')

#### Load Kaggle Whisky Competition Data
The goal is to predict the rating from the review text

In [None]:
# !!!!! You may need to change the path !!!!!
# You can download these datasets from the Kaggle in-class 
# competition for your cohort. 
 
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [None]:
test

In [None]:
train.head()

In [None]:
train['description'] = train['description'].astype(str)
train['category'] = train['category'].astype(str)
test['description'] = test['description'].astype(str)
train.dtypes

In [None]:
train['description'][0]

### 1.1 Clean Text

In [None]:
def clean_doc(text):
  # COMPLETE THE CODE IN THIS CELL
  # remove new line characters
  text = text.lower()
  text = text.replace('\\n', ' ')
  # remove numbers from the text
  pattern = re.compile('\xa0')
  text = re.sub(pattern, '', text)
  text = re.sub(r'[^a-zA-Z]', ' ', text)
  # remove multiple white spaces
  text = re.sub(r"[ ]{2,}", " ", text)

  # case normalize and strip extra white spaces on the far left and right hand side
  text = text.lstrip().rstrip()
  return text

train['description'] = train['description'].apply(clean_doc)
test['description'] = test['description'].apply(clean_doc)
train['description'][0]


In [None]:
train['description'] = train['description'].apply(lambda x: ' '.join([token.lemma_.strip() for token in nlp(x) if \
                                                             ((not token.is_stop) and (not token.is_punct) and \
                                                              (len(token.lemma_.strip()) > 1) and (token.is_alpha))]))
test['description'] = test['description'].apply(lambda x: ' '.join([token.lemma_.strip() for token in nlp(x) if \
                                                             ((not token.is_stop) and (not token.is_punct) and \
                                                              (len(token.lemma_.strip()) > 1) and (token.is_alpha))]))

train['description'][0]

In [None]:
from collections import Counter
def count(token_lists):
    """
    Calculates some basic statistics about tokens in our corpus (i.e. corpus means collections text data)
    """
    # stores the count of each token
    word_counts = Counter()
    
    # stores the number of docs that each token appears in 
    appears_in_docs = Counter()

    total_docs = len(token_lists)

    for token_list in token_lists:
        # stores count of every appearance of a token 
        word_counts.update(token_list)
        
        # use set() in order to not count duplicates, thereby count the num of docs that each token appears in
        appears_in_docs.update(set(token_list))

    # build word count dataframe
    word_count_dict = zip(word_counts.keys(), word_counts.values())
    wc = pd.DataFrame(word_count_dict, columns = ['word', 'count'])

    # rank the the word counts
    wc['rank'] = wc['count'].rank(method='first', ascending=False)
    total = wc['count'].sum()

    # calculate the percent total of each token
    wc['fraction_of_total'] = wc['count'].apply(lambda token_count: token_count / total)

    # calculate the cumulative percent total of word counts 
    wc = wc.sort_values(by='rank')
    wc['cumulative_fraction_of_total'] = wc['fraction_of_total'].cumsum()

    # create dataframe for document stats
    t2 = zip(appears_in_docs.keys(), appears_in_docs.values())
    ac = pd.DataFrame(t2, columns=['word', 'appears_in_docs'])
    
    # merge word count stats with doc stats
    wc = ac.merge(wc, on='word')

    wc['appears_in_fraction_of_docs'] = wc['appears_in_docs'].apply(lambda x: x / total_docs)

    return wc.sort_values(by='rank')
#token_lists = [doc.split(' ') for doc in train['description']]
#count(token_lists)

### Split training data into Feature Matrix `X` and Target Vector `y`

In [None]:
target = 'category'
# COMPLETE THE CODE IN THIS CELL
y = train[target]
X = train['description']

### Specify the Model and Define the Pipeline Components

For the classifier model, you can try any or several of 
* `RandomForestClassifier()` or `GradientBoostingClassifier()` from the `sklearn` library
* `XGBClassifier()` from the `xgboost` library
* `CatboostClassifier()` from the `catboost` library
* `LGBMClassifier()` from the `lightgbm` library


In [None]:
# limit max_features to 500 to speed up training on Colab.
# COMPLETE THE CODE IN THIS CELL
vect = TfidfVectorizer(stop_words="english")
clf = RandomForestClassifier(random_state=42)

pipe = Pipeline([('vect', vect), ('clf', clf)])

In [None]:
'''
vect.fit(X)
dtm = vect.transform(X)
print(vect.get_feature_names())
print(type(dtm))
print(dtm.todense())
'''

### Define Search Space
Look for both the best hyperparameters of vectorizer and classification model. 

In [None]:
# COMPLETE THE CODE IN THIS CELL
# Parameters to search in dictionary 
import numpy as np
parameters = {
    'vect__max_df': [0.95, 1.0],
    'vect__min_df': range(14, 24, 2),
    'vect__max_features': range(200, 400, 10),
    'clf__n_estimators': range(300, 460, 10),
    'clf__max_depth': range(10, 40, 3)
}

# Implement a grid search with cross-validation
#grid_search = GridSearchCV(pipe, param_grid=parameters, n_jobs=-1, cv=2, verbose=1)
#grid_search.fit(X, y)

# Display the best score from the grid search
#grid_search.best_score_


In [None]:
import os
import tensorflow as tf

if 'COLAB_TPU_ADDR' not in os.environ:
    print('ERROR: Not connected to a TPU runtime')
else:
    tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
    print ('TPU address is', tpu_address)

In [None]:
grid_search = RandomizedSearchCV(
    pipe,
    param_distributions = parameters,
    n_jobs = -1,
    cv = 2,
    verbose = 1,
    n_iter = 500
)

grid_search.fit(X, y)

In [None]:
# Display the best parameters from the grid search
'''
0.7408866177573559
{'vect__min_df': 10, 'vect__max_features': 2000, 'vect__max_df': 1.0, 'clf__n_estimators': 340, 'clf__max_depth': 80}
'''
print(grid_search.best_score_)
print(grid_search.best_params_)

### 1.5 Make a Submission File

In [None]:
test['description']

In [None]:
# COMPLETE THE CODE IN THIS CELL
# Predictions on **test** sample
pred = grid_search.predict(test['description'])

In [None]:
# COMPLETE THE CODE IN THIS CELL
submission = pd.DataFrame({'id' : test['id'], 'category': pred})
submission['category'] = submission['category'].astype('int64')

In [None]:
# Make Sure the Category is an Integer
submission

In [None]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model
submission_number = 0

submission.to_csv(f'submission{submission_number}.csv', index=False)
submission_number += 1

In [None]:
# Download submission to local machine from this Google Colab notebook
from google.colab import files
files.download(f'submission{submission_number-1}.csv')

### 1.6 Submit results to `kaggle` and get score

First, upload the `kaggle.json` API token file from local machine.<br>
Do this by clicking the file icon in the left sidebar, <br>
then clicking file icon with an up arrow inside it at the upper left, <br>
then navigating to and selecting the `kaggle.json` file in local machine.<br>
`kaggle.json` is usually found in a folder called `.kaggle` local machine, <br>

Then: make a folder `/root/.kaggle` in this notebook,<br>
and copy `kaggle.json` file into the `/root/.kaggle/` folder

In [None]:
#!mkdir /root/.kaggle/
!mv /kaggle.json /root/.kaggle/ 
!chmod 600 /root/.kaggle/kaggle.json # to safeguard your privacy
!ls -l /root/.kaggle/

## 2. Add Latent Semantic Indexing to your pipeline (Learn)
<a id="p2"></a>

### Follow Along
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (LSI) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
4. Make a submission to Kaggle 


### 2.1 Define Pipeline Components

Nest pipelines to perform SVD on our vectorization (LSA)

In [None]:
# COMPLETE THE CODE IN THIS CELL
# Transforming our Vectorization with SVD is how LSA generates topic columns
svd = TruncatedSVD(algorithm='randomized', n_iter=10)

# vectorizer and classifier like before
vect = TfidfVectorizer(stop_words="english")
clf = RandomForestClassifier(random_state=42)

# LSA pipeline with vectorizer & truncated SVD
lsa = Pipeline([('vect', vect), ('svd', svd)])

# combine LSA pipeline together with classifier
pipe = Pipeline([('lsa', lsa), ('clf', clf)])

### 2.2 Define Your grid search space and run a grid search with cross-validation
You're looking for both the best hyperparameters of your vectorizer and your classification model. 

In [None]:
# COMPLETE THE CODE IN THIS CELL
'''
0.7408866177573559
{'vect__min_df': 10, 'vect__max_features': 2000, 'vect__max_df': 1.0, 'clf__n_estimators': 340, 'clf__max_depth': 80}
'''
parameters = {
    'lsa__svd__n_components': range(10, 40, 2),
    'lsa__svd__n_iter': range(2, 17, 3),
    'lsa__vect__max_df': [0.98, 1.0],
    'lsa__vect__min_df': range(10, 20, 2),
    'lsa__vect__max_features': range(200, 400, 20),
    'clf__n_estimators': range(320, 420, 10),
    #'clf__max_depth': range(70, 90, 2),
}

grid_search = RandomizedSearchCV(
    pipe,
    param_distributions = parameters,
    n_jobs = -1,
    cv = 2,
    verbose = 1,
    n_iter = 600
)

grid_search.fit(X, y)

In [None]:
grid_search.best_score_

In [None]:
grid_search.best_params_

### 2.3 Make a Submission File
See section $1.6$ above for instructions on how to submit your results file to `kaggle` and get your score

In [None]:
# Predictions on test sample
pred = grid_search.predict(test['description'])

In [None]:
submission = pd.DataFrame({'id': test['id'], 'category':pred})
submission['category'] = submission['category'].astype('int64')

In [None]:
# Make Sure the Category is an Integer
submission.head()

In [None]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model
submission_number = 0
submission.to_csv(f'submission{submission_number}.csv', index=False)

In [None]:
# Download submission to your local machine from this Colab notebook
from google.colab import files
files.download(f'submission{submission_number}.csv')
submission_number +=1

## Challenge

Continue to apply Latent Semantic Indexing (LSI) to various datasets. 

# 3. Add Spacy Word Embeddings
<a id="p3"></a>

## Challenge

What you should be doing now:
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (lsi) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
    - Try to extract word embeddings with Spacy and use document vectors made from those word embeddings as your features for a classification model.
4. Make a submission to Kaggle 

### 3.1 Process the data set with spacy

In [None]:
# Apply to your Dataset

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier

from scipy.stats import randint

In [None]:
# Continue Word Embedding Work Here
nlp = spacy.load("en_core_web_sm")

def get_word_vectors(docs):
    # YOUR CODE HERE
    return [nlp(d).vector for d in docs]

X_train_emb = get_word_vectors(train['description'])
X_test_emb = get_word_vectors(test['description'])

In [None]:
train['description'][0]

In [None]:
rf = GradientBoostingClassifier()
params = { 
    'n_estimators': range(280, 400, 10), 
    'max_depth': range(6, 20, 2)
}

rsrf = RandomizedSearchCV(rf,
                  param_distributions=params, 
                  cv=2, 
                  n_jobs=-1, 
                  verbose=1,
                  n_iter=20)
rsrf.fit(X_train_emb, y)

In [None]:
# massively overfit with the Random Forest
print('Training Accuracy: ', rsrf.score(X_train_emb, y))

Here we use oob_score_ (out-of-bag score) as a **proxy** for the test score;<br>
for your submission, you will predict on the test set, as before

In [None]:
# validation looks decent without any tuning
print(rsrf.best_score_)
print(rsrf.best_params_)

### 3.2 Make a Submission File
See section $1.6$ above for instructions on how to submit your results file to `kaggle` and get your score

### Make a Submission File

In [None]:
# Predictions on test sample
pred = rfc.predict(X_test_emb])

In [None]:
# YOUR CODE HERE
submission = pd.DataFrame({'id': test['id'], 'category':pred})
submission['category'] = submission['category'].astype('int64')

In [None]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model
submission_number = 2
submission.to_csv(f'submission{submission_number}.csv', index=False)

In [None]:
# Download submission to local machine from Google Colab
from google.colab import files
files.download(f'submission{submission_number}.csv')

### 3.3 Submit your predictions to Kaggle


---



In [None]:
# YOUR CODE HERE 

# Post Lecture Assignment (Stretch)
<a id="p4"></a>

Your primary assignment this afternoon is to achieve a minimum of 80% accuracy on the Kaggle competition. <br>
Once you've accomplished that, do (1), and either (2) or (3): 

1. Research "Sentiment Analysis". Provide answers in markdown to the following questions: 
    - What is "Sentiment Analysis"? 
    - Is Document Classification different than "Sentiment Analysis"? Provide evidence for your response
    - How do people create labeled sentiment data? Are those labels really sentiment?
    - What are common applications of sentiment analysis?

2. Singular Value Decomposition (SVD) is one of the most important and powerful methods in Applied Mathematics and in all of Machine Learning.  Principal Components Analysis (PCA) -- which we used in Module 2 -- is closely releated to SVD. Research SVD using the resources below. Then write a few paragraphs explaining -- in your own words -- your understanding of SVD and why it has become so important in Machine Learning. As you write, pretend that you will be presenting this summary orally as an answer to a question during a job interview.<br>

* [Daniela Witten](https://www.danielawitten.com/), a Professor of Mathematical Statistics at the University of Washington, recently penned a highly amusing and informative [tweetstorm](https://twitter.com/WomenInStat/status/1285611042446413824) about SVD, well worth reading!<br>
* [Stanford University Lecture on SVD](https://www.youtube.com/watch?v=P5mlg91as1c) <br>
* [StatQuest Principal Components Analysis](https://www.youtube.com/watch?v=FgakZw6K1QQ)<br>
* [Luis Serrano Principal Components Analysis](https://www.youtube.com/watch?v=g-Hb26agBFg)<br>

3. Research which other models can be used for text classification -- see [Multi-Class Text Classification Model Comparison and Selection](https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568)
  - Try a few other classical machine learning models, and compare with the gradient boosting results 
  - Neural Networks are becoming more popular for document classification. Why is that the case? 
  - If you have the time and interest, check out this [text classification documentation](https://developers.google.com/machine-learning/guides/text-classification/step-2-5) from Google
   