Lambda School Data Science

*Unit 4, Sprint 1, Module 3*

---

# Document Classification (Assignment)

This notebook is for you to practice skills during lecture.

Today's guided module project and assignment will be different. You already know how to do classification. You ready know how to extract features from documents. So? That means you're ready to combine and practice those skills in a [kaggle competition](https://www.kaggle.com/c/whiskey-201911/) We we will open with a five minute sprint explaining the competition, and then give you 25 minutes to work. After those twenty five minutes are up, I will give a 5-minute demo an NLP technique that will help you with document classification (*and **maybe** the competition*).

Today's all about having fun and practicing your skills.

## Sections
* <a href="#p1">Part 1</a>: Text Feature Extraction & Classification Pipelines
* <a href="#p2">Part 2</a>: Latent Semantic Indexing
* <a href="#p3">Part 3</a>: Word Embeddings with Spacy
* <a href="#p4">Part 4</a>: Post Lecture Assignment

# Text Feature Extraction & Classification Pipelines (Learn)
<a id="p1"></a>

We are going to run increasingly sophisticated classification models on our whisky reviews in parts 1, 2, and 3. For each of parts 1, 2, and 3, submit your best model's results to the Kaggle competition to measure `generalization accuracy` -- i.e. how well the model performs on new data.

##1. Classifier based on TfIdf vectorization of reviews

### Follow Along 

What you should be doing now:
1. Join the Kaggle Competition
2. Download the data
3. Train a model (try using the pipe method I just demoed)

### 1.0 Setup

#### 1.0.1 Get spacy and restart runtime

In [None]:
#YOUR CODE HERE

#### 1.0.2 import necessary packages, load spacy

In [26]:
import pandas as pd
import re

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
import xgboost as xgb
from xgboost import XGBClassifier

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
import spacy
nlp = spacy.load("en_core_web_md")

Load `spacy`

In [None]:
def clean_data(text):
    """
    Accepts a single text document and performs several regex substitutions in order to clean the document. 
    
    Parameters
    ----------
    text: string or object 
    
    Returns
    -------
    text: string or object
    """
    
    # order of operations - apply the expression from top to bottom
    email_regex = r"From: \S*@\S*\s?"
    non_alpha = '[^a-zA-Z]'
    multi_white_spaces = "[ ]{2,}"
    
    text = re.sub(email_regex, "", text)
    text = re.sub(non_alpha, ' ', text)
    text = re.sub(multi_white_spaces, " ", text)
    
    # apply case normalization 
    return text.lower().lstrip().rstrip()

In [None]:
#YOUR CODE HERE
vect = Tf
(list(v)[0] for v in X.values)

In [None]:
# svd = TruncatedSVD(n_components=2, # number of topics to generate (also the size of the new feature space)
#                    algorithm='randomized',
#                    n_iter=10)

# tf_vectorizer = CountVectorizer()
# tfm = tf_vectorizer.fit_transform(data)
# tfm = pd.DataFrame(data=tfm.toarray(), columns=tf_vectorizer.get_feature_names())

# tfm.index = data
# tfm

#### 1.0.3 Load Kaggle Whisky Competition Data

In [17]:
# !!!!! You may need to change the path !!!!!
# You can download these datasets from the Kaggle in-class 
# competition for your cohort. 
 
train = pd.read_csv('train.csv',usecols=['description','category'])
test = pd.read_csv('test.csv',usecols=['description'])

In [7]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           288 non-null    int64 
 1   description  288 non-null    object
dtypes: int64(1), object(1)
memory usage: 4.6+ KB


In [15]:
test.head()

Unnamed: 0,id,description
0,955,"Think carnival aromas—the good ones, anyway—me..."
1,3532,"A blend of three bourbons, between 6 and 12 ye..."
2,1390,"The nose is focused on cereal, hints of fresh ..."
3,1024,Swiss-based Chapter 7 released this 19 year ol...
4,1902,Valkyrie replaces the current Dark Origins exp...


In [14]:
train.head()

Unnamed: 0,description,category
0,A marriage of 13 and 18 year old bourbons. A m...,2
1,There have been some legendary Bowmores from t...,1
2,This bottling celebrates master distiller Park...,2
3,What impresses me most is how this whisky evol...,1
4,"A caramel-laden fruit bouquet, followed by une...",2


### 1.1 Clean Text

In [19]:
train['description'][0]

'a marriage of and year old bourbons a mature yet very elegant whiskey with a silky texture and so easy to embrace with a splash of water balanced notes of honeyed vanilla soft caramel a basket of complex orchard fruit blackberry papaya and a dusting of cocoa and nutmeg smooth finish sophisticated stylish with well defined flavors a classic'

### 1.2 Split training data into Feature Matrix and Target Vector

In [29]:
%%time

def clean_doc(text):
  # COMPLETE THE CODE IN THIS CELL
  # remove new line characters
  text = text.replace('\\n', ' ')

  # remove numbers for the text
  non_alpha = '[^a-zA-Z]'
  multi_white_spaces = "[ ]{2,}"
  
  text = re.sub(non_alpha, ' ', text)
  text = re.sub(multi_white_spaces, " ", text)

  # case normalize and strip extra white spaces on the far left and right hand side
  return text.lower().lstrip().rstrip()

train['description'] = train['description'].apply(clean_doc)
test['description'] = test['description'].apply(clean_doc)


###BEGIN SOLUTION
# build a model that is trained on word vectors
def get_word_vectors(docs):
    """
    This serves as both our tokenizer and vectorizer. 
    Returns a list of word vectors, i.e. our doc-term matrix
    """
    return [nlp(doc).vector for doc in docs]

# You may need to change the path
#train = pd.read_csv('./Kaggle Data/train.csv')
#test = pd.read_csv('./Kaggle Data/test.csv')

# create our doc-term matrices 

# raw text data for train and test sets
X_train_text = train["description"]
X_test_text = test["description"]

# transform raw data into doc-term matrices for train and test sets 
X_train = get_word_vectors(X_train_text)
X_test = get_word_vectors(X_test_text)

# save ratings to y vector
y_train = train["category"]

# create RF model, use out-of-bag (oob) score
rfc = RandomForestClassifier(oob_score=True)

rfc.fit(X_train, y_train)
###END SOLUTION

Wall time: 38.8 s


RandomForestClassifier(oob_score=True)

In [None]:
# CREATE Term-Frequency matrix 

tfidf = TfidfVectorizer(stop_words="english", tokenizer=None) # data transformer 
rfc = RandomForestClassifier(random_state=42) # estimator

###BEGIN SOLUTION
# use CountVectorizer to create a Term-Frequency matrix (a.k.a. Doc-Term Matrix )
tf_vectorizer = CountVectorizer()
tfm = tf_vectorizer.fit_transform(data)
tfm = pd.DataFrame(data=tfm.toarray(), columns=tf_vectorizer.get_feature_names())

# switch integer indicies with terms
tfm.index = data
tfm
###END SOLUTION

### 1.3 Define Pipeline Components
We can try`RandomForestClassifier()`,  `GradientBoostingClassifier()` from the `sklearn` library, and `XGBClassifier()` from the `xgboost` library.

In [31]:
# limiting max_features to 500 to speed up training on Colab.
# COMPLETE THE CODE IN THIS CELL
rfc = RandomForestClassifier(oob_score=True)

svd = TruncatedSVD(n_components=2, # number of topics to generate (also the size of the new feature space)
                   algorithm='randomized',
                   n_iter=10)

vect = TfidfVectorizer(stop_words="english", tokenizer=None)

lsi = Pipeline([("vect", vect), # creating our term-doc matrix
                ("svd", svd)]) # apply svd to our term-doc matrix 

pipe = Pipeline([("lsi", lsi), # data transform
                 ("clf", rfc)]) # estimator 

### 1.4 Define Your Search Space
You're looking for both the best hyperparameters of your vectorizer and your classification model. 

In [33]:
# COMPLETE THE CODE IN THIS CELL


# Parameters to search in dictionary 
parameters = {
    'lsi__vect__max_df':[.9,  1.0],
    'clf__n_estimators':[10, 100, 250], 
    'clf__max_depth':(15, 20)
}

# Implement a grid search with cross-validation
grid_search = GridSearchCV(pipe,
                  param_grid=parameters, 
                  cv=3, 
                  n_jobs=-2, 
                  verbose=1)

grid_search.fit(X_train, y_train)

# Display the best score from the grid search
grid_search.best_score_

Fitting 3 folds for each of 12 candidates, totalling 36 fits




AttributeError: 'numpy.ndarray' object has no attribute 'lower'

In [None]:
# Display the best parameters from the grid search
print(grid_search.best_params_)

{'clf__max_depth': 20, 'vect__max_df': 0.75}

### 1.5 Make a Submission File
*Note:* In a typical Kaggle competition, you are only allowed two submissions a day, so only submit when your predicted test accuracy is the highest you can make it. For this competition the max daily submissions are capped at **20**.  The submission file is made from the results of running your best model on the test data set, for which we don't get the targets.

In [None]:
# COMPLETE THE CODE IN THIS CELL
# Predictions on **test** sample
pred = grid_search.predict(...)

In [None]:
# COMPLETE THE CODE IN THIS CELL
submission = pd.DataFrame({... : ..., ...: ...})
submission['ratingCategory'] = submission['ratingCategory'].astype('int64')

In [None]:
# Make Sure the Category is an Integer
submission.head()

Unnamed: 0,id,ratingCategory
0,3461,1
1,2604,1
2,3341,1
3,3764,1
4,2306,1


In [None]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model
submission_number = 0

submission.to_csv(f'submission{submission_number}.csv', index=False)
submission_number += 1

In [None]:
# Download submission if in Google Colab
from google.colab import files
files.download(f'submission{submission_number}.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Challenge

You're trying to achieve a minimum of 75% Accuracy on your model.

## 2. Add Latent Semantic Indexing to pipeline (Learn)
<a id="p2"></a>

### Follow Along
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (LSI) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
4. Make a submission to Kaggle 


### 2.1 Define Pipeline Components

Nest pipelines to perform SVD on our vectorization (LSA)

In [None]:
# COMPLETE THE CODE IN THIS CELL
# Transforming our Vectorization with SVD is how LSA generates topic columns
svd = ...

# vectorizer and classifier like before
vect = TfidfVectorizer(...)
clf = XGBClassifier()

# LSA pipeline with vectorizer & truncated SVD
lsa = Pipeline(???)

# combine LSA pipeline together with classifier
pipe = Pipeline([('lsa', lsa), ('clf', clf)])

### 2.2 Define Your grid search space and run a grid search with cross-validation
You're looking for both the best hyperparameters of your vectorizer and your classification model. 

In [None]:
# COMPLETE THE CODE IN THIS CELL
parameters = {
    'lsa__svd__n_components': [...],
    'lsa__vect__max_df': (...),
    'clf__max_depth': (...)
}

grid_search = GridSearchCV(pipe,parameters, cv=3, n_jobs=-1, verbose=1)
grid_search.fit(..., ...)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:  4.6min finished


GridSearchCV(cv=3, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('lsa',
                                        Pipeline(memory=None,
                                                 steps=[('vect',
                                                         TfidfVectorizer(analyzer='word',
                                                                         binary=False,
                                                                         decode_error='strict',
                                                                         dtype=<class 'numpy.float64'>,
                                                                         encoding='utf-8',
                                                                         input='content',
                                                                         lowercase=True,
                                                                         max_df=1.0,
             

In [None]:
grid_search.best_score_

0.7337908122828017

In [None]:
grid_search.best_params_

{'clf__max_depth': 20, 'lsa__svd__n_components': 100, 'lsa__vect__max_df': 1.0}

### 2.3 Make a Submission File

In [None]:
# Predictions on test sample
pred = grid_search.predict(test['description'])

In [None]:
submission = pd.DataFrame({'id': test['id'], 'ratingCategory':pred})
submission['ratingCategory'] = submission['ratingCategory'].astype('int64')

In [None]:
# Make Sure the Category is an Integer
submission.head()

Unnamed: 0,id,ratingCategory
0,3461,1
1,2604,1
2,3341,1
3,3764,1
4,2306,1


In [None]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model

submission.to_csv(f'submission{submission_number}.csv', index=False)
submission_number +=1

In [None]:
# Download submission if in Google Colab
from google.colab import files
files.download(f'submission{submission_number}.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Challenge

Continue to apply Latent Semantic Indexing (LSI) to various datasets. 

# 3. Add Spacy Word Embeddings
<a id="p3"></a>

### 3.1 Process the data set with spacy

In [None]:
# Apply to your Dataset

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier

from scipy.stats import randint

param_dist = {
    
    'max_depth' : randint(3,10),
    'min_samples_leaf': randint(2,15)
}

In [None]:
# Continue Word Embedding Work Here
nlp = spacy.load("en_core_web_md")

def get_word_vectors(docs):
    # YOUR CODE HERE
    return 

X_train_emb = get_word_vectors(train['description'])
X_test_emb = get_word_vectors(test['description'])

In [None]:
rfc = RandomForestClassifier(oob_score=True)

rfc.fit(X_train_emb, y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=True, random_state=None,
                       verbose=0, warm_start=False)

In [None]:
# massively overfit with the Random Forest
print('Training Accuracy: ', rfc.score(X_train_emb, y))

Training Accuracy:  0.9997553217518963


Here we use oob_score_ (out-of-bag score) as a **proxy** for the test score;<br>
for your submission, you will predict on the test set, as before

In [None]:
# validation looks decent without any tuning

rfc.oob_score_

0.7230242231465622

### 3.2 Make a Submission File

In [None]:
# YOUR CODE HERE


## Challenge

What you should be doing now:
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (lsi) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
    - Try to extract word embeddings with Spacy and use those embeddings as your features for a classification model.
4. Make a submission to Kaggle 

# 4. Post Lecture Assignment
<a id="p4"></a>

Your primary assignment this afternoon is to achieve a minimum of 75% accuracy on the Kaggle competition. <br>
Once you have achieved that goal, please explore a few of the following topics: 

1. Research "Sentiment Analysis". Provide answers in markdown to the following questions: 
    - What is "Sentiment Analysis"? 
    - Is Document Classification different than "Sentiment Analysis"? Provide evidence for your response
    - How do you create labeled sentiment data? Are those labels really sentiment?
    - What are common applications of sentiment analysis?
2. Research why word embeddings worked better for the lecture notebook than on the whiskey competition.
    - This [text classification documentation](https://developers.google.com/machine-learning/guides/text-classification/step-2-5) from Google might be of interest
    - Neural Networks are becoming more popular for document classification. Why is that the case?

3. Research Singular Value Decomposition (SVD), one of the most important and powerful methods in Applied Mathematics and in all of Machine Learning.  Principal Components Analysis (PCA) -- which we used in Module 2 -- is closely releated to SVD.<br>

* [Daniela Witten](https://www.danielawitten.com/), a Professor of Mathematical Statistics at the University of Washington, recently penned a highly amusing and informative [tweetstorm](https://twitter.com/WomenInStat/status/1285611042446413824) about SVD, well worth reading!<br>
* [Stanford University Lecture on SVD](https://www.youtube.com/watch?v=P5mlg91as1c) <br>
* [StatQuest Principal Components Analysis](https://www.youtube.com/watch?v=FgakZw6K1QQ)<br>
* [Luis Serrano Principal Components Analysis](https://www.youtube.com/watch?v=g-Hb26agBFg)<br>

