<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

_Authors: Dave Yerrington (SF)_

---

In this lab, we'll explore scikit-learn and NLTK's capabilities for processing text even further. We'll use the 20 newsgroups data set, which is provided by scikit-learn.

In [1]:
# Imports of import
import pandas as pd
import nltk
import numpy as np
import matplotlib.pyplot as plt
import re

# NLTK imports
from nltk.corpus import brown
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Sci-Kit Learn imports
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer

In [2]:
# Getting the scikit-learn data set:
from sklearn.datasets import fetch_20newsgroups

### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

The "20 Newsgroups" dataset is described [here](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html).

For this lab let's choose 4 categories to analyze.  The full list is given below.


```python
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
```

Note that the solution code will use these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [3]:
#Extracting Information from the Data's Dictionary format 

categories = ['alt.atheism','sci.space','sci.crypt','sci.electronics']  # Fill in whatever categories you want to use!!

# Setting out training data
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))
# Setting our testing data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

**Question:** What does the `shuffle` argument do?  Why are we setting a `random_state`?

Answer 1: shuffle=True ensures a random sampling from the data rather than just taking as it was sorted.

Answer 2: random_state sets the same seed on the random genereator so that it will always produce the same random numbers, which allows for repeatability with random numbers.

### 2) Inspect the data.

We've downloaded a few `newsgroups` categories and removed their headers, footers, and quotes.

Because this is a scikit-learn data set, it comes with pre-split training and testing sets (note: we were able to call "train" and "test" in subset).

Let's inspect them.

1) What data type is `data_train`?
- Is it a list? A dictionary? What else?
- How many data points does it contain?
- Inspect the first data point. What does it look like?

Answer 1:  It is a dictionary (beginning {) with keys: data, filenames, target_names, target, DESCR.  

Answer 2: There are 18846 total samples (according to the data dictionary under key DESCR), but our training data only contains 2259 entries.

Answer 3: The first data point is a string containing what looks like the title of an entry under a newsgroup.

In [4]:
print(type(data_train))

<class 'sklearn.utils.Bunch'>


In [5]:
data_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [6]:
len(data_train.data)

2259

In [7]:
# data_train.DESCR

In [8]:
data_train.data[0]

"Does anyone out there know of any products using Motorola's Neuron(r) chips MC143150 or MC143120. If so, what are they and are they utilizing Standard Network Variable Types (SNVT)?\n_________________________________________________________________________________"

### 3) Create a bag-of-words model.

Let's train a model using a simple count vectorizer.

1) Initialize a standard CountVectorizer and fit the training data.
- How big is the feature dictionary?
- Eliminate English stop words.
- Is the dictionary smaller?
- Transform the training data using the trained vectorizer.
- Evaluate the performance of a logistic regression on the features extracted by the CountVectorizer.
    - You will have to transform the `test_set`, too. Be careful to use the trained vectorizer without refitting it.

**Bonus**
- Try a couple of modifications:
    - Restrict the `max_features`.
    - Change the `max_df` and `min_df`.

How big is the Dictionary?
Answer 1: It's got a length of 28737 entries.

Eliminate English stop words - is the dictionary smaller?
Answer 2: It's slightly smaller at 28433 entries.
    
Transform the training data using the trained vectorizer and evaluate the performance of a logistic regression on the features extracted by the CountVectorizer

Evaluation: The training and test data performed exactly the same in terms of scoring, with an accuracy of 81.3%.  My personal belief is that this is not a good score for NLP.  I would really expect something closer to 90%, so at this point I would keep trying new models.

In [9]:
# Instantiating the CountVectorizer and fitting the training data to it
cv = CountVectorizer()
vec = cv.fit(data_train.data)
len(vec.vocabulary_)

28737

In [10]:
# This time removing the English stopwords.
cv2 = CountVectorizer(stop_words = 'english')
vec2 = cv2.fit(data_train.data)
len(vec2.vocabulary_)

28433

In [11]:
# The difference:
len(vec.vocabulary_) - len(vec2.vocabulary_)

304

In [12]:
# Doing the transform after the fit on both the testing and training data - testing data has NOT been fit
vec2_train = vec2.transform(data_train.data)
vec2_test = vec2.transform(data_test.data)

In [13]:
# I prefer to use GridSearch and fit a few parameters at once to see where the optimum fit might beand then rerun
# GridSearch using the best of those and trying to close in on things like an optimum C value for LogReg.

params = {
    'solver':['lbfgs','newton-cg','saga','liblinear'], # My favorite thing to try - the different LogReg solvers
    'C':[0.001,0.01,0.1,1]
}

grid = GridSearchCV(LogisticRegression(multi_class  = 'ovr',
                                       random_state = 42,
                                       n_jobs       = -1,
                                       verbose      = 1),
                    params,
                    cv      = 5,
                    verbose = 1,
                    n_jobs  = -1,
                    iid     = False,
                    return_train_score = True)

log_reg_train = grid.fit(vec2_train,data_train.target)
log_reg_test = grid.fit(vec2_test,data_test.target);

print(f'\nBest Training Parameters: {log_reg_train.best_params_}')
print(f'Best Testing Parameters:  {log_reg_test.best_params_}')
print(f'Best Training Score:      {log_reg_train.best_score_:{2}.{4}}')
print(f'Best Testing Score:       {log_reg_test.best_score_:{2}.{4}}')

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   26.2s
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:   54.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   4 | elapsed:    0.4s remaining:    0.4s
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:    0.4s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   14.4s
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:   31.3s finished
  " = {}.".format(effective_n_jobs(self.n_jobs)))


[LibLinear]
Best Training Parameters: {'C': 0.1, 'solver': 'liblinear'}
Best Testing Parameters:  {'C': 0.1, 'solver': 'liblinear'}
Best Training Score:      0.8076
Best Testing Score:       0.8076


In [14]:
# Now to just use liblinear and see how close we can get to the optimal value for C.

params = {
    'C':[0.05,0.075,0.1,0.25,0.5]
}

grid = GridSearchCV(LogisticRegression(multi_class  = 'ovr',
                                       solver       = 'liblinear',
                                       random_state = 42,
                                       verbose      = 1),
                    params,
                    cv      = 5,
                    verbose = 1,
                    n_jobs  = -1,
                    iid     = False,
                    return_train_score = True)

log_reg_train = grid.fit(vec2_train,data_train.target)
log_reg_test = grid.fit(vec2_test,data_test.target);

print(f'\nBest Training Parameters: {log_reg_train.best_params_}')
print(f'Best Testing Parameters:  {log_reg_test.best_params_}')
print(f'Best Training Score:      {log_reg_train.best_score_:{2}.{4}}')
print(f'Best Testing Score:       {log_reg_test.best_score_:{2}.{4}}')

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    9.9s finished


[LibLinear]Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    3.7s finished


[LibLinear]
Best Training Parameters: {'C': 0.25}
Best Testing Parameters:  {'C': 0.25}
Best Training Score:      0.8096
Best Testing Score:       0.8096


In [15]:
# One more iteration on finding the optimal C just because the calculation time is low.

params = {
    'C':[0.15,0.20,0.25,0.30,0.35]
}

grid = GridSearchCV(LogisticRegression(multi_class  = 'ovr',
                                       solver       = 'liblinear',
                                       random_state = 42,
                                       verbose      = 1),
                    params,
                    cv      = 5,
                    verbose = 1,
                    n_jobs  = -1,
                    iid     = False,
                    return_train_score = True)

log_reg_train = grid.fit(vec2_train,data_train.target)
log_reg_test = grid.fit(vec2_test,data_test.target);

print(f'\nBest Training Parameters: {log_reg_train.best_params_}')
print(f'Best Testing Parameters:  {log_reg_test.best_params_}')
print(f'Best Training Score:      {log_reg_train.best_score_:{2}.{4}}')
print(f'Best Testing Score:       {log_reg_test.best_score_:{2}.{4}}')

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:   10.8s finished


[LibLinear]Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    4.2s finished


[LibLinear]
Best Training Parameters: {'C': 0.15}
Best Testing Parameters:  {'C': 0.15}
Best Training Score:      0.8129
Best Testing Score:       0.8129


### 4) Test Out Hashing and TF-IDF.

Let's see if hashing or TF-IDF improves our accuracy.

1) Initialize a HashingVectorizer and repeat the test with no restriction on the number of features.
- Does the score improve with respect to the CountVectorizer?
- Print out the number of features for this model.
- Initialize a TF-IDF vectorizer and repeat the analysis above.
- Print out the number of features for this model.

**Bonus**
- Change the parameters of either (or both) models to improve your score.

Answer 1: The score does not improve with respect to the CountVectorizer.  CountVectorizer scored 0.8129 while HashVectorizer scored 0.8076.

Answer 2: The n_features, number of features, is set as a default in the instantiation of the HashVectorizer at $2^{20}$  

In [16]:
# Instantiating, fitting and transforming the data with the HashVectorizer
hv = HashingVectorizer(stop_words='english')
hvec_train = hv.fit_transform(data_train.data)
hvec_test = hv.fit_transform(data_test.data)

In [17]:
params = {
    'solver':['lbfgs','newton-cg','saga','liblinear'],
    'C':[0.001,0.01,0.1,1]
}

grid = GridSearchCV(LogisticRegression(multi_class  = 'ovr',
                                       random_state = 42,
                                       n_jobs       = -1,
                                       verbose      = 1),
                    params,
                    cv      = 5,
                    verbose = 1,
                    n_jobs  = -1,
                    iid     = False,
                    return_train_score = True)

log_reg_hv_train = grid.fit(hvec_train,data_train.target)
log_reg_hv_test = grid.fit(hvec_test,data_test.target);

print('Hashing Vectorizer Performance')
print(f'\nBest Training Parameters: {log_reg_hv_train.best_params_}')
print(f'Best Testing Parameters:  {log_reg_hv_test.best_params_}')
print(f'Best Training Score:      {log_reg_hv_train.best_score_:{2}.{4}}')
print(f'Best Testing Score:       {log_reg_hv_test.best_score_:{2}.{4}}')

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:  8.9min finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.


convergence after 18 epochs took 1 seconds
convergence after 18 epochs took 1 seconds
convergence after 19 epochs took 1 seconds
convergence after 21 epochs took 1 seconds

[Parallel(n_jobs=-1)]: Done   2 out of   4 | elapsed:    1.2s remaining:    1.2s
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:    1.3s finished



Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  5.7min
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed: 10.4min finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.


convergence after 16 epochs took 1 seconds
convergence after 19 epochs took 1 seconds
convergence after 19 epochs took 1 seconds
convergence after 22 epochs took 1 seconds


[Parallel(n_jobs=-1)]: Done   2 out of   4 | elapsed:    0.8s remaining:    0.8s
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:    0.9s finished


Hashing Vectorizer Performance

Best Training Parameters: {'C': 1, 'solver': 'saga'}
Best Testing Parameters:  {'C': 1, 'solver': 'saga'}
Best Training Score:      0.8056
Best Testing Score:       0.8056


In [18]:
# solver = 'liblinear' won out. Honing in on the optimal C value
params = {
    'C':[0.06,0.08,0.1,0.3,0.5]
}

grid = GridSearchCV(LogisticRegression(solver       = 'liblinear',
                                       multi_class  = 'ovr',
                                       random_state = 42,
                                       verbose      = 1),
                    params,
                    cv      = 5,
                    verbose = 1,
                    n_jobs  = -1,
                    iid     = False,
                    return_train_score = True)

log_reg_hv_train = grid.fit(hvec_train,data_train.target)
log_reg_hv_test = grid.fit(hvec_test,data_test.target);

print('Hashing Vectorizer Performance')
print(f'\nBest Training Parameters: {log_reg_hv_train.best_params_}')
print(f'Best Testing Parameters:  {log_reg_hv_test.best_params_}')
print(f'Best Training Score:      {log_reg_hv_train.best_score_:{2}.{4}}')
print(f'Best Testing Score:       {log_reg_hv_test.best_score_:{2}.{4}}')

# No change - time for tf_idf

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:   23.8s finished


[LibLinear]Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:   19.7s finished


[LibLinear]Hashing Vectorizer Performance

Best Training Parameters: {'C': 0.5}
Best Testing Parameters:  {'C': 0.5}
Best Training Score:      0.7963
Best Testing Score:       0.7963


In [19]:
print(f'\nNumber of Features for Hash Vectorizer: {hvec_train.shape[1]}')


Number of Features for Hash Vectorizer: 1048576


In [20]:
tfidf = TfidfVectorizer(stop_words = 'english')
tfidf_train = tfidf.fit_transform(data_train.data)
tfidf_test = tfidf.fit_transform(data_test.data)

In [21]:
params = {
    'solver':['lbfgs','newton-cg','saga','liblinear'],
    'C':[0.001,0.01,0.1,1]
}

grid = GridSearchCV(LogisticRegression(multi_class  = 'ovr',
                                       random_state = 42,
                                       n_jobs       = -1,
                                       verbose      = 1),
                    params,
                    cv      = 5,
                    verbose = 1,
                    n_jobs  = -1,
                    iid     = False,
                    return_train_score = True)

log_reg_tfidf_train = grid.fit(tfidf_train,data_train.target)
log_reg_tfidf_test = grid.fit(tfidf_test,data_test.target);

print('TF_IDF Vectorizer Performance')
print(f'\nBest Training Parameters: {log_reg_tfidf_train.best_params_}')
print(f'Best Testing Parameters:  {log_reg_tfidf_test.best_params_}')
print(f'Best Training Score:      {log_reg_tfidf_train.best_score_:{2}.{4}}')
print(f'Best Testing Score:       {log_reg_tfidf_test.best_score_:{2}.{4}}')

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   17.6s
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:   25.9s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.


convergence after 18 epochs took 1 seconds
convergence after 18 epochs took 1 seconds
convergence after 19 epochs took 1 seconds
convergence after 21 epochs took 1 seconds
Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Done   2 out of   4 | elapsed:    0.7s remaining:    0.7s
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:    0.8s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   16.8s
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:   22.1s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.


convergence after 20 epochs took 0 seconds
convergence after 18 epochs took 0 seconds
convergence after 18 epochs took 0 seconds
convergence after 19 epochs took 0 seconds
TF_IDF Vectorizer Performance

Best Training Parameters: {'C': 1, 'solver': 'saga'}
Best Testing Parameters:  {'C': 1, 'solver': 'saga'}
Best Training Score:      0.8323
Best Testing Score:       0.8323


[Parallel(n_jobs=-1)]: Done   2 out of   4 | elapsed:    0.4s remaining:    0.4s
[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:    0.4s finished


In [22]:
print(f'\nNumber of Features for TF-IDF Vectorizer: {tfidf_train.shape[1]}')


Number of Features for TF-IDF Vectorizer: 28433


### 5. [Bonus] Robust Text Preprocessing

Your mission, should you choose to accept it, is to write a preprocessing function for all of your text.  This functions should

- convert all text to lowercase,
- remove punctuation,
- stem or lemmatize each word of the text,
- remove stopwords.

The function should receive one string of text and return the processed text.

Once you have built your function, use it to process your train and test data, then fit a Logistic Regression model to see how it performs.

In [23]:
# My function to do all of the required text processing:
# 
# 1) Convert everything to lowercase
# 2) Remove the punctuation
# 3) Lemmatize each word of the text - I am choosing not to stem the words
# 4) Remove English stopwords

def my_texturizer(text):
    
    # Step 0 - instantiate the WordNetLemmatizer and initialize the list that will be returned
    lemma = WordNetLemmatizer()
    cleaned_text = []
    
    # Step 1 - Lowercase the text
    new_text = [word.lower() for word in text]
    tokenized_text = [word.split() for word in new_text]
    
    # Step 2 - Remove the punctuation - also removing anything else non-letter adn extraneous whitespace
    words_from_text = [re.sub(r'[^a-zA-Z]', " ", word) for lst in tokenized_text for word in lst]
    spaces_removed_text = [word.replace(" ","") for word in words_from_text]
    
    # Step 2.5 - I wanted to pass the words through a standard dictionary so that I only capture words
    # NLTK's standard dictionary of english words is Brown and using a set parses faster in python
    browns_words = set(brown.words())
    words_only_text = [word for word in spaces_removed_text if word in browns_words]
    
    # Step 3 - Lemmatize - this means to find the root word like run from ran and running
    lemmatize_the_text = [lemma.lemmatize(word) for word in words_only_text if len(word) >= 3]
    
    # Step 4 - Remove the stopwords
    stop_words   = set(stopwords.words('english'))
    cleaned_text = [word for word in lemmatize_the_text if word not in stop_words]
    
    return " ".join(cleaned_text)

In [24]:
# Voila!  Preprocessed text that is in the English Dictionary, has no punctuation, is lowercased, has been
# lemmatized and contains no English stopwords (which includes all 2-letter words, so all words are also at
# least 3 letters long)
my_texturizer(data_train.data)

