<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

_Authors: Dave Yerrington (SF)_

---

In this lab, we'll explore scikit-learn and NLTK's capabilities for processing text even further. We'll use the 20 newsgroups data set, which is provided by scikit-learn.

In [2]:
# Standard data science imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
# Getting the scikit-learn data set:
from sklearn.datasets import fetch_20newsgroups

### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

The "20 Newsgroups" dataset is described [here](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html).

For this lab let's choose 4 categories to analyze.  The full list is given below.


```python
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
```

Note that the solution code will use these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [4]:
#Extracting Information from the Data's Dictionary format 

categories = ['sci.crypt',
              'sci.electronics',
              'talk.politics.guns',
              'talk.politics.mideast']  # Fill in whatever categories you want to use!!

# Setting out training data
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))
# Setting our testing data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


**Question:** What does the `shuffle` argument do?  Why are we setting a `random_state`?

In [4]:
# A: Shuffle independently and identically distributes the data, random state
# allows for reproducability

### 2) Inspect the data.

We've downloaded a few `newsgroups` categories and removed their headers, footers, and quotes.

Because this is a scikit-learn data set, it comes with pre-split training and testing sets (note: we were able to call "train" and "test" in subset).

Let's inspect them.

1) What data type is `data_train`?
- Is it a list? A dictionary? What else?
- How many data points does it contain?
- Inspect the first data point. What does it look like?

In [6]:
# A:
type(data_train)

sklearn.utils.Bunch

In [13]:
data_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])

In [17]:
data_train['data'][0]

"Actually, I'm still trying to understand the self-justifying rationale\nbehind the recent murder of Ian Feinberg (?) in Gaza.\n"

In [19]:
len(data_train['data'])

2296

In [20]:
len(data_train['target'])

2296

In [21]:
data_train['target'][0:6]

array([3, 3, 3, 3, 3, 2])

### 3) Create a bag-of-words model.

Let's train a model using a simple count vectorizer.

1) Initialize a standard CountVectorizer and fit the training data.
- How big is the feature dictionary?
- Eliminate English stop words.
- Is the dictionary smaller?
- Transform the training data using the trained vectorizer.
- Evaluate the performance of a logistic regression on the features extracted by the CountVectorizer.
    - You will have to transform the `test_set`, too. Be careful to use the trained vectorizer without refitting it.

**Bonus**
- Try a couple of modifications:
    - Restrict the `max_features`.
    - Change the `max_df` and `min_df`.

In [28]:
# A:
from sklearn.feature_extraction.text import CountVectorizer

cvec = CountVectorizer()
cvec.fit(data_train['data'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [29]:
# bag of hot garbage, at least at the top
cvec.get_feature_names()

['00',
 '000',
 '0000',
 '00000000',
 '00000000b',
 '00000001',
 '00000001b',
 '00000010',
 '00000010b',
 '00000011',
 '00000011b',
 '00000100',
 '00000100b',
 '00000101',
 '00000101b',
 '00000110',
 '00000110b',
 '00000111',
 '00000111b',
 '00001000',
 '00001000b',
 '00001001',
 '00001001b',
 '00001010',
 '00001010b',
 '00001011',
 '00001011b',
 '00001100',
 '00001100b',
 '00001101',
 '00001101b',
 '00001110',
 '00001110b',
 '00001111',
 '00001111b',
 '00010000',
 '00010000b',
 '00010001',
 '00010001b',
 '00010010',
 '00010010b',
 '00010011',
 '00010011b',
 '00010100',
 '00010100b',
 '00010101',
 '00010101b',
 '00010110',
 '00010110b',
 '00010111',
 '00010111b',
 '00011000',
 '00011000b',
 '00011001',
 '00011001b',
 '00011010',
 '00011010b',
 '00011011',
 '00011011b',
 '00011100',
 '00011100b',
 '00011101',
 '00011101b',
 '00011110',
 '00011110b',
 '00011111',
 '00011111b',
 '000152',
 '0005895485',
 '000th',
 '001',
 '00100000',
 '00100000b',
 '00100001',
 '00100001b',
 '00100010',
 

In [30]:
# This might explain it
len(cvec.get_feature_names())

33374

In [31]:
from sklearn.feature_extraction.text import CountVectorizer

cvec = CountVectorizer(stop_words='english')
cvec.fit(data_train['data'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [32]:
len(cvec.get_feature_names())

33069

In [33]:
X_train = pd.DataFrame(cvec.transform(data_train['data']).todense(), columns=cvec.get_feature_names())
y_train = data_train['target']

In [37]:
# Most common terms
X_train.sum().sort_values(ascending=False).head(15)

people        1605
don           1031
like           989
use            955
key            943
just           937
know           888
government     815
time           745
said           677
think          658
right          657
gun            634
armenian       611
used           593
dtype: int64

In [39]:
len(data_test['target'])

1529

In [40]:
X_test = pd.DataFrame(cvec.transform(data_test['data']).todense(), columns=cvec.get_feature_names())
y_test = data_test['target']

In [41]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.7985611510791367

### 4) Test Out Hashing and TF-IDF.

Let's see if hashing or TF-IDF improves our accuracy.

1) Initialize a HashingVectorizer and repeat the test with no restriction on the number of features.
- Does the score improve with respect to the CountVectorizer?
- Print out the number of features for this model.
- Initialize a TF-IDF vectorizer and repeat the analysis above.
- Print out the number of features for this model.

**Bonus**
- Change the parameters of either (or both) models to improve your score.

In [42]:
# A:
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer

In [64]:
#hvec = HashingVectorizer(stop_words='english')
#lr = LogisticRegression()

#X_train = data_train['data']
#hvec.fit(X_train, y_train)
#X_test = data_test['data']

#(X_train, y_train)
#preds = lr.predict(X_test)

In [56]:
# tfw you realize that you have no idea how to fit nlp vectorizers outside of pipes
from sklearn.pipeline import make_pipeline

X_train = data_train['data']
X_test = data_test['data']

model = make_pipeline(HashingVectorizer(stop_words='english'), LogisticRegression())
model.fit(X_train, y_train)
preds = model.predict(X_test)

In [57]:
len(y_test)

1529

In [58]:
len(data_test['target'])

1529

In [59]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, preds)

0.8024852844996729

In [65]:
X_train = data_train['data']
X_test = data_test['data']

model = make_pipeline(TfidfVectorizer(stop_words='english'), LogisticRegression())
model.fit(X_train, y_train)
preds = model.predict(X_test)
accuracy_score(y_test, preds)

0.8260300850228908

In [66]:
# Number of features if I'm not mistaken
len(X_train)

2296

### 5. [Bonus] Robust Text Preprocessing

Your mission, should you choose to accept it, is to write a preprocessing function for all of your text.  This functions should

- convert all text to lowercase,
- remove punctuation,
- stem or lemmatize each word of the text,
- remove stopwords.

The function should receive one string of text and return the processed text.

Once you have built your function, use it to process your train and test data, then fit a Logistic Regression model to see how it performs.