<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

_Authors: Dave Yerrington (SF)_

---

In this lab, we'll explore scikit-learn and NLTK's capabilities for processing text even further. We'll use the 20 newsgroups data set, which is provided by scikit-learn.

In [1]:
# Standard data science imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer

%matplotlib inline

In [2]:
# Getting the scikit-learn data set:
from sklearn.datasets import fetch_20newsgroups

### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

The "20 Newsgroups" dataset is described [here](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html).

For this lab let's choose 4 categories to analyze.  The full list is given below.


```python
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
```

Note that the solution code will use these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [3]:
#Extracting Information from the Data's Dictionary format 

categories = ['misc.forsale','rec.autos','rec.sport.baseball','sci.space']  # Fill in whatever categories you want to use!!

# Setting out training data
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))
# Setting our testing data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

**Question:** What does the `shuffle` argument do?  Why are we setting a `random_state`?

# A:

The ```shuffle``` argument randomly permutes the sample so that the samples it retrieves are pulled in a random manner.


We set a ```random_state``` to ensure reproducibility of our model.

### 2) Inspect the data.

We've downloaded a few `newsgroups` categories and removed their headers, footers, and quotes.

Because this is a scikit-learn data set, it comes with pre-split training and testing sets (note: we were able to call "train" and "test" in subset).

Let's inspect them.

1) What data type is `data_train`?
- Is it a list? A dictionary? What else?
- How many data points does it contain?
- Inspect the first data point. What does it look like?

# A:

The data type for data_train is an sklearn.utils.Bunch; it seems identical to a dictionary, with five keys.

The first data point ('data') look slike it actualy contains the posts in the categories selected above.

In [4]:
type(data_train)

sklearn.utils.Bunch

In [5]:
data_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [6]:
len(data_train.data)

2369

In [7]:
data_train.data[0]

'I have a \'71 Buick Skylark with 148K on it.  I bought it in California, and if\nit\'ll let me, I\'d like to keep it for another year.  The only problem is these\nIndiana winters--my heater controls don\'t work.\n\nThe car has vacuum operated control switches for the vents.  Right now it is\nstuck in the "vent" mode.  It will blow warm air, but I can\'t switch the air\nflow to either the floor (I can live without this) or the defrost (I can\'t \nlive without this).  I probably could just jam the air deflector to the \ndefrost position, but this blows a lot of air in my face and is, well,\nkind of like putting a vacuum cleaner in reverse.\n\nI have taken parts of the dash off and looked at the vacuum system and I think\nthe problem (or part of it) is with the two diaphragms which control up/down\nand outside/inside air flow.  THe diaphragm which controls outside(vent)/in-\nside(no vent) air is cracked most of the way around, and the other one is\nprobably damaged too, considering the a

### 3) Create a bag-of-words model.

Let's train a model using a simple count vectorizer.

1) Initialize a standard CountVectorizer and fit the training data.
- How big is the feature dictionary?
- Eliminate English stop words.
- Is the dictionary smaller?
- Transform the training data using the trained vectorizer.
- Evaluate the performance of a logistic regression on the features extracted by the CountVectorizer.
    - You will have to transform the `test_set`, too. Be careful to use the trained vectorizer without refitting it.

**Bonus**
- Try a couple of modifications:
    - Restrict the `max_features`.
    - Change the `max_df` and `min_df`.

In [8]:
# A:

The dictionary does not seem smaller.

In [9]:
vectorizer = CountVectorizer(analyzer = "word",
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = 'english',
                             max_features = 5000) 

In [10]:
data_train_vc = vectorizer.fit_transform(data_train.data)
data_test_vc = vectorizer.transform(data_test.data)

In [11]:
data_train_vc = data_train_vc.toarray()

In [12]:
logreg = LogisticRegression()

In [13]:
logreg.fit(data_train_vc, data_train['target'])



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [14]:
logreg.score(data_train_vc, data_train['target'])

0.9725622625580413

In [15]:
logreg.score(data_test_vc, data_test['target'])

0.8421052631578947

### 4) Test Out Hashing and TF-IDF.

Let's see if hashing or TF-IDF improves our accuracy.

1) Initialize a HashingVectorizer and repeat the test with no restriction on the number of features.
- Does the score improve with respect to the CountVectorizer?
- Print out the number of features for this model.
- Initialize a TF-IDF vectorizer and repeat the analysis above.
- Print out the number of features for this model.

**Bonus**
- Change the parameters of either (or both) models to improve your score.

# A:

The HashingVectorizer produces a model that is not as 'overfit' as the CountVectorizer.

In [16]:
hashingvectorizer = HashingVectorizer(analyzer = "word",
                       tokenizer = None,
                       preprocessor = None,
                       stop_words = 'english')

In [17]:
data_train_hashvc = hashingvectorizer.fit_transform(data_train.data)
data_test_hashvc = hashingvectorizer.transform(data_test.data)

In [18]:
data_train_hashvc = data_train_hashvc.toarray()

In [19]:
logreg.fit(data_train_hashvc, data_train['target'])



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [20]:
logreg.score(data_train_hashvc, data_train['target'])

0.9573659772055719

In [21]:
logreg.score(data_test_hashvc, data_test['target'])

0.8573240329740013

#### TF-IDF

In [22]:
tfidfvect = TfidfVectorizer(analyzer = "word",
                            tokenizer = None,
                            preprocessor = None,
                            stop_words = 'english')

In [23]:
data_train_tfidf = tfidfvect.fit_transform(data_train.data)
data_test_tfidf = tfidfvect.transform(data_test.data)

In [24]:
data_train_tfidf = data_train_tfidf.toarray()

In [25]:
logreg.fit(data_train_tfidf, data_train['target'])



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [26]:
logreg.score(data_train_tfidf, data_train['target'])

0.9691853102574927

In [27]:
logreg.score(data_test_tfidf, data_test['target'])

0.8852251109701966

In [28]:
len(tfidfvect.get_feature_names())

24340

### 5. [Bonus] Robust Text Preprocessing

Your mission, should you choose to accept it, is to write a preprocessing function for all of your text.  This functions should

- convert all text to lowercase,
- remove punctuation,
- stem or lemmatize each word of the text,
- remove stopwords.

The function should receive one string of text and return the processed text.

Once you have built your function, use it to process your train and test data, then fit a Logistic Regression model to see how it performs.