<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

---

In this lab we will further explore sklearn and NLTK's capabilities for processing text. We will use the 20 Newsgroup dataset, which is provided by sklearn.

In [49]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
plt.style.use('ggplot')
sns.set(font_scale=1.5)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [50]:
# Getting the Sklearn Dataset
from sklearn.datasets import fetch_20newsgroups

### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

Look up the [function documentation](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html) for how to grab the data.

You should pull these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [51]:
# Extracting Information from the Data's Dictionary format 
# Categories of emails we want
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
# Setting training data
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))
# Setting testing data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

### 2. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Because this is an sklearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).

Let's inspect them.

1. What data type is `data_train`?
- What does `data_train` contain? 
- How many data points does `data_train` contain?
- How many data points of each category does `data_train` contain?
- Inspect the first data point, what does it look like?

In [53]:
# A:
#1.What data type is data_train?
type(data_train)
#Dictionary

sklearn.utils.Bunch

In [54]:
#2.What does data_train contain?
data_train.keys()
#['data', 'filenames', 'target_names', 'target', 'DESCR', 'description']

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])

In [55]:
#3.How many data points does data_train contain?
print(len(data_train.data))
print(pd.DataFrame(data_train.data).shape[0])
#2034 data points for data_train.data

2034
2034


In [65]:
#4.How many data points of each category does data_train contain?
print('Data',len(data_train.data))
print('filenames',len(data_train.filenames))
print('target_names',len(data_train.target_names))
print('target',len(data_train.target))
print('DESCR',[0 if data_train.DESCR == None else len(data_train.DESCR)][0])
print('description',len(data_train.description))

Data 2034
filenames 2034
target_names 4
target 2034
DESCR 0
description 33


In [71]:
#5.Inspect the first data point, what does it look like?
pd.DataFrame(data_train.data).iloc[0][0]

"Hi,\n\nI've noticed that if you only save a model (with all your mapping planes\npositioned carefully) to a .3DS file that when you reload it after restarting\n3DS, they are given a default position and orientation.  But if you save\nto a .PRJ file their positions/orientation are preserved.  Does anyone\nknow why this information is not stored in the .3DS file?  Nothing is\nexplicitly said in the manual about saving texture rules in the .PRJ file. \nI'd like to be able to read the texture rule information, does anyone have \nthe format for the .PRJ file?\n\nIs the .CEL file format available from somewhere?\n\nRych"

### 3. Bag of Words model

Let's train a model using a simple count vectorizer.

1. Initialize a standard CountVectorizer and fit the training data.
- How big is the feature dictionary?
- Repeat eliminating English stop words.
- Is the dictionary smaller?
- Transform the training data using the trained vectorizer.
- What are the 20 words that are most common in the whole corpus?
- What are the 20 most common words in each of the 4 classes?
- Evaluate the performance of a Logistic Regression on the features extracted by the CountVectorizer.
    - You will have to transform the test_set, too. Be careful to use the trained vectorizer, without re-fitting it.
    - Create a confusion matrix.

**BONUS:**
- Try a couple of modifications:
    - restrict max_features
    - change max_df and min_df
    - for each of the above print a confusion matrix and investigate what gets mixed

In [118]:
# A:
#1.Initialize a standard CountVectorizer and fit the training data.
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
cv.fit(data_train.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [119]:
#2.How big is the feature dictionary?
len(cv.vocabulary_)

26879

In [148]:
#3.Repeat eliminating English stop words.
cv=CountVectorizer(stop_words='english')
cv.fit(data_train.data)
len(cv.vocabulary_)

26576

In [149]:
#4.Is the dictionary smaller?
#Yes

In [150]:
#5.Transform the training data using the trained vectorizer.
data_mat=cv.transform(data_train.data)

In [164]:
#6.What are the 20 words that are most common in the whole corpus?
df=pd.DataFrame(np.sum(data_mat.toarray(),axis=0),index=cv.get_feature_names()).T
df.sort_values(0,axis=1,ascending=False,inplace=True)
df.iloc[0,0:20]

space       1061
people       793
god          745
don          730
like         682
just         675
does         600
know         592
think        584
time         546
image        534
edu          501
use          468
good         449
data         444
nasa         419
graphics     414
jesus        411
say          409
way          387
Name: 0, dtype: int64

In [None]:
#7.


In [None]:
#8.


### 4. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features.
- Does the score improve with respect to the count vectorizer? 
- Print out the number of features for this model.
- Initialize a TF-IDF Vectorizer and repeat the analysis above.

**BONUS:**
- Change the parameters of either (or both!) models to improve your score.

In [6]:
# A:

### 5. Classifier comparison

Of all the vectorizers tested above, choose one that has a reasonable performance with a manageable number of features and compare the performance of these models:

- KNN
- Logistic Regression
- Decision Trees
- Support Vector Machine
- Random Forest
- Extra Trees

In order to speed up the calculation it's better to vectorize the data only once and then compare the models.

### Bonus: Other classifiers

Adapt the code from [this example](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#example-text-document-classification-20newsgroups-py) to compare across all the classifiers suggested and to display the final plot

### Bonus: 

- #### Fit a model to the 20newsgroups dataset with all classes

- #### Choose texts, for example from newspaper articles, and check what is the class label predicted for them. Does the predicted label meet your expectations?