<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

---

In this lab we will further explore sklearn and NLTK's capabilities for processing text. We will use the 20 Newsgroup dataset, which is provided by sklearn.

In [79]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
plt.style.use('ggplot')
sns.set(font_scale=1.5)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [80]:
# Getting the Sklearn Dataset
from sklearn.datasets import fetch_20newsgroups

### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

Look up the [function documentation](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html) for how to grab the data.

You should pull these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [81]:
data=fetch_20newsgroups()

In [82]:
print(data.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features       

In [83]:
data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [84]:
pd.DataFrame(data['data']).head()

Unnamed: 0,0
0,From: lerxst@wam.umd.edu (where's my thing)\nS...
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...


In [85]:
# Extracting Information from the Data's Dictionary format 
# Categories of emails we want
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
# Setting training data
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))
# Setting testing data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))
print(data_train)

       '/Users/discoveryscientist/scikit_learn_data/20news_home/20news-bydate-train/talk.religion.misc/83741',
       '/Users/discoveryscientist/scikit_learn_data/20news_home/20news-bydate-train/sci.space/61092',
       ...,
       '/Users/discoveryscientist/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38737',
       '/Users/discoveryscientist/scikit_learn_data/20news_home/20news-bydate-train/alt.atheism/53237',
       '/Users/discoveryscientist/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38269'],


In [86]:
print(data_test)

       '/Users/discoveryscientist/scikit_learn_data/20news_home/20news-bydate-test/comp.graphics/38808',
       '/Users/discoveryscientist/scikit_learn_data/20news_home/20news-bydate-test/comp.graphics/40062',
       ...,
       '/Users/discoveryscientist/scikit_learn_data/20news_home/20news-bydate-test/talk.religion.misc/84302',
       '/Users/discoveryscientist/scikit_learn_data/20news_home/20news-bydate-test/comp.graphics/38839',
       '/Users/discoveryscientist/scikit_learn_data/20news_home/20news-bydate-test/comp.graphics/38973'],


### 2. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Because this is an sklearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).

Let's inspect them.

1. What data type is `data_train`?
- What does `data_train` contain? 
- How many data points does `data_train` contain?
- How many data points of each category does `data_train` contain?
- Inspect the first data point, what does it look like?

#### A: dictionary 11314
#Data Set Characteristics:**

    =================   ==========
    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features                  text
    =================   ==========


In [87]:
data_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [88]:
pd.DataFrame(data_train['data']).count()

0    2034
dtype: int64

In [89]:
pd.DataFrame(data_train['data']).sum()

0    Hi,\n\nI've noticed that if you only save a mo...
dtype: object

In [90]:
pd.DataFrame(data_train['target']).head()

Unnamed: 0,0
0,1
1,3
2,2
3,0
4,2


In [91]:
pd.DataFrame(data_train['data']).shape

(2034, 1)

In [92]:
pd.DataFrame(data_train['data']).columns

RangeIndex(start=0, stop=1, step=1)

### 3. Bag of Words model

Let's train a model using a simple count vectorizer.

1. Initialize a standard CountVectorizer and fit the training data.
- How big is the feature dictionary?
- Repeat eliminating English stop words.
- Is the dictionary smaller?
- Transform the training data using the trained vectorizer.
- What are the 20 words that are most common in the whole corpus?
- What are the 20 most common words in each of the 4 classes?
- Evaluate the performance of a Logistic Regression on the features extracted by the CountVectorizer.
    - You will have to transform the test_set, too. Be careful to use the trained vectorizer, without re-fitting it.
    - Create a confusion matrix.

**BONUS:**
- Try a couple of modifications:
    - restrict max_features
    - change max_df and min_df
    - for each of the above print a confusion matrix and investigate what gets mixed

In [93]:
np.unique(data_train.target)

array([0, 1, 2, 3])

In [94]:
# A: Initialize a standard CountVectorizer and fit the training data.
from sklearn.feature_extraction.text import CountVectorizer
c_v = CountVectorizer(token_pattern='\w+') # one word
c_v.fit(data_train.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='\\w+', tokenizer=None,
        vocabulary=None)

In [95]:
len(c_v.get_feature_names())

26919

In [96]:
#Repeat eliminating English stop words.
from sklearn.feature_extraction.text import CountVectorizer
c_v = CountVectorizer(token_pattern='\w+',stop_words='english') # one word
c_v.fit(data_test.data)
len(c_v.get_feature_names()),c_v.get_feature_names()

(21275,
 ['0',
  '00',
  '000',
  '0000',
  '00000',
  '000005102000',
  '00041555',
  '0004244402',
  '00043819',
  '00044808',
  '00044939',
  '0004988',
  '0005169',
  '0007',
  '0008512',
  '000th',
  '000usd',
  '0011265',
  '00196',
  '0028',
  '003',
  '0038',
  '004119',
  '0049',
  '006',
  '0094',
  '00index',
  '00pm',
  '01',
  '010',
  '01242',
  '01272',
  '01479',
  '015',
  '01752',
  '01821',
  '01852',
  '01854',
  '01890',
  '0195',
  '0199',
  '01readme',
  '02',
  '0200',
  '02139',
  '02154',
  '0223',
  '023220',
  '0235',
  '0238',
  '024',
  '025511',
  '028',
  '02860',
  '029',
  '03',
  '030',
  '031',
  '033',
  '0330',
  '0388',
  '039',
  '04',
  '040',
  '0410',
  '0430',
  '0451',
  '05',
  '0500',
  '0511',
  '053530',
  '053534',
  '05402',
  '05446',
  '058',
  '06',
  '0639',
  '0649',
  '0674',
  '07',
  '070156',
  '071',
  '07410',
  '074502',
  '07653',
  '0773',
  '08',
  '08240',
  '086',
  '0891',
  '09',
  '0900',
  '0903',
  '0908',
  '0920

In [97]:
trans=c_v.transform(data_train.data)

In [128]:
# What are the 20 words that are most common in the whole corpus?
df=pd.DataFrame(trans.toarray(),
                columns=c_v.get_feature_names()).sum()
df.nlargest(20)

s         2276
t         2031
space     1061
1          937
people     793
god        745
don        730
2          724
like       682
just       675
does       600
m          598
know       592
think      584
3          549
time       546
image      534
edu        501
use        468
good       449
dtype: int64

In [129]:
#What are the 20 most common words in each of the 4 classes?
df1=pd.DataFrame(trans.toarray(),columns=c_v.get_feature_names())
df1['target']=data_train.target
df1.groupby('target').sum(axis=0)
df1.head(5)

Unnamed: 0,0,00,000,0000,00000,000005102000,00041555,0004244402,00043819,00044808,...,zubrin,zug,zuni,zur,zurich,zvezdny,zvi,zwork,zyda,zyxel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
#from nltk.tokenize import sent_tokenize
#tokenized_text=sent_tokenize(data_train)
#print(tokenized_text)
#filtered_sent=[]
#for w in tokenized_sent:
  #  if w not in stopwords:
    #    filtered_sent.append(w)
#print("Tokenized Sentence:",tokenized_sent)
#print("Filterd Sentence:",filtered_sent)

In [None]:
#Is the dictionary smaller? no 

In [None]:

#def get_top_n_words(data_train, n=None):
   # vec = CountVectorizer().fit(data_train)
   # bag_of_words = vec.transform(data_train)
   # sum_words = bag_of_words.sum(axis=0) 
   # words_freq = [(word, sum_words[0, idx]) for word, 
   #               idx in vec.vocabulary_.items()]
   # words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
   # return words_freq[:n]
#print(get_top_n_words(data_train, n=None))

In [132]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import os

In [185]:
#Evaluate the performance of a Logistic 
#Regression on the features extracted by the CountVectorizer.

from sklearn.linear_model import LogisticRegression

#count_vect = CountVectorizer()
#X_train_counts = count_vect.fit_transform(twenty_train.data)
#X_train_counts.shape

X_c_v = CountVectorizer(token_pattern='\w+',stop_words='english') # one word

X_train = X_c_v.fit_transform(data_train.data).toarray()
y_train= data_train.target
X_test=X_c_v.transform(data_test.data).toarray()
y_test=data_test.target

# fit with l1 
model_l1 = LogisticRegression(penalty = 'l1', C=1.0) 
model_l1.fit(X_train, y_train)

# fit with l2
model_l2 = LogisticRegression(penalty = 'l2',C=1.0) 
model_l2.fit(X_train, y_train)




LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [186]:
print(model_l1.score(X_test,y_test))
print(model_l2.score(X_test,y_test))

0.7169253510716925
0.7405764966740577


In [187]:
# Create a confusion matrix.
from sklearn.metrics import confusion_matrix
import pandas as pd

log = LogisticRegression()
log.fit(X, y)
pred = log.predict(X)



In [188]:
# check intercept
log.intercept_

array([-1.20945851, -1.25497462, -1.09730196, -1.25611795])

In [189]:
# check coef
log.coef_

array([[ 2.33582262e-01, -1.41333719e-01,  1.44684147e-01, ...,
        -3.62708205e-03, -1.88808748e-02, -4.13153148e-02],
       [ 8.75203978e-02,  1.31688722e-01, -4.31118979e-02, ...,
        -1.69298166e-06,  2.48965745e-02,  6.82363603e-02],
       [-6.88112450e-02, -3.21635618e-02, -1.00764457e-01, ...,
        -2.98474840e-04, -2.18631397e-02, -2.57454564e-02],
       [-6.14435687e-01, -1.22770156e-01,  2.09177101e-03, ...,
         6.35972737e-04, -2.70179487e-03, -5.24503465e-02]])

In [196]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

In [201]:
from sklearn.metrics import confusion_matrix

In [228]:
print(confusion_matrix(y, pred))

[[467   0  13   0]
 [  0 568  16   0]
 [  0   0 593   0]
 [  1   0  16 360]]


### 4. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features.
- Does the score improve with respect to the count vectorizer? 
- Print out the number of features for this model.
- Initialize a TF-IDF Vectorizer and repeat the analysis above.

**BONUS:**
- Change the parameters of either (or both!) models to improve your score.

In [135]:
# 1- initialize vectorizer 
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer(stop_words='english')

In [136]:
stopwords =  tfidf_v.get_stop_words()
stopwords

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

In [137]:
# 2- fit vectorizer
tfidf_v.fit_transform(data_train.data)

<2034x26576 sparse matrix of type '<class 'numpy.float64'>'
	with 133634 stored elements in Compressed Sparse Row format>

In [138]:
# 3- trnsform (calculate weights)
document_matrix = tfidf_v.transform(data_train.data)
document_matrix 

<2034x26576 sparse matrix of type '<class 'numpy.float64'>'
	with 133634 stored elements in Compressed Sparse Row format>

In [139]:
# create dataframe
df_ = pd.DataFrame(document_matrix.toarray(), 
                   columns=tfidf_v.get_feature_names())

In [140]:
# sort columns
df_.sort_values(0, ascending=False , axis=1).head(4)

Unnamed: 0,file,prj,3ds,texture,orientation,save,format,cel,rych,restarting,...,earths,earthquake,earthly,earthings,earthinfo,earthers,earth,ears,earnshaw,zyxel
0,0.446625,0.415287,0.361119,0.225817,0.22287,0.182422,0.16718,0.145893,0.145893,0.145893,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [141]:
#Transform the training data using the trained vectorizer? 
tfidf_v.fit_transform(data_train.data)

<2034x26576 sparse matrix of type '<class 'numpy.float64'>'
	with 133634 stored elements in Compressed Sparse Row format>

In [142]:
# A:
# How big is the feature dictionary? 
from sklearn.feature_extraction.text import HashingVectorizer

In [143]:
# 1- initialize vectorizer 
h_c = HashingVectorizer(n_features= 100)
# 2- fit vectorizer
h_c.fit(data_train.data)

HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
         decode_error='strict', dtype=<class 'numpy.float64'>,
         encoding='utf-8', input='content', lowercase=True, n_features=100,
         ngram_range=(1, 1), non_negative=False, norm='l2',
         preprocessor=None, stop_words=None, strip_accents=None,
         token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None)

In [144]:
h_c.fit_transform(data_train.data)

<2034x100 sparse matrix of type '<class 'numpy.float64'>'
	with 93910 stored elements in Compressed Sparse Row format>

### 5. Classifier comparison

Of all the vectorizers tested above, choose one that has a reasonable performance with a manageable number of features and compare the performance of these models:

- KNN
- Logistic Regression
- Decision Trees
- Support Vector Machine
- Random Forest
- Extra Trees

In order to speed up the calculation it's better to vectorize the data only once and then compare the models.

In [227]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
forest = RandomForestRegressor(n_estimators = 500, max_features ='sqrt')
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_pred)**.5)
# Score our model
print('train score ', forest.score(X_train, y_train))
print('test score ', forest.score(X_test, y_test))

0.9290131486774462
train score  0.8813891364605494
test score  0.20613125674180952


In [216]:
from sklearn.neighbors import KNeighborsClassifier
# Intialize our model
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=2, p=2,
           weights='uniform')

In [217]:
# Score our model
print('train score ', knn.score(X_train, y_train))
print('test score ', knn.score(X_test, y_test))

train score  0.6809242871189773
test score  0.3909830007390983


In [218]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.externals.six import StringIO 

In [226]:
# build model
classifier = DecisionTreeClassifier(criterion='gini', max_depth=5)
classifier.fit(X, y)
# Score our model
print('train score ', classifier.score(X_train, y_train))
print('test score ', classifier.score(X_test, y_test))

train score  0.49065880039331367
test score  0.4515890613451589


In [229]:
# build model
from sklearn.svm import SVC
model = SVC(kernel='linear', C=1E10)
model.fit(X, y)
# Score our model
print('train score ', model.score(X_train, y_train))
print('test score ', model.score(X_test, y_test))

train score  0.9788593903638152
test score  0.6164079822616408


In [230]:
model.support_vectors_

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 2., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [234]:
#Extra Trees
# Create adaboost classifer object
from sklearn.ensemble import AdaBoostClassifier
from sklearn import datasets
from sklearn import metrics
abc = AdaBoostClassifier(n_estimators=50,
                         learning_rate=1)
# Train Adaboost Classifer
model = abc.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = model.predict(X_test)

In [235]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.623059866962306


### Bonus: Other classifiers

Adapt the code from [this example](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#example-text-document-classification-20newsgroups-py) to compare across all the classifiers suggested and to display the final plot

### Bonus: 

- #### Fit a model to the 20newsgroups dataset with all classes

- #### Choose texts, for example from newspaper articles, and check what is the class label predicted for them. Does the predicted label meet your expectations?