<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

---

In this lab we will further explore sklearn and NLTK's capabilities for processing text. We will use the 20 Newsgroup dataset, which is provided by sklearn.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')
sns.set(font_scale=1.5)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
# Getting the Sklearn Dataset
from sklearn.datasets import fetch_20newsgroups

### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

Look up the [function documentation](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html) for how to grab the data.

You should pull these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [3]:
# Extracting Information from the Data's Dictionary format 
# Categories of emails we want
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
# Setting training data
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))
# Setting testing data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

### 2. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Because this is an sklearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).

Let's inspect them.

1. What data type is `data_train`?
- What does `data_train` contain? 
- How many data points does `data_train` contain?
- How many data points of each category does `data_train` contain?
- Inspect the first data point, what does it look like?

In [4]:
data_train.target

array([1, 3, 2, ..., 1, 0, 1], dtype=int64)

In [5]:
# A:
df_train = pd.DataFrame({'text':data_train.data, 'category':data_train.target})
df_train.head()


Unnamed: 0,text,category
0,"Hi,\n\nI've noticed that if you only save a mo...",1
1,"\n\nSeems to be, barring evidence to the contr...",3
2,\n >In article <1993Apr19.020359.26996@sq.sq.c...,2
3,I have a request for those who would like to s...,0
4,AW&ST had a brief blurb on a Manned Lunar Exp...,2


In [6]:
len(df_train)

2034

In [7]:
df_train.category.value_counts()

2    593
1    584
0    480
3    377
Name: category, dtype: int64

### 3. Bag of Words model

Let's train a model using a simple count vectorizer.

1. Initialize a standard CountVectorizer and fit the training data.
- How big is the feature dictionary?
- Repeat eliminating English stop words.
- Is the dictionary smaller?
- Transform the training data using the trained vectorizer.
- What are the 20 words that are most common in the whole corpus?
- What are the 20 most common words in each of the 4 classes?
- Evaluate the performance of a Logistic Regression on the features extracted by the CountVectorizer.
    - You will have to transform the test_set, too. Be careful to use the trained vectorizer, without re-fitting it.
    - Create a confusion matrix.

**BONUS:**
- Try a couple of modifications:
    - restrict max_features
    - change max_df and min_df
    - for each of the above print a confusion matrix and investigate what gets mixed

In [8]:
# A:
from sklearn.feature_extraction.text import CountVectorizer

# 1- initialize vectorizer
c_v= CountVectorizer(token_pattern='\w+', stop_words='english')

# 2- fit vectorizer (build Vocabulary)
c_v.fit(df_train.text)
document_matrix = c_v.transform(df_train.text)
document_matrix


<2034x26614 sparse matrix of type '<class 'numpy.int64'>'
	with 138536 stored elements in Compressed Sparse Row format>

In [9]:
print(f'Length of feature dictionary= {len(c_v.get_feature_names())}')

Length of feature dictionary= 26614


In [10]:
# 20 most common words
sum_words = document_matrix.sum(axis=0)
sum_words
words_freq = [(word, sum_words[0, idx]) for word, idx in c_v.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
words_freq[0:20]

[('s', 2276),
 ('t', 2031),
 ('space', 1061),
 ('1', 937),
 ('people', 793),
 ('god', 745),
 ('don', 730),
 ('2', 724),
 ('like', 682),
 ('just', 675),
 ('does', 600),
 ('m', 598),
 ('know', 592),
 ('think', 584),
 ('3', 549),
 ('time', 546),
 ('image', 534),
 ('edu', 501),
 ('use', 468),
 ('good', 449)]

In [11]:
df = pd.DataFrame(data=document_matrix.toarray(), columns=c_v.get_feature_names())
df['target']= data_train.target
df.head()

Unnamed: 0,0,00,000,0000,00000,000000,000005102000,000062david42,0001,000100255pixel,...,zvi,zwaartepunten,zwak,zwakke,zware,zwarte,zyxel,¹,ú,þ
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
for target in range(0,4):
    print('target ', target, '\n',df[df.target == target].sum().sort_values(ascending=False).iloc[:20])

target  0 
 t           701
s           574
god         405
people      330
don         262
think       215
just        209
does        207
atheism     199
say         174
believe     163
like        162
atheists    162
religion    156
jesus       155
know        154
argument    148
time        135
true        131
said        131
dtype: int64
target  1 
 target      584
image       484
s           481
t           413
graphics    410
1           400
2           326
edu         297
0           282
jpeg        267
file        265
x           254
3           227
use         225
data        219
files       217
images      212
software    212
program     199
ftp         189
dtype: int64
target  2 
 target       1186
space         989
s             741
t             460
1             378
nasa          374
2             272
launch        267
3             246
earth         222
like          222
data          216
m             216
orbit         201
time          197
shuttle       192
just      

In [13]:
from sklearn.linear_model import LogisticRegression
logistic_reg = LogisticRegression()
logistic_reg.fit(df.drop('target', axis=1), df.target)
logistic_reg.score(df.drop('target', axis=1), df.target)




0.9773844641101278

In [14]:
df_test = pd.DataFrame(data_test.data,columns=['text'])
df_test['target'] =pd.Series(data_test.target)

In [15]:
c_v.fit(df_train.text)
document_matrix = c_v.transform(df_test.text)

In [16]:
dtest = pd.DataFrame(data=document_matrix.toarray(), columns=c_v.get_feature_names())

In [17]:
logistic_reg.score(dtest.drop('target', axis=1),pd.Series(df_test.target))

0.7405764966740577

### 4. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features.
- Does the score improve with respect to the count vectorizer? 
- Print out the number of features for this model.
- Initialize a TF-IDF Vectorizer and repeat the analysis above.

**BONUS:**
- Change the parameters of either (or both!) models to improve your score.

In [18]:
len(df_train.text)

2034

In [19]:
# A:
from sklearn.feature_extraction.text import HashingVectorizer

# Initialize a HashingVectorizer
h_v = HashingVectorizer()
h_v.fit(df_train.text)




HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
                  decode_error='strict', dtype=<class 'numpy.float64'>,
                  encoding='utf-8', input='content', lowercase=True,
                  n_features=1048576, ngram_range=(1, 1), norm='l2',
                  preprocessor=None, stop_words=None, strip_accents=None,
                  token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None)

In [20]:
document_matrix = h_v.transform(df_train.text)

In [21]:
document_matrix

<2034x1048576 sparse matrix of type '<class 'numpy.float64'>'
	with 196673 stored elements in Compressed Sparse Row format>

In [22]:
df_hash_train = pd.DataFrame(data=document_matrix.toarray())


In [23]:
h_v.fit(df_test.text)
document_matrix = h_v.fit_transform(df_test.text)


In [26]:
# Couldn't run this because memory limit exceeded
df_has_test = pd.DataFrame(data=document_matrix.toarray())

MemoryError: 

In [28]:
logistic_reg = LogisticRegression()
logistic_reg.fit(df_hash_train,df_train.category)
logistic_reg.score(df_hash_train,df_train.category)
logistic_reg.score(df_hash_test,df_test.category)



NameError: name 'df_has_test' is not defined

In [29]:
print(len(df_hash_train.columns))
print(len(df_hash_test.columns))

1048576


NameError: name 'df_hash_test' is not defined

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer()
tfidf_v.fit(df_train.text)
document_matrix = tfidf_v.transform(df_train.text)
df_tf = pd.DataFrame(document_matrix.toarray(), columns=tfidf_v.get_feature_names())

tfidf_v.fit(df_test.text)
document_matrix = tfidf_v.transform(df_test.text)
tf_test = pd.DataFrame(document_matrix.toarray(), columns=tfidf_v.get_feature_names())
tf_test.target = df_test.target

In [36]:
logistic =LogisticRegression()
logistic.fit(df_tf,df_train.category)
logistic.score(df_tf,df_train.category)
print('dataframe.category'+ str(df_train.category.shape) + ' df_tf '+ str(df_tf.shape))
print('tf_ts'+ str(df_hash_test.shape) +' df_ts.target '+ str(df_test.category.shape))



dataframe.category(2034,) df_tf (2034, 26879)


NameError: name 'df_hash_test' is not defined

In [37]:
logistic.score(df_hash_test,df_test.category)
len(df_tf.columns)

NameError: name 'df_has_test' is not defined

### 5. Classifier comparison

Of all the vectorizers tested above, choose one that has a reasonable performance with a manageable number of features and compare the performance of these models:

- KNN
- Logistic Regression
- Decision Trees
- Support Vector Machine
- Random Forest
- Extra Trees

In order to speed up the calculation it's better to vectorize the data only once and then compare the models.

In [38]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import svm 

In [None]:
param_grid ={'n_neighbors':[3,4,5,6,7]}
KG = GridSearchCV(KNeighborsClassifier,param_grid = param_grid)
df_hash_train.shape
df_train.category.shape
knn = KNeighborsClassifier()
knn.fit(df_hash_train,df_train.category)
knn.score(df_hash_train,df_train.category)
param_grid = {'max_depth':[2,3,4,5,6]}
dt = GridSearchCV(DecisionTreeClassifier(),param_grid,cv=5)
dt.fit(df_hash_train,df_train.category)
dt.score(df_hash_train,df_train.category)

In [None]:
rf = RandomForestClassifier(n_estimators = 500 , max_depth=5)
rf.fit(df_hash_train,df_train.category)
rf.score(df_hash_train,df_train.category)

### Bonus: Other classifiers

Adapt the code from [this example](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#example-text-document-classification-20newsgroups-py) to compare across all the classifiers suggested and to display the final plot

### Bonus: 

- #### Fit a model to the 20newsgroups dataset with all classes

- #### Choose texts, for example from newspaper articles, and check what is the class label predicted for them. Does the predicted label meet your expectations?