<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

---

In this lab we will further explore sklearn and NLTK's capabilities for processing text. We will use the 20 Newsgroup dataset, which is provided by sklearn.

In [414]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
plt.style.use('ggplot')
sns.set(font_scale=1.5)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [415]:
# Getting the Sklearn Dataset
from sklearn.datasets import fetch_20newsgroups

### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

Look up the [function documentation](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html) for how to grab the data.

You should pull these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [416]:
# Extracting Information from the Data's Dictionary format 
# Categories of emails we want
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
# Setting training data
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))
# Setting testing data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

### 2. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Because this is an sklearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).

Let's inspect them.

1. What data type is `data_train`?
- What does `data_train` contain? 
- How many data points does `data_train` contain?
- How many data points of each category does `data_train` contain?
- Inspect the first data point, what does it look like?

In [417]:
# A:
#1.What data type is data_train?
type(data_train)
#Dictionary

sklearn.utils.Bunch

In [418]:
#2.What does data_train contain?
data_train.keys()
#['data', 'filenames', 'target_names', 'target', 'DESCR', 'description']

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])

In [419]:
#3.How many data points does data_train contain?
print(len(data_train.data))
print(pd.DataFrame(data_train.data).shape[0])
#2034 data points for data_train.data

2034
2034


In [420]:
#4.How many data points of each category does data_train contain?
print('Data',len(data_train.data))
print('filenames',len(data_train.filenames))
print('target_names',len(data_train.target_names))
print('target',len(data_train.target))
print('DESCR',[0 if data_train.DESCR == None else len(data_train.DESCR)][0])
print('description',len(data_train.description))

Data 2034
filenames 2034
target_names 4
target 2034
DESCR 0
description 33


In [421]:
#5.Inspect the first data point, what does it look like?
pd.DataFrame(data_train.data).iloc[0][0]

"Hi,\n\nI've noticed that if you only save a model (with all your mapping planes\npositioned carefully) to a .3DS file that when you reload it after restarting\n3DS, they are given a default position and orientation.  But if you save\nto a .PRJ file their positions/orientation are preserved.  Does anyone\nknow why this information is not stored in the .3DS file?  Nothing is\nexplicitly said in the manual about saving texture rules in the .PRJ file. \nI'd like to be able to read the texture rule information, does anyone have \nthe format for the .PRJ file?\n\nIs the .CEL file format available from somewhere?\n\nRych"

### 3. Bag of Words model

Let's train a model using a simple count vectorizer.

1. Initialize a standard CountVectorizer and fit the training data.
- How big is the feature dictionary?
- Repeat eliminating English stop words.
- Is the dictionary smaller?
- Transform the training data using the trained vectorizer.
- What are the 20 words that are most common in the whole corpus?
- What are the 20 most common words in each of the 4 classes?
- Evaluate the performance of a Logistic Regression on the features extracted by the CountVectorizer.
    - You will have to transform the test_set, too. Be careful to use the trained vectorizer, without re-fitting it.
    - Create a confusion matrix.

**BONUS:**
- Try a couple of modifications:
    - restrict max_features
    - change max_df and min_df
    - for each of the above print a confusion matrix and investigate what gets mixed

In [422]:
# A:
#1.Initialize a standard CountVectorizer and fit the training data.
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
cv.fit(data_train.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [423]:
#2.How big is the feature dictionary?
len(cv.vocabulary_)

26879

In [424]:
#3.Repeat eliminating English stop words.
cv=CountVectorizer(stop_words='english')
cv.fit(data_train.data)
len(cv.vocabulary_)

26576

In [425]:
#4.Is the dictionary smaller?
#Yes

In [426]:
#5.Transform the training data using the trained vectorizer.
data_mat=cv.transform(data_train.data)


In [427]:
#6.What are the 20 words that are most common in the whole corpus?
pd.DataFrame(np.sum(data_mat.toarray(),axis=0),index=cv.get_feature_names()).sort_values(0,ascending=False).head(20).T


Unnamed: 0,space,people,god,don,like,just,does,know,think,time,image,edu,use,good,data,nasa,graphics,jesus,say,way
0,1061,793,745,730,682,675,600,592,584,546,534,501,468,449,444,419,414,411,409,387


In [428]:
#7.What are the 20 most common words in each of the 4 classes?
df2=pd.DataFrame(data_train.data)
df2['target']=data_train.target

all_classes=cv.fit_transform((str(df2[df2['target']==0][0]),str(df2[df2['target']==1][0]),
                             str(df2[df2['target']==2][0]),str(df2[df2['target']==3][0])))

df_4_classes=pd.DataFrame(all_classes.toarray(),columns=cv.get_feature_names())

In [429]:
#class 0
df_4_classes.T[[0]].sort_values(0,ascending=False).head(20).T

Unnamed: 0,ni,atheism,just,agree,read,state,nif,nit,nand,nmost,nno,post,does,nso,mean,point,nthis,think,deleted,object
0,6,4,4,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2


In [430]:
#class 1
df_4_classes.T[[1]].sort_values(1,ascending=False).head(20).T

Unnamed: 0,ni,hi,looking,program,just,hello,interested,3d,68010,using,package,problem,ve,graphics,got,format,68070,file,files,trying
1,8,7,6,4,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2


In [431]:
#class 2
df_4_classes.T[[2]].sort_values(2,ascending=False).head(20).T

Unnamed: 0,ni,worse,article,did,just,spent,assuming,fuel,space,nhow,nit,test,remember,sq,early,performance,sure,nyou,dc,talk
2,4,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2


In [432]:
#class 3
df_4_classes.T[[3]].sort_values(3,ascending=False).head(20).T

Unnamed: 0,ni,don,nyou,point,want,replied,generally,jose,tell,deleted,said,mormons,think,god,letter,th,good,rick,nthe,quote
3,5,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2


In [433]:
#8.Evaluate the performance of a Logistic Regression on the features extracted by the CountVectorizer.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
lm=LogisticRegression()
X_train=cv.fit_transform(data_train.data)
y_train=data_train.target
X_test=cv.transform(data_test.data)
y_test=data_test.target
lm.fit(X_train,y_train)
print('accuracy:',lm.score(X_test,y_test),'\n')
print(confusion_matrix(y_test,lm.predict(X_test)))
len(cv.get_feature_names())

accuracy: 0.7450110864745011 

[[187  16  46  70]
 [ 13 345  28   3]
 [ 22  23 333  16]
 [ 67  14  27 143]]


26576

### 4. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features.
- Does the score improve with respect to the count vectorizer? 
- Print out the number of features for this model.
- Initialize a TF-IDF Vectorizer and repeat the analysis above.

**BONUS:**
- Change the parameters of either (or both!) models to improve your score.

In [434]:
# A:
#1.Initialize a HashingVectorizer and repeat the test with no restriction on the number of features.
from sklearn.feature_extraction.text import HashingVectorizer,TfidfVectorizer
hash=HashingVectorizer()
X_train=hash.fit_transform(data_train.data)
X_test=hash.transform(data_test.data)
lm.fit(X_train,y_train)
print('accuracy:',lm.score(X_test,y_test),'\n')
print(confusion_matrix(y_test,lm.predict(X_test)))

accuracy: 0.6740576496674058 

[[180  41  54  44]
 [ 15 337  34   3]
 [ 29  47 315   3]
 [100  29  42  80]]


In [435]:
#2.Does the score improve with respect to the count vectorizer?
#Nope it's worse

In [436]:
#3.Print out the number of features for this model.
hash.n_features
#wow 

1048576

In [437]:
#4.Initialize a TF-IDF Vectorizer and repeat the analysis above.
tfidf=TfidfVectorizer()
X_train=tfidf.fit_transform(data_train.data)
X_test=tfidf.transform(data_test.data)
lm.fit(X_train,y_train)
print('accuracy:',lm.score(X_test,y_test),'\n')
print(confusion_matrix(y_test,lm.predict(X_test)))
len(tfidf.get_feature_names())

accuracy: 0.7331855136733185 

[[199  26  55  39]
 [ 11 349  27   2]
 [ 17  39 338   0]
 [ 86  25  34 106]]


26879

### 5. Classifier comparison

Of all the vectorizers tested above, choose one that has a reasonable performance with a manageable number of features and compare the performance of these models:

- KNN
- Logistic Regression
- Decision Trees
- Support Vector Machine
- Random Forest
- Extra Trees

In order to speed up the calculation it's better to vectorize the data only once and then compare the models.

In [449]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
from sklearn.metrics import accuracy_score

cv=CountVectorizer(stop_words='english')
X_train=cv.fit_transform(data_train.data)
X_test=cv.transform(data_test.data)

In [450]:
knn=KNeighborsClassifier()
knn.fit(X_train,y_train)
knn_accuracy=accuracy_score(y_test,knn.predict(X_test))

In [451]:
lm=LogisticRegression()
lm.fit(X_train,y_train)
lm_accuracy=accuracy_score(y_test,lm.predict(X_test))

In [452]:
dtree=DecisionTreeClassifier()
dtree.fit(X_train,y_train)
dtree_accuracy=accuracy_score(y_test,dtree.predict(X_test))

In [453]:
svm=SVC()
svm.fit(X_train,y_train)
svm_accuracy=accuracy_score(y_test,svm.predict(X_test))

In [454]:
rftree=RandomForestClassifier()
rftree.fit(X_train,y_train)
rftree_accuracy=accuracy_score(y_test,rftree.predict(X_test))

In [455]:
extree=ExtraTreesClassifier()
extree.fit(X_train,y_train)
extree_accuracy=accuracy_score(y_test,extree.predict(X_test))

In [456]:
print('KNN                    accuracy:',knn_accuracy)
print('Logistic Regression    accuracy:',lm_accuracy)
print('Decision Tree          accuracy:',dtree_accuracy)
print('Support Vector Machine accuracy:',svm_accuracy)
print('Random Forest          accuracy:',rftree_accuracy)
print('Extra Trees            accuracy:',extree_accuracy)

KNN                    accuracy: 0.3392461197339246
Logistic Regression    accuracy: 0.7450110864745011
Decision Tree          accuracy: 0.6134515890613451
Support Vector Machine accuracy: 0.30524759793052475
Random Forest          accuracy: 0.6289726533628973
Extra Trees            accuracy: 0.647450110864745


### Bonus: Other classifiers

Adapt the code from [this example](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#example-text-document-classification-20newsgroups-py) to compare across all the classifiers suggested and to display the final plot

### Bonus: 

- #### Fit a model to the 20newsgroups dataset with all classes

- #### Choose texts, for example from newspaper articles, and check what is the class label predicted for them. Does the predicted label meet your expectations?