## Homework Week 10

1. Explain the idea of bag-of-words model.

2. What are the two methods to treat the meaningless frequently occurring words?

3. Classify the documents in fetch_20newsgroups.

```python
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']

news = fetch_20newsgroups(categories=categories, shuffle=True, random_state=1)

X_train, X_test, y_train, y_test = train_test_split(news.data, news.target, test_size = 0.5, random_state=1)
```




**Problem 1: Explain the idea of bag-of-words model.**

\
The bag-of-words model can be summarized in two steps:

First, create a vocabulary of unique tokens such as words from an entire set of documents. This is a list of all the unique words that appear in all of the documents.

Then, create feature vectors from each document that contain counts of how often each word occurs in the document. 

Each feature vector will include a lot of zeros since each document contains only a small subset of all the words. 

**Problem 2: What are the two methods to treat the meaningless frequently occurring words?**

\
1. **term frequency-inverse document frequency (tf-idf)** - tf-idf can be used to downweight frequently occurring words in the feature vectors. tf-idf is the product of the term frequency and the inverse document frequency. Scikit-learn has a transformer called TfidfTransformer that can be used with CountVectorizer to assign tf-idfs to raw term frequencies in the bag-of-words model.


2. **stop-words removal** - stop-words are a set of common words that show up in text that have little or no meaning. These are words like is, the, has, etc. Because stop-words contain little to no actual meaning, they can be removed. The Natural Language Toolkit (NLKT) library provides a set of 127 English stop-words that can be used to remove stop-words from the bag-of-words model.

**Problem 3: Classify the documents in fetch_20newsgroups.**

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']
 
news = fetch_20newsgroups(categories=categories, shuffle=True, random_state=1)
 
X_train, X_test, y_train, y_test = train_test_split(news.data, news.target, test_size = 0.5, random_state=1)

In [2]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from nltk.stem.porter import PorterStemmer
import nltk
nltk.download('stopwords')
import re
from nltk.corpus import stopwords

count = CountVectorizer()
tfidf = TfidfTransformer()
porter = PorterStemmer() 
stop = stopwords.words('english')

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

def tokenizer(text):
    return text.split()
   
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None,
                        stop_words='english'
                        )

param_grid = [{
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2']},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0, solver='liblinear'))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, 
                           param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits




GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vect',
                                        TfidfVectorizer(lowercase=False,
                                                        stop_words='english')),
                                       ('clf',
                                        LogisticRegression(random_state=0,
                                                           solver='liblinear'))]),
             n_jobs=-1,
             param_grid=[{'clf__penalty': ['l1', 'l2'],
                          'vect__tokenizer': [<function tokenizer at 0x7fd9bc43d1f0>,
                                              <function tokenizer_porter at 0x7fd9bc43d550>]}],
             scoring='accuracy', verbose=1)

In [4]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

Best parameter set: {'clf__penalty': 'l2', 'vect__tokenizer': <function tokenizer_porter at 0x7fd9bc43d550>} 
CV Accuracy: 0.910


In [5]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.919


In [6]:
import pandas as pd

predict = gs_lr_tfidf.predict(X_test)

data = {'Documents':X_test, 'Class Predictions':predict, 'Actual Class Values':y_test}

predictions = pd.DataFrame(data)
predictions.head(20)

Unnamed: 0,Documents,Class Predictions,Actual Class Values
0,From: wjhovi01@ulkyvx.louisville.edu\nSubject:...,3,3
1,From: zyeh@caspian.usc.edu (zhenghao yeh)\nSub...,1,1
2,From: newmme@helios.tn.cornell.edu (Mark E. J....,1,1
3,From: peterbak@microsoft.com (Peter Bako)\nSub...,1,1
4,From: euclid@mrcnext.cso.uiuc.edu (Euclid K.)\...,2,2
5,From: joachim@kih.no (joachim lous)\nSubject: ...,1,1
6,From: boylan@pi.eai.iastate.edu (Terran Boylan...,1,1
7,From: stgprao@st.unocal.COM (Richard Ottolini)...,2,2
8,From: mayne@ds3.scri.fsu.edu (Bill Mayne)\nSub...,3,3
9,From: mrb@cbnewsj.cb.att.com (m..bruncati)\nSu...,2,2
