## Intro to NLP walkthrough with gridsearching


In this lab, we'll explore scikit-learn and NLTK's capabilities for processing text even further. We'll use the 20 newsgroups data set, which is provided by scikit-learn.

In [1]:
# Standard data science imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
import regex as re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords


In [3]:
# Getting the scikit-learn data set:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction import stop_words
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV



### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

The "20 Newsgroups" dataset is described [here](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html).

For this lab let's choose 4 categories to analyze.  The full list is given below.


```python
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
```

Note that the solution code will use these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [4]:
#Extracting Information from the Data's Dictionary format 

categories = ['sci.med', 'comp.graphics', 'sci.crypt', 'talk.politics.misc']  # Fill in whatever categories you want to use!!

# Setting out training data
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))
# Setting our testing data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

y_train = data_train['target']
y_test = data_test['target']
X_train = data_train.data
X_test = data_test.data

**Question:** What does the `shuffle` argument do?  Why are we setting a `random_state`?

### so that we might ultimately assume the data to be i.i.d... aka  that these samples from the data are identical and independentlfy distributed

### 2) Inspect the data.

We've downloaded a few `newsgroups` categories and removed their headers, footers, and quotes.

Because this is a scikit-learn data set, it comes with pre-split training and testing sets (note: we were able to call "train" and "test" in subset).

Let's inspect them.

1) What data type is `data_train`?
- Is it a list? A dictionary? What else?
- How many data points does it contain?
- Inspect the first data point. What does it look like?

In [5]:
print(type(data_train))
print(data_train.keys())
print(data_train.description)
print(len(data_train.data))
data_train.data[0]

<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])
the 20 newsgroups by date dataset
2238


'\nI understand, believe it or not, and there are any number of kinds of \nconversation and communication I engage in that I wouldn\'t even consider \nusing this scheme for.  On the other hand, I don\'t see "Clipper" as providing \na secure channel--it just prevents casual eavesdropping.  This is part of why \nI am not worried about it per se.  Trying to look at Clipper as a serious \nsecurity tool is simply ludicrous.  It\'s a voice scrambler, nothing more.\n\nThere is still plenty of market for real crypto.\n\n\nThey cost an arm and a leg, though. "Clipper" is obviously aimed at the mass \nmarket.  It certainly won\'t put Cylink out of business.\n\n\nThis is old news.  I can do this now.\n\n\nThere ARE restrictions.  Example: We\'re a networking software vendor with a \nlarge overseas share of our market.  We cannot currently ship PEM, or even \nsimple DES, in our products without case-by-case approval from the Department \nof State.  ITAR presents a material trade barrier to US firm

### 3) Create a bag-of-words model.

Let's train a model using a simple count vectorizer.

1) Initialize a standard CountVectorizer and fit the training data.
- How big is the feature dictionary?
- Eliminate English stop words.
- Is the dictionary smaller?
- Transform the training data using the trained vectorizer.
- Evaluate the performance of a logistic regression on the features extracted by the CountVectorizer.
    - You will have to transform the `test_set`, too. Be careful to use the trained vectorizer without refitting it.

**Bonus**
- Try a couple of modifications:
    - Restrict the `max_features`.
    - Change the `max_df` and `min_df`.

In [6]:
cv = CountVectorizer()
cv.fit(X_train)
X_train_cv = cv.transform(X_train)

In [7]:
len(cv.get_feature_names())

31154

In [8]:
cv = CountVectorizer(stop_words='english')
cv.fit(X_train)
X_train_cv = cv.transform(X_train)
len(cv.get_feature_names())

30848

## Using stop words, we lose about 300 features

## Also transforming the testing data with the fitted CountVectorizer

In [9]:
X_test_cv = cv.transform(X_test)

In [10]:
pd.set_option("display.max_columns", 400)

## Setting up a pipeline with gridsearch for a logistic regression 

In [11]:
pipe = Pipeline([
    ('cv', CountVectorizer()),
    ('lr', LogisticRegression())
])

parameters = {
    'cv__stop_words' : ['english', None],
    'cv__max_features': [10000, 15000, 20000, 25000, 30000],
    'cv__max_df': [1.0, 0.9, 0.8],
    'lr__penalty' : ['l2', 'l1'],
    'lr__C': [1, 1/10, 1/20]
}

gs = GridSearchCV(pipe, parameters, verbose=2)

In [12]:
gs.fit(X_train, y_train)

Fitting 3 folds for each of 180 candidates, totalling 540 fits
[CV] cv__max_df=1.0, cv__max_features=10000, cv__stop_words=english, lr__C=1, lr__penalty=l2 
[CV]  cv__max_df=1.0, cv__max_features=10000, cv__stop_words=english, lr__C=1, lr__penalty=l2 -   1.4s
[CV] cv__max_df=1.0, cv__max_features=10000, cv__stop_words=english, lr__C=1, lr__penalty=l2 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.4s remaining:    0.0s


[CV]  cv__max_df=1.0, cv__max_features=10000, cv__stop_words=english, lr__C=1, lr__penalty=l2 -   0.6s
[CV] cv__max_df=1.0, cv__max_features=10000, cv__stop_words=english, lr__C=1, lr__penalty=l2 
[CV]  cv__max_df=1.0, cv__max_features=10000, cv__stop_words=english, lr__C=1, lr__penalty=l2 -   0.6s
[CV] cv__max_df=1.0, cv__max_features=10000, cv__stop_words=english, lr__C=1, lr__penalty=l1 
[CV]  cv__max_df=1.0, cv__max_features=10000, cv__stop_words=english, lr__C=1, lr__penalty=l1 -   0.4s
[CV] cv__max_df=1.0, cv__max_features=10000, cv__stop_words=english, lr__C=1, lr__penalty=l1 
[CV]  cv__max_df=1.0, cv__max_features=10000, cv__stop_words=english, lr__C=1, lr__penalty=l1 -   0.7s
[CV] cv__max_df=1.0, cv__max_features=10000, cv__stop_words=english, lr__C=1, lr__penalty=l1 
[CV]  cv__max_df=1.0, cv__max_features=10000, cv__stop_words=english, lr__C=1, lr__penalty=l1 -   0.6s
[CV] cv__max_df=1.0, cv__max_features=10000, cv__stop_words=english, lr__C=0.1, lr__penalty=l2 
[CV]  cv__max

[CV]  cv__max_df=1.0, cv__max_features=15000, cv__stop_words=english, lr__C=0.1, lr__penalty=l2 -   0.5s
[CV] cv__max_df=1.0, cv__max_features=15000, cv__stop_words=english, lr__C=0.1, lr__penalty=l2 
[CV]  cv__max_df=1.0, cv__max_features=15000, cv__stop_words=english, lr__C=0.1, lr__penalty=l2 -   0.5s
[CV] cv__max_df=1.0, cv__max_features=15000, cv__stop_words=english, lr__C=0.1, lr__penalty=l1 
[CV]  cv__max_df=1.0, cv__max_features=15000, cv__stop_words=english, lr__C=0.1, lr__penalty=l1 -   0.4s
[CV] cv__max_df=1.0, cv__max_features=15000, cv__stop_words=english, lr__C=0.1, lr__penalty=l1 
[CV]  cv__max_df=1.0, cv__max_features=15000, cv__stop_words=english, lr__C=0.1, lr__penalty=l1 -   0.4s
[CV] cv__max_df=1.0, cv__max_features=15000, cv__stop_words=english, lr__C=0.1, lr__penalty=l1 
[CV]  cv__max_df=1.0, cv__max_features=15000, cv__stop_words=english, lr__C=0.1, lr__penalty=l1 -   0.4s
[CV] cv__max_df=1.0, cv__max_features=15000, cv__stop_words=english, lr__C=0.05, lr__penalt

[CV]  cv__max_df=1.0, cv__max_features=20000, cv__stop_words=english, lr__C=0.05, lr__penalty=l2 -   0.5s
[CV] cv__max_df=1.0, cv__max_features=20000, cv__stop_words=english, lr__C=0.05, lr__penalty=l2 
[CV]  cv__max_df=1.0, cv__max_features=20000, cv__stop_words=english, lr__C=0.05, lr__penalty=l2 -   0.5s
[CV] cv__max_df=1.0, cv__max_features=20000, cv__stop_words=english, lr__C=0.05, lr__penalty=l1 
[CV]  cv__max_df=1.0, cv__max_features=20000, cv__stop_words=english, lr__C=0.05, lr__penalty=l1 -   0.4s
[CV] cv__max_df=1.0, cv__max_features=20000, cv__stop_words=english, lr__C=0.05, lr__penalty=l1 
[CV]  cv__max_df=1.0, cv__max_features=20000, cv__stop_words=english, lr__C=0.05, lr__penalty=l1 -   0.4s
[CV] cv__max_df=1.0, cv__max_features=20000, cv__stop_words=english, lr__C=0.05, lr__penalty=l1 
[CV]  cv__max_df=1.0, cv__max_features=20000, cv__stop_words=english, lr__C=0.05, lr__penalty=l1 -   0.4s
[CV] cv__max_df=1.0, cv__max_features=20000, cv__stop_words=None, lr__C=1, lr__pen

[CV]  cv__max_df=1.0, cv__max_features=25000, cv__stop_words=None, lr__C=1, lr__penalty=l2 -   0.7s
[CV] cv__max_df=1.0, cv__max_features=25000, cv__stop_words=None, lr__C=1, lr__penalty=l2 
[CV]  cv__max_df=1.0, cv__max_features=25000, cv__stop_words=None, lr__C=1, lr__penalty=l2 -   0.7s
[CV] cv__max_df=1.0, cv__max_features=25000, cv__stop_words=None, lr__C=1, lr__penalty=l1 
[CV]  cv__max_df=1.0, cv__max_features=25000, cv__stop_words=None, lr__C=1, lr__penalty=l1 -   1.1s
[CV] cv__max_df=1.0, cv__max_features=25000, cv__stop_words=None, lr__C=1, lr__penalty=l1 
[CV]  cv__max_df=1.0, cv__max_features=25000, cv__stop_words=None, lr__C=1, lr__penalty=l1 -   0.7s
[CV] cv__max_df=1.0, cv__max_features=25000, cv__stop_words=None, lr__C=1, lr__penalty=l1 
[CV]  cv__max_df=1.0, cv__max_features=25000, cv__stop_words=None, lr__C=1, lr__penalty=l1 -   0.8s
[CV] cv__max_df=1.0, cv__max_features=25000, cv__stop_words=None, lr__C=0.1, lr__penalty=l2 
[CV]  cv__max_df=1.0, cv__max_features=2500

[CV]  cv__max_df=1.0, cv__max_features=30000, cv__stop_words=None, lr__C=0.1, lr__penalty=l2 -   0.6s
[CV] cv__max_df=1.0, cv__max_features=30000, cv__stop_words=None, lr__C=0.1, lr__penalty=l2 
[CV]  cv__max_df=1.0, cv__max_features=30000, cv__stop_words=None, lr__C=0.1, lr__penalty=l2 -   0.7s
[CV] cv__max_df=1.0, cv__max_features=30000, cv__stop_words=None, lr__C=0.1, lr__penalty=l1 
[CV]  cv__max_df=1.0, cv__max_features=30000, cv__stop_words=None, lr__C=0.1, lr__penalty=l1 -   0.5s
[CV] cv__max_df=1.0, cv__max_features=30000, cv__stop_words=None, lr__C=0.1, lr__penalty=l1 
[CV]  cv__max_df=1.0, cv__max_features=30000, cv__stop_words=None, lr__C=0.1, lr__penalty=l1 -   0.5s
[CV] cv__max_df=1.0, cv__max_features=30000, cv__stop_words=None, lr__C=0.1, lr__penalty=l1 
[CV]  cv__max_df=1.0, cv__max_features=30000, cv__stop_words=None, lr__C=0.1, lr__penalty=l1 -   0.5s
[CV] cv__max_df=1.0, cv__max_features=30000, cv__stop_words=None, lr__C=0.05, lr__penalty=l2 
[CV]  cv__max_df=1.0, cv

[CV]  cv__max_df=0.9, cv__max_features=10000, cv__stop_words=None, lr__C=0.05, lr__penalty=l2 -   0.5s
[CV] cv__max_df=0.9, cv__max_features=10000, cv__stop_words=None, lr__C=0.05, lr__penalty=l2 
[CV]  cv__max_df=0.9, cv__max_features=10000, cv__stop_words=None, lr__C=0.05, lr__penalty=l2 -   0.6s
[CV] cv__max_df=0.9, cv__max_features=10000, cv__stop_words=None, lr__C=0.05, lr__penalty=l1 
[CV]  cv__max_df=0.9, cv__max_features=10000, cv__stop_words=None, lr__C=0.05, lr__penalty=l1 -   0.4s
[CV] cv__max_df=0.9, cv__max_features=10000, cv__stop_words=None, lr__C=0.05, lr__penalty=l1 
[CV]  cv__max_df=0.9, cv__max_features=10000, cv__stop_words=None, lr__C=0.05, lr__penalty=l1 -   0.5s
[CV] cv__max_df=0.9, cv__max_features=10000, cv__stop_words=None, lr__C=0.05, lr__penalty=l1 
[CV]  cv__max_df=0.9, cv__max_features=10000, cv__stop_words=None, lr__C=0.05, lr__penalty=l1 -   0.4s
[CV] cv__max_df=0.9, cv__max_features=15000, cv__stop_words=english, lr__C=1, lr__penalty=l2 
[CV]  cv__max_d

[CV]  cv__max_df=0.9, cv__max_features=20000, cv__stop_words=english, lr__C=1, lr__penalty=l2 -   0.6s
[CV] cv__max_df=0.9, cv__max_features=20000, cv__stop_words=english, lr__C=1, lr__penalty=l2 
[CV]  cv__max_df=0.9, cv__max_features=20000, cv__stop_words=english, lr__C=1, lr__penalty=l2 -   0.6s
[CV] cv__max_df=0.9, cv__max_features=20000, cv__stop_words=english, lr__C=1, lr__penalty=l1 
[CV]  cv__max_df=0.9, cv__max_features=20000, cv__stop_words=english, lr__C=1, lr__penalty=l1 -   0.4s
[CV] cv__max_df=0.9, cv__max_features=20000, cv__stop_words=english, lr__C=1, lr__penalty=l1 
[CV]  cv__max_df=0.9, cv__max_features=20000, cv__stop_words=english, lr__C=1, lr__penalty=l1 -   0.6s
[CV] cv__max_df=0.9, cv__max_features=20000, cv__stop_words=english, lr__C=1, lr__penalty=l1 
[CV]  cv__max_df=0.9, cv__max_features=20000, cv__stop_words=english, lr__C=1, lr__penalty=l1 -   0.7s
[CV] cv__max_df=0.9, cv__max_features=20000, cv__stop_words=english, lr__C=0.1, lr__penalty=l2 
[CV]  cv__max

[CV]  cv__max_df=0.9, cv__max_features=25000, cv__stop_words=english, lr__C=0.1, lr__penalty=l2 -   0.5s
[CV] cv__max_df=0.9, cv__max_features=25000, cv__stop_words=english, lr__C=0.1, lr__penalty=l2 
[CV]  cv__max_df=0.9, cv__max_features=25000, cv__stop_words=english, lr__C=0.1, lr__penalty=l2 -   0.5s
[CV] cv__max_df=0.9, cv__max_features=25000, cv__stop_words=english, lr__C=0.1, lr__penalty=l1 
[CV]  cv__max_df=0.9, cv__max_features=25000, cv__stop_words=english, lr__C=0.1, lr__penalty=l1 -   0.4s
[CV] cv__max_df=0.9, cv__max_features=25000, cv__stop_words=english, lr__C=0.1, lr__penalty=l1 
[CV]  cv__max_df=0.9, cv__max_features=25000, cv__stop_words=english, lr__C=0.1, lr__penalty=l1 -   0.4s
[CV] cv__max_df=0.9, cv__max_features=25000, cv__stop_words=english, lr__C=0.1, lr__penalty=l1 
[CV]  cv__max_df=0.9, cv__max_features=25000, cv__stop_words=english, lr__C=0.1, lr__penalty=l1 -   0.4s
[CV] cv__max_df=0.9, cv__max_features=25000, cv__stop_words=english, lr__C=0.05, lr__penalt

[CV]  cv__max_df=0.9, cv__max_features=30000, cv__stop_words=english, lr__C=0.05, lr__penalty=l2 -   0.5s
[CV] cv__max_df=0.9, cv__max_features=30000, cv__stop_words=english, lr__C=0.05, lr__penalty=l2 
[CV]  cv__max_df=0.9, cv__max_features=30000, cv__stop_words=english, lr__C=0.05, lr__penalty=l2 -   0.5s
[CV] cv__max_df=0.9, cv__max_features=30000, cv__stop_words=english, lr__C=0.05, lr__penalty=l1 
[CV]  cv__max_df=0.9, cv__max_features=30000, cv__stop_words=english, lr__C=0.05, lr__penalty=l1 -   0.4s
[CV] cv__max_df=0.9, cv__max_features=30000, cv__stop_words=english, lr__C=0.05, lr__penalty=l1 
[CV]  cv__max_df=0.9, cv__max_features=30000, cv__stop_words=english, lr__C=0.05, lr__penalty=l1 -   0.4s
[CV] cv__max_df=0.9, cv__max_features=30000, cv__stop_words=english, lr__C=0.05, lr__penalty=l1 
[CV]  cv__max_df=0.9, cv__max_features=30000, cv__stop_words=english, lr__C=0.05, lr__penalty=l1 -   0.4s
[CV] cv__max_df=0.9, cv__max_features=30000, cv__stop_words=None, lr__C=1, lr__pen

[CV]  cv__max_df=0.8, cv__max_features=10000, cv__stop_words=None, lr__C=1, lr__penalty=l2 -   0.6s
[CV] cv__max_df=0.8, cv__max_features=10000, cv__stop_words=None, lr__C=1, lr__penalty=l2 
[CV]  cv__max_df=0.8, cv__max_features=10000, cv__stop_words=None, lr__C=1, lr__penalty=l2 -   0.6s
[CV] cv__max_df=0.8, cv__max_features=10000, cv__stop_words=None, lr__C=1, lr__penalty=l1 
[CV]  cv__max_df=0.8, cv__max_features=10000, cv__stop_words=None, lr__C=1, lr__penalty=l1 -   0.9s
[CV] cv__max_df=0.8, cv__max_features=10000, cv__stop_words=None, lr__C=1, lr__penalty=l1 
[CV]  cv__max_df=0.8, cv__max_features=10000, cv__stop_words=None, lr__C=1, lr__penalty=l1 -   0.7s
[CV] cv__max_df=0.8, cv__max_features=10000, cv__stop_words=None, lr__C=1, lr__penalty=l1 
[CV]  cv__max_df=0.8, cv__max_features=10000, cv__stop_words=None, lr__C=1, lr__penalty=l1 -   0.7s
[CV] cv__max_df=0.8, cv__max_features=10000, cv__stop_words=None, lr__C=0.1, lr__penalty=l2 
[CV]  cv__max_df=0.8, cv__max_features=1000

[CV]  cv__max_df=0.8, cv__max_features=15000, cv__stop_words=None, lr__C=0.1, lr__penalty=l2 -   0.5s
[CV] cv__max_df=0.8, cv__max_features=15000, cv__stop_words=None, lr__C=0.1, lr__penalty=l2 
[CV]  cv__max_df=0.8, cv__max_features=15000, cv__stop_words=None, lr__C=0.1, lr__penalty=l2 -   0.5s
[CV] cv__max_df=0.8, cv__max_features=15000, cv__stop_words=None, lr__C=0.1, lr__penalty=l1 
[CV]  cv__max_df=0.8, cv__max_features=15000, cv__stop_words=None, lr__C=0.1, lr__penalty=l1 -   0.4s
[CV] cv__max_df=0.8, cv__max_features=15000, cv__stop_words=None, lr__C=0.1, lr__penalty=l1 
[CV]  cv__max_df=0.8, cv__max_features=15000, cv__stop_words=None, lr__C=0.1, lr__penalty=l1 -   0.4s
[CV] cv__max_df=0.8, cv__max_features=15000, cv__stop_words=None, lr__C=0.1, lr__penalty=l1 
[CV]  cv__max_df=0.8, cv__max_features=15000, cv__stop_words=None, lr__C=0.1, lr__penalty=l1 -   0.4s
[CV] cv__max_df=0.8, cv__max_features=15000, cv__stop_words=None, lr__C=0.05, lr__penalty=l2 
[CV]  cv__max_df=0.8, cv

[CV]  cv__max_df=0.8, cv__max_features=20000, cv__stop_words=None, lr__C=0.05, lr__penalty=l2 -   0.5s
[CV] cv__max_df=0.8, cv__max_features=20000, cv__stop_words=None, lr__C=0.05, lr__penalty=l2 
[CV]  cv__max_df=0.8, cv__max_features=20000, cv__stop_words=None, lr__C=0.05, lr__penalty=l2 -   0.5s
[CV] cv__max_df=0.8, cv__max_features=20000, cv__stop_words=None, lr__C=0.05, lr__penalty=l1 
[CV]  cv__max_df=0.8, cv__max_features=20000, cv__stop_words=None, lr__C=0.05, lr__penalty=l1 -   0.4s
[CV] cv__max_df=0.8, cv__max_features=20000, cv__stop_words=None, lr__C=0.05, lr__penalty=l1 
[CV]  cv__max_df=0.8, cv__max_features=20000, cv__stop_words=None, lr__C=0.05, lr__penalty=l1 -   0.4s
[CV] cv__max_df=0.8, cv__max_features=20000, cv__stop_words=None, lr__C=0.05, lr__penalty=l1 
[CV]  cv__max_df=0.8, cv__max_features=20000, cv__stop_words=None, lr__C=0.05, lr__penalty=l1 -   0.4s
[CV] cv__max_df=0.8, cv__max_features=25000, cv__stop_words=english, lr__C=1, lr__penalty=l2 
[CV]  cv__max_d

[CV]  cv__max_df=0.8, cv__max_features=30000, cv__stop_words=english, lr__C=1, lr__penalty=l2 -   0.6s
[CV] cv__max_df=0.8, cv__max_features=30000, cv__stop_words=english, lr__C=1, lr__penalty=l2 
[CV]  cv__max_df=0.8, cv__max_features=30000, cv__stop_words=english, lr__C=1, lr__penalty=l2 -   0.6s
[CV] cv__max_df=0.8, cv__max_features=30000, cv__stop_words=english, lr__C=1, lr__penalty=l1 
[CV]  cv__max_df=0.8, cv__max_features=30000, cv__stop_words=english, lr__C=1, lr__penalty=l1 -   0.4s
[CV] cv__max_df=0.8, cv__max_features=30000, cv__stop_words=english, lr__C=1, lr__penalty=l1 
[CV]  cv__max_df=0.8, cv__max_features=30000, cv__stop_words=english, lr__C=1, lr__penalty=l1 -   0.6s
[CV] cv__max_df=0.8, cv__max_features=30000, cv__stop_words=english, lr__C=1, lr__penalty=l1 
[CV]  cv__max_df=0.8, cv__max_features=30000, cv__stop_words=english, lr__C=1, lr__penalty=l1 -   0.6s
[CV] cv__max_df=0.8, cv__max_features=30000, cv__stop_words=english, lr__C=0.1, lr__penalty=l2 
[CV]  cv__max

[Parallel(n_jobs=1)]: Done 540 out of 540 | elapsed:  4.8min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('cv', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_a...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'cv__stop_words': ['english', None], 'cv__max_features': [10000, 15000, 20000, 25000, 30000], 'cv__max_df': [1.0, 0.9, 0.8], 'lr__penalty': ['l2', 'l1'], 'lr__C': [1, 0.1, 0.05]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)

In [13]:
print(gs.score(X_train, y_train))
print(gs.score(X_test, y_test))
print(gs.best_estimator_)
print(gs.best_params_)
print(gs.best_score_)

0.966934763181412
0.7994634473507712
Pipeline(memory=None,
     steps=[('cv', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=20000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        s...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])
{'cv__max_df': 1.0, 'cv__max_features': 20000, 'cv__stop_words': 'english', 'lr__C': 0.1, 'lr__penalty': 'l2'}
0.8516532618409294


### 4) Test Out Hashing and TF-IDF.

Let's see if hashing or TF-IDF improves our accuracy.

1) Initialize a HashingVectorizer and repeat the test with no restriction on the number of features.
- Does the score improve with respect to the CountVectorizer?
- Print out the number of features for this model.
- Initialize a TF-IDF vectorizer and repeat the analysis above.
- Print out the number of features for this model.

**Bonus**
- Change the parameters of either (or both) models to improve your score.

## Hashing vectorizer pipeline and gridsearch on logistic regression

In [14]:
pipe_hash = Pipeline([
    ('hv', HashingVectorizer()),
    ('lr', LogisticRegression())
])

parameters_hash = {
    'hv__stop_words' : ['english', None],
  
    'lr__penalty' : ['l2', 'l1'],
    'lr__C': [1, 1/10, 1/20]
}

In [15]:
gs_hash = GridSearchCV(pipe_hash, parameters_hash)
gs_hash.fit(X_train,y_train)
print(gs_hash.score(X_train, y_train))
print(gs_hash.score(X_test, y_test))
print(gs_hash.best_estimator_)
print(gs_hash.best_params_)
print(gs_hash.best_score_)

0.953083109919571
0.8148893360160966
Pipeline(memory=None,
     steps=[('hv', HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
         decode_error='strict', dtype=<class 'numpy.float64'>,
         encoding='utf-8', input='content', lowercase=True,
         n_features=1048576, ngram_range=(1, 1), non_negative=False,
         norm='l2', prep...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])
{'hv__stop_words': 'english', 'lr__C': 1, 'lr__penalty': 'l2'}
0.8413762287756926


## Hashing vectorizer does about the same as countvectorizer

## Pipeline and gridsearch for tf-idf and logistic regression

In [16]:
pipe_tfidf = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('lr', LogisticRegression())
])

parameters_tfidf = {
    'tfidf__stop_words' : ['english', None],
    'tfidf__max_features':[10000, 15000, 20000, 22000, 24000, 25000, 30000],
  
    'lr__penalty' : ['l2', 'l1'],
    'lr__C': [1, 1/10, 1/20]
}

gs_tfidf = GridSearchCV(pipe_tfidf, parameters_tfidf)
gs_tfidf.fit(X_train,y_train)
print(gs_tfidf.score(X_train, y_train))
print(gs_tfidf.score(X_test, y_test))
print(gs_tfidf.best_estimator_)
print(gs_tfidf.best_params_)
print(gs_tfidf.best_score_)

0.966041108132261
0.8262910798122066
Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=22000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])
{'lr__C': 1, 'lr__penalty': 'l2', 'tfidf__max_features': 22000, 'tfidf__stop_words': 'english'}
0.8757819481680071


## Overall, TF-IDF out performed CountVectorizer and HashingVectorizer using logistic regression

### 5. [Bonus] Robust Text Preprocessing

Your mission, should you choose to accept it, is to write a preprocessing function for all of your text.  This functions should

- convert all text to lowercase,
- remove punctuation,
- stem or lemmatize each word of the text,
- remove stopwords.

The function should receive one string of text and return the processed text.

Once you have built your function, use it to process your train and test data, then fit a Logistic Regression model to see how it performs.

In [17]:
def text_preprocessing(text):
    
    text_tokens_nopunc_lemma =[]
    text_tokens_nopunc_lemma_nostop =[]
    tokenizer = RegexpTokenizer(r'\w+')
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    text_lower = text.lower()
    text_tokens = tokenizer.tokenize(text_lower)
    text_tokens_nopunc_lemma = [lemmatizer.lemmatize(i) for i in text_tokens]
    text_tokens_nopunc_lemma_nostop = [i for i in text_tokens_nopunc_lemma if i not in stopwords.words('english')]
    
    return ' '.join(text_tokens_nopunc_lemma_nostop)    

In [18]:
X_train_processed = []
for article in X_train:
    #if type(article) == str:
    X_train_processed.append(text_preprocessing(article))
    

In [19]:
X_test_processed = []
for article in X_test:
    X_test_processed.append(text_preprocessing(article))

In [25]:
pipe_final_count = Pipeline([
    ('cv', CountVectorizer()),
    ('lr', LogisticRegression())
])

params = {
    'cv__stop_words' : ['english', None],
    'cv__max_features': [10000, 15000, 20000, 25000, 30000],
    'cv__max_df': [1.0, 0.9, 0.8],
    'lr__penalty' : ['l2', 'l1'],
    'lr__C': [1, 1/10, 1/20]
}

gs_final_count = GridSearchCV(pipe_final_count, params)
gs_final_count.fit(X_train_processed, y_train)

print(gs_final_count.best_params_)
print(gs_final_count.best_score_)
print(gs_final_count.best_estimator_)
print(gs_final_count.score(X_train_processed, y_train))
print(gs_final_count.score(X_test_processed, y_test))

{'cv__max_df': 1.0, 'cv__max_features': 15000, 'cv__stop_words': 'english', 'lr__C': 0.1, 'lr__penalty': 'l2'}
0.8543342269883825
Pipeline(memory=None,
     steps=[('cv', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=15000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        s...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])
0.9664879356568364
0.8001341381623072


In [26]:
pipe_hash_final = Pipeline([
    ('hv', HashingVectorizer()),
    ('lr', LogisticRegression())
])

parameters_hash_final = {
    'hv__stop_words' : ['english', None],
  
    'lr__penalty' : ['l2', 'l1'],
    'lr__C': [1, 1/10, 1/20]
}

gs_hash_final = GridSearchCV(pipe_hash_final, parameters_hash_final)
gs_hash_final.fit(X_train_processed,y_train)
print(gs_hash_final.score(X_train_processed, y_train))
print(gs_hash_final.score(X_test_processed, y_test))
print(gs_hash_final.best_estimator_)
print(gs_hash_final.best_params_)
print(gs_hash_final.best_score_)

0.953083109919571
0.8088531187122736
Pipeline(memory=None,
     steps=[('hv', HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
         decode_error='strict', dtype=<class 'numpy.float64'>,
         encoding='utf-8', input='content', lowercase=True,
         n_features=1048576, ngram_range=(1, 1), non_negative=False,
         norm='l2', prep...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])
{'hv__stop_words': 'english', 'lr__C': 1, 'lr__penalty': 'l2'}
0.8467381590705988


In [28]:
pipe_tfidf_final = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('lr', LogisticRegression())
])

parameters_tfidf_final = {
    'tfidf__stop_words' : ['english', None],
    'tfidf__max_features':[10000, 15000, 20000, 22000, 24000, 25000, 30000],
  
    'lr__penalty' : ['l2', 'l1'],
    'lr__C': [1, 1/10, 1/20]
}

gs_tfidf_final = GridSearchCV(pipe_tfidf_final, parameters_tfidf_final)
gs_tfidf_final.fit(X_train_processed,y_train)
print(gs_tfidf_final.score(X_train_processed, y_train))
print(gs_tfidf_final.score(X_test_processed, y_test))
print(gs_tfidf_final.best_estimator_)
print(gs_tfidf_final.best_params_)
print(gs_tfidf_final.best_score_)

0.9673815907059875
0.8323272971160295
Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=22000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])
{'lr__C': 1, 'lr__penalty': 'l2', 'tfidf__max_features': 22000, 'tfidf__stop_words': 'english'}
0.8806970509383378


# Applying the text preprocessing function improved the scores slightly of all 3 models- it improved the score of Tfidf with Logistic regression the most- somewhat interesting.  Probably due to lemmatizing?