# Applying machine learning to sentiment analysis.

The topics that we will cover in the following sections include the following:

- Cleaning and preparing data.
- Building feature vectors from text documents.
- Training a machine learning model to classify positive and negative movie reviews.
- Working with large datasets using out-of-core learning.
- Inferring topics from document collections for categorisation.

## Preparing the IMDb movie review data for text processing.

Sentiment analysis, sometimes also called opinion mining, is a popular subdiscipline of the broader field of NLP; it is concerned with analyzing the polarity of documents.  A popular taks in sentiment analysis is the classification of documents based on the expressed opinions or emotions of the authors with regard to a particular topic.

### Obtaining the movie review dataset
[Link to the IMDB dataset](http://ai.stafor.edu/~amaas/data/sentiment) that has been collected by Maas and others.

From the terminal run the command ```tar -zxf aclImdb_v1.tar.gz``` to decompress the dataset.

### Preprocessing the movie dataset into more convenient format

Now we assemble the individual text documents from the decompressed download archive into a single CSV file.  In the following code section, we will be reading the movie reviews into a pandas ```DataFrame``` object, which can take up to 10 minutes on a standard desktop computer.

To visualise the progress and estimated time untill completion, we will use3 the **Python Progress Indicator** ([PyPrind](https://pypi.python.org/pypi/PyPrind/)).  PyPrind can be installed by executing the ```pip install pyprind``` command.

In [1]:
import pyprind
import pandas as pd
import os

# Change the `basepath` to the directory of the unzipped movie dataset:
basepath = "aclImdb"

labels = {"pos": 1, "neg": 0}
pbar = pyprind.ProgBar(50000)  # The number of documents that we are reading in.
df = pd.DataFrame()  # Instantiating the dataframe.

for s in ("test", "train"):  # iterate over the `test` and `train` subdirectories.
    for l in ("pos", "neg"):  # iterate over the `pos` and `neg` subdirectories.
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):  # iterate over all the text files.
            with open(os.path.join(path, file), "r", encoding="utf-8") as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)  
            # ignore_index:  boolean, default False. If True, do not use the index Labels.
            pbar.update()
df.columns = ["review", "sentiment"]  # Initialise the column names of the pandas dataframe.
# Total time elapsed: 00:01:29.

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:01:33


Since the class labels in the assembled dataset are sorted, we will now shuffle the DataFrame using the ```permutation``` function from the ```np.random``` submodule -- this will be useful to split the dataset into training and test sets in later sections when we will stream the data from our local drive directly.  For our own convenience, we will also store the assembled and shuffled movie review datasets as a CSV file:

In [2]:
import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))  # Conform the dataframe to the new index.
df.to_csv("movie_data.csv", index=False, encoding="utf-8")  # do not write the row names (indices)

Since we are going to use this dataset later in this chapter, let's quickly confirm that we have successfully saved the data in the right format by reading in the CSV and printing an exceprt of the the first three samples:

In [3]:
df = pd.read_csv("movie_data.csv", encoding="utf-8")
df.head(5)

Unnamed: 0,review,sentiment
0,'This Is Not a Film' works because it is so tr...,1
1,I probably have to blame myselfbut I sure as ...,0
2,I love Jane Austen's stories. I've only read t...,0
3,'Stanley and Iris' show the triumph of the hum...,1
4,talk about your waste of money.. im just wonde...,0


## Introducing the bag-of-words model

Bag-of-words allows us to represent text as numerical feature vectors.  The idea behind the bag-of-words model is quite simple and can be summarised as follows:

1. We create a vocabulary of unique tokens - for example, words - from the entire set of documents.
2. We construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.

Since the unique words in each document represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will mostly consist of zeros, which is why we call them sparse vectors.

## Transforming words into feature vectors

To construct a bag-of-words model based on the word counts in the respective documents, we can use the ```CountVectorizer``` class implemented in scikit-learn.  As we will see in the following code section, ```CountVectorizer``` takes an array of text data, which can be documents or sentences, and constructs the bag-of-words model for us:

In [4]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
    "The sun is shining",
    "The weather is sweet",
    "The sun is shining, the weather is sweet, and one and one is two"])
bag = count.fit_transform(docs)

By calling the ```fit_transform``` method on ```CountVectorizer```, we constructed the vocabulary of the bag-of-words model and transformed the following three sentences into sparse feature vectors:

- ```'The sun is shining'```
- ```'The seather is sweet'```
- ```'The sun is shining, the weather is sweet, and one and one is two'```

Now, let us print the contents of the vocabulary to get a better understanding of the unerlying concepts:

In [5]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


The vocabulary is stored in a Python dictionary that **maps the unique words to integer indices.**  Next, let us print the feature vectors that we just created:

In [6]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


These values in the feature vectors are also called the **raw term frequencies:**  $\mathrm{tf}(t,d)$ --> the number of times a term $t$ occurs in a document $d$.

## Assessing the word relevancy via the term frequency-inverse document frequency.

tf-idf is used to downweight the frequently occurring words in the feature vectors.  The tf-idf can be defined as the product of the term frequency and the inverse document frequency:
$$
\mathrm{tfidf}(t,d) = tf(t,d) \times idf(t,d) 
$$
$tf(t,d)$ is the term frequency that we introduced in the previous section, and $idf(t,d)$ is the inverse document frequency and can be calculated as follows:
$$
\mathrm{idf}(t,d) = \ln\frac{n_d}{1 + df(d,t)}
$$

Here $n_d$ is the total number of documents, and $df(d,t)$ is the number of documents $d$ that contain the term $t$.  Note that adding the constant 1 to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in all the training samples.

The $\ln$ is used to ensure that low-document-frequencies are not given too much weight.

The scikit-learn library implements yet another transformer, the ```TfidfTransformer``` class, that takes the raw term frequencies from the ```CountVectorizer``` class as input and transforms them into tf-idfs:

In [7]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True,
                        norm="l2",
                        smooth_idf=True)
# use_idf = True: Enable inverse-document-frequency reweighting.
# norm = "l2": Norm used to normalise term vectors.  None for no normalisation.
# smooth_idf=True: Smooth the idf weights by adding one to document frequencies,
# as if an extra document was seen containing every term in the collection exactly once.
# Prevents zero division.
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


As we saw in the previous subsection, the word "is" had the largest term frequency in the third document, being the most frequently-ocurring word.  However, after transforming the same feature vector into tf-idfs, we see that the word "is" is now associated with a relatively small tf-idf (```0.45```) in the third document, since it is also present in the first and second document and thus is unlikely to contain any useful discriminatory information.

However, if we'd manually calculated the tf-idfs of the individual terms in our feature vectors, we'd notice that ```TfidfTransformer``` calculates the tf-idfs slightly diffferently compared to the standard textbook equations that we defined previously.  The equation for the inverse document frequency implemented in sciki-learn is computed as follows:
$$
idf(t,d) = \ln\frac{1 + n_d}{1 + df(d,t)}
$$

Similarly, the tf-idf computed in scikit-learn deviates slightly from the default equation we defined earlier:
$$
tfidf(t,d) = tf(t,d) \times (idf(t,d) + 1)
$$

For L2-normalisation we divide an un-normalised feature vector $\mathbf{v}$ by its L2-norm:
$$
v_{norm} = \frac{v}{\sqrt{v_1^2 + v_2^2 + \ldots + v_n^2}}
$$

## Cleaning Text Data

The first important step - before we built our bag-of-words model - is to clean the text data by stripping it of all unwanted characters.

For simplicity, we will remove all punctuation marks except for emoticon characters such as :) since those are cetainly useful for sentiment analysis.  To accomplish this task, we will use Pyton's **regular expression (regex)** library, ```re```, as shown here:

In [8]:
import re

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
    return text

Via the first regex, we tried to remove all of the HTML markup from the movie reviews.
Then we used a slightly more complex regex to find emoticons, which we temporarily stored as ```emoticons.```
Next, we removed all the non-word characters from the text and converted the text into lowercase characters.
Eventually, we added the temporarily stored ```emoticons``` to the end of the processed document string.  Additionally, we removed the _nose_ character (-) from the emoticons for consistency.

We shall note that the order of the words does not matter in our bag-of-wordes model if our vocabulary consists of only one-word tokens.  Let us confirm that our preprocessor works correctly:

In [9]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

Let us now apply our ```preprocessor``` function to all the movie reviews in our ```DataFrame```:

In [10]:
df['review'] = df['review'].apply(preprocessor)

## Processing documents into tokens

One way to _tokenize_ documents is to split them into individual words by splitting the cleaned documents at its whitespace characters:

In [11]:
def tokenizer(text):
    return text.split()

tokenizer("runners like running and thus they run")

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In the context of tokenization, another useful technique is **word stemming** which is the process of transforming a work into its foot form.  It allows us to map related words to the same stem.  The original stemming algorithm was developed by Martin F. Porter in 1979 and is hence known as the Porter stmmer algorithm.

The following code shows hose to use the Porter stemming algorithm:

In [12]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

tokenizer_porter("runners like running and thus they run")

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

Using the ```PorterStemmer``` from the ```nltk``` package, we modified our ```tokenizer``` function to reduce words to their root form, which was illustrated by the simple preceding example.

Let us also talk about **stop-word-removal**.  Stop words are simply those words that are extremely common in all sorts of texts and probably bear no (or only little) useful information that can be used to distinguish between different classers of documents.  Examples of stopping words are: _is, and, has, like, ..._  Removing stop-words can be useful if we are working with raw or normalised term frequencies rather than tf-idfs, which are already downweighting frequently occurring words.

In order to remove stop-words from the movie reviews, we will use the set of 127 English stop-words that is available from the NLTK library, which can be obtained by calling the ```nltk.download``` function.

In [13]:
import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]

[nltk_data] Downloading package stopwords to /home/henri/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['runner', 'like', 'run', 'run', 'lot']

## Training a logistic regression model for document classification.

In this section, we will train a logistic regression model to classify the movie reviews into _positive_ and _negative_ reviews.  First, we will divide the ```DataFrame``` of cleaned text documents into 25000 documents for training and 25000 documents for testing:

In [14]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

Next, we will use a ```GridSearchCV``` object to find the optimal set of parameters for our logistic regression model using 5-fold stratified cross-validation:

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents=None,  # Remove accents durign the preprocessing step.
                       lowercase=False,  # convert all characters to lowercase before tokenizing
                       preprocessor=None)  # Override the preprocessing stage while preserving the tokenizing and n-grams generation steps.

param_grid = [{'vect__ngram_range': [(1,1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1,1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf': [False],
               'vect__norm': [None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]}]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy',
                          cv=5, verbose=2,
                          n_jobs=-1)

gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] clf__C=1.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how

[CV] clf__C=1.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other

[CV]  clf__C=1.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'othe

[CV] clf__C=1.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7f1dbd2bc950> 
[CV]  clf__C=1.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'unde

[CV]  clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'othe

[CV]  clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'othe

[CV]  clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'othe

[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  5.2min


[CV]  clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f1dbd2c9598>, total=   5.0s
[CV] clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f1dbd2c9598> 
[CV]  clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f1dbd2c9598>, total=   5.4s
[CV] clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f1dbd2c9598> 
[CV]  clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f1dbd2c9598>, total=   5.9s
[CV] clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7f1dbd2bc950> 
[CV]  clf__C=1.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function toke

[CV]  clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'othe

[CV]  clf__C=10.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'oth

[CV]  clf__C=10.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'oth

[CV]  clf__C=10.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f1dbd2c9598>, total=   5.9s
[CV] clf__C=10.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7f1dbd2bc950> 
[CV]  clf__C=10.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f1dbd2c9598>, total=   6.7s
[CV] clf__C=10.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7f1dbd2bc950> 
[CV]  clf__C=10.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f1dbd2c9598>, total=   6.6s
[CV] clf__C=10.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7f1dbd2bc950> 
[CV]  clf__C=10.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my

[CV] clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'othe

[CV] clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'othe

[CV] clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'othe

[CV] clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7f1dbd2bc950> 
[CV]  clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'un

[CV]  clf__C=100.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'ot

[CV] clf__C=100.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'oth

[CV]  clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7f1dbd2bc950>, total= 2.7min
[CV] clf__C=100.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'of

[CV] clf__C=100.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7f1dbd2bc950> 
[CV]  clf__C=100.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', '

[CV] clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'oth

[CV] clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'oth

[CV] clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'oth

[CV] clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7f1dbd2bc950> 
[CV]  clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', '

[CV] clf__C=1.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more

[CV]  clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'ot

[CV] clf__C=1.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more

[CV] clf__C=1.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7f1dbd2bc950>, vect__use_idf=False 
[CV]  clf__C=1.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from'

[CV]  clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'mor

[CV] clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more

[CV]  clf__C=1.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'mor

[CV] clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f1dbd2c9598>, vect__use_idf=False 


[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed: 43.6min


[CV]  clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f1dbd2c9598>, vect__use_idf=False, total=  23.9s
[CV] clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f1dbd2c9598>, vect__use_idf=False 
[CV]  clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f1dbd2c9598>, vect__use_idf=False, total=  23.3s
[CV] clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7f1dbd2bc950>, vect__use_idf=False 
[CV]  clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f1dbd2c9598>, vect__use_idf=False, total=  26.1s
[CV] clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vec

[CV]  clf__C=10.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'mo

[CV] clf__C=10.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'mor

[CV]  clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'mor

[CV]  clf__C=1.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7f1dbd2bc950>, vect__use_idf=False, total= 3.0min
[CV] clf__C=10.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below

[CV] clf__C=10.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7f1dbd2bc950>, vect__use_idf=False 
[CV]  clf__C=10.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'fro

[CV] clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'mor

[CV]  clf__C=10.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'mo

[CV] clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'mor

[CV]  clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'mo

[CV] clf__C=100.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'mo

[CV]  clf__C=100.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'm

[CV] clf__C=100.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'mo

[CV]  clf__C=10.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7f1dbd2bc950>, vect__use_idf=False, total= 3.0min
[CV] clf__C=100.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f1dbd2c9598>, vect__use_idf=False 
[CV]  clf__C=100.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f1dbd2c9598>, vect__use_idf=False, total=   5.2s
[CV] clf__C=100.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f1dbd2c9598>, vect__use_idf=False 
[CV]  clf__C=100.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer at 0x7f1dbd2c9598>, vect__use_idf=False, total=   6.1s
[CV] clf__C=100.0, clf__penalty=l1, vect__ngram_range=

[CV] clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'mo

[CV]  clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'm

[CV] clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'mo

[CV]  clf__C=100.0, clf__penalty=l1, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7f1dbd2bc950>, vect__use_idf=False, total= 2.7min
[CV] clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'be

[CV] clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=None, vect__tokenizer=<function tokenizer_porter at 0x7f1dbd2bc950>, vect__use_idf=False 
[CV]  clf__C=100.0, clf__penalty=l2, vect__ngram_range=(1, 1), vect__norm=None, vect__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'f

[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 75.4min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...nalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid=[{'vect__ngram_range': [(1, 1)], 'vect__stop_words': [['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's...se_idf': [False], 'vect__norm': [None], 'clf__penalty': ['l1', 'l2'], 'clf__C': [1.0, 10.0, 100.0]}],
       pre_dispatch='2*n_jobs', refit=True, return_tr

It is highly recommended to set ```n_jobs=-1``` (instead of ```n_jobs=1```) in the previous code example to utilise all available cores on your machine and speed up the grid search.

When we initialized the ```GridSearchCV``` object and its parameter grid using the preceding code, we restricted ourselves to a limited number of parameter combinations, since the number of feature vectors, as well as the large vocabulary can make the grid search computationally quite expensive.  Using a standard desktop computer, our grid search may take up to 40 minutes to complete.

In the previous code example, we replaced ```CountVectorizer``` and ```TfidfTransformer``` from the previous subsection with TfidfVectorizer, which combines the former transformer objects.  Our ```param_grid``` consisted of 2 parameter dictionaries.  In the first dictionary, we use the ```TfidfVectorizer``` with its default settings (```use_idf=True, smooth_idf=True, norm='l2'```) to calculate the tf-idfs.  In the second dictionary, we set those parameters to ```use_idf=False, smooth_idf=False, norm=None``` in order to train a model based on raw term frequencies.  Furthermore, for the logistic regression classifier itself, we trained models using L2 and L1 regularization via the penalty parameter and compared different regularizations strengths by defining a range of values for the inverse-regularization parameter C.

After the grid search has finished, we can FINALLY print the best parameter set:

In [16]:
print("Best parameter set: {}".format(gs_lr_tfidf.best_params_))

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7f1dbd2c9598>}


Using the best model from this grid search, let's print the average 5-fold cross-validation accuracy scores on the training set, and the classification accuracy on the test dataset:

In [17]:
print("CV Accuracy: {:.3f}".format(gs_lr_tfidf.best_score_))
clf = gs_lr_tfidf.best_estimator_ 
print("Test Accuracy: {:.3f}".format(clf.score(X_test, y_test)))

CV Accuracy: 0.898
Test Accuracy: 0.896


The results reveal that our machine learning model can predict whether a movie review is positive or negative with 90 percent accuracy.

## Working with bigger data - online algorithms and out-of-core learning

Since not everyone has access to supercomputer facilities, we will now apply a technique called **out-of-core learning**, which allows us to work with large datasets by fitting the classifier incrementally on smaller batches of the dataset.

In this section, we will make use of the ```partial_fit``` function of the ```SGDClassifier``` in scikit-learn to stream the documents directly from our local drive, and train a logistic regression model using small mini-batches of documents.

First, we define a ```tokenizer``` function that cleans the unprocessed text data from the ```movie_data.csv``` file that we constructed at the beginning of this chapter and separate it into word tokens while removing stop words:

In [18]:
import numpy as np
import re
from nltk.corpus import stopwords
stop = stopwords.words("english")

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower() + ' '.join(emoticons).replace('-', ''))
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

Next, we define a generator function ```stream_docs``` that reads and returns one document at a time:

In [19]:
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv) # skip the header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [20]:
with open("movie_data.csv", "r", encoding="utf-8") as csv:
    for i, line in enumerate(csv):
        print(i, line)

0 review,sentiment

1 "'This Is Not a Film' works because it is so true in what it is trying to say. If you ignore the dynamics of the plot and focus in on the message, you will see a little bit of yourself in the main character, Michael. Whether male or female, all of us have come to a point in our lives where we want to look back and reexamine a situation or a relationship. Did it really occur like we remembered? What went wrong? Michael's desire to find Grace is completely selfish. More than anything, he wants to make himself feel better about how things turned out. But even so, he is a sympathetic character because everyone is selfish when it comes to relationships. We would not be in them otherwise. As the film ends, I am not sure if Michael has learned anything new about himself or not. Our best gauge on the relationship is through his friend, Nadia. She is the soul of the movie and reminds us of how there are always two sides to every story. I found Michael to be pompous, arroga

1385 "*SPOILERS AHEAD*<br /><br />Great WrestleManias were still a few years away. But this one was certainly good, with lots of good matches, and one great match.<br /><br />Demolition was always at their best at WrestleMania. I'm glad their last WM hoorah (I refuse to include the other version) was a win over the Colossal Connection. I liked the gag of Andre never tagging in.<br /><br />Few fans know that this was the first time anyone ever beat Mr. Perfect. For some reason, Brutus Beefcake's feat was never recognized. Or the fact that he did it pretty easily.<br /><br />The Hart Foundation's win over the Bolsheviks was the shortest in WM history, including the 24/9 second match between King Kong Bundy and S.D. Jones.<br /><br />I'm glad Jake and DiBiase got to fight at WrestleMania. This made up for the fact that the feud had to be put on hold for so long.<br /><br />I expected the Big Bossman-Akeem feud to heat up, but the Bossman just clobbered him. As good as Bossman was as a hee

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



3076 "Minor Spoilers<br /><br />In Chicago, Grace Beasley (Kathy Bates) is a housewife having a twenty-five years marriage with the lawyer Max Beasley (Dan Aykroyd) and a hysterical and psychotic dwarf daughter-in-law, Maudey (Meredith Eaton). Grace worships the singer Victor Fox (Jonathan Price), who will present a TV show in Chicago and will give five spots on the first row in a TV promotion. Kate calls the show and wins a ticket, when Max simultaneously asks for the divorce, claiming that their lives are too monotonous. Grace becomes depressed, and when she goes to the show, the audience is informed that a Chicago serial killer, who uses a crossbow, killed Victor Fox. With a broken heart, she decides to fly to England to Victor Fox's funeral. There, she realizes that he was gay, and becomes friend of his former mate Dirk Simpson (Rupert Everett). They fly back to Chicago, trying to find the killer. This movie is a delightful, original and weird dramatic comedy, having bizarre charac

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



8219 """What Alice Found"" was a pleasant discovery. As written and directed by A. Dean Bell, this is combination of a road movie with a cautionary tale, as well as a voyage of discovery.<br /><br />If you haven't seen the film, maybe you should stop reading here.<br /><br />Alice is a case study of a young woman that wants to break away from the unhappy life she leads in a New England town. Her pretext for leaving is going to join her best friend, who is away studying at a Miami university. Alice is the product of a single mother's home, one that is struggling to make ends meet, in sharp contrast with the life of ease her friend seems to inhabit. In flashbacks we get to see Alice's life before going on the road.<br /><br />Alice, like her namesake in ""Alice in Wonderland"", embarks in a trip to the unknown that life hasn't prepared her for. The highways of America are full of predators in search of the weak and innocent. Alice meets with disaster when her car breaks down the road and

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



31303 "This extraordinary pseudo-documentary, made in 1971, perfectly captures the zeitgeist of America today...which makes it all the more scary and relevant. ""subversives"" (college students, hippies, black activists, academics) are being rounded up by the government and given lengthy prison terms for what amount to thought crimes and social protest. As an alternative to life in prison, these convicted ""criminals"" are offered three days in ""Punishment Park"". Their objective inside the park is to make their way to the American flag where freedom awaits them. Not surprisingly, the Punishment Park option is a dirty lie. This brilliant film from Peter Watkins even pre-dates ""Battle Royale"" and ""Series 7"", though its angle of attack is more blatantly political. Shot in '71, it looks and feels as fresh as anything made today. The performances are exemplary and the direction is razer sharp. The narrative cuts back and forth between various groups trying to survive the harsh conditi

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



To verify that our ```stream_docs``` function works correctly, let's read in the first document from the ```movie_data.csv``` ile, which should return a tuple consisting of the review text as well as the corresponding class label:

In [21]:
print(next(stream_docs(path="movie_data.csv")))

('"\'This Is Not a Film\' works because it is so true in what it is trying to say. If you ignore the dynamics of the plot and focus in on the message, you will see a little bit of yourself in the main character, Michael. Whether male or female, all of us have come to a point in our lives where we want to look back and reexamine a situation or a relationship. Did it really occur like we remembered? What went wrong? Michael\'s desire to find Grace is completely selfish. More than anything, he wants to make himself feel better about how things turned out. But even so, he is a sympathetic character because everyone is selfish when it comes to relationships. We would not be in them otherwise. As the film ends, I am not sure if Michael has learned anything new about himself or not. Our best gauge on the relationship is through his friend, Nadia. She is the soul of the movie and reminds us of how there are always two sides to every story. I found Michael to be pompous, arrogant, and just plai

We will now define a function, ```get_minibatch```, that will take a document stream from the stream_docs function and return a particular number of documents specified by the ```size``` parameter:

In [22]:
def get_minibatch(doc_streams, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

Unfortunately, we cannot use ```CountVectorizer``` for out-of-core learning since it requires holding the complete vocabulary in memory.  Also, ```TfidfVectorizer``` needs to keep all the feature vectors of the training dataset in memory to calculate the inverse document frequencies.  However, another useful vectorizer for text processing implemented in scikit-learn is ```HashingVectorizer```.  ```HashingVectorizer``` is data-independent and makes use of the hashing trick via the 32-bit MurmurHash3 function by Austin Appleby

In [23]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error="ignore",
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer)

clf = SGDClassifier(loss="log",
                    random_state=1,
                    max_iter=1)

doc_stream = stream_docs(path="movie_data.csv")

We initialized ```HashingVectorizer``` with our tokenizer function and set the number of features to 2^21.  Furthermore, we reinitialized a logistic regression classifier by setting the ```loss``` parameter of the ```SGDClassifier``` to ```'log'```.   Note that by choosing a large number of features in the ```HashingVectorizer```, we recude the chance of causing hash collisions, but we also increase the number of coefficients in our logistic regression model.  Now comes the really interesting part.

Having set up all the complementary functions, we can now start out-of-core learning using the following code:

In [24]:
import pyprind
pbar = pyprind.ProgBar(45)
classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:19


We initialised the progress bar object with 45 iterations and, in the following ```for``` loop, we iterated over ```45``` mini-batches of documentsw where each mini-batch consists of 1000 documents.  Having completed the incremental leardning progress, we will use the last 5000 documents to evaluate the performance of our model.

In [25]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print("Accuracy: {:.3f}".format(clf.score(X_test, y_test)))

Accuracy: 0.874


The accuracy of the model is approximately 88%, slightly below the accuracy that we achieved in the previous section using the grid search for hyperparameter tuning.  However, out-of-core learning is very memory efficient and took less than a minute to complete.  Finally, we can use the last 5000 documents to update our model:

In [26]:
clf = clf.partial_fit(X_test, y_test)

# Embedding a Machine Learning Model into a Web Application!

Machine learning technqiues can be the predictive engines of your web services.  For example, popular and useful applications of machine learning models in web applications include spam detection in submission forms, search engines, recommendation systems for media or shopping portals, and many more.

In this chapter, we will focus on how to embed a machine learning model into a web application that can not only classify, but also learn from the data in real time.  The topics that we will cover are as follows:

- Saving the current state of a trained machine learning model.
- Using SQLite databses for data storage.
- Developing web applications using the popular Flas web framework.
- Deploying a machine learning application to a public web server.

## Serialising fitted scikit-learn estimators.

Surely we don't want to train our model every time we close our Python interpreter and want to make a new prediction or reload our web application.

The pickle module allows us to serialize and deserialze Python object structures to compact bytecode so that we can save our classifier in its current state and reload it if we want to classify new samples, without needing the model to learn from the training data all over again.

**Before you execute the following code, please make sure that you have trained the out-of-core logistic regression model from the last section of Chapter 8, and have it ready in your current Python session!**

```python
import pickle
import os
dest = os.path.join("movieclassifier", "pkl_objects")
if not os.path.exists(dest):
    os.makedirs(dest)
    
pickle.dump(stop, open(os.path.join(dest, "stopwords.pkl"), "wb"), protocol=4)
pickle.dump(clf, open(os.path.join(dest), "classifier.pkl"), "wb", protocol=4)
```

Using the preceding code, we create a ```movieclassifier``` direcotyr where we will later sotre the files and data for our web application.  Within this ```movieclassifier``` directory, we created a ```pkl_objects``` subdirectory to save the serialized Python objects to our local drive.

In [27]:
import pickle
import os
dest = os.path.join("movieclassifier", "pkl_objects")
if not os.path.exists(dest):
    os.makedirs(dest)

pickle.dump(stop, open(os.path.join(dest, "stopwords.pkl"), "wb"), protocol=4)
pickle.dump(clf, open(os.path.join(dest, "classifier.pkl"), "wb"), protocol=4)

Via the ```dump``` method of the pickle module, we serialized the trained logistic regression model as well as the stop word set from the Natural Language ToolKit (NLTK) library, so that we don't have to install the NLTK vocabulary on our server.

The dump method take as its first argument the objecte that we want to pickle, and for the second argument we provided an open file object that the Python object will be written to.  Via the ```wb``` argument inside the open function, we opened the file in binary mode for pickle, and we set ```protocol=4``` to choose the latest and most efficient pickle protocol that has been added to Python 3.4.

We don't need to pickle the ```HashingVectorizer```, since it does not need to be fitted.  Instead we can create a new Python script file from which we can import the vectorizer into our current Python session.  Now, copy the following code and save it as ```vectorizer.py``` in the ```movieclassifier``` directory.

```python
from sklearn.feature_extraction.text import HashingVectorizer
import re
import os
import pickle

cur_dir = os.path.dirname(__file__)
stop = pickle.load(open(os.path.join(cur_dir, "pkl_objects", "stopwords.pkl"), "rb"))

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

vect = HashingVectorizer(decode_error="ignore", n_features=2**21, preprocessor=None, tokenizer=tokenizer)
```

After we have pickled the Python objects and created the ```vectorizer.py``` file, it would now be a good idea to restart our Python interpreter or IPython Notebook kernel to test if we can deserialize the objects without error.

From your terminal, navigate to the ```movieclassifier``` directory, start a new Python session and execute the following code to verify that you can import the ```vectorizer``` and unpickle the classifier.

```python
import pickle
import re
import os
from vectorizer import vect

clf = pickle.load(open(os.path.join("pkl_objects", "classifier.pkl"), "rb"))
```

After we have successfully loaded the vectorizer and unpickled the classifier, we can now use these objects to preprocess document samples and make predicitons about their sentiment.

```python
import numpy as np
label = {0: "negative", 1: "positive"}
example = ["I love this movie"]
X = vect.transform(example)
print("Prediction: {:s}".format(label[clf.predict(X)[0]]))
print("Probability: {:.2f}%".format(np.max(clf.predict_proba(X))*100))
```

Since our classifier returns the class labels as integers, we defined a simple Python dictionary to map these integers to their sentiment.  We then used ```HashingVectorizer``` to transform the simple example document into a word vector ```X```.  Finally, we used the ```predict``` method of the logistic regression classifier to predict the class label, as well as the ```predict_proba``` method to return the corresponding probability of our prediction.  Note that the ```predict_proba``` method call returns an array with a probability value for each unique class label.  Since the class label with the largest probability corresponds to the class label that is returned by the ```predict``` call, we used the ```np.max``` function to return the probability of the predicted class.

## Setting up an SQLite database for data storage

In this section , we will set up a simple SQLite database to collect optional feedback about the predictions from users of the web application.  We can use this feedback to update our classification model.

Essentially, a SQLite database can be understood as a single, self-contained database file that allows us to directly access storage files.

Fortunately, following Pyton's _batteries included_ philosophy, there is already an API in the Python standard Library, ```sqlite3```, which allows us to work with SQLite databases.

By executing the following code, we will create a new SQLite database inside the ```movieclassifier``` directory and store 2 example movie reviews:

```python
import sqlite3
import os

if os.path.exists("reviews.sqlite"):
    os.remove("reviews.sqlite")
conn = sqlite3.connect("reviews.sqlite")
c = conn.cursor()
c.execute("CREATE TABLE review_db (review TEXT, sentiment INTEGER, date TEXT)")

example1 = "I love this movie"
c.execute("INSERT INTO review_db (review, sentiment, date) VALUES (?, ?, DATETIME('now'))", (example1, 1))

example2 = "I dislike this movie"
c.execute("INSERT INTO review_db (review, sentiment, date) VALUES (?, ?, DATETIME('now'))", (example2, 0))

conn.commit()
conn.close()
```

Following the preceding code example, we created a connection (```conn```) to a SQLite database file by calling the ```connect``` method of the ```sqlite3``` library, which created the new database file ```review.sqlite``` in the ```movieclassifier``` directory if it didn't already exist.  Please note that SQLite does not implement a replace function for existing tables; you need to delete the database file manually from your file browser if you want to execute the code a second time.

Next, we created a cursor via the ```cursor``` method, which allows us to traverse over the database records using the versatile SQLite syntax.  Via the first ```execute``` call, we then created a new database table, ```review_db```.  We used this to store and access database entries.  Along with ```review_db```, we also created three columns in this database table: ```review, sentiment,``` and ```date```.  We used these to store 2 example movie reviews and respective class labels (sentiments).

Using the ```DATETIME("now")``` SQL command we also added date and timestamps to our entries.  In addition to the timestamps, we used the question mark symbols (?) to pass the movie review texts and the corresponding class labels as positional arguments to the ```execute``` method, as members of a tuple.  Lastly, we called the ```commit``` method to save the changes that we made to the database and closed the connection via the ```close``` method.

To check if the entires have been stored in the database table correctly, we will now reopen the connection to the database and use the SQL ```SELECT``` command to fetch all the rows in the database table that have been committed between the beginning of the year 2017 and today:

```python
conn = sqlite3.connect("reviews.sqlite")
c = conn.cursor()
c.execute("SELECT * FROM review_db WHERE date BETWEEN '2018-01-01 00:00:00' AND DATETIME('now')")
results = c.fetchall()

conn.close()
print(results)
```

Alternatively, we could also use the free Firefox browser pluging SQLite Manager (available at https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager) which offers a nice GUI interface for working with SQLite databases.

## Developing a web application with Flask

Since Flask is written in Python, it provides us Python programmers with a convenient interface for embededing existing Python code, such as our movie classifier.

## Our first Flask web application
In this subsection, we will develop a very simple web application to become more familiar with the Flask API before we implement our movie classifier.  The first application we are going to build consists of a simple web page with a form field that lets us enter a name.  After submitting the name to the web application, it will render it on a new page.  While this is a very simple example of a web application, it helps with building intuition about how to store and pass variables and values between the different parts of our code withing the Flask framework.

First we create a directory tree
```
1st_flask_app/
    app.py
    templates/
        first_app.html
```

The ```app.py``` file will contain the main code that will be executed by the Python interpreter to run the Flask web application.  The ```templates``` directory is the directory in which Flask will look for static HTML files for rendering in the web browser.  Let's now take alook at the contents of ```app.py```:

```python
from flask import Flask, render_template

app = Flask(__name__)
@app.route('/')
def index():
    return render_template("first_app.html")

if __name__ == "__main__":
    app.run()
```

Now, let us take alook at the contents of the ```first_app.html``` file:
```hmtl
<!doctype html>
<html>
    <head>
        <title>First app</title>
    </head>
    <body>
        <div>Hi, this is my first Flask we3b app!</div>
    </body>
</html>
```
Here we, have simply filled an empty HTML template file with a ```<div>``` element that contains the sentence: "Hi, this is my first Flask web app!"

Conveniently, Flask allows us to run our applications locally, which is useful for developing and testing web applications before we deploy them on a public web server.  Now, let us start our web application by executing the command from the Terminal inside the ```1st_flask_app_1``` directory:
```bash
python3 app.py
```

We should see a line such as the following displayed in the Terminal:

```* Running on http://127.0.0.1:5000/```  
This line contains the address of our local server.  We can enter this address in our web browser to see the we application in action.

## Form validation and rendering

In this subsection, we will extend our simple Flask web application with HTML form elements to learn how to collect data from a user using the WTForms library, which can be installed via conda or pip.

This web application will prompt a user to type in his or her name into a text field.  After the submission button has been clicked, and the form is validated, a new HTML page will be rendered to display the user's name.

### Setting up the directory structure
The new directory structure that we need to set up for this application looks like this:
```
1st_flask_app_2/
    app.py
    static/
        style.css
    templates/
        _formhelpers.html
        first_app.html
        hell.html
```

The following are the contents of the modified ```app.py``` file:
```python
from flask import Flask, render_template, request
from wtforms import Form, TextAreaField, validators

app = Flask(__name__)

class HelloForm(Form):
    sayhello = TextAreaField('', [validators.DataRequired()])

@app.route('/')
def index():
    form = HelloForm(request.form)
    return render_template('first_app.html', form=form)

@app.route('/hello', methods=['POST'])
def hello():
    form = HelloForm(request.form)
    if request.method == 'POST' and form.validate():
        name = request.form['sayhello']
        return render_template('hello.html', name=name)
    return render_template('first_app.html', form=form)

if __name__ == '__main__':
    app.run(debug=True) # We activate Flask's debugger, a useful feature for developing new web applications.
```

**What the previous code does step by step:**
1. Using ```wtforms```, we extended the ```index``` function with a text field that we will embed in our start page using the ```TextAreaField``` class, which automatically checks whether a user has provided valid input text or not.
2. Furthermore, we defined a new function, ```hello```, which will render an HTML page ```hello.html``` after validating the HTML form.
3. Here, we used the ```POST``` method to transport the form data to the server in the message body.  Finally, by setting the ```debug=True``` argument inside the ```app.run``` method, we further activated Flask's debugger.  This is a useful feature for developing new web applications.

### Implementing a macro using the Jinja2 templating engine

Now, we will implement a generic macro in the ```_formhelpers.html``` file via the Jinja2 templating engine, which we will later import in our ```first_app.html``` file to render the text field:

```html
{% macro render_field(field) %}
    <dt>{{ field.label }}
    <dd>{{ field(**kwargs)|safe }}
    {% if field.errors %}
        <ul class=errors>
        {% for error in field.errors %}
            <li>{{ error }}</li>
        {% endfor %}
        <\ul>
    {% endif %}
    </dd>
    </dt>
{% endmacro %}
```

## Adding style via CSS
Next, we set up a simple **Cascading Style Sheet (CSS)** file, ```style.css```, to demonstrate how the look and feel of HTML documents can be modified.  We have to save the following CSS file, which will simply double the font size of our HTML body elementsw, in a subdirectory called ```static```, which is the default directory where Flask looks for static files such as CSS.  The file contents is as follows:

```css
body {
    font-size: 2em;
}
```

The following are the contents of the modified ```first_app.html``` file that will now render a text form where a user can enter a name:
```html
<!doctype html>
<html>
    <head>
        <title>First app</title>
            <link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
    </head>
    <body>
        {% from "_formhelpers.html" import render_field %}
        <div>What's your name?</div>
        <form method=post action="/hello">
            <dl>
                {{ render_field(form.sayhello) }}
            </dl>
            <input type=submit value='Say hello' name='submit_btn'>
        </form>
    </body>
</html>
```

In the header section of ```first_app.html```, we loaded the CSS file.  It should now alter the size of all text elements in the HTML body.  In the HTML body section, we imported the form macro from ```_formhelpers.html```, and we rendered the ```sayhello``` form that we specified in the ```app.py``` file.  Furthermore, we added a button to the same form element so that a user can submit the text field entry.

## Creating the Results Page

Lastly, we will create a ```hello.html``` file that will be rendered via the ```render_template('hello.html', name=name)``` line return inside the ```hello``` function, which we defined in the ```app.py``` script to display the text that a user submitted via the text field.  The file contents is as follows:

```html
<!doctype html>
<html>
    <head>
        <title>First app</title>
            <link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
    </head>
    <body>
        <div>Hello {{ name }}</div>
    </body>
</html>
```

Having set up our modified Flask web application, we can run it locally by executing the following command from the application's main directory, and we can view the result in our web browser at http://127.0.0.1:5000/:
```bash
python3 app.py
```

# Topic modeling with Latent Dirichlet Allocation

Topic modelling describes the broad task of assigning topics to unlabelled text documents.  We can consider topic modelling as a clustering task, a subcategory of unsupervised learning.

### Decomposing text document with LDA

LDA is a generative prababilistic model that tries to find groups of words that appear frequently together accross different documents.  These frequntly appearing words represent our topics, assuming that each document is a mixture of different words.  The input to an LDA is the bag-of-words model we discussed earlier in this chapter.  Given a bag-of-words matrix as input, LDA decomposes it into 2 new matrices:

- a document-to-topic matrix
- a word-to-topic matrix.

If we multiply those two matrices together again, we would be able to reproduce the input, the bag-of-wrods matrix, with the lowest possible error.  The only downside may be that we must define the number of topics beforehand -- the number of topics is a hyperparameter of LDA that has to be specified manually.

### LDA with scikit-learn

First, we load the dataset into a pandas ```Dataframe``` using the ```movie_data.csv``` file of the movie reviews that we have created at the beginning of this chapter.  Secondly, we are going to use the already familiar ```CountVectorizer``` to create the bag-of-words matrix as input to the LDA.  For convenience, we will use the scikit-learn's built-in English stop word library via ```stop_words='english'```:

In [35]:
import pandas as pd

df = pd.read_csv("movie_data.csv", encoding="utf-8")

from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words="english",
                        max_df=0.1,
                        max_features=5000)
# max_df=0.1 : Represents a proportion of documents.  # Exclude words that occur too frequently across the documents.
# max_features : Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

X = count.fit_transform(df["review"].values)

print("X.shape = {}".format(X.shape))

X.shape = (50000, 5000)


Both ```max_df=0.1``` and ```max_features=5000``` are hyperparameters values chosen arbitrarily.  Readers are encouraged to tune them whiloe comparing the results.

The following code example demonstrates how to fit a ```LatentDirichletAllocation``` estimator to the bag-of-words matrix and infer the 10 different topics from the documents (note that the model fitting cn take up to five minutesw or more):

In [29]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=10,
                                random_state=123,
                                learning_method="batch",
                                n_jobs=-1)

X_topics = lda.fit_transform(X)

In [30]:
print("X_topics.shape = {}".format(X_topics.shape))
print("X_topics[4,:] = {}".format(X_topics[4,:]))
print("type(X_topics) = {}".format(type(X_topics)))

X_topics.shape = (50000, 10)
X_topics[4,:] = [0.35 0.18 0.   0.   0.   0.   0.41 0.04 0.   0.  ]
type(X_topics) = <class 'numpy.ndarray'>


By setting the learning method to ```batch```, we let the ```lda``` estimator do its estimation based on all the available training data in one iteration.  This is slower than the alternative ```'online'``` learning method but can lead to more accurate results.  Setting ```learning_method='online'``` is analogous to online or mini-batch learning that we discussed in chapter 2.

After fitting the LDA, we now have access to the ```components_``` attribute of the ```lda``` instance, which stores a matrix containing the word importance (here, 5000) for each of the 10 topics in increasing order.

```componenets_[i,j]``` can be viewd as a pseudocount that represents the number of times word j was assigned to topic i.

In [31]:
print("lda.components_.shape = {}".format(lda.components_.shape))
print("type(lda.components_) = {}".format(type(lda.components_)))
print("lda.components_[:,5] = {}".format(lda.components_[:,5]))

lda.components_.shape = (10, 5000)
type(lda.components_) = <class 'numpy.ndarray'>
lda.components_[:,5] = [211.36 115.75  27.48   7.62  23.75  48.35   3.6   27.12  35.66  49.33]


To analyze the results, let's print the five most important words for each of the 10 topics.  Note that the word importance values are ranked in increasing order.  Thus to print the top five words, we need to sort the ```topic``` array in reverse order:

In [32]:
n_top_words = 5
feature_names = count.get_feature_names()  # words
print("feature_names[100:110] = {}".format(feature_names[100:110]))
print("\n")

for topic_idx, topic in enumerate(lda.components_):
    descending_indices = topic.argsort()[::-1]
    print("Topic {:d}:".format(topic_idx + 1))
    print(" ".join([feature_names[i] for i in descending_indices[:n_top_words + 1]]))

feature_names[100:110] = ['accompanied', 'accomplished', 'according', 'account', 'accuracy', 'accurate', 'accurately', 'accused', 'achieve', 'achieved']


Topic 1:
worst minutes script awful stupid terrible
Topic 2:
family mother father children girl women
Topic 3:
war american dvd history music german
Topic 4:
human audience cinema art feel sense
Topic 5:
police guy car dead wife murder
Topic 6:
horror house sex blood girl gore
Topic 7:
role performance comedy actor performances plays
Topic 8:
series episode episodes tv season original
Topic 9:
book version original effects special fi
Topic 10:
action fight guy guys kids fun


Based on reading the five most important words for each topic, we may guess that the LDA identified the following topics:

1. Generally bad movies (not really a topic)
2. Movies about families
3. War movies
4. Art movies
5. Crime movies
6. Horror movies
7. Comedy movies
8. Movies somehow related to TV shows
9. Movies based on books
10. Action movies

To confirm that the categories make sense based ont the reviews, let's print three movies from the horror movie category (horror movies belong to category 6 at index position 5):

In [33]:
print("X_topics.shape = {}".format(X_topics.shape))
print("#"*80, "\n")

horror = X_topics[:, 1].argsort()[::-1]
print("horror.shape = {}".format(horror.shape))
print("type(horror) = {}".format(type(horror)))
print("horror[0:10] = {}".format(horror[0:10]))
print("horror.min() = {}, horror.max() = {}".format(horror.min(), horror.max()))
print("#"*80, "\n")

for iter_idx, movie_idx in enumerate(horror[:5]):
    print(":"*40)
    print("iter_idx = {}, movie_idx = {}".format(iter_idx + 1, movie_idx))
    print(":"*40,"\n")
    
    print("Horror Movie Number {}".format(iter_idx + 1))
    print(df["review"][movie_idx][:1000], "...", "\n")

X_topics.shape = (50000, 10)
################################################################################ 

horror.shape = (50000,)
type(horror) = <class 'numpy.ndarray'>
horror[0:10] = [31349 39689 35997 11681 10735   149 36641 12640 17934 49627]
horror.min() = 0, horror.max() = 49999
################################################################################ 

::::::::::::::::::::::::::::::::::::::::
iter_idx = 1, movie_idx = 31349
:::::::::::::::::::::::::::::::::::::::: 

Horror Movie Number 1
That magical moment in life, that point between the beautiful innocence of childhood, and the confusing whirlwind that marks adulthood . . . this is what this movie is all about. <br /><br />Danni (wonderfully played by Reese Witherspoon) is right at that moment in life when the movie starts. She swoons over Elvis, playing his records and wishfully thinking about love. Maureen her sister will soon be off to college, has no trouble with attracting boys, is beautiful, and seems to have

In [34]:
print(X_topics[34,:])
print(df.iloc[34, 0])

[0.   0.43 0.   0.   0.13 0.   0.39 0.   0.04 0.  ]
It's along the line of comedy of errors, mistaken affection transferring from one to another, blossoms and passes onkinda cat and mouse situations Flares of passion, sparks of fire fanned and put outguessing maybe she loves, he loves or they love Circle of emotions, evolving, releasinghiding, yet not hidingwanting to let him know, wanting to let her know, let them know Good ensemble cast in spite of the seemingly confusing mix of emotions from different parties involved. <br /><br />It's a refreshing charmer, casual, free and easy and rather down to earth -- not Hollywood glamorous like "Notting Hill", but lots of human feelings, frailty, vulnerability a-flowing. Yes, all revolving around an accidentally (lost &) found love letter. Kate Capshaw as the owner of the town's bookstore, with a variety of characters portrayed by Ellen DeGeneres, Tom Selleck, Blythe Danner, Tom Everett Scott, Gloria Stuart, Alice Drummond and Geraldin