# P4
[Code Reference](https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch08/ch08.ipynb)

Download test data set from http://ai.stanford.edu/~amaas/data/sentiment/, then use commad *tar -xvzf* to unzip files.

In [0]:
!tar -xvzf aclImdb_v1.tar.gz

Because trying to import huge data set into python memory, install 'pyprind' package to see process when importing data.

In [5]:
!pip install pyprind

Collecting pyprind
  Downloading https://files.pythonhosted.org/packages/1e/30/e76fb0c45da8aef49ea8d2a90d4e7a6877b45894c25f12fb961f009a891e/PyPrind-2.11.2-py3-none-any.whl
Installing collected packages: pyprind
Successfully installed pyprind-2.11.2


Import all data point in the fold which just unziped. The file system is like tree below:

aclImdb
* Train
* * Pos
* * Neg
* Test
* * Pos
* * Neg

Thus, using nasted *for* loop to retrive each data point.

In [6]:
import pyprind
import pandas as pd
import os

basepath = './aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:01:51


Reorder the data frame and save data to csv file.

In [7]:
import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

df.to_csv('./movie_data.csv', index=False)

import pandas as pd

df = pd.read_csv('./movie_data.csv')
df.head(3)

Unnamed: 0,review,sentiment
0,Burt Reynolds stars as an undercover cop who i...,1
1,"The Mod Squad isn't a movie, it's a void. That...",0
2,"1937's ""Stella Dallas"" with Barbara Stanwyck h...",0


## Clean text data

During this step, you may want to strip all unwanted characters from review texts.

In [10]:
df.loc[0, 'review'][-50:]

'missing in the equation on a scale of one to ten 7'

For the function bellow, data set will remove all text which may be caused by HTML format or other reason.

In [0]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

In [14]:
preprocessor(df.loc[0, 'review'][-50:])

'missing in the equation on a scale of one to ten 7'

In [15]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [0]:
df['review'] = df['review'].apply(preprocessor)

## Process documents into tokens

Some words may have different forms. You may want to group them into one word.

Using *PorterStemmer* in package *nltk.stem.porter* to process words in differen forms.

In [0]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

Try a simple example to test our function.

In [19]:
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [20]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [21]:
import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:]
if w not in stop]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['runner', 'like', 'run', 'run', 'lot']


## Transform words into feature vectors

Transform each document into a vector where each dimension represents the frequency of a word (bag-of-words model).

Using *CountVectorizer* in *sklearn.feature_extraction.text* to transfer text to vector.

In [0]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()

Simple example for transfer sentences.

In [23]:
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

print(count.vocabulary_)
print(bag.toarray())

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}
[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


## Access word relevancy

When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typ- ically don’t contain useful or discriminatory information. In this step, you may want to downweight those frequently occurring words in the feature vectors.


Use term frequency-inverse document frequency (tf-idf) to access word relevancy. The function is provided in *sklearn* packages.

Built model for transfer.

In [25]:
np.set_printoptions(precision=2)
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


## Build the logistic model

After finishing all the abovementioned steps, you are now ready to build the model using the manipulated vectors. Remember to report your 5-fold cross- validation error.

First devided data set we import at the beginning to train and test. Also devided them into data and target.

In [0]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

Now build model for logistic regression. Put data process in previous steps as parameters in model.

In [0]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

Use 5-fold cross validation over training data to find the average accuracy.

In [0]:
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

Fit the model, it will take almost 7 hours to finnished all.

In [0]:
gs_lr_tfidf.fit(X_train, y_train)

print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed: 70.1min
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed: 315.9min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 394.7min finished


Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7f02c7c666a8>} 
CV Accuracy: 0.895
Test Accuracy: 0.899


As the result of CV, I get a accuracy $89.5\%$ which is not bad. When trying to predict test data, I get a similarly accuracy $89.5\%$.