# Out-Of-Core Learning and Online Learning

This note is intended to build a sentiment model but with bigger feature space and the data is live-streaming. The considered example situation is like tweets or live news announcement. The difficulty for these cases is that the word features of the model as well as the vector space always change. Thus our learning process is online and in real-time, called **out-of-core learning**.

In the previous cases, we constructed the word feature space from **all** documents to train a logistic model or use Naive Bayes classifier for sentiment analysis. Here, we will use the `SGDClassifier` for minibatches of documents to train a logistic regression model. There are **50000** documents in the entire corpus, divided by **45000** as training set and **5000** as test set. We simulate that in each period, there are **100/1000** incoming documents, so the number of minibatches is **100/1000**. Eventually we have **450/45** iterations of data streamming and online training.

The `Hashingvectorizer` in Python helps us to build a hash table to store the word features during all streaming processes.

In [1]:
import pandas as pd
import numpy as np
import os

## The Data

In [4]:
df = pd.DataFrame()
for type in ['train', 'test']:
    for sentiment in ['pos', 'neg']:
        path = 'aclImdb/train/%s/' % sentiment
        if sentiment == 'pos': y = 1
        if sentiment == 'neg': y = 0
        print (path, y)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r') as infile:
                text = infile.read()
                df = df.append([[text, y]], ignore_index=True)

aclImdb/train/pos/ 1
aclImdb/train/neg/ 0
aclImdb/train/pos/ 1
aclImdb/train/neg/ 0


In [5]:
df.columns = ['review', 'sentiment']

In [6]:
df.shape

(50000, 2)

In [7]:
df.head()

Unnamed: 0,review,sentiment
0,Bromwell High is a cartoon comedy. It ran at t...,1
1,Homelessness (or Houselessness as George Carli...,1
2,Brilliant over-acting by Lesley Ann Warren. Be...,1
3,This is easily the most underrated film inn th...,1
4,This is not the typical Mel Brooks film. It wa...,1


We need to randomize the data:

In [8]:
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index=False)

In [9]:
df.head()

Unnamed: 0,review,sentiment
11841,"Often tagged as a comedy, The Man In The White...",1
19602,After Chaplin made one of his best films: Doug...,0
45519,"***SPOILER*** Do not read this, if you think a...",0
25747,hi for all the people who have seen this wonde...,1
42642,"I recently bought the DVD, forgetting just how...",0


In data **movie_data.csv**, we have **50000** documents, each of them is composed of (review, sentiment).

## Tokenizer

Here there exist a lot '< br/>' in data. Here we use regular expression to move it and remove the stop words

In [10]:
import re
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
stop_words = set(stopwords.words('English'))

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower())+ ' '.join(emoticons).replace('-', '')
    return [x for x in text.split() if x not in stop_words]

## Doc stream function

In [11]:
def stream_docs(path):
    """
    This function is to help each time, we stream ONE text.
    """
    with open(path, 'r') as csv:
        next(csv) # skip header
        for line in csv:
            #print(line[-1])
            ## line[-1] is blank
            line = line.rstrip()
            text, label = line[:-2], int(line[-1])
            yield text, label

The `stream_docs` is to help each time, we stream ONE text. For example, 

In [87]:
next(stream_docs('movie_data.csv'))

('"I watched like 8 or 9 Herzog movies and none of them had any impact on me.<br /><br />I watched several documentaries about him. He is obviously an intelligent man, with great knowledge about films and passion for making them, but does this makes him a good director. Definitely NO! A complete anti-talent. He can make a good documentary because of previously mentioned traits, but a film with actors \x96 never!<br /><br />He can\'t direct nor write. His screenplays are full of badly thought out situations, and many situations/dialogues in his movies are so childishly and badly done that they cannot be hidden behind the word ""art"" in any sense. No way. Not to mention the unskillful direction, so amateurish-like. To say that he wants to direct like that and write crap like that is a lie.<br /><br />Like the scene when Scheitz gets arrested and Storszek hides in the back of the store. WHO IS HE KIDDING?<br /><br />He is a cheater; he knows what fake intellectuals and critics want. He k

## Minibatch Function

This function helps to stream the doc with **batch_size** times, until we went through all of them. In the following, we use **batch_size=100/1000**, mimic in a certian period, we have **100/1000** incoming documents.

In [12]:
def get_minibatch(doc_stream, batch_size):
    docs, y = [], []
    try:
        for batch in range(batch_size):
            text, label = next(doc_stream)
            #print (text, label)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

## HashingVectorizer

In [13]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vect = HashingVectorizer(decode_error='ignore', n_features=2**21, preprocessor=None, tokenizer=tokenizer)
sgd = SGDClassifier(loss='log', random_state=1, n_iter=1)

In [31]:
doc_stream = stream_docs(path='movie_data.csv')

## Training Model with 1000 minibatches

In [15]:
"""
training text data
"""
classes = np.array([0, 1])
for iteration in range(45):
    X, y = get_minibatch(doc_stream, batch_size = 1000)
    if not X: break
    X = vect.transform(X)
    sgd.partial_fit(X, y, classes = classes)

In [16]:
X_test, y_test = get_minibatch(doc_stream, batch_size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % sgd.score(X_test, y_test))

Accuracy: 0.876


## Training Model with 100 minibatches

In [32]:
classes = np.array([0, 1])
for iteration in range(450):
    X, y = get_minibatch(doc_stream, batch_size = 100)
    if not X: break
    X = vect.transform(X)
    sgd.partial_fit(X, y, classes = classes)

In [33]:
X_test, y_test = get_minibatch(doc_stream, batch_size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % sgd.score(X_test, y_test))

Accuracy: 0.877


## Compare the Minibatch Learning and Batch Learning

Previous we considered either 100 or 1000 minibatches to perform the logistic regression for sentiment model. How does it compare with the entire batches? Here we revisit the same data, but use Nive Bayes classifier from sklearn and NLTK.

### Construct the entire corpus

In [17]:
hd = open('movie_data.csv', 'r')
next(hd)
corpus_words = []
for review in hd:
    review = review.rstrip()
    #words = [ps.stem(x.lower()) for x in nltk.word_tokenize(str(review[0]))
    #            if x not in stop_words if x not in punctuations if x != "''"]
    words = tokenizer(review[:-2])
    for word in words:
        corpus_words.append(word)

corpus_words = nltk.FreqDist(corpus_words)

In [18]:
print (len(corpus_words), corpus_words.most_common(10))

75803 [('movie', 88056), ('film', 80288), ('one', 53576), ('like', 40548), ('good', 30280), ('time', 25446), ('even', 25290), ('would', 24872), ('story', 23962), ('really', 23470)]


### Use 5000 most common words for the corpus

In [35]:
print (corpus_words.most_common(5000)[4999])

('claus', 160)


In [19]:
word_features = [x for x, y in corpus_words.most_common(5000)]

### Construct the feature vector space

In the following, we go through all documents and examine each tokenized word. If they exist in **corpus_words**, we label 'Ture'; otherwise 'False'. In other words, here we simply consider the **bag of words** model.

In [20]:
def get_features(review):
    features = {}
    review_words = tokenizer(review)
    for word in word_features:
        features[word] = (word in review_words)
    return features

In [21]:
hd = open('movie_data.csv', 'r')
next(hd)
allData = []
for review in hd:
    review = review.rstrip()
    category = review[-1]
    #print (review[:-2])
    allData.append((get_features(review[:-2]), category))
    #allData.append(review[1])

### Training models

In [22]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(allData, test_size=0.3)

In [23]:
classifier = nltk.NaiveBayesClassifier.train(train)
print ('Accuracy:', nltk.classify.accuracy(classifier, test))

Accuracy: 0.8612


In [24]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB

In [25]:
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(train)
print ('MNB Accuracy:', nltk.classify.accuracy(MNB_classifier, test)) 

MNB Accuracy: 0.8620666666666666
