# Building an online perceptron for text data

Online learning is a dynamic approach to machine learning, where at each step, we get a new observation for which a prediction is made. After the prediction, we get to observe the actual label of the data and use it to update the estimation function. 

Here, we attempt to build an online perceptron, coded manually, which predicts restaurant recommendation based on customer reviews.

In [1]:
import pandas as pd
import numpy as np
import time
import heapq
import nltk
from scipy.sparse import hstack,csr_matrix
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from nltk.corpus import stopwords

In [2]:
df_test=pd.DataFrame.from_csv("reviews_te.csv")
#df_test=df_test[:1000]
y_test=df_test.index.values
y_test[y_test==0]=-1
df_train=pd.DataFrame.from_csv("reviews_tr.csv")
#df_train=df_train[:10000]
y_train=df_train.index.values
y_train[y_train==0]=-1

We will use the perceptron using 4 different representations of the text data.

## 1) Term frequency

Here, each word in a document is represented by a feature that counts the number of times it appears in the document.

In [3]:
def add_constant(matrix):
    return(hstack((matrix,np.ones(matrix.shape[0]).astype(np.int64)[:,None]),format="csr"))
    

In [4]:
vectorizer=CountVectorizer()
sparse_train =add_constant(vectorizer.fit_transform(df_train.text.values))
sparse_test = add_constant(vectorizer.transform(df_test.text.values))

This perceptron works by actualizing the weights at each step. If $w.x > 0$, the perceptron returns a correct prediction

In [5]:
def online_perceptron(X,y,n):
    w=np.zeros(X.shape[1])
    w_sum=np.zeros(X.shape[1])
    index = np.arange(np.shape(sparse_train)[0])
    for i in range(n):
        np.random.shuffle(index)
        X_shuffle=X[index, :]
        y_shuffle=y[index]
        count=0
        k=0
        while k<len(y):
            while k<len(y) and y_shuffle[k]*X_shuffle[k].dot(w)[0]>0:#use sparse
                if i==n-1:
                    count+=1
                k+=1
            if i==n-1:
                w_sum=w_sum+count*np.array(w)
            if k<len(y):
                w=np.array(w+y_shuffle[k]*X_shuffle[k])[0]#use sparse
            k+=1
            count=1
    return(w_sum/(len(y)+1))
    

Train the perceptron on the training data:

In [6]:
w_final=online_perceptron(sparse_train, y_train, 2) 

10 words with lowest (most negative) weights

In [7]:
sorted([vectorizer.get_feature_names()[i] for i in w_final.argsort()[:10]])

[u'flavorless',
 u'hopes',
 u'inedible',
 u'lacked',
 u'mediocre',
 u'meh',
 u'poisoning',
 u'underwhelmed',
 u'underwhelming',
 u'worst']

10 words with highest (most positive) weights

In [8]:
sorted([vectorizer.get_feature_names()[i] for i in (-w_final).argsort()[:10]])

[u'disappoint',
 u'exceeded',
 u'gem',
 u'heavenly',
 u'incredible',
 u'perfection',
 u'skeptical',
 u'worried',
 u'yerm',
 u'yurm']

We evaluate the model using a loss function: the risk, which indicates the frequency of false predictions

In [9]:
def risk_train(X,y,w):
    y_pred_train=X.dot(w)
    y_pred_train[y_pred_train<=0]=-1
    y_pred_train[y_pred_train>0]=1
    return(np.mean(abs(y_pred_train-y))/2)

def risk_test(X,y,w):
    y_pred_test=X.dot(w)
    y_pred_test[y_pred_test<=0]=-1
    y_pred_test[y_pred_test>0]=1
    return(np.mean(abs(y_pred_test-y))/2)

In [10]:
risk_train(sparse_train, y_train, w_final)

0.101561

In [11]:
risk_test(sparse_test, y_test, w_final)

0.10525986967468652

## 2. Term frequency-inverse document frequency (tf-idf).

Tf-idf is an improvement of the term frequency representation. It basically corrects term frequency by multiplying it by the log of the inverse of the frequency of the word through all the training documents. This is beneficial because it allows to increase the value for words that appear a lot in one document as opposed to others, and thus highlights each documents' specificities.

In [12]:
idf_vectorizer=TfidfVectorizer()
sparse_train= add_constant(idf_vectorizer.fit_transform(df_train.text.values))
sparse_test= add_constant(idf_vectorizer.transform(df_test.text.values))

In [13]:
w_final_idf=online_perceptron(sparse_train, y_train, 2)

In [15]:
risk_train(sparse_train, y_train, w_final_idf)

0.095028000000000001

In [16]:
risk_test(sparse_test, y_test, w_final_idf)

0.10616265048950088

As we can see here, it did not improve the test loss, and slightly improved the train loss.

## 3. Bigram representation.

The downside of the two previous approaches is that they do not consider the order of the words, but only their frequency. In "I do not like apples" and "I like apples", both sentences have the words "like" and "apples, but they have different meanings because of the expression "not like" in the former. One way to account for this is by using 2gram representations, which is the same as before, but for any combination of 2 consecutive words.

In [17]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2))
sparse_train= add_constant(bigram_vectorizer.fit_transform(df_train.text.values))
sparse_test= add_constant(bigram_vectorizer.transform(df_test.text.values))

In [19]:
w_final_bigram=online_perceptron(sparse_train, y_train, 2)

In [20]:
risk_train(sparse_train, y_train, w_final_bigram)

0.068169999999999994

In [21]:
risk_test(sparse_test, y_test, w_final_bigram)

0.089034805480410595

As expected, this significantly improve our score

## 4. Stemming
We can also use stemming to allow "apples" and "apple" to have the same frequency countings by reducing each word to an arbitrary root. Then, we will remove stopwords ("the", "a", "is", ...) and use the 2gram representation on top of it.

In [None]:
stopWords = set(stopwords.words('english'))
theStemmer = nltk.stem.porter.PorterStemmer()

In [None]:
def clean_data(df,stopWords=stopWords,theStemmer=theStemmer):
    df_copy=df
    for k in range(len(df_copy)):
        tokens = [word for word in df_copy.text.values[k].split(" ") if word not in stopWords]
        tokens = [theStemmer.stem(word) for word in tokens] #stem words uing porter stemming algorithm
        df_copy.text.values[k]= " ".join(tokens)
    return(df_copy)

In [None]:
clean_train=clean_data(df_train)
clean_test=clean_data(df_test)

In [None]:
bigram_vectorizer_clean = CountVectorizer(ngram_range=(1, 2))
sparse_train= add_constant(bigram_vectorizer_clean.fit_transform(clean_train.text.values))
sparse_test= add_constant(bigram_vectorizer_clean.transform(clean_test.text.values))

In [None]:
w_final_bigram_clean=online_perceptron(sparse_train, y_train, 2)

In [None]:
risk_train(sparse_train, y_train, w_final_bigram_clean)

In [None]:
risk_test(sparse_test, y_test, w_final_bigram_clean)