# Bag of Words with LDA

Latent Dirichlet Allocation (LDA) is a model that is able to group similar data together. This will reduce the dimensionality of our data by grouping similar words into one category. For example, the words **movie**, **film**, and **show** might be grouped into one topic called **MOVIE_related**. Putting synonyms into one feature instead of numerous greatly reduces the dimensionality of our data. Hopefully this will improve our learner's score.

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time

# Load the data
train = pd.read_csv('../../data/train.tsv', sep='\t')
test = pd.read_csv('../../data/test.tsv', sep='\t')
train.shape



(156060, 4)

## Count Vectorization

Now that the data is loaded, we can vectorize it using the Count vectorizer. This will give us a matrix of word frequencies.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

subset = train[:500]

# Create the vectorizer
# We ignoring common english words and only look at a maximum of 500 unique words
def vectorize(phrases):   
    vectorizer = CountVectorizer(stop_words='english', min_df=2, max_df=0.95, max_features=100)
    return vectorizer.fit_transform(phrases)

X = vectorize(subset.Phrase)
print X.shape

(500, 100)


## LDA Dimension Reduction

Now we can put LDA to use and reduce dimensionality.

In [3]:
from sklearn.decomposition import LatentDirichletAllocation

def ldaify(X, y):
    lda = LatentDirichletAllocation(n_topics=10)
    return lda.fit_transform(X, y)

L = ldaify(X, subset.Sentiment)

## Learning

Now that we've transformed our data into categories we can begin to learn. We will compute cross validation accuracy using Random Forest, AdaBoost, and SVC models.

In [4]:
def cv(X, y):
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.ensemble import AdaBoostClassifier
    from sklearn.svm import SVC

    forest = RandomForestClassifier(n_estimators=500)
    boost = AdaBoostClassifier()
    svc = SVC()

    from sklearn.cross_validation import cross_val_score
    import time

    t0 = time.time()
    print "Random Forest cross validation runnning..."
    forest_score = cross_val_score(forest, X, y).mean()
    print "Random Forest Score: %2.2f" % forest_score
    print "dt: %f" % (time.time() - t0)
    print ""

    t0 = time.time()
    print "AdaBoost cross validation runnning..."
    boost_score = cross_val_score(boost, X, y).mean()
    print "AdaBoost Score:      %2.2f" % boost_score
    print "dt: %f" % (time.time() - t0)
    print ""

    t0 = time.time()
    print "SVC cross validation runnning..."
    svc_score = cross_val_score(svc, X, y).mean()
    print "SVC Score:           %2.2f" % svc_score
    print "dt: %f" % (time.time() - t0)
    print ""
    
cv(L, subset.Sentiment)

Random Forest cross validation runnning...
Random Forest Score: 0.49
dt: 2.209551

AdaBoost cross validation runnning...
AdaBoost Score:      0.55
dt: 0.222177

SVC cross validation runnning...
SVC Score:           0.59
dt: 0.023728



In [6]:
# Vectorize
print "vectorizing training data...\n"
subset = train
X = vectorize(subset.Phrase)

# LDA
print "running lda transform...\n"
lda = LatentDirichletAllocation(n_topics=10)
L = lda.fit_transform(X, subset.Sentiment)

# Fit
print "training random forest...\n"
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=500, max_depth=5)
forest.fit(L, subset.Sentiment)

# Vectorize test data
print "vectorizing test data...\n"
subset = test
X = vectorize(subset.Phrase)
L = lda.transform(X)

# Predict
print "predicting...\n"
y = forest.predict(L)

# Export results
print "exporting...\n"
import pandas as pd
df = pd.DataFrame({
    'PhraseId': subset.PhraseId,
    'Sentiment': y
})

df.to_csv('results.csv', index=False)
print "done"

vectorizing training data...

running lda transform...

training random forest...

vectorizing test data...

predicting...

exporting...

done
