# Bag of Words with LDA

Latent Dirichlet Allocation (LDA) is a model that is able to group similar data together. This will reduce the dimensionality of our data by grouping similar words into one category. For example, the words **movie**, **film**, and **show** might be grouped into one topic called **MOVIE_related**. Putting synonyms into one feature instead of numerous greatly reduces the dimensionality of our data. Hopefully this will improve our learner's score.

In [2]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
train = pd.read_csv('../Lemmatization/result_train.csv', encoding='ascii')
test = pd.read_csv('../Lemmatization/result_test.csv', encoding='ascii')

# convert empty fields into empty strings
train.fillna('', inplace=True)
test.fillna('', inplace=True)

train = train[:500]
test = test[:500]

## Split the Data

We will split the training data into two parts. The first of these parts will be used to train the model, and the other will be used to make predictions. Later on, we will feed these predictions into a higher level learner as a feature. Since we are using labeled data, we can give the ensemble learner the these predictions along with the corresponding true sentiments. This will allow the ensemble learner to fit the predictions to the actual sentiments.

In [3]:
split_index = int(len(train) / 2)
cv = train[:split_index]
train = train[split_index:]

## Vectorization

Fit the vectorizer on all of the data

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words='english', min_df=2, max_df=0.95)
all_phrases = pd.concat([train.Phrase, cv.Phrase, test.Phrase])
vectorizer.fit(all_phrases)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=0.95, max_features=None, min_df=2,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

vectorize the training data

In [6]:
X_train = vectorizer.transform(train.Phrase)
y_train = train.Sentiment

vectorize the cross validation data

In [7]:
X_cv = vectorizer.transform(cv.Phrase)
y_cv = cv.Sentiment

vectorize the test data

In [8]:
X_test = vectorizer.transform(test.Phrase)

## LDA

Now we'll train LDA to group phrases into 100 different topics

In [9]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=100, n_jobs=-1)
lda.fit(X_train, y_train)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_jobs=-1, n_topics=100, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

Do the transformation

In [10]:
L_train = lda.transform(X_train)
L_cv = lda.transform(X_cv)
L_test = lda.transform(X_test)

## SVC

SVC has worked well for other models, let's try it out. We should probably test other representations later on.

Train

In [11]:
from sklearn.svm import SVC
svc = SVC()
svc.fit(L_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Predict

In [12]:
y_cv_pred = svc.predict(L_cv)
y_test_pred = svc.predict(L_test)

## Save Predictions

save cross validation predictions

In [13]:
results_cv = pd.DataFrame({
    'PhraseId': cv.PhraseId,
    'Predicted': y_cv_pred,
    'Sentiment': cv.Sentiment
})
results_cv.to_csv('results_train.csv', index=False)

save test predictions

In [15]:
results_test = pd.DataFrame({
    'PhraseId': test.PhraseId,
    'Sentiment': y_test_pred
})
results_test.to_csv('results_test.csv', index=False)

## LDA Results

In [20]:
print "Cross Validation score: %f" % svc.score(L_cv, y_cv)

Cross Validation score: 0.632000


In [None]:
![Kaggle Results]()