# Sentiment Analysis Using Doc2Vec

### We use Word2Vec for sentiment analysis by attempting to classify the Cornell IMDB movie review corpus (http://www.cs.cornell.edu/people/pabo/movie-review-data/).

## Setup

In [1]:
# gensim modules
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec

# numpy
import numpy

# classifier
from sklearn.linear_model import LogisticRegression

Couldn't import dot_parser, loading of dot files will not be possible.


#### Data:
* **test-neg.txt**: 12500 negative movie reviews from the test data
* **test-pos.txt**: 12500 positive movie reviews from the test data
* **train-neg.txt**: 12500 negative movie reviews from the training data
* **train-pos.txt**: 12500 positive movie reviews from the training data
* **train-unsup.txt**: 50000 Unlabelled movie reviews

## Feeding Data to Doc2Vec

In [2]:
# Each sentences need to be labeled in this format
# [['word1', 'word2', 'word3', 'lastword'], ['label1']]
# This is doen with LabeledSentence

In [6]:
class LabeledLineSentence(object):
    def __init__(self, sources):
        self.sources = sources
        
        flipped = {}
        
        for key, value in sources.items():
            if value not in flipped:
                flipped[value] = [key]
            else:
                raise Exception('Non-unique prefix encountered')
    def __iter__(self):
        for source, prefix in sources.items():
            with utils.smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    yield LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no])
                    
    def to_array(self):
        self.sentences = []
        for source, prefix in sources.items():
            with utils.smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))
        return self.sentences
    
    def sentences_perm(self):
        return numpy.random.permutation(self.sentences)

In [7]:
sources = {'data/test-neg.txt':'TEST_NEG', 'data/test-pos.txt':'TEST_POS', 
           'data/train-neg.txt':'TRAIN_NEG', 'data/train-pos.txt':'TRAIN_POS', 'data/train-unsup.txt':'TRAIN_UNS'}
sentences = LabeledLineSentence(sources)

## Model - Building the Vocabulary Table

In [8]:
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=7)
model.build_vocab(sentences.to_array())

[34mDoc2Vec_Intro[m[m/                          [34mSentiment Analysis[m[m/
README.md                               Sentiment Analysis Using Doc2Vec.ipynb
