In order to understand feraure representation from text data in the form of words and sentences it needs to be converted to vectors so that mathametical computation can be done on them by any algortithm.
the process to convert words to vectors is called word mebeddings. word2vec is a algorithm that uses vectors created from words which then can be used for semantic similarity, sentiment analysis and other type of prediction tasks.
in this tutorial we will use word2vec model to make sentiment analysis of cornell IMDB moview review database.

In [1]:
#We will use Gensim for word2vec. 

#Import 
import numpy as np
import pandas as pd
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
from random import shuffle
from sklearn.linear_model import LogisticRegression

We will be working on these datasets.
The result is to have five documents:
test-neg.txt: 12500 negative movie reviews from the test data
test-pos.txt: 12500 positive movie reviews from the test data
train-neg.txt: 12500 negative movie reviews from the training data
train-pos.txt: 12500 positive movie reviews from the training data
train-unsup.txt: 50000 Unlabelled movie reviews
Movie reviews will be formatted as sentences - each moview review will be separate sentences.
Each sentence will end with a new line - parser will depend on this to identify a new sentence.

word2vec only converts word to vector but doc2vec converts word to vectors and also aggregates words 
of a sentence into a vector. Also it creates a label for each sentence vector as a special word.
So we have to format sentences into
[['word1', 'word2', 'word3', 'lastword'], ['label1']]
LabeldSentence just does that exactly. But for that we need to convert our corpus into sentences and LabeldLineSentence
can do that but for a single file. In reality we have to deal with many files. So we need to write our own LabeledLineSentence
function to do this for us.

In [2]:
class LabeledLineSentence(object):
    def __init__(self, sources):
        self.sources = sources
        
        flipped = {}
        #make sure that keys are unique
        
        for key, value in sources.items():
            if value not in flipped:
                flipped[value] = [key]
            else:
                raise Exception('Non-unique prefix encountered')
    def __iter__(self):
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                for item_np, line in enumerate(fin):
                    yield LabeleSentence(utils.to_unicode(line).split(), 
                                         [prefix + '_%s' % item_no])

    def to_array(self):
        self.sentences = []
        for source, prefix in self.sources.items():
            with utils.smart_open(source) as fin:
                for item_no, line in enumerate(fin):
                    self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))
        return self.sentences
    
    def sentence_perm(self):
        return np.random.permutation(self.sentences)

In [3]:
sources = {'test-neg.txt':'TEST_NEG', 'test-pos.txt':'TEST_POS', 'train-neg.txt':'TRAIN_NEG', 'train-pos.txt':'TRAIN_POS', 'train-unsup.txt':'TRAIN_UNS'}

sentences = LabeledLineSentence(sources)

In [4]:
#doc2vec requires us to provide vocabulary whic are unique words that are used in the corpus.
#so we need to build another function that will take output of LabeledLineSentence as
#array of sentences and create vocabulary for us.
model = Doc2Vec(min_count=1, window = 10, size=100, sample=1e-4, negative=5, workers=7)

model.build_vocab(sentences.to_array())

In [None]:
#Now we need to train the model. In each epoch model will be fed with sentence of sentences
#selected randomly by sentence_perm function.
sentences = list(sentences.sentence_perm())
model.train(sentences, total_examples=model.corpus_count, epochs = model.iter)

Exception in thread Thread-11:
Traceback (most recent call last):
  File "/Users/trinakarmakar/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/Users/trinakarmakar/anaconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/Users/trinakarmakar/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 857, in job_producer
    sentence_length = self._raw_word_count([sentence])
  File "/Users/trinakarmakar/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 729, in _raw_word_count
    return sum(len(sentence.words) for sentence in job)
  File "/Users/trinakarmakar/anaconda2/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 729, in <genexpr>
    return sum(len(sentence.words) for sentence in job)
AttributeError: 'numpy.ndarray' object has no attribute 'words'



In [None]:
#Inspecting the model: Lets see what are model returns as similar words for the word "good".

model.most_simlar('good')

In [None]:
#we can also open the hood and see whats are the vector inside the model
#This is each of the vector of words and sentences. We can do that using model.syn0.
#syn is the output layer of the neural network. But instead of printing vectors of all
#the word we want to print vectors of sentences.

model['TRAIN_NEG_0']

#To avoid rerunning model we can save it
model.save('./imdb.d2v')
#and load it
model = Doc2Vec.load('./imdb.d2v')

Classifying Sentiments:
Note that we have total 25,000 vectors of reviews 12,500 of positive and 12500 of negative revoews. We need to create a numpy array with review sentence vectors and labels vectors    

In [None]:
training_arrays = np.zeros([12500, 100])
training_label = np.zeros([25000])

for i in range(12500):
    prefix_train_pos = 'TRIAN_POS_' + str(i)
    prefix_train_neg = 'TARIN_NEG_' + str(i) 
    train_arrays[i] = model[prefix_train_pos]
    train_arrays[12500 + i] = model[prefix_train_neg]
    train_labels[i] = 1
    train_labels[12500 + i] = 0

In [None]:
#training arrays looks like this rows and rows of vectors 
#representing each sentences

print(train_arrays)

In [None]:
print(train_labels)

In [None]:
#Similarly for test data
testing_arrays = np.zeros([12500, 100])
testing_label = np.zeros([25000])

for i in range(12500):
    prefix_test_pos = 'TEST_POS_' + str(i)
    prefix_test_neg = 'TEST_NEG_' + str(i) 
    test_arrays[i] = model[prefix_test_pos]
    test_arrays[12500 + i] = model[prefix_test_neg]
    test_labels[i] = 1
    test_labels[12500 + i] = 0

In [None]:
#Classification: Train a logistic regression, I am using sklearn
clf = LogisticRegression()
clf.fit(train_arrays, train_labels)

In [None]:
#Test classifier's accuracy
clf.score(test_arrays, test_labels)