Model used- Word2Vec

No need to forward propogate our data set through a network since the total number of words is small and,
infact result is stored in the form of a dictionary, we need to simply look up the word in the dictionary and get the Word2Vec features for the word. 

The dictionary used is provided by googlenewsvector and it is available at: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

Gensim is a Python library designed to automatically extract semantic topics from documents.
It can process raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation and Random Projections discover semantic structure of documents by examining statistical co-occurrence patterns of the words within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary - and we only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation and queried for topical similarity against other documents.

The task of word2vec class(actually now a deprecated class and thus I have used KeyedVectors which is latest one) is training, using and evaluating unsupervised neural networks described in the paper: http://arxiv.org/pdf/1405.4053v2.pdf

The algorithm represents each overview by a dense vector which is trained to predict words in the document.
Its construction gives the algorithm potential to overcome the weaknesses of bag-of-words models. Empirical results shown in the paper prove that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations.
Also, they chieve state-of-the-art results on several text classification and sentiment analysis tasks.

In [1]:
from gensim import models
modelDict = models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, limit=500000)

Using TensorFlow backend.


In [3]:
#Now, we can simply look up for a word in the above loaded model. 
#Like, to get the Word2Vec representation of the word "movies" we just do - modelDict['movies']
modelDict['movies'].shape#We get output, a word2vector representation of movies.
#And so, in this way, we can represent the words in our overview using word2Vector model.

(300,)

In [3]:
import pickle
f=open('final_movies_set.pckl','rb')
finalMoviesSet=pickle.load(f)
f.close()
del f

In [5]:
len(finalMoviesSet)

1278

The preprocessing we do here is - we delete commonly occurring words which we know are not informative about the genre. These words are often removed and are referred to as "stop words". Such words include simple words like "a", "and", "but", "how", "or" ,etc. They have been removed using the python package NLTK.

Due to this kind of preprocessing, the net result is that movies with overviews which contain only stop words, or movies with overviews containing no words with word2vec representation are neglected.
Others are used to build Mean word2vec representation. Concisely, preprocessing steps are as mentioned below:
<ul>
<li>Take movie overview</li>
<li>Throw out stop words</li>
<li>For non stop words:</li>
</ul>
If in word2vec - take it's word2vec representation which is 300 dimensional/
If not - throw word
<li>For each movie, calculate the arithmetic mean of the 300 dimensional vector representations for all words in the overview which weren't thrown out </li>

This mean becomes the 300 dimensional representation for the movie. For all movies, these are stored in a numpy array. So the X matrix becomes (1278,300). And, Y is (1278,20) i.e. binarized 20 genres, as before

I have taken arithmetic mean  and not kept all the words separately - because of limitation of how current neural networks work, the details related to which were taken from the paper: https://jiajunwu.com/papers/dmil_cvpr.pdf in context of multiple instance learning.

In [7]:
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
tokenizer = RegexpTokenizer(r'\w+')

# create list of english stop words
enStop = get_stop_words('en')

In [9]:
import numpy as np
movieMeanWordvec=np.zeros((len(finalMoviesSet),300))
movieMeanWordvec.shape

(1278, 300)

In [16]:
genres=[]
rows_to_delete=[]
for i in range(len(finalMoviesSet)):
    mov=finalMoviesSet[i]
    movie_genres=mov['genre_ids']
    genres.append(movie_genres)
    overview=mov['overview']
    tokens = tokenizer.tokenize(overview)
    stopped_tokens = [k for k in tokens if not k in enStop]
    count_in_vocab=0
    s=0
    if len(stopped_tokens)==0:
        rows_to_delete.append(i)
        genres.pop(-1)
#         print overview
#         print "sample ",i,"had no nonstops"
    else:
        for tok in stopped_tokens:
            if tok.lower() in modelDict.vocab:
                count_in_vocab+=1
                s+=modelDict[tok.lower()]
        if count_in_vocab!=0:
            movieMeanWordvec[i]=s/float(count_in_vocab)
        else:
            rows_to_delete.append(i)
            genres.pop(-1)
#             print overview
#             print "sample ",i,"had no word2vec"

In [19]:
len(genres)

1261

In [21]:
mask2=[]
for row in range(len(movieMeanWordvec)):
    if row in rows_to_delete:
        mask2.append(False)
    else:
        mask2.append(True)

In [23]:
X=movieMeanWordvec[mask2]
X.shape

(1261, 300)

In [25]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb=MultiLabelBinarizer()
Y=mlb.fit_transform(genres)
Y.shape

(1261, 20)

In [27]:
textualFeatures=(X,Y)
#f=open('textual_features.pckl','wb')
#pickle.dump(textual_features,f)
#f.close()
#del f

In [29]:
#f=open('textual_features.pckl','rb')
#textualFeatures=pickle.load(f)
#f.close()
#del f

In [30]:
print X.shape
print Y.shape

(1261, 300)
(1261, 20)


In [31]:
maskText=np.random.rand(len(X))<0.8

In [33]:
XTrain=X[maskText]
YTrain=Y[maskText]
XTest=X[~maskText]
YTest=Y[~maskText]

In [35]:
#Training steps are similar to the case of deep learning in case of Posters

from keras.models import Sequential
from keras.layers import Dense, Activation

modelTextual = Sequential([
    Dense(300, input_shape=(300,)),
    Activation('relu'),
    Dense(20),
    Activation('softmax'),
])

modelTextual.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [37]:
modelTextual.fit(XTrain, YTrain, epochs=10, batch_size=500)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f34ee794b10>

In [38]:
modelTextual.fit(XTrain, YTrain, epochs=10000, batch_size=500,verbose=0)

<keras.callbacks.History at 0x7f34e5ff6f90>

In [41]:
score = modelTextual.evaluate(XTest, YTest, batch_size=249)



In [42]:
print("%s: %.2f%%" % (modelTextual.metrics_names[1], score[1]*100))

acc: 85.50%


In [58]:
YPreds=modelTextual.predict(XTest)
#print YPreds

f=open('Genredict.pckl','rb')
GenreIDToName=pickle.load(f)
f.close()
del f
GenreIDToName
genreList=sorted(GenreIDToName.keys())
genreList

[12,
 14,
 16,
 18,
 27,
 28,
 35,
 36,
 37,
 53,
 80,
 99,
 878,
 9648,
 10402,
 10749,
 10751,
 10752,
 10769,
 10770]

In [60]:
def precision_recall(gt,preds):
    TP=0
    FP=0
    FN=0
    for t in gt:
        if t in preds:
            TP+=1
        else:
            FN+=1
    for p in preds:
        if p not in gt:
            FP+=1
    if TP+FP==0:
        precision=0
    else:
        precision=TP/float(TP+FP)
    if TP+FN==0:
        recall=0
    else:
        recall=TP/float(TP+FN)
    return precision,recall

In [61]:
print "Our predictions for the movies are - \n"
precs=[]
recs=[]
for i in range(len(YPreds)):
    row=YPreds[i]
    gtGenres=YTest[i]
    gtGenreNames=[]
    for j in range(20):
        if gtGenres[j]==1:
            gtGenreNames.append(GenreIDToName[genreList[j]])
    top3=np.argsort(row)[-3:]
    predictedGenres=[]
    for genre in top3:
        predictedGenres.append(GenreIDToName[genreList[genre]])
    (precision,recall)=precision_recall(gtGenreNames,predictedGenres)
    precs.append(precision)
    recs.append(recall)
    if i%50==0:
        print "Predicted: ",predictedGenres," Actual: ",gtGenreNames

Our predictions for the movies are - 

Predicted:  [u'Adventure', u'Thriller', u'Action']  Actual:  [u'Adventure', u'Action', u'Comedy', u'Romance']
Predicted:  [u'Comedy', u'Thriller', u'Drama']  Actual:  [u'Drama', u'Comedy', u'Romance']
Predicted:  [u'Comedy', u'Thriller', u'Drama']  Actual:  [u'Drama', u'Comedy', u'Thriller', u'Crime']
Predicted:  [u'Action', u'Thriller', u'Drama']  Actual:  [u'Drama', u'Thriller', u'Crime', u'Romance']
Predicted:  [u'Drama', u'Family', u'Comedy']  Actual:  [u'Animation', u'Music', u'Family']
Predicted:  [u'Drama', u'Family', u'Adventure']  Actual:  [u'Adventure', u'Animation', u'Comedy', u'Science Fiction', u'Family']


In [64]:
print np.mean(np.asarray(precs)),np.mean(np.asarray(recs))
#print 0.509713261649, 0.563918757467

0.509713261649 0.563918757467
