# Deep learning on video titles

In this notebook we do some data learning using the titles and the number of subscribers of the videos. 

In [53]:
import requests
import json
import pandas as pd
from math import *
import numpy as np
import tensorflow as tf
import time
import collections
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from IPython.display import display
from random import randint

# Database selection

We can choose on which database we want to do our learning. To test our neural network we created 3 databases, one on the theme "animals", an other on "cars", and one with random videos. We want to see if we get different results depending on the dataset. More databases can easily be created by using the notebook "create_videos_database".

In [54]:
folder = os.path.join('sql_database_animaux')
#folder = os.path.join('sql_database_voitures')
#folder = os.path.join('sql_database_random')

In [55]:
videos_database = pd.read_sql('videos', 'sqlite:///' + os.path.join(folder, 'videos.sqlite'), index_col='index')
videos_database = videos_database.drop_duplicates('id')
videos_database = videos_database.reset_index(drop=True)

display(videos_database)

print("Length of the video database :",len(videos_database))

Unnamed: 0,id,channelId,title,thumbnailUrl,viewCount,likeCount,dislikeCount,commentCount,subsCount
0,p3mjQp8hLp8,UCa4R_3Ii-u7RxvROF9GUJnw,bêtisier des animaux,https://i.ytimg.com/vi/p3mjQp8hLp8/default.jpg,11889,40,2,1,239
1,fsM9ecpIBFA,UC8_aLXmRelD95h5CLJOIryw,Quand un chien policier attaque!!!,https://i.ytimg.com/vi/fsM9ecpIBFA/default.jpg,8479,22,5,0,18
2,7fgCfC3bM0U,UCpko_-a4wgz2u_DgDgd9fqA,The Try Guys Try Drag For The First Time,https://i.ytimg.com/vi/7fgCfC3bM0U/default.jpg,23449551,266289,6713,20994,11639882
3,GT22KpOF98E,UCOyVzq9Qv5vuiJACON4FJxA,نسخة عن FILM documentaire animaux sauvages 2016,https://i.ytimg.com/vi/GT22KpOF98E/default.jpg,48336,40,24,1,70
4,ZpF-jjy72-M,UCakQLdwrxuo0KhJ49Sq2csA,Alice aux Pays des Merveilles - Extrait - Le c...,https://i.ytimg.com/vi/ZpF-jjy72-M/default.jpg,81586,223,4,5,940704
5,iRff2nHA-tc,UCvWx3bt6RmIE2aW-EkN0hOA,JUL - C'EST LA DANSE DES CANARDS [CLIP OFFICIEL],https://i.ytimg.com/vi/iRff2nHA-tc/default.jpg,487,16,7,7,48
6,8SHtIMrM2FI,UCckz6n8QccTd6K_xdwKqa0A,Streetart: M. Chat sort les griffes face à la ...,https://i.ytimg.com/vi/8SHtIMrM2FI/default.jpg,15510,93,4,1,109749
7,cZh39iYOJAs,UC5pVNCws5dA6Ef818KcsYyg,FS17 / Map Fichtelberg / ep14 / Les animaux/ v...,https://i.ytimg.com/vi/cZh39iYOJAs/default.jpg,17770,499,10,72,128848
8,etswSufeUek,UCLMKLU-ZuDQIsbjMvR3bbog,"Présentation du mod ""ANIMAL BIKES V2""! - 26 Mo...",https://i.ytimg.com/vi/etswSufeUek/default.jpg,523449,5225,141,402,1232513
9,fwS0PthZKSk,UC-iBrOB6NtcCYaZ_yLNnEdg,CHIEN DE SUIVI -THE HUNTER,https://i.ytimg.com/vi/fwS0PthZKSk/default.jpg,2269,55,4,22,4200


Length of the video database : 3822


# Train_data creation

For our train_data set we use the Bag of Words and Term Frequency-Inverse Document Frequency (TF-IDF) method. It is usually used in sentiment analysis but this method should give interesting results in our case because we think that some particular words in a video title may attract more viewers. 

We also append the normalized number of subscribers to each vector.

In [60]:
#maximal number of words to extract, it is also the maximal size of our vectors
#we played a little with this value and 2000 seems to give good results
nwords = 2000  

#the stopwords are the words such as "the" or "is" that are everywhere and does not give any information
#we don't want those words in our vocabulary
#we get them from the file "stopwords.txt" found on the internet
stopwords = [line.rstrip('\n') for line in open('stopwords.txt')]

#print('stopwords:',stopwords)

def compute_bag_of_words(text, nwords):
    vectorizer = CountVectorizer(max_features=nwords)
    vectors = vectorizer.fit_transform(text)
    vocabulary = vectorizer.get_feature_names()
    return vectors, vocabulary

#we concatenate the titles to extract the words from them
concatenated_titles=[]
for titles in videos_database['title']:
    concatenated_titles += [' ', titles]

#create a vocabulary from the titles
title_bow, titles_vocab = compute_bag_of_words(concatenated_titles, nwords)

del concatenated_titles

titles_list = videos_database['title'].tolist()

#we apply the TF-IDF method to the titles
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words=stopwords, vocabulary=titles_vocab)
vect.fit(titles_list)

#create a sparse TF-IDF matrix 
titles_tfidf = vect.transform(titles_list)

del titles_list

train_data = titles_tfidf.todense()

print(train_data.shape)

def print_most_frequent(bow, vocab, n=100):
    idx = np.argsort(bow.sum(axis=0))
    for i in range(n):
        j = idx[0, -i]
        print(vocab[j],': ',title_bow.sum(axis=0)[0,j])
 
print('most used words:')

print_most_frequent(title_bow,titles_vocab)

#print(len(title_vocab))

#print(train_data.shape)

(3822, 2000)
most used words:
fall :  2
de :  706
un :  522
animaux :  473
chat :  468
les :  449
chien :  443
animal :  440
le :  429
hamster :  417
cheval :  412
oiseau :  401
canards :  393
la :  373
et :  327
des :  325
the :  240
du :  233
en :  209
pour :  171
son :  164
comment :  143
of :  131
to :  129
mon :  118
aux :  117
qui :  116
plus :  112
vs :  108
10 :  106
compilation :  104
2016 :  103
chasse :  101
in :  96
and :  96
avec :  96
video :  93
dans :  88
une :  85
funny :  84
au :  84
for :  83
est :  82
sur :  82
with :  81
top :  80
on :  77
hd :  71
how :  71
animals :  68
petit :  67
cage :  62
2015 :  61
monde :  59
par :  57
official :  55
faire :  52
danse :  52
hamsters :  49
enfant :  48
kids :  46
new :  46
parole :  45
2014 :  45
apprendre :  44
tuto :  44
epic :  44
fait :  43
sauvages :  42
enfants :  42
diy :  41
history :  41
best :  40
documentaire :  40
chiens :  40
battles :  40
rap :  39
se :  38
my :  38
season :  37
vidéo :  37
you :  37
pas :  36


In [61]:
#add the sub count to data_train

subsCountTemp = videos_database['subsCount'].tolist()

maxSubs = max(subsCountTemp)

print(max(subsCountTemp))

#divide all the subs count by the maximal number of subs. 
#it is to have values in the range of the values created by the tf-idf algorithm
subsCount = []
for x in subsCountTemp:
    subsCount.append(x/maxSubs)
    
del subsCountTemp

#add the subs to our train_data
subsCount = np.asarray(subsCount)
subsCount = np.reshape(subsCount, [len(subsCount),1]);
train_data = np.append(train_data, np.array(subsCount), 1)

del subsCount

print(train_data.shape)

52434012
(3822, 2001)


# The labels

Each of our label corresponds to a range of views. We have 8 labels:
+ 0         to 99 views
+ 100       to 999 views
+ 1'000      to 9'999 views
+ 10'000     to 99'999 views
+ 100'000    to 999'999 views
+ 1'000'000   to 9'999'999 views
+ 10'000'000  to 99'999'999 views
+ more than 99'999'999 views

In [62]:
nbr_labels = 8
nbr_video = len(videos_database['title'])

train_labels = np.zeros([train_data.shape[0],nbr_labels])

for i in range(nbr_video):
    views = int(videos_database['viewCount'][i])

    if views < 99:
        train_labels[i] = [1,0,0,0,0,0,0,0]
    elif views < 999:
        train_labels[i] = [0,1,0,0,0,0,0,0]
    elif views < 9999:
        train_labels[i] = [0,0,1,0,0,0,0,0]
    elif views < 99999:
        train_labels[i] = [0,0,0,1,0,0,0,0]
    elif views < 999999:
        train_labels[i] = [0,0,0,0,1,0,0,0]
    elif views < 9999999:
        train_labels[i] = [0,0,0,0,0,1,0,0]
    elif views < 99999999:
        train_labels[i] = [0,0,0,0,0,0,1,0]
    else:
        train_labels[i] = [0,0,0,0,0,0,0,1]
        
print('train_labels shape :', train_labels.shape)


train_labels shape : (3822, 8)


# Test set extraction
We randomly extract 100 items from our data set to construct our test set. 

In [63]:
testset = 100

test_data = np.zeros([testset,train_data.shape[1]])
test_labels = np.zeros([testset,nbr_labels])

for i in range(len(test_data)):
    x = randint(0,len(test_data))
    test_data[i] = train_data[x]
    test_labels[i] = train_labels[x]
    train_data=np.delete(train_data,x,axis=0)
    train_labels=np.delete(train_labels,x,axis=0)
    
print('train data shape  ', train_data.shape)
print('train labels shape', train_labels.shape)
print('train test shape  ', test_data.shape)
print('train labels shape', test_labels.shape)

train data shape   (3722, 2001)
train labels shape (3722, 8)
train test shape   (100, 2001)
train labels shape (100, 8)


# Neural Network Classifier

We tried different networks, with 2, 3 or even 4 layers, fully connected or not, and different activations. In the end the classic 2 layer with ReLu activation works just as well as the others, or better.

$$
y=\textrm{softmax}(ReLU(xW_1+b_1)W_2+b_2)
$$

In [51]:
# Define computational graph (CG)
batch_size = testset     # batch size
d = train_data.shape[1]  # data dimensionality
nc = nbr_labels          # number of classes

# CG inputs
xin = tf.placeholder(tf.float32,[batch_size,d]);
y_label = tf.placeholder(tf.float32,[batch_size,nc]);

# 1st Fully Connected layer
nfc1 = 300
Wfc1 = tf.Variable(tf.truncated_normal([d,nfc1], stddev=tf.sqrt(5./tf.to_float(d+nfc1)) ));
bfc1 = tf.Variable(tf.zeros([nfc1]));
y = tf.matmul(xin, Wfc1);
y += bfc1;

# ReLU activation
y = tf.nn.relu(y)

# dropout
y = tf.nn.dropout(y, 0.25)

# 2nd layer
nfc2 = nc
#nfc2 = 100
Wfc2 = tf.Variable(tf.truncated_normal([nfc1,nfc2], stddev=tf.sqrt(5./tf.to_float(nfc1+nc)) )); 
bfc2 = tf.Variable(tf.zeros([nfc2])); 
y = tf.matmul(y, Wfc2); 
y += bfc2;

#y = tf.nn.relu(y)

# 3rd layer
#nfc3 = nc
#Wfc3 = tf.Variable(tf.truncated_normal([nfc2,nfc3], stddev=tf.sqrt(5./tf.to_float(nfc1+nc)) )); 
#bfc3 = tf.Variable(tf.zeros([nfc3])); 
#y = tf.matmul(y, Wfc3); 
#y += bfc3;

# Softmax
y = tf.nn.softmax(y);

# Loss
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_label * tf.log(y), 1))

# L2 Regularization
reg_loss = tf.nn.l2_loss(Wfc1)
reg_loss += tf.nn.l2_loss(bfc1)
reg_loss += tf.nn.l2_loss(Wfc2)
reg_loss += tf.nn.l2_loss(bfc2)
reg_par = 4*1e-3
total_loss = cross_entropy + reg_par*reg_loss

# Optimization scheme
train_step = tf.train.AdamOptimizer(0.001).minimize(total_loss)

# Accuracy
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_label,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

In [52]:
# Run Computational Graph
n = train_data.shape[0]
indices = collections.deque()
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
for i in range(10001):
    
    # Batch extraction
    if len(indices) < batch_size:
        indices.extend(np.random.permutation(n)) 
    idx = [indices.popleft() for i in range(batch_size)]
    batch_x, batch_y = train_data[idx,:], train_labels[idx]
    
    # Run CG for variable training
    _,acc_train,total_loss_o = sess.run([train_step,accuracy,total_loss], feed_dict={xin: batch_x, y_label: batch_y})
    
    # Run CG for test set
    if not i%100:
        print('\nIteration i=',i,', train accuracy=',acc_train,', loss=',total_loss_o)
        acc_test = sess.run(accuracy, feed_dict={xin: test_data, y_label: test_labels})
        print('test accuracy=',acc_test)


Iteration i= 0 , train accuracy= 0.15 , loss= 4.17036
test accuracy= 0.09

Iteration i= 100 , train accuracy= 0.37 , loss= 1.75347
test accuracy= 0.31

Iteration i= 200 , train accuracy= 0.42 , loss= 1.67264
test accuracy= 0.37

Iteration i= 300 , train accuracy= 0.52 , loss= 1.46247
test accuracy= 0.42

Iteration i= 400 , train accuracy= 0.57 , loss= 1.44062
test accuracy= 0.35

Iteration i= 500 , train accuracy= 0.58 , loss= 1.44749
test accuracy= 0.37

Iteration i= 600 , train accuracy= 0.5 , loss= 1.57928
test accuracy= 0.36

Iteration i= 700 , train accuracy= 0.56 , loss= 1.47601
test accuracy= 0.4

Iteration i= 800 , train accuracy= 0.52 , loss= 1.56849
test accuracy= 0.47

Iteration i= 900 , train accuracy= 0.55 , loss= 1.52185
test accuracy= 0.4

Iteration i= 1000 , train accuracy= 0.55 , loss= 1.52442
test accuracy= 0.43

Iteration i= 1100 , train accuracy= 0.58 , loss= 1.47568
test accuracy= 0.47

Iteration i= 1200 , train accuracy= 0.58 , loss= 1.44343
test accuracy= 0.47



# Results

random dataset:
+ train accuracy: ~0.7
+ test accuracy: ~0.32

"cars" dataset:
+ train accuracy: ~0.8
+ test accuracy: ~0.4

"animals" dataset:
+ train accuracy: ~0.67
+ test accuracy: ~0.41

We can see that we get better results if we use videos in a given theme. Unfortunately we could not use really big dataset because of the limited memory of the virtual machine. It is a good result considering that this neural network do not take into consideration the thumbnail of the video!

Without the L2 regulation and dropout we can overfit our train accuracy up to 0.95, but the test accuracy will drop.