# CA684 Machine Learning - Media Memorability Assignment 

## The notebooks is divided in following sections:

* Section 1 : Mounting google drive and importing necessary libraries
* Section 2 : Function definition
* Section 3 : Loading features
* Section 4 : Data preprocessing
* Section 5 : Application of different machine learning models
* Section 6 : Selection of the best models with features
* Section 7 : Prediction on test dataset
* Section 8 : Exporting the results

## **Section 1 : Mounting google drive and importing the required libraries**

In [4]:
#mounting the drive and importing the required libraries

from google.colab import drive
import os
drive.mount('/content/gdrive/')
os.chdir('/content/gdrive/My Drive/CA684_Assignment/')

!pip install pyprind

import pandas as pd
import numpy as np
import pyprind
import matplotlib.pyplot as plt
import os
import glob
import nltk

from string import punctuation
from collections import Counter
from sklearn.model_selection import train_test_split
from keras import Sequential
from keras import layers
from keras import regularizers
from keras.preprocessing.text import Tokenizer
from keras import preprocessing

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive/
Collecting pyprind
  Downloading https://files.pythonhosted.org/packages/1e/30/e76fb0c45da8aef49ea8d2a90d4e7a6877b45894c25f12fb961f009a891e/PyPrind-2.11.2-py3-none-any.whl
Installing collected packages: pyprind
Successfully installed pyprind-2.11.2


Using TensorFlow backend.


## **Section 2 : Function definitions**

In [0]:
#Function to load the captions into dataframes
def load_captions(filename):
    video_name = []
    captions = []
    dataframe = pd.DataFrame()
    with open(filename) as file:
        for line in file:
            pair = line.split() #each line in the text file contains to words so, this code will split them into two words
            video_name.append(pair[0]) #first word will be assigned as video name
            captions.append(pair[1]) #second word will be assigned as caption
        dataframe['video']=video_name #setting these two as column names of dataframe
        dataframe['caption']=captions
    return dataframe

#Function to load C3D features
def load_c3d(captions, c3dPath):
    files = list(captions["video"].values)
    c3dfeatures = []
    for file in files:
        file = c3dPath+file[:-4]+'txt'
        c3dfeatures.append(np.loadtxt(file))
    #print(type(c3dfeatures))
    return c3dfeatures

#Function to calculate Spearman coefficient scores
def Get_score(Y_pred,Y_true):
    '''Calculate the Spearmann"s correlation coefficient'''
    Y_pred = np.squeeze(Y_pred)
    Y_true = np.squeeze(Y_true)
    if Y_pred.shape != Y_true.shape:
        print('Input shapes don\'t match!')
    else:
        if len(Y_pred.shape) == 1:
            Res = pd.DataFrame({'Y_true':Y_true,'Y_pred':Y_pred})
            score_mat = Res[['Y_true','Y_pred']].corr(method='spearman',min_periods=1)
            print('The Spearman\'s correlation coefficient is: %.3f' % score_mat.iloc[1][0])
        else:
            for ii in range(Y_pred.shape[1]):
                Get_score(Y_pred[:,ii],Y_true[:,ii])

## **Section 3 : Loading features**

In [0]:
#load the ground truth dataset
csv_path ='./Dev-set/Ground-truth/'
dataset = pd.read_csv(csv_path+'ground-truth.csv')

#load captions
captions_path ='./Dev-set/Captions/dev-set_video-captions.txt'
captions = load_captions(captions_path)

#load C3D features
c3dPath = './Dev-set/C3D/'
c3dfeatures = load_c3d(captions,c3dPath)

In [7]:
dataset.head()

Unnamed: 0,video,short-term_memorability,nb_short-term_annotations,long-term_memorability,nb_long-term_annotations
0,video3.webm,0.924,34,0.846,13
1,video4.webm,0.923,33,0.667,12
2,video6.webm,0.863,33,0.7,10
3,video8.webm,0.922,33,0.818,11
4,video10.webm,0.95,34,0.9,10


In [8]:
captions.head()

Unnamed: 0,video,caption
0,video3.webm,blonde-woman-is-massaged-tilt-down
1,video4.webm,roulette-table-spinning-with-ball-in-closeup-shot
2,video6.webm,khr-gangsters
3,video8.webm,medical-helicopter-hovers-at-airport
4,video10.webm,couple-relaxing-on-picnic-crane-shot


In [9]:
captions.shape

(6000, 2)

In [10]:
c3dfeatures

[array([2.0249420e-02, 1.5778000e-03, 8.2625000e-04, 9.4509000e-04,
        6.2790000e-05, 3.4900000e-06, 1.1618200e-03, 9.7420000e-05,
        2.1790000e-05, 1.0330000e-05, 3.3725000e-04, 6.3631000e-04,
        1.1117000e-04, 1.0078200e-03, 3.6100000e-06, 6.3123000e-04,
        3.9050000e-05, 4.0980000e-05, 9.1250000e-05, 3.0321000e-04,
        1.5410000e-05, 3.1970000e-05, 5.2210000e-05, 6.1550000e-05,
        1.7464590e-02, 6.6581000e-04, 6.5270000e-05, 5.4450000e-05,
        2.7318000e-04, 1.3858800e-03, 3.3300000e-06, 1.3557900e-03,
        5.1650000e-04, 2.4261200e-03, 2.7191400e-03, 2.7700000e-06,
        1.5570800e-03, 2.4923000e-04, 2.6324300e-03, 9.3050000e-05,
        8.9018000e-04, 9.8830000e-05, 1.0030000e-05, 2.2525000e-04,
        9.7030000e-05, 3.3656300e-03, 7.9170000e-05, 2.3487000e-04,
        5.0306000e-04, 7.2603369e-01, 1.9330000e-05, 1.3091000e-04,
        9.6670000e-05, 1.3184000e-04, 3.4292000e-04, 4.9308000e-04,
        9.7340000e-05, 1.0900000e-06, 3.7808000e

In [11]:
len(c3dfeatures)

6000

## **Section 4 : Data preprocessing**

In [12]:
#loading the nltk stopwords of English
import nltk
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
print(f'Length of Stopwords: {len(stopwords)}')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Length of Stopwords: 179


**Removing punctuations and stop words from Captions**

In [13]:
# setup prograss tracker
pbar = pyprind.ProgBar(len(captions['caption']), title='Counting word occurrences')
for i, cap in enumerate(captions['caption']):
    # replace punctuations with space
    # convert words to lower case 
    text = ''.join([c if c not in punctuation else ' ' for c in cap]).lower()
    #removing stopwords
    rmv_stopwords= ' '.join([word for word in text.split() if word not in stopwords])
    captions.loc[i,'caption'] = rmv_stopwords #updating the original captions 
    pbar.update()

Counting word occurrences
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:01


In [14]:
captions.head()

Unnamed: 0,video,caption
0,video3.webm,blonde woman massaged tilt
1,video4.webm,roulette table spinning ball closeup shot
2,video6.webm,khr gangsters
3,video8.webm,medical helicopter hovers airport
4,video10.webm,couple relaxing picnic crane shot


In [15]:
#implementing bag of words for the combined captions
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",max_features=3112) 
captions_bag = vectorizer.fit_transform(captions.caption).toarray()
type(captions_bag)

numpy.ndarray

In [16]:
captions_bag

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [17]:
captions_bag.shape

(6000, 3112)

**Combining captions and C3D into a single vector**

In [0]:
captions_c3d = (captions_bag.tolist())
counter = 0
for item in range(6000):
    captions_c3d[counter] = np.append(captions_c3d[counter],c3dfeatures[counter],axis=0)
    counter = counter+1

In [19]:
len(captions_c3d[0])

3213

In [20]:
len(captions_c3d)

6000

**We have the following features :**



1. Caption -> captions_bag
2. Captions & c3d -> captions_c3d



## **Section 5 : Applying different machine learning algorithms**

### **Model 1 : Random Forest With Captions**

In [0]:
X = captions_bag
y = dataset[['short-term_memorability','long-term_memorability']].values

In [0]:
# Splitting the dataset into the Training set and Test set
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [23]:
print('X_train ', X_train.shape)
print('X_test  ', X_test.shape)
print('Y_train ', y_train.shape)
print('Y_test  ', y_test.shape)

X_train  (4800, 3112)
X_test   (1200, 3112)
Y_train  (4800, 2)
Y_test   (1200, 2)


In [0]:
from sklearn.ensemble import RandomForestRegressor
captions_rf = RandomForestRegressor(n_estimators=100,random_state=45)

In [25]:
captions_rf.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=45, verbose=0, warm_start=False)

In [0]:
captions_pred = captions_rf.predict(X_test)

In [27]:
Get_score(captions_pred, y_test)

The Spearman's correlation coefficient is: 0.409
The Spearman's correlation coefficient is: 0.176


### **Model 2 : Random Forest With Captions And C3D Combined**

In [0]:
X = captions_c3d
y = dataset[['short-term_memorability','long-term_memorability']].values

In [0]:
# Splitting the dataset into the Training set and Test set
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [0]:
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators=50,random_state=45,verbose=2)

In [31]:
rf_regressor.fit(X_train,y_train)

building tree 1 of 50


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.2s remaining:    0.0s


building tree 2 of 50
building tree 3 of 50
building tree 4 of 50
building tree 5 of 50
building tree 6 of 50
building tree 7 of 50
building tree 8 of 50
building tree 9 of 50
building tree 10 of 50
building tree 11 of 50
building tree 12 of 50
building tree 13 of 50
building tree 14 of 50
building tree 15 of 50
building tree 16 of 50
building tree 17 of 50
building tree 18 of 50
building tree 19 of 50
building tree 20 of 50
building tree 21 of 50
building tree 22 of 50
building tree 23 of 50
building tree 24 of 50
building tree 25 of 50
building tree 26 of 50
building tree 27 of 50
building tree 28 of 50
building tree 29 of 50
building tree 30 of 50
building tree 31 of 50
building tree 32 of 50
building tree 33 of 50
building tree 34 of 50
building tree 35 of 50
building tree 36 of 50
building tree 37 of 50
building tree 38 of 50
building tree 39 of 50
building tree 40 of 50
building tree 41 of 50
building tree 42 of 50
building tree 43 of 50
building tree 44 of 50
building tree 45 of

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:  1.1min finished


RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=50, n_jobs=None, oob_score=False,
                      random_state=45, verbose=2, warm_start=False)

In [32]:
rf_pred = rf_regressor.predict(X_test)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    0.0s finished


In [33]:
Get_score(rf_pred, y_test)

The Spearman's correlation coefficient is: 0.330
The Spearman's correlation coefficient is: 0.115


### **Model 3 : Gradient Boosting Regressor With Captions**

In [0]:
Y_st = dataset['short-term_memorability'].values # st targets
Y_lt = dataset['long-term_memorability'].values  # lt targets
X = captions_bag # captions

In [0]:
X_train_st, X_test_st, Y_train_st, Y_test_st = train_test_split(X,Y_st, test_size=0.2, random_state=42) 
#splitting the short term dev data into a train and validate split of 80 to 20 with a random state for reproducability

X_train_lt, X_test_lt, Y_train_lt, Y_test_lt = train_test_split(X,Y_lt, test_size=0.2, random_state=42) 
# #splitting the long term dev data into a train and validate split of 80 to 20 with a random state for reproducability

In [36]:
#Just testing to see shape of data split for Short Term
print('Data split for Short Term')
print('X_train', len(X_train_st))
print('X_test', len(X_test_st))
print('Y_train', len(Y_train_st))
print('Y_test', len(Y_test_st))
print('')

#Just testing to see shape of data split for Long Term
print('Data split for Long Term')
print('X_train', len(X_train_lt))
print('X_test', len(X_test_lt))
print('Y_train', len(Y_train_lt))
print('Y_test', len(Y_test_lt))

Data split for Short Term
X_train 4800
X_test 1200
Y_train 4800
Y_test 1200

Data split for Long Term
X_train 4800
X_test 1200
Y_train 4800
Y_test 1200


In [0]:
#Model Params - 650 decision tree, 12 depth, learning rate 0.01
from sklearn import ensemble
params = {'n_estimators':650, 'max_depth':12, 'min_samples_split':2, 'learning_rate':0.01, 'loss':'lad'}
clf = ensemble.GradientBoostingRegressor(**params)

In [38]:
#fit to short term training set
clf.fit(X_train_st, Y_train_st)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.01, loss='lad',
                          max_depth=12, max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=650,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [39]:
#fit to training set
clf.fit(X_train_lt, Y_train_lt)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.01, loss='lad',
                          max_depth=12, max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=650,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [40]:
#predict for stm for test set
print('Short Term:')
print(Get_score(clf.predict(X_test_st), Y_test_st))

#predict for ltm for test set
print('Long Term:')
print(Get_score(clf.predict(X_test_lt), Y_test_lt))

Short Term:
The Spearman's correlation coefficient is: 0.261
None
Long Term:
The Spearman's correlation coefficient is: 0.143
None


### **Model 4 : Gradient Boosting Regressor With Captions And C3D Combined**

In [0]:
Y_st = dataset['short-term_memorability'].values # st targets
Y_lt = dataset['long-term_memorability'].values  # lt targets
X = captions_c3d # captions & C3D merged

In [0]:
X_train_st, X_test_st, Y_train_st, Y_test_st = train_test_split(X,Y_st, test_size=0.2, random_state=42) 
#splitting the short term dev data into a train and validate split of 80 to 20 with a random state for reproducability

X_train_lt, X_test_lt, Y_train_lt, Y_test_lt = train_test_split(X,Y_lt, test_size=0.2, random_state=42) 
# #splitting the long term dev data into a train and validate split of 80 to 20 with a random state for reproducability

In [43]:
#Just testing to see shape of data split for Short Term
print('X_train', len(X_train_st))
print('X_test', len(X_test_st))
print('Y_train', len(Y_train_st))
print('Y_test', len(Y_test_st))

print('')

#Just testing to see shape of data split for Long Term
print('X_train', len(X_train_lt))
print('X_test', len(X_test_lt))
print('Y_train', len(Y_train_lt))
print('Y_test', len(Y_test_lt))

X_train 4800
X_test 1200
Y_train 4800
Y_test 1200

X_train 4800
X_test 1200
Y_train 4800
Y_test 1200


In [0]:
#Model Params - 650 decision tree, 12 depth, learning rate 0.01
from sklearn import ensemble
params = {'n_estimators':650, 'max_depth':12, 'min_samples_split':2, 'learning_rate':0.01, 'loss':'lad'}
clf = ensemble.GradientBoostingRegressor(**params)

In [45]:
#fit to short term training set
clf.fit(X_train_st, Y_train_st)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.01, loss='lad',
                          max_depth=12, max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=650,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [46]:
#fit to training set
clf.fit(X_train_lt, Y_train_lt)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.01, loss='lad',
                          max_depth=12, max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=650,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [47]:
#predict for stm for test set
print('Short Term:')
print(Get_score(clf.predict(X_test_st), Y_test_st))

#predict for ltm for test set
print('Long Term:')
print(Get_score(clf.predict(X_test_lt), Y_test_lt))

Short Term:
The Spearman's correlation coefficient is: 0.273
None
Long Term:
The Spearman's correlation coefficient is: 0.116
None


## **Section 6 : Selecting the best model**

|Models|Features|Short Term| Long Term |
|------|--------|------|-------------------|
|Random Forest|**Captions**|**0.409**|**0.176**|
||Captions & C3D|0.330|0.115|
|Gradient Boosting|Captions|0.261|0.143|
||Captions & C3D|0.273|0.116|

The above table shows the Spearman's Coefficient Correlation Scores obtained using different algorithms and combination of the features. 

From the results, we can see that Random Forest with Captions performed best.

## **Section 7 : Prediction on test dataset**

In [0]:
X = captions_bag
y = dataset[['short-term_memorability','long-term_memorability']].values

In [49]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [50]:
X.shape

(6000, 3112)

In [51]:
print(f'X: ({len(X)})')
print(f'y:{y.shape}')

X: (6000)
y:(6000, 2)


In [0]:
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators=100,random_state=45)

In [53]:
rf_regressor.fit(X,y)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=45, verbose=0, warm_start=False)

**Importing test dataset and test captions**

In [0]:
#importing test dataset
test_csv_path = './Test-set/Ground-truth_test/ground_truth_template.csv'
test_dataset = pd.read_csv(test_csv_path)

#load the test set captions
test_captions_path ='./Test-set/Captions_test/test-set-1_video-captions.txt'
test_captions = load_captions(test_captions_path)

In [55]:
test_dataset.head()

Unnamed: 0,video,short-term_memorability,nb_short-term_annotations,long-term_memorability,nb_long-term_annotations
0,7494,,33,,12
1,7495,,34,,10
2,7496,,32,,13
3,7497,,33,,10
4,7498,,33,,10


In [56]:
test_captions.head()

Unnamed: 0,video,caption
0,video7494.webm,green-jeep-struggling-to-drive-over-huge-rocks
1,video7495.webm,hiking-woman-tourist-is-walking-forward-in-mou...
2,video7496.webm,close-up-of-african-american-doctors-hands-usi...
3,video7497.webm,slow-motion-of-a-man-using-treadmill-in-the-gy...
4,video7498.webm,slow-motion-of-photographer-in-national-park


In [57]:
#printing the dimensions of test-set dataset and features
print(f'Test-Dataset : {test_dataset.shape}')
print(f'Test-Captions: {test_captions.shape}')

Test-Dataset : (2000, 5)
Test-Captions: (2000, 2)


In [58]:
#Removing punctuations and stop words from captions
# setup prograss tracker
pbar = pyprind.ProgBar(len(test_captions['caption']), title='Counting word occurrences')
for i, cap in enumerate(test_captions['caption']):
    # replace punctuations with space
    # convert words to lower case 
    text = ''.join([c if c not in punctuation else ' ' for c in cap]).lower()
    #removing stopwords
    rmv_stopwords= ' '.join([word for word in text.split() if word not in stopwords])
    test_captions.loc[i,'caption'] = rmv_stopwords #updating the original captions 
    pbar.update()

Counting word occurrences
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [59]:
test_captions.head()

Unnamed: 0,video,caption
0,video7494.webm,green jeep struggling drive huge rocks
1,video7495.webm,hiking woman tourist walking forward mountains...
2,video7496.webm,close african american doctors hands using sph...
3,video7497.webm,slow motion man using treadmill gym regular ph...
4,video7498.webm,slow motion photographer national park


In [60]:
#implementing bag of words for the combined captions
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",max_features=3112) 
test_captions_bag = vectorizer.fit_transform(test_captions.caption).toarray()
type(test_captions_bag)

numpy.ndarray

In [61]:
len(test_captions_bag)

2000

In [62]:
test_captions_bag.shape

(2000, 3112)

In [63]:
print(f'Testing Bag of words size     : {len(test_captions_bag)}')
print(f'Development Bag of words size : {len(captions_bag)}')
print(f'Development Vocabulary Size   : {len(captions_bag[0])}')
print(f'Testing Vocabulary Size       : {len(test_captions_bag[0])}')

Testing Bag of words size     : 2000
Development Bag of words size : 6000
Development Vocabulary Size   : 3112
Testing Vocabulary Size       : 3112


**Predicting the scores for test captions**

In [0]:
test_pred = rf_regressor.predict(test_captions_bag)

In [65]:
test_pred = rf_regressor.predict(test_captions_bag)
pred = pd.DataFrame()
pred['short-term'] = test_pred[:,0]
pred['long-term'] = test_pred[:,1]
pred.head()

Unnamed: 0,short-term,long-term
0,0.854635,0.712699
1,0.898883,0.7815
2,0.841901,0.805625
3,0.915282,0.82699
4,0.866134,0.711727


## **Section 8 : Exporting the results to CSV**

In [0]:
pred.to_csv("/content/gdrive/My Drive/Saurabh_Bhise_19210665_predictions.csv",index=False)