# Predicting Media Memorability: MediaEval 2020






# Table of Contents 

This notebook is organized into :

* **PART 1 &nbsp;  :** Function Definitions


* **PART 2 &nbsp;  :** Loading Dev-set Data


* **PART 3 &nbsp;  :** Data Preprocessing on Dev-set Data


* **PART 4 &nbsp;  :** Model Evaluation with video precomputed features
  *  P4.1: Random Forest with Captions
  * P4.2: SVR with Captions
  * P4.3: Random Forest with C3D
  * P4.4: SVR with C3D
  * P4.5: Random Forest with Captions+C3D
  * P4.6: SVR with Captions+C3D
  
  
  
* **PART 5 &nbsp;  :** Evaluating and comparing the features


* **PART 6 &nbsp;  :** Predicting the Memorability scores on Test-set
  * P6.1:  Training entire 6000 Dev-set
  * P6.2:  Loading Test-set with Captions
  * P6.3:  Predicting the Scores

  * P6.4:  Exporting the results to CSV file

  * P6.5:  Re-arranging the oreder of yhe columns in CSV File


In [0]:
pip install PyPrind

Collecting PyPrind
  Downloading https://files.pythonhosted.org/packages/1e/30/e76fb0c45da8aef49ea8d2a90d4e7a6877b45894c25f12fb961f009a891e/PyPrind-2.11.2-py3-none-any.whl
Installing collected packages: PyPrind
Successfully installed PyPrind-2.11.2


#### Importing necessary libraries

In [0]:
import pandas as pd
import numpy as np
from string import punctuation
import pyprind
from collections import Counter
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import os
import glob
import nltk

<a id='section1'></a>

## PART 1&nbsp;:  FUNCTION DEFINITIONS



In [0]:
#Function to load C3D features
def load_c3d(captions, c3dPath):
    files = list(captions["video"].values)
    c3dfeatures = []
    for file in files:
        file = c3dPath+file[:-4]+'txt'
        c3dfeatures.append(np.loadtxt(file))
    #print(type(c3dfeatures))
    return c3dfeatures

In [0]:
#Function to load the captions into dataframes
def load_captions(filename):
    video_name = []
    captions = []
    dataframe = pd.DataFrame()
    with open(filename) as file:
        for line in file:
            pair = line.split() #each line in the text file contains to words so, this code will split them into two words
            video_name.append(pair[0]) #first word will be assigned as video name
            captions.append(pair[1]) #second word will be assigned as caption
        dataframe['video']=video_name #setting these two as column names of dataframe
        dataframe['caption']=captions
    return dataframe

In [0]:
#Function to calculate Spearman coefficient scores
def Get_score(Y_pred,Y_true):
    '''Calculate the Spearmann"s correlation coefficient'''
    Y_pred = np.squeeze(Y_pred)
    Y_true = np.squeeze(Y_true)
    if Y_pred.shape != Y_true.shape:
        print('Input shapes don\'t match!')
    else:
        if len(Y_pred.shape) == 1:
            Res = pd.DataFrame({'Y_true':Y_true,'Y_pred':Y_pred})
            score_mat = Res[['Y_true','Y_pred']].corr(method='spearman',min_periods=1)
            print('The Spearman\'s correlation coefficient is: %.3f' % score_mat.iloc[1][0])
        else:
            for ii in range(Y_pred.shape[1]):
                Get_score(Y_pred[:,ii],Y_true[:,ii])

<a id='section2'></a>

## PART 2&nbsp;:  LOADING DEV-SET DATA

In [0]:
# connect information in the google drive to this colab session

from google.colab import drive
drive.mount('/content/drive/')

import os
os.chdir("/content/drive/My Drive/CA684_Assignment")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [0]:
#load the ground truth dataset
csv_path ='Dev-set/Ground-truth/ground-truth.csv'
dataset = pd.read_csv(csv_path)

In [0]:
#load captions
captions_path ='Dev-set/Captions/dev-set_video-captions.txt'
captions = load_captions(captions_path)

In [0]:
#load C3D features
c3dPath = 'Dev-set/C3D/'
c3dfeatures = load_c3d(captions,c3dPath)

In [0]:
dataset.head()

Unnamed: 0,video,short-term_memorability,nb_short-term_annotations,long-term_memorability,nb_long-term_annotations
0,video3.webm,0.924,34,0.846,13
1,video4.webm,0.923,33,0.667,12
2,video6.webm,0.863,33,0.7,10
3,video8.webm,0.922,33,0.818,11
4,video10.webm,0.95,34,0.9,10


In [0]:
captions.head()

Unnamed: 0,video,caption
0,video3.webm,blonde-woman-is-massaged-tilt-down
1,video4.webm,roulette-table-spinning-with-ball-in-closeup-shot
2,video6.webm,khr-gangsters
3,video8.webm,medical-helicopter-hovers-at-airport
4,video10.webm,couple-relaxing-on-picnic-crane-shot


In [0]:
#shape of the dataframes
print(f'Ground Truth : {dataset.shape}')
print(f'Captions     : {captions.shape}')
print(f'C3D          : ({len(c3dfeatures)})')


Ground Truth : (6000, 5)
Captions     : (6000, 2)
C3D          : (6000)


***
<a id='section3'></a>

## PART 3: DATA PREPROCESSING ON DEV-SET DATA

**Vectorizing Captions**



For training I have used used dev-set captions with first 6000 captions by using method of splitting the word and removing stopwords. 

In [0]:
import nltk
nltk.download('stopwords')
#loading the nltk stopwords of English
stopwords = nltk.corpus.stopwords.words('english')
print(f'Length of Stopwords: {len(stopwords)}')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Length of Stopwords: 179


Cleaning Data: dev-set_captions

In [0]:
#Removing punctuations and stop words from captions

pbar = pyprind.ProgBar(len(captions['caption']), title='Counting word occurrences')
for i, cap in enumerate(captions['caption']):
    # replace punctuations with space
    # convert words to lower case 
    text = ''.join([c if c not in punctuation else ' ' for c in cap]).lower()
    #removing stopwords
    rmv_stopwords= ' '.join([word for word in text.split() if word not in stopwords])
    captions.loc[i,'caption'] = rmv_stopwords #updating the original captions 
    pbar.update()

Counting word occurrences
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:01


In [0]:
captions.head()

Unnamed: 0,video,caption
0,video3.webm,blonde woman massaged tilt
1,video4.webm,roulette table spinning ball closeup shot
2,video6.webm,khr gangsters
3,video8.webm,medical helicopter hovers airport
4,video10.webm,couple relaxing picnic crane shot


In [0]:
#implementing bag of words for the combined captions
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",max_features=3112) 
captions_bag = vectorizer.fit_transform(captions.caption).toarray()
type(captions_bag)

numpy.ndarray

In [0]:
captions_bag.shape

(6000, 3112)

**Combining Captions and C3D into a single Vector for model implementation with Random Forest and SVR**

In [0]:
captions_c3d = (captions_bag.tolist())
counter = 0
for item in range(6000):
    captions_c3d[counter] = np.append(captions_c3d[counter],c3dfeatures[counter],axis=0)
    counter = counter+1

In [0]:
len(captions_c3d[0]) 

3213

In [0]:
len(captions_c3d)

6000

Now we have the following features

1.   Captions in captions_bag
2.   C3D in c3dfeatures 
3.   Captions and C3D


***
<a id='section4'></a>

## PART 4: &nbsp; MODEL EVALUATION WITH VIDEO PRECOMPUTED FEATURES<a id='section4.1'></a>

### P4.1: &nbsp; Random Forest  with Captions

In [0]:
X = captions_bag
y = dataset[['short-term_memorability','long-term_memorability']].values

In [0]:
# Splitting the dataset into the Training set and Test set
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [0]:
print('X_train ', X_train.shape)
print('X_test  ', X_test.shape)
print('Y_train ', y_train.shape)
print('Y_test  ', y_test.shape)

X_train  (4800, 3112)
X_test   (1200, 3112)
Y_train  (4800, 2)
Y_test   (1200, 2)


In [0]:
from sklearn.ensemble import RandomForestRegressor
captions_rf = RandomForestRegressor(n_estimators=100,random_state=45)

In [0]:
captions_rf.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=45, verbose=0, warm_start=False)

In [0]:
captions_pred = captions_rf.predict(X_test)

In [0]:
Get_score(captions_pred, y_test)

The Spearman's correlation coefficient is: 0.409
The Spearman's correlation coefficient is: 0.176


***

###P4.2: &nbsp;  SVR with Captions

In [0]:
svr_X = captions_bag
svr_y_short = dataset[['short-term_memorability']].values
svr_y_long = dataset[['long-term_memorability']].values

In [0]:
# Splitting the dataset into the Training set and Test set
short_X_train,short_X_test,short_y_train,short_y_test = train_test_split(svr_X,svr_y_short,test_size=0.2,random_state=40)
long_X_train,long_X_test,long_y_train,long_y_test = train_test_split(svr_X,svr_y_long,test_size=0.2,random_state=40)

In [0]:
from sklearn.preprocessing import StandardScaler
short_X = StandardScaler()
short_y = StandardScaler()
short_X_train = short_X.fit_transform(short_X_train)
short_y_train = short_y.fit_transform(short_y_train)
long_X = StandardScaler()
long_y = StandardScaler()
long_X_train = long_X.fit_transform(long_X_train)
long_y_train = long_y.fit_transform(long_y_train)

In [0]:
from sklearn.svm import SVR
short_regressor = SVR(kernel = 'rbf')
long_regressor = SVR(kernel = 'rbf')
short_regressor.fit(short_X_train, short_y_train)
long_regressor.fit(long_X_train,long_y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [0]:
short_pred = short_regressor.predict(short_X_test)
short_pred = short_y.inverse_transform(short_pred)
long_pred = long_regressor.predict(long_X_test)
long_pred = long_y.inverse_transform(long_pred)

In [0]:
Get_score(short_pred, short_y_test)
Get_score(long_pred, long_y_test)

The Spearman's correlation coefficient is: 0.338
The Spearman's correlation coefficient is: 0.170


***
<a id='section4.3'></a>

### P4.3: &nbsp; Random Forest with C3D

In [0]:
X = c3dfeatures
y = dataset[['short-term_memorability','long-term_memorability']].values

In [0]:
# Splitting the dataset into the Training set and Test set
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [0]:
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators=20,random_state=42,verbose=2)

In [0]:
rf_regressor.fit(X_train,y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


building tree 1 of 20


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.4s remaining:    0.0s


building tree 2 of 20
building tree 3 of 20
building tree 4 of 20
building tree 5 of 20
building tree 6 of 20
building tree 7 of 20
building tree 8 of 20
building tree 9 of 20
building tree 10 of 20
building tree 11 of 20
building tree 12 of 20
building tree 13 of 20
building tree 14 of 20
building tree 15 of 20
building tree 16 of 20
building tree 17 of 20
building tree 18 of 20
building tree 19 of 20
building tree 20 of 20


[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    8.4s finished


RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=20, n_jobs=None, oob_score=False,
                      random_state=42, verbose=2, warm_start=False)

In [0]:
rf_pred = rf_regressor.predict(X_test)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    0.0s finished


In [0]:
Get_score(rf_pred, y_test)

The Spearman's correlation coefficient is: 0.266
The Spearman's correlation coefficient is: 0.122


***
<a id='section4.4'></a>

### P4.4: &nbsp;  SVR with C3D

In [0]:
svr_X = c3dfeatures
svr_y_short = dataset[['short-term_memorability']].values
svr_y_long = dataset[['long-term_memorability']].values

In [0]:
# Splitting the dataset into the Training set and Test set
short_X_train,short_X_test,short_y_train,short_y_test = train_test_split(svr_X,svr_y_short,test_size=0.2,random_state=40)
long_X_train,long_X_test,long_y_train,long_y_test = train_test_split(svr_X,svr_y_long,test_size=0.2,random_state=40)

In [0]:
from sklearn.preprocessing import StandardScaler
short_X = StandardScaler()
short_y = StandardScaler()
short_X_train = short_X.fit_transform(short_X_train)
short_y_train = short_y.fit_transform(short_y_train)
long_X = StandardScaler()
long_y = StandardScaler()
long_X_train = long_X.fit_transform(long_X_train)
long_y_train = long_y.fit_transform(long_y_train)

In [0]:
from sklearn.svm import SVR
short_regressor = SVR(kernel = 'rbf')
long_regressor = SVR(kernel = 'rbf')
short_regressor.fit(short_X_train, short_y_train)
long_regressor.fit(long_X_train,long_y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [0]:
short_pred = short_regressor.predict(short_X_test)
short_pred = short_y.inverse_transform(short_pred)
long_pred = long_regressor.predict(long_X_test)
long_pred = long_y.inverse_transform(long_pred)

In [0]:
Get_score(short_pred, short_y_test)
Get_score(long_pred, long_y_test)

The Spearman's correlation coefficient is: 0.242
The Spearman's correlation coefficient is: 0.107


***
<a id='section4.5'></a>

### P4.5: &nbsp; Random Forest with Captions and C3D

In [0]:
X = captions_c3d
y = dataset[['short-term_memorability','long-term_memorability']].values

In [0]:
# Splitting the dataset into the Training set and Test set
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [0]:
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators=50,random_state=45,verbose=2)

In [0]:
rf_regressor.fit(X_train,y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


building tree 1 of 50


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.3s remaining:    0.0s


building tree 2 of 50
building tree 3 of 50
building tree 4 of 50
building tree 5 of 50
building tree 6 of 50
building tree 7 of 50
building tree 8 of 50
building tree 9 of 50
building tree 10 of 50
building tree 11 of 50
building tree 12 of 50
building tree 13 of 50
building tree 14 of 50
building tree 15 of 50
building tree 16 of 50
building tree 17 of 50
building tree 18 of 50
building tree 19 of 50
building tree 20 of 50
building tree 21 of 50
building tree 22 of 50
building tree 23 of 50
building tree 24 of 50
building tree 25 of 50
building tree 26 of 50
building tree 27 of 50
building tree 28 of 50
building tree 29 of 50
building tree 30 of 50
building tree 31 of 50
building tree 32 of 50
building tree 33 of 50
building tree 34 of 50
building tree 35 of 50
building tree 36 of 50
building tree 37 of 50
building tree 38 of 50
building tree 39 of 50
building tree 40 of 50
building tree 41 of 50
building tree 42 of 50
building tree 43 of 50
building tree 44 of 50
building tree 45 of

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:  1.1min finished


RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=50, n_jobs=None, oob_score=False,
                      random_state=45, verbose=2, warm_start=False)

In [0]:
rf_pred_captions_c3d = rf_regressor.predict(X_test)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    0.0s finished


In [0]:
Get_score(rf_pred_captions_c3d, y_test)

The Spearman's correlation coefficient is: 0.330
The Spearman's correlation coefficient is: 0.115


***
<a id='section4.6'></a>

### P4.6: &nbsp; SVR with Captions and C3D

In [0]:
svr_X = captions_c3d
svr_y_short = dataset[['short-term_memorability']].values
svr_y_long = dataset[['long-term_memorability']].values

In [0]:
# Splitting the dataset into the Training set and Test set
short_X_train,short_X_test,short_y_train,short_y_test = train_test_split(svr_X,svr_y_short,test_size=0.2,random_state=40)
long_X_train,long_X_test,long_y_train,long_y_test = train_test_split(svr_X,svr_y_long,test_size=0.2,random_state=40)

In [0]:
from sklearn.preprocessing import StandardScaler
short_X = StandardScaler()
short_y = StandardScaler()
short_X_train = short_X.fit_transform(short_X_train)
short_y_train = short_y.fit_transform(short_y_train)
long_X = StandardScaler()
long_y = StandardScaler()
long_X_train = long_X.fit_transform(long_X_train)
long_y_train = long_y.fit_transform(long_y_train)

In [0]:
from sklearn.svm import SVR
short_regressor = SVR(kernel = 'rbf')
long_regressor = SVR(kernel = 'rbf')
short_regressor.fit(short_X_train, short_y_train)
long_regressor.fit(long_X_train,long_y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [0]:
short_pred = short_regressor.predict(short_X_test)
short_pred = short_y.inverse_transform(short_pred)
long_pred = long_regressor.predict(long_X_test)
long_pred = long_y.inverse_transform(long_pred)

In [0]:
Get_score(short_pred, short_y_test)
Get_score(long_pred, long_y_test)

The Spearman's correlation coefficient is: 0.363
The Spearman's correlation coefficient is: 0.171


# PART 5: EVALUATING AND COMPARING THE FEATURES


Below code written shows the Spearman's coefficient corelation scores obtained with different features/combinations of models. 

From the results we can see that Captions yielded the best performance with Random Forest model when tested on Dev-set.

Hence I have chosen it as the best performing model for computing the predictions on Test-set video memorability.

In [0]:
print('Random Forest with captions:')
Get_score(captions_pred, y_test)

print('Random Forest with c3d')
Get_score(rf_pred, y_test)

print('Random Forest with captions and c3d')
Get_score(rf_pred_captions_c3d, y_test)



Random Forest with captions:
The Spearman's correlation coefficient is: 0.409
The Spearman's correlation coefficient is: 0.176
Random Forest with c3d
The Spearman's correlation coefficient is: 0.266
The Spearman's correlation coefficient is: 0.122
Random Forest with captions and c3d
The Spearman's correlation coefficient is: 0.330
The Spearman's correlation coefficient is: 0.115


***
<a id='section7'></a>

# PART 6: &nbsp; PRDICTING THE  MEMORABILITY SCORES ON TEST-SET



In this code, we will be using full 6000 data records of Dev-set to train our model and predict the scores on Test-set which which has 2000 records.


###P6.1: &nbsp; Training 6000 Dev-set

In [0]:
X = captions_bag
y = dataset[['short-term_memorability','long-term_memorability']].values

In [0]:
print(f'X: ({len(X)})')
print(f'y:{y.shape}')

X: (6000)
y:(6000, 2)


In [0]:
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators=100,random_state=45)

In [0]:
rf_regressor.fit(X,y)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=45, verbose=0, warm_start=False)

Now We have trained our model for full 6000 Dev-set.
<a id='section7.2'></a>

<a id='section7.2'></a>

### P6.2: &nbsp; Loading Test-set with Captions

In [0]:
#importing test Dataset
csv_path ='Test-set/Ground-truth_test/ground_truth_template.csv'
test_dataset = pd.read_csv(csv_path)

In [0]:
#load the test set captions
test_captions_path ='Test-set/Captions_test/test-set-1_video-captions.txt'
test_captions = load_captions(test_captions_path)

In [0]:
test_dataset.head()

Unnamed: 0,video,short-term_memorability,nb_short-term_annotations,long-term_memorability,nb_long-term_annotations
0,7494,,33,,12
1,7495,,34,,10
2,7496,,32,,13
3,7497,,33,,10
4,7498,,33,,10


We need to predict and fill the values of the above Short-term Memorability and Long-term Memorability Scores using our trained model

In [0]:
test_captions.head()

Unnamed: 0,video,caption
0,video7494.webm,green-jeep-struggling-to-drive-over-huge-rocks
1,video7495.webm,hiking-woman-tourist-is-walking-forward-in-mou...
2,video7496.webm,close-up-of-african-american-doctors-hands-usi...
3,video7497.webm,slow-motion-of-a-man-using-treadmill-in-the-gy...
4,video7498.webm,slow-motion-of-photographer-in-national-park


In [0]:
#printing the dimensions of test-set dataset and features
print(f'Test-Dataset : {test_dataset.shape}')
print(f'Test-Captions: {test_captions.shape}')

Test-Dataset : (2000, 5)
Test-Captions: (2000, 2)


In [0]:
#Removing punctuations and stop words from captions
# setup prograss tracker
pbar = pyprind.ProgBar(len(test_captions['caption']), title='Counting word occurrences')
for i, cap in enumerate(test_captions['caption']):
    # replace punctuations with space
    # convert words to lower case 
    text = ''.join([c if c not in punctuation else ' ' for c in cap]).lower()
    #removing stopwords
    rmv_stopwords= ' '.join([word for word in text.split() if word not in stopwords])
    test_captions.loc[i,'caption'] = rmv_stopwords #updating the original captions 
    pbar.update()

Counting word occurrences
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


In [0]:
test_captions.head()

Unnamed: 0,video,caption
0,video7494.webm,green jeep struggling drive huge rocks
1,video7495.webm,hiking woman tourist walking forward mountains...
2,video7496.webm,close african american doctors hands using sph...
3,video7497.webm,slow motion man using treadmill gym regular ph...
4,video7498.webm,slow motion photographer national park


In [0]:
#implementing bag of words for the combined captions
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",max_features=3122) 
test_captions_bag = vectorizer.fit_transform(test_captions.caption).toarray()
type(test_captions_bag)

numpy.ndarray

In [0]:
#print(f'Testing Bag of words size     : {len(test_captions_bag)}')
#print(f'Development Bag of words size : {len(captions_bag)}')
print(f'Development Vocabulary Size   : {len(captions_bag[0])}')
print(f'Testing Vocabulary Size       : {len(test_captions_bag[0])}')

Development Vocabulary Size   : 3112
Testing Vocabulary Size       : 3112


***
<a id='section7.3'></a>

### P6.3: &nbsp; Predicting the scores 

In [0]:
test_pred = rf_regressor.predict(test_captions_bag)

In [0]:
pred = pd.DataFrame()

In [0]:
type(test_pred)

numpy.ndarray

In [0]:
pred['short-term'] = test_pred[:,0]

In [0]:
pred['long-term'] = test_pred[:,1]

In [0]:
pred.head()

Unnamed: 0,short-term,long-term
0,0.854635,0.712699
1,0.898883,0.7815
2,0.841901,0.805625
3,0.915282,0.82699
4,0.866134,0.711727


In [0]:
pred_details=pred.describe()

In [0]:
pred_details

Unnamed: 0,short-term,long-term
count,2000.0,2000.0
mean,0.848028,0.751163
std,0.034564,0.067863
min,0.687093,0.397318
25%,0.830368,0.721224
50%,0.849698,0.757218
75%,0.871004,0.792872
max,0.953749,0.949478


***

In [0]:
test_dataset.head()

Unnamed: 0,video,short-term_memorability,nb_short-term_annotations,long-term_memorability,nb_long-term_annotations
0,7494,,33,,12
1,7495,,34,,10
2,7496,,32,,13
3,7497,,33,,10
4,7498,,33,,10


## P6.4:  Exporting results in given format 

In [0]:
ground_truth_values = test_dataset.drop('short-term_memorability', axis=1)   #drop short term memorability row
ground_truth_values = ground_truth_values.drop('long-term_memorability', axis=1)   #drop long term memorability row

In [0]:
print(ground_truth_values)

      video  nb_short-term_annotations  nb_long-term_annotations
0      7494                         33                        12
1      7495                         34                        10
2      7496                         32                        13
3      7497                         33                        10
4      7498                         33                        10
...     ...                        ...                       ...
1995  10004                         34                        17
1996  10005                         34                         9
1997  10006                         34                        12
1998  10007                         34                        12
1999  10008                         33                        10

[2000 rows x 3 columns]


In [0]:
ground_truth_values = pd.concat([ground_truth_values, pred], axis = 1)   #merge with my final predictions
print(ground_truth_values)

      video  nb_short-term_annotations  ...  short-term  long-term
0      7494                         33  ...    0.854635   0.712699
1      7495                         34  ...    0.898883   0.781500
2      7496                         32  ...    0.841901   0.805625
3      7497                         33  ...    0.915282   0.826990
4      7498                         33  ...    0.866134   0.711727
...     ...                        ...  ...         ...        ...
1995  10004                         34  ...    0.840399   0.646480
1996  10005                         34  ...    0.840914   0.834525
1997  10006                         34  ...    0.860869   0.749333
1998  10007                         34  ...    0.771076   0.713031
1999  10008                         33  ...    0.857138   0.762567

[2000 rows x 5 columns]


## P6.5: Re-arranging the order of the columns in CSV file

In [0]:
ground_truth_values = ground_truth_values[['video', 'short-term','nb_short-term_annotations','long-term', 'nb_long-term_annotations']]

In [0]:
print(ground_truth_values)  

      video  short-term  ...  long-term  nb_long-term_annotations
0      7494    0.854635  ...   0.712699                        12
1      7495    0.898883  ...   0.781500                        10
2      7496    0.841901  ...   0.805625                        13
3      7497    0.915282  ...   0.826990                        10
4      7498    0.866134  ...   0.711727                        10
...     ...         ...  ...        ...                       ...
1995  10004    0.840399  ...   0.646480                        17
1996  10005    0.840914  ...   0.834525                         9
1997  10006    0.860869  ...   0.749333                        12
1998  10007    0.771076  ...   0.713031                        12
1999  10008    0.857138  ...   0.762567                        10

[2000 rows x 5 columns]


In [0]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [0]:
import os
os.chdir("/content/drive/My Drive/CA684_MachineLearning_Assignment")

In [0]:
ground_truth_values.to_csv('Prateek_Sakaray_19210825_predictions.csv')