Unsupervised Learning from Narrated Instruction Videos

Created by Jean-Baptiste Alayrac at INRIA, Paris.

Introduction

We address the problem of automatically learning the main steps to complete a certain task, such as changing a car tire, from a set of narrated instruction videos. The contributions of this paper are three-fold. First, we develop a new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration. The method solves two clustering problems, one in text and one in video, applied one after each other and linked by joint constraints to obtain a single coherent sequence of steps in both modalities. Second, we collect and annotate a new challenging dataset of real-world instruction videos from the Internet. The dataset contains about 800,000 frames for five different tasks (How to : change a car tire, perform CardioPulmonary resuscitation (CPR), jump cars, repot a plant and make coffee) that include complex interactions between people and objects, and are captured in a variety of indoor and outdoor settings. Third, we experimentally demonstrate that the proposed method can automatically discover, in an unsupervised manner , the main steps to achieve the task and locate the steps in the input videos.

The webpage for this project is available here. It contains link to the paper, and other utilities such as original data, poster or slides of the presentation.

License

Our code is released under the MIT License (refer to the LICENSE file for details).

Cite

If you find this code useful in your research, please, consider citing our paper:

@InProceedings{Alayrac16unsupervised, author = "Alayrac, Jean-Baptiste and Bojanowski, Piotr and Agrawal, Nishant and Laptev, Ivan and Sivic, Josef and Lacoste-Julien, Simon", title = "Unsupervised learning from Narrated Instruction Videos", booktitle = "Computer Vision and Pattern Recognition (CVPR)", year = "2016" }

Requirements

To run the code, you need MATLAB installed. The code was tested on Ubuntu 12.04 LTS with MATLAB-2014b. In order to obtain the features used here, other dependencies are needed. For that, see the corresponding section.

Method

This repo contains the code for the method described in the CVPR paper. This method aims at discovering the main steps to achieve a task and temporally localize them in narrated instruction videos. The method is a 2-stage approach:

Multiple Sequence Alignment of the text input sequences
Discriminative clustering of videos under text constraints

Code is given for both with a separate script for each stage.You can run both stages with different parameter configurations (see comments in the code).

Multiple Sequence Alignment:

To run a demo of this code, you need to follow these steps:

Download the package and go to that folder

git clone https://github.com/jalayrac/instructionVideos.git
cd instructionVideos

Download and unpack the preprocessed features

wget -P data http://www.di.ens.fr/willow/research/instructionvideos/release/NLP_data.zip
unzip data/NLP_data.zip -d data

Go in the corresponding folder

cd nlp_utils

Open MATLAB and run

compile.m
launching_script.m

Discriminative clustering under text constraints:

Note, that you don't need to run the first stage to be able to launch this demo as we provide mat files of results for the first stage (see instructions below). To run a demo of this code, you need to follow these steps:

Download the package and go to that folder

git clone https://github.com/jalayrac/instructionVideos.git
cd instructionVideos

Download and unpack the preprocessed features (both for NLP and VISION)

wget -P data http://www.di.ens.fr/willow/research/instructionvideos/release/NLP_data.zip
wget -P data http://www.di.ens.fr/willow/research/instructionvideos/release/VISION_data.zip
unzip data/NLP_data.zip -d data
unzip data/VISION_data.zip -d data

Download and unpack the preprocessed results of the first stage:

wget -P results http://www.di.ens.fr/willow/research/instructionvideos/release/NLP_results.zip
unzip results/NLP_results.zip -d results

Go in the corresponding folder

cd cv_utils

Open MATLAB and run

compile.m
launching_script.m

Evaluation

The authors provide the preprocessed results so that one can reproduce the results of the paper. To reproduce our result plots, please follow these steps:

Download the package and go to that folder

git clone https://github.com/jalayrac/instructionVideos.git
cd instructionVideos

Download and unpack the preprocessed results, both for NLP and VISION

wget -P results http://www.di.ens.fr/willow/research/instructionvideos/release/NLP_results.zip
unzip results/NLP_results.zip -d results
wget -P results http://www.di.ens.fr/willow/research/instructionvideos/release/VISION_results.zip
unzip results/VISION_results.zip -d results

Download and unpack the preprocessed data for NLP (for qualitative)

wget -P data http://www.di.ens.fr/willow/research/instructionvideos/release/NLP_data.zip
unzip data/NLP_data.zip -d data

Go in the corresponding folder

cd display_res

Open MATLAB and run (for NLP qual. results)

display_res_NLP.m

Open MATLAB and run (for temporal localization results)

display_res_VISION.m

Features

If you want to run this code on new data, you will need to process the data as follows. If you need more details on this don't hesitate to email the first author of the paper.

NLP

To obtain the direct object relations, we used the Stanford Parser 3.5.1 available here. We first construct a dictionary of direct object relations ranked by their number of apparitions in all our corpus. The indexing is based on this ranking (see count_files folder for a given task.)

For each video, we created a *.trlst file. For each dobj pronounced during the video, it has a new line containing:

The index of the corresponding dobj in our dictionary
The start time in the video (coming from subtitles)
The end time in the video (coming from subtitles)

We then used the nltk python package to obtain the distance between dobj (WordNet interface). This allows us to obtain the sim_mat matrix.

VISION

The data for VISION contains two folders:

videos_info: This folder contains video information for each video (FPS, number of frames...)
features: This folder contains a mat file. This mat file is a struct containing all the features, ground truth, different information needed to be able to launch the second stage of the method. The features used here are a concatenation of a Bag-Of-Words of Improved Dense Trajectories, and CNN representation obtained with MatConvNet. Please see the paper for detailed explanations.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
cv_utils		cv_utils
display_res		display_res
nlp_utils		nlp_utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cv_utils

cv_utils

display_res

display_res

nlp_utils

nlp_utils

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Unsupervised Learning from Narrated Instruction Videos

Introduction

License

Cite

Contents

Requirements

Method

Evaluation

Features

NLP

VISION

About

Releases

Packages

Languages

License

jalayrac/instructionVideos

Folders and files

Latest commit

History

Repository files navigation

Unsupervised Learning from Narrated Instruction Videos

Introduction

License

Cite

Contents

Requirements

Method

Evaluation

Features

NLP

VISION

About

Resources

License

Stars

Watchers

Forks

Languages