# Video Processing Example

This example shows how to use `interactionvideo` package to process a video for studies in human interactions. Please also refer to our research paper: Hu and Ma (2020), "Human Interactions and Financial Investment: A Video-Based Approach", available at: https://songma.github.io/files/hm_video.pdf.

## Overview

The video processing involves the following steps:
1. Set up folders and check dependencies (requirements)
2. Extract images and audios from a video using `pliers`
3. Extract text from audios using Google Speech2Text API
4. Process images(faces) using Face++ API
5. Process text using Loughran and McDonald (2011) Finance Dictionary and Nicolas, Bai, and Fiske (2019) Social Psychology Dictionary
6. Process audios using pre-trained ML models in `pyAudioAnalysis` and `speechemotionrecognition`
7. Aggregate information from 3V (visual, vocal, and verbal) to video level

## Structure

```bash
├── interactionvideo
│   ├── __pycache__
│   ├── prepare.py
│   ├── decompose.py
│   ├── faceppml.py
│   ├── googleml.py
│   ├── textualanalysis.py
│   ├── audioml.py
│   ├── aggregate.py
│   └── utils.py
├── data
│   ├── example_video.mp4
│   └── VideoDictionary.csv
├── mlmodel
│   ├── pyAudioAnalysis
│   └── speechemotionrecognition
├── output
│   ├── audio_temp
│   ├── image_temp
│   └── result_temp
├── PythonSDK
├── example.py
├── Video Processing Example.ipynb
├── README.md
└── requirement.txt
```

## Dependencies
 - pandas 
 - tqdm 
 - codecs
 - pliers
 - pydub
 - PIL
 - google-cloud-speech
 - google-cloud-storage
 - speechemotionrecognition
 - pyAudioAnalysis

## 1. Set up folders and check dependencies (requirements)

In [1]:
from os.path import join
# Set your root path here
RootPath = r''
# Set your video file path here
VideoFilePath = join(RootPath,'data','example_video.mp4')
# Set your work path here
# Work path is where to store meta files and output files
WorkPath = join(RootPath,'output')

In [2]:
# Set up the folders
from interactionvideo.prepare import setup_folder
setup_folder(WorkPath)

# check the requirements for interactionvideo
from interactionvideo.prepare import check_requirements
check_requirements()

Audio Temp folder is set up.

Image Temp folder is set up.

Result Temp folder is set up.

decompose.py requirements satisfied.

faceppml.py requirements satisfied.

googleml.py requirements satisfied.

audioml.py requirements satisfied.



True

## 2. Extract images and audios from video

In [3]:
from interactionvideo.decompose import convert_video_to_images

# Decompose the video into a stream of images
# The default sampling rate is 10 frames per second
# Find the output at WorkPath\image_temp
convert_video_to_images(VideoFilePath, WorkPath)

Video is 70.12 seconds long.




100%|████████████████████████████████████████████████████████████████████████████████| 702/702 [06:10<00:00,  1.84it/s]


Video is sampled to 702 images.

Video to images finished.



True

In [4]:
from interactionvideo.decompose import convert_video_to_audios

# Decompose the video into audios
# Find the output at WorkPath\audio_temp
convert_video_to_audios(VideoFilePath, WorkPath)

  nbytes=2, chunksize=buffersize))



MoviePy - Writing audio in %s


                                                                                                                       

MoviePy - Done.
Video to audios finished.



True

## 3. Extract text from audios using Google Speech2Text API

Set up your Google Cloud environment following

 - https://cloud.google.com/python
 - https://cloud.google.com/storage/docs/quickstart-console
 - https://cloud.google.com/speech-to-text

Create a Google Cloud Storage bucket.

In [5]:
from interactionvideo.googleml import upload_audio_to_googlecloud

# Set your Google Cloud Storage bucket name here
GoogleBucketName = ''

# Upload audio file to Google Cloud Storage
upload_audio_to_googlecloud(WorkPath, GoogleBucketName)

Uploaded the audio file to Google Cloud.



True

In [6]:
from interactionvideo.googleml import convert_audio_to_text_by_google

# Use Google Speech2Text API to convert audio to text
# Return a txt file of full speech script and a csv file of text and punctuation
# Find the output at 
# - WorkPath\result_temp\script_google.txt (full speech script)
# - WorkPath\result_temp\text_panel_google.csv (text panel from Google)
google_result_text, google_result_df = convert_audio_to_text_by_google(WorkPath, GoogleBucketName)

Google Speech2Text begins. 70.12 seconds audio to process.

Google Speech2Text ends. 70.12 seconds audio processed.



In [7]:
# Check full speech script from Google
print(google_result_text)

Hello, everyone. First of all, we will like to thank you for your interest in our research in this paper. We try to understand how human interaction features such as facial expressions vocal emotions and word choices might influence economic agents decision making in order to study this question empirically, we build an empirical approach that uses videos of human interactions as data input and and machine learning based algorithms as the tool. We apply an empirical approach in a setting where early stage Turn up Pitch Venture capitalists for early-stage funding. We find that pitch features along visual vocal and verbal damages all matter for the probability of receiving funding and we also show that this event impact is largely due to interaction induced biases rather than that interactions provide additional valuable information the empirical structure that you see in this code example can hopefully help you to get started with using video in other important settings such as As inter

In [8]:
# Check text panel from Google
google_result_df.head(10)

Unnamed: 0,Text,Onset,Offset,Duration,Sentence End
0,"Hello,",0.1,0.7,0.6,True
1,everyone.,0.7,1.1,0.4,True
2,First,1.1,1.5,0.4,False
3,of,1.5,1.6,0.1,False
4,"all,",1.6,1.9,0.3,True
5,we,1.9,2.0,0.1,False
6,will,2.0,2.2,0.2,False
7,like,2.2,2.3,0.1,False
8,to,2.3,2.4,0.1,False
9,thank,2.4,2.7,0.3,False


## 4. Process images(faces) using Face++ API

Get your key and secret from https://www.faceplusplus.com.

If you register at https://console.faceplusplus.com/register, use https://api-us.faceplusplus.com as the server.

If you register at https://console.faceplusplus.com.cn/register, use https://api-cn.faceplusplus.com as the server.

The `Python SDK` of Face++ is included in this package.

You can also download it from https://github.com/FacePlusPlus/facepp-python-sdk.

In [10]:
from interactionvideo.faceppml import process_image_by_facepp

# Use Face++ ML API to process images
# Return csv files of facial emotion, gender, predicted age
# Find the output
# - WorkPath\result_temp\face_panel_facepp.csv (full returns from Face++)
# - WorkPath\result_temp\face_panel.csv (clean results)

# Set your key, secret, and server here
FaceppKey = ''
FaceppSecret = ''
FaceppServer = 'https://api-us.faceplusplus.com'

facepp_result_df, facepp_result_clean_df = process_image_by_facepp(VideoFilePath, WorkPath,\
                                            FaceppKey, FaceppSecret, FaceppServer)

Face++ API begins. 702 images to process.



100%|██████████████████████████████████████████████████████████████████████████████| 702/702 [1:13:47<00:00,  6.22s/it]


Face++ API ends. 702 images processed.



In [11]:
# Check full returns from Face++
facepp_result_df.head(10)

Unnamed: 0,ImageName,Onset,Offset,Duration,face_rectangle#top,face_rectangle#left,face_rectangle#width,face_rectangle#height,landmark#contour_chin#x,landmark#contour_chin#y,...,attributes#beauty#male_score,attributes#beauty#female_score,attributes#mouthstatus#surgical_mask_or_respirator,attributes#mouthstatus#other_occlusion,attributes#mouthstatus#close,attributes#mouthstatus#open,Number of Faces,Visual-Positive,Visual-Negative,Visual-Beauty
0,frame[0],0.0,0.1,0.1,405,868,249,249,1008,654,...,41.801,44.379,0.0,0.003,0.054,99.943,1,7e-05,0.26876,0.4309
1,frame[3],0.1,0.2,0.1,406,867,250,250,1007,655,...,38.811,42.511,0.0,0.0,0.0,100.0,1,8e-05,0.2349,0.40661
2,frame[6],0.2,0.3,0.1,404,866,252,252,1006,655,...,39.404,43.379,0.0,0.0,0.0,100.0,1,0.00115,0.33071,0.413915
3,frame[9],0.3,0.4,0.1,403,867,253,253,1006,655,...,38.345,42.218,0.0,0.0,0.0,100.0,1,0.00153,0.33095,0.402815
4,frame[12],0.4,0.5,0.1,401,866,258,258,1005,658,...,38.964,44.045,0.0,0.0,0.0,100.0,1,0.00039,0.92737,0.415045
5,frame[15],0.5,0.6,0.1,405,867,261,261,1010,665,...,42.981,46.557,0.0,0.0,0.0,100.0,1,0.00734,0.98612,0.44769
6,frame[18],0.6,0.7,0.1,407,867,261,261,1010,667,...,43.866,46.03,0.0,0.0,0.0,100.0,1,0.00196,0.80259,0.44948
7,frame[21],0.7,0.8,0.1,404,869,258,258,1012,662,...,42.087,47.846,0.0,0.0,0.0,99.999,1,0.00021,0.09575,0.449665
8,frame[24],0.8,0.9,0.1,403,867,262,262,1014,664,...,42.112,48.182,0.0,0.003,0.0,99.997,1,0.00095,0.60956,0.45147
9,frame[27],0.9,1.0,0.1,402,868,262,262,1014,664,...,44.736,49.043,0.0,0.001,0.065,99.934,1,0.00046,0.05656,0.468895


In [12]:
# Check clean results
facepp_result_clean_df.head(10)

Unnamed: 0,Onset,Offset,Duration,Number of Faces,Gender,Age,Visual-Positive,Visual-Negative,Visual-Beauty
0,0.0,0.1,0.1,1,Male,31,7e-05,0.26876,0.4309
1,0.1,0.2,0.1,1,Male,33,8e-05,0.2349,0.40661
2,0.2,0.3,0.1,1,Male,30,0.00115,0.33071,0.413915
3,0.3,0.4,0.1,1,Male,28,0.00153,0.33095,0.402815
4,0.4,0.5,0.1,1,Male,28,0.00039,0.92737,0.415045
5,0.5,0.6,0.1,1,Male,26,0.00734,0.98612,0.44769
6,0.6,0.7,0.1,1,Male,30,0.00196,0.80259,0.44948
7,0.7,0.8,0.1,1,Male,32,0.00021,0.09575,0.449665
8,0.8,0.9,0.1,1,Male,29,0.00095,0.60956,0.45147
9,0.9,1.0,0.1,1,Male,29,0.00046,0.05656,0.468895


## 5. Process text using LM and NBF Dictionaries

Use Loughran-McDonald (2011) Finance Dictionary (LM) to construct verbal positive and negative.

For more details, please check https://sraf.nd.edu/textual-analysis/resources.

Use Nicolas, Bai, and Fiske (2019) Social Psychology Dictionary (NBF) to construct verbal warmth and ability.

For more details, please check https://psyarxiv.com/afm8k.

In [13]:
from interactionvideo.textualanalysis import process_text_by_dict

# Set LM-NBF dictionary path
DictionaryPath = join(RootPath,'data','VideoDictionary.csv')

# Dictionary-based textual analysis to get verbal measures
# (e.g., verbal positive, negative, warmth, ability)
# Return csv files of verbal positive, negative, warmth, and ability
# Find the output at 
# - WorkPath\result_temp\text_panel.csv
text_result_df = process_text_by_dict(WorkPath, DictionaryPath)

LM and NBF Dictionaries loaded.

Dictionary-based textual analysis begins. 183 words to process.

Dictionary-based textual analysis ends. 183 words processed.



In [14]:
# Check text panel from Dictionary
text_result_df.head(10)

Unnamed: 0,Text,Onset,Offset,Duration,Sentence End,Verbal-Negative,Verbal-Positive,Verbal-Warmth,Verbal-Ability
0,"Hello,",0.1,0.7,0.6,True,0.0,0.0,0.0,0.0
1,everyone.,0.7,1.1,0.4,True,0.0,0.0,0.0,0.0
2,First,1.1,1.5,0.4,False,0.0,0.0,0.0,0.0
3,of,1.5,1.6,0.1,False,0.0,0.0,0.0,0.0
4,"all,",1.6,1.9,0.3,True,0.0,0.0,0.0,0.0
5,we,1.9,2.0,0.1,False,0.0,0.0,0.0,0.0
6,will,2.0,2.2,0.2,False,0.0,0.0,0.0,0.0
7,like,2.2,2.3,0.1,False,0.0,0.0,1.0,0.0
8,to,2.3,2.4,0.1,False,0.0,0.0,0.0,0.0
9,thank,2.4,2.7,0.3,False,0.0,0.0,1.0,0.0


## 6. Process audios by pre-trained ML models

Construct vocal arousal and vocal valence from pre-trained SVM ML models in `pyAudioAnalysis`.

The pre-trained models are located at mlmodel\pyAudioAnalysis
- svmSpeechEmotion_arousal
- svmSpeechEmotion_arousalMEANS
- svmSpeechEmotion_valence
- svmSpeechEmotion_valenceMEANS

For more details, please check https://github.com/tyiannak/pyAudioAnalysis/wiki/4.-Classification-and-Regression.

Construct vocal positive and vocal negative from pre-trained LSTM ML models in `speechemotionrecognition`.

The pre-trained models are located at mlmodel\speechemotionrecognition
- best_model_LSTM_39.h5

For more details, please check https://github.com/harry-7/speech-emotion-recognition.

Note: speechemotionrecognition requires tensorflow and Keras.


In [15]:
from interactionvideo.audioml import process_audio_by_pyAudioAnalysis

# Set the model path
pyAudioAnalysisModelPath = join(RootPath,'mlmodel','pyAudioAnalysis')

# Construct vocal arousal and vocal valence
# Find the output at 
# - WorkPath\result_temp\audio_panel_pyAudioAnalysis.csv
audio_result_df1 = process_audio_by_pyAudioAnalysis(WorkPath, pyAudioAnalysisModelPath)

pyAudioAnalysis vocal emotion analysis begins. 70.12 seconds audio to process.

pyAudioAnalysis ML model loaded.

pyAudioAnalysis vocal emotion analysis ends. 70.12 seconds audio processed.



In [16]:
# Check audio panel from pyAudioAnalysis
audio_result_df1.head()

Unnamed: 0,Onset,Offset,Duration,Vocal-Arousal,Vocal-Valence
0,0,70.12,70.12,0.404089,-0.01519


In [17]:
from interactionvideo.audioml import process_audio_by_speechemotionrecognition

# Set the model path
speechemotionrecognitionModelPath = join(RootPath,'mlmodel','speechemotionrecognition')

# Construct vocal positive and vocal negative
# Find the output at 
# - WorkPath\result_temp\audio_panel_speechemotionrecognition.csv
audio_result_df2 = process_audio_by_speechemotionrecognition(WorkPath, speechemotionrecognitionModelPath)

speechemotionrecognition vocal emotion analysis begins. 70.12 seconds audio to process.



Using TensorFlow backend.
W0615 20:46:57.225495  9908 deprecation_wrapper.py:119] From D:\Software\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0615 20:46:57.247436  9908 deprecation_wrapper.py:119] From D:\Software\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0615 20:46:57.252422  9908 deprecation_wrapper.py:119] From D:\Software\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0615 20:46:57.405013  9908 deprecation_wrapper.py:119] From D:\Software\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0615 20:46:57.411995  9908 depr

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 128)               86016     
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                4128      
_________________________________________________________________
dense_2 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_3 (Dense)              (None, 4)                 68        
Total params: 90,740
Trainable params: 90,740
Non-trainable params: 0
_________________________________________________________________


None


speechemotionrecognition ML model loaded.

speechemotionrecognition vocal emotion analysis ends. 70.12 seconds audio processed.



In [18]:
# Check audio panel from speechemotionrecognition
audio_result_df2.head()

Unnamed: 0,Onset,Offset,Duration,Vocal-Positive,Vocal-Negative
0,0,70.12,70.12,0.459319,0.006388


## 7. Aggregate information from 3V to video level

In [19]:
from interactionvideo.aggregate import aggregate_3v_to_video

# Aggregate 3V information
# Find the output at 
# - WorkPath\result_temp\video_panel.csv
video_result_df = aggregate_3v_to_video(WorkPath)

3V to video aggregation finished.



In [20]:
# Check video panel
video_result_df.T

Unnamed: 0,0
Number of Faces,1
Gender,Male
Age,32
Visual-Positive,0.0142023
Visual-Negative,0.443462
Visual-Beauty,0.450598
Vocal-Positive,0.46
Vocal-Negative,0.01
Vocal-Arousal,0.4
Vocal-Valence,-0.02
