# Experiment: Presidential Campaigns Ads Dataset - Data Collection

This notebook shows how to use public cloud services for data augmententation of widely available data source. The data will be used to predict candidates' odds of winning the presidential election. You are going to collect the data avaialble using some dependencies(i.e. pytube, pydub and cv2) that will allow you to download presidential' campaign Ads video, reduce them in smaller audio files, and extract frames too. Then you will use public cloud servicese using REST API to convert audio to text, to analyze the extracted text and frames contents. You are going to use cognitive services, text analytics, speech recognition and optical character recognition among the most powerful tools for data augmentation offered by Azure Microsoft public cloud. By using these tools you will increase the availability of data in your possess from a lagely available source: YouTube videos.

![data_collection](img/data_collection.PNG)

**The experiment aim** is to demonstrate how public cloud services can be used in real applications to enhance social scientists research. In this respect we are going to use only an handful of video but keep in mind that you could repeat this experiment with all the video you want. The subject of the videos we selected is presidential campaign commercials that have been most influential, and you can find a link to them below. We estimate that completing this guide will take you around 10-15 minutes for reading, running the code and understanding how it works.


**YouTube Video Source:** [Ten of the Most Succesfull Presidential Campaign Ads Ever Made](https://www.kqed.org/lowdown/3955/ten-of-the-best-presidential-campaign-commercials-of-all-time)

_**NOTE:**_ remember to install the dependencies necessary to run this experiment before to start this tutorial. Follow the instructions in the markdown within the workshop repository ([link to workshop repo](https://github.com/IQSS/workshops/tree/master/public_cloud_computing)).

# Table of Contents
* [Experiment: Presidential Campaigns Ads Dataset - Data Collection](#xperiment:-Presidential-Campaigns-Ads-Dataset---Data-Collection)
* [Data collection](#Data-collection)
    * [Use pytube to download videos from YouTube](#Use-pytube-to-download-videos-from-YouTube)
    * [Use cv2 to extract frames from video](#Use-cv2-to-extract-frames-from-video)    
    * [Use pydub to chunck videos and convert audio](#Use-pydub-to-chunck-videos-and-convert-audio)
    * [Combine data into a dataframe](#Combine-data-into-a-dataframe)
* [Recap](#Recap)
    * [What you have learnt](#What-you-have-learnt)
    * [What you will learn next guide](#What-you-will-learn-next-guide)

## Data collection
### Use pytube to download videos from you tube

We are going to download YouTube videos from the webpage the [Ten of the Most Succesfull Presidential Campaign Ads Ever Made](https://www.kqed.org/lowdown/3955/ten-of-the-best-presidential-campaign-commercials-of-all-time). To do so, we are going to use the package pytube([pytube doc](https://github.com/nficano/pytube)) that allows for  downloading easily any YouTube video using our notebook/python script. To do so follow the next steps:

- import libraries
- set directory to save video
- select video URLs
- download video

In [1]:
#import libraries
import os
import time
import requests
from pytube import YouTube

In [2]:
#set notebook current directory
cur_dir = os.getcwd()

#set directory to the folder where we are going to download the videos
os.chdir('../../data/video/video.mp4/')
dir_video = os.getcwd()

#print directories
print('---------------------------------------------------------')
print('Your documents directories are:')
print('- notebook:\t', cur_dir)
print('- video:\t', dir_video)
print('---------------------------------------------------------')

---------------------------------------------------------
Your documents directories are:
- notebook:	 C:\Users\popor\iqss_workshop\workshops\public_cloud_computing\experiment\data_collection
- video:	 C:\Users\popor\iqss_workshop\workshops\public_cloud_computing\data\video\video.mp4
---------------------------------------------------------


In [3]:
#list of URLs in the page (consider writing a script to scrape for greater quantities)
videos_urls = ['https://youtu.be/Y9RAxAgksSE', 
                'https://youtu.be/vs5ORK8RLWk', 
                'https://youtu.be/dDTBnsqxZ3k',
                'https://youtu.be/Qwk_epMblW4', 
                'https://youtu.be/qVcFUIXEDZ8', 
                'https://youtu.be/EU-IBF8nwSY',
                'https://youtu.be/PmwhdDv8VrM', 
                'https://youtu.be/Xq_x3JUwrU0', 
                'https://youtu.be/pbdzMLk9wHQ', 
                'https://youtu.be/jjXyqcx-mYY']

In [4]:
#set procedure starting time
print('---------------------------------------------------------')
print("Start downloading video")
print('---------------------------------------------------------')
start = time.time()

#store useful info to lists
titles = []
#thumbnail_urls = []
video_file_spec = []

#download video
for video in videos_urls:

    #instatiate Youtube object passing video URL
    yt = YouTube(video)

    #extract the video's title
    replacements = {" ": "_", "-": "", "'": "", '"': "", ':':'', '.':'', '(':'', ')':''}
    title = "".join([replacements.get(c, c) for c in yt.title.lower()])
    titles.append(title)

    #choose file to download (filter by extension and resolution and take first in the list)
    video_file = yt.streams.filter(file_extension = 'mp4', res="360p").first()
    video_file_spec.append(video_file)
    
    #download video
    try:
        video_file.download(dir_video, filename=title)
        print('{} download status: success'.format(title))
    except:
        print('{} download status: fail'.format(title))

#set procedure ending time
end = time.time()
print('---------------------------------------------------------')
print('Download completed')
print('---------------------------------------------------------')
print('It took {} seconds to download {} videos'.format(round((end - start),2), len(video_file_spec)))

---------------------------------------------------------
Start downloading video
---------------------------------------------------------
eisenhower_for_president_1952 download status: success
kennedy_for_me_campaign_jingle_jfk_1960 download status: success
high_quality_famous_daisy_attack_ad_from_1964_presidential_election download status: success
humphrey_laughing_at_spiro_agnew_1968_political_ad download status: success
mcgovern_defense_plan_ad_nixon_1972_presidential_campaign_commercial download status: success
ronald_reagan_tv_ad_its_morning_in_america_again download status: success
1988_george_bush_sr_revolving_door_attack_ad_campaign download status: success
bill_clinton_hope_ad_1992 download status: success
historical_campaign_ad_windsurfing_bushcheney_04 download status: success
yes_we_can__barack_obama_music_video download status: success
---------------------------------------------------------
Download completed
---------------------------------------------------------
It

### Use pydub to chunck videos and convert to audio

Next, we are going to partition each video in chuncks and convert each to audio WAV format. To do so, we are going to use the package pydub ([pydub doc](https://github.com/jiaaro/pydub)) that allows for doing so using our notebook/python script. Follow the next steps:

- import library
- set directory to save audio converted
- reduce video in chunks and covert audio from .mp4 to .wav

In [5]:
#import libraries
from pydub import AudioSegment
from pydub.utils import make_chunks

In [6]:
#set directory to the folder where we are going to save audio file
os.chdir(cur_dir)
os.chdir('../../data/video/audio/')
dir_audio = os.getcwd()

#print directories where files are going to be saved
print('---------------------------------------------------------')
print('Documents will be stored in the following directory:')
print('- audio:\t', dir_audio)
print('---------------------------------------------------------')

---------------------------------------------------------
Documents will be stored in the following directory:
- audio:	 C:\Users\popor\iqss_workshop\workshops\public_cloud_computing\data\video\audio
---------------------------------------------------------


In [7]:
#set procedure starting time
print('---------------------------------------------------------')
print("Start dividing video in chuncks and converting to audio")
print('---------------------------------------------------------')
start = time.time()

#store audio lenght
audio_lenght = []

#convert video to audio, and split in chunks
for title in titles:
    
    #set path to file to convert
    path_to_file = os.path.join(dir_video, title) + '.mp4'
        
    #read file into variable
    myaudio = AudioSegment.from_file(path_to_file, "mp4")
    
    #estimate length
    file_length = round(len(myaudio)/1000, 2)
    audio_lenght.append(file_length)
    
    #Make chunks of fi
    chunk_length_ms = 10000
    chunks = make_chunks(myaudio, chunk_length_ms)

    #take each chunck and export it to path as wav
    dir_file = os.path.join(dir_audio, title)
    for i, chunk in enumerate(chunks):
        chunk.export(dir_file + "_chunck_{}.wav".format(i+1), format="wav")
        
    print('The video {0} is long {1} seconds and has been split in {2} audio files'.format(title, file_length,len(chunks)))
    
#set procedure ending time
end = time.time()
print('---------------------------------------------------------')
print('Process completed')
print('---------------------------------------------------------')
print('It took {} seconds to chunk the video and convert '.format(round(end - start, 2)))

---------------------------------------------------------
Start dividing video in chuncks and converting to audio
---------------------------------------------------------
The video eisenhower_for_president_1952 is long 62.09 seconds and has been split in 7 audio files
The video kennedy_for_me_campaign_jingle_jfk_1960 is long 60.23 seconds and has been split in 7 audio files
The video high_quality_famous_daisy_attack_ad_from_1964_presidential_election is long 66.9 seconds and has been split in 7 audio files
The video humphrey_laughing_at_spiro_agnew_1968_political_ad is long 19.25 seconds and has been split in 2 audio files
The video mcgovern_defense_plan_ad_nixon_1972_presidential_campaign_commercial is long 60.05 seconds and has been split in 7 audio files
The video ronald_reagan_tv_ad_its_morning_in_america_again is long 59.95 seconds and has been split in 6 audio files
The video 1988_george_bush_sr_revolving_door_attack_ad_campaign is long 29.88 seconds and has been split in 3 audi

### Use cv2 to extract frames from video

Next, we are going to extract frames from each video. To do so, we are going to use the package cv2 ([cv2 doc](https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_gui/py_image_display/py_image_display.html)) that allows for doing so using our notebook/python script. Follow the next steps:

- import library
- retrieve directory to load video and set directory to save frames
- extract frames from video and save to .jpg 

NOTE: this will generate many files into your local memory (~500MB), remember to erase them once finished

In [8]:
#import library
import cv2

In [9]:
#set directory to the folder where we are going to save audio file
os.chdir(cur_dir)
os.chdir('../../data/image/')
dir_images = os.getcwd()

#print directories where files are going to be saved
print('---------------------------------------------------------')
print('Documents will be stored in the following directory:')
print('- video:\t', dir_video)
print('- image:\t', dir_images)
print('---------------------------------------------------------')

---------------------------------------------------------
Documents will be stored in the following directory:
- video:	 C:\Users\popor\iqss_workshop\workshops\public_cloud_computing\data\video\video.mp4
- image:	 C:\Users\popor\iqss_workshop\workshops\public_cloud_computing\data\image
---------------------------------------------------------


In [11]:
#set procedure starting time
print('---------------------------------------------------------')
print("Start extracting frames from video")
print('---------------------------------------------------------')
start = time.time()

#make a new directory
os.mkdir('frames')

#store the number of frames for video and frame extracted
frame_count = []
frame_saved = []

#convert video to audio, and split in chunks
for title in titles:
    
    #set input path
    path_to_file = os.path.join(dir_video, title) + '.mp4'
    save_img_path = (os.path.join(dir_images, 'frames'))

    #create object to capture frames from cv2 module
    cap = cv2.VideoCapture(path_to_file)
    
    #get video frame and save to output path
    count = 0
    frames = 0
    while (cap.isOpened()):
        #capture frame-by-frame
        ret, frame = cap.read()
        #print(ret,frame)
        if ret == True:
            #take only the hundreath frame and multiples
            if count%100 == 0:
                save_frame_path =  (os.path.join(save_img_path, title + '_frame{:d}.jpg'.format(count)))
                cv2.imwrite(save_frame_path, frame)
                frames +=1
            count += 1
        else:
            break
    
    #release the capture
    cap.release()
    cv2.destroyAllWindows()
    
    #add number of frames to list
    #frame_extracted = round(count//100)
    frame_saved.append(frames)
    frame_count.append(count)
    
    print('The video {0} has been split in {1} frames and {2} has been extracted'.format(title, count, frames))
    
#set procedure ending time
end = time.time()
print('---------------------------------------------------------')
print('Process completed')
print('---------------------------------------------------------')
print('It took {} seconds to partion the videos into frames and to save them'.format(round(end - start, 2)))

---------------------------------------------------------
Start extracting frames from video
---------------------------------------------------------
The video eisenhower_for_president_1952 has been split in 1859 frames and 19 has been extracted
The video kennedy_for_me_campaign_jingle_jfk_1960 has been split in 1789 frames and 18 has been extracted
The video high_quality_famous_daisy_attack_ad_from_1964_presidential_election has been split in 2003 frames and 21 has been extracted
The video humphrey_laughing_at_spiro_agnew_1968_political_ad has been split in 575 frames and 6 has been extracted
The video mcgovern_defense_plan_ad_nixon_1972_presidential_campaign_commercial has been split in 1798 frames and 18 has been extracted
The video ronald_reagan_tv_ad_its_morning_in_america_again has been split in 1793 frames and 18 has been extracted
The video 1988_george_bush_sr_revolving_door_attack_ad_campaign has been split in 894 frames and 9 has been extracted
The video bill_clinton_hope_ad

### Combine data into a dataframe

To combine the data into a dataframe, follows these steps:

- import library
- add additional data to list
- retrieve data previously storesd
- organize lists into a dataframe
- save the dataframe

In [12]:
#import library
import pandas as pd
import pickle
import json

In [14]:
#add some predictors
year = ['1952','1960','1964','1968','1972','1984','1988','1992','2004','2008']
candidate = ['DDE', 'JFK', 'LBJ', 'HH', 'RN','RR','GBS','BC','GBJ','BO' ]
party = ['republican','democratic','democratic','republican', 'democratic','republican','republican','democratic','republican','democratic']
frame_sec = [frame_count/audio_lenght for frame_count, audio_lenght in zip(frame_count, audio_lenght)]

df_data_collection = pd.DataFrame({'video_url':videos_urls,
                                   'video_title':titles,
                                   'video_length[sec]':audio_lenght,
                                   'video_frames[n]': frame_count,
                                   'frame_sec[n/sec]': frame_sec,
                                   'frame_extracted[n]': frame_saved,
                                   'year':year,
                                   'candidate_name':candidate,
                                   'party': party})

#save dataframe to folder
df_data_collection.to_csv('../../data/dataset/data_collection_presidential_campaign.csv', sep=',', encoding='utf-8')

#display
df_data_collection

Unnamed: 0,video_url,video_title,video_length[sec],video_frames[n],frame_sec[n/sec],frame_extracted[n],year,candidate_name,party
0,https://youtu.be/Y9RAxAgksSE,eisenhower_for_president_1952,62.09,1859,29.940409,19,1952,DDE,republican
1,https://youtu.be/vs5ORK8RLWk,kennedy_for_me_campaign_jingle_jfk_1960,60.23,1789,29.702806,18,1960,JFK,democratic
2,https://youtu.be/dDTBnsqxZ3k,high_quality_famous_daisy_attack_ad_from_1964_...,66.9,2003,29.940209,21,1964,LBJ,democratic
3,https://youtu.be/Qwk_epMblW4,humphrey_laughing_at_spiro_agnew_1968_politica...,19.25,575,29.87013,6,1968,HH,republican
4,https://youtu.be/qVcFUIXEDZ8,mcgovern_defense_plan_ad_nixon_1972_presidenti...,60.05,1798,29.941715,18,1972,RN,democratic
5,https://youtu.be/EU-IBF8nwSY,ronald_reagan_tv_ad_its_morning_in_america_again,59.95,1793,29.908257,18,1984,RR,republican
6,https://youtu.be/PmwhdDv8VrM,1988_george_bush_sr_revolving_door_attack_ad_c...,29.88,894,29.919679,9,1988,GBS,republican
7,https://youtu.be/Xq_x3JUwrU0,bill_clinton_hope_ad_1992,60.26,1802,29.90375,19,1992,BC,democratic
8,https://youtu.be/pbdzMLk9wHQ,historical_campaign_ad_windsurfing_bushcheney_04,30.09,900,29.910269,9,2004,GBJ,republican
9,https://youtu.be/jjXyqcx-mYY,yes_we_can__barack_obama_music_video,270.21,3781,13.99282,38,2008,BO,democratic


## Recap
### What you have learnt

- How to download YouTube videos using pytube
- How to partition video and convert to chosen format using pydub
- How to extract frames from a video using cv2
- Organize dataset and save to csv file

### What you will learn next guide

- Next guide will show you how to use the cloud computing services learn in guide 3 on the data you just collected. The end result will be a dataset with all the extracted features
    
### Question for you¶

- What other information we could extract from these YouTube videos?
- Can you think about similar data sources to the one presented here?
- Does it sounds to you that the package presented will be useful for your future research?

In [42]:
#import library to display notebook as HTML
import os
from IPython.core.display import HTML

#path to .ccs style script
cur_path = os.path.dirname(os.path.abspath("__file__"))
new_path = os.path.relpath('..\\..\\styles\\custom_styles_public_cloud_computing.css', cur_path)

#function to display notebook
def css():
    style = open(new_path, "r").read()
    return HTML(style)

In [43]:
#run this cell to apply HTML style
css()