# Experiment: Predict Elections Using Presidential Commercial Campaign

This notebook shows how to use public cloud services for data augmententation of widely available data source. The data will be used to predict candidates' odds of winning the presidential election. You are going to collect the data avaialble using some dependencies(i.e. pytube, pydub and cv2) that will allow you to download presidential' campaign Ads video, reduce them in smaller audio files, and extract frames too. Then you will use public cloud servicese using REST API to convert audio to text, to analyze the extracted text and frames contents. You are going to use cognitive services, text analytics, speech recognition and optical character recognition among the most powerful tools for data augmentation offered by Azure Microsoft public cloud. By using these tools you will increase the availability of data in your possess from a lagely available source: YouTube videos. This notebook ends with a an application of statistical tool with the aim of predicting presidential candidates likelihood of winning the presidential elections using principally campaign Ads from the past elections. For the purpose of this experiment, we are going to use only the most influential presidential campaign commercials as noted in the link below.

**Source:** [Ten of the Most Succesfull Presidential Campaign Ads Ever Made](https://www.kqed.org/lowdown/3955/ten-of-the-best-presidential-campaign-commercials-of-all-time)

# Table of Contents
* [Experiment: Predict Elections Using Presidential Commercial Campaign](#Experiment:-Predict-Elections-Using-Presidential-Commercial-Campaign)
* [Data collection](#Data-collection)
    * [Set up environment and dependencies](#Set-up-environment-and-dependencies)        
    * [Use pytube to download videos from YouTube](#Use-pytube-to-download-videos-from-YouTube)
    * [Use cv2 to extract frames from video](#Use-cv2-to-extract-frames-from-video)    
    * [Use pydub to chunck videos and convert audio](#Use-pydub-to-chunck-videos-and-convert-audio)
    * [Combine data into a dataframe](#Combine-data-into-a-dataframe)
* [Feature engineering: extract data using Microsoft Azure public cloud services](#Paragraph-3)
    * [Set up containers and upload files (audio & image)](#Set-up-containers-and-upload-files(audio-&-image))
    * [Extract speech from audio using Bing Speech Recognition API](#Extract-speech-from-audio-using-Bing-Speech-Recognition-API)
    * [Extract sentiment and key phrases from text using Text Analytics API](#Extract-sentiment-and-key-phrases-from-text-using-Text-Analytics-API)
    * [Extract images contents and text using Vision API](#Paragraph-3)
    * [Extract facial features using Vision API](#Paragraph-3)
* [Modeling: Classification](#Paragraph-3)
    * [One versus one](#Paragraph-3)
    * [Bag of words](#Paragraph-3)

## Data collection

We are going to download videos from the webpage the [Ten of the Most Succesfull Presidential Campaign Ads Ever Made](https://www.kqed.org/lowdown/3955/ten-of-the-best-presidential-campaign-commercials-of-all-time). 

### Set up environment and dependencies

TO DO 

### Use pytube to download videos from you tube

To download video from youtube follows these steps:

- import library
- set directory to save video
- select video URLs
- download video (and thumbnails)

In [8]:
#import libraries
import os
import time
import requests
from pytube import YouTube

In [9]:
#set directory to the folder where we are going to download the videos
cur_dir = os.getcwd()
os.chdir('../../data/video/video.mp4/')
dir_video = os.getcwd()

#print directories where files are going to be saved
print('---------------------------------------------------------')
print('Documents will be stored in the following directory:')
print('- video:\t', dir_video)
print('---------------------------------------------------------')

---------------------------------------------------------
Documents will be stored in the following directory:
- video:	 C:\Users\popor\iqss_workshop\workshops\public_cloud_computing\data\video\video.mp4
---------------------------------------------------------


In [10]:
#list of URLs in the page (consider writing a script to scrape for greater quantities)
videos_urls = ['https://youtu.be/Y9RAxAgksSE', 
                'https://youtu.be/vs5ORK8RLWk', 
                'https://youtu.be/dDTBnsqxZ3k',
                'https://youtu.be/Qwk_epMblW4', 
                'https://youtu.be/qVcFUIXEDZ8', 
                'https://youtu.be/EU-IBF8nwSY',
                'https://youtu.be/PmwhdDv8VrM', 
                'https://youtu.be/Xq_x3JUwrU0', 
                'https://youtu.be/pbdzMLk9wHQ', 
                'https://youtu.be/jjXyqcx-mYY']

In [11]:
#set procedure starting time
print('-------------------')
print("Start downloading video")
print('-------------------')
start = time.time()

#store useful info to lists
titles = []
#thumbnail_urls = []
video_file_spec = []

#download video
for video in videos_urls:

    #instatiate Youtube object passing video URL
    yt = YouTube(video)

    #extract the video's title
    replacements = {" ": "_", "-": "", "'": "", '"': "", ':':'', '.':'', '(':'', ')':''}
    title = "".join([replacements.get(c, c) for c in yt.title.lower()])
    titles.append(title)

    #choose file to download (filter by extension and resolution and take first in the list)
    video_file = yt.streams.filter(file_extension = 'mp4', res="360p").first()
    video_file_spec.append(video_file)
    
    #download video
    try:
        video_file.download(dir_video, filename=title)
        print('{} download status: success'.format(title))
    except:
        print('{} download status: fail'.format(title))

#set procedure ending time
end = time.time()
print('-------------------')
print('Download completed')
print('-------------------')
print('It took {} seconds to download {} videos'.format(round((end - start),2), len(video_file_spec)))

-------------------
Start downloading video
-------------------
eisenhower_for_president_1952 download status: success
kennedy_for_me_campaign_jingle_jfk_1960 download status: success
high_quality_famous_daisy_attack_ad_from_1964_presidential_election download status: success
humphrey_laughing_at_spiro_agnew_1968_political_ad download status: success
mcgovern_defense_plan_ad_nixon_1972_presidential_campaign_commercial download status: success
ronald_reagan_tv_ad_its_morning_in_america_again download status: success
1988_george_bush_sr_revolving_door_attack_ad_campaign download status: success
bill_clinton_hope_ad_1992 download status: success
historical_campaign_ad_windsurfing_bushcheney_04 download status: success
yes_we_can__barack_obama_music_video download status: success
-------------------
Download completed
-------------------
It took 21.51 seconds to download 10 videos


### Use pydub to chunck videos and convert to audio

To partion each video in chuncks and convert it to  the video from youtube follows these steps:

- import library
- set directory to save audio converted
- reduce video in chunks and covert audio from .mp4 to .wav

In [12]:
#import libraries
#import pydub
from pydub import AudioSegment
from pydub.utils import make_chunks

In [15]:
#set directory to the folder where we are going to save audio file
os.chdir(cur_dir)
os.chdir('../../data/video/audio/')
dir_audio = os.getcwd()

#print directories where files are going to be saved
print('---------------------------------------------------------')
print('Documents will be stored in the following directory:')
print('- audio:\t', dir_audio)
print('---------------------------------------------------------')

---------------------------------------------------------
Documents will be stored in the following directory:
- audio:	 C:\Users\popor\iqss_workshop\workshops\public_cloud_computing\data\video\audio
---------------------------------------------------------


In [16]:
#set procedure starting time
print('-------------------')
print("Start dividing video in chuncks and converting to audio")
print('-------------------')
start = time.time()

#store audio lenght
audio_lenght = []

#convert video to audio, and split in chunks
for title in titles:
    
    #set path to file to convert
    path_to_file = os.path.join(dir_video, title) + '.mp4'
        
    #read file into variable
    myaudio = AudioSegment.from_file(path_to_file, "mp4")
    
    #estimate length
    file_length = round(len(myaudio)/1000, 2)
    audio_lenght.append(file_length)
    
    #Make chunks of fi
    chunk_length_ms = 10000
    chunks = make_chunks(myaudio, chunk_length_ms)

    #take each chunck and export it to path as wav
    dir_file = os.path.join(dir_audio, title)
    for i, chunk in enumerate(chunks):
        chunk.export(dir_file + "_chunck_{}.wav".format(i+1), format="wav")
        
    print('The video {0} is long {1} seconds and has been split in {2} audio files'.format(title, file_length,len(chunks)))
    
#set procedure ending time
end = time.time()
print('-------------------')
print('Process completed')
print('-------------------')
print('It took {} seconds to chunk the video and convert '.format(round(end - start, 2)))

-------------------
Start dividing video in chuncks and converting to audio
-------------------
The video eisenhower_for_president_1952 is long 62.09 seconds and has been split in 7 audio files
The video kennedy_for_me_campaign_jingle_jfk_1960 is long 60.23 seconds and has been split in 7 audio files
The video high_quality_famous_daisy_attack_ad_from_1964_presidential_election is long 66.9 seconds and has been split in 7 audio files
The video humphrey_laughing_at_spiro_agnew_1968_political_ad is long 19.25 seconds and has been split in 2 audio files
The video mcgovern_defense_plan_ad_nixon_1972_presidential_campaign_commercial is long 60.05 seconds and has been split in 7 audio files
The video ronald_reagan_tv_ad_its_morning_in_america_again is long 59.95 seconds and has been split in 6 audio files
The video 1988_george_bush_sr_revolving_door_attack_ad_campaign is long 29.88 seconds and has been split in 3 audio files
The video bill_clinton_hope_ad_1992 is long 60.26 seconds and has be

### Use cv2 to extract frames from video

To extract frames from the videos downloaded, follows these steps:

- import library
- retrieve directory to load video and set directory to save frames
- extract frames from video and save to .jpg 

NOTE: this will generate many files into your local memory (~500MB), remember to erase them once finished

In [17]:
#import libraries
import cv2

In [19]:
#set directory to the folder where we are going to save audio file
os.chdir(cur_dir)
os.chdir('../../data/image/')
dir_images = os.getcwd()

#print directories where files are going to be saved
print('---------------------------------------------------------')
print('Documents will be stored in the following directory:')
print('- video:\t', dir_video)
print('- image:\t', dir_images)
print('---------------------------------------------------------')

---------------------------------------------------------
Documents will be stored in the following directory:
- video:	 C:\Users\popor\iqss_workshop\workshops\public_cloud_computing\data\video\video.mp4
- image:	 C:\Users\popor\iqss_workshop\workshops\public_cloud_computing\data\image
---------------------------------------------------------


In [25]:
#set procedure starting time
print('-------------------')
print("Start extracting frames from video")
print('-------------------')
start = time.time()

#store the number of frames for video
frame_count = []

#convert video to audio, and split in chunks
for title in titles:
    
    #set input and output paths
    save_img_path = (os.path.join(dir_images, title))
    path_to_file = os.path.join(dir_video, title) + '.mp4'
    
    #make a new directory
    os.mkdir(save_img_path)
    
    #create object to capture frames from cv2 module
    cap = cv2.VideoCapture(path_to_file)
    
    #get video frame and save to output path
    count = 0
    while (cap.isOpened()):
        #capture frame-by-frame
        ret, frame = cap.read()
        #print(ret,frame)
        if ret == True:
            #save frame as JPEG file
            cv2.imwrite(os.path.join(save_img_path, "frame{:d}.jpg".format(count)), frame)  
            count += 1
        else:
            break
    
    #release the capture
    cap.release()
    cv2.destroyAllWindows()
    
    #add number of frames extracted to list
    frame_count.append(count)
    
    print('The video {0} has been split in {1} frames'.format(title, count))
    
#set procedure ending time
end = time.time()
print('-------------------')
print('Process completed')
print('-------------------')
print('It took {} seconds to partion the videos into frames '.format(round(end - start, 2)))

-------------------
Start extracting frames from video
-------------------
The video eisenhower_for_president_1952 has been split in 1859 frames
The video kennedy_for_me_campaign_jingle_jfk_1960 has been split in 1789 frames
The video high_quality_famous_daisy_attack_ad_from_1964_presidential_election has been split in 2003 frames
The video humphrey_laughing_at_spiro_agnew_1968_political_ad has been split in 575 frames
The video mcgovern_defense_plan_ad_nixon_1972_presidential_campaign_commercial has been split in 1798 frames
The video ronald_reagan_tv_ad_its_morning_in_america_again has been split in 1793 frames
The video 1988_george_bush_sr_revolving_door_attack_ad_campaign has been split in 894 frames
The video bill_clinton_hope_ad_1992 has been split in 1802 frames
The video historical_campaign_ad_windsurfing_bushcheney_04 has been split in 900 frames
The video yes_we_can__barack_obama_music_video has been split in 3781 frames
-------------------
Process completed
-----------------

### Combine data into a dataframe

To combine the data into a dataframe, follows these steps:

- import library
- add additional data to list
- retrieve data previously storesd
- organize lists into a dataframe
- save the dataframe

In [26]:
#import library
import pandas as pd
import pickle
import json

In [33]:
#add some predictors
year = ['1952','1960','1964','1968','1972','1984','1988','1992','2004','2008']
candidate = ['DDE', 'JFK', 'LBJ', 'HH', 'NIX','RR','GBS','BC','GBJ','BO' ]
party = ['republican','democratic','democratic','republican', 'republican','republican','republican','democratic','republican','democratic']
win = ['-','-','-',0,'-','-','-','-','-','-']
vote_percentage = ['-','-','-','-','-','-','-','-','-','-']
frame_sec = [frame_count/audio_lenght for frame_count, audio_lenght in zip(frame_count, audio_lenght)]

df = pd.DataFrame({'video_url':videos_urls,
                   'video_title':titles,
                   'video_length[sec]':audio_lenght,
                   'video_frames[n]': frame_count,
                   'frame_sec[n/sec]': frame_sec,
                   'year':year,
                   'candidate_name':candidate,
                   'party': party,
                   'win': win,
                   'vote[%]': vote_percentage})

df

Unnamed: 0,video_url,video_title,video_length[sec],video_frames[n],frame_sec[n/sec],year,candidate_name,party,win,vote[%]
0,https://youtu.be/Y9RAxAgksSE,eisenhower_for_president_1952,62.09,1859,29.940409,1952,DDE,republican,-,-
1,https://youtu.be/vs5ORK8RLWk,kennedy_for_me_campaign_jingle_jfk_1960,60.23,1789,29.702806,1960,JFK,democratic,-,-
2,https://youtu.be/dDTBnsqxZ3k,high_quality_famous_daisy_attack_ad_from_1964_...,66.9,2003,29.940209,1964,LBJ,democratic,-,-
3,https://youtu.be/Qwk_epMblW4,humphrey_laughing_at_spiro_agnew_1968_politica...,19.25,575,29.87013,1968,HH,republican,0,-
4,https://youtu.be/qVcFUIXEDZ8,mcgovern_defense_plan_ad_nixon_1972_presidenti...,60.05,1798,29.941715,1972,NIX,republican,-,-
5,https://youtu.be/EU-IBF8nwSY,ronald_reagan_tv_ad_its_morning_in_america_again,59.95,1793,29.908257,1984,RR,republican,-,-
6,https://youtu.be/PmwhdDv8VrM,1988_george_bush_sr_revolving_door_attack_ad_c...,29.88,894,29.919679,1988,GBS,republican,-,-
7,https://youtu.be/Xq_x3JUwrU0,bill_clinton_hope_ad_1992,60.26,1802,29.90375,1992,BC,democratic,-,-
8,https://youtu.be/pbdzMLk9wHQ,historical_campaign_ad_windsurfing_bushcheney_04,30.09,900,29.910269,2004,GBJ,republican,-,-
9,https://youtu.be/jjXyqcx-mYY,yes_we_can__barack_obama_music_video,270.21,3781,13.99282,2008,BO,democratic,-,-


In [34]:
#dump the dataframe on a file
df.to_pickle('../dataset/data_collection_presidential_campaign')

#load the dataframe
#df = pd.read_pickle('../data_collection_presidential_campaign')

In [6]:
#import library to display notebook as HTML
import os
from IPython.core.display import HTML

#path to .ccs style script
cur_path = os.path.dirname(os.path.abspath("__file__"))
new_path = os.path.relpath('..\\..\\styles\\custom_styles_public_cloud_computing.css', cur_path)

#function to display notebook
def css():
    style = open(new_path, "r").read()
    return HTML(style)

In [7]:
#run this cell to apply HTML style
css()

# -- Notebook End --

In [None]:
#advanced features
###number of views
###users comments on you tube

In [None]:
#set procedure starting time
print('-------------------')
print("Start downloading videos and thumbnails")
print('-------------------')
start = time.time()

#download images
for url in images_urls:
    #set thumbail name and choose path to save it
    name = url.split('/')[-1].split('?')[0]    
    save_img_path = (os.path.join(dir_images, name))
    
    #download thumbnail
    img_data = requests.get(url).content
    with open(save_img_path, 'wb') as handler:
        try:
            handler.write(img_data)
            print('{} download status: success'.format(name))
        except:
            print('{} download status: fail'.format(name))

#set procedure ending time
end = time.time()
print('-------------------')
print('Download completed')
print('-------------------')
print('It took {} seconds to download 22 images from iqss website'.format(round(end - start),2))

In [None]:
#TODO split on silence

#set procedure starting time
print('-------------------')
print("Start dividing video in chuncks and converting to audio")
print('-------------------')
start = time.time()

#store audio lenght
audio_lenght = []

#convert video to audio, and plit in chunks
for title in titles:

    #set path to file to convert
    path_to_file = os.path.join(dir_video, title) + '.mp4'
        
    #read file 
    sound = AudioSegment.from_file(path_to_file, "mp4")
    
    #estimate length
    file_length = round(len(sound)/1000, 2)
    audio_lenght.append(file_length)
    

    print(title)
    
    #split audio when it is silent for at least one second or quieter than -16 dBFS
    chunks = split_on_silence(sound, min_silence_len=500, silence_thresh=-16,  keep_silence=200)
    print(len(chunks))
    
    if len(chunks) == 0:
        chunks = split_on_silence(sound, min_silence_len=500, silence_thresh=-48,  keep_silence=200)
    print(len(chunks))
    
    if len(chunks) == 1:
        chunks = split_on_silence(sound, min_silence_len=300, silence_thresh=-72,  keep_silence=200)
    print(len(chunks))
    
    # now recombine the chunks so that the parts are at least 10 sec long
    target_length = 10 * 1000
    output_chunks = [chunks[0]]
    for chunk in chunks[1:]:
        if len(output_chunks[-1]) < target_length:
            output_chunks[-1] += chunk
        else:
            # if the last output chunk is longer than the target length start a new one
            output_chunks.append(chunk)
    
    print(file_length,len(chunks))
    
    dir_file = os.path.join(dir_audio, title)
    #take each chunck and save it as wav
    for i, chunk in enumerate(output_chunks):
        chunk.export(dir_file + "_chunck_{}.wav".format(i+1), format="wav")
        
    print('The video {0} is long {1} seconds and has been split in {2} audio files'.format(title, file_length,len(output_chunks)))
    
#set procedure ending time
end = time.time()
print('-------------------')
print('Process completed')
print('-------------------')
print('It took {} seconds to chunk the video and convert '.format(round(end - start, 2)))

In [None]:
#from pydub.silence import split_on_silence

#set directory to folder to download the video thumbnail
#os.chdir('../thumbnail/')
#dir_thumbnail = os.getcwd()
#print('- thumbnail:\t', dir_thumbnail)

#download video
for video in videos_urls:

    #instatiate Youtube object passing video URL
    yt = YouTube(video)

    #extract the video's title
    replacements = {" ": "_", "-": "", "'": "", '"': "", ':':'', '.':'', '(':'', ')':''}
    title = "".join([replacements.get(c, c) for c in yt.title.lower()])
    titles.append(title)
#     thumbnail_url = yt.thumbnail_url.lower()
#     thumbnail_urls.append(thumbnail_url)

    #choose file to download (filter by extension and resolution and take first in the list)
    video_file = yt.streams.filter(file_extension = 'mp4', res="360p").first()
    video_file_spec.append(video_file)
    
    #download video
    try:
        video_file.download(dir_video, filename=title)
        print('{} download status: success'.format(title))
    except:
        print('{} download status: fail'.format(title))

#     #set thumbail name and choose path to save it
#     thumbnail_name = title +'_'+ yt.thumbnail_url.split('/')[-1]
#     save_thumbnail_path = (os.path.join(dir_thumbnail, thumbnail_name))
    
#     #download thumbnail
#     img_data = requests.get(thumbnail_url).content
#     with open(save_thumbnail_path, 'wb') as handler:
#         try:
#             handler.write(img_data)
#             print('{} download status: success'.format(thumbnail_name))
#         except:
#             print('{} download status: fail'.format(thumbnail_name))

In [None]:
#sample code to check split on silence

title_HS = titles[4]
path_to_file = os.path.join(dir_video, title_HS) + '.mp4'
sound = AudioSegment.from_file(path_to_file, "mp4")
file_length = round(len(sound)/1000, 2)

chunks = split_on_silence(sound, min_silence_len=300, silence_thresh=-16,  keep_silence=200) #,  keep_silence=200
print(title_HS, len(chunks))

if len(chunks) == 0:
    chunks = split_on_silence(sound, min_silence_len=500, silence_thresh=-32,  keep_silence=200)
print(len(chunks))

if len(chunks) == 0:
    chunks = split_on_silence(sound, min_silence_len=300, silence_thresh=-32,  keep_silence=200)
print(len(chunks))

target_length =  10 * 1000
    
output_chunks = [chunks[0]]
for chunk in chunks[1:]:
    if len(output_chunks[-1]) < target_length:
        output_chunks[-1] += chunk
    else:
        # if the last output chunk is longer than the target length start a new one
        output_chunks.append(chunk)
        
print(file_length,len(chunks))

dir_file = os.path.join(dir_audio, title_HS)
#take each chunck and save it as wav
for i, chunk in enumerate(output_chunks):
    chunk.export(dir_file + "_chunck_{}.wav".format(i+1), format="wav")

print('The video {0} is long {1} seconds and has been split in {2} audio files'.format(title, file_length,len(output_chunks)))

| title | min_silence_len| silence_thresh |keep_silence| target_length |
|---|---|---|---|---|
| eisenhower | 500 | -16 | 200 | 10 |
| kennedy | 500 | -16 | 200 | 10 |
| LBJ | 500 | -32 | 200 | 5 |
| HHH | 500 | -16 | 200 | 5 |

