# Experiment: Presidential Campaigns Ads Dataset - Feature Extraction -

This notebook shows how to use cloud services using REST API to convert audio to text, to analyze the extracted text and frames contents. Using the files previously collected (see Experiment:  Presidential Campaigns Ads Dataset - Data Collection -), You are going to use cognitive services, text analytics, speech recognition and optical character recognition that are among the most powerful tools for captions generartion, transcribe audio to text and for text analytics offered by Azure Microsoft public cloud. By using these tools you will be able to extract features from a largely available source: YouTube videos. We are going to use the audio files (.WAV format) and the frames extracted from them (.JPG format) stored in the data folder in the main repository folder.

![features_extraction](img/features_extraction.PNG)


# Table of Contents
* [Experiment: Predict Elections Outcomes Using Presidential Commercial Campaign](#Experiment:-Predict-Elections-Outcomes-Using-Presidential-Commercial-Campaign)
* [Feature engineering: extract data using Microsoft Azure public cloud services](#Paragraph-3)
    * [Set up containers and upload files: audio & image](#Set-up-containers-and-upload-files-audio-&-image)
    * [Extract speech from audio using Speech Recognition API](#Extract-speech-from-audio-using-Bing-Speech-Recognition-API)
    * [Extract sentiment and key phrases from text using Text Analytics API](#Extract-sentiment-and-key-phrases-from-text-using-Text-Analytics-API)
    * [Extract image contents and text using Vision API](#Extract-image-contents-and-text-using-Vision-API)
* [Presidential Campaigns Ads Dataset](#Presidential-Campaigns-Ads-Dataset)
    * [Combine dataframes](#Combine-dataframes)
    * [Data type and data description](#Data-type-and-data-description)
* [Recap](#Recap)
    * [What you have learnt](#What-you-have-learnt)
    * [What are you going to learn](#What-are-you-going-to-learn)

## Feature engineering: extract data using Microsoft Azure public cloud services

### Set up containers and upload files: audio and  image (video frames)

To set up containers, follows these steps:

- access Azure Portal using your account [[Link here](https://portal.azure.com)]
- import libraries and run functions we will use to accomplish tasks quickly and without hardcoding
- set directories to import videos and images
- retrieve storage account service credentials from your azure_keys (public_cloud_computing\guides\keys)
- create container, retrive files to download path and upload them. Repeat the task twice using **`upload_files_to_container()`**. First to upload audio files and then to upload the image files. The function call these funtions at once:
    - retrieve files name, path and extensions use **`get_files()`**
    - set two containers name and create containers: audio and image use **`make_public_container()`**
    - upload files to containers use **`upload_file()`**

#### _Import libraries and functions_

In [2]:
#import library
import sys

#import functions from utilities
sys.path.insert(0, "../../guides/utilities/")
try:
    from utils import *
except ImportError:
    print('No Import')

In [704]:
#import libraries
import os
import time
import pickle
from azure.storage.blob import BlockBlobService, PublicAccess
from azure.storage.blob import ContentSettings

#### _Set directories_

In [4]:
#set notebook current directory
cur_dir = os.getcwd()

#set directory to the folder to import azure keys
os.chdir('../../guides/keys/')
dir_azure_keys = os.getcwd()

#set directory to the folder to import audio files
os.chdir('../../data/video/audio/')
dir_audio_files = os.getcwd()

#set directory to the folder to import image files
os.chdir(cur_dir)
os.chdir('../../data/image/frames/')
dir_image_files = os.getcwd()

#print your notebook directory 
#print directories where files are goint to be saved
print('---------------------------------------------------------')
print('Your documents directories are:')
print('- notebook:\t', cur_dir)
print('- azure keys:\t', dir_azure_keys)
print('- audio files:\t', dir_audio_files)
print('- image files:\t', dir_image_files)
print('---------------------------------------------------------')

---------------------------------------------------------
Your documents directories are:
- notebook:	 C:\Users\popor\iqss_workshop\workshops\public_cloud_computing\experiment\features_extraction
- azure keys:	 C:\Users\popor\iqss_workshop\workshops\public_cloud_computing\guides\keys
- audio files:	 C:\Users\popor\iqss_workshop\workshops\public_cloud_computing\data\video\audio
- image files:	 C:\Users\popor\iqss_workshop\workshops\public_cloud_computing\data\image\frames
---------------------------------------------------------


#### _Retrieve storage account credentials_

In [7]:
#ERASE MY PATH BEFORE REALISING THE WORKSHOP MATERIALS
my_path_to_keys = 'C:/Users/popor/Desktop/keys/'

#set service name, path to the keys and keys file name
SERVICE_NAME = 'STORAGE' #add here: STORAGE, FACE, COMPUTER_VISION, SPEECH_RECOGNITION, TEXT_ANALYTICS, ML_STUDIO
PATH_TO_KEYS = my_path_to_keys #add here (use dir_azure_keys)
KEYS_FILE_NAME = 'azure_services_keys_v1.1.json' #add file name (eg 'azure_services_keys.json')

#call function to retrive
storage_keys = retrieve_keys(SERVICE_NAME, PATH_TO_KEYS, KEYS_FILE_NAME)

#set storage name and keys
STORAGE_NAME = storage_keys['NAME']
STORAGE_KEY = storage_keys['API_KEY']

#### _Create container, get files and upload audio files_

In [11]:
#set a name for a new container
NEW_CONTAINER_NAME ='myaudio'

#set the audio file directory
DIR_FILES = dir_audio_files

# #set content type of the file, in this case is a audio .wav
# CONTENT_TYPE = 'audio/x-'
#upload files
upload_files_to_container(STORAGE_NAME, STORAGE_KEY, NEW_CONTAINER_NAME, DIR_FILES)

myaudio BLOB container has been successfully created: True
------------------------------------------------------------------------------------------------------------------
Data stored from directory):	 C:\Users\popor\iqss_workshop\workshops\public_cloud_computing\data\video\audio
------------------------------------------------------------------------------------------------------------------
Start uploading files
------------------------------------------------------------------------------------------------------------------
1988_george_bush_sr_revolving_door_attack_ad_campaign_chunck_1.wav // BLOB upload status: successful
1988_george_bush_sr_revolving_door_attack_ad_campaign_chunck_2.wav // BLOB upload status: successful
1988_george_bush_sr_revolving_door_attack_ad_campaign_chunck_3.wav // BLOB upload status: successful
bill_clinton_hope_ad_1992_chunck_1.wav // BLOB upload status: successful
bill_clinton_hope_ad_1992_chunck_2.wav // BLOB upload status: successful
bill_clinton_hop

#### _Create container, get files and upload image files_

In [13]:
#set a name for a new container
NEW_CONTAINER_NAME ='myimage'

#set the audio file directory
DIR_FILES = dir_image_files

#crete container and upload frames
upload_files_to_container(STORAGE_NAME, STORAGE_KEY, NEW_CONTAINER_NAME, DIR_FILES)

myimage something went wrong: check parameters and subscription
------------------------------------------------------------------------------------------------------------------
Data stored from directory):	 C:\Users\popor\iqss_workshop\workshops\public_cloud_computing\data\image\frames
------------------------------------------------------------------------------------------------------------------
Start uploading files
------------------------------------------------------------------------------------------------------------------
1988_george_bush_sr_revolving_door_attack_ad_campaign_frame0.jpg // BLOB upload status: successful
1988_george_bush_sr_revolving_door_attack_ad_campaign_frame100.jpg // BLOB upload status: successful
1988_george_bush_sr_revolving_door_attack_ad_campaign_frame200.jpg // BLOB upload status: successful
1988_george_bush_sr_revolving_door_attack_ad_campaign_frame300.jpg // BLOB upload status: successful
1988_george_bush_sr_revolving_door_attack_ad_campaign_fra

kennedy_for_me_campaign_jingle_jfk_1960_frame1000.jpg // BLOB upload status: successful
kennedy_for_me_campaign_jingle_jfk_1960_frame1100.jpg // BLOB upload status: successful
kennedy_for_me_campaign_jingle_jfk_1960_frame1200.jpg // BLOB upload status: successful
kennedy_for_me_campaign_jingle_jfk_1960_frame1300.jpg // BLOB upload status: successful
kennedy_for_me_campaign_jingle_jfk_1960_frame1400.jpg // BLOB upload status: successful
kennedy_for_me_campaign_jingle_jfk_1960_frame1500.jpg // BLOB upload status: successful
kennedy_for_me_campaign_jingle_jfk_1960_frame1600.jpg // BLOB upload status: successful
kennedy_for_me_campaign_jingle_jfk_1960_frame1700.jpg // BLOB upload status: successful
kennedy_for_me_campaign_jingle_jfk_1960_frame200.jpg // BLOB upload status: successful
kennedy_for_me_campaign_jingle_jfk_1960_frame300.jpg // BLOB upload status: successful
kennedy_for_me_campaign_jingle_jfk_1960_frame400.jpg // BLOB upload status: successful
kennedy_for_me_campaign_jingle_jfk_

yes_we_can__barack_obama_music_video_frame700.jpg // BLOB upload status: successful
yes_we_can__barack_obama_music_video_frame800.jpg // BLOB upload status: successful
yes_we_can__barack_obama_music_video_frame900.jpg // BLOB upload status: successful
------------------------------------------------------------------------------------------------------------------
Uploading completed
------------------------------------------------------------------------------------------------------------------
It took 39.97 seconds to upload 175 files


In [742]:
def get_list_blob(STORAGE_NAME, STORAGE_KEY, CONTAINER_NAME):
    """"create blob service and return list of blobs in the container"""
    
    blob_service = BlockBlobService(account_name= STORAGE_NAME, account_key=STORAGE_KEY)
    
    uploaded_file = blob_service.list_blobs(CONTAINER_NAME)
    blob_name_list = []
    for blob in uploaded_file:
        blob_name_list.append(blob.name)
        
    return blob_name_list

## Extract audio script using Speech Recognition

To extract text from the audio files uploaded to cloud storage previously, follows these steps:

- access Azure Portal using your account [[Link here](https://portal.azure.com)]
- retrieve speech recognition service credentials and configure API to access cloud service
- request speech recognition services to the public cloud for each audio file
- extract text from each response
- recompose the script of each video, collect results into a dataframe and save it as tabular dataset

#### _Retrieve speech recognition service credentials and configure API to access cloud service_

In [None]:
# import libraries
import requests
import urllib
import uuid
import json

#set service name
SERVICE_NAME = 'SPEECH_RECOGNITION' #add here: STORAGE, FACE, COMPUTER_VISION, SPEECH_RECOGNITION, TEXT_ANALYTICS, ML_STUDIO

#call function to retrive keys
storage_keys = retrieve_keys(SERVICE_NAME, PATH_TO_KEYS, KEYS_FILE_NAME)

#set speech recognition keys
SPEECH_RECOGNITION_KEY = storage_keys['API_KEY']

#configure API access to request speech recognition service
URI_TOKEN_SPEECH = 'https://api.cognitive.microsoft.com/sts/v1.0/issueToken'
URL_SPEECH = 'https://speech.platform.bing.com/recognize'

#set api request REST headers
headers_api = {}
headers_api['Authorization'] = 'Bearer {0}'.format(access_token)
headers_api['Content-type'] = 'audio/wav'
headers_api['codec'] = 'audio/pcm'
headers_api['samplerate'] = '16000'

#set api request parameters
params_set = {}
params_set['scenarios'] = 'ulm'
params_set['appid'] = 'D4D52672-91D7-4C74-8AD8-42B1D98141A5'
params_set['locale'] = 'en-US'
params_set['device.os'] = 'PC'
params_set['version'] = '3.0'
params_set['format'] = 'json'
params_set['instanceid'] = str(uuid.uuid1())
params_set['requestid'] = str(uuid.uuid1())

#### _Request speech recognition service to the public cloud for each audio file_

In [28]:
#set container to retrieve files from
CONTAINER_NAME = 'myaudio'

#get list of blob
#blob_list = get_list_blob(STORAGE_NAME, STORAGE_KEY, CONTAINER_NAME)
blob_name_list, blob_url_list = retrieve_blob_list(STORAGE_NAME, STORAGE_KEY, CONTAINER_NAME)

#store http response and json file
responses = []
http_responses = []

#set procedure starting time
print('---------------------------------------------------------')
print("Start speech to text conversion")
print('---------------------------------------------------------')
start = time.time()

#run speech recognition on uploaded audio files (i.e. extension .wax)
for blob_name in blob_name_list:
    if blob_name.split('.')[-1] == 'wav':
                
        #set token request REST headers
        headers_token = {}
        headers_token['Ocp-Apim-Subscription-Key'] = SPEECH_RECOGNITION_KEY
        headers_token['Content-Length'] = '0'

        #request for token
        api_response = requests.post(URI_TOKEN_SPEECH, headers=headers_token)
        access_token = str(api_response.content.decode('utf-8'))
        
        #convert blob to bytes
        blob_service = BlockBlobService(STORAGE_NAME, STORAGE_KEY)
        blob = blob_service.get_blob_to_bytes(CONTAINER_NAME, blob_name)

        #request for speech recognition service
        params = urllib.parse.urlencode(params_set)
        api_response = requests.post(URL_SPEECH, headers=headers_api, params=params, data=blob.content)
        print('{} had a {} response'.format(blob_name, api_response))

        #extract data from response
        res_json = json.loads(api_response.content.decode('utf-8'))
        http_responses.append(api_response)
        responses.append(res_json)
        
#set procedure ending time
end = time.time()
print('---------------------------------------------------------')
print('Conversion completed')
print('---------------------------------------------------------')
print('It took {} seconds to '.format(round(end - start, 2)))

---------------------------------------------------------
Start speech to text conversion
---------------------------------------------------------
1988_george_bush_sr_revolving_door_attack_ad_campaign_chunck_1.wav had a <Response [200]> response
1988_george_bush_sr_revolving_door_attack_ad_campaign_chunck_2.wav had a <Response [200]> response
1988_george_bush_sr_revolving_door_attack_ad_campaign_chunck_3.wav had a <Response [200]> response
bill_clinton_hope_ad_1992_chunck_1.wav had a <Response [200]> response
bill_clinton_hope_ad_1992_chunck_2.wav had a <Response [200]> response
bill_clinton_hope_ad_1992_chunck_3.wav had a <Response [200]> response
bill_clinton_hope_ad_1992_chunck_4.wav had a <Response [200]> response
bill_clinton_hope_ad_1992_chunck_5.wav had a <Response [200]> response
bill_clinton_hope_ad_1992_chunck_6.wav had a <Response [200]> response
bill_clinton_hope_ad_1992_chunck_7.wav had a <Response [200]> response
eisenhower_for_president_1952_chunck_1.wav had a <Response

#### _Extract text from each response_

In [32]:
#organize response output
status = []
name = []
lexical = []
request_id = []
confidence = []

#select variables from output
for i, response in enumerate(responses):
    if responses[i]['header']['status'] == 'success':
        status.append(responses[i]['header']['status'])
        name.append(responses[i]['header']['name'])
        lexical.append(responses[i]['header']['lexical'])
        request_id.append(responses[i]['header']['properties']['requestid'])
        confidence.append(responses[i]['results'][0]['confidence'])
    else:
        status.append('Error')
        name.append('Nan')
        lexical.append('Nan')
        request_id.append('Nan')
        confidence.append('Nan')

#combine output into df
df_log_response = pd.DataFrame({'file_name' : blob_name_list,
                                'stt_http_response' :  http_responses,
                                'stt_id' : request_id,
                                'stt_status' : status,
                                'stt_name' : name,
                                'stt_text' : lexical,
                                'stt_confidence' : confidence})

#display df
df_log_response.head()

Unnamed: 0,file_name,stt_http_response,stt_id,stt_status,stt_name,stt_text,stt_confidence
0,1988_george_bush_sr_revolving_door_attack_ad_c...,<Response [200]>,88f34f10-fc93-4c74-bdc9-cb65f25174f4,success,who is governor michael dukakis vitov mandator...,who is governor michael dukakis vitov mandator...,0.8767573
1,1988_george_bush_sr_revolving_door_attack_ad_c...,<Response [200]>,279bb9e8-0e30-4934-b0d6-e8f1300b1026,success,impolicy gave weekend frontales to first degre...,impolicy gave weekend frontales to first degre...,0.8108339
2,1988_george_bush_sr_revolving_door_attack_ad_c...,<Response [200]>,f7f9b2d9-396c-46df-856c-a91414491f7b,success,large how michael dukakis says he wants to do ...,large how michael dukakis says he wants to do ...,0.8396592
3,bill_clinton_hope_ad_1992_chunck_1.wav,<Response [200]>,600005f9-2c3c-4a1f-922e-ace84b6c6726,success,I was born a little town called hope arkansas ...,i was born a little town called hope arkansas ...,0.8724898
4,bill_clinton_hope_ad_1992_chunck_2.wav,<Response [200]>,08e79595-766f-46f2-80da-fe81d1e10017,success,very limited income it was in 1963 that I went...,very limited income it was in nineteen sixty t...,0.8587455


#### _Recompose the script of each video, collect results into a dataframe_

In [197]:
#recompose text from speech recognition service into a dataframe
dict_speech_recognition = dict()
video_list = ['eisenhower_for_president_1952',
               '1988_george_bush_sr_revolving_door_attack_ad_campaign',
               'high_quality_famous_daisy_attack_ad_from_1964_presidential_election',
               'humphrey_laughing_at_spiro_agnew_1968_political_ad',
               'kennedy_for_me_campaign_jingle_jfk_1960',
               'mcgovern_defense_plan_ad_nixon_1972_presidential_campaign_commercial',
               'ronald_reagan_tv_ad_its_morning_in_america_again',
               'bill_clinton_hope_ad_1992',
               'historical_campaign_ad_windsurfing_bushcheney_04',
               'yes_we_can__barack_obama_music_video'] 

#extract text for each candidate and join it
for name in video_list:    
    dict_name = dict()
    audio_text = []
    
    #for each entry append text to the correspendent video
    for i, entry in enumerate(df_log_response.loc[:,'file_name']):
        if name in entry:
            if df_log_response.loc[i, 'stt_text'] != 'Nan': 
                audio_text.append(df_log_response.loc[i, 'stt_text'])
            #uncomment the line below and indent the next if you want to keep track of empty audio chuncks
            #else:    
                #audio_text.append('Nan')
                        
    n_words = []
    for words in audio_text: 
        n_words.append(int(len(words.split(' '))))
    words_count = sum(n_words)    
    
    joined_audio = " ".join(audio_text)
    dict_name['stt_text'] = joined_audio
    dict_name['stt_words_count'] = words_count
    dict_speech_recognition[name] = dict_name 

#convert dictionary to df
df_speech_to_text = pd.DataFrame.from_dict(dict_speech_recognition , orient='index').reset_index()
df_speech_to_text.columns =  'video_title', 'stt_text', 'stt_words_count'

#display dataframe
df_speech_to_text

Unnamed: 0,video_title,stt_text,stt_words_count
0,1988_george_bush_sr_revolving_door_attack_ad_c...,who is governor michael dukakis vitov mandator...,63
1,bill_clinton_hope_ad_1992,i was born a little town called hope arkansas ...,172
2,eisenhower_for_president_1952,i for president for president i like my comput...,28
3,high_quality_famous_daisy_attack_ad_from_1964_...,play hello by standing ben please are the stak...,48
4,historical_campaign_ad_windsurfing_bushcheney_04,i'm george W bush and i approve this message i...,77
5,humphrey_laughing_at_spiro_agnew_1968_politica...,,0
6,kennedy_for_me_campaign_jingle_jfk_1960,do you wanna man for president who season thro...,11
7,mcgovern_defense_plan_ad_nixon_1972_presidenti...,the mcgovern defense plan he would cut the mar...,83
8,ronald_reagan_tv_ad_its_morning_in_america_again,it's morning again in america today more men a...,112
9,yes_we_can__barack_obama_music_video,how how what what people of this nation false ...,21


In [50]:
#orint the output of the script text extracted of a selected Ad
print('---------------------------------------------------------')
print('The entire script of the video {} is:'.format(df_speech_to_text.loc[1,'video_title']))
print('---------------------------------------------------------')
print('{}"'.format(df_speech_to_text.loc[1,'stt_text'].replace('Nan', '').replace('   ', ' ')))

---------------------------------------------------------
The entire script of the video bill_clinton_hope_ad_1992 is:
---------------------------------------------------------
i was born a little town called hope arkansas three months after my father died i remembered old two story house where i live in the grandparents very limited income it was in nineteen sixty three that i went to washington invent president kennedy at the boys nation program an IRA member just thinking morning kredible country this was it somebody like me you had no money or anything would be given the opportunity to meet the president rocket really do public service 'cause i care so much about people i work my way through law school with part time jobs anything i could find after i graduated i really didn't care about making a lot of money i just wanna go home and see if i can make a difference between work hard and education and healthcare create jobs and we real progress now it's exhilarating tomato think that

### Extract sentiment and key phrases from text using Text Analytics API

To extract sentiment and key phrases from the text, follows these steps:

- retrieve text analytics service credentials and configure API to access service
- use text from the audio files
- request sentiment analysis and extract key phrases services to the public cloud for each script
- collect results into a dataframe and save it as tabular dataset

#### _Retrieve text analytics service credentials and configure API to access service_


In [54]:
#set service name
SERVICE_NAME = 'TEXT_ANALYTICS' #add here: STORAGE, FACE, COMPUTER_VISION, SPEECH_RECOGNITION, TEXT_ANALYTICS, ML_STUDIO

#call function to retrive keys
storage_keys = retrieve_keys(SERVICE_NAME, PATH_TO_KEYS, KEYS_FILE_NAME)

#set text analytics keys
TEXT_ANALYTICS_KEY = storage_keys['API_KEY']

#configure API access to request text analytics service
URI_SENTIMENT = 'https://eastus.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment'
URI_KEY_PHRASES = 'https://eastus.api.cognitive.microsoft.com/text/analytics/v2.0/keyPhrases'

#set REST headers
headers = {}
headers['Ocp-Apim-Subscription-Key'] = TEXT_ANALYTICS_KEY
headers['Content-Type'] = 'application/json'
headers['Accept'] = 'application/json'

#### _Request sentiment analysis and extract key phrases services to the public cloud for each script_

In [55]:
#set procedure starting time
print('--------------------------------------')
print("Start text analysis")
print('--------------------------------------')
start = time.time()

#store text analysis to list
sentiment_text = []
key_phrases = []
sentiment_mean_key_phrases = []   

#perform on text for each audio
for i, entry in enumerate(df_speech_to_text.index):
    text = df_speech_to_text.loc[i,'stt_text'].replace('Nan', '').replace('   ', ' ')

    #create request to determine sentiment from text
    data = json.dumps({"documents":[{"id":str(uuid.uuid1()), "language":"en", "text":text}]}).encode('utf-8')
    request = urllib.request.Request(URI_SENTIMENT, data, headers)
    response = urllib.request.urlopen(request)
    responsejson = json.loads(response.read().decode('utf-8'))
    try:
        sentiment = responsejson['documents'][0]['score']
    except:
        sentiment = 'Nan'
    sentiment_text.append(sentiment)

    #create request to determine key phrases from text
    data = data
    request = urllib.request.Request(URI_KEY_PHRASES, data, headers)
    response = urllib.request.urlopen(request)
    responsejson = json.loads(response.read().decode('utf-8'))
    try:
        key_phrase = responsejson['documents'][0]['keyPhrases']
    except:
        key_phrase = 'Nan'
    key_phrases.append(key_phrase)
    
    #create request to determine sentiment from key phrases
    sentiment_key_phrases = []
    for key in key_phrase:
        data = json.dumps({"documents":[{"id":str(uuid.uuid1()), "language":"en", "text":key}]}).encode('utf-8')
        request = urllib.request.Request(URI_SENTIMENT, data, headers)
        response = urllib.request.urlopen(request)
        responsejson = json.loads(response.read().decode('utf-8'))
        sentiment = responsejson['documents'][0]['score']
        sentiment_key_phrases.append(round(sentiment, 2))
        time.sleep(1)

    sentiment_mean = sum(sentiment_key_phrases)/len(sentiment_key_phrases)
    sentiment_mean_key_phrases.append(sentiment_mean)

#assign new column to df_stt
df_speech_to_text['ta_sentiment_text'] = sentiment_text
df_speech_to_text['ta_key_phrases'] = key_phrases
df_speech_to_text['ta_sentiment_key_phrases'] = sentiment_mean_key_phrases

#set procedure ending time
end = time.time()
print('Text analysis completed')
print('--------------------------------------')
print('It took {} to perform text analysis'.format(round(end - start, 2)))

--------------------------------------
Start text analysis
Text analysis completed
--------------------------------------
It took 138.91 to perform text analysis


#### _Collect results into a dataframe and save it as tabular dataset_

In [58]:
#make a copy of the dataset
df_text_analysis = df_speech_to_text.copy()

#save dataframe to folder
df_text_analysis.to_csv('../../dataset/data_extraction_text_analysis.csv', sep=',', encoding='utf-8')

#display dataset with text analytics
df_text_analysis

Unnamed: 0,video_title,stt_text,stt_words_count,ta_sentiment_text,ta_key_phrases,ta_sentiment_key_phrases
0,1988_george_bush_sr_revolving_door_attack_ad_c...,who is governor michael dukakis vitov mandator...,63,0.0557907,"[massachusetts america, governor michael dukak...",0.529167
1,bill_clinton_hope_ad_1992,i was born a little town called hope arkansas ...,172,0.5,"[cause i, present i, graduated i, lot of money...",0.571613
2,eisenhower_for_president_1952,i for president for president i like my comput...,28,0.918349,"[president i, computer billy graham washington...",0.7
3,high_quality_famous_daisy_attack_ad_from_1964_...,play hello by standing ben please are the stak...,48,0.856453,"[stakes, god, president johnson, s children, v...",0.7625
4,historical_campaign_ad_windsurfing_bushcheney_04,i'm george W bush and i approve this message i...,77,0.5,"[john kerry lead carey, john kerry whichever w...",0.677273
5,humphrey_laughing_at_spiro_agnew_1968_politica...,,0,Nan,Nan,0.4
6,kennedy_for_me_campaign_jingle_jfk_1960,do you wanna man for president who season thro...,11,0.5,"[wanna man, president]",0.38
7,mcgovern_defense_plan_ad_nixon_1972_presidenti...,the mcgovern defense plan he would cut the mar...,83,0.5,"[cut navy personnel, navy fleet, mcgovern defe...",0.602667
8,ronald_reagan_tv_ad_its_morning_in_america_again,it's morning again in america today more men a...,112,0.5,"[short years, half, young men, rates, leadersh...",0.590588
9,yes_we_can__barack_obama_music_video,how how what what people of this nation false ...,21,0.243644,"[people, nation false hope]",0.815


## Extract image contents and text using Vision API

To extract text from the audio files uploaded to cloud storage previously, follows these steps:

- access Azure Portal using your account [Link here]
- retrieve computer vision service credentials and configure API to access analyze image and optical character recognition services
- request analyze image and optical character recognition services to the public cloud for each script
- extract text from each response
- recompose the script of each video, collect results into a dataframe and save it as tabular dataset
- collect results into a dataframe and save it as tabular dataset

#### _Retrieve computer vision service credentials and configure API to access analyze image and optical character recognition services_

In [107]:
#set service name
SERVICE_NAME = 'COMPUTER_VISION'

#call function to retrive keys
storage_keys = retrieve_keys(SERVICE_NAME, PATH_TO_KEYS, KEYS_FILE_NAME)

#set text analytics keys
COMPUTER_VISION_KEY = storage_keys['API_KEY']

#configure API access to request text analytics service
URI_ANALYZE = 'https://eastus.api.cognitive.microsoft.com/vision/v1.0/analyze'
URI_OCR = 'https://eastus.api.cognitive.microsoft.com/vision/v1.0/ocr'

#set REST headers
headers = {}
headers['Ocp-Apim-Subscription-Key'] = COMPUTER_VISION_KEY
headers['Content-Type'] = 'application/json'
headers['Accept'] = 'application/json'

#set api request parameters for analyze image service
params_set_vis = {}
params_set_vis['visualFeatures'] = 'Categories,Tags,Description,Faces,ImageType,Color,Adult'

#set api request parameters for ocr service
params_set_ocr = {}
params_set_ocr['language'] =  'unk'
params_set_ocr['detectOrientation'] = 'false'

#### _Request analyze image and optical character recognition services to the public cloud for each script_

In [110]:
#set container to retrieve files from
CONTAINER_NAME = 'myimage'

#get list of BLOB urls and names
blob_name_list, blob_url_list = retrieve_blob_list(STORAGE_NAME, STORAGE_KEY, CONTAINER_NAME)

#store http response
responses_vis = []
http_responses_vis = []
responses_ocr = []
http_responses_ocr = []

#set procedure starting time
print('-------------------')
print("Start computer vision")
print('-------------------')
start = time.time()

#run analyze image service on video frames (i.e. extension .jpg)
for i, blob_name in enumerate(blob_url_list):
    if i == 0 or i%10 != 0:
        if blob_name.split('.')[-1] == 'jpg':
            
            #send image request REST to computer vision
            params = urllib.parse.urlencode(params_set_vis)
            query_string = '?{0}'.format(params) 
            url = URI_ANALYZE + query_string
            body = '{\'url\':\'' + blob_name + '\'}'

            #request for analyze image service   
            api_response = requests.post(url, headers=headers, data=body)
            print('{} had a {} from analyze service'.format(blob_name.split('/')[-1], api_response))

            #extract data from analyze image response
            res_json = json.loads(api_response.content.decode('utf-8'))
            http_responses_vis.append(api_response)
            responses_vis.append(res_json)

            #send ocr request REST to computer vision
            params = urllib.parse.urlencode(params_set_ocr)
            query_string = '?{0}'.format(params)
            url = URI_OCR + query_string
            body = '{\'url\':\'' + blob_name + '\'}'
        
            #request for ocr service   
            api_response = requests.post(url, headers=headers, data=body)
            print('{} had a {} from ocr service'.format(blob_name.split('/')[-1], api_response))

            #extract data from ocr response
            res_json = json.loads(api_response.content.decode('utf-8'))
            http_responses_ocr.append(api_response)
            responses_ocr.append(res_json)

    else:
        time.sleep(50)
        
#set procedure ending time
end = time.time()
print('-------------------')
print('Computer Vision completed')
print('-------------------')
print('It took {} seconds to '.format(round(end - start, 2)))

-------------------
Start computer vision
-------------------
1988_george_bush_sr_revolving_door_attack_ad_campaign_frame0.jpg had a <Response [200]> from analyze service
1988_george_bush_sr_revolving_door_attack_ad_campaign_frame0.jpg had a <Response [200]> from ocr service
1988_george_bush_sr_revolving_door_attack_ad_campaign_frame100.jpg had a <Response [200]> from analyze service
1988_george_bush_sr_revolving_door_attack_ad_campaign_frame100.jpg had a <Response [200]> from ocr service
1988_george_bush_sr_revolving_door_attack_ad_campaign_frame200.jpg had a <Response [200]> from analyze service
1988_george_bush_sr_revolving_door_attack_ad_campaign_frame200.jpg had a <Response [200]> from ocr service
1988_george_bush_sr_revolving_door_attack_ad_campaign_frame300.jpg had a <Response [200]> from analyze service
1988_george_bush_sr_revolving_door_attack_ad_campaign_frame300.jpg had a <Response [200]> from ocr service
1988_george_bush_sr_revolving_door_attack_ad_campaign_frame400.jpg had

high_quality_famous_daisy_attack_ad_from_1964_presidential_election_frame1000.jpg had a <Response [200]> from ocr service
high_quality_famous_daisy_attack_ad_from_1964_presidential_election_frame1200.jpg had a <Response [200]> from analyze service
high_quality_famous_daisy_attack_ad_from_1964_presidential_election_frame1200.jpg had a <Response [200]> from ocr service
high_quality_famous_daisy_attack_ad_from_1964_presidential_election_frame1300.jpg had a <Response [200]> from analyze service
high_quality_famous_daisy_attack_ad_from_1964_presidential_election_frame1300.jpg had a <Response [200]> from ocr service
high_quality_famous_daisy_attack_ad_from_1964_presidential_election_frame1400.jpg had a <Response [200]> from analyze service
high_quality_famous_daisy_attack_ad_from_1964_presidential_election_frame1400.jpg had a <Response [200]> from ocr service
high_quality_famous_daisy_attack_ad_from_1964_presidential_election_frame1500.jpg had a <Response [200]> from analyze service
high_qua

kennedy_for_me_campaign_jingle_jfk_1960_frame1600.jpg had a <Response [200]> from ocr service
kennedy_for_me_campaign_jingle_jfk_1960_frame1700.jpg had a <Response [200]> from analyze service
kennedy_for_me_campaign_jingle_jfk_1960_frame1700.jpg had a <Response [200]> from ocr service
kennedy_for_me_campaign_jingle_jfk_1960_frame200.jpg had a <Response [200]> from analyze service
kennedy_for_me_campaign_jingle_jfk_1960_frame200.jpg had a <Response [200]> from ocr service
kennedy_for_me_campaign_jingle_jfk_1960_frame300.jpg had a <Response [200]> from analyze service
kennedy_for_me_campaign_jingle_jfk_1960_frame300.jpg had a <Response [200]> from ocr service
kennedy_for_me_campaign_jingle_jfk_1960_frame400.jpg had a <Response [200]> from analyze service
kennedy_for_me_campaign_jingle_jfk_1960_frame400.jpg had a <Response [200]> from ocr service
kennedy_for_me_campaign_jingle_jfk_1960_frame500.jpg had a <Response [200]> from analyze service
kennedy_for_me_campaign_jingle_jfk_1960_frame50

ronald_reagan_tv_ad_its_morning_in_america_again_frame500.jpg had a <Response [200]> from ocr service
ronald_reagan_tv_ad_its_morning_in_america_again_frame600.jpg had a <Response [200]> from analyze service
ronald_reagan_tv_ad_its_morning_in_america_again_frame600.jpg had a <Response [200]> from ocr service
ronald_reagan_tv_ad_its_morning_in_america_again_frame700.jpg had a <Response [200]> from analyze service
ronald_reagan_tv_ad_its_morning_in_america_again_frame700.jpg had a <Response [200]> from ocr service
ronald_reagan_tv_ad_its_morning_in_america_again_frame800.jpg had a <Response [200]> from analyze service
ronald_reagan_tv_ad_its_morning_in_america_again_frame800.jpg had a <Response [200]> from ocr service
ronald_reagan_tv_ad_its_morning_in_america_again_frame900.jpg had a <Response [200]> from analyze service
ronald_reagan_tv_ad_its_morning_in_america_again_frame900.jpg had a <Response [200]> from ocr service
yes_we_can__barack_obama_music_video_frame0.jpg had a <Response [2

#### _Extract text from each response_

In [112]:
#store results from response
response_status_vis = []
fr_category = []
fr_category_confidence = []
fr_detail_celebrities = []
fr_detail_celebrities_confidence = []
fr_tag_name = []
fr_tag_confidence = []
fr_tag_description = []
fr_caption = []
fr_caption_confidence = []
fr_face_age = []
fr_face_gender = []

#parse over response and extract features
for i, response in enumerate(responses_vis):
    if next(iter(responses_vis[i])) == 'statusCode':
        response_status_vis.append(responses_vis[i]['statusCode'])
        fr_category.append('Nan')
        fr_category_confidence.append('Nan') 
        fr_detail_celebrities.append('Nan')
        fr_detail_celebrities_confidence.append('Nan')
        fr_tag_name.append('Nan')
        fr_tag_confidence.append('Nan')
        fr_tag_description.append('Nan')
        fr_caption.append('Nan')
        fr_caption_confidence.append('Nan')
        fr_face_age.append('Nan')
        fr_face_gender.append('Nan')
            
    else:
        response_status_vis.append('<200>')
        
        #parse over the categories key of the response
        for j, response in enumerate(responses_vis[i]['categories']):
                                
                #get all the category with a relatively high score
                count =  0
                if response['score'] > 0.25:
                    #check for multiple high score category
                    fr_category.append(response['name'].strip('_'))
                    fr_category_confidence.append(response['score']) 
                    
                    #extract celebrities
                    if 'detail' in response.keys():
                        if 'celebrities' in response['detail'].keys():
                            if response['detail']['celebrities'] != []:
                                fr_detail_celebrities.append(response['detail']['celebrities'][0]['name'])
                                fr_detail_celebrities_confidence.append(response['detail']['celebrities'][0]['confidence'])             
                            else:
                                fr_detail_celebrities.append('Nan')
                                fr_detail_celebrities_confidence.append('Nan')        
                        else:
                            fr_detail_celebrities.append('Nan')
                            fr_detail_celebrities_confidence.append('Nan')
                    else:
                        fr_detail_celebrities.append('Nan')
                        fr_detail_celebrities_confidence.append('Nan')
                    
                    break
                
                else:           
                    if count == j:
                        fr_category.append('Nan')
                        fr_category_confidence.append('Nan') 
                        fr_detail_celebrities.append('Nan')
                        fr_detail_celebrities_confidence.append('Nan')
                        break
                        
                    count =+ 1
        
        #parse over the tags key of the response
        tags_name = []
        tags_confidence = []
        for k, response in enumerate(responses_vis[i]['tags']):
            tags_name.append(response['name'])
            tags_confidence.append(response['confidence'])
        fr_tag_name.append(tags_name)
        fr_tag_confidence.append(tags_confidence)
        
        #parse over the description key of the response
        tags_description = []
        for k, response in enumerate(responses_vis[i]['description']['tags']):
            tags_description.append(response)
        fr_tag_description.append(tags_description) 
        
        caption = []
        caption_confidence = []
        for k, response in enumerate(responses_vis[i]['description']['captions']):
            caption.append(response['text'])
            caption_confidence.append(response['confidence'])
        fr_caption.append(caption)
        fr_caption_confidence.append(caption_confidence)
        
        #parse over the faces key of the response
        #print(i)
        face_age = []
        face_gender = []
        for k, response in enumerate(responses_vis[i]['faces']):
            face_age.append(response['age'])
            face_gender.append(response['gender'])
        fr_face_age.append(face_age)
        fr_face_gender.append(face_gender)

In [111]:
#store results from response
response_status = []
fr_ocr_words = []

#parse over response and extract features
for i, response in enumerate(responses_ocr):
    if next(iter(responses_ocr[i])) == 'statusCode':
        response_status.append(responses_ocr[i]['statusCode'])
        fr_ocr_words.append('Nan')

    else:
        response_status.append('<200>')
        words = []
        for j, response in enumerate(responses_ocr[i]['regions']):
            for k, box in enumerate(response['lines']):
                for l, word in enumerate(box['words']):
                    words.append(word['text'])
                                        
        fr_ocr_words.append(words)

In [231]:
#display results
log_text_analysis = {'file_name' : blob_name_list,
                   'vis_http_response' :  response_status,
                   'vis_fr_caption' : fr_caption,
                   'vis_fr_caption_score[%]' : fr_caption_confidence,
                   'vis_tag_description': fr_tag_description,
                   'vis_tag_name' :  fr_tag_name,
                   'vis_tag_confidence' :  fr_tag_confidence, 
                   'vis_face_gender' : fr_face_gender,
                   'vis_face_age' : fr_face_age,
                   'vis_ocr' : fr_ocr_words,
                   'vis_fr_category' : fr_category,
                   'vis_fr_category_score[%]' : fr_category_confidence,
                   'vis_fr_celebrities' : fr_detail_celebrities,
                   'vis_fr_celebrities_score[%]' : fr_detail_celebrities_confidence}

df_log_text_analysis = pd.DataFrame.from_dict(log_text_analysis, orient='index')
df_log_text_analysis = df_log_text_analysis.transpose()
df_log_text_analysis[df_log_text_analysis['vis_http_response'] == '<200>'].head(10)

Unnamed: 0,file_name,vis_http_response,vis_fr_caption,vis_fr_caption_score[%],vis_tag_description,vis_tag_name,vis_tag_confidence,vis_face_gender,vis_face_age,vis_ocr,vis_fr_category,vis_fr_category_score[%],vis_fr_celebrities,vis_fr_celebrities_score[%]
0,1988_george_bush_sr_revolving_door_attack_ad_c...,<200>,[a tall glass building],[0.6281513925662964],"[building, photo, sitting, black, large, sign,...",[],[],[],[],"[.11, THE, DUKAKIS, FURLOUGH, PROGRAM]",Nan,Nan,Nan,Nan
1,1988_george_bush_sr_revolving_door_attack_ad_c...,<200>,[a gate in front of a window],[0.5849918523400203],"[building, standing, sitting, window, large, r...",[building],[0.8614632487297058],[],[],[],Nan,Nan,Nan,Nan
2,1988_george_bush_sr_revolving_door_attack_ad_c...,<200>,[a tall glass building],[0.4673530253144375],"[building, window, clock, photo, water, small,...","[building, silhouette, clouds, tower, distance]","[0.8810973763465881, 0.23954088985919952, 0.20...",[],[],[],Nan,Nan,Nan,Nan
3,1988_george_bush_sr_revolving_door_attack_ad_c...,<200>,[a group of people standing in front of a mirr...,[0.5246359866485301],"[man, standing, front, building, mirror, peopl...",[],[],[],[],[],Nan,Nan,Nan,Nan
4,1988_george_bush_sr_revolving_door_attack_ad_c...,<200>,[a group of people standing in front of a mirr...,[0.8146214362046119],"[person, man, standing, looking, photo, people...",[person],[0.9888064861297607],[],[],[],Nan,Nan,Nan,Nan
5,1988_george_bush_sr_revolving_door_attack_ad_c...,<200>,[a group of people in a cage],[0.944788099490945],"[person, people, photo, building, window, man,...","[person, people]","[0.899941086769104, 0.6188361644744873]",[],[],"[268, Escaped.]",Nan,Nan,Nan,Nan
6,1988_george_bush_sr_revolving_door_attack_ad_c...,<200>,[a group of people in a cage],[0.9060217418649054],"[person, building, man, group, people, photo, ...","[person, group, people]","[0.9542752504348755, 0.5842924118041992, 0.551...",[],[],"[Many, are, still, at, large.]",Nan,Nan,Nan,Nan
7,1988_george_bush_sr_revolving_door_attack_ad_c...,<200>,[a group of people standing in front of a fence],[0.9185269151722641],"[person, fence, group, outdoor, people, buildi...","[person, group, people]","[0.9850817918777466, 0.8664748072624207, 0.741...",[],[],[],Nan,Nan,Nan,Nan
8,1988_george_bush_sr_revolving_door_attack_ad_c...,<200>,[a tall building],[0.5896768068792926],"[building, window, table, living, clock, stand...","[building, tower]","[0.8481139540672302, 0.31022948026657104]",[],[],[],Nan,Nan,Nan,Nan
9,bill_clinton_hope_ad_1992_frame0.jpg,<200>,[a close up of a computer],[0.4195503962965528],"[laptop, computer]",[],[],[],[],[],Nan,Nan,Nan,Nan


#### _Recompose output of computer vision of each video, collect results into a dataframe and save it as tabular dataset_

In [258]:
#recompose text from speech recognition service into a dataframe
dict_computer_vision = dict()
video_list = ['eisenhower_for_president_1952',
               '1988_george_bush_sr_revolving_door_attack_ad_campaign',
               'high_quality_famous_daisy_attack_ad_from_1964_presidential_election',
               'humphrey_laughing_at_spiro_agnew_1968_political_ad',
               'kennedy_for_me_campaign_jingle_jfk_1960',
               'mcgovern_defense_plan_ad_nixon_1972_presidential_campaign_commercial',
               'ronald_reagan_tv_ad_its_morning_in_america_again',
               'bill_clinton_hope_ad_1992',
               'historical_campaign_ad_windsurfing_bushcheney_04',
               'yes_we_can__barack_obama_music_video']

#extract text for each candidate and join it
for name in video_list:
    
    dict_name = dict()
    caption = []
    caption_score = []
    caption_tag = []
    caption_tag_description = []
    caption_tag_score = []
    caption_people_gender = []
    caption_people_age = []
    caption_tag_celebrities = []
    caption_text = []
    caption_category = []
    caption_category_score = []

    for i, entry in enumerate(df_log_text_analysis.loc[:,'file_name']):        
        if name in entry:
            
            caption.append(df_log_text_analysis.loc[i,'vis_fr_caption'])
            caption_score.append(df_log_text_analysis.loc[i, 'vis_fr_caption_score[%]'])
            caption_tag.append(df_log_text_analysis.loc[i, 'vis_tag_name'])
            caption_tag_description.append(df_log_text_analysis.loc[i, 'vis_tag_description'])
            caption_tag_score.append(df_log_text_analysis.loc[i, 'vis_tag_confidence'])
            caption_people_gender.append(df_log_text_analysis.loc[i, 'vis_face_gender'])
            caption_people_age.append(df_log_text_analysis.loc[i,  'vis_face_age'])
            caption_tag_celebrities.append(df_log_text_analysis.loc[i,  'vis_fr_celebrities'])
            caption_text.append(df_log_text_analysis.loc[i, 'vis_ocr'])
            caption_category.append(df_log_text_analysis.loc[i, 'vis_fr_category'])
            caption_category_score.append(df_log_text_analysis.loc[i, 'vis_fr_category_score[%]'])

    dict_name['caption'] = caption
    dict_name['caption_score'] = caption_score
    dict_name['tag'] = caption_tag
    dict_name['tag_description'] = caption_tag_description
    dict_name['tag_score'] = caption_tag_score
    dict_name['people_gender'] = caption_people_gender
    dict_name['people_age'] = caption_people_age
    dict_name['people_celebrities'] = caption_tag_celebrities
    dict_name['image_text'] = caption_text
    dict_name['category'] = caption_category 
    dict_name['category_score'] = caption_category_score  
    
    dict_computer_vision[name] = dict_name
    
#convert dictionary to df
df_computer_vision = pd.DataFrame.from_dict(dict_computer_vision , orient='index').reset_index()

#save dataframe to folder
df_computer_vision.to_csv('../../dataset/data_extraction_computer_vision.csv', sep=',', encoding='utf-8')

#display dataframe
df_computer_vision

Unnamed: 0,index,caption,caption_score,tag,tag_description,tag_score,people_gender,people_age,people_celebrities,image_text,category,category_score
0,1988_george_bush_sr_revolving_door_attack_ad_c...,"[[a tall glass building], [a gate in front of ...","[[0.6281513925662964], [0.5849918523400203], [...","[[], [building], [building, silhouette, clouds...","[[building, photo, sitting, black, large, sign...","[[], [0.8614632487297058], [0.8810973763465881...","[[], [], [], [], [], [], [], [], []]","[[], [], [], [], [], [], [], [], []]","[Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan]","[[.11, THE, DUKAKIS, FURLOUGH, PROGRAM], [], [...","[Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan]","[Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan]"
1,bill_clinton_hope_ad_1992,"[[a close up of a computer], [a person sitting...","[[0.4195503962965528], [0.6316425808088616], [...","[[], [], [person, striped], [outdoor], [person...","[[laptop, computer], [sitting, table, white, b...","[[], [], [0.9881473183631897, 0.77130669355392...","[[], [], [], [], [Male, Female], [Male], [], [...","[[], [], [], [], [29, 24], [42], [], [33], [48...","[Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, Bill ...","[[], [], [], [], [], [], [], [], [cu, '92, mMM...","[Nan, Nan, Nan, Nan, Nan, people_many, Nan, Na...","[Nan, Nan, Nan, Nan, Nan, 0.359375, Nan, Nan, ..."
2,eisenhower_for_president_1952,"[[a group of people in a room], [a group of pe...","[[0.8372239654487026], [0.6410701379449969], [...","[[], [crowd], [wall, indoor], [], [building, g...","[[photo, group, white, standing, people, woman...","[[], [0.009168439544737339], [0.96387577056884...","[[], [], [], [], [], [], [], [], [], [], [], [...","[[], [], [], [], [], [], [], [], [], [], [], [...","[Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, ...","[[ΙΚΕ, ΙΚΕ], [IKE], [], [IKE], [], [ΙΚΕ], [VOT...","[Nan, Nan, Nan, Nan, building, Nan, Nan, Nan, ...","[Nan, Nan, Nan, Nan, 0.65234375, Nan, Nan, Nan..."
3,high_quality_famous_daisy_attack_ad_from_1964_...,"[[a close up of a person], [a star filled sky]...","[[0.37491668320809257], [0.39321677414931927],...","[[], [outdoor object], [nature, clouds, cloud]...","[[sitting, white, black, table, food, holding,...","[[], [0.33399033546447754], [0.660276889801025...","[[], [], [], [], [], [], [], [], [], [Female],...","[[], [], [], [], [], [], [], [], [], [10], [12...","[Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, ...","[[], [], [], [], [], [VOTE, FOR, PRESIDENT, JO...","[Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, ...","[Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, ..."
4,historical_campaign_ad_windsurfing_bushcheney_04,"[[a ship in the background], [a small boat in ...","[[0.4505885722809501], [0.7388361383124457], [...","[[], [water, transport, outdoor, watercraft, s...","[[ship, water, man, engine], [water, transport...","[[], [0.986984133720398, 0.9118053317070007, 0...","[[], [], [], [], [], [], [], [], []]","[[], [], [], [], [], [], [], [], []]","[Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan]","[[], [], [], [], [?sident?], [Agnew, for, -Pre...","[Nan, Nan, Nan, Nan, others, others, Nan, Nan,...","[Nan, Nan, Nan, Nan, 0.65234375, 0.54296875, N..."
5,humphrey_laughing_at_spiro_agnew_1968_politica...,"[[John F. Kennedy et al. on a newspaper], [a b...","[[0.46781462865547085], [0.46116296912474647],...","[[text, newspaper], [several], [text, outdoor,...","[[text, photo, newspaper, sitting, white, blac...","[[0.9706620573997498, 0.8873120546340942], [0....","[[Male], [Male], [], [], [], []]","[[43], [56], [], [], [], []]","[Nan, Nan, Nan, Nan, Nan, Nan]","[[IKENNEDY, PRESIDENT], [], [], [], [A, rog], []]","[Nan, others, Nan, Nan, Nan, Nan]","[Nan, 0.3046875, Nan, Nan, Nan, Nan]"
6,kennedy_for_me_campaign_jingle_jfk_1960,"[[a black sign with white text], [a black sign...","[[0.7830852370379959], [0.9310520289368901], [...","[[], [sign, street], [person], [black, white, ...","[[white, sign, sitting, hanging, street, black...","[[], [0.793519139289856, 0.7934789657592773], ...","[[], [], [], [], [Male], [], [], [], [], [], [...","[[], [], [], [], [45], [], [], [], [], [], [],...","[Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, ...","[[KENNEDY], [KENNEDY, KENNEDY, KENNEDYI], [], ...","[text_sign, text_sign, Nan, Nan, Nan, text_sig...","[0.83203125, 0.94921875, Nan, Nan, Nan, 0.9960..."
7,mcgovern_defense_plan_ad_nixon_1972_presidenti...,"[[], [], [a close up of a snow covered ground]...","[[], [], [0.5722464133123896], [0.264241769977...","[[], [], [], [outdoor, gun, day], [distance, d...","[[snow, skiing, water, large, group, country, ...","[[], [], [], [0.8725705742835999, 0.3045244514...","[[], [], [], [], [], [], [], [], [], [], [], [...","[[], [], [], [], [], [], [], [], [], [], [], [...","[Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, ...","[[], [], [], [], [], [], [], [], [], [], [], [...","[Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, ...","[Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, Nan, ..."
8,ronald_reagan_tv_ad_its_morning_in_america_again,"[[a close up of a street], [a close up of a pe...","[[0.35940994135170257], [0.2898669278402918], ...","[[], [], [], [], [crowd], [], [person, dark], ...","[[street, rain, train, bird, traffic], [man], ...","[[], [], [], [], [0.00718245655298233], [], [0...","[[], [], [], [], [], [], [], [Male, Female], [...","[[], [], [], [], [], [], [], [35, 35], [31, 22...","[Nan, Nan, Nan, Nan, Nan, Nan, Nan, Enrique Mu...","[[], [], [], [], [], [], [], [], [], [], [], [...","[Nan, Nan, Nan, Nan, Nan, Nan, Nan, people_por...","[Nan, Nan, Nan, Nan, Nan, Nan, Nan, 0.33203125..."
9,yes_we_can__barack_obama_music_video,"[[a man in a suit and tie], [Ed Kowalczyk hold...","[[0.9698386271569711], [0.9397864121385892], [...","[[person, man, indoor, suit], [person, guitar]...","[[person, man, indoor, standing, photo, screen...","[[0.9795127511024475, 0.9436362981796265, 0.91...","[[Male, Male], [Male, Male, Male], [Male, Male...","[[38, 35], [35, 37, 32], [44, 39, 24, 28], [27...","[Nan, Nan, Barack Obama, Nan, Nan, will.i.am, ...","[[], [], [], [], [], [], [], [], [], [], [], [...","[people, Nan, people_group, Nan, Nan, people, ...","[0.40625, Nan, 0.59375, Nan, Nan, 0.56640625, ..."


## Presidential Campaigns Ads Dataset
### Combine dataframes

In [307]:
#read data from data collection dataset
df1 = pd.read_csv('../../dataset/data_collection_presidential_campaign.csv')
df1 = df1.set_index('video_title').drop('Unnamed: 0', axis=1)

#read data from text analysis dataset
df2 = pd.read_csv('../../dataset/data_extraction_text_analysis.csv')
df2 = df2.set_index('video_title').drop('Unnamed: 0', axis=1)

#read data from computer vision dataset
df3 = pd.read_csv('../../dataset/data_extraction_computer_vision.csv')
df3 = df3.set_index('index').drop('Unnamed: 0', axis=1)

#concatenate dataframes
df_presidential_campaigns = pd.concat([df1, df2, df3], axis=1,sort=False)
df_presidential_campaigns.reset_index(level=0, inplace=True)

#make column title look the same to join dataset
df_presidential_campaigns['index'] = df_presidential_campaigns['index'].apply(
    lambda value: 'eisenhower_for_president' if value == 'eisenhower_for_president_1952'
    else 'kennedy_for_me' if value == 'kennedy_for_me_campaign_jingle_jfk_1960'  
    else 'daisy_attack' if value == 'high_quality_famous_daisy_attack_ad_from_1964_presidential_election'
    else 'humphrey_laughing_at_spiro_agnew' if value == 'humphrey_laughing_at_spiro_agnew_1968_political_ad'
    else 'mcgovern_defense_plan' if value == 'mcgovern_defense_plan_ad_nixon_1972_presidential_campaign_commercial'
    else 'its_morning_in_america_again' if value == 'ronald_reagan_tv_ad_its_morning_in_america_again'
    else 'revolving_door_attack' if value == '1988_george_bush_sr_revolving_door_attack_ad_campaign'
    else 'hope' if value == 'bill_clinton_hope_ad_1992'
    else 'windsurfing' if value == 'historical_campaign_ad_windsurfing_bushcheney_04'
    else 'yes_we_can' if value == 'yes_we_can__barack_obama_music_video'
    else 'unknown')

#save dataframe to folder
df_presidential_campaigns.to_csv('../../dataset/presidential_campaigns_ads_dataset.csv', sep=',', encoding='utf-8')

#display the dataset
df_presidential_campaigns

Unnamed: 0,index,video_url,video_length[sec],video_frames[n],frame_sec[n/sec],frame_extracted[n],year,candidate_name,party,stt_text,...,caption_score,tag,tag_description,tag_score,people_gender,people_age,people_celebrities,image_text,category,category_score
0,eisenhower_for_president,https://youtu.be/Y9RAxAgksSE,62.09,1859,29.940409,19,1952,DDE,republican,i for president for president i like my comput...,...,"[[0.8372239654487026], [0.6410701379449969], [...","[[], ['crowd'], ['wall', 'indoor'], [], ['buil...","[['photo', 'group', 'white', 'standing', 'peop...","[[], [0.009168439544737339], [0.96387577056884...","[[], [], [], [], [], [], [], [], [], [], [], [...","[[], [], [], [], [], [], [], [], [], [], [], [...","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Na...","[['ΙΚΕ', 'ΙΚΕ'], ['IKE'], [], ['IKE'], [], ['Ι...","['Nan', 'Nan', 'Nan', 'Nan', 'building', 'Nan'...","['Nan', 'Nan', 'Nan', 'Nan', 0.65234375, 'Nan'..."
1,kennedy_for_me,https://youtu.be/vs5ORK8RLWk,60.23,1789,29.702806,18,1960,JFK,democratic,do you wanna man for president who season thro...,...,"[[0.7830852370379959], [0.9310520289368901], [...","[[], ['sign', 'street'], ['person'], ['black',...","[['white', 'sign', 'sitting', 'hanging', 'stre...","[[], [0.793519139289856, 0.7934789657592773], ...","[[], [], [], [], ['Male'], [], [], [], [], [],...","[[], [], [], [], [45], [], [], [], [], [], [],...","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Na...","[['KENNEDY'], ['KENNEDY', 'KENNEDY', 'KENNEDYI...","['text_sign', 'text_sign', 'Nan', 'Nan', 'Nan'...","[0.83203125, 0.94921875, 'Nan', 'Nan', 'Nan', ..."
2,daisy_attack,https://youtu.be/dDTBnsqxZ3k,66.9,2003,29.940209,21,1964,LBJ,democratic,play hello by standing ben please are the stak...,...,"[[0.37491668320809257], [0.39321677414931927],...","[[], ['outdoor object'], ['nature', 'clouds', ...","[['sitting', 'white', 'black', 'table', 'food'...","[[], [0.33399033546447754], [0.660276889801025...","[[], [], [], [], [], [], [], [], [], ['Female'...","[[], [], [], [], [], [], [], [], [], [10], [12...","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Na...","[[], [], [], [], [], ['VOTE', 'FOR', 'PRESIDEN...","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Na...","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Na..."
3,humphrey_laughing_at_spiro_agnew,https://youtu.be/Qwk_epMblW4,19.25,575,29.87013,6,1968,HH,republican,,...,"[[0.46781462865547085], [0.46116296912474647],...","[['text', 'newspaper'], ['several'], ['text', ...","[['text', 'photo', 'newspaper', 'sitting', 'wh...","[[0.9706620573997498, 0.8873120546340942], [0....","[['Male'], ['Male'], [], [], [], []]","[[43], [56], [], [], [], []]","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Nan']","[['IKENNEDY', 'PRESIDENT'], [], [], [], ['A', ...","['Nan', 'others', 'Nan', 'Nan', 'Nan', 'Nan']","['Nan', 0.3046875, 'Nan', 'Nan', 'Nan', 'Nan']"
4,mcgovern_defense_plan,https://youtu.be/qVcFUIXEDZ8,60.05,1798,29.941715,18,1972,RN,democratic,the mcgovern defense plan he would cut the mar...,...,"[[], [], [0.5722464133123896], [0.264241769977...","[[], [], [], ['outdoor', 'gun', 'day'], ['dist...","[['snow', 'skiing', 'water', 'large', 'group',...","[[], [], [], [0.8725705742835999, 0.3045244514...","[[], [], [], [], [], [], [], [], [], [], [], [...","[[], [], [], [], [], [], [], [], [], [], [], [...","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Na...","[[], [], [], [], [], [], [], [], [], [], [], [...","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Na...","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Na..."
5,its_morning_in_america_again,https://youtu.be/EU-IBF8nwSY,59.95,1793,29.908257,18,1984,RR,republican,it's morning again in america today more men a...,...,"[[0.35940994135170257], [0.2898669278402918], ...","[[], [], [], [], ['crowd'], [], ['person', 'da...","[['street', 'rain', 'train', 'bird', 'traffic'...","[[], [], [], [], [0.00718245655298233], [], [0...","[[], [], [], [], [], [], [], ['Male', 'Female'...","[[], [], [], [], [], [], [], [35, 35], [31, 22...","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Na...","[[], [], [], [], [], [], [], [], [], [], [], [...","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Na...","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Na..."
6,revolving_door_attack,https://youtu.be/PmwhdDv8VrM,29.88,894,29.919679,9,1988,GBS,republican,who is governor michael dukakis vitov mandator...,...,"[[0.6281513925662964], [0.5849918523400203], [...","[[], ['building'], ['building', 'silhouette', ...","[['building', 'photo', 'sitting', 'black', 'la...","[[], [0.8614632487297058], [0.8810973763465881...","[[], [], [], [], [], [], [], [], []]","[[], [], [], [], [], [], [], [], []]","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Na...","[['.11', 'THE', 'DUKAKIS', 'FURLOUGH', 'PROGRA...","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Na...","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Na..."
7,hope,https://youtu.be/Xq_x3JUwrU0,60.26,1802,29.90375,19,1992,BC,democratic,i was born a little town called hope arkansas ...,...,"[[0.4195503962965528], [0.6316425808088616], [...","[[], [], ['person', 'striped'], ['outdoor'], [...","[['laptop', 'computer'], ['sitting', 'table', ...","[[], [], [0.9881473183631897, 0.77130669355392...","[[], [], [], [], ['Male', 'Female'], ['Male'],...","[[], [], [], [], [29, 24], [42], [], [33], [48...","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Na...","[[], [], [], [], [], [], [], [], ['cu', ""'92"",...","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'people_ma...","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 0.359375, ..."
8,windsurfing,https://youtu.be/pbdzMLk9wHQ,30.09,900,29.910269,9,2004,GBJ,republican,i'm george W bush and i approve this message i...,...,"[[0.4505885722809501], [0.7388361383124457], [...","[[], ['water', 'transport', 'outdoor', 'waterc...","[['ship', 'water', 'man', 'engine'], ['water',...","[[], [0.986984133720398, 0.9118053317070007, 0...","[[], [], [], [], [], [], [], [], []]","[[], [], [], [], [], [], [], [], []]","['Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Nan', 'Na...","[[], [], [], [], ['?sident?'], ['Agnew', 'for'...","['Nan', 'Nan', 'Nan', 'Nan', 'others', 'others...","['Nan', 'Nan', 'Nan', 'Nan', 0.65234375, 0.542..."
9,yes_we_can,https://youtu.be/jjXyqcx-mYY,270.21,3781,13.99282,38,2008,BO,democratic,how how what what people of this nation false ...,...,"[[0.9698386271569711], [0.9397864121385892], [...","[['person', 'man', 'indoor', 'suit'], ['person...","[['person', 'man', 'indoor', 'standing', 'phot...","[[0.9795127511024475, 0.9436362981796265, 0.91...","[['Male', 'Male'], ['Male', 'Male', 'Male'], [...","[[38, 35], [35, 37, 32], [44, 39, 24, 28], [27...","['Nan', 'Nan', 'Barack Obama', 'Nan', 'Nan', '...","[[], [], [], [], [], [], [], [], [], [], [], [...","['people', 'Nan', 'people_group', 'Nan', 'Nan'...","[0.40625, 'Nan', 0.59375, 'Nan', 'Nan', 0.5664..."


### Data type and data description

Below it is a complete list of the data available in the _**Presidential Campaigns Ads Dataset**_:

|field|data type|data description|
|:---|:---|:---|
|index|**string**|presidential campaign title title|
|video_url|**string**|YouTube video url|
|video_length[sec]|**float**|YouTube video lenght|
|video_frames[n]|**integer**|YouTube video number of frames|
|frame_sec[n/sec]|**float**|ratio number of frames in the video and video lenght|
|frame_extracted[n]|**integer**|number of frames extracted by the orignal video (i.e. 100th frame and multiples)|
|year|**integer**|presidential campaign year|
|candidate_name|**string**|presidential candidate name|
|party|**string**|candidate party|
|stt_text|**string**|text trascription of the video (i.e. script)|
|stt_words_count|**integer**|number of words in the script|
|ta_sentiment_text|**float**|scores close to 1 indicate positive sentiment in the script of the video, while scores close to 0 indicate negative sentiment|
|ta_key_phrases|**string**|list of strings denoting the key talking points in the script of the video|
|ta_sentiment_key_phrases|**float**|scores close to 1 indicate positive sentiment in the key talking point, while scores close to 0 indicate negative sentiment|
|caption|**string**|sentence describing image contents|
|caption_score|**float**|probability of the captions being accurate|
|tag|**string**|key points in the image|
|tag_description|**string**|list of subjects the image|
|tag_score|**float**|probability of the tag being accurate|
|people_gender|**string**|gender of people in the image|
|people_age|**integer**|age of people in the image|
|people_celebrities|**string**|name of celebrities in the image|
|image_text|**string**|character recognition in the image|
|category|**string**|main subject in the image|
|category_score|**float**|probability of the category being accurate|

## Recap
### What you have learnt

- How convert audio to text and recombine it to trascribe a video
- How extract key phrases and sentiment from the text
- How to extract image contents and text from frames extracted by video
- Organize dataset and save it

### What you will learn next guide

- This is the last guide of the workshop. In the future we hope to provide you with tutorials that use this dataset to make prediction as well as to build other interesting models. Stay tuned!

![future_work](img/future_work.PNG)

    
### Question for you¶

- Does the datset looks interesting for your research? In general to social scientists?
- What was your idea about public cloud computing before and after the workshop?
- What do you think about captions generation (before/after the workshop)?

In [200]:
#import library to display notebook as HTML
import os
from IPython.core.display import HTML

#path to .ccs style script
cur_path = os.path.dirname(os.path.abspath("__file__"))
new_path = os.path.relpath('..\\..\\..\\styles\\custom_styles_public_cloud_computing.css', cur_path)

#function to display notebook
def css():
    style = open(new_path, "r").read()
    return HTML(style)

In [201]:
css()

In [727]:
def get_files(dir_files):
    """"store file name, extension and path """
    
    files_name = []
    files_path = []
    files_extension = []
    
    for root, directories, files in os.walk(dir_files):
        for file in files:
            files_name.append(file)
            files_path.append(os.path.join(root,file))
            files_extension.append(file.split('.')[-1])
            
    print('Data stored from directory):\t {}'.format(dir_files))
          
    return files_name, files_path, files_extension

def retrive_keys(service_name, PATH_TO_KEYS, KEYS_FILE_NAME):
    """"function to retrieve_keys. return name and key for the selected cloud computing service"""
  
    path_to_keys = os.path.join(PATH_TO_KEYS, KEYS_FILE_NAME)

    with open(path_to_keys, 'rb') as handle:
        azure_keys = pickle.load(handle)

    service_key = azure_keys[service_name]
    
    return service_key

def make_public_container(STORAGE_NAME, STORAGE_KEY, NEW_CONTAINER_NAME):
    """"create blob service, blob container and set it to public access. return blob service"""
    
    blob_service = BlockBlobService(account_name= STORAGE_NAME, account_key=STORAGE_KEY)
    new_container_status = blob_service.create_container(NEW_CONTAINER_NAME) 
    blob_service.set_container_acl(NEW_CONTAINER_NAME, public_access=PublicAccess.Container)
    
    if new_container == True:
        print('{} BLOB container has been successfully created: {}'.format(NEW_CONTAINER_NAME, new_container_status))
    else:
        print('{] something went wrong: check parameters and subscription'.format(NEW_CONTAINER_NAME))

def upload_file(STORAGE_NAME, STORAGE_KEY, NEW_CONTAINER_NAME, file, path, extension, content_type):
    """"create blob service, and upload files to container"""
    
    blob_service = BlockBlobService(account_name= STORAGE_NAME, account_key=STORAGE_KEY)
    
    try:
        blob_service.create_blob_from_path(NEW_CONTAINER_NAME, file, path, content_settings=ContentSettings(content_type= content_type+extension))    
        print("{} // BLOB upload status: successful".format(file))

    except:
        print("{} // BLOB upload status: failed".format(file))

def upload_files_to_container(STORAGE_NAME, STORAGE_KEY, NEW_CONTAINER_NAME, DIR_FILES, CONTENT_TYPE):
    """"create container, get files, and upload to storage"""

    #call funtion to make container
    make_public_container(STORAGE_NAME, STORAGE_KEY, NEW_CONTAINER_NAME)

    print('---------------------------------------------------------')

    #find names, paths and extension of the files stored into directory
    files_name, files_path, files_extension = get_files(DIR_FILES)

    #set uploading procedure starting time
    print('---------------------------------------------------------')
    print("Start uploading files")
    print('---------------------------------------------------------')
    start = time.time()


    #upload all files at once to the new container
    count = 0
    for path, file, ext in zip(files_path, files_name, files_extension):
        upload_file(STORAGE_NAME, STORAGE_KEY, NEW_CONTAINER_NAME, file, path, ext, CONTENT_TYPE) #(blob_service, NEW_CONTAINER_NAME, file, path, ext, CONTENT_TYPE) 
        count += 1
        #add print only failed otherwise good to go

    #set procedure ending time
    end = time.time()
    print('---------------------------------------------------------')
    print('Uploading completed')
    print('---------------------------------------------------------')
    print('It took {} seconds to upload {} files'.format(round(end - start, 2), count))


def delete_container(STORAGE_NAME, STORAGE_KEY, CONTAINER_NAME):

    ##############################################################
    #RUN THIS ONLY IF YOU WANT TO DELETE A CONTAINTER            #
    #REMEMBER TO DOWNLOAD YOUR DATA BEFORE DELETING THE CONTAINER#
    #IMPORTANT: YOU WILL LOOSE YOUR BLOB INTO THE CONTAINER      #
    ##############################################################

    blob_service = BlockBlobService(account_name= STORAGE_NAME, account_key=STORAGE_KEY)

    #delete container
    delete_container = blob_service.delete_container(CONTAINER_NAME)
    print("{} delition status success: {}".format(CONTAINER_NAME, delete_container))