# Connecting to the YouTube API

First we need to create a connection with the YouTube API. For that we need an api_key that can be obtained by using Google dashboard.



In [1]:
from googleapiclient.discovery import build
import pandas as pd
import seaborn as sns

In [2]:
#Get an YouTube API key we generated on API dashboard from google/gmail
#See https://www.youtube.com/watch?v=SwSbnmqk3zY&t=621s for details
api_key = 'AIzaSyB6qCv-p3US4FoAZ2pE1FgFed-0B1ei3yI'

#Build our YouTube API search
youtube = build("youtube",'v3',developerKey = api_key)


# Building a pipeline to scrape the data

The search request on the youtube API does not provide the statistics for the videos it returns, so we will have to do some workarounds to search and get the information we are interested in.
<br/><br/>

We start by getting the videos snippet provided by the search method of the API, which contains the video id information. With that we can use the video list method to for each video in order to get the statitistics we are interested in.
<br/><br/>


## YouTube search

We start by building a function to search for youtube videos and get their ids. Notice that the keywords argument is not a string but a list.

In [3]:
def videos_id(youtube,keywords):
    '''
    Creates a connection with the YouTube API and search for videos on physics given 
        words in the keywords list.
        
    Args:
        youtube (object/class): Build a Resource from the API description.
        
        keywords (list): A list containing a keyword to search for, 
            topic, subtopic, level of difficulty associated to the keyword and the csv filename.
            
    Returns:
        video_ids (list): list containing the ids of the videos, the topic and subtopic associated to
            it.
    '''
    
    # Make a request from our youtube API search (stemplate on https://developers.google.com/youtube/v3/docs/search/list?apix=true)
    max_results = '50'
    video_ids = []
        
    # Given the API cost for collecting videos in a search, we will not use a for loop here.
    # Instead we will search word by word in our keywords list.
    
    request = youtube.search().list(
                #See https://developers.google.com/youtube/v3/getting-started#partial for details on part variable
                part = 'snippet',
                maxResults = max_results,
                q = keywords[1],
                type = 'video',
                # Search for portuguese videos since we are only interested on those
                relevanceLanguage = 'pt',
                # Only include videos that are between four and 20 minutes long (inclusive).
                videoDuration = 'medium'
                )
    # A pagination is required to get only the first 500 videos of a given keyword
    page = 0
    while page < 10:
        response = request.execute()
        
        for m in range(0, int(max_results)):
            video_id = response['items'][m]['id']['videoId']
            video_ids.append([video_id,keywords[0],keywords[1], keywords[2]])
            
        request = youtube.search().list_next(request, response)
        
        page += 1

    return video_ids

    

Before moving on, the videos_id function returns a list of lists. It will be useful to transform this list as a dictionary in order to convert it to a dataframe later on. Let's create a function for that.

In [4]:
def list_to_dict(array):
    '''
    Transform a list into a dictionary.
    
    Args:
        array (list): A list  of lists.
        
    Returns:
        list (list): A list where all entrances are dictionaries.
    '''
    all_data = []
    
    for i in range(len(array)):
        data = dict(
                video_id = array[i][0],
                topic = array[i][1],
                subtopic= array[i][2],
                dificulty = array[i][3]
                )
        all_data.append(data)
    return all_data
    
    

We only need the video_id information in order to obtain the statistics for the videos later on. Next, we build a function to retrive that only that information from videos_id function outcome:

In [5]:
def videos_id_list(array):
    '''
    Takes a list of list and generates a list containing only the first term of each list.
    
    Args:
        list_of_lists (list): List of lists to be splited.
        
    Returns:
        list (list): a list containing only the first item (video_id) of each list.
    '''

    videos_id_list = []
    for i in range(len(array)):
        videos_id_list.append(array[i][0])
        
    return videos_id_list

### Comment

The three functions define steps 1, 2 and 3 of our pipeline:

1 - Search for videos on youtube.

2 - Create a dataframe with the videos_id outcome.

3 - Store the video id information, given by the first item of the list returned by the videos_id function, in a variable.


## Getting videos statistics

The next step is to get relevant information for all scraped videos.

In [6]:
def videos_info(youtube,videos_id_list):
    '''
    Generate a list containing all the videos scrapped from the youtube API request
        stored in the data dictionary.
    Args:
        youtube (object/class): Build a Resource from the API description.
        
        videos_id_list (string): List containg all the video ids and keywords scrapped from the API 
            using the videos_id function.  
    
    Returns:
        list: each item of the list is a dictionary containing the relevant information for 
            each video.
    '''
    # We will append all videos in the list below
    all_videos = []
    
    for m in range(len(videos_id_list)):
        # Request to use the API and get the information we want
        request = youtube.videos().list(
                    part = 'snippet,contentDetails,statistics',
                    id = videos_id_list[m]

                    )
        response = request.execute()
        #print(response)

        # Select the relevant information from each video that will be useful for our analysis later on
        for i in range(len(response['items'])):

            # Some videos do not contain the 'tags' information so we add the if statement
            if 'tags' in  response['items'][i]['snippet'] and 'likeCount' in response['items'][i]['statistics']:

                data = dict(
                        video_id = videos_id_list[m],
                        title = response['items'][i]['snippet']['title'],
                        description = response['items'][i]['snippet']['description'],
                        tags = response['items'][i]['snippet']['tags'],
                        channel_id = response['items'][i]['snippet']['channelId'],
                        duration = response['items'][i]['contentDetails']['duration'],
                        view_count = response['items'][i]['statistics']['viewCount'],
                        like_count = response['items'][i]['statistics']['likeCount'],
                       # comment_count = response['items'][i]['statistics']['commentCount']
                           )
            elif 'tags' not in  response['items'][i]['snippet'] and 'likeCount' in response['items'][i]['statistics']:
                data = dict(
                        video_id = videos_id_list[m],
                        title = response['items'][i]['snippet']['title'],
                        description = response['items'][i]['snippet']['description'],

                        duration = response['items'][i]['contentDetails']['duration'],
                        view_count = response['items'][i]['statistics']['viewCount'],
                        like_count = response['items'][i]['statistics']['likeCount'],
                        #comment_count = response['items'][i]['statistics']['commentCount']    
                   )
            elif 'tags' in  response['items'][i]['snippet'] and 'likeCount' not in response['items'][i]['statistics']:
                data = dict(
                        video_id = videos_id_list[m],
                        title = response['items'][i]['snippet']['title'],
                        description = response['items'][i]['snippet']['description'],
                        tags = response['items'][i]['snippet']['tags'],
                        duration = response['items'][i]['contentDetails']['duration'],
                        view_count = response['items'][i]['statistics']['viewCount'],

                        #comment_count = response['items'][i]['statistics']['commentCount']
                           )
        
        all_videos.append(data)
    
    return all_videos

## Workflow 

The functions above finally allow us to define a pipeline:

1 - Get the videos id using the videos_id function,

2 - Generate a dataframe containing the videos_id, topic, subtopic and level of difficulty,

3 - Get a list containing only the video_id information in order to obtain their statistics,

4 - Create a dictionary containing all scraped info,

5 - Transform the dictionary into a dataframe,

6 - Join the two dataframes,

7 - Clean the extra video_id column,

8 - Generate a csv file.

Let's build a function to execute it:

In [7]:
def scrape_n_save(youtube,kw_filename):
    '''
    Scrape videos using youtube API and saves the dataframes into a csv file.
    
    Args:
        youtube (object/class): Build a Resource from the API description.
        
        kw_filename (list): A list containing a keyword to search for, 
            topic, subtopic, level of difficulty associated to the keyword and the csv filename.
    
    Returns:
        csv file (object): A csv file containing the scraped data.
    '''   
    # Generate a list containing the ids of the videos, topic, subtopic and difficulty 
    #associated to it.
    video_id = videos_id(youtube,kw_filename)
    
    # Dataframe containing the information returned by the function videos_id
    videos_id_df = pd.DataFrame(list_to_dict(video_id))
    
    # Retrieve only a list of videos id given a keyword.
    videos_id_kw = videos_id_list(video_id)
    
    # Dictionary containing all the scraped information given a keyword
    videos_info_kw = videos_info(youtube,videos_id_kw)
    
    # Transform the dictionary into a dataframe
    kw_df = pd.DataFrame(videos_info_kw)
    
    # Joining videos_id_df and kw_df for the a given keyword
    df = pd.merge(left = kw_df, right = videos_id_df, left_index = True, right_index = True)
    
    # Dropping the extra video_id column and renaming the one left after the merging process
    df = df.drop(['video_id_y'], axis = 1).rename(columns = {'video_id_x': 'video_id'})
    
    # Save the final dataframe for a given keyword in a csv file
    df.to_csv(kw_filename[3])   


# Generating csv files

It is finally time to start generating the csv files we are interested in! We will do it for each keyword given the amount of data and limitations imposed by the youtube API.

# Keyword 1

In [8]:
#kw_1 = ['Cinematica','Movimento e Repouso fisica','Easy','movimento_e_repouso.csv']

#scrape_n_save(youtube,kw_1) 
         

In [9]:
#df1 = pd.read_csv('movimento_e_repouso.csv')

# Keyword 2

In [10]:
#kw_2 =  ['Cinematica','Movimento Uniforme fisica','Easy','movimento_uniforme.csv']

#scrape_n_save(youtube,kw_2) 

In [11]:
#df2 = pd.read_csv('movimento_uniforme.csv')

# Keyword 3

In [12]:
#kw_3 =  ['Cinematica','Movimento Uniformente variado fisica','Medium','mov_unif_variado.csv']

#scrape_n_save(youtube,kw_3) 

In [13]:
#df3 = pd.read_csv('mov_unif_variado.csv')

# Keyword 4

In [14]:
#kw_4 =  ['Cinematica','Lançamento vertical para cima fisica','Medium','lancamento_vertical.csv']

#scrape_n_save(youtube,kw_4) 

In [15]:
#df4 = pd.read_csv('lancamento_vertical.csv')

# Keyword 5

In [16]:
#kw_5 =  ['Cinematica','Queda livre fisica','Easy','queda_livre.csv']

#scrape_n_save(youtube,kw_5) 

In [17]:
#df5 = pd.read_csv('queda_livre.csv')

# Keyword 6

In [18]:
#kw_6 =  ['Cinematica','Vetores Lançamento oblíquo fisica','Hard','lancamento_obliquo.csv']

#scrape_n_save(youtube,kw_6) 

In [19]:
#df6 = pd.read_csv('lancamento_obliquo.csv')

# Keyword 7

In [20]:
#kw_7 =  ['Cinematica','Lançamento horizontal fisica','Hard','lancamento_horizontal.csv']

#scrape_n_save(youtube,kw_7) 

In [21]:
#df7 = pd.read_csv('lancamento_horizontal.csv')

# Keyword 8

In [22]:
#kw_8 =  ['Cinematica','cinematica vetorial fisica','Hard','cinematica_vetorial.csv']

#scrape_n_save(youtube,kw_8) 

In [23]:
#df8 = pd.read_csv('cinematica_vetorial.csv')

In [24]:
#df8

# Keyword 9

In [25]:
#kw_9 =  ['Cinematica','Movimento Circular fisica','Hard','movimento_circular.csv']

#scrape_n_save(youtube,kw_9) 

In [26]:
#df9 = pd.read_csv('movimento_circular.csv')

# Keyword 10

In [27]:
#kw_10 =  ['Cinematica','Estática de um ponto material fisica','Hard','estatica_ponto_material.csv']

#scrape_n_save(youtube,kw_10) 

In [28]:
#df10 = pd.read_csv('estatica_ponto_material.csv')

# Keyword 11

In [29]:
#kw_11 =  ['Cinematica','centro de massa e equilíbrio','Hard','centro_massa.csv']

#scrape_n_save(youtube,kw_11) 

In [30]:
#df11 = pd.read_csv('centro_massa.csv')

# Keyword 12

In [31]:
#kw_12 = ['Cinematica','Estática do corpo extenso fisica','Hard','estatica_corpo_extenso.csv']

#scrape_n_save(youtube,kw_12) 

In [32]:
#df12 = pd.read_csv('estatica_corpo_extenso.csv')

# Keyword 13

In [33]:
#kw_13 = ['Dinamica','Leis de Newton','Medium','leis_Newton.csv']

#scrape_n_save(youtube,kw_13) 

In [34]:
#df13 = pd.read_csv('leis_Newton.csv')

# Keyword 14

In [35]:
#kw_14 = ['Dinamica','Forças de tração normal e peso fisica','Medium','forcas_tracao_normal_peso.csv']

#scrape_n_save(youtube,kw_14) 

In [36]:
#df14 = pd.read_csv('forcas_tracao_normal_peso.csv')

# Keyword 15

In [37]:
#kw_15 = ['Dinamica','Força elástica fisica','Easy','forca_elastica.csv']

#scrape_n_save(youtube,kw_15) 

In [38]:
#df15 = pd.read_csv('forca_elastica.csv')

# Keyword 16

In [39]:
#kw_16 = ['Dinamica','Força de atrito fisica','Easy','forca_atrito.csv']

#scrape_n_save(youtube,kw_16) 

In [40]:
#df16 = pd.read_csv('forca_atrito.csv')

In [41]:
#df16

# Keyword 17

In [42]:
#kw_17 = ['Dinamica','Trabalho e energia fisica','Easy','trabalho_energia.csv']

#scrape_n_save(youtube,kw_17) 

In [43]:
#df17 = pd.read_csv('trabalho_energia.csv')

In [44]:
#df17

# Keyword 18

In [45]:
#kw_18 = ['Dinamica','Impulso e quantidade de movimento fisica','Medium','quantidade_movimento.csv']

#scrape_n_save(youtube,kw_18) 

In [46]:
#df18 = pd.read_csv('quantidade_movimento.csv')

In [47]:
#df18

# Keyword 19

In [48]:
#kw_19 =  ['Dinamica','Lei de Kepler fisica','Hard','lei_Kepler.csv']

#scrape_n_save(youtube,kw_19) 

In [49]:
#df19 = pd.read_csv('lei_Kepler.csv')

In [50]:
#df19

# Keyword 20

In [51]:
#kw_20 =  ['Dinamica','Lei de gravitação Universal fisica','Medium','gravitacao.csv']

#scrape_n_save(youtube,kw_20) 

In [52]:
#df20 = pd.read_csv('gravitacao.csv')

In [53]:
#df20

# Keyword 21

In [54]:
#kw_21 = ['Dinamica','Satélite em órbitas circulares fisica','Medium','orbitas_circulares.csv']

#scrape_n_save(youtube,kw_21) 

In [55]:
#df21 = pd.read_csv('orbitas_circulares.csv')

In [56]:
#df21

# Keyword 22

In [57]:
#kw_22 = ['Dinamica','Velocidade de escape fisica','Medium','velocidade_escape.csv']

#scrape_n_save(youtube,kw_22) 

In [58]:
#df22 = pd.read_csv('velocidade_escape.csv')

In [59]:
#df22

# Keyword 23

In [60]:
#kw_23 = ['Dinamica','Aceleração da gravidade fisica','Medium','aceleracao_gravidade.csv']

#scrape_n_save(youtube,kw_23) 

In [61]:
#df23 = pd.read_csv('aceleracao_gravidade.csv')

In [62]:
#df23