This notebook documents the construction of an application that perfroms three actions:

1. Create a database containing snap shots of descriptor statistics for the all the videos on a number of channels on Youtube.

2. Generate reports of pertinent information summarising some subselection of the videos retrieved.

3. Perform time series analysis of of historical snap-shots.


# Part 1


Let's build the part of the application that talks to Youtube. There is a quickstart guide at:
https://developers.google.com/youtube/v3/quickstart/python
but we will mostly follow our own route using 
https://github.com/youtube/api-samples/blob/master/python/search.py

We will also refer to 

https://developers.google.com/youtube/v3/docs/
https://www.forgov.qld.gov.au/file/21896/download?token=mo1SpiZT 

We want to get a list of statistics for all videos for all channels in a given list. 

We'll need to make calls to at least 3 different resources:

Channels
    Need: channel_name
    Return: channel_id
Videos
    Need: video_id
    Return: Statistics on video
PlaylistItems
    Need: playlist_id (derivable from channel_id)
    Return: All videos_ids on that playlist
    
It's probably a good idea to first make an index of all the channel_ids of the channels that we want to survey. Usefully, each of these channel ids are only one character away from the id of the corresponding default playlist, which contains every video uploaded by that channel. 

There is a stackoverflow question which has a lot of useful information as to how we should structure our calls to the API: https://stackoverflow.com/questions/18953499/youtube-api-to-fetch-all-videos-on-a-channel/20795628

From this, and a little consideration,  note a few important facts:
1. Each call to an API resource returns at most 50 results, i.e., if a channel has uploaded 6000 videos (e.g., Khan Academy), to return every video_id we must make at least 120 calls to the API.
2. You can iterate through this list of 6000 videos, 50 at a time, using the optional next_page token. 
3. You can actually only get a maximum of 500 videos using the next_page token.
4. To get around the 500 max limitation, you can try to restrict your query by date of upload. This might take some trial and error if there were any periods of really high upload density.
5. We'll need to do a big initial survey to get the video ids to date. After that the process will shift to maintenance of an existing database and require much lower volumes of queries. If we want up-to-date information on videos, that will require a fair bit of resampling.



All this suggests the following order of operations:

1. Construct index of channel name and channel id. 
    1. Manually construct list of names from Epsilon Stream channel list.
    2. Use channels API call to get channel id for each.
    3. Store list in file.
    4. Derive upload playlist ids from channel ids. 
2. Get list of all video_ids on each channel. For each channel:
    1. Get number of uploads. alright write the rest of this later #TODO
    


### Calls to the Youtube API

The following API call retrieves the video_ids of the first 50 videos, according to some order, of all videos posted by Khan Academy's channel on youtube. We take advantage of the fact that every channel has a default playlist resource, which contains a list of every video published on that channel. 


In [2]:
from googleapiclient.discovery import build  
#We are building an authorisationless app, so we don't need oauth2client
import os

apikey_path = os.path.join(os.path.join(os.getcwd(),os.pardir),"apikey.txt")

with open(apikey_path,"rb") as f:
    apikey = f.readline()

DEVELOPER_KEY = apikey
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"

playlist_id = "UU4a-Gbdw7vOaccHmFo40b9g" #This is the default playlist id for Khan Academy

def get_youtube_playlist(playlist_id, max_results=50):
    
    youtube = build(serviceName = YOUTUBE_API_SERVICE_NAME,
                    version = YOUTUBE_API_VERSION,
                    developerKey = DEVELOPER_KEY)
    
    search_response = youtube.playlistItems().list(
        playlistId = playlist_id,
        part = 'snippet',
        maxResults= max_results)
    
    return search_response.execute()

playlist_result = get_youtube_playlist(playlist_id)


In [3]:
for item in playlist_result["items"]:
    print item["snippet"]["resourceId"]["videoId"]

lPfNgTrWKyE
-rr81Uf10pc
chfz7QiwdOc
i0sUJVxSXlk
7R1B4QFd668
VKq7wUoIIK8
hScx0e9qyMw
re2d80cqhYw
zJtCmpH--70
bP1pDUwV5hE
9Pp_LtKgTQg
vfmLI150g4w
9Me8VGbA9dA
iJr9PpY3PjM
ogFLbvKru8A
NkGvw18zlGQ
X0gIJUXz6jc
8CJ-RsoYbg0
pVPPDbZGd04
Lh1TPIFH7iI
Yadshgsx6FQ
Y650kaYNlJU
-VZUijm02h0
HOyNEU94xJQ
NCENjXTMp9I
7iTWX8bfBv0
HR5iEX3Sy1k
KYz6HH9wZ8g
rn_C25fPOVw
YyXlrRMFUJw
Rk-EtuESRm4
08zHioOVTd4
C5titprQAc4
3WVVbCUNPHY
nh78-VKSBUY
C8Eb4-Wz27A
-hK1p6Ymbhw
vA3o-lTreKM
You0q4JJzS4
ReahyukN0qg
HSwIERXX4T8
2EjEUoeIccc
ck_L-TD1t2Y
_VuMqbtphr0
alqGXue7Ca8
Zt1Esq-qOZM
NMb3bguJ7wY
h_ZIMHgKTP0
DR5ZytN7Pj0
hcUxFwUjCxM


The following API call retrieves the "id" part of the channel resource that contains information about Khan Academy's channel. This is the channel id.

In [8]:
def get_channel_id(channel_name, max_results=50):
    
    youtube = build(serviceName = YOUTUBE_API_SERVICE_NAME,
                    version = YOUTUBE_API_VERSION,
                    developerKey = DEVELOPER_KEY)
    
    query_response = youtube.channels().list(
        forUsername = channel_name,
        part = 'id',
        maxResults= max_results)
    
    return query_response.execute()

result = get_channel_id("khanacademy")

In [13]:
print result["items"][0]["id"]

UC4a-Gbdw7vOaccHmFo40b9g


This is the channel id for Khan Academy