This notebook documents the construction of an application that performs three actions:

1. Create a database containing snap shots of descriptor statistics for the all the videos on a number of channels on Youtube.

2. Generate reports of pertinent information summarising some subselection of the videos retrieved.

3. Perform time series analysis of of historical snap-shots.


# Part 1


Let's build the part of the application that talks to Youtube. There is a quickstart guide at:
https://developers.google.com/youtube/v3/quickstart/python
but we will mostly follow our own route using 
https://github.com/youtube/api-samples/blob/master/python/search.py

We will also refer to 

https://developers.google.com/youtube/v3/docs/
https://www.forgov.qld.gov.au/file/21896/download?token=mo1SpiZT 

We want to get a list of statistics for all videos for all channels in a given list. We'll need to make calls to at least 3 different resources:

> 
**Channels**
-  Needs: channel_name
-  Returns: channel_id
> 
**Videos**
-  Needs: video_id
-  Returns: Statistics on video
> 
**PlaylistItems**
-  Needs: playlist_id (derivable from channel_id)
-  Returns: All videos_ids on that playlist
    
It's probably a good idea to first make an index of all the channel_ids of the channels that we want to survey. Usefully, each of these channel ids are only one character away from the id of the corresponding default playlist, which contains every video uploaded by that channel. 

There is a stackoverflow question which has a lot of useful information as to how we should structure our calls to the API: https://stackoverflow.com/questions/18953499/youtube-api-to-fetch-all-videos-on-a-channel/20795628

From this, and a little consideration,  note a few important facts:
1. Each call to an API resource returns at most 50 results, i.e., if a channel has uploaded 6000 videos (e.g., Khan Academy), to return every video_id we must make at least 120 calls to the API.
2. You can iterate through this list of 6000 videos, 50 at a time, using the optional next_page token. 
3. You can actually only get a maximum of 500 videos using the next_page token.
4. To get around the 500 max limitation, you can try to restrict your query by date of upload. This might take some trial and error if there were any periods of really high upload density.
5. We'll need to do a big initial survey to get the video ids to date. After that the process will shift to maintenance of an existing database and should require much lower volumes of queries. However, If we want up-to-date information on videos, that will require a fair bit of resampling. API call quota may become an issue.



All this suggests the following order of operations:

1. Construct index of channel name and channel id. 
    1. Manually construct list of names from Epsilon Stream channel list.
    2. Use channels API call to get channel id for each.
    3. Store list in file.
    4. Derive upload playlist ids from channel ids. 
2. Get list of all video_ids on each channel. For each channel:
    1. Get number of uploads. alright write the rest of this later #TODO
    


### Calls to the Youtube API

The following API call retrieves the video_ids of the first 50 videos, according to some order, of all videos posted by Khan Academy's channel on youtube. We take advantage of the fact that every channel has a default playlist resource, which contains a list of every video published on that channel. 


In [1]:
from googleapiclient.discovery import build  
#We are currently building an authorisationless app, so we don't need oauth2client
import os

#put the textfile containing the apikey into the parent folder of your local git repo
apikey_path = os.path.join(os.path.join(os.getcwd(),os.pardir),"apikey.txt")

with open(apikey_path,"rb") as f:
    apikey = f.readline()

DEVELOPER_KEY = apikey
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"

In [2]:
playlist_id = "UU4a-Gbdw7vOaccHmFo40b9g" #This is the default playlist id for Khan Academy

def get_youtube_playlist(playlist_id, max_results=50):
    
    youtube = build(serviceName = YOUTUBE_API_SERVICE_NAME,
                    version = YOUTUBE_API_VERSION,
                    developerKey = DEVELOPER_KEY)
    
    search_response = youtube.playlistItems().list(
        playlistId = playlist_id,
        part = 'snippet',
        maxResults= max_results)
    
    return search_response.execute()

playlist_result = get_youtube_playlist(playlist_id)

In [10]:
videos = []

for item in playlist_result["items"]:
    A=item["snippet"]["resourceId"]["videoId"]
    print A
    videos.append(A)

wT_ApV1s_io
fToDs5nd_rE
mFI58RRCDbs
iaGjqkRIUSk
lPfNgTrWKyE
-rr81Uf10pc
chfz7QiwdOc
i0sUJVxSXlk
7R1B4QFd668
VKq7wUoIIK8
hScx0e9qyMw
re2d80cqhYw
zJtCmpH--70
bP1pDUwV5hE
9Pp_LtKgTQg
vfmLI150g4w
9Me8VGbA9dA
iJr9PpY3PjM
ogFLbvKru8A
NkGvw18zlGQ
X0gIJUXz6jc
8CJ-RsoYbg0
pVPPDbZGd04
Lh1TPIFH7iI
Yadshgsx6FQ
Y650kaYNlJU
-VZUijm02h0
HOyNEU94xJQ
NCENjXTMp9I
7iTWX8bfBv0
HR5iEX3Sy1k
KYz6HH9wZ8g
rn_C25fPOVw
YyXlrRMFUJw
Rk-EtuESRm4
08zHioOVTd4
C5titprQAc4
3WVVbCUNPHY
nh78-VKSBUY
C8Eb4-Wz27A
-hK1p6Ymbhw
vA3o-lTreKM
You0q4JJzS4
ReahyukN0qg
HSwIERXX4T8
2EjEUoeIccc
ck_L-TD1t2Y
_VuMqbtphr0
alqGXue7Ca8
Zt1Esq-qOZM


The following API call retrieves the "id" part of the channel resource that contains information about Khan Academy's channel. This is the channel id.

In [8]:
def get_channel_id(channel_name, max_results=50):
    
    youtube = build(serviceName = YOUTUBE_API_SERVICE_NAME,
                    version = YOUTUBE_API_VERSION,
                    developerKey = DEVELOPER_KEY)
    
    query_response = youtube.channels().list(
        forUsername = channel_name,
        part = 'id',
        maxResults= max_results)
    
    return query_response.execute()

result = get_channel_id("khanacademy")

In [13]:
print result["items"][0]["id"]

UC4a-Gbdw7vOaccHmFo40b9g


This is the channel id for Khan Academy

Now we'll take that list of 50 video ids and get a bunch of statistics for them using the Videos resource.

In [27]:

test_video_ids = ",".join(videos)

def get_video_statistics(video_ids, max_results=50):
    """returns a youtube API Video resource, containing details for a list of videos"""
    
    youtube = build(serviceName = YOUTUBE_API_SERVICE_NAME,
                    version = YOUTUBE_API_VERSION,
                    developerKey = DEVELOPER_KEY)
    
    search_response = youtube.videos().list(
        id = video_ids,
        part = 'snippet,statistics',
        maxResults= max_results)
    
    return search_response.execute()

video_result = get_video_statistics(test_video_ids)

We've now got some statistics for the 50 videos captured above.

In [49]:

for item in video_result["items"]:
    A = float(item["statistics"]["viewCount"])
    B = float(item["statistics"]["likeCount"])
    C = float(B/A)
    print int(A), int(B), C


114 2 0.0175438596491
1059 37 0.0349386213409
118 4 0.0338983050847
853 37 0.0433763188746
2378 61 0.0256518082422
2530 67 0.0264822134387
3433 112 0.0326245266531
3651 122 0.033415502602
3471 61 0.0175741861135
3493 153 0.0438018894933
952 12 0.0126050420168
4521 104 0.02300376023
5697 120 0.0210637177462
675 8 0.0118518518519
3863 108 0.0279575459487
3607 111 0.03077349598
656 11 0.0167682926829
1511 19 0.012574454004
1759 32 0.0181921546333
2347 38 0.016190881977
600 4 0.00666666666667
483 0 0.0
2543 63 0.0247738891074
1615 25 0.015479876161
420 3 0.00714285714286
1118 21 0.0187835420394
822 9 0.0109489051095
3931 53 0.0134825744085
1481 30 0.0202565833896
1707 42 0.02460456942
11632 319 0.02742434663
1880 67 0.0356382978723
1256 36 0.0286624203822
1050 22 0.0209523809524
3656 70 0.0191466083151
6180 90 0.0145631067961
4043 51 0.0126143952511
2405 86 0.0357588357588
560 10 0.0178571428571
4878 112 0.0229602296023
2532 41 0.0161927330174
2940 48 0.0163265306122
2371 56 0.023618726275

Next we should figure out how to use pagination of results. We'll choose the default playlist of 3Blue1Brown as our testing grounds.

In [54]:
bb_channelId = "UCYO_jab_esuFRV4b17AJtAw"
bb_playlist_id = "UUYO_jab_esuFRV4b17AJtAw"

def get_next_page_token(response):
    """Given json Youtube API response, return next_page token"""
    try:
        return response["nextPageToken"] 
    except KeyError:
        return None
    
def num_results(response):
    "Get number of results to query"
    return int(response["pageInfo"]["totalResults"])


def add_playlist_video_details(response, dataset):
    """reads a playlist call response and extracts dict of desired information""" 
    for item in response["items"]:
        video_id = item["id"]
        dataset[video_id] = {"likes":1, "dislikes":3, "views":20}
    return dataset

def get_paginated_playlist(playlist_id, max_results=50, first_token=None):
    
    youtube = build(serviceName = YOUTUBE_API_SERVICE_NAME,
                    version = YOUTUBE_API_VERSION,
                    developerKey = DEVELOPER_KEY)
    
    token = None
    
    search_response = youtube.playlistItems().list(
        playlistId = playlist_id,
        part = 'snippet',
        maxResults= max_results,
        pageToken =None)
    
    result = search_response.execute()
    
    token =  get_next_page_token(result)
    
    search_response2 = youtube.playlistItems().list(
        playlistId = playlist_id,
        part = 'snippet',
        maxResults= max_results,
        pageToken = token
        )
    
    return search_response2.execute()

In [48]:
bb_results = get_youtube_playlist(bb_playlist_id)
bb_results_2 = get_paginated_playlist(bb_playlist_id)

In [55]:
def get_video_details_from_playlist(playlist_id):
    """Gets all videos on a youtube playlist by collecting paginated results"""
    
    youtube = build(serviceName = YOUTUBE_API_SERVICE_NAME,
                    version = YOUTUBE_API_VERSION,
                    developerKey = DEVELOPER_KEY)
    dataset = {}
    
    query = youtube.playlistItems().list(
                playlistId = playlist_id,
                part = 'snippet',
                pageToken=None)
    first_page = query.execute()
    
    token = get_next_page_token(first_page)
    dataset = add_playlist_video_details(first_page, dataset) 
    #maybe create a dataset class with some nice methods?
    
    while token:
        
        query = youtube.playlistItems().list(
            playlistId = playlist_id,
            part = 'snippet',
            pageToken = token
            )
        page_response = query.execute()
        
        token = get_next_page_token(page_response)
        dataset = add_playlist_video_details(page_response, dataset) 
    
    return dataset

bb_results_3 = get_video_details_from_playlist(bb_playlist_id)
    

In [56]:
bb_results_3

{u'VVVZT19qYWJfZXN1RlJWNGIxN0FKdEF3Li05T1V5bzhORlpn': {'dislikes': 3,
  'likes': 1,
  'views': 20},
 u'VVVZT19qYWJfZXN1RlJWNGIxN0FKdEF3LjFTTW1jOWdRbUhR': {'dislikes': 3,
  'likes': 1,
  'views': 20},
 u'VVVZT19qYWJfZXN1RlJWNGIxN0FKdEF3LjJTVXZXZk5KU3NN': {'dislikes': 3,
  'likes': 1,
  'views': 20},
 u'VVVZT19qYWJfZXN1RlJWNGIxN0FKdEF3LjNkNkRzaklCeko0': {'dislikes': 3,
  'likes': 1,
  'views': 20},
 u'VVVZT19qYWJfZXN1RlJWNGIxN0FKdEF3LjNzN2gyTUhRdHhj': {'dislikes': 3,
  'likes': 1,
  'views': 20},
 u'VVVZT19qYWJfZXN1RlJWNGIxN0FKdEF3Ljg0aEVtR0h3M0o4': {'dislikes': 3,
  'likes': 1,
  'views': 20},
 u'VVVZT19qYWJfZXN1RlJWNGIxN0FKdEF3Ljl2S3FWa01RSEtr': {'dislikes': 3,
  'likes': 1,
  'views': 20},
 u'VVVZT19qYWJfZXN1RlJWNGIxN0FKdEF3Lk16UkNETHJlMWI0': {'dislikes': 3,
  'likes': 1,
  'views': 20},
 u'VVVZT19qYWJfZXN1RlJWNGIxN0FKdEF3Lk1Cbm5YYk9NNVM0': {'dislikes': 3,
  'likes': 1,
  'views': 20},
 u'VVVZT19qYWJfZXN1RlJWNGIxN0FKdEF3Lk5hTF9DYjQyV3lZ': {'dislikes': 3,
  'likes': 1,
  'views': 20},


In [57]:
len(bb_results_3)

67

In [61]:
h3h3 = "UUDWIvJwLJsE4LG1Atne2blQ"
h3h3_results = get_video_details_from_playlist(h3h3) #288 of 289, missing one?