This notebook documents the construction of an application that performs three actions:

1. Create a database containing snap shots of descriptor statistics for the all the videos on a number of channels on Youtube.

2. Generate reports of pertinent information summarising some subselection of the videos retrieved.

3. Perform time series analysis of of historical snap-shots.


# Part 1


Let's build the part of the application that talks to Youtube. There is a quickstart guide at:
https://developers.google.com/youtube/v3/quickstart/python
but we will mostly follow our own route using 
https://github.com/youtube/api-samples/blob/master/python/search.py

We will also refer to 

https://developers.google.com/youtube/v3/docs/
https://www.forgov.qld.gov.au/file/21896/download?token=mo1SpiZT 

We want to get a list of statistics for all videos for all channels in a given list. We'll need to make calls to at least 4 different resources:

> 
**Channels**
-  Needs: channel_name
-  Returns: channel_id
> 
**Videos**
-  Needs: video_id
-  Returns: Statistics on video
> 
**PlaylistItems**
-  Needs: playlist_id (derivable from channel_id)
-  Returns: All videos_ids on that playlist
>
**Search**
-  Needs: lots of stuff
-  Returns: All videos_ids on a channel
    
It's probably a good idea to first make an index of all the channel_ids of the channels that we want to survey. Usefully, each of these channel ids are only one character away from the id of the corresponding default playlist, which contains every video uploaded by that channel. 

There is a stackoverflow question which has a lot of useful information as to how we should structure our calls to the API: https://stackoverflow.com/questions/18953499/youtube-api-to-fetch-all-videos-on-a-channel/20795628

From this, and a little consideration,  note a few important facts:
1. Each call to an API resource returns at most 50 results, i.e., if a channel has uploaded 6000 videos (e.g., Khan Academy), to return every video_id we must make at least 120 calls to the API.
2. You can iterate through this list of 6000 videos, 50 at a time, using the optional next_page token. 
3. You can actually only get a maximum of 500 videos using the next_page token.
4. To get around the 500 max limitation, you can try to restrict your query by date of upload. 
5. We'll need to do a big initial survey to get the video ids to date. After that the process will shift to maintenance of an existing database and should require much lower volumes of queries. However, If we want up-to-date information on videos, that will require a fair bit of resampling. API call quota may become an issue.


There's only a few things that we really want:

1. A list of all the videos on a given playlist (-time indexed? only by upload date?)
2. Engagement statistics for each video. (-time indexed)


This suggests the following tasks:

#### Construct index of channel name and channel id. 
    - [X] Manually construct list of names from Epsilon Stream channel list.
    - [X] Store list in file. <span style="color:red"> json_channels.txt</span>
    - [ ] Complete channel/playlist details manually

#### Build update queue
    - [ ] read database, determine which playlists need updating by date
    - [ ] load some number of playlist details to memory. construct queue of update requests, write back to database on completion
    - [ ] 
    

#### Get list of all video ids on each channel -> Create video-list-updater
    - [X] find total number of videos in playlist
    - [X] create search request for channel with bounded upload dates to restrict number of results <50
    - [X] figure out how to make RFC 3339 timestamps for search reqs
    - [ ] write function that recursively subdivides upload time until all periods for a channel have < 50 vids
    - [ ] create (1-shot?) exclusion list to discard non maths videos
    - [ ] read search response and obtain all desired details
    - [ ] handle aggregation and quality control of requests (e.g. check for doubles/missing, repeat failed requests)
    - [ ] handle I/O of results

#### Having gotten all video_ids for each channel, for each video:
    - [ ] Asynchronously query video resource 50 ids at a time 
    - [ ] routine to extract "status" (all details desired) on every video
    - [ ] routine to add each 
    - [ ] program to govern all these actions, reuse/retry function and quality control from earlier

To consider:
What kinds of operations will I frequently be doing with this database?
Should the database be snapshot based, or delta based, or a mix of the two?
    
Other TODO:
    - [ ] Talk to Yoni about database structure. Architecture, basic unit of storage? Learn about NoSQL?
    - [ ] create table of which resources have which params

### Calls to the Youtube API

The following API call retrieves the video_ids of the first 50 videos, according to some order, of all videos posted by Khan Academy's channel on youtube. We take advantage of the fact that every channel has a default playlist resource, which contains a list of every video published on that channel.


In [1]:
from googleapiclient.discovery import build  
#We are currently building an authorisationless app, so we don't need oauth2client
import os

#put the textfile containing the apikey into the parent folder of your local git repo
apikey_path = os.path.join(os.path.join(os.getcwd(),os.pardir),"apikey.txt")

with open(apikey_path,"rb") as f:
    apikey = f.readline()

DEVELOPER_KEY = apikey
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"

# Asynchronous requests to the Youtube API


In [54]:
import nbgrequests as grequests   #changed one tiny boolean
import requests
import time 
import os

#put the textfile containing the apikey into the parent folder of your local git repo
apikey_path = os.path.join(os.path.join(os.getcwd(),os.pardir),"apikey.txt")

with open(apikey_path,"rb") as f:
    apikey = f.readline()

def tic(t):
    # time since t in ms
    return (time.time() -t)*1000 

# n=2
# urls = [url_maker(20+i) for i in range(0,n)]

##Example 1: 7 http requests, repeating some targets. Works asynchronously
# urls = [
#     'http://www.heroku.com',
#     'http://python-tablib.org',
#     'http://httpbin.org',
#     'http://python-requests.org',
#     'http://python-requests.org',
#     'http://python-requests.org',
#     'http://python-requests.org',
#       ]

###Example 2: 4 https requests, repeating some targets. 
# urls = [
#     'https://www.heroku.com',
#     'https://httpbin.org',
#     'https://httpbin.org',
#     'https://httpbin.org',
#       ]

###Example 3: 3 Youtube v3 API search queries for the same channel, 3 different search terms. These queries all successfully return results, but do not run asynchronously
urls =['https://www.googleapis.com/youtube/v3/search?part=snippet&channelId=UCDWIvJwLJsE4LG1Atne2blQ&key={0}&q=1'.format(apikey)
        ,'https://www.googleapis.com/youtube/v3/search?part=snippet&channelId=UCDWIvJwLJsE4LG1Atne2blQ&key={0}&q=2'.format(apikey)
        ,'https://www.googleapis.com/youtube/v3/search?part=snippet&channelId=UCDWIvJwLJsE4LG1Atne2blQ&key={0}&q=3'.format(apikey)
        ]

###Example 4: 3 Youtube v3 API search queries for 3 different channels.
# urls = ['https://www.googleapis.com/youtube/v3/search?part=snippet&channelId=UCDWIvJwLJsE4LG1Atne2blQ&key={0}&q=1'.format(apikey)
#         ,'https://www.googleapis.com/youtube/v3/search?part=snippet&channelId=UC4a-Gbdw7vOaccHmFo40b9g&key={0}&q=1'.format(apikey)
#         ,'https://www.googleapis.com/youtube/v3/search?part=snippet&channelId=UCjwOWaOX-c-NeLnj_YGiNEg&key={0}&q=1'.format(apikey)
#         ]


def fetch(url):
    start = time.time()
    response = requests.get(url)
    result = response.status_code
    print 'Process {}: {:.2f}'.format(url, tic(start))
    return result

def async_fetch(unsent_requests):
    start = time.time()
    responses = grequests.map(unsent_requests)
    results = [response.status_code for response in responses]
    print 'Async Process: {:.2f}'.format( tic(start))
    return results

def synchronous(urls):
    responses = [fetch(i) for i in urls]
    return responses
        
def asynchronous(urls):
    unsent_requests = [grequests.get(url) for url in urls]
    responses = async_fetch(unsent_requests)
    return responses

    
p_start = time.time()
print 'Synchronous:'
sync_results = synchronous(urls)
print "All done in {:.2f}".format(tic(p_start))
print sync_results
    
p_start = time.time()
print 'Asynchronous:'
async_results = asynchronous(urls)
print "All done in {:.2f}".format(tic(p_start))   
print async_results

Synchronous:
Process https://www.googleapis.com/youtube/v3/search?part=snippet&channelId=UCDWIvJwLJsE4LG1Atne2blQ&key=AIzaSyDYEubHdR-SolXXyXdCbCF1ivVL5sy8k3c&q=1: 884.00
Process https://www.googleapis.com/youtube/v3/search?part=snippet&channelId=UCDWIvJwLJsE4LG1Atne2blQ&key=AIzaSyDYEubHdR-SolXXyXdCbCF1ivVL5sy8k3c&q=2: 801.00
Process https://www.googleapis.com/youtube/v3/search?part=snippet&channelId=UCDWIvJwLJsE4LG1Atne2blQ&key=AIzaSyDYEubHdR-SolXXyXdCbCF1ivVL5sy8k3c&q=3: 707.00
All done in 2395.00
[200, 200, 200]
Asynchronous:
Async Process: 720.00
All done in 727.00
[200, 200, 200]


The async queries are returning in about the time it takes to do one sync query, as expected.

# Creating json files

We need to choose a method of data storage for all the stuff we'll be getting. JSON might be a good choice - our data is highly structured and we don't know exactly what we want to do with it yet, also we need our data structure to be language independent.

Next we'll look at creating and reading json files. This example creates a json file and then reads it back into python.

In [27]:
import json

A = [1,3,4,5,9]

#json allows dict, string and list structures
json_data = {"name":"Jeff", "shifts" : A, "title" : "Chef"} 

dump_list = [json_data for _ in range(10)]

#parses object to json format
dump = json.dumps(dump_list)

#creates file in cwd
with open("json_dump.txt","wb") as f:
    f.write(dump)

with open("json_dump.txt","rb") as f:
    everything = f.read()

#parses json object to python types
jsond = json.loads(everything)

print jsond

[{u'shifts': [1, 3, 4, 5, 9], u'name': u'Jeff', u'title': u'Chef'}, {u'shifts': [1, 3, 4, 5, 9], u'name': u'Jeff', u'title': u'Chef'}, {u'shifts': [1, 3, 4, 5, 9], u'name': u'Jeff', u'title': u'Chef'}, {u'shifts': [1, 3, 4, 5, 9], u'name': u'Jeff', u'title': u'Chef'}, {u'shifts': [1, 3, 4, 5, 9], u'name': u'Jeff', u'title': u'Chef'}, {u'shifts': [1, 3, 4, 5, 9], u'name': u'Jeff', u'title': u'Chef'}, {u'shifts': [1, 3, 4, 5, 9], u'name': u'Jeff', u'title': u'Chef'}, {u'shifts': [1, 3, 4, 5, 9], u'name': u'Jeff', u'title': u'Chef'}, {u'shifts': [1, 3, 4, 5, 9], u'name': u'Jeff', u'title': u'Chef'}, {u'shifts': [1, 3, 4, 5, 9], u'name': u'Jeff', u'title': u'Chef'}]


HEre's a text file with all the current channels added to it.

In [51]:


with open("json_channels.txt","rb") as f:
    channels_json = json.loads(f.read())
    
channels_json


[{u'channelId': u'',
  u'channelName': u'mathbff',
  u'comment': u'Mathbff',
  u'playListId': u''},
 {u'channelId': u'',
  u'channelName': u'mathantics',
  u'comment': u'mathantics',
  u'playListId': u''},
 {u'channelId': u'',
  u'channelName': u'',
  u'comment': u'Khan Academy, Algebra Worked Examples List',
  u'playListId': u'PL3128E15B8D159842'},
 {u'channelId': u'',
  u'channelName': u'tecmath',
  u'comment': u'tecmath',
  u'playListId': u''},
 {u'channelId': u'UC1_uAIS3r8Vu6JjXWvastJg',
  u'channelName': u'',
  u'comment': u'Mathologer',
  u'playListId': u''},
 {u'channelId': u'UCYO_jab_esuFRV4b17AJtAw',
  u'channelName': u'',
  u'comment': u'3Blue1Brown',
  u'playListId': u''},
 {u'channelId': u'',
  u'channelName': u'Numberphile',
  u'comment': u'Numberphile',
  u'playListId': u''},
 {u'channelId': u'',
  u'channelName': u'kylepearce3',
  u'comment': u'Kyle Pearce (too many)',
  u'playListId': u''},
 {u'channelId': u'',
  u'channelName': u'',
  u'comment': u'Trigonometry Tutoria

In [54]:
for i in channels_json:
    print("{0:12.12}    {1:5.5}    {2:20.20}    {3}".format(i["channelName"],i["channelId"],i["playListId"],i["comment"]))
#     print i


mathbff                                          Mathbff
mathantics                                       mathantics
                         PL3128E15B8D159842      Khan Academy, Algebra Worked Examples List
tecmath                                          tecmath
                UC1_u                            Mathologer
                UCYO_                            3Blue1Brown
Numberphile                                      Numberphile
kylepearce3                                      Kyle Pearce (too many)
                         PLAF816DCEEB2A2F7B      Trigonometry Tutorials by patrickJMT
                         PL8gnhgRJl1x4rjaE3rM    PatrickJMT Algebra
                         PLANMHOrJaFxPCjR2enL    PatrickJMT The Fundamentals of Logic
                         PLANMHOrJaFxMobwlFya    PatrickJMT Puzzle Problems - Fun Problems!
                         PLANMHOrJaFxM2UbRPM9    PatrickJMT Inverse Trigonometric Functions
                         PLANMHOrJaFxN4Ny3jqa    Patrick

# general API requests

I'll need to do a lot of different queries, some of them will be async and some not. I'd like the async part of all this to look the same as the sync part. How should I define the basic objects and their methods?


### What params does each resource have?

I'll be pointing my requests to only a few of the many Youtube *resources*. These resources take a set of parameters, some of them have constant meaning across resources while other retain the same name with a different meaning. Other parameters are entirely unique to a resource. The following dictionary describes the params available for each resource.

In [5]:
resource_params = [
    {
     "name":"channels",
     "params" : ["part", "categoryId","forUsername",
                 "hl","id","managedByMe",
                 "maxResults", "mine", "mySubscribers",
                 "onBehalfOfContentOwner","pageToken","fields"
                ]
    },
    {
     "name":"videos",
     "params":["part", "chart", "hl",
               "id", "locale", "maxHeight",
               "maxWidth", "myRating", "onBehalfOfContentOwner",
               "pageToken", "regionCode", "videoCategoryId",
               "fields"
              ]
    },
    {
     "name":"search",
     "params":["part", "channelId", "channelType",
               "eventType", "forContentOwner", "forDeveloper",
               "forMine", "location", "locationRadius",
               "maxResults", "onBehalfOfContentOwner", "order",
               "pageToken", "publishedAfter", "publishedBefore",
               "q", "regionCode", "relatedToVideoId", "relevanceLanguage",
               "safeSearch", "type", "videoCaption", "videoCategoryId",
               "videoDefinition", "videoDimension", "videoDuration",
               "videoEmbeddable", "videoLicense", "videoSyndicated",
               "videoType", "fields"
              ]
    },
    {
     "name":"playlistItems",
     "params":["part","id","maxResults", 
               "onBehalfOfContentOwner","pageToken",
               "playlistId","videoId","fields"
              ]
    },
    {
     "name":"playlists",
     "params":["part", "channelId","hl","id",
               "maxResults", "mine","onBehalfOfContentOwner",
               "onBehalfOfContentOwnerChannel","pageToken","fields"
              ]
    }
]

## All the functions!

Here's the functions we need to create, modify and update our database of youtube videos and channels.


In [2]:
import requests
import nbgrequests
import datetime

def send_query(query):
    '''gets API response for a single query'''
    response = requests.get(query)
    return response.json()

def send_async_queries(query_list):
    '''gets API responses for a list of queries asynchronously'''
    return None

#query construction

def htmlify(s):
    """alter string to replace html disallowed characters in Youtube API query"""
    substitutions = {",":"%2C"}
    for a,b in substitutions.items():
        s = s.replace(a,b)
    return s

def write_request(resource, params):
    """take resource name and a list of (key,value) pairs, write a Youtube API request
    
    attributes
    ----------
    resource: string                  name of the desired Youtube resource
    params  : list of key,val pairs   other parameters for the query  
    
    params is encoded as a list of key val pairs because code below needed a predictable order
    in params and dict doesn't have that."""
    
    query = "https://www.googleapis.com/youtube/v3/"    #all queries begin thus
    query = query+"{0}".format(resource)
    
    for key, value in params:
        if type(value) is list:                         #special handling of lists
            comma_sep_vals = ",".join(value)
            term = "{0}={1}".format(key,comma_sep_vals)
        elif type(value) is datetime.datetime:          #special handling of datetimes, must be RFC3339 format
            term = "{0}={1}".format(key,to_RFC3339(value))
        else:
            term = "{0}={1}".format(key,value)
            
        if key == "part":
            query = query+"?{0}".format(term)
        else:
            query = query+"&{0}".format(term)
    return htmlify(query)



###info fetch functions
#build write and read functions for each type of query

def make_query_PL_vid_count(playlistId):
    '''make a query to get the number of videos in a given playlist'''
    
    query_params = [("part",["contentDetails"]),
                    ("id",playlistId),
                    ("maxResults",50),
                    ("key",apikey)
                   ]
    
    s = write_request("playlists", query_params)
    return s

def read_reply_get_PL_vid_count(response):
    '''reads PL_vid_count response and returns number of videos in the playlist'''
    return response["items"][0]["contentDetails"]["itemCount"]


def make_query_Ch_vid_count(channelId):
    '''make a query to get the number of videos in a given playlist'''
    
    query_params = [("part",["statistics"]),
                    ("id",channelId),
                    ("maxResults",50),
                    ("key",apikey)
                   ]
#     s = 'https://www.googleapis.com/youtube/v3/channels?part=statistics&id={0}&key={1}'.format(channelId, apikey)
    s = write_request("channels", query_params)
    return s

def read_reply_get_Ch_vid_count(response):
    '''reads Ch_vid_count response and returns number of videos in the playlist'''
    return response["items"][0]["statistics"]["videoCount"]

def make_query_Ch_srch_response_count_between_dates(channelId, publishedAfter, publishedBefore):
    '''make a query to get the number of videos on a given channel between two dates, by counting items in results.
    
    attributes
    ----------
    channelId : string       Youtube Channel id
    publishedAfter  : datetime    start of date interval
    publishedBefore : datetime    end of date interval 
    
    Do not trust totalResults : val in response for the true number of hits. This number is unreliable. 
    e.g. #results(fn(chId,d1,d2))+ #results(fn(chId,d2,d3)) =/= #results(fn(chId,d1,d3))'''
    
    query_params = [("part",["snippet"]),
                    ("publishedAfter", publishedAfter),
                    ("publishedBefore", publishedBefore),
                    ("channelId",channelId),
                    ("type","video"),
                    ("order","date"),                      #probably not necessary
                    ("maxResults",50),
                    ("key",apikey)
                   ]
#     s = 'https://www.googleapis.com/youtube/v3/channels?part=statistics&id={0}&key={1}'.format(channelId, apikey)
    s = write_request("search", query_params)
    return s

def read_reply_Ch_srch_response_count_between_dates(response):
    """reads Ch_srch_response_count_between_dates response and gets number of search results"""
    return len(response["items"])    #maxResults for request should be set to 50

def lt_50_vids(channelId, publishedAfter, publishedBefore):
    q = make_query_Ch_srch_response_count_between_dates(
            channelId, publishedAfter, publishedBefore)
    response = send_query(q)
    n = read_reply_Ch_srch_response_count_between_dates(response)
    return n < 50

def make_query_get_Ch_creation_date(channelId):
    '''make a query to get the creation date of a channel'''
    
    query_params = [("part",["snippet"]),
                    ("id",channelId),
                    ("maxResults",50),
                    ("key",apikey)
                   ]
    s = write_request("channels", query_params)
#     s = 'https://www.googleapis.com/youtube/v3/channels?part=snippet&id={0}&key={1}'.format(channelId, apikey)
    return s

def read_reply_get_Ch_creation_date(response):
    '''reads get_Ch_creation_date response and returns creation date of a channel as a datetime obj'''
    return read_str_RFC3339(response["items"][0]["snippet"]["publishedAt"])




#datetime manipulation

def to_RFC3339(datetime_obj):
    """format a datetime object in RFC3339 format, e.g. 1999-11-20T04:34:11Z"""
    return datetime_obj.strftime("%Y-%m-%dT%H:%M:%S.%fZ")

def read_str_RFC3339(datetime_obj):
    """format a datetime object in RFC3339 format, e.g. 1999-11-20T04:34:11Z"""
    return datetime.datetime.strptime(datetime_obj,"%Y-%m-%dT%H:%M:%S.%fZ")

def make_time_intervals(stint, n):
    """create n contiguous datetime intervals of equal period.
    
    attributes
    ------------
    stint : tuple     a pair of datetime objects
    n     : integer   the number of time intervals"""
    start, end = stint
    period = (end - start)/n #interval length, type = timedelta
    intervals = [[start+i*period, start+(i+1)*period] for i in range(n)]
    return intervals

def partition_channel_history(channelId, publishedAfter, publishedBefore, m):
    """recursively chop a channel into segments of publishing time that contain at most 50 videos.
    Returns a list of date boundary tuples."""
    
    #-----helper function-----
    #for this next fn, I couldn't figure out a way to pass pairs upwards through layers of 
    #recursion without inadvertantly nesting lists many layers deep, so I'll just flatten, 
    #delete dupes and recreate pairs at the end.
    def recur_split(interval, n=2):
        d1,d2 = interval
        if n==1:
            return [d1,d2]
        else:
            if lt_50_vids(channelId, d1, d2): #calls Youtube api
                good_interval = [d1,d2]
                return good_interval
            else:
                new_intervals = make_time_intervals([d1,d2], n)
                A = [recur_split(i) for i in new_intervals]
                return [x for y in A for x in y]            #flatten everything 1 level    
    
    A = recur_split([publishedAfter, publishedBefore], n=m) #flattened list o.t.f [a,b,b,c,c,d,d,e,e,f]
    datetime_segments = [[A[i],A[i+1]] for i in range(0, len(A),2)]
    
    return datetime_segments

#object creation 
def create_new_channels(channelIds):
    """get all channel data required to create a new channel record.
    
    attributes
    -----------
    channelIds : list      list of youtube channelIds"""
    
    def extract_new_channels_data(response):
        new_channels = []
        for channel in response["items"]:
            data = {"metaData":{},"timeSeries":{}}    #initialise channel data

            #metaData content
            data["metaData"]["channelId"] = channel["id"]
            data["metaData"]["channelTitle"] = channel["snippet"]["title"]
            data["metaData"]["publishedAt"] = read_str_RFC3339(channel["snippet"]["publishedAt"])
            data["metaData"]["isMixedContentChannel"] = False #default

            #timeSeries content
            date = str(datetime.datetime.utcnow())    #youtube uses UTC time
            data["timeSeries"][date] = {}             #initialise timeSeries data entry     
            if channel["statistics"]["hiddenSubscriberCount"] is False:
                data["timeSeries"][date]["subscriberCount"] = channel["statistics"]["subscriberCount"]
            
            new_channels.append(data)
        return new_channels

    query_params = [("part",["snippet","statistics"]),
                    ("id",channelIds),
                    ("maxResults",50),
                    ("key",apikey)
                   ]
    s = write_request("channels", query_params)
    r = send_query(s)
    channels_data = extract_new_channels_data(r)
    return channels_data
# CId = "UC4a-Gbdw7vOaccHmFo40b9g" #khan academy
# response = create_new_channels([CId])



q2 = make_query_PL_vid_count("UUDWIvJwLJsE4LG1Atne2blQ")
# q = make_query_get_channel_vid_ids("UCDWIvJwLJsE4LG1Atne2blQ")
r = send_query(q2)

print q2
print read_reply_get_PL_vid_count(r)
# print q

https://www.googleapis.com/youtube/v3/playlists?part=contentDetails&id=UUDWIvJwLJsE4LG1Atne2blQ&maxResults=50&key=AIzaSyDYEubHdR-SolXXyXdCbCF1ivVL5sy8k3c
292


In [58]:
import datetime

str(datetime.MAXYEAR)

d = datetime.datetime.utcnow() #youtube uses UTC time
k = datetime.datetime(1999,11,20,4,34,11)
k2 =  datetime.datetime(1989,11,20,4,34,11)
# print d
# print k
# print k.strftime('%Y-%m-%dT%H:%M:%SZ')

print dt_obj

#create evenly spaced dates from a given range

   
k1 = datetime.datetime(2000,1,1,0,0,0)
k2 =  datetime.datetime(2000,1,11,0,0,0)
for i in make_time_intervals([k1,k2],2):
    print i


# q = make_query_Ch_vid_count("UCDWIvJwLJsE4LG1Atne2blQ")
# r = send_query(q)
# print read_reply_get_Ch_vid_count(r)
# q = make_query_get_Ch_creation_date("UCDWIvJwLJsE4LG1Atne2blQ")
# r = send_query(q)
# print read_reply_get_Ch_creation_date(r)

2011-04-29 18:22:54
[datetime.datetime(2000, 1, 1, 0, 0), datetime.datetime(2000, 1, 6, 0, 0)]
[datetime.datetime(2000, 1, 6, 0, 0), datetime.datetime(2000, 1, 11, 0, 0)]


per video 
key is id
value is:
{metaData, timeSeries}
metData is {name, channel, playlists, duration, maybe a bit more stuf}
times series is another array of {time,stats}
by array I mean dictionary where key is time.
stats.

In [2]:

#https://www.googleapis.com/youtube/v3/search?part=snippet
#&channelId=UCDWIvJwLJsE4LG1Atne2blQ&type=video&maxResults=50
#&key={YOUR_API_KEY}

#example 2
#https://www.googleapis.com/youtube/v3/videos?part=snippet%2Cstatistics
#&id=0a799xooy-w%2Cyb5DH9y-rB8%2CQh8hO9j76R4%2CIJ-obdnR_j8%2CPE8NNZG9IYw%2CcSw5R-jdMiI%2CB0A2lDzn3yw
#&maxResults=50&key={YOUR_API_KEY}



     
query_params = [("part","snippet"),
                ("channelId" ,"UCDWIvJwLJsE4LG1Atne2blQ"),
                ("type","video"),
                ("maxResults",50),
                ("key",apikey)
               ]
print "example 1"
print str(write_request("search",query_params))
    
    
query_params = [("part",["snippet", "statistics"]),
                ("id" , ["0a799xooy-w","yb5DH9y-rB8","Qh8hO9j76R4","IJ-obdnR_j8","PE8NNZG9IYw","cSw5R-jdMiI","B0A2lDzn3yw"]),
                ("maxResults",50),
                ("key",apikey)
               ]
print "example 2"
print write_request("videos",query_params)



example 1


NameError: name 'write_request' is not defined

In [24]:
CId = "UC4a-Gbdw7vOaccHmFo40b9g" #khan acadmey
d1 = datetime.datetime.strptime("2006-11-16T18:22:54.001Z","%Y-%m-%dT%H:%M:%S.%fZ") #approx creation date khan academy
d2 = datetime.datetime.utcnow()

m = int(read_reply_get_Ch_vid_count(    
            send_query(
                make_query_Ch_vid_count(
                    CId))))//50  

history =  partition_channel_history(CId,d1,d2, m)
for i in history:
    print i

[datetime.datetime(2006, 11, 16, 18, 22, 54), datetime.datetime(2006, 12, 16, 23, 18, 23, 611014)]
[datetime.datetime(2006, 12, 16, 23, 18, 23, 611014), datetime.datetime(2007, 1, 16, 4, 13, 53, 222028)]
[datetime.datetime(2007, 1, 16, 4, 13, 53, 222028), datetime.datetime(2007, 2, 15, 9, 9, 22, 833042)]
[datetime.datetime(2007, 2, 15, 9, 9, 22, 833042), datetime.datetime(2007, 3, 17, 14, 4, 52, 444056)]
[datetime.datetime(2007, 3, 17, 14, 4, 52, 444056), datetime.datetime(2007, 4, 16, 19, 0, 22, 55070)]
[datetime.datetime(2007, 4, 16, 19, 0, 22, 55070), datetime.datetime(2007, 5, 16, 23, 55, 51, 666084)]
[datetime.datetime(2007, 5, 16, 23, 55, 51, 666084), datetime.datetime(2007, 6, 16, 4, 51, 21, 277098)]
[datetime.datetime(2007, 6, 16, 4, 51, 21, 277098), datetime.datetime(2007, 7, 16, 9, 46, 50, 888112)]
[datetime.datetime(2007, 7, 16, 9, 46, 50, 888112), datetime.datetime(2007, 8, 15, 14, 42, 20, 499126)]
[datetime.datetime(2007, 8, 15, 14, 42, 20, 499126), datetime.datetime(2007,

In [25]:
len(history)

271

In [50]:
def save_history(channelId, history):
    """save a partitioning of a channels upload history to file
    
    attributes
    ----------
    channelId : (string)        Youtube channelId
    history   : (list of pairs of datetime objects)   
        a partitioning of channel's publishing history into <50 video increments of time """
    
    #make sure all datetimes have identical formats as strings, i.e. microseconds always present
    fmt = "%Y-%m-%d %H:%M:%S.%f"
    formatter = lambda date : datetime.datetime.strftime(date,fmt)
    history_formatted = [[formatter(a),formatter(b)] for a,b in history] 
    
    #format as json
    history_json = json.dumps(history_formatted)
    
    filename = "{0}_partition.txt".format(channelId)
    with open(filename,"w") as f:
        f.write(history_json)
    print "{0} written to cwd".format(filename)

    
def load_history(filename):
     """load a channel's upload history"""
    with open(filename,"rb") as f:
        data = json.loads(f.read())
        fmt = "%Y-%m-%d %H:%M:%S.%f"   #datetime format
        p = lambda date: datetime.datetime.strptime(date,fmt) #date formatter
        formatted_data = [[p(a),p(b)] for a,b in data]
        
        return formatted_data

save_history(CId, history)
recovered_history = load_history("UC4a-Gbdw7vOaccHmFo40b9g_partition.txt")   

print recovered_history

UC4a-Gbdw7vOaccHmFo40b9g_partition.txt written to cwd
[[datetime.datetime(2006, 11, 16, 18, 22, 54), datetime.datetime(2006, 12, 16, 23, 18, 23, 611014)], [datetime.datetime(2006, 12, 16, 23, 18, 23, 611014), datetime.datetime(2007, 1, 16, 4, 13, 53, 222028)], [datetime.datetime(2007, 1, 16, 4, 13, 53, 222028), datetime.datetime(2007, 2, 15, 9, 9, 22, 833042)], [datetime.datetime(2007, 2, 15, 9, 9, 22, 833042), datetime.datetime(2007, 3, 17, 14, 4, 52, 444056)], [datetime.datetime(2007, 3, 17, 14, 4, 52, 444056), datetime.datetime(2007, 4, 16, 19, 0, 22, 55070)], [datetime.datetime(2007, 4, 16, 19, 0, 22, 55070), datetime.datetime(2007, 5, 16, 23, 55, 51, 666084)], [datetime.datetime(2007, 5, 16, 23, 55, 51, 666084), datetime.datetime(2007, 6, 16, 4, 51, 21, 277098)], [datetime.datetime(2007, 6, 16, 4, 51, 21, 277098), datetime.datetime(2007, 7, 16, 9, 46, 50, 888112)], [datetime.datetime(2007, 7, 16, 9, 46, 50, 888112), datetime.datetime(2007, 8, 15, 14, 42, 20, 499126)], [datetime.da

# async get all videos from channel!

In [67]:
import nbgrequests as grequests   #changed one tiny boolean

def async_fetch(requests):
    unsent_requests = [grequests.get(r) for r in requests]
    responses = grequests.map(unsent_requests)
    status_codes = [response.status_code for response in responses]
    print status_codes
    results = [response.json() for response in responses]
    return results


def make_query_Ch_get_vids_between_dates(channelId, publishedAfter, publishedBefore):
    '''make a query to get the videos on a given channel between two dates.
    
    attributes
    ----------
    channelId : string       Youtube Channel id
    publishedAfter  : datetime    start of date interval
    publishedBefore : datetime    end of date interval 
    
    Do not trust totalResults : val in response for the true number of hits. This number is unreliable. 
    e.g. #results(fn(chId,d1,d2))+ #results(fn(chId,d2,d3)) =/= #results(fn(chId,d1,d3))'''
    
    query_params = [("part",["snippet"]),
                    ("publishedAfter", publishedAfter),
                    ("publishedBefore", publishedBefore),
                    ("channelId",channelId),
                    ("type","video"),
                    ("maxResults",50),
                    ("key",apikey)
                   ]
#     s = 'https://www.googleapis.com/youtube/v3/channels?part=statistics&id={0}&key={1}'.format(channelId, apikey)
    s = write_request("search", query_params)
    return s

CId = "UC4a-Gbdw7vOaccHmFo40b9g" #khan acadmey
dates = [[datetime.datetime(2007, 10, 15, 0, 33, 19, 721154), datetime.datetime(2007, 10, 30, 3, 1, 4, 526661)], 
         [datetime.datetime(2007, 10, 30, 3, 1, 4, 526661), datetime.datetime(2007, 11, 14, 5, 28, 49, 332168)], 
         [datetime.datetime(2007, 11, 14, 5, 28, 49, 332168), datetime.datetime(2007, 11, 21, 18, 42, 41, 734921)], 
         [datetime.datetime(2007, 11, 21, 18, 42, 41, 734921), datetime.datetime(2007, 11, 25, 13, 19, 37, 936297)], 
         [datetime.datetime(2007, 11, 25, 13, 19, 37, 936297), datetime.datetime(2007, 11, 29, 7, 56, 34, 137673)], 
         [datetime.datetime(2007, 11, 29, 7, 56, 34, 137675), datetime.datetime(2007, 12, 14, 10, 24, 18, 943182)], 
         [datetime.datetime(2007, 12, 14, 10, 24, 18, 943182), datetime.datetime(2008, 1, 13, 15, 19, 48, 554196)], 
         [datetime.datetime(2008, 1, 13, 15, 19, 48, 554196), datetime.datetime(2008, 2, 12, 20, 15, 18, 165210)], 
         [datetime.datetime(2008, 2, 12, 20, 15, 18, 165210), datetime.datetime(2008, 3, 14, 1, 10, 47, 776224)], 
         [datetime.datetime(2008, 3, 14, 1, 10, 47, 776224), datetime.datetime(2008, 4, 13, 6, 6, 17, 387238)], 
         [datetime.datetime(2008, 4, 13, 6, 6, 17, 387238), datetime.datetime(2008, 4, 28, 8, 34, 2, 192745)], 
         [datetime.datetime(2008, 4, 28, 8, 34, 2, 192745), datetime.datetime(2008, 5, 13, 11, 1, 46, 998252)], 
         [datetime.datetime(2008, 5, 13, 11, 1, 46, 998252), datetime.datetime(2008, 6, 12, 15, 57, 16, 609266)], 
         [datetime.datetime(2008, 6, 12, 15, 57, 16, 609266), datetime.datetime(2008, 7, 12, 20, 52, 46, 220280)], 
         [datetime.datetime(2008, 7, 12, 20, 52, 46, 220280), datetime.datetime(2008, 8, 12, 1, 48, 15, 831294)], 
         [datetime.datetime(2008, 8, 12, 1, 48, 15, 831294), datetime.datetime(2008, 8, 27, 4, 16, 0, 636801)], 
         [datetime.datetime(2008, 8, 27, 4, 16, 0, 636801), datetime.datetime(2008, 9, 11, 6, 43, 45, 442308)]]

requests = [make_query_Ch_get_vids_between_dates(CId,d1,d2) for d1,d2 in dates]

responses = async_fetch(requests)


[200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200]


In [74]:
for r in responses:
    for item in r["items"]:
        print item["id"]["videoId"]

xRspb-iev-g
OLzXqIqZZz0
Pra6r20geXU
ouYZiIh8Ctc
0RdI3-8G4Fs
mHvSYRUEWnE
VJ9VRUDQyK8
xmgk8_l3lig
77-najNh4iY
zw0waJCEc-w
ZWSoyUxAQW0
V3-xCPDzQ1Q
F-OsMq7QKEQ
Zyq6TmQVBxk
7wUHJ7JQ-gs
11Bt6OhIeqA
JXCiFbEMTZ4
CmXmRNFrtFw
6PaFm_Je5A0
zbZyiyzMUQ8
RcSadoSQhdA
cKOtT4WnZb4
8wZugqi_uCg
dlpmllTx5MY
pGaDcOMdw48
FP2arCfAfBY
15zliAL4llE
1vamogV81Y8
5GRlLD7M430
ko-cYG3d6ec
BI-rtfZVXy0
-W3RkgvLrGI
Y5cSGxdDHz4
2439OIVBgPg
emdHj6WodLw
yEAxG_D1HDw
o2ZrX9MbIwU
OKXyKt40WFE
kqU_ymV581c
zrqzG6xKa1A
-uAfg0t6NmM
NLg6hfoKKlE
qO2cTx6DwCA
8PXJ3hHPxlo
Ij8AotZHfzU
oDcPHWX2Nv0
cEOxZWGp-8E
4CNnPgabrLE
bl2DvFn8LjM
uHwKV2NzDno
enmHaVxLfAE
PupNgv49_WY
TMmxKZaCqe0
tP9bocr_C2I
5tptL-SjfHY
MFAuLptYXFE
r_MPl6c23cc
I1CPivr1Rqs
Pj-dYWwdlDA
THu1yyU350A
zYltrrFpuRU
-UtUjx4nj-E
uFfMl-wGOqA
281GDzKgNIw
qCFS9sM3U4w
fX0ZVflYqf0
BRX5mWU0pKo
VUY_9-dl9Ro
QWtZ9jpN_3k
D4UfrKzUVz8
j-G31l9tETk
rbt51hXmzig
EWsVUf6ZFgw
eZuR5-Jng0o
oVkzin26KJk
4mUAiRKIhj0
MuhPEK5_kog
EDEg7SY2-VU
jG_R8MyQ53U
7rt8X3bIhf4
pCpLtKdMjSE
YFV6QFVxxPc
n30T7Uc6IOg
oRWM

In [81]:
print responses[0]
for i in extract_new_video_data(responses[0]):
    print i


{u'nextPageToken': u'CDIQAA', u'kind': u'youtube#searchListResponse', u'items': [{u'snippet': {u'thumbnails': {u'default': {u'url': u'https://i.ytimg.com/vi/xRspb-iev-g/default.jpg', u'width': 120, u'height': 90}, u'high': {u'url': u'https://i.ytimg.com/vi/xRspb-iev-g/hqdefault.jpg', u'width': 480, u'height': 360}, u'medium': {u'url': u'https://i.ytimg.com/vi/xRspb-iev-g/mqdefault.jpg', u'width': 320, u'height': 180}}, u'title': u'The Indefinite Integral or Anti-derivative', u'channelId': u'UC4a-Gbdw7vOaccHmFo40b9g', u'publishedAt': u'2007-10-19T00:23:08.000Z', u'liveBroadcastContent': u'none', u'channelTitle': u'Khan Academy', u'description': u'An introduction to indefinite integration of polynomials.'}, u'kind': u'youtube#searchResult', u'etag': u'"DuHzAJ-eQIiCIp7p4ldoVcVAOeY/AIIBjT6Wy_S56DTIBcbnzc9hev4"', u'id': {u'kind': u'youtube#video', u'videoId': u'xRspb-iev-g'}}, {u'snippet': {u'thumbnails': {u'default': {u'url': u'https://i.ytimg.com/vi/OLzXqIqZZz0/default.jpg', u'width': 120

In [79]:
def extract_new_video_data(response):
    new_videos = []
    for video in response["items"]:
        data = {"metaData":{},"timeSeries":{}}    #initialise video data

        #metaData content
        data["metaData"]["videoId"] = video["id"]["videoId"]
        data["metaData"]["videoName"] = video["snippet"]["title"]
        data["metaData"]["channelId"] = video["snippet"]["channelId"]
        data["metaData"]["channelTitle"] = video["snippet"]["channelTitle"]
        data["metaData"]["publishedAt"] = read_str_RFC3339(video["snippet"]["publishedAt"])
        data["metaData"]["description"] = video["snippet"]["description"]

        #no timeSeries content at creation

        new_videos.append(data)
    return new_videos