
In this notebook we document an attempt to use Google's Youtube api for the first time. We will try to obtain some simple attributes for a target video on youtube: e.g. View Count, Video name, Thumbs up, Channel name.


Youtube has a getting started page for developers at :
https://developers.google.com/youtube/v3/getting-started

There's a decent tutorial for getting some youtube data here:
https://medium.com/greyatom/youtube-data-in-python-6147160c5833



We'll start by downloading the right packages for python:

Input into cmd: conda install -c conda-forge google-api-python-client

You might need to restart the ipynb kernal a few times or run the conda command a few times to make sure everything actually got installed.

This should get the googleapi packages and their dependencies.


Here's an example request that is a youtube search, taken from the above tutorial:
(There's some things that I'd do differently if I rewrote it - I'd make sure it was clear we've retrieved everything from this query and I'd also do the variable typing as they are read from the response.

In [None]:
from apiclient.discovery import build
# from apiclient.errors import HttpError
# from oauth2client.tools import argparser
import pandas as pd
import pprint 
import matplotlib.pyplot as plt
import os

apikey_path = os.path.join(os.path.join(os.getcwd(),os.pardir),"apikey.txt")

with open(apikey_path,"rb") as f:
    apikey = f.readline()

DEVELOPER_KEY = apikey
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"

def youtube_search(q, max_results=50,order="relevance", token=None, location=None, location_radius=None):

    youtube = build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION,developerKey=DEVELOPER_KEY)

    search_response = youtube.search().list(
    q=q,
    type="video",
    pageToken=token,
    order = order,
    part="id,snippet", # Part signifies the different types of data you want 
    maxResults=max_results,
    location=location,
    locationRadius=location_radius).execute()

    title = []
    channelId = []
    channelTitle = []
    categoryId = []
    videoId = []
    viewCount = []
    likeCount = []
    dislikeCount = []
    commentCount = []
    favoriteCount = []
    category = []
    tags = []
    videos = []
    
    
    for search_result in search_response.get("items", []):
        if search_result["id"]["kind"] == "youtube#video":

            title.append(search_result['snippet']['title']) 

            videoId.append(search_result['id']['videoId'])

            response = youtube.videos().list(
                part='statistics, snippet',
                id=search_result['id']['videoId']).execute()

            channelId.append(response['items'][0]['snippet']['channelId'])
            channelTitle.append(response['items'][0]['snippet']['channelTitle'])
            categoryId.append(response['items'][0]['snippet']['categoryId'])
            favoriteCount.append(response['items'][0]['statistics']['favoriteCount'])
            viewCount.append(response['items'][0]['statistics']['viewCount'])
            likeCount.append(response['items'][0]['statistics']['likeCount'])
            dislikeCount.append(response['items'][0]['statistics']['dislikeCount'])
 
        if 'commentCount' in response['items'][0]['statistics'].keys():
            commentCount.append(response['items'][0]['statistics']['commentCount'])
        else:
            commentCount.append([])
  
        if 'tags' in response['items'][0]['snippet'].keys():
            tags.append(response['items'][0]['snippet']['tags'])
        else:
            tags.append([])
    
    
    youtube_dict = {'tags':tags,'channelId': channelId,'channelTitle': channelTitle,'categoryId':categoryId,'title':title,'videoId':videoId,'viewCount':viewCount,'likeCount':likeCount,'dislikeCount':dislikeCount,'commentCount':commentCount,'favoriteCount':favoriteCount}

    return youtube_dict

Using this function we can get the first 50 results for a search

In [1]:
A = youtube_search("h3h3")

A now contains the 50 top results for the search term "h3h3", which should return the work of Ethan Klein, a popular YouTube videoblogger. A is a dictionary, by construction, and allows easy importation into a dataframe format using pandas.

In [41]:
import pandas as pd
import pprint 
import matplotlib.pyplot as plt

Adf = pd.DataFrame(A)
Adf["viewCount"] = pd.to_numeric(Adf["viewCount"])
Adf["likeCount"] = pd.to_numeric(Adf["likeCount"])
Adf["dislikeCount"] = pd.to_numeric(Adf["dislikeCount"])
Adf["ControversyIndex"] = Adf["dislikeCount"]/(Adf["dislikeCount"]+Adf["likeCount"])

Currently we have a problem, the data types in our dataframe are all unicode strings (ideally, we would declare types when importing).

In [43]:
Adf_VC_sort = Adf.sort_values(by=['viewCount'], ascending=False)
Adf_Controv_sort = Adf.sort_values(by=["ControversyIndex"], ascending=False)

Adf_Controv_sort

Unnamed: 0,categoryId,channelId,channelTitle,commentCount,dislikeCount,favoriteCount,likeCount,tags,title,videoId,viewCount,ControversyIndex
49,22,UCZZHPXsg6LopvdOKF7qM6cQ,H3 Podcast Highlights,5358,1522,0,16613,"[h3 podcast, podcast, h3h3, h3h3productions, e...",H3H3 On Elon Musk Haters,S-BA4PRUVMU,592609,0.083926
20,22,UCZZHPXsg6LopvdOKF7qM6cQ,H3 Podcast Highlights,7015,1520,0,21781,"[h3 podcast, podcast, h3h3, h3h3productions, e...",H3H3 On the Biggest Idiot On Youtube,MgLO_ojVcJ8,976344,0.065233
8,22,UCZZHPXsg6LopvdOKF7qM6cQ,H3 Podcast Highlights,10514,5512,0,79292,"[h3 podcast, podcast, h3h3, h3h3productions, e...",H3H3 Slams Ellen DeGeneres,YuZxT89-GDk,2682618,0.064997
41,24,UCLtREJY21xRfCuEKvdki1Kw,H3 Podcast,11694,4194,0,60646,"[jontron, jon tron, jon, tron, jon jafari, jaf...",H3 Podcast #41 - JonTron,irVe0wQkMXg,2043603,0.064682
3,24,UCDWIvJwLJsE4LG1Atne2blQ,h3h3Productions,10848,8686,0,143769,"[facebook, mark zuckerberg, ethans corner, h3h...",Ethan's Corner - Facebook,vb97DrNDSfM,1759749,0.056974
28,24,UCDWIvJwLJsE4LG1Atne2blQ,h3h3Productions,17575,13540,0,248347,"[human mail challenge, human, mail, challenge,...",The Human Mail Challenge is Stupid,XvYDPHEJ9Rg,5089338,0.051702
12,22,UCZZHPXsg6LopvdOKF7qM6cQ,H3 Podcast Highlights,6276,1440,0,27807,"[h3 podcast, podcast, h3h3, h3h3productions, e...",H3H3 On Logan Paul Demonetization,386gkrAKKC4,1152948,0.049236
10,22,UCZZHPXsg6LopvdOKF7qM6cQ,H3 Podcast Highlights,2397,835,0,18267,"[h3 podcast, podcast, h3h3, h3h3productions, e...",H3H3 On Deepfakes,HqjKn0iC6l8,1146329,0.043713
25,24,UCLtREJY21xRfCuEKvdki1Kw,H3 Podcast,5515,1050,0,23050,"[h3 podcast, h3h3 podcast, h3h3, h3h3productio...",H3 Podcast #53 - Female Teacher Sleeps w Stude...,sUyR_nz-JsA,1115107,0.043568
13,24,UCDWIvJwLJsE4LG1Atne2blQ,h3h3Productions,39106,18780,0,413395,"[net neutrality, net, neutrality, ajit pai, aj...",It's Time To Stop Ajit Pai,5uXsCaakZD8,5732701,0.043455


We invented a metric called Controversy Index, which is the ratio of thumbs down to all thumbs. 

In [4]:
with open("C:/Users/AI/Documents/Projects/Python/Notebooks/apikey.txt","rb") as f:
    print f.readline()

AIzaSyDYEubHdR-SolXXyXdCbCF1ivVL5sy8k3c


In [12]:
import os
print os.path.join(os.path.join(os.getcwd(),os.pardir),"apikey.txt")


C:\Users\AI\Documents\Projects\Python\Notebooks\Google API\..\apikey.txt
