## Scraping Closed Captions from YouTube
    -In this notebook we will set up a mechanism to download subtitles from youtube videos.
    -There are a few different approaches.
    -For my project we start with a list of playlists that were chosen by hand
    -However, a function to return search results and filter them is also present

In [351]:
import pandas as pd
import os
import webvtt
import google.oauth2.credentials
import time

import google_auth_oauthlib.flow
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google_auth_oauthlib.flow import InstalledAppFlow




In [352]:
#use google's resource builder code from API documentation

# The CLIENT_SECRETS_FILE variable specifies the name of a file that contains
# the OAuth 2.0 information for this application, including its client_id and
# client_secret.
CLIENT_SECRETS_FILE = "client_secret.json"

# This OAuth 2.0 access scope allows for full read/write access to the
# authenticated user's account and requires requests to use an SSL connection.
SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl']
API_SERVICE_NAME = 'youtube'
API_VERSION = 'v3'


def get_authenticated_service():
    flow = InstalledAppFlow.from_client_secrets_file(CLIENT_SECRETS_FILE, SCOPES)
    credentials = flow.run_console()
    return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)

def print_response(response):
    print(response)

# Build a resource based on a list of properties given as key-value pairs.
# Leave properties with empty values out of the inserted resource.
def build_resource(properties):
    resource = {}
    for p in properties:
    # Given a key like "snippet.title", split into "snippet" and "title", where
    # "snippet" will be an object and "title" will be a property in that object.
        prop_array = p.split('.')
        ref = resource
    for pa in range(0, len(prop_array)):
        is_array = False
        key = prop_array[pa]

        # For properties that have array values, convert a name like
        # "snippet.tags[]" to snippet.tags, and set a flag to handle
        # the value as an array.
        if key[-2:] == '[]':
            key = key[0:len(key)-2:]
            is_array = True

        if pa == (len(prop_array) - 1):
             # Leave properties without values out of inserted resource.
            if properties[p]:
                if is_array:
                    ref[key] = properties[p].split(',')
                else:
                    ref[key] = properties[p]
        elif key not in ref:
            # For example, the property is "snippet.title", but the resource does
            # not yet have a "snippet" object. Create the snippet object here.
            # Setting "ref = ref[key]" means that in the next time through the
            # "for pa in range ..." loop, we will be setting a property in the
            # resource's "snippet" object.
            ref[key] = {}
            ref = ref[key]
        else:
            # For example, the property is "snippet.description", and the resource
            # already has a "snippet" object.
            ref = ref[key]
    return resource

# Remove keyword arguments that are not set
def remove_empty_kwargs(**kwargs):
    good_kwargs = {}
    if kwargs is not None:
        for key, value in kwargs.items():
            if value:
                good_kwargs[key] = value
    return good_kwargs

if __name__ == '__main__':
    # When running locally, disable OAuthlib's HTTPs verification. When
    # running in production *do not* leave this option enabled.
    os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1'
    client = get_authenticated_service()

Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=111385062335-jq2vrqf7hvtp6mfh2o5bprq11sabsq1r.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fyoutube.force-ssl&state=DfoE7FZXxyd0CPQkMlzm44eN4kJwXp&prompt=consent&access_type=offline
Enter the authorization code: 4/4wBDp-DRrBAIbklO7zQqLgGxUbx5Y44eemC4b_KIcjgmKjKWhW9drwA


## Function to search videos by search term
    - Given a search term and an integer n, this function will return the n first results for the term
    - Videos are limited to those that have CCs and are longer than 20 minutes (these paramenters can be changed)
    - Returns a DF with the video title,videoID, and channelID

In [252]:
def youtube_keyword(client, **kwargs):    
    # See full sample for function
    kwargs = remove_empty_kwargs(**kwargs)
    response = client.search().list(
        **kwargs
        ).execute()    
    return response

def youtube_search (criteria,n_terms):   
    #create lists and empty dataframe
    titles = []
    videoIds = []
    channelIds = []
    resp_df = pd.DataFrame()
    
    while len(titles) < n_terms:
        token = None
        response = youtube_keyword(client,
                        part='id,snippet',
                        maxResults=50,
                        q=criteria,#whatever you search for above
                        videoCaption='closedCaption',#we only want videos with CCs
                        type='video', #because we specify duration
                        videoDuration='long',#videos longer than 20 minutes
                        pageToken=token) #this will update for the next 50 results
                                         
        for item in response['items']:
        
            titles.append(item['snippet']['title'])
            channelIds.append(item['snippet']['channelTitle'])
            videoIds.append(item['id']['videoId'])
        
        token = response['nextPageToken']
    resp_df['titles'] = titles
    resp_df['channelIds'] = channelIds
    resp_df['videoIds'] = videoIds
    
    return resp_df

#Sample usage

'''Linear_Algebra_Videos = youtube_search('[linear+algebra]',1000)'''


"Linear_Algebra_Videos = youtube_search('[linear+algebra]',1000)"

## Function to get a list of videoIDs for a given playlist
    -This function takes a playlistID and returns a list of videosIDs for that playlistID

In [510]:
def playlist_items_list_by_playlist_id(client, **kwargs):
    # See full sample for function
    kwargs = remove_empty_kwargs(**kwargs)

    response = client.playlistItems().list(
    **kwargs
    ).execute()

    return response


def get_vid_ids (play_lists):
    #generate data from call
    titles = []
    descriptions = []
    channelids = []
    vidids = []
    playlist_ids = []
    video_df = pd.DataFrame()
    for play_list in play_lists:
        #request playlist items
        pl_data = playlist_items_list_by_playlist_id(client,
                            part='snippet,contentDetails',
                            maxResults=50,
                            playlistId=play_list)
        
        #extract information about each video in the playlist
        
        for item in pl_data['items']:
                titles.append(item['snippet']['title'])
                descriptions.append(item['snippet']['description'])
                channelids.append(item['snippet']['channelTitle'])
                vidids.append(item['snippet']['resourceId']['videoId'])
                playlist_ids.append(item['snippet']['playlistId'])
                
    video_df['title'] = titles
    video_df['description'] = descriptions
    video_df['channelid'] = channelids
    video_df['videoids'] = vidids
    video_df['playlist_id'] = playlist_ids
            
                
    return video_df

## Importing a list of playlists

In [513]:
#import csv of target play lists
pl_df = pd.read_csv('playlists_math.csv')
#coorect a playlist
pl_df.iloc[16].PlaylistID = 'PLUl4u3cNGP61hsJNdULdudlRL493b-XZf'

In [None]:
pl_df.PlaylistID

In [514]:
#call our function to get the videoIDs of the playlists
t1 = time.time()
video_df = get_vid_ids(pl_df.PlaylistID)
print("Num Videos listed in {} minutes".format((time.time()-t1)/60))

Num Videos listed in 0.25398961702982586 minutes


In [515]:
print('Number of VideoIds downloaded: {}'.format(len(video_df.videoids)))

Number of VideoIds downloaded: 1285


## Function to scrape Captions 
    -Given a list of videoIds, this function will download the CCs if available.
    -As this function does not utilize the API, it is not subject to a quota(be nice)
    -Also, its slow so be patient

In [474]:
#This is the function that will actually download the CCs, it takes a while as it writes a .vtt file to disk
def get_all_ccs(vids):
    base_url = 'https://www.youtube.com/watch?v='
    lang="en"
    for vid in vids:
        url = base_url + vid
        cmd = ["youtube-dl","--skip-download","--write-sub",
               "--sub-lang",lang,url]
        os.system(" ".join(cmd))

In [475]:
t1 = time.time()

get_all_ccs(video_df.videoids)

print("Captions downloaded in {} minutes".format((time.time()-t1)/60))

Captions downloaded in 0.9936846733093262 minutes


In [422]:
#Enumerate .vtt files in cwd
filenames_vtt = filenames_vtt = [os.fsdecode(file) for file in os.listdir(os.getcwd())\
                                 if os.fsdecode(file).endswith(".vtt")]
print('Captions Downloaded {}'.format(len(filenames_vtt)))

Captions Downloaded 195


In [423]:
#look at how they're saved
filenames_vtt[:2]

['Determining whether a transformation is onto _ Linear Algebra _ Khan Academy-eR8vEdJTvd0.en.vtt',
 'More on linear independence _ Vectors and spaces _ Linear Algebra _ Khan Academy-Alhcv5d_XOs.en.vtt']

## Fuction to convert .vtt files to csv files
    -Given a list of file names, this funtion will convert the .vtt in the cwd to
    a csv file with the test, start time and stop time for each line.
    -it also removes the downloaded .vtt files

In [424]:
def convert_vtt(filenames):    
    #create an assets folder if one does not yet exist
    if os.path.isdir('{}/assets2'.format(os.getcwd())) == False:
        os.makedirs('assets2')
    #extract the text and times from the vtt file
    for file in filenames:
        captions = webvtt.read(file)
        text_time = pd.DataFrame()
        text_time['text'] = [caption.text for caption in captions]
        text_time['start'] = [caption.start for caption in captions]
        text_time['stop'] = [caption.end for caption in captions]
        text_time.to_csv('assets2/{}.csv'.format(file[:-4]),index=False) #-4 to remove '.vtt'
        #remove files from local drive
        os.remove(file)

In [425]:
#convert the vtt and creat csvs in assets folder
convert_vtt(filenames_vtt)

## General Clean up
    -Here clean up the csv file names and get the text into a dataframe

In [528]:
#Get a list of the CSV files 
csv_files = [os.fsdecode(file) for file in os.listdir(os.getcwd()+'/math2clean/assets') \
                    if os.fsdecode(file).endswith('.csv')]

In [524]:
len(csv_files)

868

In [529]:
csv_files

['VisualizingacolumnspaceasaplaneinR3_Vectorsandspaces_LinearAlgebra_KhanAcademy-EGNlXtjYABw.en.csv',
 'Modelingpopulationwithsimpledifferentialequation_KhanAcademy-IYFkXWlgC_w.en.csv',
 'FunctionsasArguments-QaOHeMnpnmU.en.csv',
 'Introductiontoprojections_Matrixtransformations_LinearAlgebra_KhanAcademy-27vT-NWuw0M.en.csv',
 '11.DynamicProgramming-All-PairsShortestPaths-NzgFUwOaoIw.en.csv',
 'Lec16_MIT18.02MultivariableCalculus,Fall2007-YP_B0AapU0c.en.csv',
 '6.MaximumLikelihoodEstimation(cont.)andtheMethodofMoments-JTbZP0yt9qc.en.csv',
 'Lec19_MIT18.03DifferentialEquations,Spring2006-sZ2qulI6GEk.en.csv',
 'L05.2DefinitionofRandomVariables-vfqPpai_9jI.en.csv',
 'Lec6_MIT6.042JMathematicsforComputerScience,Fall2010-h9wxtqoa1jY.en.csv',
 'Mean&VarianceoftheExponential-GJ2klfD0Q3g.en.csv',
 'ComputingtheFourFundamentalSubspaces-D8u1LV9CnCk.en.csv',
 '11.Learning-IdentificationTrees,Disorder-SXBG3RGr_Rc.en.csv',
 'Lec26_MIT18.02MultivariableCalculus,Fall2007-RMBGQtwkoyU.en.csv',
 'Lecture

In [525]:
#look at a file example
csv_files[0]

'VisualizingacolumnspaceasaplaneinR3_Vectorsandspaces_LinearAlgebra_KhanAcademy-EGNlXtjYABw.en.csv'

In [None]:
#remove the spaces from the names
path = '/math2clean/assets/'
for filename in csv_files:
    os.rename(os.path.join(path, filename), os.path.join(path, filename.replace(' ', '')))

In [526]:
clean_csv = sorted([os.fsdecode(file) for file in os.listdir(os.getcwd()+'/math2clean/assets')])
'''clean_csv'''

'clean_csv'

In [530]:
#extrat the text and videoid
vidText = []
csv_vidid = []
path = 'math2clean/assets/'
for file in clean_csv:
    
    df = pd.read_csv(path+file)
    text = " ".join(df.text)
    vidText.append(text)
    csv_vidid.append(file[-18:-7])

In [531]:
#set up a data frame with the available caption information
lectures_df = pd.DataFrame()
lectures_df['lecture_title'] = clean_csv
lectures_df['lecture_text'] = vidText
lectures_df['vid_id'] = csv_vidid

## Adding infomation from youtube api and captions scraped together
    -Because we had a list of 913 videos and only 675 had downloadable captions, when we add the information together
    we will need to discard some of the videos in the api videoId list.
    -To do this we will join the two DFs using the video ID as the index.

In [532]:
lectures_df.shape

(868, 3)

In [533]:
lectures_df.head(1)

Unnamed: 0,lecture_title,lecture_text,vid_id
0,'Shifting'transformbymultiplyingfunctionbyexpo...,Now I think is a good time\nto add some notati...,_X_QwpXsdOs


In [534]:
video_df.head(1)

Unnamed: 0,title,description,channelid,videoids,playlist_id
0,L01.1 Lecture Overview,"MIT RES.6-012 Introduction to Probability, Spr...",MIT OpenCourseWare,1uW3qMFA9Ho,PLUl4u3cNGP60hI9ATjSFgLZpbNJ7myAg6


In [535]:
base_df = lectures_df.set_index('vid_id').join(video_df.set_index('videoids')).reset_index()

In [536]:
base_df.shape


(892, 7)

In [537]:
base_df.head()

Unnamed: 0,index,lecture_title,lecture_text,title,description,channelid,playlist_id
0,--lPz7VFnKI,"Lec39_MIT18.01SingleVariableCalculus,Fall2007-...",The following content is\nprovided under a Cre...,"Lec 39 | MIT 18.01 Single Variable Calculus, F...",Lecture 39: Final review\nInstructor: David Je...,MIT OpenCourseWare,PL590CCC2BC5AF3BC1
1,-630YTQEuCI,S01.0MathematicalBackgroundOverview--630YTQEuC...,"In this sequence of segments,\nwe review some ...",S01.0 Mathematical Background Overview,"MIT RES.6-012 Introduction to Probability, Spr...",MIT OpenCourseWare,PLUl4u3cNGP60hI9ATjSFgLZpbNJ7myAg6
2,-DP1i2ZU9gk,8.ObjectOrientedProgramming--DP1i2ZU9gk.en.csv,The following content is\nprovided under a Cre...,8. Object Oriented Programming,MIT 6.0001 Introduction to Computer Science an...,MIT OpenCourseWare,PLUl4u3cNGP63WbdFxL8giv4yhgdMGaZNA
3,-DwGrJ8JxDc,Recitation9b-DNASequenceMatching--DwGrJ8JxDc.e...,The following\ncontent is provided under a Cre...,Recitation 9b: DNA Sequence Matching,"MIT 6.006 Introduction to Algorithms, Fall 201...",MIT OpenCourseWare,PLUl4u3cNGP61Oq3tWYp6V_F-5jb5L2iHb
4,-FElVPKykgw,R10.Quiz1Review--FElVPKykgw.en.csv,The following\ncontent is provided under a Cre...,R10. Quiz 1 Review,"MIT 6.006 Introduction to Algorithms, Fall 201...",MIT OpenCourseWare,PLUl4u3cNGP61Oq3tWYp6V_F-5jb5L2iHb


In [538]:
#drop duplicates
base_df = base_df.drop_duplicates(subset=['index'])

In [540]:
base_df.shape

(868, 7)

In [541]:
base_df.head(1)

Unnamed: 0,index,lecture_title,lecture_text,title,description,channelid,playlist_id
0,--lPz7VFnKI,"Lec39_MIT18.01SingleVariableCalculus,Fall2007-...",The following content is\nprovided under a Cre...,"Lec 39 | MIT 18.01 Single Variable Calculus, F...",Lecture 39: Final review\nInstructor: David Je...,MIT OpenCourseWare,PL590CCC2BC5AF3BC1


In [542]:
base_df['vidid'] = base_df['index']
base_df = base_df.drop(['index'],axis=1)

In [543]:
base_df.head()

Unnamed: 0,lecture_title,lecture_text,title,description,channelid,playlist_id,vidid
0,"Lec39_MIT18.01SingleVariableCalculus,Fall2007-...",The following content is\nprovided under a Cre...,"Lec 39 | MIT 18.01 Single Variable Calculus, F...",Lecture 39: Final review\nInstructor: David Je...,MIT OpenCourseWare,PL590CCC2BC5AF3BC1,--lPz7VFnKI
1,S01.0MathematicalBackgroundOverview--630YTQEuC...,"In this sequence of segments,\nwe review some ...",S01.0 Mathematical Background Overview,"MIT RES.6-012 Introduction to Probability, Spr...",MIT OpenCourseWare,PLUl4u3cNGP60hI9ATjSFgLZpbNJ7myAg6,-630YTQEuCI
2,8.ObjectOrientedProgramming--DP1i2ZU9gk.en.csv,The following content is\nprovided under a Cre...,8. Object Oriented Programming,MIT 6.0001 Introduction to Computer Science an...,MIT OpenCourseWare,PLUl4u3cNGP63WbdFxL8giv4yhgdMGaZNA,-DP1i2ZU9gk
3,Recitation9b-DNASequenceMatching--DwGrJ8JxDc.e...,The following\ncontent is provided under a Cre...,Recitation 9b: DNA Sequence Matching,"MIT 6.006 Introduction to Algorithms, Fall 201...",MIT OpenCourseWare,PLUl4u3cNGP61Oq3tWYp6V_F-5jb5L2iHb,-DwGrJ8JxDc
4,R10.Quiz1Review--FElVPKykgw.en.csv,The following\ncontent is provided under a Cre...,R10. Quiz 1 Review,"MIT 6.006 Introduction to Algorithms, Fall 201...",MIT OpenCourseWare,PLUl4u3cNGP61Oq3tWYp6V_F-5jb5L2iHb,-FElVPKykgw


In [544]:
#save the csv
base_df.to_csv('all_lectures.csv', index=False)