# Building a Data Pipeline in Python

The goal of this project is to load in data from a YouTube channel API and extract useful data in a dataframe format, then upload that to an AWS database. 

In [14]:
import requests 
import time
import pandas as pd
from dotenv import load_dotenv
import os

In [15]:
def configure():
    load_dotenv() #securely loading in my API key from .env

For this project I will be looking at the popular science channel, Kurzgesagt. In order to find the channel ID, we obtain it from the source code on the YouTube channel's homepage. We also need the base url from which we will form the root of our api, this can be found in the documentation: https://developers.google.com/youtube/v3/docs/search/list

In [16]:
#key and ID, you will want to replace the API key with your own
configure()
API_KEY = os.getenv("API_KEY")
CHANNEL_ID = "UCsXVk37bltHxD1rDPwtNM8Q"

## 1. Initial Exploration

First, I will craft an API from the base URL and the parameters found in the documentation.

In [17]:
url = "https://www.googleapis.com/youtube/v3/search?key="+API_KEY+"&channelId="+CHANNEL_ID+"&part=snippet,id&order=date&maxResults=2000"

video_info = requests.get(url).json()

video_info

{'kind': 'youtube#searchListResponse',
 'etag': 'cm-L7gAvAGsag_UyO3_I23FNvvg',
 'nextPageToken': 'CDIQAA',
 'regionCode': 'US',
 'pageInfo': {'totalResults': 204, 'resultsPerPage': 50},
 'items': [{'kind': 'youtube#searchResult',
   'etag': 'Fmk-JUAYW9KB-NAJIx6mh6rRURI',
   'id': {'kind': 'youtube#video', 'videoId': 'LEENEFaVUzU'},
   'snippet': {'publishedAt': '2022-06-28T14:00:23Z',
    'channelId': 'UCsXVk37bltHxD1rDPwtNM8Q',
    'title': 'The Last Human – A Glimpse Into The Far Future',
    'description': 'Because of the potential size of the future, the most important thing about our actions today might be their impact on future ...',
    'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/LEENEFaVUzU/default.jpg',
      'width': 120,
      'height': 90},
     'medium': {'url': 'https://i.ytimg.com/vi/LEENEFaVUzU/mqdefault.jpg',
      'width': 320,
      'height': 180},
     'high': {'url': 'https://i.ytimg.com/vi/LEENEFaVUzU/hqdefault.jpg',
      'width': 480,
      'height

We will have a few options to pick from for our statistics, including like count, view count, comment count, and favorite count. Favorite count is always zero, so we will leave it out. 

In [43]:
video_df = pd.DataFrame(columns = [ 'vid_id', 'vid_title', 'upload_date', 'view_count', 
                                    'like_count', 'comment_count'])

for vid in video_info['items']:
    if vid['id']['kind'] == 'youtube#video':
        vid_id = vid['id']['videoId']
        vid_title = vid['snippet']['title']
        upload_date = vid['snippet']['publishedAt']
        upload_date = str(upload_date).split("T")[0]
        
        #obtaining stats using video id
        
        vid_url = "https://www.googleapis.com/youtube/v3/videos?key="+API_KEY+"&part=statistics&id="+vid_id
        video_info_vid = requests.get(vid_url).json()
        
        view_count = video_info_vid['items'][0]['statistics']['viewCount']
        like_count = video_info_vid['items'][0]['statistics']['likeCount']
        comment_count = video_info_vid['items'][0]['statistics']['commentCount']
        d = {'vid_id':[vid_id], 'vid_title':[vid_title], 'upload_date':[upload_date], 
             'view_count':[view_count], 'like_count':[like_count], 'comment_count':[comment_count]}
        video_df = pd.concat([video_df, pd.DataFrame(data = d)])

In [42]:
video_df

Unnamed: 0,vid_id,vid_title,upload_date,view_count,like_count,comment_count


In [40]:
d = {"app":[0], "tap": [2]}
pd.DataFrame(data = d)

Unnamed: 0,app,tap
0,0,2


## 2. Cleaning and Optimizing Code

This has only collected videos from a single page, we want to loop through all page tokens. Also, it would be better to collect this loop into a function that obtains this same data.

In [48]:
def get_youtube_data(API_KEY, CHANNEL_ID):
    page = ""
    vid_df = pd.DataFrame(columns=["vid_id","vid_title","upload_date","view_count","like_count","comment_count"]) 
    
    while True:
        url = "https://www.googleapis.com/youtube/v3/search?key="+API_KEY+"&channelId="+CHANNEL_ID+"&order=date&maxResults=2000&part=snippet,id&"+page

        video_info = requests.get(url).json()
        time.sleep(1) #waits for one second
        for video in video_info['items']:
            if video['id']['kind'] == "youtube#video":
                vid_id = video['id']['videoId']
                vid_title = video['snippet']['title']
                vid_title = str(vid_title).replace("&amp;","")
                upload_date = video['snippet']['publishedAt']
                upload_date = str(upload_date).split("T")[0]
                
                #making a separate api call to pull the video stats
                url_vid_stats = "https://www.googleapis.com/youtube/v3/videos?id="+vid_id+"&part=statistics&key="+API_KEY
                vid_stats = requests.get(url_vid_stats).json()
                
                view_count = vid_stats['items'][0]['statistics']['viewCount']
                like_count = vid_stats['items'][0]['statistics']['likeCount']
                comment_count = vid_stats['items'][0]['statistics']['commentCount']
                
                #concatenating into the dataframe
                d = {'vid_id':[vid_id], 'vid_title':[vid_title], 'upload_date':[upload_date], 
                     'view_count':[view_count], 'like_count':[like_count], 'comment_count':[comment_count]}
                vid_df = pd.concat([video_df, pd.DataFrame(data = d)], ignore_index = True)
                
                
        try:
            if video_info['nextPageToken'] != None: 
                page = "pageToken=" + video_info['nextPageToken'] # causes loop to end when we reach final page

        except:
            break


    return vid_df

In [49]:
get_youtube_data(API_KEY, CHANNEL_ID)

Unnamed: 0,vid_id,vid_title,upload_date,view_count,like_count,comment_count
0,LEENEFaVUzU,The Last Human – A Glimpse Into The Far Future,2022-06-28,3692527,278874,13463
1,75d_29QWELk,Change Your Life – One Tiny Step at a Time,2022-06-07,4690303,347079,10065
2,Pj-h6MEgE7I,You Are Not Where You Think You Are,2022-05-17,5835854,324037,13865
3,7OPg-ksxZ4Y,The Most Horrible Parasite: Brain Eating Amoeba,2022-05-03,5228868,310379,15820
4,LxgMdjyw8uw,We WILL Fix Climate Change!,2022-04-05,7939435,546628,38302
5,KRvv0QdruMQ,Are There Lost Alien Civilizations in Our Past?,2022-03-01,9100101,388482,16461
6,lheapd7bgLA,What Happens if the Moon Crashes into Earth?,2022-02-08,11961361,443269,25101
7,xAUJYP8tnRE,Why We Should NOT Look For Aliens - The Dark F...,2021-12-14,10934881,550268,28299
8,XFqn3uy238E,...And We&#39;ll Do it Again,2021-12-07,9796172,626471,24854
9,F1Hq8eVOMHs,Is Meat Really that Bad?,2021-11-30,6590393,368886,43074


Now, instead of a series of for loops, we have a single function which allows for us to pull this data from any channel that we have the channel ID for. 

Next steps will be to perform sentiment analysis on titles, relate that to view counts, and then export that to AWS