# Extract and store data from YouTube songs

![song](https://source.unsplash.com/nyBUfH9MkL4)
Photo by [@mybibimbaplife](https://unsplash.com/@mybibimbaplife)


## Guidelines

Open the `02-Extract-and-store-from-Youtube-API.ipynb` notebook, and follow the instructions.

Please, note that the code we wrote for you in this notebook uses the module `src/data_collection.py` to work efficiently. We'd like you to have a look at this module, just to familiarize yourself with lesser-known Python concepts, specifically the `asyncio` package.

Don't panic if you don't understand this module: it takes practice to master those concepts. The goal here is just to give you a first glimpse at what is possible with Python, beyond its "classical" utilization that you see everywhere on the Web.


## 1. Objective

The goal of this mini-project is to fetch extra data for all the videos appearing in the Youtube songs playlog dataset. With this exercice, you will learn how to:

- Use the YouTube API
- Write efficient and asynchronous Python code 
- Create a `boto3` session object to store your AWS credentials
- Store a big JSON object in AWS S3 using `boto3`

### Why ?

One of the jobs of the Data Engineer is to be able to collect and store data, so that her fellow Data Scientists and Data Analysts will have enough materials to work efficiently.

The original dataset is interesting, but we need more data about the songs to do an interesting and insightful analysis and/or modelization of the users behaviour.

### How ?

Luckily for us, in this case there is a simple way to gather data about YouTube videos: the Youtube API!

You now know what is web API and how to make your Python program interact with it. However, the volume of the dataset (more than 630,000 songs!) will force us to be creative in the difficulties we will encounter... Let's go!

----

### First steps

Before anything, we need to have a list of all the songs IDs of the dataset. To achieve this, we will quickly load the data in a `pandas` dataframe, and extract all the unique values in the `song` column.

In [1]:
# Will be necessary to read the dataset from a public file stored on S3 from pandas
!pip install s3fs



In [2]:
# TODO: import pandas
### BEGIN STRIP ###
import pandas as pd
### END STRIP ###

In [3]:
# TODO: Load the 'youtube-playlog.csv' dataset in a pandas dataframe `df`
# NOTE: it may take some time to complete...
### BEGIN STRIP ###
df = pd.read_csv('s3://full-stack-bigdata-datasets/Big_Data/youtube_playlog.csv')
### END STRIP ###

In [4]:
# TODO: create a list `song_ids` containing all the IDs of songs contained in the dataset
# WARNING: we don't want duplicates!
### BEGIN STRIP ###
song_ids = df['song'].unique().tolist()
### END STRIP ###

In [5]:
# TODO: how many songs (or video IDs) do we have in our list?
### BEGIN STRIP ###
len(song_ids)
### END STRIP ###

631348

## 2. Extract data from YouTube Data API


The [Youtube Data API](https://developers.google.com/youtube/v3/docs/videos/list) allows us to get information such as statistics about play logs, details on the video category etc...

### Quick maths

Before proceeding, answer these question (using Python as a calculator).

Considering we need to fetch data about more than 630,000 videos, and that a single API call can fetch information about 50 videos max, 

**1. how many API calls (roughly) would be necessary to gather all the data we want?**

Knowing that a single API call to YouTube takes about 2 seconds in average to complete, 

**2. how much time is necessary to fetch all the extra data ? Is it doable in a class session like today?**

In [6]:
# Calculate the 2 variables `required_api_calls` and `time_in_hours`
### BEGIN STRIP ###
required_api_calls = len(song_ids) // 50
time_in_seconds = required_api_calls * 2
time_in_hours = time_in_seconds / 3600
### END STRIP ###

print(f'We need to do {required_api_calls} API calls to fetch all the data.')
print(f'It would take about {time_in_hours:.1f} hours to complete.')

We need to do 12626 API calls to fetch all the data.
It would take about 7.0 hours to complete.


-----

### A brief introduction to concurrency

To avoid spending the entire day waiting for the API calls to do their job, we will leverage a very important concept in Software and Data Engineering: **concurrency** — [Wikipedia](https://en.wikipedia.org/wiki/Concurrency_(computer_science)).

In a nutshell, **concurrency** allows us to run commands _asynchronously_ : launching the next command whereas the previous one has not terminated yet.

#### Two differents types of commands

In a program, a command is considered either being **CPU-bound** — requiring lots of processing and calculation — or **I/O-bound** — having to wait for external inputs or outputs — to finish. This distinction is very important, as it will guide you in choosing the best tool for writing asynchronous code.

In our case here, the command is definitely not CPU intensive, it just has to wait for the YouTube API to respond to our request. Hence, our problem here is said to be **I/O bound**.

In Python, there is an historical way to handle concurrency with I/O-bound tasks, with threads. Since Python 3.5 however, the `asyncio` package has joined the standard library.

Today, we don't ask you to build this asynchronous program yourself, as it would require too much of extra knowledge. 

**However, we strongly recommend you to 1) read the code below and the functions it uses, and 2) dig into this topic of asynchronous Python programming when you have some time.**

### API quotas

Another problem is raising up: by default, **Google gives you a credit of 10,000 unit-calls a day** on their YouTube Data API. Depending on the amount of data you want to fetch, each API call will cost you more or less units. 

Here is the official [quota calculator](https://developers.google.com/youtube/v3/determine_quota_cost). In our case, we need to call the `list` method of the `videos` resource. 



Considering that we would like to grab data about the following parts: `snippet`, `contentDetails`, `status`, `statistics` and `topicDetails`, 

**1. how much units does a single call cost?**

**2. Hence, how many of such calls can we make in a one day before reaching the limit sets by Google?**

Considering now that you can grab data from 50 videos per call, 

**3. how many videos can you extra data from?**

In [7]:
# TODO: Open the Quota calculator and find the cost of our API call, knowing all the data we want to fetch.
# Then, calculate the `max_calls` and the `max_videos_count` variables
### BEGIN STRIP ###
COST = 11
QUOTA = 10000
max_calls = QUOTA // COST
max_videos_count = max_calls * 50
### END STRIP ###

print(f'We can perform {max_calls} calls a day before reaching the API limit.')
print(f'That means we can fetch extra data for {max_videos_count} videos per day.')

We can perform 909 calls a day before reaching the API limit.
That means we can fetch extra data for 45450 videos per day.


---

As you just calculate, you won't be able to extract the data for all the 630,000+ videos listed in the dataset in a single day. 

Today, you are only going to fetch data from 5,000 videos. Tomorrow, when you need to work with the rest of the data, you could do so by accessing all the data we've fetched for you, and stored on S3. You're welcome :)

But before that, learn how to do it yourself! If you can fetch data from 40,000 videos, fetching data from 630,000 is the same... You would just need to wait 20 days to comply with the API quotas!

When you are done, all you need to do is to push the data you just extracted as a JSON file to an S3 bucket.

In [8]:
# TODO: Slice the `song_ids` list to keep only 5,000 songs ids: `songs_sample`
### BEGIN STRIP ###
songs_sample = song_ids[:50000]
### END STRIP ###

### Get a YouTube API key

Create your personnal YouTube API key by following this [guided tutorial](https://developers.google.com/youtube/v3/getting-started)

When you get your key, export it as an environment variable in your current shell by running the following cell:

In [9]:
# TODO: write your own API key
YOUTUBE_API_KEY = 'AIzaSyDDKKNgq__t6jeJE5jTdsFh5vFBOZ6PQrI'

import os
os.environ['YOUTUBE_API_KEY'] = YOUTUBE_API_KEY

Please, note that in the real world, you would **never** write an API key directly in your code!

### Making asynchronous calls to the YouTube API

We wrote for you the actual code that will download and save the data fetched from the API. All you need to do is to have a look at the function `fetch_all` from the module `src.data_collection` and then to import it.

**NOTE** If your python version is too low (< 3.5) you probably need to use the function `alt_fetch_all()` instead of `fetch_all()`.

In [10]:
# TODO: after reviewing the `fetch_all` function, import it to your notebook
from data_collection import fetch_all

Again, we don't ask you to write the following code yourself, as it is beyond the scope of this course. However, we strongly encourage you to dig the topic of concurrent programming in Python.

If you don't understand what this code does in details, that's OK. Read carrefully the comments.

In [11]:
import asyncio
loop = asyncio.get_event_loop()

task = loop.create_task(fetch_all(songs_sample, dry_run=False))

[2022-01-05 12:14:12 Paris, Madrid]	INFO	Requesting data for 50000 videos. Please wait...	(data_collection)
[2022-01-05 12:14:13 Paris, Madrid]	DEBUG	API call succeeded.	(data_collection)
[2022-01-05 12:14:13 Paris, Madrid]	DEBUG	API call succeeded.	(data_collection)
[2022-01-05 12:14:13 Paris, Madrid]	DEBUG	API call succeeded.	(data_collection)
[2022-01-05 12:14:13 Paris, Madrid]	DEBUG	API call succeeded.	(data_collection)
[2022-01-05 12:14:13 Paris, Madrid]	DEBUG	API call succeeded.	(data_collection)
[2022-01-05 12:14:13 Paris, Madrid]	DEBUG	API call succeeded.	(data_collection)
[2022-01-05 12:14:13 Paris, Madrid]	DEBUG	API call succeeded.	(data_collection)
[2022-01-05 12:14:13 Paris, Madrid]	DEBUG	API call succeeded.	(data_collection)
[2022-01-05 12:14:13 Paris, Madrid]	DEBUG	API call succeeded.	(data_collection)
[2022-01-05 12:14:13 Paris, Madrid]	DEBUG	API call succeeded.	(data_collection)
[2022-01-05 12:14:13 Paris, Madrid]	DEBUG	API call succeeded.	(data_collection)
[2022-01-05 

☝️ **Wait for a log message `Done! Fetched data...` to show up before continuing! It may take 10-30 seconds depending on your Internet connection.**

**NOTES**

1 - Did you notice the timestamps of the debug log messages? Thank to concurrency we are able to perform a dozen of API calls per second!

2 - Have you heard of logging before? It's an important feature to write maintainable code! Check out the resource at the end of this notebook for further information

Now that the asynchronous calls are done, we need to extract the result. You can do so by calling the `result()` method on your [task](https://docs.python.org/3/library/asyncio-task.html#asyncio.Task) object. Store it into a `result` variable.

When you have your `result` variable, dump its content into a `json_result` variable: we need th result to be stored as a JSON string.

In [12]:
import json
# TODO: call the `.result(...)` method on the `task` object.
# Store the result of our API calls in `result` variable
### BEGIN STRIP ###
result = task.result()
### END STRIP ###

# TODO: print how many items are inside the `result` list?
### BEGIN STRIP ###
print(len(result))
### END STRIP ###

HTTPTimeoutError: Timeout in request queue

In [16]:
print(result)

NameError: name 'result' is not defined

As you saw, `result` is a list of objects returned by each API calls.  These objects are represented as Python dictionnary, and really big! 

Display the first dictionary of the list.

In [16]:
# TODO: display the first item to have a feel of the content. The rest is the same, but for other videos.
### BEGIN STRIP ###
result[0]
### END STRIP ###

{'kind': 'youtube#videoListResponse',
 'etag': '1aMNqcJ3-XytgWjIA7QkDUlHHHs',
 'items': [{'kind': 'youtube#video',
   'etag': 'OV2vqB0CW-MV73zLNAxyZ524bds',
   'id': 't1l8Z6gLPzo',
   'snippet': {'publishedAt': '2013-07-22T12:09:11Z',
    'channelId': 'UCUERSOitwgUq_37kGslN96w',
    'title': 'VOLO. "L\'air d\'un con"',
    'description': 'Enregistré et mixé par Cyrille PELTIER au "Keen Studio" à Tours.\nMerci à Cyrille PELTIER et au "Keen Studio" à Tours.\nToutes les dates de la tournée sur volo.fr\nLe nouvel album "Sans rire" est sorti le 11 mars 2013.\nVolo était à l\'Olympia le 29 avril dernier et est en tournée dans toute la France.\nLe clip de "Toujours à côté" est sur youtube sur http://www.youtube.com/watch?v=cWCH2dpyw1c : Le premier extrait de l\'album est en playliste sur France Bleue, Europe 1, RFM, RFI, Radio Alouette.\n"Sans rire", autre extrait de l\'album, est en playlist sur Virgin Radio.\n\nSortie digitale de "Toujours à côté"\nhttps://itunes.apple.com/fr/album/toujours

In [20]:
# TODO: dumps the `result` in a JSON string, `json_result`
### BEGIN STRIP ###
json_result = json.dumps(result)
### END STRIP ###

## 3. Store the data in S3

We now have a JSON string containing extra data from 5000 videos. We now need to store this data in S3, in order to make it available for future computations.

You are going to store the data under the key `youtube/{{ your-name }}/songs.json`. 

For instance, if you are George Abitbol, first, whoooaaa 😳, then your storage key is `youtube/george-abitbol/songs.json`.

#### Reminder

You probably need the `boto3` [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#bucket) related to S3 buckets. 

To put your AWS credentials directly in our s3 object, we recommend using a `boto3.Session()` object. Check out [this page](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/session.html) for a quick explanation.

In [17]:
import os
import logging

import boto3

# Avoid DEBUG messages from Boto3. You don't have to understand ;-)
logging.getLogger('botocore').setLevel(logging.INFO)
logging.getLogger('boto3').setLevel(logging.INFO)

In [18]:
# TODO: set up your AWS credentials
# NOTE: This is NOT how you should do that in the real world...
ACCESS_KEY_ID = 'YOUR_ACCESS_KEY'
SECRET_ACCESS_KEY = 'YOUR_SECRET_ACCESS_KEY'

# TODO: create a `boto3.Session` object containing your AWS credentials
### BEGIN STRIP ###
session = boto3.Session(
    region_name='eu-west-3',  # Datacenters located in Paris, FR
    aws_access_key_id=ACCESS_KEY_ID, 
    aws_secret_access_key=SECRET_ACCESS_KEY
)
### END STRIP ###

In [21]:
# TODO: create an `s3` resource object
### BEGIN STRIP ###
s3 = session.resource('s3')
### END STRIP ###

# TODO: create the bucket object representing the bucket `jedha-cloud-storage`
### BEGIN STRIP ###
bucket = s3.Bucket('jedha-cloud-storage-157')
### END STRIP ###

# TODO: As you did in the previous exercice, store your `json_result` object under your own storage key
### BEGIN STRIP ###
bucket.put_object(
    ACL='private', 
    Body=json_result.encode(), # See docs: `Body` needs to be bytes
    Key='youtube/george-abitbol/songs.json'
)
### END STRIP ###

s3.Object(bucket_name='jedha-cloud-storage-157', key='youtube/george-abitbol/songs.json')

Depending on your Internet connection, this may take some time... Uploading data big amount of data in the Cloud is not free!

---

## 4. Summing up

In this exercice, you have learned:
- How to read a public API documentation
- The concept of concurrent programming, to improve performance
- How to create a `boto3` session object to explicitely store your AWS credentials
- How to upload a JSON data structure to S3

Not too bad!

![travis](https://media.giphy.com/media/LYDNZAzOqrez6/giphy.gif)

## 5. Extra resources

About `asyncio` et asyncrhonous programming in Python:

- [asyncio: A complete walkthrough](https://realpython.com/async-io-python/)
- [Making 1 million requests with Python](https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html)

About logging

- [What is logging?](https://en.wikipedia.org/wiki/Log_file)
- [Logging in Python](https://docs.python.org/3/library/logging.html)