# Data Extraction
All the heavy lifting for data extraction happens here. We make use of some functions declared in `src/data_acquisition.py` to avoid bloating the notebook. The functions over in `data_acquisition.py` use the `API_KEY` defined in `.env` at the root of the current directory. If the `API_KEY` is not there the code may fail.

The current notebook extracts the following:
1)  Imports and Preliminary work
2)  Extract YouTube Channel ID based on handle
3)  Extract YouTube Uploads Playlist based on Channel ID
4)  Extract YouTube Videos based on YouTube Playlist ID
5)  Extract YouTube Comments & Replies based on YouTube Video List
6)  (Optional) Extract YouTube Playlists based on Channel ID

If you are continuing the fetch of comments from `<channel_handle>_videos.json`, there is no need to do steps 2, 3, 4 or 6. Executing 1 and 5 is enough.

# IMPORTANT
- Remember to configure your desired channel handle in `config.py` so that other notebooks have also access to the same YouTube handle. It is important to identify the data produced by the project.

## 1) Imports and Preliminary Work
We need some basic libraries for file handling, pathing, and time. Since we are importing functions from other parts of the project, for example `src/data_acquisition.py`, we need to add the root directory to the current scope of imports of the notebook.

In [1]:
import os
import sys
from datetime import date

# load project directory to path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))
from src.data_acquisition import *
from src.utils import *
from paths import Paths
import config

# show Channel Handle
print(f"Working with the handle '{config.channel_handle}'.")

Working with the handle 'kurzgesagt'.


Some of the generated files will have a name like `file_25_08_01.ndjson`, for that we have provided with a `Paths` class that handles the naming for us. We pass the channel handle, the date object, and the base directory:

**1) channel_handle**:
> This doens't have to be necessarily the same as the channel handle, it is for naming conventions only.

**2) date_obj**:
> Some files are generated by days, if the process of extraction takes multiple days the program will generate multiple files. It takes today by default, but a specific time can be passed like `date_path = date(25, 8, 1)`.

**3) base_dir**:
>You can specify the directory to save the files to. By default points to the root directory of the current project. All files will be saved to `base_dir/data/raw` for raw files and `base_dir/data/processed` for processed files.

In [2]:
date_path = date.today()
channel_paths = Paths(channel_handle=config.channel_handle, date_obj=date_path)

## 2) Extract YouTube Channel ID based on handle
Most of the API calls use the Channel ID to complete the queries, so we will need to make a request to the YouTube Data API v3 to get it. The Channel Id for a single channel is an alphanumeric string. For example, for the Channel handle `@smalin` is `UC2zb5cQbLabj3U9l3tke1pg`. The channel handle appears with a `@` symbol in the youtube profile. You can also make a browser search at `youtube.com/@smalin` to confirm that it is indeed the handle.

In [3]:
channel_id = get_channel_id(config.channel_handle)
print(f"The Channel ID for the YouTube handle '{config.channel_handle}' is {channel_id}.")
# channel_id = "UC2zb5cQbLabj3U9l3tke1pg" # smalin

# UCmAOuA1OIofz5XYzARmKkfQ for yeju
# UCsXVk37bltHxD1rDPwtNM8Q for Kurzgesagt

The Channel ID for the YouTube handle 'smalin' is UC2zb5cQbLabj3U9l3tke1pg.


## 3)  Extract YouTube Uploads Playlist based on Channel ID
To get all the uploaded (public) videos that the channel has uploaded, we need to fetch the Uploads Playlist from the channel.

In [4]:
uploads_playlist_id = get_channel_uploads_playlist(channel_id)
print(f"The Uploads Playlist for the YouTube handle '{config.channel_handle}' is {uploads_playlist_id}.")

The Uploads Playlist for the YouTube handle 'smalin' is UU2zb5cQbLabj3U9l3tke1pg.


For `smalin`, the uploads playlist was `UU2zb5cQbLabj3U9l3tke1pg`. You can also compare against the YouTube Channel ID `UC2zb5cQbLabj3U9l3tke1pg`:
- `UU2zb5cQbLabj3U9l3tke1pg`
- `UC2zb5cQbLabj3U9l3tke1pg`

They are almost the same string, but the Channel ID starts with `UC` and the Uploads Playlist starts with `UU`. If you want to avoid calling this method, you can modiy the YouTube Channel ID manually.

## 4)  Extract YouTube Videos based on YouTube Playlist ID
We will save the videos for the given playlist (by default the Uploads playlist) to a json file. The path that will be used is `base_dir/data/raw/<channel_handle>_videos.json`. This file will contain all videos with a format like this:
```json
[{
    "videoId": "21hbsC7nYwk",
    "done": false,
    "nextPageToken": null
}, ...],
```

This structure will be useful later if we have to distribute data extraction across multiple days. If the YouTube is exceeded we have to continue until our quota is reset. Videos completely processed will be marked as **done**, and videos that started processing and did not finish will save the nextPageToken, so that the next day of extraction we can resume where we left off.

If the file already exists the function will print an error. If you want to overwrite the file you can set the `overwrite` argument to `True` <span style="color: red; font-weight: bold;">⚠️BUT BE CAREFUL, THIS WILL COMPLETELY ERASE ANY EXTRACTION PROGRESS</span>.

In [None]:
# uploads_playlist_id = "UUsXVk37bltHxD1rDPwtNM8Q" # for kurzgesagt
save_playlist_videos(uploads_playlist_id, channel_paths.videos_file_path)

ERROR:root:File 'e:\Conda-Projects\Jupyter notebooks\youtube-nlp\data\raw\smalin_videos.json' already exists. Set overwrite=True to allow overwriting.


## 5)  Extract YouTube Comments & Replies based on YouTube Video List
Now for the big part, extracting comments. The function doesn't take any IDs this time, **THIS FUNCTION WORKS FROM THE `data/raw/<channel_handle>_videos.json` FILE**, if no such file exists, you will get an error, the `<channel_handle>_videos.json` is strictly necessary for this function to work. If you want comments for a single video you can create the file at the specified path with the following structure:

```json
[{
    "videoId": "<your_video_id>",
    "done": false,
    "nextPageToken": null
}]
```

You can get the Video ID by going at the desired video in the browser and look at the adress: `youtube.com/watch?v=21hbsC7nYwk`, the Video ID is `21hbsC7nYwk`, the part that comes after `/watch?v=` and before any `&`. Sometimes the adress will contain things like `watch?v=21hbsC7nYwk&list=RD21hbsC7nYwk&start_radio=1&ab_channel=smalin`. The Video ID is still the same, the `...&list=RD21hb...` part is just more arguments of the URL, but the Video ID is still the same.


This function will check against every video in `<channel_handle>_videos.json` and extract comments from all the videos that have `"done": false`. When a video is done, it is marked as `"done": true`. If a video is stopped, either because of an error, or because quota exceeded, the `nextPageToken` will be saved, to give the program indications where to continue from. This is useful when the YouTube video has numerous comments, and a single run is not enough.


### IMPORTANT
The `nextPageToken` is an obscure ID, meaning that it doesn't store any meaningful information, it is used by YouTube internally in order to handle pagination. <span style="color: red; font-weight: bold;">BUT IT CAN CHANGE</span>. A `nextPageToken` will not be valid forever, and it is better not to wait too much time between runs. If you could not finish a video today, plan to continue tomorrow, because in two days it may not be the same. **IF A VIDEO `nextPageToken` BECOMES INVALID YOU HAVE TO REPROCESS THE ENTIRE VIDEO**, this is a limitation of the api, that unfortunately we have to work around on our end. A `nextPageToken` can change for numerous reasons, when a person post a new comment, it can cause the comments to shift, and there may be a need for a new set of pagination tokens. It is not confirmed that the `nextPageToken` changes for this specific reason, just have in mind that **IT CAN CHANGE**. A `nextPageToken` may become invalid in the middle of a fetch, not just across days.


For now, if you have page errors, you can go to the `<channel_handle>_videos.json` file and set `nextPageToken` to null and make sure that `done` is set to false:
```json
[{
    "videoId": "<video_id>",
    "done": false, // this should remain in false
    "nextPageToken": "<next_page_token>" // set this to null
}]
```

A functionality to handle this issue programmatically is planned for the future.

### LOGGING
The function by default will log every video. This is convenient when a single video may contain numerous comments. If you want to control how many videos to wait to log, you can use the `log_every_count` argument. If the videos have numerous comments, it is better to leave it at 1.

In [8]:
save_all_videos_comments(channel_paths.videos_file_path, channel_paths.raw_comments_file_path, debugging=False, log_every_count=1)

INFO:root:Comments fetch initialized...
INFO:root:Videos list from e:\Conda-Projects\Jupyter notebooks\youtube-nlp\data\raw\kurzgesagt_videos.json loaded successfully.
INFO:root:Video gtDKKJq9u30 processed. Finished: True. Comments: 7323, Replies: 1658
INFO:root:Success. Skipped: 301, Processed: 1, Comments: 7323, Replies: 1658. (46.41s)
INFO:root:Videos saved successfully, videos count: 1, comments & replies: 8981


We can check the progress for our videos JSON with the following function:

In [1]:
result = get_videos_progress(channel_paths.videos_file_path)
result

NameError: name 'get_videos_progress' is not defined

### Add a video manually
You can add a video manually having the YouTube video ID. You pass the videoId to the function and the videos.json file that you want to append it to, and the function will handle the insertion for you, no need to modify the file manually. Be careful, as adding a video will add any video from any channel, since it doens't check if the video belongs to the current channel. USE IT AT YOUR OWN RISK.

In [6]:
add_video(channel_paths.videos_file_path, 'gtDKKJq9u30')

INFO:root:Success: Video gtDKKJq9u30 added.


## 6)  (Optional) Extract YouTube Playlists based on Channel ID
Since we later can extract a video list for a specific playlist, we can fetch all the youtube playlists for the specified YouTube Channel ID. You can look it up in the URL of a youtube playlist, or a video that is being played as a part of a playlist. The playlist in the url is identified by `list=` and is followed by something like `PLFB3B01978B3A6D7C` (lately they look like `PLtj_HurkS7Zx9aPdiCyf9GI5gC7qsmLwF`), it always starts with `PL`. Remember not to include anything after `&` symbol, excluding the symbol itself.

This function will fetch **all channel playlists**, all the playlists that the user created, it doesn't list the Uploads Playlist (it is not user created). It will output all playlists to a json file to the path `base_dir/data/raw/<channel_handle>_playlists.json`. It will look something like this:
```json
[
    {
        "id": "PLFB3B01978B3A6D7C",
        "title": "My Top Videos",
        "description": ""
    },
    ...
],
```

From this file you can look up the Playlist ID that you want to extract comments from. Remember that you have to create the `videos.json` for that playlist, and then extract comments from that specific videos file. This function is mostly informational, if the reader wishes to see from a specific playlist, it doesn't contribute anything significant to the main flow of this notebook.

In [None]:
# save_channel_playlists(channel_id, channel_paths.playlists_file_path)

# FAQ
- Q: **I've got a quotaExceeded error but have not used 10000 units today**.

A: Sometimes happens. sometimes would trigger quotaExceeded at 9897 units. The notebook once gets a quotaExceeded finishes
processing and saves the progress. From here it is recommended to continue on a different day.

- **How do I distinguish between the videos that have comments disabled?**

A: With the endpoinds that the notebook uses, there is no way to tell which videos have comments disabled. If a video
has comments disabled it will be shown as an error, but the progress will resume for the rest of the videos.

- **Why is there so little information about the video itself?**

A: The original scope of this project was about YouTube comments NLP, so we only use the YouTube endpoints for comments not for video metadata. It
could be added to the project but that would change the scope of the project significantly, since using the endpoints to retrieve video metadata cost
significantly more units per request.

- **How much content can I extract with the basic quota of 10000 units?**

A: With the tests done during the development of this project, the notebook can extract about 300.000 comments per 10.000 units. There isn't an exact number and is hard to estimate since it depends on how much comments there are and how much replies there are for these comments. The notebook doesn't use the replies endpoint if the top level comment has a number of replies of 5 or less. If it has more, it makes a call to the replies endpoint.

- **How much time takes the notebook to finish?**

A: With the tests done during development, we ran the notebook for about 1 hour every day, point at which the units will be almost expended.

- **Does the notebook handles new comments?**

A: If the video has not been processed, yes. The notebook doesn't distinguish new comments, the notebook will fetch all comments up to the day that the notebook is run for that video. If the video has been processed, and a new comment gets posted, the notebook doesn't account for the new comment.

- **Does the notebook handles new videos?**

A: No, You would have to get a new list of videos, but getting a new list of videos will erase any extraction process. The notebook may have a way to handle this case in the future. For now, if you want to add a video, you can do it modifying the file manually, respecting the required format.