# THIS PROJECT

This project is a YouTube comments analyzer, with the following tasks:
- YouTube extraction
- Cleaning
- Enriching
- Profiling
- Basic statistics
- Word Cloud
- n-grams
- Language detection with langdetect
- Sentiment classification with VADER

The notebooks are structured as the following:
- **01_data_acquisition**:

The main notebook responsible for the data extraction, if it is your first time exploring this project, it is recommended to start there, since the analyses done depend on the extracted data.

- **02_1_data_cleaning**:

Cleaning, reformating, renaming, null fills, notebook responsible for tyding the data obtained from the API. Saves the result to parquet.

- **02_2_enriched_columns**:

Addition of columns that are too heavy to maintain in a single file, like tokens with stopwords, tokens without stopwords, emojis, mentions, hashtags.

- **02_3_pandas_profiling**:

Basic understanding of some of the columns from the clean dataset (not the enriched ones) with pandas profiling.

- **03_1_cloud_of_words**:

Drawing of cloud of words from the tokens without stopwords from the enriched dataset.

- **03_2_cloud_of_emojis**:

Drawing cloud of emojis from the emojis column from the enriched dataset.

- **03_3_comment_statistics**:

Basic descriptive analysis of some of the columns obtained from the clean dataset.

- **03_4_ngrams**:

Ngrams and cloud of words (with the ngrams) with the tokens with stopwords from the enriched dataset.

- **03_5_1_language_detection**:

Language detection with langdetect. This task uses the original comment from the clean dataset.

- **03_5_2_language_detection_notes (Optional)**:

Some notes regarding language detection and CPU performance.

- **03_6_1_sentiment_analysis**:

Sentiment analysis using VADER sentiment analyzer.

- **03_6_2_sentiment_analysis_statistics**:

Basic descriptive statistics with the sentiment of each column.

- **03_6_3_sentiment_analysis_notes (Optional)**:

Some notes regarding sentiment analysis with VADER and CPU performance.


# NECESSARY CONFIGS
In order for the current project to work correctly it is necessary to configure the following two things:
1) API KEY
- Go over to the developer's console of YouTube and register a new API Key, you can follow the following tutorial: https://www.youtube.com/watch?v=brCkpzAD0gc
- In the `.env` file at the root of the file there is the variable `api_key = ""`, place the API key generated within the quotation marks. like this: `api_key = "<your_api_key_here>"`
2) Configure YouTube channel's handle
- Go to the YouTube page that you desired to analyze, and watch for a name starting with "@" in the profile page.
- Go to the file `config.py` at the root of the directory and place the channel handle there, without the "@"

# API QUOTAS
The YouTube API used in this project has certain usage restrictions. YouTube allocates 10,000 units per day, which are used in the following way:
- Fetching the YouTube channel ID: 1 unit
- Fetching the YouTube chennel playlists: 1 unit, max 50 playlists per page
- Fetching YouTube videos from a playlist: 1 unit, max 50 videos per page
- Fetching YouTube YouTube comments from comment Threads: 1 unit, max 100 comments per page
- Fetching YouTube replies from comments: 1 unit, max 100 replies per page

Averaging, in a single day the extraction process could fetch from 300,000 to 500,000 comments, top level comments and replies combined. If the YouTube channel you are trying to analyze is relatively big, you may want to run the extraction process in different days, the notebook is built with this necessity in mind.

# FUNNY FINDINGS
- In order to make an emoji cloud with 1 million emojis, where every emoji is 72x72 and not wanting the emojis to stack too much in top of each other we would need a resolution of 96000x54000 or 25 times 4K