# Scrape YouTube data
This notebook scrapes the general details of videos uploaded by multiple news channels between 2021-11-05 and 2021-11-15 as well as each video's details and comment sections.

## Environment
Import dependencies

In [None]:
import os
from dotenv import load_dotenv
import json
import numpy as np
import pandas as pd
import langid
from youtube import channel, video
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

Get API key from environment variables

## Import list of News channels

This data set is a handpicked list of news channels that:
1. Are relevant (top 100 subscribed or viewed channels)
1. Post political content in English
1. Have open comment sections

## 1. Build `channels` table

NOTE @ 2022-03-14: This section has already been executed. Since the requests are expensive, it's best to just load the result.

In [None]:
# Load result
df1 = pd.read_csv('../../dat/channels.csv')

## 2. Build `channelVideos` table

NOTE @ 2022-03-15: This section has already been executed. Since the requests are expensive, it's best to just load the result.

### 2.1. Pre-treatment videos
Videos uploaded on or before 2021-11-09

### 2.2. Post-treatment videos
Videos uploaded on or after 2021-11-11 (skip November 10th because the policy was gradually rolled out)

API quota ran out on `channelId = UCt-WqkTyKK1_70U4bb4k4lQ`.

Export table

Create backup of `videos.csv` table before adding videos posted on fuzzy day

NOTE @ 2022-08-24: This section has already been executed. Since the requests are expensive, it's best to just load the result.

### 2.3. Fuzzy-day videos

Videos uploaded on 1010-11-10

Add videos posted on 2022-11-10 to `videos.csv` table

In [None]:
# Load result
df2 = pd.read_csv('../../dat/videos.csv')

## 3. Build `videoDetails` table

Get the details of each video (title, description, duration, definition, etc.). These data will be used as controls.

NOTE @ 2022-03-15: This section has already been executed. Since the requests are expensive, it's best to just load the result.

Export table

Mine details of fuzzy-day videos.

Append new details to `videoDetails.csv` backup table

In [None]:
# Load data
df3 = pd.read_csv('../../dat/videoDetails.csv')

## 4. Build `videoComments` table

### 4.1. Mine all comment sections

- Quota Exceeded on `videoId = '_laKJi8Xwh8'`
- Quota Exceeded on `videoId = 'OzRR6ROQ-mA'`

Mine videos uploaded on 2021-11-10

### 4.2. Count and classify relevant comments
Each video's negative comment ratio is defined as
$$ncr_i(h) = \frac
    {\text{Negative comments }| \text{ Post time} \leq \text{Upload time} + h}
    {\text{Comments } | \text{ Post time} \leq \text{Upload time} + h}$$
for $i \in \{1, ..., n\}$ and $h \in \{12, 24, ..., 72\}$

---

Note: Not a single comment must overlap with the 10th!!!

Export `videoFlags` table

### 4.3. Count and classify comments (including `2022-11-10`)
Count types of comments **without language check**

Count types of comments **with language check**

Sample 1000 comments and score them (this is used for comparing the classifications made by VADR and my own labels)