# Scrape YouTube data
This notebook scrapes the general details of videos uploaded by multiple news channels between 2021-11-05 and 2021-11-15 as well as each video's details and comment sections.

## Environment
Import dependencies

In [None]:
import os
from dotenv import load_dotenv
import json
import numpy as np
import pandas as pd
from youtube import channel, video
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

Get API key from environment variables

## Import list of News channels

This data set is a handpicked list of news channels that:
1. Are relevant (top 100 subscribed or viewed channels)
1. Post political content in English
1. Have open comment sections

## 1. Build `channels` table

NOTE @ 2021-03-14: This section has already been executed. Since the requests are expensive, it's best to just load the result.

In [None]:
# Load result
df1 = pd.read_csv('../../dat/channels.csv')

## 2. Build `channelVideos` table

NOTE @ 2021-03-15: This section has already been executed. Since the requests are expensive, it's best to just load the result.

### 2.1. Pre-treatment videos
Videos uploaded on or before 2021-11-09

### 2.2. Post-treatment videos
Videos uploaded on or after 2021-11-11 (skip November 10th because the policy was gradually rolled out)

API quota ran out on `channelId = UCt-WqkTyKK1_70U4bb4k4lQ`.

Export table

In [None]:
# Load result
df2 = pd.read_csv('../../dat/videos.csv')

## 3. Build `videoDetails` table

Get the details of each video (title, description, duration, definition, etc.). These data will be used as controls.

NOTE @ 2021-03-15: This section has already been executed. Since the requests are expensive, it's best to just load the result.

Export table

In [None]:
# Load result
df3 = pd.read_csv('../../dat/videoDetails.csv')

## 4. Build `videoComments` table

### 4.1. Mine all comment sections

- Quota Exceeded on `videoId = '_laKJi8Xwh8'`
- Quota Exceeded on `videoId = 'OzRR6ROQ-mA'`

### 4.2. Count and classify relevant comments
Function that classifies text as negative or non-negative:

In [None]:
def clf(text=None):
    r = SentimentIntensityAnalyzer().polarity_scores(text)
    if r['neg'] == max(list(r.values())[:3]):
        return 1
    else:
        return 0

For every video, from the comments posted at most one day after its release, count the total number of comments as well as the total number of negative comments.

To do:
- Repeat procedure to get `yt` for $t \in \{1,2,3,4,5\}$
- Pass on videos as a function of $t$

In [None]:
# Read t1
df4 = pd.read_csv('../../dat/comments_classified_t1.csv')
# Add publishedAt
df = pd.merge(df3, df4, on='videoId', how='inner')
# Remove November 9
df = df[df['publishedAt'].str[8:10] != '09']
# Add treatment column
df = df.merge(df2[['videoId','treat']])
# Declare target
df['ncr1'] = df['nNegativeComments'].div(df['nComments'])
# Days until treatment
df['dUntil'] = 10 - df['publishedAt'].str[8:10].astype(int)

- The policy took place on November 10th
- $t = 1 \implies$ videos posted on 11-09 and 11-10 have comments posted on the 10th
    - Discard 11-09
    - Discard 11-10