# Scrape YouTube data
This notebook scrapes the general details of videos uploaded by multiple news channels between 2021-11-05 and 2021-11-15 as well as each video's details and comment sections.

## Environment
Import dependencies

In [1]:
import os
from dotenv import load_dotenv
import json
import pandas as pd
from youtube import channel, video

Get API key from environment variables

In [2]:
load_dotenv()
key = os.getenv('API_KEY')

## Import list of News channels

This data set is a handpicked list of news channels that:
1. Are relevant (top 100 subscribed or viewed channels)
1. Post political content in English
1. Have open comment sections

## 1. Build `channels` table

NOTE @ 2021-03-14: This section has already been executed. Since the requests are expensive, it's best to just load the result.

In [3]:
# Load result
df1 = pd.read_csv('../../dat/channels.csv')

## 2. Build `channelVideos` table

NOTE @ 2021-03-15: This section has already been executed. Since the requests are expensive, it's best to just load the result.

### 2.1. Pre-treatment videos
Videos uploaded on or before 2021-11-09

### 2.2. Post-treatment videos
Videos uploaded on or after 2021-11-11 (skip November 10th because the policy was gradually rolled out)

API quota ran out on `channelId = UCt-WqkTyKK1_70U4bb4k4lQ`.

Export table

In [4]:
# Load result
df2 = pd.read_csv('../../dat/videos.csv')

## 3. Build `videoDetails` table

Get the details of each video (title, description, duration, definition, etc.). These data will be used as controls.

NOTE @ 2021-03-15: This section has already been executed. Since the requests are expensive, it's best to just load the result.

Export table

In [5]:
# Load result
df3 = pd.read_csv('../../dat/videoDetails.csv')

## 4. Build `videoComments` table

- Quota Exceeded on `videoId = '_laKJi8Xwh8'`
- Quota Exceeded on `videoId = 'OzRR6ROQ-mA'`
- Quota Exceeded on `videoId = ''`

Read json files and filter by days after

In [6]:
df = pd.merge(df2, df3[['videoId','publishedAt']], on='videoId')
df['publishedAt'] = df['publishedAt'].str[:10]
t = df.groupby('publishedAt').size().reset_index(name='n')

In [13]:
print('t=3: ', 200 + 136 + 116, 'pre and', 229 + 207 + 156, 'post')
print('t=2: ', 200 + 136 + 116 + 160, 'pre and', 229 + 207 + 156 + 122, 'post')
print('t=1: ', 200 + 136 + 116 + 160 + 219, 'pre and', 229 + 207 + 156 + 122 + 185, 'post')

t=3:  452 pre and 592 post
t=2:  612 pre and 714 post
t=1:  831 pre and 899 post


In [14]:
221/219

1.0091324200913243