# Data collection

In this notebook we will collect the data for the Google Analytics Capstone project. We choose to analize data from a YouTube Channel.

So the aim of this notebook is to create a dataset with the following columns:

| Variable | Description | Notes |
| --- | --- | --- |
| `id` | YouTube video id | |
| `title` | Title of the video | At the time of extraction<br>I've noticed that some titles change perhaps due to A/B tests |
| `published_at` | Date of publication | |
| `description` | Video description | At the time of extraction |
| `tags` | Video tags | AFAIK this is not displayed at the front-end |
| `thumbnail` | Thumbnail url | At the time of extraction |
| `duration` | Video duration | ISO 8601 duration string |
| `view_count` | Number of views | At the time of extraction |
| `like_count` | Number of likes | At the time of extraction |
| `comment_count` | Number of comments | At the time of extraction |
| `transcription` | Video transcription | Needs to be parsed |

In [1]:
import requests
import pickle
import pandas as pd

from dotenv import dotenv_values
from tqdm import tqdm
from urllib.parse import urlparse, parse_qs
from IPython.display import JSON

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

from youtube_transcript_api import YouTubeTranscriptApi

# Video information

The first thing we will do is setup for the use of the YouTube Data V3 API.

You need to populate the `.env` file with your [API key](https://developers.google.com/youtube/v3).

In [2]:
env = dotenv_values('.env')

api_key = env.get('DATA-V3')

base_api = 'https://www.googleapis.com/youtube/v3'
videos_api = lambda id: f'{base_api}/videos?key={api_key}&id={id}&part=id,snippet,contentDetails,statistics'

## Sample request

Note: we can send multiple ids per request, [apparently](https://stackoverflow.com/questions/36370821/does-youtube-v3-data-api-have-a-limit-to-the-number-of-ids-you-can-send-to-vide) the max is 50.

In [3]:
response = requests.get(videos_api('-n8N62DeNDU'))

JSON(response.json())

<IPython.core.display.JSON object>

After looking at the sample structure, we choose to capture the following for each video:

- `snippet.title`
- `snipper.publishedAt`
- `snippet.description`
- `snippet.tags`
- `snippet.thumbnails.default.url`
- `snippet.tags` 
- `contentDetails.duration`
- `statistics.viewCount`
- `statistics.likeCount`
- `statistics.commentCount`

Now we can define our function that extracts the video information.

In [4]:
def get_item_info(item: dict) -> dict:
    '''Collects the relevant information from an video

    Parameters
    ----------
    item: dict
        Dictionary from YouTube response

    Returns
    -------
    dict
        Dictionary that represents a video
    '''
    snippet = item.get('snippet', {})
    statistics = item.get('statistics', {})
    
    item_info = {}

    item_info['id'] = item.get('id')
    
    item_info['title'] = snippet.get('title', None)
    item_info['published_at'] = snippet.get('publishedAt', None)
    item_info['description'] = snippet.get('description', None)
    item_info['tags'] = snippet.get('tags', None)
    item_info['thumbnail'] = snippet\
        .get('thumbnails', {})\
        .get('default', {})\
        .get('url', None)

    item_info['duration'] = item.get('contentDetails', {}).get('duration', None)

    item_info['view_count'] = statistics.get('viewCount', {})
    item_info['like_count'] = statistics.get('likeCount', {})
    item_info['comment_count'] = statistics.get('commentCount', {})

    return item_info

Finnaly, we create the function to get the information for all videos in a list.

In [5]:
def get_video_list_info(ids: list, chunk_size: int = 50) -> list:
    '''Collects the relevant information from a list of video ids

    Parameters
    ----------
    ids: list
        List of YouTube video id
    chunk_size: int
        Number of ids passed to each API call

    Returns
    -------
    list
        List of video information, can be used by pd.DataFrame
    '''
    info = []
    chunks = [ids[i:i + chunk_size] for i in range(0, len(ids), chunk_size)]

    for chunk in chunks:
        response = requests.get(videos_api(','.join(chunk)))

        if response.status_code != 200:
            raise 'YouTube has failed us'

        items = response.json()['items']
        # info = [get_item_info(item) for item in items]
        for item in items:
            info.append(get_item_info(item))

    return info

# Transcription

We also want to get the transcription for each video, the following function uses the `YouTubeTranscriptAPI` to get the transcriptions.

In [6]:
def get_transcription(id: str) -> list:
    '''Collects the transcriptions from a list of video ids

    Parameters
    ----------
    id: str
        YouTube video id

    Returns
    -------
    list
        List of dicts with text, start and duration
    '''

    try:
        return YouTubeTranscriptApi.get_transcript(id)
    except:
        # Video doesn't have transcription
        return None

# Channel information

We will use Selenium to scrape the video ids from the channel page. When the driver appear we should scroll to the oldest video we want.

This could be more automated, but since we will only do it once, there's no need to overdo it.

We will gather the data for the past 2 years.

In [7]:
ltt_channel = 'https://www.youtube.com/@LinusTechTips/videos'

chrome_driver = ChromeDriverManager().install()
driver = webdriver.Chrome(service=Service(chrome_driver))

driver.get(ltt_channel)

In [8]:
get_id = lambda href: parse_qs(urlparse(href).query)['v'][0]
id_xpath = '//*[@id="video-title-link"]'

ids = [get_id(el.get_attribute('href')) for el in driver.find_elements(By.XPATH, id_xpath)]

print(ids[:3])
print(f'We have {len(ids)} video ids')

['8N3sFRR9-OE', '-n8N62DeNDU', 'buLyy7x2dcQ']
We have 900 video ids


# Creating the DataFrame

First we will get the information

In [9]:
video_info = pd.DataFrame(get_video_list_info(ids)).set_index('id')

video_info.head()

Unnamed: 0_level_0,title,published_at,description,tags,thumbnail,duration,view_count,like_count,comment_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8N3sFRR9-OE,How bad is the Cheapest Laptop,2023-07-12T17:23:57Z,"Check out the UGREEN Nexode 100W Charger, 145W...","[Cheapest Laptop, AliExpress]",https://i.ytimg.com/vi/8N3sFRR9-OE/default.jpg,PT23M49S,453842,25856,1891
-n8N62DeNDU,WHY is Everyone Buying This Power Supply??,2023-07-11T17:03:27Z,Checkout iFixit's toolkits at: https://www.iFi...,"[Power Supply, PSU Tester, Thermaltake Smart 6...",https://i.ytimg.com/vi/-n8N62DeNDU/default.jpg,PT14M39S,1116918,51871,2983
buLyy7x2dcQ,"Apple fans, start typing your angry comments now…",2023-07-10T18:14:04Z,"Check out the UGREEN PowerRoam 1200W, 145W Pow...","[apple, mac, mac studio, apple silicon, m2, m2...",https://i.ytimg.com/vi/buLyy7x2dcQ/default.jpg,PT18M25S,1728263,70674,4720
H5e3ALqgpaA,I said YES to everything… I regret it,2023-07-09T17:20:35Z,Visit https://www.squarespace.com/LTT and use ...,"[saying yes, roundup, tech, assorted, cuktech,...",https://i.ytimg.com/vi/H5e3ALqgpaA/default.jpg,PT26M11S,1823313,76297,3044
P32OKr74NPQ,Upgrading our FREE internet to 25 gigabit!,2023-07-08T17:00:29Z,It’s no secret their chairs are great! Check o...,,https://i.ytimg.com/vi/P32OKr74NPQ/default.jpg,PT32M19S,1835615,68739,2771


In [10]:
assert len(video_info) == len(ids), 'DataFrame and ids should have the same size'

Now we get the transcriptions.

The transcription API does ~1 video/sec*, so this takes a while for a lot of videos. As it is a one time thing, I'll not try to optimize this step.

\* There's a `get_transcripts` method, but it seems to do it serialized, too

In [12]:
transcriptions = [{
    'id': id,
    'transcription': get_transcription(id)
} for id in tqdm(ids)]

100%|██████████████████████████████████████████████████████████████████████████████| 900/900 [15:03<00:00,  1.00s/it]


In [13]:
video_transcription = pd.DataFrame(transcriptions).set_index('id')

video_transcription.head()

Unnamed: 0_level_0,transcription
id,Unnamed: 1_level_1
8N3sFRR9-OE,[{'text': 'I've got something I want to show y...
-n8N62DeNDU,[{'text': 'this is the most popular power supp...
buLyy7x2dcQ,[{'text': 'I really did try this time guys I e...
H5e3ALqgpaA,[{'text': 'my inbox is full of opportunities t...
P32OKr74NPQ,[{'text': 'when we expanded our space to give ...


In [14]:
assert len(video_transcription) == len(ids), 'DataFrame and ids should have the same size'

Finally we can merge and export the DataFrame. We will export with pickle just to keep the `transcription` format.

In [15]:
raw = video_info.join(video_transcription)

with open('data/raw.pkl', 'wb') as f:
    pickle.dump(raw, f)

raw.head()

Unnamed: 0_level_0,title,published_at,description,tags,thumbnail,duration,view_count,like_count,comment_count,transcription
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8N3sFRR9-OE,How bad is the Cheapest Laptop,2023-07-12T17:23:57Z,"Check out the UGREEN Nexode 100W Charger, 145W...","[Cheapest Laptop, AliExpress]",https://i.ytimg.com/vi/8N3sFRR9-OE/default.jpg,PT23M49S,453842,25856,1891,[{'text': 'I've got something I want to show y...
-n8N62DeNDU,WHY is Everyone Buying This Power Supply??,2023-07-11T17:03:27Z,Checkout iFixit's toolkits at: https://www.iFi...,"[Power Supply, PSU Tester, Thermaltake Smart 6...",https://i.ytimg.com/vi/-n8N62DeNDU/default.jpg,PT14M39S,1116918,51871,2983,[{'text': 'this is the most popular power supp...
buLyy7x2dcQ,"Apple fans, start typing your angry comments now…",2023-07-10T18:14:04Z,"Check out the UGREEN PowerRoam 1200W, 145W Pow...","[apple, mac, mac studio, apple silicon, m2, m2...",https://i.ytimg.com/vi/buLyy7x2dcQ/default.jpg,PT18M25S,1728263,70674,4720,[{'text': 'I really did try this time guys I e...
H5e3ALqgpaA,I said YES to everything… I regret it,2023-07-09T17:20:35Z,Visit https://www.squarespace.com/LTT and use ...,"[saying yes, roundup, tech, assorted, cuktech,...",https://i.ytimg.com/vi/H5e3ALqgpaA/default.jpg,PT26M11S,1823313,76297,3044,[{'text': 'my inbox is full of opportunities t...
P32OKr74NPQ,Upgrading our FREE internet to 25 gigabit!,2023-07-08T17:00:29Z,It’s no secret their chairs are great! Check o...,,https://i.ytimg.com/vi/P32OKr74NPQ/default.jpg,PT32M19S,1835615,68739,2771,[{'text': 'when we expanded our space to give ...


In [16]:
raw.shape

(900, 10)