# Data collector

We are going to use the package [youtube_dl](https://youtube-dl.org) to extract from Youtube the data we need to build the machine learning model. As this project will be built around my personal interests regarding Youtube videos, the following subjects will be used to train the model:
- artificial intelligence;
- machine learning;
- deep learning.

The information we want to extract from the videos of the Youtube search for each of the topics above is:
- Title;
- Category;
- Tags;
- Duration;
- Number of views;
- Number of likes;
- Number of dislikes;
- Upload date;
- Uploader;
- Video's link;
- Channel's link.

In [110]:
import youtube_dl      # Youtube scraper
import pandas as pd    # module to handle dataframes
import numpy as np     # module to handle arrays

We will make use of youtube_dl's feature *ytsearchdate* to extract the data starting from the most recent video. The data are organized in a dictionary containing the following keys: <br>
['_type', 'entries', 'id', 'extractor', 'webpage_url', 'webpage_url_basename', 'extractor_key']

When we inspect this dictionary, we see that all the information we care about for this project is stored in the key *'entries'*. Having this in mind, let's proceed to collect the data.

**OBS:** The cell below will take a few minutes to run.

In [111]:
# My subjects of interest:
queries = ["artificial+intelligence", "machine+learning", "deep+learning"]

# Defining youtube_dl object:
ydl = youtube_dl.YoutubeDL({"ignoreerrors": True})

# Starting the list where we will store the collected data:
results = []

# Collecting data for each query:
for query in queries:
    r = ydl.extract_info("ytsearchdateall:{}".format(query), download=False)
    results += r['entries']

[download] Downloading playlist: artificial+intelligence
[youtube:search:date] query "artificial+intelligence": Downloading page 1
[youtube:search:date] query "artificial+intelligence": Downloading page 2
[youtube:search:date] query "artificial+intelligence": Downloading page 3
[youtube:search:date] query "artificial+intelligence": Downloading page 4
[youtube:search:date] query "artificial+intelligence": Downloading page 5
[youtube:search:date] query "artificial+intelligence": Downloading page 6
[youtube:search:date] query "artificial+intelligence": Downloading page 7
[youtube:search:date] query "artificial+intelligence": Downloading page 8
[youtube:search:date] query "artificial+intelligence": Downloading page 9
[youtube:search:date] query "artificial+intelligence": Downloading page 10
[youtube:search:date] query "artificial+intelligence": Downloading page 11
[youtube:search:date] query "artificial+intelligence": Downloading page 12
[youtube:search:date] query "artificial+intelligence

ERROR: This live event will begin in 2 days.


[download] Downloading video 38 of 244
[youtube] z3V7zsTEb2I: Downloading webpage
[download] Downloading video 39 of 244
[youtube] -gjypS516fc: Downloading webpage
[download] Downloading video 40 of 244
[youtube] MluJOFGSD3k: Downloading webpage
[download] Downloading video 41 of 244
[youtube] S-DaWqNiQ04: Downloading webpage
[download] Downloading video 42 of 244
[youtube] iiNx0uXuQg4: Downloading webpage
[download] Downloading video 43 of 244
[youtube] CDTttXWMmTM: Downloading webpage
[download] Downloading video 44 of 244
[youtube] Ytu7WWKer0k: Downloading webpage
[download] Downloading video 45 of 244
[youtube] 8tma6WtDti0: Downloading webpage
[download] Downloading video 46 of 244
[youtube] rPDm8-O8e4k: Downloading webpage
[download] Downloading video 47 of 244
[youtube] qRfmgOpZLfY: Downloading webpage
[download] Downloading video 48 of 244
[youtube] IV5xKnJPziU: Downloading webpage
[download] Downloading video 49 of 244
[youtube] oiCejILoKPY: Downloading webpage


ERROR: Premieres in 2 days


[download] Downloading video 50 of 244
[youtube] 1doZIRJ5RIY: Downloading webpage
[download] Downloading video 51 of 244
[youtube] w051gmP0B-U: Downloading webpage
[download] Downloading video 52 of 244
[youtube] Bq4K2_fBcRY: Downloading webpage
[download] Downloading video 53 of 244
[youtube] KAlm38jVH4M: Downloading webpage
[download] Downloading video 54 of 244
[youtube] nMp0UdosGoo: Downloading webpage
[download] Downloading video 55 of 244
[youtube] 0NPByMb2Z8E: Downloading webpage
[download] Downloading video 56 of 244
[youtube] wG97SurBzUw: Downloading webpage
[download] Downloading video 57 of 244
[youtube] NVVGfYXQjWQ: Downloading webpage
[download] Downloading video 58 of 244
[youtube] sdFVbrQsKLQ: Downloading webpage
[download] Downloading video 59 of 244
[youtube] bWiB70ec4iQ: Downloading webpage
[download] Downloading video 60 of 244
[youtube] T_y_5bf5KP8: Downloading webpage
[download] Downloading video 61 of 244
[youtube] VMBFFEhRsAY: Downloading webpage
[download] Downl

[youtube] xe1vr3UeOKE: Downloading webpage
[download] Downloading video 150 of 244
[youtube] oDpJQruHJsk: Downloading webpage
[download] Downloading video 151 of 244
[youtube] QQ6_RezvOsw: Downloading webpage
[download] Downloading video 152 of 244
[youtube] qpYV9MJaGu0: Downloading webpage
[download] Downloading video 153 of 244
[youtube] XIrOM9oP3pA: Downloading webpage
[download] Downloading video 154 of 244
[youtube] dUlIt8HXIs4: Downloading webpage
[download] Downloading video 155 of 244
[youtube] eey91kzfOZs: Downloading webpage
[download] Downloading video 156 of 244
[youtube] PXwUEJVSAeA: Downloading webpage
[download] Downloading video 157 of 244
[youtube] Oh1S2jVidEw: Downloading webpage
[download] Downloading video 158 of 244
[youtube] rpwwu2pkhbc: Downloading webpage
[download] Downloading video 159 of 244
[youtube] wlNJiBWklzg: Downloading webpage
[download] Downloading video 160 of 244
[youtube] IFCKno0rBFs: Downloading webpage
[download] Downloading video 161 of 244
[you

[youtube:search:date] query "machine+learning": Downloading page 4
[youtube:search:date] query "machine+learning": Downloading page 5
[youtube:search:date] query "machine+learning": Downloading page 6
[youtube:search:date] query "machine+learning": Downloading page 7
[youtube:search:date] query "machine+learning": Downloading page 8
[youtube:search:date] query "machine+learning": Downloading page 9
[youtube:search:date] query "machine+learning": Downloading page 10
[youtube:search:date] query "machine+learning": Downloading page 11
[youtube:search:date] query "machine+learning": Downloading page 12
[youtube:search:date] query "machine+learning": Downloading page 13
[youtube:search:date] query "machine+learning": Downloading page 14
[youtube:search:date] query "machine+learning": Downloading page 15
[youtube:search:date] query "machine+learning": Downloading page 16
[youtube:search:date] query "machine+learning": Downloading page 17
[youtube:search:date] query "machine+learning": Downlo

ERROR: This live event will begin in a few moments.


[download] Downloading video 37 of 312
[youtube] MPGaDZiv5wQ: Downloading webpage
[download] Downloading video 38 of 312
[youtube] t_Zy0q2pkec: Downloading webpage
[download] Downloading video 39 of 312
[youtube] mE6NNm6h4Ms: Downloading webpage
[download] Downloading video 40 of 312
[youtube] Z08TSSVWcAM: Downloading webpage
[download] Downloading video 41 of 312
[youtube] b7l76sb4ukw: Downloading webpage
[download] Downloading video 42 of 312
[youtube] vBtxjXjr_HA: Downloading webpage
[download] Downloading video 43 of 312
[youtube] YjMAasOOr4g: Downloading webpage
[download] Downloading video 44 of 312
[youtube] wpE2qfC60wA: Downloading webpage
[download] Downloading video 45 of 312
[youtube] UbD5wiq8qnU: Downloading webpage
[download] Downloading video 46 of 312
[youtube] L6H4GpmhXMg: Downloading webpage
[download] Downloading video 47 of 312
[youtube] xRAximWv33Q: Downloading webpage
[download] Downloading video 48 of 312
[youtube] BaZWcSq3IuI: Downloading webpage
[download] Downl

ERROR: Premieres in 58 minutes


[download] Downloading video 66 of 312
[youtube] fQG_oCaCF6o: Downloading webpage
[download] Downloading video 67 of 312
[youtube] mAddiPYfB0I: Downloading webpage
[download] Downloading video 68 of 312
[youtube] OE08k99KxBg: Downloading webpage
[download] Downloading video 69 of 312
[youtube] SaS8yZ5UGh0: Downloading webpage
[download] Downloading video 70 of 312
[youtube] GHhdy2AKpCg: Downloading webpage
[download] Downloading video 71 of 312
[youtube] RshDbUHFvFU: Downloading webpage
[download] Downloading video 72 of 312
[youtube] Fq0P1hClClQ: Downloading webpage
[download] Downloading video 73 of 312
[youtube] WAw7mve8qfA: Downloading webpage
[download] Downloading video 74 of 312
[youtube] w_yEQKzyRqk: Downloading webpage
[download] Downloading video 75 of 312
[youtube] jaOv7WrD9hI: Downloading webpage
[download] Downloading video 76 of 312
[youtube] Yp72VyF7KgI: Downloading webpage
[download] Downloading video 77 of 312
[youtube] 1b6WJ9mP4rU: Downloading webpage
[download] Downl

[youtube] LQBIYfu-fg0: Downloading webpage
[download] Downloading video 166 of 312
[youtube] q42NMZnxG0c: Downloading webpage
[download] Downloading video 167 of 312
[youtube] gKoI6sKcxaI: Downloading webpage
[download] Downloading video 168 of 312
[youtube] 0PrOA2JK6GQ: Downloading webpage
[download] Downloading video 169 of 312
[youtube] 68lIfswwG2A: Downloading webpage
[download] Downloading video 170 of 312
[youtube] P1Jov61-SG4: Downloading webpage
[download] Downloading video 171 of 312
[youtube] 0j946BcumJk: Downloading webpage
[download] Downloading video 172 of 312
[youtube] GhaP-u8SK9s: Downloading webpage
[download] Downloading video 173 of 312
[youtube] zUk3Dv-0xcc: Downloading webpage
[download] Downloading video 174 of 312
[youtube] LMja0isVs3E: Downloading webpage
[download] Downloading video 175 of 312
[youtube] SjG1q9xRCM0: Downloading webpage
[download] Downloading video 176 of 312
[youtube] ZZs4R-5u4g4: Downloading webpage
[download] Downloading video 177 of 312
[you

[youtube] euwc0va-7Vo: Downloading webpage
[download] Downloading video 265 of 312
[youtube] zRHiMekNdOI: Downloading webpage
[download] Downloading video 266 of 312
[youtube] i9tjzr1KME0: Downloading webpage
[download] Downloading video 267 of 312
[youtube] EhExK4JgXvE: Downloading webpage
[download] Downloading video 268 of 312
[youtube] wqQKFu41FIw: Downloading webpage
[download] Downloading video 269 of 312
[youtube] jCQa7QMTX_o: Downloading webpage
[download] Downloading video 270 of 312
[youtube] DI0bdMVphv8: Downloading webpage
[download] Downloading video 271 of 312
[youtube] Mdcw3Sb98DA: Downloading webpage
[download] Downloading video 272 of 312
[youtube] 5KwVvPb5gHI: Downloading webpage
[download] Downloading video 273 of 312
[youtube] xioHYyDpHyg: Downloading webpage
[download] Downloading video 274 of 312
[youtube] EhExK4JgXvE: Downloading webpage
[download] Downloading video 275 of 312
[youtube] wqQKFu41FIw: Downloading webpage
[download] Downloading video 276 of 312
[you

ERROR: Premieres in 3 days


[download] Downloading video 11 of 453
[youtube] AH_nyxnCoak: Downloading webpage
[download] Downloading video 12 of 453
[youtube] iZJwcYfSj8g: Downloading webpage
[download] Downloading video 13 of 453
[youtube] qzK-lj6rHvc: Downloading webpage
[download] Downloading video 14 of 453
[youtube] 7U7HPF3GwUg: Downloading webpage
[download] Downloading video 15 of 453
[youtube] LK5G1kSU0Ms: Downloading webpage
[download] Downloading video 16 of 453
[youtube] g5DGBWjiULQ: Downloading webpage
[download] Downloading video 17 of 453
[youtube] bHuww-l_Sq0: Downloading webpage


ERROR: Premieres in 24 hours


[download] Downloading video 18 of 453
[youtube] YjMAasOOr4g: Downloading webpage
[download] Downloading video 19 of 453
[youtube] xRAximWv33Q: Downloading webpage
[download] Downloading video 20 of 453
[youtube] FqW-t7eqHsk: Downloading webpage
[download] Downloading video 21 of 453
[youtube] 79tiND9L9XE: Downloading webpage
[download] Downloading video 22 of 453
[youtube] DuyZAAvXog0: Downloading webpage
[download] Downloading video 23 of 453
[youtube] PGOFKsEnPZI: Downloading webpage
[download] Downloading video 24 of 453
[youtube] 05EMAi7nam0: Downloading webpage
[download] Downloading video 25 of 453
[youtube] idwVFKGHFvs: Downloading webpage
[download] Downloading video 26 of 453
[youtube] Go5OFIuhQ-Q: Downloading webpage
[download] Downloading video 27 of 453
[youtube] Bxx9HcLOnXY: Downloading webpage
[download] Downloading video 28 of 453
[youtube] h0gPomI3h8o: Downloading webpage
[download] Downloading video 29 of 453
[youtube] ot96IKt9UdE: Downloading webpage
[download] Downl

ERROR: Premieres in 19 hours


[download] Downloading video 65 of 453
[youtube] W-O7AZNzbzQ: Downloading webpage
[download] Downloading video 66 of 453
[youtube] HPtND6dPs_U: Downloading webpage
[download] Downloading video 67 of 453
[youtube] OcpAkACOwW0: Downloading webpage
[download] Downloading video 68 of 453
[youtube] 7GTAw4GYYdg: Downloading webpage
[download] Downloading video 69 of 453
[youtube] eeYMM4qiQSA: Downloading webpage
[download] Downloading video 70 of 453
[youtube] qmQGJ8cJTW8: Downloading webpage
[download] Downloading video 71 of 453
[youtube] AgvjPPzy64I: Downloading webpage
[download] Downloading video 72 of 453
[youtube] _PDEaHgtJF0: Downloading webpage
[download] Downloading video 73 of 453
[youtube] vKikt2d9PE0: Downloading webpage
[download] Downloading video 74 of 453
[youtube] I_kM-np8S9k: Downloading webpage
[download] Downloading video 75 of 453
[youtube] EPiCXdaXnqE: Downloading webpage
[download] Downloading video 76 of 453
[youtube] 9u7GnYTrFgM: Downloading webpage
[download] Downl

[youtube] TudQZtgpoHk: Downloading webpage
[download] Downloading video 165 of 453
[youtube] e74PiXuddho: Downloading webpage
[download] Downloading video 166 of 453
[youtube] qVLQ9Cqm-ec: Downloading webpage
[download] Downloading video 167 of 453
[youtube] mTVf7BN7S8w: Downloading webpage
[download] Downloading video 168 of 453
[youtube] sxmR-quY4so: Downloading webpage
[download] Downloading video 169 of 453
[youtube] _9URFV0Zf1M: Downloading webpage
[download] Downloading video 170 of 453
[youtube] 9z9mbiOZqSs: Downloading webpage
[download] Downloading video 171 of 453
[youtube] 8CSpTy4myRs: Downloading webpage
[download] Downloading video 172 of 453
[youtube] fWGm3XPex2A: Downloading webpage
[download] Downloading video 173 of 453
[youtube] nq8la9qknx8: Downloading webpage
[download] Downloading video 174 of 453
[youtube] fIvQM_Qr5Oc: Downloading webpage
[download] Downloading video 175 of 453
[youtube] S16h9gYvvb8: Downloading webpage
[download] Downloading video 176 of 453
[you

[youtube] DQyLTlD1IBc: Downloading webpage
[download] Downloading video 264 of 453
[youtube] H9yACitf-KM: Downloading webpage
[download] Downloading video 265 of 453
[youtube] o-jPkvZIQNE: Downloading webpage
[download] Downloading video 266 of 453
[youtube] KJzxxIZICNI: Downloading webpage
[download] Downloading video 267 of 453
[youtube] _f-oX7ca3Ik: Downloading webpage
[download] Downloading video 268 of 453
[youtube] KRjLo9zPpFE: Downloading webpage
[download] Downloading video 269 of 453
[youtube] lncoLfue_Y4: Downloading webpage
[download] Downloading video 270 of 453
[youtube] vlPnLg0jSsc: Downloading webpage
[download] Downloading video 271 of 453
[youtube] SyWwoMpP_P4: Downloading webpage
[download] Downloading video 272 of 453
[youtube] WPQOkoXhdBQ: Downloading webpage
[download] Downloading video 273 of 453
[youtube] adrqRm4q5Ic: Downloading webpage
[download] Downloading video 274 of 453
[youtube] jS1CKhALUBQ: Downloading webpage
[download] Downloading video 275 of 453
[you

[youtube] 38SUUaMX5Rg: Downloading webpage
[download] Downloading video 363 of 453
[youtube] N9mcfsHac1U: Downloading webpage
[download] Downloading video 364 of 453
[youtube] cC8A5VmEVtg: Downloading webpage
[download] Downloading video 365 of 453
[youtube] x7S1SXyd8QU: Downloading webpage
[download] Downloading video 366 of 453
[youtube] HGXlFG_Rz4E: Downloading webpage
[download] Downloading video 367 of 453
[youtube] 9QErWiClGjM: Downloading webpage
[download] Downloading video 368 of 453
[youtube] ANIw1Mz1SRI: Downloading webpage
[download] Downloading video 369 of 453
[youtube] 733m6qBH-jI: Downloading webpage
[download] Downloading video 370 of 453
[youtube] gCJCgQW_LKc: Downloading webpage
[download] Downloading video 371 of 453
[youtube] RtagUu7t63c: Downloading webpage
[download] Downloading video 372 of 453
[youtube] hjh1ikznScg: Downloading webpage
[download] Downloading video 373 of 453
[youtube] wTbrk0suwbg: Downloading webpage
[download] Downloading video 374 of 453
[you

In [114]:
# Getting rid of possible null entries:
results = [datum for datum in results if datum is not None]

# Writing the data as a dataframe:
df_initial = pd.DataFrame(results)

# Removing possible duplicated entries:
df_initial = df_initial.drop_duplicates(subset=['title', 'uploader', 'duration'], keep='first')

In [116]:
pd.set_option("display.max_columns", 100)
df_initial.head(2)

Unnamed: 0,id,title,formats,thumbnails,description,upload_date,uploader,uploader_id,uploader_url,channel_id,channel_url,duration,view_count,average_rating,age_limit,webpage_url,categories,tags,is_live,like_count,dislike_count,channel,extractor,webpage_url_basename,extractor_key,n_entries,playlist,playlist_id,playlist_title,playlist_uploader,playlist_uploader_id,playlist_index,thumbnail,display_id,requested_subtitles,requested_formats,format,format_id,width,height,resolution,fps,vcodec,vbr,stretched_ratio,acodec,abr,ext,automatic_captions,subtitles,location,track,artist,album,creator,alt_title,chapters,license,release_date,release_year
0,2q0IOgtFHRM,Tackling Biases in Artificial Intelligence AI ...,"[{'asr': 48000, 'filesize': 48988677, 'format_...","[{'height': 94, 'url': 'https://i.ytimg.com/vi...",Our Diversity & Inclusion and ICT Committees a...,20210518,British Chamber of Commerce Singapore,britchamsg,http://www.youtube.com/user/britchamsg,UClpAYQSuNmsPYC7BDXNxAEQ,https://www.youtube.com/channel/UClpAYQSuNmsPY...,3637.0,1,0.0,0,https://www.youtube.com/watch?v=2q0IOgtFHRM,[Travel & Events],[],,0.0,0.0,,youtube,2q0IOgtFHRM,Youtube,244,artificial+intelligence,artificial+intelligence,,,,1,https://i.ytimg.com/vi/2q0IOgtFHRM/maxresdefau...,2q0IOgtFHRM,,"({'asr': None, 'filesize': 655576612, 'format_...",137 - 1920x1080 (1080p)+140 - audio only (tiny),137+140,1920,1080,,30,avc1.640028,1442.386,,mp4a.40.2,129.472,mp4,,,,,,,,,,,,
1,1SKWh9wz66I,English Listening Practice with A.I. Artificia...,"[{'asr': 48000, 'filesize': 13847160, 'format_...","[{'height': 94, 'url': 'https://i.ytimg.com/vi...",English listening practice is really important...,20210518,English Fluency Mission - Learn with Movies Sc...,UCi-2T61TG0r2dqx857dHE4w,http://www.youtube.com/channel/UCi-2T61TG0r2dq...,UCi-2T61TG0r2dqx857dHE4w,https://www.youtube.com/channel/UCi-2T61TG0r2d...,1062.0,14,5.0,0,https://www.youtube.com/watch?v=1SKWh9wz66I,[Education],"[learn english through movies, hollywood movie...",,9.0,0.0,,youtube,1SKWh9wz66I,Youtube,244,artificial+intelligence,artificial+intelligence,,,,2,https://i.ytimg.com/vi/1SKWh9wz66I/maxresdefau...,1SKWh9wz66I,,"({'asr': None, 'filesize': 234645613, 'format_...",137 - 1920x1080 (1080p)+140 - audio only (tiny),137+140,1920,1080,,30,avc1.640028,1768.474,,mp4a.40.2,129.476,mp4,"{'af': [{'ext': 'srv1', 'url': 'https://www.yo...",{},,,,,,,,,,


After looking at the columns above, we select some keys of interest:

In [117]:
columns = ['title','duration','categories','uploader','view_count','like_count','dislike_count',
           'tags','upload_date','channel_url','webpage_url','playlist_id']

df = df_initial[columns].copy()

# Improving the upload_date format:
df['upload_date'] = pd.to_datetime(df['upload_date'])
# Writing the duration of the video in integer units of minutes:
df['duration'] = [ round(i/60) for i in df['duration'] ]
# Renaming some columns:
df = df.rename(columns={'duration': 'minutes', 'view_count': 'views', 'like_count': 'likes',
                         'dislike_count': 'dislikes', 'playlist_id': 'query'})
df.head(2)

Unnamed: 0,title,minutes,categories,uploader,views,likes,dislikes,tags,upload_date,channel_url,webpage_url,query
0,Tackling Biases in Artificial Intelligence AI ...,61,[Travel & Events],British Chamber of Commerce Singapore,1,0.0,0.0,[],2021-05-18,https://www.youtube.com/channel/UClpAYQSuNmsPY...,https://www.youtube.com/watch?v=2q0IOgtFHRM,artificial+intelligence
1,English Listening Practice with A.I. Artificia...,18,[Education],English Fluency Mission - Learn with Movies Sc...,14,9.0,0.0,"[learn english through movies, hollywood movie...",2021-05-18,https://www.youtube.com/channel/UCi-2T61TG0r2d...,https://www.youtube.com/watch?v=1SKWh9wz66I,artificial+intelligence


Now that our dataset looks clean and well put, we will save it in a csv file so we can start labeling the data. I will use the label 1 if the video appeals to me and 0 if it does not. The labeling will be done manually using macbook's Number app, and the labeled dataset will be saved as "Labeled_data.csv".

In [54]:
df.to_csv('Unlabeled_data.csv')