# Create Sample Data
This notebook is intended to help explore and export small sample data to showcase examples of the data types and structures used. They are not intended to be used in the actual pipeline (and thus can not guarantee similar results) but may be used to test out certain functions or features of our analysis.

In [1]:
import pandas as pd
import pickle

## Initial JSON Files
Theses JSON files were extracted from .tar files downloaded from Archive.org.

In [2]:
json_sample = pd.read_json("../sample_data/raw/5000.json.gz",compression='gzip')
json_sample.head(3)

Unnamed: 0,v,id,fetch_date,uploader,uploader_id,upload_date,title,description,category,tags,...,is_live_content,is_ads_enabled,is_comments_enabled,formats,credits,regions_allowed,recommended_videos,headline_badges,unavailable_message,license
0,3,3oZrwTwGmxw,20190203003531,Zoltan Kerekes,UC071WEbL9XvLQMBBSG1XnHQ,20081008.0,trentemoller - kink,trentemoller - kink,Music,"[electronic, minimal, trentemoller, kink]",...,False,True,True,"[{'format_id': '133', 'ext': 'mp4', 'height': ...","[{'title': 'Category', 'author': 'Music', 'url...","[AD, AE, AF, AG, AI, AL, AM, AO, AQ, AR, AS, A...","[{'video_id': 'kPc2XLMDmRE', 'view_count': 715...",,,
1,3,Vlgvu45xPIU,20190203003531,Adrianna Skon,UC_IrqtYlVpQRZCvffHQ42Pw,20160706.0,TEST! Słuchawka do telefonu! O.o ♥ G+ #17,Wbijemy 20 000 łapek ♥ ? \nUDOSTĘPNIAJCIE I WY...,People & Blogs,"[adrianna, ada, skon, skoneczna, moviestarplan...",...,False,True,True,"[{'format_id': '137', 'ext': 'mp4', 'height': ...","[{'title': 'Category', 'author': 'People & Blo...","[AD, AE, AF, AG, AI, AL, AM, AO, AQ, AR, AS, A...","[{'video_id': 'A5huOROyvn0', 'view_count': 249...",,,
2,3,ZrQYcCKtpUs,20190203003531,Abaddon Tv,UCbULg9eolm4RQ1BKfQR5GlQ,20110727.0,Abaddon Pinapangarap ko ft. Curse 0ne,/www.facebook.com/pages/Abaddon-AKA-Django-Bal...,Music,"[Abaddon, Rap, Hip hop, thugs music, thugszill...",...,False,False,True,"[{'format_id': '135', 'ext': 'mp4', 'height': ...","[{'title': 'Category', 'author': 'Music', 'url...","[AD, AE, AF, AG, AI, AL, AM, AO, AQ, AR, AS, A...","[{'video_id': 'BM8ZKtXg9wk', 'view_count': 611...",,,


## Parquet Files
We combined the numerous JSON files into parquet files to help reduce IO processing time and easier initial analysis.

In [3]:
# Load our main parquet files
parq_sample = pd.read_parquet("../sample_data/interim/parq_sample.parquet")
parq_sample.head(3)

Unnamed: 0,v,id,fetch_date,uploader,uploader_id,upload_date,title,description,category,tags,...,is_live_content,is_ads_enabled,is_comments_enabled,formats,credits,regions_allowed,recommended_videos,headline_badges,unavailable_message,license
1231,3,RtAl0ZIMEJs,20190203204022,Chidambaram VIJAY GEMS,UCbqZDu2OP13FeXgImdLju7A,20171002.0,jimikki kammal,,People & Blogs,,...,False,False,True,"[{'bitrate': 4359297, 'ext': 'mp4', 'format_id...","[{'author': 'People & Blogs', 'title': 'Catego...","[AD, AE, AF, AG, AI, AL, AM, AO, AQ, AR, AS, A...","[{'video_id': 'cgib0bCzUpg', 'view_count': 270...",,,
1954,3,DtQ3Q3tnSIY,20190203171016,Dragon Stein,UCbIS-NpawTXInuh8B_sfZZA,20150129.0,Replay from Geometry Dash!,Replay from Geometry Dash!\nhttps://everyplay....,People & Blogs,,...,False,False,True,"[{'bitrate': 1155000, 'ext': 'mp4', 'format_id...","[{'author': 'People & Blogs', 'title': 'Catego...","[AD, AE, AF, AG, AI, AL, AM, AO, AQ, AR, AS, A...","[{'video_id': 'qQqkLQ6wF48', 'view_count': 572...",,,
1131,3,g-MwwpfBOZ8,20190204173206,Electric Bird,UCs04YaGf84ediaW4wb3zcLA,20131230.0,"Claude D. Pepper Building, Bethesda, Maryland,...",It has 11 floors.,Entertainment,,...,False,False,True,"[{'bitrate': 2310000, 'ext': 'mp4', 'format_id...","[{'author': 'Entertainment', 'title': 'Categor...","[AD, AE, AF, AG, AI, AL, AM, AO, AQ, AR, AS, A...","[{'video_id': 'dA42vievwZk', 'view_count': 762...",,,


## Archive Data CSV Exports from Database

In [4]:
csv_liked_sample = pd.read_csv("../sample_data/processed/csv_liked_sample.csv", engine="python")
csv_liked_sample.head(3)

Unnamed: 0,row_id,id,fetch_date,uploader,uploader_id,upload_date,title,desc_text,category,tags,...,credits,regions_allowed,recommended_videos,headline_badges,unavailable_message,license,view_like_ratio,view_dislike_ratio,like_dislike_ratio,dislike_like_ratio
0,3931071,SqtyROrkw0w,2019-02-04 18:51:27,Too Many T's,UCA3yMT22rr0-Y6OMMDeoO0g,2018-03-19,Too Many T's – Featuring Alexa (Full Version),AS SEEN ON TECH CRUNCH :)\n\nThe world’s first...,Music,,...,"[{""url"": ""/channel/UC-9-kyTW8ZkZNDHQJ6FgpwQ"", ...","{AD,AE,AF,AG,AI,AL,AM,AO,AQ,AR,AS,AT,AU,AW,AX,...","[{""video_id"": ""2srkD_BTVkg"", ""view_count"": 259...",,,,42.81262,924881.0,21603.0,9.3e-05
1,1131743,dAPClFn9-_w,2019-02-03 00:19:52,Bitcoin Official,UCbGspiLVD7HNtZEHc8miLeA,2018-04-27,New Leak - My Bitsler Maintenance Mode Method ...,Here it is guys! My Leaked Method for http://B...,Gaming,"{bitsler,mikethemug,bitcoin,""bitcoin gambling""...",...,"[{""url"": ""/gaming"", ""title"": ""Category"", ""auth...","{AD,AE,AF,AG,AI,AL,AM,AO,AQ,AR,AS,AT,AU,AW,AX,...","[{""video_id"": ""1Ih6GF1wPnw"", ""view_count"": 153...",,,,52.699596,1007669.0,19121.0,0.000105
2,53176560,1LEUsNbGAyw,2019-02-04 09:19:23,DrRay Baez,UCyV1RvxOz1iHkDq4HX5ucJQ,2018-06-16,RAY BAEZ 7 HOMICIDIOS AL SINDICO LLEGA A SANT...,Distribuido bajo el acto de uso justo de 1976,People & Blogs,"{""ray baez"",""dr ray baez"",""ray baez acosta""}",...,"[{""url"": ""/channel/UC1vGae2Q3oT5MkhhfW8lwjg"", ...","{AD,AE,AF,AG,AI,AL,AM,AO,AQ,AR,AS,AT,AU,AW,AX,...","[{""video_id"": ""Jnei1md-Ia0"", ""view_count"": 695...",,,,2.709518,24653.0,9098.667,0.000147


In [5]:
csv_disliked_sample = pd.read_csv("../sample_data/processed/csv_disliked_sample.csv", engine="python")
csv_disliked_sample.head(3)

Unnamed: 0,row_id,id,fetch_date,uploader,uploader_id,upload_date,title,desc_text,category,tags,...,credits,regions_allowed,recommended_videos,headline_badges,unavailable_message,license,view_like_ratio,view_dislike_ratio,like_dislike_ratio,dislike_like_ratio
0,761755,m6b3FCBYuM0,2019-02-03 02:11:51,Сергей Тарасов,UCoBAX8sJx6rOYXQ-g8zWNsQ,2017-10-23,КВН 2012 выборка,,People & Blogs,,...,"[{""url"": ""/channel/UC1vGae2Q3oT5MkhhfW8lwjg"", ...","{AD,AE,AF,AG,AI,AL,AM,AO,AQ,AR,AS,AT,AU,AW,AX,...","[{""video_id"": ""SZlXjN80O78"", ""view_count"": 232...",,,,22900.0,1.467008,6.4e-05,7805.5
1,7829816,T4CxthMciIg,2019-02-03 02:16:26,Александр Храмцов,UCdXRYgjGpzKp3oucGPFjk7w,2018-08-20,10,,Pets & Animals,,...,"[{""url"": ""/channel/UCFYJCBaHRzLJrnhRglM3GdA"", ...","{AD,AE,AF,AG,AI,AL,AM,AO,AQ,AR,AS,AT,AU,AW,AX,...","[{""video_id"": ""kEqLudxk_SE"", ""view_count"": 267...",,,,inf,51.556072,0.0,2373.0
2,33453926,PkWuE_6NnTk,2019-02-03 10:44:18,Народный Фронт,UCOmixtlsD9LWx-puOmo1Hog,2018-03-05,Путин: В России нужно развивать не только мега...,В ходе заключительного дня работы Медиафорума ...,Nonprofits & Activism,"{ОНФ,""Народный фронт"",медиафорум,Калининград,П...",...,"[{""url"": ""/channel/UCM6FFmRAK_uTICRwyTubV0A"", ...","{AD,AE,AF,AG,AI,AL,AM,AO,AQ,AR,AS,AT,AU,AW,AX,...","[{""video_id"": ""LTused4tv5E"", ""view_count"": 106...",,,,32.33333,0.020798,0.000643,1166.25


In [6]:
csv_random_sample = pd.read_csv("../sample_data/processed/csv_random_sample.csv", engine="python")
csv_random_sample.head(3)

Unnamed: 0,row_id,id,fetch_date,uploader,uploader_id,upload_date,title,desc_text,category,tags,...,credits,regions_allowed,recommended_videos,headline_badges,unavailable_message,license,view_like_ratio,view_dislike_ratio,like_dislike_ratio,dislike_like_ratio
0,62750321,AVCsN7n5vFQ,2019-02-04 16:37:10,DannyPhantom,UCkztGwmh6LowfcvBPVqmsvQ,2013-07-31,Beesh's HUNGER GAMES Let's Play #1,"Leave a Like, it helps more than you know! :]\...",Gaming,,...,"[{""url"": ""/gaming"", ""title"": ""Category"", ""auth...","{AD,AE,AF,AG,AI,AL,AM,AO,AQ,AR,AS,AT,AU,AW,AX,...","[{""video_id"": ""cPGGn__t6nU"", ""view_count"": 274...",,,,25.4,254.0,10.0,0.181818
1,62750514,AVNmHRkyl1I,2019-02-04 16:38:27,Anurak Show,UCGVLcqjprhTfa8CPTJZfkHQ,2015-08-19,โรงเรียนเทศบาล1 (วัดพรหมวิหาร) กิจกรรมการแข่...,การแข่งขันทักษะวิชาการท้องถิ่น ภาคเหนือ จ.แพร่...,People & Blogs,,...,"[{""url"": ""/channel/UC1vGae2Q3oT5MkhhfW8lwjg"", ...","{AD,AE,AF,AG,AI,AL,AM,AO,AQ,AR,AS,AT,AU,AW,AX,...","[{""video_id"": ""LA9qt-kjmYM"", ""view_count"": 153...",,,,56.115383,1459.0,26.0,0.074074
2,62750691,AVVumgQdPnw,2019-02-04 16:39:38,Bunny Turconi,UC56dLjNql-y9VwWWUTd430w,2015-07-28,Bunny Primer Vicio - Adelanto,Bunny\nPrimer Vicio - Adelanto\nEstudio: Ritmo...,People & Blogs,"{""face to face"",24/siempre,redbullbatalladegal...",...,"[{""url"": ""/channel/UC1vGae2Q3oT5MkhhfW8lwjg"", ...","{AD,AE,AF,AG,AI,AL,AM,AO,AQ,AR,AS,AT,AU,AW,AX,...","[{""video_id"": ""eb8phBrZNwE"", ""view_count"": 121...",,,,31.428572,990.0,31.5,0.046875


In [7]:
csv_random_point2_sample = pd.read_csv("../sample_data/processed/csv_random_point2_sample.csv", engine="python")
csv_random_point2_sample.head(3)

Unnamed: 0,row_id,id,fetch_date,uploader,uploader_id,upload_date,title,desc_text,category,tags,...,credits,regions_allowed,recommended_videos,headline_badges,unavailable_message,license,view_like_ratio,view_dislike_ratio,like_dislike_ratio,dislike_like_ratio
0,62751068,XUU-c5BzGv0,2019-02-04 16:57:14,Š T O K I X,UCIP6gRFMVGGMa6uHxJBzxMw,2017-06-24,EDYN - JEDNÉHO VEČERA FT. ROKO | PARODY [REUPL...,"Dneska to bude ďalšia paródia, tentokrát na Ed...",Comedy,"{""EDYN PARODY"",EDYN,SAYTYR,""ZLÝ ZAJO"",DUKLOCK,...",...,"[{""url"": ""/channel/UCDbM8yVukVKPWUQSODaw_Mw"", ...","{AD,AE,AF,AG,AI,AL,AM,AO,AQ,AR,AS,AT,AU,AW,AX,...","[{""video_id"": ""0OlHDnD7ovc"", ""view_count"": 103...",,,Creative Commons Attribution license (reuse al...,15.907408,143.16667,9.0,0.127273
1,62752016,bhq-OytkWM4,2019-02-03 04:00:21,Australian TV Fan,UCV_E5tsLsjryRQoRXgsmwSg,2016-02-01,7 News Brisbane New Opening Theme,7 News Brisbane 1 February 2016 new opening th...,News & Politics,"{""News presentation"",""7 news"",""7 Brisbane""}",...,"[{""url"": ""/channel/UCYfdidRxbB8Qhf0Nx7ioOYw"", ...","{AD,AE,AF,AG,AI,AL,AM,AO,AQ,AR,AS,AT,AU,AW,AX,...","[{""video_id"": ""SKNW6kwBRdA"", ""view_count"": 116...",,,,334.78946,6361.0,19.0,0.1
2,62751952,Balr2jpHKzs,2019-02-03 04:00:14,gracenote6,UCISbkwID3Myr6PbyCLrWlhg,2008-12-20,Gracenote - jinglebellrock/i caught myself (co...,venue: SM megamall\nline up: \njinglebellrock(...,Music,"{Gracenote,Converse,event,jingle,bell,rock,Par...",...,"[{""url"": ""/channel/UC-9-kyTW8ZkZNDHQJ6FgpwQ"", ...","{AD,AE,AF,AG,AI,AL,AM,AO,AQ,AR,AS,AT,AU,AW,AX,...","[{""video_id"": ""F39tLtFub90"", ""view_count"": 891...",,,,229.96297,1552.25,6.75,0.178571


## Comments data CSVs downloaded via web scraper

In [8]:
comments_liked_sample = pd.read_csv("../sample_data/processed/comments_csv/comments_liked_sample.csv",lineterminator='\n')
comments_liked_sample.head(3)

Unnamed: 0,cid,text,time,author,channel,votes,heart,video_id,text_cleaned,language
0,UgyAPB4OnCUFpVnfP8B4AaABAg,"sorry chaeng, this two can't stop sailing alth...",3 years ago,aise aysa,UCKwvC1czPFIZz6tepH_Kg7A,95,True,vYRBswkH8Zc,"sorry chaeng, this two can't stop sailing alth...",en
1,Ugzx0oZUUpf3y1OZXG14AaABAg,*You try to stop crying\n*But you refused.,5 years ago,Mirlica x Hayley,UCb38VDSqFRhV0zXAQfc4hIw,21,False,ZNcyYsXQrzg,*You try to stop crying *But you refused.,en
2,Uggg6jXmXr5QD3gCoAEC,"Man merkt, dass das Wetter kälter wird, die Ne...",7 years ago (edited),DieGinny,UCch5a56qiRolZtL9cgo90UA,1,False,RNEu2E7-zcY,"Man merkt, dass das Wetter kälter wird, die Ne...",de


In [9]:
comments_disliked_sample = pd.read_csv("../sample_data/processed/comments_csv/comments_disliked_sample.csv",lineterminator='\n')
comments_disliked_sample.head(3)

Unnamed: 0,cid,text,time,author,channel,votes,photo,heart,video_id
0,Ugwg-WhqKmZ6qxz5opp4AaABAg,"your not evening screaming, your just making s...",قبل 11 سنة,AvoidingtheQuestion,UCGT2aruoruPN4vCHOAbQYJw,0,,False,ks-eAAEXnNU
1,UggLXI6ZBTRq-HgCoAEC,How do you get a skin on luncher team extream ...,قبل 7 سنوات,iiZeex,UCf6fi8_eyp92_Yli8LuFcqg,0,,False,-IBoKsjMeHw
2,UgjoomrAr9z-ZngCoAEC,Real corruption,قبل 5 سنوات,Stefano Cinquegrana,UCkpEFOsa83lPqxU-JSMakrA,0,,False,UuJtngplkb4


In [10]:
comments_random_sample = pd.read_csv("../sample_data/processed/comments_csv/comments_random_sample.csv",lineterminator='\n')
comments_random_sample.head(3)

Unnamed: 0,cid,text,time,author,channel,votes,photo,heart,video_id
0,UgjS3JFda9BHhngCoAEC,"Я верю, Беркут спасет Украину!!!",قبل 8 سنوات,Pol Hant,UC3qoG-iVFEItoqNuBqOJ8oQ,17,,False,47zLQOph4vQ
1,UgwNWWPd4kBKOsbDQJx4AaABAg,amo-te muito é lindo demais o teu trabalho que...,قبل سنة واحدة,Zurema Chandicua,UCaWWuDU02cN9dXuY4NwQOpw,0,,False,OYyS0-JRd80
2,Ugjs3EyHxXfMrHgCoAEC,love scoota,قبل 4 سنوات,Kesha Backus,UCIU5bTvGnZauHGwNdU4k_qA,15,,True,Bk8B5EQUhb4


In [11]:
comments_random_point2_sample = pd.read_csv("../sample_data/processed/comments_csv/comments_random_point2_sample.csv",lineterminator='\n')
comments_random_point2_sample.head(3)

Unnamed: 0,cid,text,time,author,channel,votes,photo,heart,video_id
0,UgiBG8owCBF5cXgCoAEC,"Czytalam kiedys ksiazke,wspaniala.Ciesze sie z...",قبل 7 سنوات,Nessa Yola,UCIGoLID77TTlhsQTRwBlJZg,11,,False,pn0aK59xhl0
1,Ugwr2HtFrZYNZeyKell4AaABAg,"""Now we are free"" from the Gladiator Soundtrac...",قبل 14 سنة,TehCheese,UC4PwZvOzDFUG-i848MVbqYw,0,,False,EFkt2a9ufuE
2,UgiJ6o-fWxjhVngCoAEC,세계 어딘가에선 지금도 석양이 지고 있지..,قبل 5 سنوات,foxy MOTO,UCyW0qCb0Q-CbEOVNQqCC0sw,0,,False,YN7EmvQIDrc


## Training DF small sample

In [12]:
training_sample = pd.read_pickle("../sample_data/processed/training_sample.pkl")
training_sample.head(3)

Unnamed: 0,id,fetch_date,uploader,upload_date,title,desc_text,category,duration,age_limit,view_count,...,desc_pos,desc_compound,video_id,votes,comment_neg,comment_neu,comment_pos,comment_compound,NoComments,NoCommentsBinary
695596,hS0bcE5V5LI,2019-02-04 12:25:51,Dorivan Salles,2014-11-03,Dorivan Gamer Minha Intro Nova XD,created at httpanimotocom,Film & Animation,25,0,32,...,0.5,0.25,0,0.0,0.0,0.0,0.0,0.0,True,1
140467,numh27cyViE,2019-02-03 05:51:47,UriahJabez,2017-08-23,Gomes signals to Kluber to hit batter.,yan gomes uses right thumb to signal cory klub...,Music,12,0,2407,...,0.0,-0.3818,numh27cyViE,0.2,0.0584,0.8206,0.121,0.1551,False,0
163426,6eIs_J-kwz0,2019-02-03 03:42:19,Eduardo Feldberg,2018-11-15,O Que Todo Iniciante no Violão Deve Estudar (P...,na aula de hoje passarei um plano de estudo pa...,Music,581,0,10660,...,0.0,-0.296,6eIs_J-kwz0,62.7,0.0214,0.9786,0.0,-0.11187,False,0


## Testing DF small sample

In [13]:
testing_sample = pd.read_pickle("../sample_data/processed/testing_sample.pkl")
testing_sample.head(3)

Unnamed: 0,id,fetch_date,uploader,upload_date,title,desc_text,category,duration,age_limit,view_count,...,desc_pos,desc_compound,video_id,votes,comment_neg,comment_neu,comment_pos,comment_compound,NoComments,NoCommentsBinary
139336,LXdz7JKVM-4,2019-02-04 03:49:23,Noah T. Gaming Plus,2018-12-11,Unboxing Inkling Girl Amiibo Unboxing #1,oh boy at least am a squid and a kid and maybe...,Gaming,415,0,8,...,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,True,1
157139,Wxa1cC5jXT8,2019-02-04 06:58:50,ملاصادق الجابري/ sadek ligabere,2017-07-30,ملا ستار الحجامي / ثكيل / اجاعو /جديد. 2017,0,People & Blogs,520,0,450,...,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,True,1
89730,L3nqhwQZkLs,2019-02-03 03:08:56,毎日気持ちいいチャンネル,2018-09-04,気持ちいい！高圧洗浄動画#54,チャンネル登録で毎日気持ちいい再生リスト高圧洗浄シリーズはこちらhttpswwwyoutub...,Entertainment,618,0,21591,...,0.0,0.0,L3nqhwQZkLs,4.5,0.0894,0.7903,0.0201,-0.25923,False,0
