## Deep Dive into YouTube Trending
During the Data Analytics examination, the Data Scientist at the company N asks for an in-depth analysis of YouTube Trending datasets.

This notebook will analyze all the data prepared before.
&nbsp;
### Datasets
- _categories_Jun-dd-2022.pkl_ contains YouTube categories
- _trending_Jun-dd-2022.pkl_ contains YouTube videos listed in Trending Tab

In [1]:
from pathlib import Path
import pandas as pd
from collections import Counter
import nltk
# nltk.download()  # Required once

### Define paths

In [2]:
version = "Jun-16-2022"
path_categories = Path.cwd() / "datasets" / "processed" / f"categories_{version}.pkl"
path_trending = Path.cwd() / "datasets" / "processed" / f"trending_{version}.pkl"

### Pathfinding | Technology Categories

In [3]:
df = pd.read_pickle(path_categories)
df_tech = df[df["root"] == "Education & Science"]
df_tech.head(10)

Unnamed: 0,category_id,parent_id,root_id,category,parent,root
167,300,300,300,Education & Science,Education & Science,Education & Science
6,301,301,300,Science & Technology,Science & Technology,Education & Science
7,302,302,300,Education,Education,Education & Science


In [4]:
tech_categories = df_tech["category_id"].tolist()
print(tech_categories)

[300, 301, 302]


### Analysis | Read Dataset

In [5]:
df = pd.read_pickle(path_trending)

### Analysis | Popular Words

In [6]:
word_counter = Counter()
word_tokenizer = nltk.RegexpTokenizer(r"\w+")
word_stemmer = nltk.stem.PorterStemmer()
word_stoplist = set(nltk.corpus.stopwords.words("english"))

for i, row in df.iterrows():
    tokens = [token for token in word_tokenizer.tokenize(row["description"]) if token not in word_stoplist]
    word_counter.update([word_stemmer.stem(word) for word in tokens])

print(word_counter.most_common(50))

[('n', 93972), ('video', 16556), ('the', 15689), ('i', 11141), ('us', 10598), ('new', 10064), ('subscrib', 9240), ('channel', 8166), ('de', 8007), ('show', 7551), ('twitter', 7318), ('facebook', 7085), ('music', 7047), ('episod', 6709), ('use', 6418), ('get', 6160), ('watch', 6092), ('youtub', 5902), ('instagram', 5274), ('to', 5131), ('like', 5073), ('product', 4710), ('2018', 4347), ('link', 4344), ('make', 4285), ('live', 4168), ('full', 4155), ('com', 4085), ('one', 4055), ('a', 3971), ('2', 3823), ('_', 3717), ('game', 3699), ('thi', 3654), ('nsubscrib', 3651), ('follow', 3611), ('you', 3575), ('time', 3572), ('la', 3537), ('offici', 3509), ('day', 3399), ('song', 3376), ('love', 3357), ('here', 3269), ('on', 3173), ('tv', 3145), ('1', 3100), ('le', 3069), ('latest', 3016), ('best', 2931)]


### Analysis | Popular Words in Tech

In [7]:
word_counter = Counter()
for i, row in df.iterrows():
    if row["category_id"] in tech_categories:
        tokens = [token for token in word_tokenizer.tokenize(row["description"]) if token not in word_stoplist]
        word_counter.update([word_stemmer.stem(word) for word in tokens])

print(word_counter.most_common(50))

[('n', 7125), ('video', 1270), ('the', 1071), ('i', 779), ('link', 628), ('de', 613), ('you', 565), ('get', 508), ('use', 498), ('music', 494), ('us', 488), ('thi', 436), ('iphon', 416), ('a', 406), ('credit', 396), ('new', 391), ('smartphon', 389), ('x', 361), ('support', 351), ('le', 349), ('one', 345), ('patreon', 343), ('in', 338), ('like', 336), ('click', 332), ('amazon', 325), ('to', 320), ('kevin', 314), ('for', 305), ('com', 300), ('product', 295), ('is', 293), ('thank', 293), ('channel', 291), ('screen', 281), ('à', 273), ('what', 259), ('make', 256), ('phone', 255), ('life', 254), ('never', 252), ('world', 251), ('tech', 236), ('pleas', 234), ('it', 230), ('buy', 228), ('your', 228), ('la', 227), ('gmail', 222), ('10', 220)]
