# Collecting and Cleaning YouTube Trending Data
This notebook collects trending YouTube data for selected countries, performs basic exploration, and carries out some data cleaning steps to prepare the data for further analysis.

## Step 1: Data Collection Using Kaggle API
Authenticate with the Kaggle API and download the dataset containing YouTube trending videos data from 113 countries. (daily updated)


In [None]:
from kaggle.api.kaggle_api_extended import KaggleApi

# Create the API object
api = KaggleApi()
api.authenticate()

## Step 2: Loading the Dataset
Load the trending data from the downloaded CSV file into a pandas DataFrame.

In [37]:
df_trending = pd.read_csv('trending_yt_videos_113_countries.csv')

## Step 3: Filtering for Relevant Countries
Filter the dataset to include only the trending data for the desired countries: USA, UK (GB), Australia, and Canada.


In [38]:
desired_countries = ['USA', 'GB', 'AU', 'CA']

## Step 4: Merging Filtered Chunks
Concatenate the filtered chunks to create a DataFrame containing only the data for the relevant countries.


In [40]:
filtered_data = []

# Use the chunksize parameter to read the CSV in chunks
for chunk in pd.read_csv('trending_yt_videos_113_countries.csv', chunksize=100000):
    # Filter the chunk for the desired countries
    filtered_chunk = chunk[chunk['country'].isin(desired_countries)]
    # Append the filtered chunk to the list
    filtered_data.append(filtered_chunk)

In [42]:
filtered_data

[                                                   title  \
 4000                 ANDREW GARFIELD | CHICKEN SHOP DATE   
 4001   Hojlund nets first PL goal of the season! 🙌 | ...   
 4002                    Surprise! 🎲  | Ep. 1 | Wild Life   
 4003                   Naming Babies in 2045: Part 3 👶💀🧌   
 4004    ROSÉ & Bruno Mars - APT. (Official Music Video)   
 ...                                                  ...   
 95789  THE COST OF MANSORY FINISHING MY ROLLS ROYCE R...   
 95790               Backrooms - Lighting and Tile Survey   
 95791             Can Shayne Guess Our Childhood Photos?   
 95792  The road to Chimney Rock is gone - Hurricane H...   
 95793   Minecraft but I get CAPTURED in PVP CIVILIZATION   
 
                     channel_name  daily_rank  daily_movement  weekly_movement  \
 4000         Amelia Dimoldenberg           1               0               49   
 4001   Sky Sports Premier League           2              48               48   
 4002               

In [49]:
df_filtered = pd.concat(filtered_data, ignore_index=True)

In [54]:
df=df_filtered

## Step 5: Basic Data Exploration
Inspect the first few rows of the DataFrame and check the total number of records.

In [55]:
df.head()

Unnamed: 0,title,channel_name,daily_rank,daily_movement,weekly_movement,snapshot_date,country,view_count,like_count,comment_count,description,thumbnail_url,video_id,channel_id,video_tags,kind,publish_date,langauge
0,ANDREW GARFIELD | CHICKEN SHOP DATE,Amelia Dimoldenberg,1,0,49,2024-10-20,GB,4336922,354482,16095,Amelia meets Andrew Garfield for a date in a C...,https://i.ytimg.com/vi/eFS5vxYlfY8/mqdefault.jpg,eFS5vxYlfY8,UCyQ-DUV6lZgoL8wiPusYiUg,,youtube#video,2024-10-18 00:00:00+00:00,en-GB
1,Hojlund nets first PL goal of the season! 🙌 | ...,Sky Sports Premier League,2,48,48,2024-10-20,GB,986155,10686,1009,► Subscribe to Sky Sports Premier League: http...,https://i.ytimg.com/vi/06wRDKlLymQ/mqdefault.jpg,06wRDKlLymQ,UCNAf1k0yIjyGu3k9BwAg3lg,"sky sports, premier league, football, Sky Spor...",youtube#video,2024-10-19 00:00:00+00:00,en-GB
2,Surprise! 🎲 | Ep. 1 | Wild Life,LDShadowLady,3,47,47,2024-10-20,GB,313699,37034,4620,"Welcome to Wild Life SMP, a new variation on t...",https://i.ytimg.com/vi/UluZ54MxGNI/mqdefault.jpg,UluZ54MxGNI,UCzTlXb7ivVzuFlugVCv3Kvg,"ldshadowlady, minecraft, mini game, girl gamer...",youtube#video,2024-10-19 00:00:00+00:00,en-GB
3,Naming Babies in 2045: Part 3 👶💀🧌,Josiah Schneider,4,46,46,2024-10-20,GB,437026,33246,336,"If my name were Dracula, I wouldn't be mad. Ju...",https://i.ytimg.com/vi/gEIZrpjklYE/mqdefault.jpg,gEIZrpjklYE,UCmXQs4aZSk4eL5MIimvwMVQ,"skit, skits, funny, comedy, funny skit, sketch...",youtube#video,2024-10-19 00:00:00+00:00,en-US
4,ROSÉ & Bruno Mars - APT. (Official Music Video),ROSÉ,5,0,45,2024-10-20,GB,55940223,4139961,209332,ROSÉ & Bruno Mars - APT.\nDownload/stream: ht...,https://i.ytimg.com/vi/ekr2nIex040/mqdefault.jpg,ekr2nIex040,UCBo1hnzxV9rz3WVsv__Rn1g,"YG Entertainment, YG, 와이지, K-pop, BLACKPINK, 블...",youtube#video,2024-10-18 00:00:00+00:00,en-US


In [64]:
len(df)

53909

In [56]:
df['normalized_channelTitle'] = df['channel_name'].str.lower()
df_channel['normalized_Channel'] = df_channel['Channel'].str.lower()

# Merge the DataFrames on the normalized channel names
common_channels = pd.merge(df, df_channel, left_on='normalized_channelTitle', right_on='normalized_Channel')

# Display the common channels
print("Common channels in both DataFrames:")
print(common_channels[['channel_name', 'Channel']])  # Display relevant columns

Common channels in both DataFrames:
     channel_name      Channel
0      DOM Studio   DOM Studio
1     MoreSidemen  MoreSidemen
2      colinfurze   colinfurze
3      DOM Studio   DOM Studio
4     MoreSidemen  MoreSidemen
...           ...          ...
1646  The Beatles  The Beatles
1647  The Beatles  The Beatles
1648  Mumbo Jumbo  Mumbo Jumbo
1649          BBC          BBC
1650        Grian        Grian

[1651 rows x 2 columns]


In [57]:
common_channels['channel_name'].nunique()

35

In [58]:
common_channels['Channel'].nunique()

35

In [59]:
df['channel_name'].nunique()

4043

In [65]:
# Get descriptive statistics for the publish_date column
publish_date_stats = {
    'min': df['publish_date'].min(),
    'max': df['publish_date'].max(),
    'count': df['publish_date'].count(),
    'unique': df['publish_date'].nunique(),
    'first': df['publish_date'].first_valid_index(),
    'last': df['publish_date'].last_valid_index()
}

# Print the statistics
print(publish_date_stats)

{'min': Timestamp('2023-10-19 06:53:35+0000', tz='UTC'), 'max': Timestamp('2024-10-19 00:00:00+0000', tz='UTC'), 'count': 53909, 'unique': 471, 'first': 0, 'last': 53908}


In [66]:
len(df)

53909

In [67]:
df['video_id'].nunique()

13534

In [75]:
df.isnull().sum()

title                          0
channel_name                   0
daily_rank                     0
daily_movement                 0
weekly_movement                0
snapshot_date                  0
country                        0
view_count                     0
like_count                     0
comment_count                  0
description                    0
thumbnail_url                  0
video_id                       0
channel_id                     0
video_tags                     0
publish_date                   0
langauge                   11372
normalized_channelTitle        0
dtype: int64

In [76]:
df['langauge'].nunique()

41

## Step 6: Data Cleaning
Perform basic data cleaning steps to prepare the dataset:
- Handle missing values: Replace null values in the 'description' and 'video_tags' columns with "no description" and "no tags", respectively.
- Remove unnecessary columns: Drop columns that are not needed for the analysis.


In [71]:
df = df.drop(columns=['kind'])

In [73]:
df[df['description'].isnull()].head()

Unnamed: 0,title,channel_name,daily_rank,daily_movement,weekly_movement,snapshot_date,country,view_count,like_count,comment_count,description,thumbnail_url,video_id,channel_id,video_tags,publish_date,langauge,normalized_channelTitle
786,Moving on.,ItsMeYellow,37,-12,13,2024-10-15,GB,487481,27843,2610,,https://i.ytimg.com/vi/ZqyTZ-hgFGU/mqdefault.jpg,ZqyTZ-hgFGU,UCvjx1ZaKxwGH5hbZQccx_9Q,,2024-10-12 00:00:00+00:00,,itsmeyellow
845,Moving on.,ItsMeYellow,46,-21,4,2024-10-15,CA,487354,27840,2610,,https://i.ytimg.com/vi/ZqyTZ-hgFGU/mqdefault.jpg,ZqyTZ-hgFGU,UCvjx1ZaKxwGH5hbZQccx_9Q,,2024-10-12 00:00:00+00:00,,itsmeyellow
872,Port Antonio,J. Cole,23,-13,27,2024-10-15,AU,4718518,335489,29432,,https://i.ytimg.com/vi/BWtBckf8RIw/mqdefault.jpg,BWtBckf8RIw,UCnc6db-y3IU7CkT_yeVXdVg,,2024-10-10 00:00:00+00:00,,j. cole
931,Moving on.,ItsMeYellow,32,18,18,2024-10-14,GB,378962,25882,2509,,https://i.ytimg.com/vi/ZqyTZ-hgFGU/mqdefault.jpg,ZqyTZ-hgFGU,UCvjx1ZaKxwGH5hbZQccx_9Q,,2024-10-12 00:00:00+00:00,,itsmeyellow
979,Moving on.,ItsMeYellow,30,20,20,2024-10-14,CA,378742,25872,2509,,https://i.ytimg.com/vi/ZqyTZ-hgFGU/mqdefault.jpg,ZqyTZ-hgFGU,UCvjx1ZaKxwGH5hbZQccx_9Q,,2024-10-12 00:00:00+00:00,,itsmeyellow


In [74]:
# Replace null values in 'description' with 'no description'
df['description'] = df['description'].fillna('no description')

# Replace null values in 'video_tags' with 'no tags'
df['video_tags'] = df['video_tags'].fillna('no tags')

In [77]:
df.to_csv('youtube_trending_data.csv', index=False)

In [3]:
import pandas as pd

# Load the CSV file into a pandas DataFrame
file_path = 'youtube_trending_data.csv'  # Update with the correct file path
df = pd.read_csv(file_path)

In [5]:
df['langauge'].unique()

array(['en-GB', 'en-US', 'en', 'ko', nan, 'en-CA', 'ja', 'fr', 'zxx',
       'es-419', 'hi', 'ta', 'te', 'ht', 'tr', 'es', 'en-AU', 'en-IN',
       'fr-CA', 'de', 'vi', 'en-IE', 'pa', 'akk', 'zh-Hans', 'es-MX',
       'fr-FR', 'ml', 'no', 'fil', 'ro', 'ar', 'it', 'uk', 'fa', 'zh-HK',
       'ne', 'sq', 'zh-Hant', 'es-US', 'kn', 'zh'], dtype=object)

In [6]:
df['video_id'].nunique()

13534

In [7]:
df['channel_name'].nunique()

4043

In [8]:
df['channel_id'].nunique()

4021

In [10]:
df['channel_id'].isnull().sum()

0

In [12]:
df['title'].nunique()

13743