# YouTube Trending Video Dataset Cleaning

This notebook contains code to clean up [YouTube Trending Video Dataset from Kaggle](https://www.kaggle.com/rsrishav/youtube-trending-video-dataset).

The rules by which data cleaning is done depend very much on the task, so mine may be very different from yours.

My rules/stages:
- Combine datasets for all countries into one large pandas DataFrame
- Replace NaN in description with space (videos with no description)
- Delete all rows with missing values if any
- Keep records only with unique video IDs with the maximum number of views (the latest request)
- Delete entries where comments_disabled=True or ratings_disabled=True (where comments or information about likes/dislikes are unavailable). Note that all videos have dislikes unavailable from December 13, 2021
- Split tags with '|' and convert a list of strings into one string
- Delete non-ASCII and non-English characters from video title, channel title, description, and tags
- Delete rows where the video title or channel title doesn't contains English letters anymore
- Create a new comments column with " " (space for all rows)
- Rename columns to match snake case
- Delete next columns: ['categoryId', 'trending_date', 'thumbnail_link', 'comments_disabled', 'ratings_disabled']
- Reindex columns
- Reset index

This code allows you to get 136k rows dataset.

# Imports

In [23]:
import glob
import pandas as pd

# Combine datasets from all countries

In [24]:
# list of paths to all .csv files
datasets_filenames = glob.glob("data/*.csv")

In [25]:
def combine_datasets(filenames):
    """ Concatenete all dataframes from 'filenames' and reset index """
    
    list_of_df = []

    for filename in filenames:
        current_df = pd.read_csv(filename,
                                 parse_dates=['publishedAt', 'trending_date'])
        list_of_df.append(current_df)

    all_df = pd.concat(list_of_df)
    all_df.reset_index(drop=True, inplace=True)
        
    return all_df

In [26]:
combined_df = combine_datasets(datasets_filenames)
combined_df

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
0,s9FH4rDMvds,LEVEI UM FORA? FINGI ESTAR APAIXONADO POR ELA!,2020-08-11 22:21:49+00:00,UCGfBwrCoi9ZJjKiUK8MmJNw,Pietro Guedes,22,2020-08-12 00:00:00+00:00,pietro|guedes|ingrid|ohara|pingrid|vlog|amigos...,263835,85095,487,4500,https://i.ytimg.com/vi/s9FH4rDMvds/default.jpg,False,False,"Salve rapaziada, neste vídeo me declarei pra e..."
1,jbGRowa5tIk,ITZY “Not Shy” M/V TEASER,2020-08-11 15:00:13+00:00,UCaO6TYtlC8U5ttz62hTrZgg,JYP Entertainment,10,2020-08-12 00:00:00+00:00,JYP Entertainment|JYP|ITZY|있지|ITZY Video|ITZY ...,6000070,714310,15176,31040,https://i.ytimg.com/vi/jbGRowa5tIk/default.jpg,False,False,ITZY Not Shy M/V[ITZY Official] https://www.yo...
2,3EfkCrXKZNs,Oh Juliana PARÓDIA - MC Niack,2020-08-10 14:59:00+00:00,UCoXZmVma073v5G1cW82UKkA,As Irmãs Mota,22,2020-08-12 00:00:00+00:00,OH JULIANA PARÓDIA|MC Niack PARÓDIA|PARÓDIAS|A...,2296748,39761,5484,0,https://i.ytimg.com/vi/3EfkCrXKZNs/default.jpg,True,False,Se inscrevam meus amores! 📬 Quer nos mandar al...
3,gBjox7vn3-g,Contos de Runeterra: Targon | A Estrada Tortuosa,2020-08-11 15:00:09+00:00,UC6Xqz2pm50gDCORYztqhDpg,League of Legends BR,20,2020-08-12 00:00:00+00:00,Riot|Riot Games|League of Legends|lol|trailer|...,300510,46222,242,2748,https://i.ytimg.com/vi/gBjox7vn3-g/default.jpg,False,False,Você se unirá aos Lunari e aos Solari em Targo...
4,npoUGx7UW7o,Entrevista com Thammy Miranda | The Noite (10/...,2020-08-11 20:04:02+00:00,UCEWOoncsrmirqnFqxer9lmA,The Noite com Danilo Gentili,23,2020-08-12 00:00:00+00:00,The Noite|The Noite com Danilo Gentili|Danilo ...,327235,22059,3972,2751,https://i.ytimg.com/vi/npoUGx7UW7o/default.jpg,False,False,Danilo Gentili recebe Thammy Miranda. Após pas...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2273175,_R0tSt5tTUU,A Grueling Footrace To An Abandoned Ghost Town!,2023-06-11 20:00:11+00:00,UCEjBDKfrqQI4TgzT9YLNT8g,Ghost Town Living,22,2023-06-18 00:00:00+00:00,Ghost Town Living|Cerro Gordo|Footrace|Death V...,246175,23217,0,1227,https://i.ytimg.com/vi/_R0tSt5tTUU/default.jpg,False,False,Check out https://drinklmnt.com/Brent for a FR...
2273176,wq1FgZy1LzE,Game Theory: Garten Of BanBan Lore Is... Somet...,2023-06-10 18:05:36+00:00,UCo_IB5145EVNcf8hw1Kku7w,The Game Theorists,20,2023-06-18 00:00:00+00:00,garten of banban|garten|banban|jumbo josh|gart...,2264185,108041,0,5348,https://i.ytimg.com/vi/wq1FgZy1LzE/default.jpg,False,False,Game Theory Is Now On Spotify!Check Out Some O...
2273177,uvebNBKOSTg,"bye, from Dream.",2023-06-09 21:40:08+00:00,UCTkXRDQl0luXxVQrRQvWS6w,Dream,20,2023-06-18 00:00:00+00:00,minecraft|dream|dream minecraft,6152817,459258,0,92458,https://i.ytimg.com/vi/uvebNBKOSTg/default.jpg,False,False,"bye, from Dream. I deleted my face reveal, and..."
2273178,LScLTYUTXAM,Brandon Crawford's Scoreless Inning | Pitching...,2023-06-11 23:54:18+00:00,UCpXMHgjrpnynDSV5mXpqImw,San Francisco Giants,17,2023-06-18 00:00:00+00:00,San Francisco Giants|SF Giants|SFGiants|Oracle...,138893,2662,0,237,https://i.ytimg.com/vi/LScLTYUTXAM/default.jpg,False,False,"For the first time in his Major League career,..."


In [27]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2273180 entries, 0 to 2273179
Data columns (total 16 columns):
 #   Column             Dtype              
---  ------             -----              
 0   video_id           object             
 1   title              object             
 2   publishedAt        datetime64[ns, UTC]
 3   channelId          object             
 4   channelTitle       object             
 5   categoryId         int64              
 6   trending_date      datetime64[ns, UTC]
 7   tags               object             
 8   view_count         int64              
 9   likes              int64              
 10  dislikes           int64              
 11  comment_count      int64              
 12  thumbnail_link     object             
 13  comments_disabled  bool               
 14  ratings_disabled   bool               
 15  description        object             
dtypes: bool(2), datetime64[ns, UTC](2), int64(5), object(7)
memory usage: 247.1+ MB


In [28]:
# dataset size - 1,074,418 rows, 
# but it contains only 199,728 unique video IDs
# data for December 13, 2021
len(combined_df['video_id'].unique())

398811

# Sample Dataset

In [29]:
sample_df = pd.read_csv("data/US_youtube_trending_data.csv", parse_dates=['publishedAt', 'trending_date'])
sample_df.reset_index(drop=True, inplace=True)

sample_df

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
0,3C66w5Z0ixs,I ASKED HER TO BE MY GIRLFRIEND...,2020-08-11 19:20:14+00:00,UCvtRTOMP2TqYqu51xNrqAzg,Brawadis,22,2020-08-12 00:00:00+00:00,brawadis|prank|basketball|skits|ghost|funny vi...,1514614,156908,5855,35313,https://i.ytimg.com/vi/3C66w5Z0ixs/default.jpg,False,False,SUBSCRIBE to BRAWADIS ▶ http://bit.ly/Subscrib...
1,M9Pmf9AB4Mo,Apex Legends | Stories from the Outlands – “Th...,2020-08-11 17:00:10+00:00,UC0ZV6M2THA81QT9hrVWJG3A,Apex Legends,20,2020-08-12 00:00:00+00:00,Apex Legends|Apex Legends characters|new Apex ...,2381688,146739,2794,16549,https://i.ytimg.com/vi/M9Pmf9AB4Mo/default.jpg,False,False,"While running her own modding shop, Ramya Pare..."
2,J78aPJ3VyNs,I left youtube for a month and THIS is what ha...,2020-08-11 16:34:06+00:00,UCYzPXprvl5Y-Sf0g4vX-m6g,jacksepticeye,24,2020-08-12 00:00:00+00:00,jacksepticeye|funny|funny meme|memes|jacksepti...,2038853,353787,2628,40221,https://i.ytimg.com/vi/J78aPJ3VyNs/default.jpg,False,False,I left youtube for a month and this is what ha...
3,kXLn3HkpjaA,XXL 2020 Freshman Class Revealed - Official An...,2020-08-11 16:38:55+00:00,UCbg_UMjlHJg_19SZckaKajg,XXL,10,2020-08-12 00:00:00+00:00,xxl freshman|xxl freshmen|2020 xxl freshman|20...,496771,23251,1856,7647,https://i.ytimg.com/vi/kXLn3HkpjaA/default.jpg,False,False,Subscribe to XXL → http://bit.ly/subscribe-xxl...
4,VIUo6yapDbc,Ultimate DIY Home Movie Theater for The LaBran...,2020-08-11 15:10:05+00:00,UCDVPcEbVLQgLZX0Rt6jo34A,Mr. Kate,26,2020-08-12 00:00:00+00:00,The LaBrant Family|DIY|Interior Design|Makeove...,1123889,45802,964,2196,https://i.ytimg.com/vi/VIUo6yapDbc/default.jpg,False,False,Transforming The LaBrant Family's empty white ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
208983,_R0tSt5tTUU,A Grueling Footrace To An Abandoned Ghost Town!,2023-06-11 20:00:11+00:00,UCEjBDKfrqQI4TgzT9YLNT8g,Ghost Town Living,22,2023-06-18 00:00:00+00:00,Ghost Town Living|Cerro Gordo|Footrace|Death V...,246175,23217,0,1227,https://i.ytimg.com/vi/_R0tSt5tTUU/default.jpg,False,False,Check out https://drinklmnt.com/Brent for a FR...
208984,wq1FgZy1LzE,Game Theory: Garten Of BanBan Lore Is... Somet...,2023-06-10 18:05:36+00:00,UCo_IB5145EVNcf8hw1Kku7w,The Game Theorists,20,2023-06-18 00:00:00+00:00,garten of banban|garten|banban|jumbo josh|gart...,2264185,108041,0,5348,https://i.ytimg.com/vi/wq1FgZy1LzE/default.jpg,False,False,Game Theory Is Now On Spotify!Check Out Some O...
208985,uvebNBKOSTg,"bye, from Dream.",2023-06-09 21:40:08+00:00,UCTkXRDQl0luXxVQrRQvWS6w,Dream,20,2023-06-18 00:00:00+00:00,minecraft|dream|dream minecraft,6152817,459258,0,92458,https://i.ytimg.com/vi/uvebNBKOSTg/default.jpg,False,False,"bye, from Dream. I deleted my face reveal, and..."
208986,LScLTYUTXAM,Brandon Crawford's Scoreless Inning | Pitching...,2023-06-11 23:54:18+00:00,UCpXMHgjrpnynDSV5mXpqImw,San Francisco Giants,17,2023-06-18 00:00:00+00:00,San Francisco Giants|SF Giants|SFGiants|Oracle...,138893,2662,0,237,https://i.ytimg.com/vi/LScLTYUTXAM/default.jpg,False,False,"For the first time in his Major League career,..."


# Clean dataset

In [30]:
def clean_kaggle_dataset(dataset):
    """ 
    Clean YouTube Kaggle dataset:
        - Replace NaN in description with space
        - Delete missing values
        - Keep records only with unique video IDs 
          with the maximum number of views (the latest request)
        - Delete entries where comments_disabled=True
        - Convert "tags" field from list of strings to string
        - Delete non-ASCII characters and non-English from text columns
        - Rename columns to snake case, reorder add empty 'comments' column
        - Delete non-relevant columns
        - Reset index
    """

    clean_df = dataset.copy(deep=True)

    # Replace NaN in description with space
    clean_df["description"].fillna(" ", inplace=True)
    # Delete all rows with a missing values if any
    clean_df.dropna(inplace=True)

    # primary_key - unique 'video_id' with the largest number of views
    primary_key = clean_df.groupby("video_id")["view_count"].idxmax()
    # keep only most relevant records
    clean_df = clean_df.loc[primary_key]

    # delete rows with comments_disabled=True or ratings_disabled=True
    clean_df = clean_df[(clean_df['comments_disabled'] == False) &
                        (clean_df['ratings_disabled'] == False)]

    # Replace [None] in tags with space
    clean_df.loc[clean_df['tags'] == '[None]', 'tags'] = ' '
    # split tags with '|' and convert list to one string
    clean_df['tags'] = [' '.join(tag)
                        for tag in clean_df['tags'].str.split('|')]

    printable = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n'
    # delete non-ASCII and non-English characters
    for text_column in ['title', 'channelTitle', 'description', 'tags']:
        # for all rows in the column apply a filter
        # that only leaves characters from 'printable'
        # since filter does not return a string, then you need to use the join method
        clean_df[text_column] = clean_df[text_column].apply(
            lambda x: ''.join(filter(lambda xi: xi in printable, x)))

    # if there is not a single letter left
    # in the title of the video or in the channel title
    # the video is definitely not in English
    symbols = [c for c in "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"]
    clean_df = clean_df[clean_df['title'].str.contains('|'.join(symbols))]
    clean_df = clean_df[clean_df['channelTitle'].str.contains(
        '|'.join(symbols))]

    # create new empty column
    clean_df['comments'] = " "

    # rename column to match snake case
    clean_df.rename(columns={'channelTitle': 'channel_title',
                             'publishedAt': 'published_at',
                             'channelId': 'channel_id'}, inplace=True)

    # delete non-relevant columns
    clean_df.drop(['categoryId',
                   'thumbnail_link', 'comments_disabled',
                   'ratings_disabled'], axis=1, inplace=True)

    clean_df = clean_df.reindex(columns=['video_id', 'title', 'channel_id', 'channel_title',
                                         'published_at', 'trending_date', 'view_count', 'likes', 'dislikes',
                                         'comment_count', 'tags', 'description', 'comments'])
    
    clean_df['published_at'] = pd.to_datetime(clean_df['published_at'], format='%Y-%m-%d %H:%M:%S%z')

    sorted_clean_df = clean_df.sort_values('published_at')

    sorted_clean_df['published_at'] = sorted_clean_df['published_at'].dt.strftime('%Y-%m-%d %H:%M:%S%z')

    sorted_clean_df.reset_index(drop=True, inplace=True)

    return sorted_clean_df

In [31]:
clean_df = clean_kaggle_dataset(combined_df)
clean_df

Unnamed: 0,video_id,title,channel_id,channel_title,published_at,trending_date,view_count,likes,dislikes,comment_count,tags,description,comments
0,rRQUPkoXnCI,"Fuerte explosin en El Cairo, en tubera de petrleo",UC1ziDs9ZvpXVgGaI4wEGAbQ,Canal66,2020-07-14 22:24:57+0000,2020-08-14 00:00:00+00:00,10796592,245506,32229,13714,Canal 66 El Cairo Egipto,Se produjo un incendio masivo despus de una ex...,
1,0l3-iufiywU,FIRST TIME HEARING Phil Collins - In the Air T...,UCopm4iCRGWS6PkLB8uDe-Wg,TwinsthenewTrend,2020-07-27 21:49:32+0000,2020-08-14 00:00:00+00:00,5452179,110393,2160,16244,Twinsthenewtrend reaction reactions reaction c...,NEW VLOG CHANNEL: https://m.youtube.com/channe...,
2,-2RJTVPSOPc,DOCTOR (Official Video) Sidhu Moose Wala | Kid...,UC9ChdqQRCaZmTCwSJ49tcbw,Sidhu Moose Wala,2020-08-03 02:30:09+0000,2020-08-12 00:00:00+00:00,23422631,1106267,108509,1191634,doctor sidhu moose wala sidhu moose wala docto...,Sidhu Moose Wala PresentsAlso available on: Sp...,
3,JXzk8G9aXI8,Avatar Intro but with Animals,UCa0LID3WQdj-bd5KiI-34vw,BLTW,2020-08-03 21:51:14+0000,2020-08-16 00:00:00+00:00,3793289,141023,2866,2083,avatar avatar the last airbender avatar meme a...,"Long ago , the four4 pets lived together in ha...",
4,86N766ZoH3U,"En video: fuerte explosin en Beirut, capital d...",UCe5-b0fCK3eQCpwS6MT0aNw,EL TIEMPO,2020-08-04 17:11:42+0000,2020-08-12 00:00:00+00:00,4358150,28720,1984,6832,Noticias de ltimo momento noticias hoy noticia...,Noticia de ltimo momento: dos fuertes explosio...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
284432,9CNSz09gIAg,BRASIL 4 X 1 GUIN | GOLS | AMISTOSO SELEO | ge...,UCgCKagVhzGnZcuP9bSMgMCg,ge,2023-06-17 22:03:36+0000,2023-06-18 00:00:00+00:00,602281,28390,0,2174,melhores momentos highlights esporte esportes ...,Em amistoso na Espanha pautado pelo combate ao...,
284433,lNMSqxQtO0w,ONE PIECE | Official Teaser Trailer | Netflix,UCWOA1ZGywLbqmigxE4Qlvuw,Netflix,2023-06-17 22:26:00+0000,2023-06-18 00:00:00+00:00,1434251,73007,0,9786,Adventure Alvida Anime Buggy Eiichiro Oda Emil...,Heres a first look at the live action adaptati...,
284434,0zCMi98Zxqo,ONE PIECE - Netflix,UCv2ejD5B1xOYtGB2cf80B8g,Netflix Japan,2023-06-17 22:26:04+0000,2023-06-18 00:00:00+00:00,587221,7150,0,1769,Netflix ONE PIECE,ONE PIECEONE PIECENetflix831 () !: https://bit...,
284435,qoPjpZ7PsQc,ONE PIECE Trailer (2023),UCzcRQ3vRNr6fJ1A9rqFn7QA,ONE Media,2023-06-17 22:30:35+0000,2023-06-18 00:00:00+00:00,854164,33449,0,5410,One Media Trailer Official Movie Film 2023 One...,ONE PIECE Trailer (2023) One Piece Live Action...,


In [32]:
# check that group by worked correctly
len(clean_df['video_id'].unique())

284437

In [33]:
# save dataset
clean_df.to_csv('preprocessing_output/youtube_kaggle_clean_dataset.tsv', sep='\t', index=False, header=True)

In [34]:
clean_sample = clean_kaggle_dataset(sample_df)
clean_sample

Unnamed: 0,video_id,title,channel_id,channel_title,published_at,trending_date,view_count,likes,dislikes,comment_count,tags,description,comments
0,JXzk8G9aXI8,Avatar Intro but with Animals,UCa0LID3WQdj-bd5KiI-34vw,BLTW,2020-08-03 21:51:14+0000,2020-08-15 00:00:00+00:00,3146234,123862,2410,1865,avatar avatar the last airbender avatar meme a...,"Long ago , the four4 pets lived together in ha...",
1,3bC2T0oFwoo,This is Goodbye,UCIcgBZ9hEJxHv6r_jDYOMqg,Unus Annus,2020-08-05 19:00:01+0000,2020-08-12 00:00:00+00:00,4971181,360168,7850,48742,unus annus markiplier crankgameplays memento m...,Only 100 days left. Will you make the most of ...,
2,hpMeCem6Hss,"I FOUGHT BRYCE HALL...Ft. Mike Majlak, Lana Rh...",UCWTQG2aMDYKGDqYEGqJb1FA,Life of Bradley Martyn,2020-08-05 19:42:52+0000,2020-08-12 00:00:00+00:00,1123529,39289,2056,3626,bradley martyn steve will do it fullsend nelk ...,It is what it is.... New drop coming this mont...,
3,FnSr820S2Mk,Explained: What happened in deadly Beirut expl...,UCoMdktPbSTixAyNGwb-UYkQ,Sky News,2020-08-05 21:01:33+0000,2020-08-12 00:00:00+00:00,8496552,74508,3034,11855,BEIRUT LEBANON MIDDLE EAST EXPLOSION BLAST SKY...,The size of the explosion that ripped through ...,
4,K_uCyxNsHpo,Brooklyn's 10 DATES in 10 DAYS | Meet Jorge (D...,UC6QWhGQqf0YDYdRb0n6ojWw,Brooklyn and Bailey,2020-08-05 21:07:19+0000,2020-08-12 00:00:00+00:00,1120675,41671,888,5282,,Brooklyn is going on TEN DATES in TEN DAYS! (T...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
37132,_XcRIrDDTpI,How I Survived the Wormhole,UCPLMPHT-d8GZOqL_AHJFdQQ,Parrot,2023-06-17 14:00:32+0000,2023-06-18 00:00:00+00:00,414165,22544,0,1484,Minecraft Minecraft SMP SMP Minecraft Server L...,PARROT PLUSHIE: https://youtooz.com/products/p...,
37133,c0td7Noukww,"People Order Coffee, I Serve Them Suffering - ...",UCto7D1L-MiRoOziCXK9uT5Q,Let's Game It Out,2023-06-17 15:00:27+0000,2023-06-18 00:00:00+00:00,1095295,65153,0,2899,let's game it out lets game it out let game it...,Get a browser thats literally better at everyt...,
37134,lsml1LqVpK0,I Built A CHERRY MANSION in Minecraft 1.20 Har...,UCVtz3s3FUxVxBgPl2OWtIJQ,Farzy,2023-06-17 16:00:27+0000,2023-06-18 00:00:00+00:00,218631,7189,0,2532,Minecraft Farzy survival farzy minecraft minec...,Today I built a cherry blossom mansion in Mine...,
37135,Ef6S8Bhj5M8,Minecraft Escape Rooms Got EVEN DUMBER,UCrbA5a8z3E0Dm6hIiJRuL_w,Kenadian,2023-06-17 16:47:20+0000,2023-06-18 00:00:00+00:00,288996,19729,0,1603,minecraft escape rooms minecraft prison debunk...,Minecraft youtubers really thought they could ...,


In [35]:
clean_sample.to_csv('preprocessing_output/youtube_kaggle_sample_dataset.tsv', sep='\t', index=False, header=True)