## Data Loading and Cleaning:
-------

The Data Loading and Cleaning process outlines how the data was prepared for analysis. This includes, but is not limited to: Identifying and removing/imputing null values, identifying/removing duplicate data, Cleaning text data ect. 

##### Data Cleaning and Loading overview:
- Concatenating individual CSV files into a single data data frame.
- Exploring the data 
- Cleaning the data

##### Imports:
-------

In [1]:
import pandas as pd # Pandas imported and given the alias 'pd'
import numpy as np # numpy imported and given the alias 'np'. 
import matplotlib.pyplot as plt # matplotlib imported and given the alias 'plt'. Used for plotting graphs
import seaborn as sns 
import matplotlib as mlp

##### loading Datasets: 


In [2]:
# Reading the csv files
gbd = pd.read_csv('../data/youtube-dataset/GBvideos.csv')
frd = pd.read_csv('../data/youtube-dataset/FRvideos.csv')
usd = pd.read_csv('../data/youtube-dataset/USvideos.csv') 
cad = pd.read_csv('../data/youtube-dataset/CAvideos.csv') 
ded = pd.read_csv('../data/youtube-dataset/DEvideos.csv') 


KeyboardInterrupt: 

A total of 5 csv files will be used for this project. Each csv file presents YouTube statistics from a different country. The csv files will be merged later on to create a single data frame. Merging the csv files will make analysing the data easier during the Process of EDA. It is important to not that creating a data frame with large dimensions could make it difficult to fit a model on the data by increasing compuation times. However, working with a sample of this dataset could be utilized to mitigate this risk. 

Choosing to use global data as opposed to data from a singular country is useful for understanding how location/the video's origin can affect the amount of views a video recieves. This could be assessed futher by looking at country specific entries within the data eg: languages.

##### Checking individual entries:



In [None]:
gbd.head() #Great Britain data set

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,Jw1Y-zhQURU,17.14.11,John Lewis Christmas Ad 2017 - #MozTheMonster,John Lewis,26,2017-11-10T07:38:29.000Z,"christmas|""john lewis christmas""|""john lewis""|...",7224515,55681,10247,9479,https://i.ytimg.com/vi/Jw1Y-zhQURU/default.jpg,False,False,False,Click here to continue the story and make your...
1,3s1rvMFUweQ,17.14.11,Taylor Swift: ‚Ä¶Ready for It? (Live) - SNL,Saturday Night Live,24,2017-11-12T06:24:44.000Z,"SNL|""Saturday Night Live""|""SNL Season 43""|""Epi...",1053632,25561,2294,2757,https://i.ytimg.com/vi/3s1rvMFUweQ/default.jpg,False,False,False,Musical guest Taylor Swift performs ‚Ä¶Ready for...
2,n1WpP7iowLc,17.14.11,Eminem - Walk On Water (Audio) ft. Beyonc√©,EminemVEVO,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787420,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyonc√© i...
3,PUTEiSjKwJU,17.14.11,Goals from Salford City vs Class of 92 and Fri...,Salford City Football Club,17,2017-11-13T02:30:38.000Z,"Salford City FC|""Salford City""|""Salford""|""Clas...",27833,193,12,37,https://i.ytimg.com/vi/PUTEiSjKwJU/default.jpg,False,False,False,Salford drew 4-4 against the Class of 92 and F...
4,rHwDegptbI4,17.14.11,Dashcam captures truck's near miss with child ...,Cute Girl Videos,25,2017-11-13T01:45:13.000Z,[none],9815,30,2,30,https://i.ytimg.com/vi/rHwDegptbI4/default.jpg,False,False,False,Dashcam captures truck's near miss with child ...


In [None]:
frd.head() #France data set 

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,Ro6eob0LrCY,17.14.11,Malika LePen : Femme de Gauche - Trailer,Le Raptor Dissident,24,2017-11-13T17:32:55.000Z,"Raptor""|""Dissident""|""Expliquez""|""moi""|""cette""|...",212702,29282,1108,3817,https://i.ytimg.com/vi/Ro6eob0LrCY/default.jpg,False,False,False,Dimanche.\n18h30.\nSoyez pr√©sents pour la vid√©...
1,Yo84eqYwP98,17.14.11,"LA PIRE PARTIE ft Le Rire Jaune, Pierre Croce,...",Le Labo,24,2017-11-12T15:00:02.000Z,[none],432721,14053,576,1161,https://i.ytimg.com/vi/Yo84eqYwP98/default.jpg,False,False,False,Le jeu de soci√©t√©: https://goo.gl/hhG1Ta\n\nGa...
2,ceqntSXE-10,17.14.11,DESSINS ANIMEÃÅS FRANCÃßAIS VS RUSSES 2 - Daniil...,Daniil le Russe,23,2017-11-13T17:00:38.000Z,"cartoon""|""pok√©mon""|""√©cole""|""ours""|""–º—É–ª—å—Ç—Ñ–∏–ª—å–º",482153,76203,477,9580,https://i.ytimg.com/vi/ceqntSXE-10/default.jpg,False,False,False,Une nouvelle dose de dessins anim√©s fran√ßais e...
3,WuTFI5qftCE,17.14.11,PAPY GRENIER - METAL GEAR SOLID,Joueur Du Grenier,20,2017-11-12T17:00:02.000Z,"Papy grenier""|""Metal Gear Solid""|""PS1""|""Tirage...",925222,85016,550,4303,https://i.ytimg.com/vi/WuTFI5qftCE/default.jpg,False,False,False,"Nouvel ,√©pisode de Papy Grenier ! Ce mois-ci o..."
4,ee6OFs8TdEg,17.14.11,QUI SAUTERA LE PLUS HAUT ? (V√âLO SKATE ROLLER ...,Aurelien Fontenoy,17,2017-11-13T16:30:03.000Z,"v√©lo""|""vtt""|""bmx""|""freestyle""|""bike""|""mtb""|""di...",141695,8091,72,481,https://i.ytimg.com/vi/ee6OFs8TdEg/default.jpg,False,False,False,Sauts √† plus de 4 m√®tres de haut dans un tramp...


In [None]:
usd.head() #USA data set

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ‚ñ∂ \n\nSUBSCRIBE ‚ñ∫ http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


In [None]:
cad.head() #Canada data set

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,n1WpP7iowLc,17.14.11,Eminem - Walk On Water (Audio) ft. Beyonc√©,EminemVEVO,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787425,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyonc√© i...
1,0dBIkQ4Mz1M,17.14.11,PLUSH - Bad Unboxing Fan Mail,iDubbbzTV,23,2017-11-13T17:00:00.000Z,"plush|""bad unboxing""|""unboxing""|""fan mail""|""id...",1014651,127794,1688,13030,https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg,False,False,False,STill got a lot of packages. Probably will las...
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146035,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ‚ñ∂ \n\nSUBSCRIBE ‚ñ∫ http...
3,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095828,132239,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...
4,2Vv-BfVoq4g,17.14.11,Ed Sheeran - Perfect (Official Music Video),Ed Sheeran,10,2017-11-09T11:04:14.000Z,"edsheeran|""ed sheeran""|""acoustic""|""live""|""cove...",33523622,1634130,21082,85067,https://i.ytimg.com/vi/2Vv-BfVoq4g/default.jpg,False,False,False,üéß: https://ad.gt/yt-perfect\nüí∞: https://atlant...


## Merging Data-Sets:
------


##### Concatenating individual datasets:

In [None]:
df = pd.concat([usd,frd,gbd,cad,ded]) # concatenating the different dataframes into one dataframe
print(df.shape)

(202310, 16)


##### Resetting the dataframe index

In [None]:
df.reset_index(inplace=True,drop=True)
print(df.shape)
df.tail(1)

(202310, 16)


Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
202309,go-F6xvezAM,18.14.06,–ì–∏—Ä–æ—Å–∫—É—Ç–µ—Ä - –ê–∑–±—É–∫–∞ –£—Ä–∞–ª—å—Å–∫–∏—Ö –ü–µ–ª—å–º–µ–Ω–µ–π –ë - –£—Ä...,–£—Ä–∞–ª—å—Å–∫–∏–µ –ü–µ–ª—å–º–µ–Ω–∏,23,2018-06-13T15:02:15.000Z,"–ì–∏—Ä–æ—Å–∫—É—Ç–µ—Ä|""—É—Ä–∞–ª—å—Å–∫–∏–µ –ø–µ–ª—å–º–µ–Ω–∏ –≥–∏—Ä–æ—Å–∫—É—Ç–µ—Ä""|""–º—è...",316328,11394,352,550,https://i.ytimg.com/vi/go-F6xvezAM/default.jpg,False,False,False,–ü–æ–ø—É–ª—è—Ä–Ω—ã–π –Ω–æ–º–µ—Ä –∏–∑ –Ω–æ–≤–æ–≥–æ —à–æ—É –ê–∑–±—É–∫–∞ –£—Ä–∞–ª—å—Å–∫–∏...


The data frame index was reset after merging to ensure that the indexing is consistent across the dataframe. Failure to reset the column index could result in overlapping indexes which could interfer with analysis on the data.

The data now contains: (202310 rows and 16 columns)

##### Checking the concatenated data:

In [None]:
df.head() #Displays the first five rows of data (Checking for data format consistency ect.)

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ‚ñ∂ \n\nSUBSCRIBE ‚ñ∫ http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


In [None]:
df.tail() #Displays the first five rows of data (Checking for data format consistency ect.)

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
202305,fn5WNxy-Wcw,18.14.06,KINGDOM HEARTS III ‚Äì E3 2018 Pirates of the Ca...,Kingdom Hearts,20,2018-06-12T01:54:02.000Z,"Kingdom Hearts|""KH3""|""Kingdom Hearts 3""|""Pirat...",1394530,46778,501,9878,https://i.ytimg.com/vi/fn5WNxy-Wcw/default.jpg,False,False,False,Find out more about Kingdom Hearts 3: https://...
202306,zAFv43lxqHE,18.14.06,YMS: The Visit,YourMovieSucksDOTorg,24,2018-06-13T21:58:43.000Z,[none],139733,11155,119,1968,https://i.ytimg.com/vi/zAFv43lxqHE/default.jpg,False,False,False,Patreon: http://www.patreon.com/YMSTwitch: htt...
202307,zSXG5I6Y2fA,18.14.06,Ungut umgeschult ‚Äì Gr√ºnwald als Ersthelfer am ...,Gr√ºnwald Freitagscomedy,24,2018-06-12T10:01:28.000Z,"G√ºnter Gr√ºnwald|""Gr√ºnwald Freitagscomedy""|""G√ºn...",26054,364,11,8,https://i.ytimg.com/vi/zSXG5I6Y2fA/default.jpg,False,False,False,G√ºnter versucht sich als Ersthelfer bei einem ...
202308,5d115sePmaU,18.14.06,Assassin's Creed Odyssey: E3 2018 Welt-Enth√ºll...,Assassin's Creed DE,20,2018-06-11T21:16:55.000Z,"Assassin's Creed|""Assassins Creed""|""Assassin's...",1139198,14900,1421,1587,https://i.ytimg.com/vi/5d115sePmaU/default.jpg,False,False,False,"Vom versto√üenen S√∂ldner zum legend√§ren Helden,..."
202309,go-F6xvezAM,18.14.06,–ì–∏—Ä–æ—Å–∫—É—Ç–µ—Ä - –ê–∑–±—É–∫–∞ –£—Ä–∞–ª—å—Å–∫–∏—Ö –ü–µ–ª—å–º–µ–Ω–µ–π –ë - –£—Ä...,–£—Ä–∞–ª—å—Å–∫–∏–µ –ü–µ–ª—å–º–µ–Ω–∏,23,2018-06-13T15:02:15.000Z,"–ì–∏—Ä–æ—Å–∫—É—Ç–µ—Ä|""—É—Ä–∞–ª—å—Å–∫–∏–µ –ø–µ–ª—å–º–µ–Ω–∏ –≥–∏—Ä–æ—Å–∫—É—Ç–µ—Ä""|""–º—è...",316328,11394,352,550,https://i.ytimg.com/vi/go-F6xvezAM/default.jpg,False,False,False,–ü–æ–ø—É–ª—è—Ä–Ω—ã–π –Ω–æ–º–µ—Ä –∏–∑ –Ω–æ–≤–æ–≥–æ —à–æ—É –ê–∑–±—É–∫–∞ –£—Ä–∞–ª—å—Å–∫–∏...


Df.head and Df.tail have been used to show that the data entries are consistent throughout the dataframe. Each column contains the  appropriate data entries. 

Initial Observations:
- trending_date: Date format will need to be converted to date/time so it can be used for modelling.
- All date columns will be turned to numerical values and broken down into: month, day, year for modelling
- We could choose to keep non english text as column entries because they can reveal country specific trends about the global appeal of each country (Views)
- We Will need to Vectorize text data eg: tags and title so that they can be used for modelling
- publish_time and trending_date contain different date formats - potentially drop one column - multicolinearity
- Category id is numerical and provides little understandable information about the specific category - their corresponding values could be extracted through the data set's JSON Files to create a new names column.
- The distribution of numerical columns will be plotted to explore relationships: views, likes, dislikes,comment_count.
- Each video has a unique video id.


## Exploring the data:
----------

In [None]:
df.shape # Provides the shape of the dataframe (Rows and Columns)

(202310, 16)

In [None]:
df.columns # Provides all of the column names of the Data Frame.

Index(['video_id', 'trending_date', 'title', 'channel_title', 'category_id',
       'publish_time', 'tags', 'views', 'likes', 'dislikes', 'comment_count',
       'thumbnail_link', 'comments_disabled', 'ratings_disabled',
       'video_error_or_removed', 'description'],
      dtype='object')

In [None]:
df.dtypes # This shows what datatype has been recorded for each individual columns.

video_id                  object
trending_date             object
title                     object
channel_title             object
category_id                int64
publish_time              object
tags                      object
views                      int64
likes                      int64
dislikes                   int64
comment_count              int64
thumbnail_link            object
comments_disabled           bool
ratings_disabled            bool
video_error_or_removed      bool
description               object
dtype: object

Most of the listed data types are correct. However, the 'publish_time' and 'trending_date' have been incorrectly listed as objects (They will need to be converted later on)

##### Number of unique values in id columns:

In [None]:
df["category_id"].nunique()

18

There are a total of 18 different categories within the data. However, df.head has revealed that they are numerical, so category names will need to be obtained to gain more understanding of the data. 

In [None]:
df["video_id"].nunique()

79408

There are a total of 202310 rows within the dataframe. However, nunique() has revealed that there are only 79408 unique video ids. This is important because it reveals that some videos repeat across the data from different countries.

In [None]:
df.info() # Shows information about the data frame which, providing a summary of the data.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202310 entries, 0 to 202309
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   video_id                202310 non-null  object
 1   trending_date           202310 non-null  object
 2   title                   202310 non-null  object
 3   channel_title           202310 non-null  object
 4   category_id             202310 non-null  int64 
 5   publish_time            202310 non-null  object
 6   tags                    202310 non-null  object
 7   views                   202310 non-null  int64 
 8   likes                   202310 non-null  int64 
 9   dislikes                202310 non-null  int64 
 10  comment_count           202310 non-null  int64 
 11  thumbnail_link          202310 non-null  object
 12  comments_disabled       202310 non-null  bool  
 13  ratings_disabled        202310 non-null  bool  
 14  video_error_or_removed  202310 non-n

In [None]:
df.describe() # A statistical summary of the data.( mean, standard deviation, count ect.) 

Unnamed: 0,category_id,views,likes,dislikes,comment_count
count,202310.0,202310.0,202310.0,202310.0,202310.0
mean,19.712412,2053181.0,56822.84,3067.639,6177.626
std,7.359156,9412473.0,207818.3,28599.58,31453.32
min,1.0,223.0,0.0,0.0,0.0
25%,17.0,75174.25,1447.0,67.0,209.0
50%,23.0,309129.0,7603.0,290.0,924.0
75%,24.0,1103690.0,32244.75,1152.0,3520.0
max,44.0,424538900.0,5613827.0,1944971.0,1626501.0


In [None]:
df.describe().T # A statistical summary of the data - columns repositioned.

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
category_id,202310.0,19.71241,7.359156,1.0,17.0,23.0,24.0,44.0
views,202310.0,2053181.0,9412473.0,223.0,75174.25,309129.0,1103690.5,424538912.0
likes,202310.0,56822.84,207818.3,0.0,1447.0,7603.0,32244.75,5613827.0
dislikes,202310.0,3067.639,28599.58,0.0,67.0,290.0,1152.0,1944971.0
comment_count,202310.0,6177.626,31453.32,0.0,209.0,924.0,3520.0,1626501.0


In [None]:
cat_cols = df.select_dtypes(include=['object']).columns.to_list() # selects all the category columns
num_cols = df.select_dtypes(include=['float64', 'bool']).columns.to_list() # selects all the numeric columns

In [None]:
print(cat_cols)
print(num_cols)

['video_id', 'trending_date', 'title', 'channel_title', 'publish_time', 'tags', 'thumbnail_link', 'description']
['comments_disabled', 'ratings_disabled', 'video_error_or_removed']


##### Identifying numerical and categorical columns:

In [None]:
df.select_dtypes('object').columns # Provides a list of non- numeric columns in the data - listed as object.

Index(['video_id', 'trending_date', 'title', 'channel_title', 'publish_time',
       'tags', 'thumbnail_link', 'description'],
      dtype='object')

In [None]:
df.select_dtypes('int64','bool').columns # Provides a list of numeric columns in the data - listed as bool and int64.


Index(['category_id', 'views', 'likes', 'dislikes', 'comment_count'], dtype='object')


## Data Cleaning:
---------
- Cleaning the data to handle missing values, and inconsistencies. 
-  Data cleaning ensures the data is ready for analysis.

##### Duplicate data:

In [None]:
df.duplicated().sum() # This provides the sum of all duplicated columns in the Data Frame 

18083

Df.duplicated has revealed that there are 18083 duplicated rows across the data set. This is a large number of duplicated values. Considering the nature of this dataset 'Trending YouTube videos', it is possible that the data contains video entries that have trended across multiple countries. The decision to remove this data could limit the how well our model will be able to predict which features are the most statistically significant to the number of views.

In [None]:
df[df.duplicated()] # to find all the duplicates data in the Data Frame.

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
34899,QBL8IRJ5yHU,18.15.05,Why I'm So Scared (being myself and crying too...,grav3yardgirl,26,2018-05-14T19:00:01.000Z,"beauty|""how to""|""makeup""|""howto""|""style""|""fash...",1469627,188652,3124,33032,https://i.ytimg.com/vi/QBL8IRJ5yHU/default.jpg,False,False,False,I will never be able to say Thank You enough.....
34900,t4pRQ0jn23Q,18.15.05,YoungBoy Never Broke Again Goes Sneaker Shoppi...,Complex,24,2018-05-14T14:00:03.000Z,"sneakerhead|""complex""|""complex originals""|""sne...",1199587,49709,2380,7261,https://i.ytimg.com/vi/t4pRQ0jn23Q/default.jpg,False,False,False,YoungBoy Never Broke Again goes Sneaker Shoppi...
34901,j4KvrAUjn6c,18.15.05,WE MADE OUR MOM CRY...HER DREAM CAME TRUE!,Lucas and Marcus,24,2018-05-13T18:03:56.000Z,"Lucas and Marcus|""Marcus and Lucas""|""Dobre""|""D...",3906727,77378,12160,15874,https://i.ytimg.com/vi/j4KvrAUjn6c/default.jpg,False,False,False,BEST MOM EVER! WANT TO SEE US IN NYC & NJ?!BUY...
34902,MAjY8mCTXWk,18.15.05,"Âë®Êù∞ÂÄ´ Jay Chou„Äê‰∏çÊÑõÊàëÂ∞±ÊãâÂÄí If You Don't Love Me, It's...",Êù∞Â®ÅÁàæÈü≥Ê®Ç JVR Music,10,2018-05-14T15:59:47.000Z,"Âë®Êù∞ÂÄ´|""Jay""|""Chou""|""Âë®Ëë£""|""Âë®Êù∞‰º¶""|""Âë®ÂÇëÂÄ´""|""Êù∞Â®ÅÂ∞î""|""Âë®Âë®""|""...",916128,40485,1042,4746,https://i.ytimg.com/vi/MAjY8mCTXWk/default.jpg,False,False,False,Ë©ûÔºöÂë®Êù∞ÂÄ´„ÄÅÂÆãÂÅ•ÂΩ∞ÔºàÂΩàÈ†≠Ôºâ Êõ≤ÔºöÂë®Êù∞ÂÄ´ÊÜÇÈ¨±ÂûãÁî∑ÁöÑËµ∞ÂøÉÊóãÂæã Áî®Ëã±ÂºèÊêñÊªæÂÆ£Ê¥©ÊÉÖÂÇ∑‰∏çÊÑõÊàëÂ∞±ÊãâÂÄí...
34903,xhs8tf1v__w,18.15.05,Terry Crews Answers the Web's Most Searched Qu...,WIRED,24,2018-05-14T16:00:29.000Z,"autocomplete|""deadpool 2""|""google autocomplete...",343967,16988,132,1308,https://i.ytimg.com/vi/xhs8tf1v__w/default.jpg,False,False,False,Terry Crews takes the WIRED Autocomplete Inter...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
202300,DGzy8FE1Rhk,18.14.06,Shawn Mendes - Nervous,ShawnMendesVEVO,10,2018-06-11T16:00:03.000Z,"Shawn|""Mendes""|""Nervous""|""Island""|""Records""|""Pop""",4986664,518240,5215,34466,https://i.ytimg.com/vi/DGzy8FE1Rhk/default.jpg,False,False,False,Music video by Shawn Mendes performing Nervous...
202301,HyqTJpG_JbE,18.14.06,Ÿáÿ∞Ÿá ŸáŸä ÿßŸÑÿØŸàŸÑ ÿßŸÑÿ™Ÿä ŸÑŸÖ ÿ™ÿµŸàÿ™ ÿπŸÑŸâ ÿßŸÑŸÖÿ∫ÿ±ÿ® ŸÑÿ•ÿ≥ÿ™ÿ∂ÿßŸÅÿ© ...,DailyProFootball,17,2018-06-13T12:05:22.000Z,"ÿßŸÑŸÖÿ∫ÿ±ÿ® 2026|""morocco 2026""|""ŸÉÿ£ÿ≥ ÿßŸÑÿπÿßŸÑŸÖ 2026""|""...",184156,504,217,1368,https://i.ytimg.com/vi/HyqTJpG_JbE/default.jpg,False,False,False,FACEBOOK ÿßŸÜÿ∂ŸÖ ÿßŸÑŸâ ÿµŸÅÿ≠ÿ™ŸÜÿß ÿπŸÑŸâhttps://www.facebo...
202303,C9DCE3k2lWA,18.14.06,ÿßŸÑÿ±ÿ¶Ÿäÿ≥ ÿ£ÿ≠ŸÖÿØ ÿ£ÿ≠ŸÖÿØ Ÿäÿµÿ±ÿ≠ ÿ®ÿ£ŸÜŸá ÿ™ŸÑŸÇŸâ ÿ™ŸáÿØŸäÿØÿßÿ™ ŸàŸäÿ™ÿ≠ÿ≥ÿ±...,Shahid TV,17,2018-06-13T19:31:05.000Z,"ÿßŸÑÿ±ÿ¶Ÿäÿ≥|""ÿ£ÿ≠ŸÖÿØ""|""Ÿäÿµÿ±ÿ≠""|""ÿ®ÿ£ŸÜŸá""|""ÿ™ŸÑŸÇŸâ""|""ÿ™ŸáÿØŸäÿØÿßÿ™""|""...",184242,1440,97,470,https://i.ytimg.com/vi/C9DCE3k2lWA/default.jpg,False,False,False,ÿßŸÑÿ±ÿ¶Ÿäÿ≥ ÿ£ÿ≠ŸÖÿØ ÿ£ÿ≠ŸÖÿØ Ÿäÿµÿ±ÿ≠ ÿ®ÿ£ŸÜŸá ÿ™ŸÑŸÇŸâ ÿ™ŸáÿØŸäÿØÿßÿ™ ŸàŸäÿ™ÿ≠ÿ≥ÿ±...
202304,KYke3FFiyk4,18.14.06,Crime Patrol Dial 100 - Ep 796 - Full Episode ...,SET India,24,2018-06-13T13:54:47.000Z,"true events|""sony entertainment channel""|""cons...",54395,285,61,37,https://i.ytimg.com/vi/KYke3FFiyk4/default.jpg,False,False,False,Click here to subscribe to SonyLIV : http://w...


In [None]:
#df.T.duplicated() # Checks to see if any of the columns are duplicated

There are no duplicated columns in the dataframe.

##### Null values:

In [None]:
df.isna().sum() # checks for null values in all of the columns.

video_id                     0
trending_date                0
title                        0
channel_title                0
category_id                  0
publish_time                 0
tags                         0
views                        0
likes                        0
dislikes                     0
comment_count                0
thumbnail_link               0
comments_disabled            0
ratings_disabled             0
video_error_or_removed       0
description               6942
dtype: int64

In [None]:
df.isna().mean()*100 # Calculates the mean percentage of null values for each column.

video_id                  0.000000
trending_date             0.000000
title                     0.000000
channel_title             0.000000
category_id               0.000000
publish_time              0.000000
tags                      0.000000
views                     0.000000
likes                     0.000000
dislikes                  0.000000
comment_count             0.000000
thumbnail_link            0.000000
comments_disabled         0.000000
ratings_disabled          0.000000
video_error_or_removed    0.000000
description               3.431368
dtype: float64

In [None]:
df.isna().sum().sum() # this collects the total sum of null values across the Data Frame. 

6942

There is a total of 6942 null values within the description column. All other columns contain no null values. Majority of the data within this dataset contains unique data entries, so imputing these values from other columns within the dataset would not be worthwhile because it would take too long. These values would most likely need to be webscaped. However, I do not think that webscraping is needed becuase the data relating to this column could be readily available within the remaining text columns eg: tags. For now, the null values will be dropped.

In [None]:
df_dropnul = df.copy() # new data frame to make changes on

In [None]:
df_dropnul.dropna(inplace=True) # dropping null values

In [None]:
df_dropnul[df_dropna['tags'] == '[none]'] # Missing values within the tags

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
97,xfmipNU4Odc,17.14.11,Edna's registered owner thought she was dead f...,Hope For Paws - Official Rescue Channel,15,2017-11-10T18:02:04.000Z,[none],284666,16396,81,949,https://i.ytimg.com/vi/xfmipNU4Odc/default.jpg,False,False,False,Please donate $5 and help us save more lives:\...
133,X7flefV8tec,17.14.11,"President Bill Clinton On Dictators, Democracy...",Team Coco,24,2017-11-09T02:37:49.000Z,[none],366180,4364,4448,1997,https://i.ytimg.com/vi/X7flefV8tec/default.jpg,False,False,False,#ConanNYC Highlight: President Clinton talks a...
136,5x1FAiIq_pQ,17.14.11,Alicia Keys - When You Were Gone,Alicia Keys,10,2017-11-09T15:49:21.000Z,[none],95944,1354,181,117,https://i.ytimg.com/vi/5x1FAiIq_pQ/default.jpg,False,False,False,Find out more in The Vault: http://bit.ly/AK_A...
178,JuP1Z8xpRb8,17.14.11,Brian Justin Crum - Wild Side,BrianJustinCrum,24,2017-11-09T17:01:01.000Z,[none],27010,1666,36,150,https://i.ytimg.com/vi/JuP1Z8xpRb8/default.jpg,False,False,False,Get the new single 'Wild Side' now! http://sma...
180,1640fZpYBSY,17.14.11,I love the Price is Right! Wooo! -Kevin,Anaki Abo,22,2017-11-07T18:54:39.000Z,[none],358597,1211,72,593,https://i.ytimg.com/vi/1640fZpYBSY/default.jpg,False,False,False,Price is Right contestant plays for a car.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
202199,pGxPSPqwdiw,18.14.06,Nj√´ shtepi u inagurua ne qytetin e Vushtrrise,Jetimat e Ballkanit,22,2018-06-13T18:01:37.000Z,[none],44577,658,16,63,https://i.ytimg.com/vi/pGxPSPqwdiw/default.jpg,False,False,False,www.jetimat.com+386 49 259 931+377 45 459 856+...
202227,KAyj5Xm1C64,18.14.06,[ENG SUB] BTS PROM PARTY 2018 Intro + 2nd Gran...,DaisyxBTS 07,24,2018-06-13T12:51:23.000Z,[none],449418,24806,93,974,https://i.ytimg.com/vi/KAyj5Xm1C64/default.jpg,False,False,False,***I do NOT own anything. Just want to share t...
202247,VdOGUFr3glA,18.14.06,ŸÑÿ≠ÿ∏ÿ© ÿ™ÿ™ŸàŸäÿ¨ ŸÖŸÑŸÅ ÿßŸÑÿ™ŸÑÿßÿ™Ÿä ÿ®ŸÉÿ£ÿ≥ ÿßŸÑÿπÿßŸÑŸÖ 2026 -ÿÆÿ≥ÿßÿ±ÿ©...,Sahifa-Tv ŸÇŸÜÿßÿ© ÿßŸÑÿµÿ≠ŸäŸÅÿ©,24,2018-06-13T11:00:10.000Z,[none],429897,1915,1021,1635,https://i.ytimg.com/vi/VdOGUFr3glA/default.jpg,False,False,False,üõëT√©l√©charge Onefootball maintenant : https://...
202296,LR308Yr8tsg,18.14.06,LIVE NOW ! - 68th FIFA Congress 2018,FIFATV,17,2018-06-13T12:24:25.000Z,[none],312470,2752,412,133,https://i.ytimg.com/vi/LR308Yr8tsg/default.jpg,False,False,False,Follow the congress LIVE on FIFA on YouTube ! ...


Missing values have also been detected within the tags column. However, they will be left alone because they have already been listed as 'none'. It is unlikely that leaving these values will have a negative impact on predicting 'views' because some of this information could also be found within the description and title columns.

##### Closer look at the text data: Description

In [None]:
df_dropnul['description'].head(10) #first 10 rows of data from the desciption column.

0    SHANTELL'S CHANNEL - https://www.youtube.com/s...
1    One year after the presidential election, John...
2    WATCH MY PREVIOUS VIDEO ‚ñ∂ \n\nSUBSCRIBE ‚ñ∫ http...
3    Today we find out if Link is a Nickelback amat...
4    I know it's been a while since we did this sho...
5    Using the iPhone for the past two weeks -- her...
6    Embattled Alabama Senate candidate Roy Moore (...
7    Ice Cream Pint Combination Lock - http://amzn....
8    Inspired by the imagination of P.T. Barnum, Th...
9    For now, at least, we have better things to wo...
Name: description, dtype: object

The description can be considered an important column because it contains the most text data. From looking at this column it is clear that a large amount of unuseful data eg: special characters that will need to be removed. Although cleaning this data is an important step, most of the the unuseful data will be filtered out by the CountVectorizer when the data is vectorized.


## Cleaning Text Data 
-------

##### Text formatting to lowercase:
Converting description, tags and title to lower case

In [None]:
df_transform = df_dropnul.copy()

In [None]:
# converting all text to lower case, replacing the original column values.

df_transform['tags_lower'] = df_transform['tags'].apply(lambda x: x.lower()) #convert tags to lower case
df_transform['title_lower'] = df_transform['title'].apply(lambda x: x.lower()) # converts title to lowercase 
df_transform['description_lower'] = df_transform['description'].apply(lambda x: str(x).lower()) # converts description to lowercase 



In [None]:
print(df_transform['tags_lower']) #confirming the column has been converted to lowercase.

0                                           shantell martin
1         last week tonight trump presidency|"last week ...
2         racist superman|"rudy"|"mancuso"|"king"|"bach"...
3         rhett and link|"gmm"|"good mythical morning"|"...
4         ryan|"higa"|"higatv"|"nigahiga"|"i dare you"|"...
                                ...                        
202305    kingdom hearts|"kh3"|"kingdom hearts 3"|"pirat...
202306                                               [none]
202307    g√ºnter gr√ºnwald|"gr√ºnwald freitagscomedy"|"g√ºn...
202308    assassin's creed|"assassins creed"|"assassin's...
202309    –≥–∏—Ä–æ—Å–∫—É—Ç–µ—Ä|"—É—Ä–∞–ª—å—Å–∫–∏–µ –ø–µ–ª—å–º–µ–Ω–∏ –≥–∏—Ä–æ—Å–∫—É—Ç–µ—Ä"|"–º—è...
Name: tags_lower, Length: 195368, dtype: object


In [None]:
print(df_transform['title_lower']) #confirming the column has been converted to lowercase.

0                        we want to talk about our marriage
1         the trump presidency: last week tonight with j...
2         racist superman | rudy mancuso, king bach & le...
3                          nickelback lyrics: real or fake?
4                                  i dare you: going bald!?
                                ...                        
202305    kingdom hearts iii ‚Äì e3 2018 pirates of the ca...
202306                                       yms: the visit
202307    ungut umgeschult ‚Äì gr√ºnwald als ersthelfer am ...
202308    assassin's creed odyssey: e3 2018 welt-enth√ºll...
202309    –≥–∏—Ä–æ—Å–∫—É—Ç–µ—Ä - –∞–∑–±—É–∫–∞ —É—Ä–∞–ª—å—Å–∫–∏—Ö –ø–µ–ª—å–º–µ–Ω–µ–π –± - —É—Ä...
Name: title_lower, Length: 195368, dtype: object


In [None]:
print(df_transform['description_lower']) #confirming the column has been converted to lowercase.

0         shantell's channel - https://www.youtube.com/s...
1         one year after the presidential election, john...
2         watch my previous video ‚ñ∂ \n\nsubscribe ‚ñ∫ http...
3         today we find out if link is a nickelback amat...
4         i know it's been a while since we did this sho...
                                ...                        
202305    find out more about kingdom hearts 3: https://...
202306    patreon: http://www.patreon.com/ymstwitch: htt...
202307    g√ºnter versucht sich als ersthelfer bei einem ...
202308    vom versto√üenen s√∂ldner zum legend√§ren helden,...
202309    –ø–æ–ø—É–ª—è—Ä–Ω—ã–π –Ω–æ–º–µ—Ä –∏–∑ –Ω–æ–≤–æ–≥–æ —à–æ—É –∞–∑–±—É–∫–∞ —É—Ä–∞–ª—å—Å–∫–∏...
Name: description_lower, Length: 195368, dtype: object


## Text Cleaning:
------
Removing all numerical values and special characters from text columns: Title, Tags & Description.

In [None]:
df_cleaningtext = df_transform.copy()

In [None]:
# removing unicode character eg: hyperlinks, emojis punctuations
import re
pattern = r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?"
replace = "" 

df_cleaningtext['tags_no_chars'] = df_cleaningtext['tags_lower'].apply(lambda x: re.sub(pattern, replace, x))
df_cleaningtext['description_no_chars'] = df_cleaningtext['description_lower'].apply(lambda x: re.sub(pattern, replace, x))
df_cleaningtext['title_no_chars'] = df_cleaningtext['title_lower'].apply(lambda x: re.sub(pattern, replace, x))


In [None]:
# removing numbers dates from tags eg: '2002'
df_cleaningtext['tags_no_dates'] = df_cleaningtext['tags_no_chars'].apply(lambda x: re.sub(r'\b\d{4}\b', '', x)) #removing the date entries
df_cleaningtext['tags_no_numerics'] = df_cleaningtext['tags_no_dates'].apply(lambda x: re.sub(r'\d+', '', x)) #removing individual numbers

description = (filter(lambda x: not x.isdigit(), df['description']))
df_cleaningtext['tags_no_numerics']

0                                           shantell martin
1         last week tonight trump presidencylast week to...
2         racist supermanrudymancusokingbachracistsuperm...
3         rhett and linkgmmgood mythical morningrhett an...
4         ryanhigahigatvnigahigai dare youidyrhpcdaresno...
                                ...                        
202305    kingdom heartskhkingdom hearts pirates of the ...
202306                                                 none
202307    gnter grnwaldgrnwald freitagscomedygnter grnwa...
202308    assassins creedassassins creedassassins creed ...
202309                                                     
Name: tags_no_numerics, Length: 195368, dtype: object

In [None]:
#to remove single numerical values in the tags column eg: '2'
df_cleaningtext['tags_clean'] = df_cleaningtext['tags_no_numerics'].str.replace(r'\d+', '')  #replaces numeric values with whitespace.

  df_cleaningtext['tags_clean'] = df_cleaningtext['tags_no_numerics'].str.replace(r'\d+', '')  #replaces numeric values with whitespace.


In [None]:
df_cleaningtext['tags_clean'] #checking  the data entries inside the tags column.

0                                           shantell martin
1         last week tonight trump presidencylast week to...
2         racist supermanrudymancusokingbachracistsuperm...
3         rhett and linkgmmgood mythical morningrhett an...
4         ryanhigahigatvnigahigai dare youidyrhpcdaresno...
                                ...                        
202305    kingdom heartskhkingdom hearts pirates of the ...
202306                                                 none
202307    gnter grnwaldgrnwald freitagscomedygnter grnwa...
202308    assassins creedassassins creedassassins creed ...
202309                                                     
Name: tags_clean, Length: 195368, dtype: object

- Percentage of null values is small across the dataframe so it is not significant. 
- Although the contents of this column could be found in other column, it may be useful with understanding the category_id.
- The null values in these columns will be imputed through webscraping descriptions online. 

## Numerical Columns:
----

In [None]:
numerical_columns = df_cleaningtext.select_dtypes(include=['float64', 'int64'])
numerical_columns

Unnamed: 0,category_id,views,likes,dislikes,comment_count
0,22,748374,57527,2966,15954
1,24,2418783,97185,6146,12703
2,23,3191434,146033,5339,8181
3,24,343168,10172,666,2146
4,24,2095731,132235,1989,17518
...,...,...,...,...,...
202305,20,1394530,46778,501,9878
202306,24,139733,11155,119,1968
202307,24,26054,364,11,8
202308,20,1139198,14900,1421,1587


In [None]:
df2 = df_cleaningtext.copy() #column to make changes on - dropping the columns

In [None]:
df2['comments_disabled'].value_counts()


False    191769
True       3599
Name: comments_disabled, dtype: int64

In [None]:
df2['ratings_disabled'].value_counts()

False    193511
True       1857
Name: ratings_disabled, dtype: int64

In [None]:
df2['video_error_or_removed'].value_counts()

False    195218
True        150
Name: video_error_or_removed, dtype: int64

The 'comments_disabled', 'ratings_disabled', 'video_error_or_removed' will be removed because they do not contain any usefull information to predict the number of 'views'. Furthermore, all the values in these binary columns are majority 'false' which means that these columns do not contain enough varied data for it to be meaninful/useful for the target.

In [None]:
df2 = df.drop(columns=['comments_disabled', 'ratings_disabled', 'video_error_or_removed']) #dropping columns

In [None]:
df2.head() # showing the first 5 columns to observe changes in the dataframe.

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,WATCH MY PREVIOUS VIDEO ‚ñ∂ \n\nSUBSCRIBE ‚ñ∫ http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,I know it's been a while since we did this sho...


##### Resetting Dataframe indexes:

In [None]:
df2.reset_index(inplace=True,drop=True)
print(df2.shape)
df2.tail(1)

(202310, 13)


Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,description
202309,go-F6xvezAM,18.14.06,–ì–∏—Ä–æ—Å–∫—É—Ç–µ—Ä - –ê–∑–±—É–∫–∞ –£—Ä–∞–ª—å—Å–∫–∏—Ö –ü–µ–ª—å–º–µ–Ω–µ–π –ë - –£—Ä...,–£—Ä–∞–ª—å—Å–∫–∏–µ –ü–µ–ª—å–º–µ–Ω–∏,23,2018-06-13T15:02:15.000Z,"–ì–∏—Ä–æ—Å–∫—É—Ç–µ—Ä|""—É—Ä–∞–ª—å—Å–∫–∏–µ –ø–µ–ª—å–º–µ–Ω–∏ –≥–∏—Ä–æ—Å–∫—É—Ç–µ—Ä""|""–º—è...",316328,11394,352,550,https://i.ytimg.com/vi/go-F6xvezAM/default.jpg,–ü–æ–ø—É–ª—è—Ä–Ω—ã–π –Ω–æ–º–µ—Ä –∏–∑ –Ω–æ–≤–æ–≥–æ —à–æ—É –ê–∑–±—É–∫–∞ –£—Ä–∞–ª—å—Å–∫–∏...


In [None]:
set(df2['trending_date'].str[:3]) #resetting the date format

{'17.', '18.'}

##### Pickling the data frame:

In [None]:
#pickling the dataframe

import joblib
joblib.dump(df2,'df_cleaning.pkl')

['df_cleaning.pkl']

## Summary:
