In [1]:
import pandas as pd

Note : The CSV file must be in the same directory as the notebook

In [2]:
df = pd.read_csv('videosUS.csv')
df.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22.0,2017-11-13T17:13:01.000Z,SHANtell martin,748374.0,57527.0,2966.0,15954.0,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24.0,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783.0,97185.0,6146.0,12703.0,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23.0,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434.0,146033.0,5339.0,8181.0,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24.0,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168.0,10172.0,666.0,2146.0,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24.0,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731.0,132235.0,1989.0,17518.0,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


# Question 1 : Missing values

Let's find out how many missing values there are

In [3]:
#Number of missing values for each column seperately
df.isna().head().sum()

video_id                  0
trending_date             0
title                     0
channel_title             0
category_id               0
publish_time              0
tags                      0
views                     0
likes                     0
dislikes                  0
comment_count             0
thumbnail_link            0
comments_disabled         0
ratings_disabled          0
video_error_or_removed    0
description               0
dtype: int64

In [4]:
#Number of missing values for the entire dataframe
df.isna().head().sum().sum()

0

It is clear that our dataframe does not contain any missing values

# Question 2 : Mean , median and quartiles

We start by determining which variables are numerical in our dataset

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41415 entries, 0 to 41414
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   video_id                41415 non-null  object 
 1   trending_date           40949 non-null  object 
 2   title                   40949 non-null  object 
 3   channel_title           40949 non-null  object 
 4   category_id             40949 non-null  float64
 5   publish_time            40949 non-null  object 
 6   tags                    40949 non-null  object 
 7   views                   40949 non-null  float64
 8   likes                   40949 non-null  float64
 9   dislikes                40949 non-null  float64
 10  comment_count           40949 non-null  float64
 11  thumbnail_link          40949 non-null  object 
 12  comments_disabled       40949 non-null  object 
 13  ratings_disabled        40949 non-null  object 
 14  video_error_or_removed  40949 non-null

The 4 numerical variables in this dataset are :
* views
* likes
* dislikes
* comment_count

(category_id is clearly not included)

Let's put them all in a list, so that we don't have to type them each time:

In [6]:
num_cols = ['views', 'likes', 'dislikes', 'comment_count']

Let's start with the mean:

In [7]:
df[num_cols].mean()

views            2.360785e+06
likes            7.426670e+04
dislikes         3.711401e+03
comment_count    8.446804e+03
dtype: float64

Now the median:

In [8]:
df[num_cols].median()

views            681861.0
likes             18091.0
dislikes            631.0
comment_count      1856.0
dtype: float64

And finally, the quartiles:

In [9]:
df[num_cols].quantile(q=[0.25, 0.5, 0.75])

Unnamed: 0,views,likes,dislikes,comment_count
0.25,242329.0,5424.0,202.0,614.0
0.5,681861.0,18091.0,631.0,1856.0
0.75,1823157.0,55417.0,1938.0,5755.0


# Question 3 : Checking for outliers

Let's calculate the z-scores for each column

In [10]:
df['views_zscore'] = (df['views'] - df['views'].mean()) / df['views'].std()
df['likes_zscore'] = (df['likes'] - df['likes'].mean()) / df['likes'].std()
df['dislikes_zscore'] = (df['dislikes'] - df['dislikes'].mean()) / df['dislikes'].std()
df['comment_count_zscore'] = (df['comment_count'] - df['comment_count'].mean()) / df['comment_count'].std()

Now, let's see how many outliers there are for each column

In [11]:
thresh = 3
print('Total number of elements :',len(df))
print('views :',len(df) - len(df[df['views_zscore'] > thresh]))
print('likes :',len(df) - len(df[df['likes_zscore'] > thresh]))
print('dislikes :',len(df) - len(df[df['dislikes_zscore'] > thresh]))
print('comment_count :',len(df) - len(df[df['comment_count_zscore'] > thresh]))

Total number of elements : 41415
views : 40896
likes : 40833
dislikes : 41212
comment_count : 41043


Finally, let's get rid of those outliers !

In [12]:
#views
df = df[df['views_zscore'] > 3]

#likes
df = df[df['likes_zscore'] > 3]

#dislikes
df = df[df['dislikes_zscore'] > 3]

#comment_count
df = df[df['comment_count_zscore'] > 3]

print('Number of remaining records :', len(df))

Number of remaining records : 98


# Question 4 : Unique categories

In [13]:
print('There are '+str(len(df['category_id'].unique()))+' unique categories: ')
for e in df['category_id'].unique():
    print(e)

There are 3 unique categories: 
10.0
24.0
22.0


Question 5 : Adjusting column types

In [14]:
df.head(1)

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,views_zscore,likes_zscore,dislikes_zscore,comment_count_zscore
2175,TyHvyGVs42U,17.24.11,"Luis Fonsi, Demi Lovato - Échame La Culpa",LuisFonsiVEVO,10.0,2017-11-17T05:00:01.000Z,"Luis|""Fonsi""|""Demi""|""Lovato""|""Échame""|""La""|""Cu...",80605857.0,2173715.0,104121.0,122511.0,https://i.ytimg.com/vi/TyHvyGVs42U/default.jpg,False,False,False,“Échame La Culpa” disponible ya en todas las p...,10.582076,9.172489,3.458857,3.047361


It might be beneficial to change the columns "comments_disabled", "ratings_disabled" and "video_error_or_removed"'s types using one-hot-encoding, rather than keep them as booleans

In [15]:
df['comments_disabled'] = pd.get_dummies(df['comments_disabled'], prefix='comments_disabled')
df['ratings_disabled'] = pd.get_dummies(df['ratings_disabled'], prefix='ratings_disabled')
df['video_error_or_removed'] = pd.get_dummies(df['video_error_or_removed'], prefix='video_error_or_removed')

df.head(1)

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,views_zscore,likes_zscore,dislikes_zscore,comment_count_zscore
2175,TyHvyGVs42U,17.24.11,"Luis Fonsi, Demi Lovato - Échame La Culpa",LuisFonsiVEVO,10.0,2017-11-17T05:00:01.000Z,"Luis|""Fonsi""|""Demi""|""Lovato""|""Échame""|""La""|""Cu...",80605857.0,2173715.0,104121.0,122511.0,https://i.ytimg.com/vi/TyHvyGVs42U/default.jpg,1,1,1,“Échame La Culpa” disponible ya en todas las p...,10.582076,9.172489,3.458857,3.047361


# Question 5 : Tags

First, we need to get all of the existing tags

In [16]:
import re

#Getting all tags
all_tags = []
for tag in df['tags']:
    l = re.split('\|', tag)
    all_tags.extend(l)

#Cleaning the tags from the " character
for i in range(len(all_tags)):
    all_tags[i] = all_tags[i].strip("\"")
    
#Creating a dictionary with the number of occurences of each tag
tag_dict = dict((x,all_tags.count(x)) for x in set(all_tags))

#How many tags do we have?
print("There are "+str(len(tag_dict))+" tags")

#Most common tags
#We first sort the dictionary by value
sorted_tags_dict = dict(sorted(tag_dict.items(), key=lambda item: item[1], reverse=True))
#then, we select the first elements
print("The 5 most famous tags are :",list(sorted_tags_dict)[:5])

There are 184 tags
The 5 most famous tags are : ['Pop', 'This Is America', 'Rap', 'mcDJ Recording/RCA Records', 'Childish Gambino']
