In [95]:
#from numpy import nan as NA
import numpy as np
import pandas as pd

In [96]:
# Import data
df=pd.read_csv("USvideos.csv", usecols=[items for items in range(16)], sep=",")

In [97]:
df.shape

(40949, 16)

Let's have a look at the first five rows of the data, its shape, and check whether there are any columns that have null/missing values.

In [98]:
# Have a look at the first five rows of the data
df.head(5)

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO â–¶ \n\nSUBSCRIBE â–º ...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


In [99]:
# Shape of dataframe
df.shape

(40949, 16)

In [100]:
# check whether there are any columns that have null/missing values.
df.isna().sum()

video_id                    0
trending_date               0
title                       0
channel_title               0
category_id                 0
publish_time                0
tags                        0
views                       0
likes                       0
dislikes                    0
comment_count               0
thumbnail_link              0
comments_disabled           0
ratings_disabled            0
video_error_or_removed      0
description               570
dtype: int64

Only column "description" has missing values. The data in this column is mostly about additional information that the publishers want the audiences know about the video. Therefore, we do not need to fill out data in this column or do any missing value treatment.

Now let's check duplicates of the entire dataset. 

In [101]:
df.duplicated().sum()

48

We will drop all rows that are completely repeated because they do not provide any additional information.

In [102]:
df.drop_duplicates(inplace=True)

In [103]:
df.duplicated().sum()

0

Next, we want to analyze the data based on video's ids, so we need to make sure that all video's ids are unique. The other columns do not need to be unique.

In [104]:
df.duplicated(subset='video_id').sum()

34619

We can see that there are a lot of duplicates in video_id column. We choose randomly one video's id that have duplicates to find the reason.

In [105]:
df[df.video_id == 'uxbQATBAXf8']

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
34152,uxbQATBAXf8,18.12.05,Deadpool 2 | With Apologies to David Beckham,20th Century Fox,1,2018-05-10T14:24:29.000Z,"Trailer|""Deadpool""|""20th Century Fox (Producti...",9399654,268143,2399,6739,https://i.ytimg.com/vi/uxbQATBAXf8/default.jpg,False,False,False,Get your Deadpool 2 tickets at http://www.Dead...
34362,uxbQATBAXf8,18.13.05,Deadpool 2 | With Apologies to David Beckham,20th Century Fox,1,2018-05-10T14:24:29.000Z,"Trailer|""Deadpool""|""20th Century Fox (Producti...",13293647,334533,3302,8333,https://i.ytimg.com/vi/uxbQATBAXf8/default.jpg,False,False,False,Get your Deadpool 2 tickets at http://www.Dead...
34792,uxbQATBAXf8,18.15.05,Deadpool 2 | With Apologies to David Beckham,20th Century Fox,1,2018-05-10T14:24:29.000Z,"Trailer|""Deadpool""|""20th Century Fox (Producti...",15960127,374825,3823,9059,https://i.ytimg.com/vi/uxbQATBAXf8/default.jpg,False,False,False,Get your Deadpool 2 tickets at http://www.Dead...


The result above shows that one video can be on trending many times (many days). Of all records with the same video's id, columns "trending date", "views", "like", "dislikes", and "comment_count" have different values. We will create a column named 'trending_date_count' that indicates the number of time that video is on trending. After that, we will keep the latest record with the latest number of views, like, dislikes, and comment_count, and remove the other rows with the same video_id and older statistical numbers.

In [106]:
# Create a column named 'trending_date_counts' that indicates the number of time that video is on trending.
df['trending_date_count'] = df.groupby(['video_id'])['trending_date'].transform('count')

In [107]:
# remove the other columns with the same video_id and older numbers.
df.drop_duplicates(subset='video_id', keep='last', inplace=True)

In [108]:
df[df.video_id == 'uxbQATBAXf8']

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,trending_date_count
34792,uxbQATBAXf8,18.15.05,Deadpool 2 | With Apologies to David Beckham,20th Century Fox,1,2018-05-10T14:24:29.000Z,"Trailer|""Deadpool""|""20th Century Fox (Producti...",15960127,374825,3823,9059,https://i.ytimg.com/vi/uxbQATBAXf8/default.jpg,False,False,False,Get your Deadpool 2 tickets at http://www.Dead...,3


In [109]:
df.shape

(6282, 17)