## Import

In [58]:
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import isodate

## Read raw data from file

In [59]:
raw_video_df = pd.read_csv("../data/raw/video_data_raw.csv")
raw_video_df.tail()

Unnamed: 0,video_id,channelTitle,title,description,tags,publishedAt,viewCount,likeCount,favouriteCount,commentCount,duration,definition,caption
60027,Lz-IYgZRaXE,DataEng Uncomplicated,How to Get Free FME Desktop Licenses,This is a walk-through on the 6 ways on how yo...,"['free fme license', 'fme license key', 'fme h...",2020-04-12T21:13:01Z,2420.0,13.0,,6.0,PT2M57S,hd,False
60028,tdRbvrP5Hmk,DataEng Uncomplicated,Rest API to S3 with FME | Step by Step,This is a walkthrough on how to use any Rest A...,"['s3connector', 'api to s3 fme', 'httpcaller',...",2020-04-10T16:25:39Z,780.0,11.0,,0.0,PT7M33S,hd,False
60029,ZGWLggwc3Bs,DataEng Uncomplicated,How To Manage Temporary Data In FME,This video is a full guide on using the tempPa...,"['temppathnamecreater', 'FME Transformer Guide...",2020-03-22T22:30:55Z,373.0,2.0,,0.0,PT5M8S,hd,False
60030,vbOtRupga44,DataEng Uncomplicated,"Creating Points From X,Y Coordinates with FME ...",Learn how to convert a CSV file with longitude...,"['csv to shapefile', 'csv to geopackage', 'fme...",2019-08-15T04:15:09Z,6271.0,19.0,,4.0,PT3M47S,hd,False
60031,OeiAAjgGEbY,DataEng Uncomplicated,FME FeatureJoiner Transformer Guide,This video walks through how to use the Featur...,"['FeatureJoiner', 'FME FeatureJoiner', 'joinin...",2019-08-11T20:31:25Z,2286.0,36.0,,2.0,PT6M50S,hd,False


In [60]:
raw_comment_df = pd.read_csv("../data/raw/comment_data_raw.csv")
raw_comment_df.head()

Unnamed: 0,video_id,comments
0,WkqM0ndr42c,['👩\u200d💻 Code: https://github.com/justmarkha...
1,tWFQqaRtSQA,['THANKS for watching! 🙌 Which trick are you m...
2,gd-TZut-oto,"[""Want to watch all 50 scikit-learn tips? Enro..."
3,v2QpvCJ1ar8,"[""Thanks for watching! 🙌 If you're brand new t..."
4,sMlsd2CnIf4,['Want to learn how to use pipelines effective...


### How many rows and how many columns does the raw data have?

In [61]:
data_video_shape = raw_video_df.shape
print(f"Video data current shape: {data_video_shape}")
data_comment_shape = raw_comment_df.shape
print(f"Comments data current shape: {data_comment_shape}")

Video data current shape: (60032, 13)
Comments data current shape: (55309, 2)


### What is the meaning of each row?

- Answer: Based on the observations of the dataset, we suppose that each row represents a unique set of details regarding individual YouTube videos

### What does each column mean?

dài qá chưa làm :Đ

### Does the raw data have duplicate rows?

In [62]:
# retrieve the index
index = raw_video_df.index
# create a Pandas Series indicating whether each index is duplicated or not
deDupSeries = index.duplicated(keep='first')
# calculate the number of duplicated rows
num_duplicated_rows = deDupSeries.sum()

In [63]:
if num_duplicated_rows == 0:
    print(f"Raw data have no duplicated line !")
else:
    if num_duplicated_rows > 1:
        ext = "lines"
    else:
        ext = "line"
    print(f"Raw data have {num_duplicated_rows} duplicated " + ext)

Raw data have no duplicated line !


### What data type does each column currently have? Are there any columns having inappropriate data types?

In [64]:
raw_video_df.dtypes

video_id           object
channelTitle       object
title              object
description        object
tags               object
publishedAt        object
viewCount         float64
likeCount         float64
favouriteCount    float64
commentCount      float64
duration           object
definition         object
caption              bool
dtype: object

- We notice that the columns `publishedAt` and `duration` are currently of object type. Given that they represent time periods, it is advisable to convert `publishedAt`  to datetime type. As the 'duration' is in ISO 8601 format, we'll convert it into a float representing the total number of seconds

In [65]:
# convert publishedAt to datetime
raw_video_df["publishedAt"] = pd.to_datetime(raw_video_df["publishedAt"])
# convert duration to float
raw_video_df['duration'] = raw_video_df['duration'].apply(lambda x: isodate.parse_duration(x))
raw_video_df['duration'] = raw_video_df['duration'].dt.total_seconds()

In [66]:
# TEST
raw_video_df.dtypes

video_id                       object
channelTitle                   object
title                          object
description                    object
tags                           object
publishedAt       datetime64[ns, UTC]
viewCount                     float64
likeCount                     float64
favouriteCount                float64
commentCount                  float64
duration                      float64
definition                     object
caption                          bool
dtype: object

### With each numerical column, how are values distributed?

What is the percentage of missing values?

In [67]:
missing_vals = raw_video_df.select_dtypes(include='number').isna().sum()
missing_percentage = missing_vals / len(raw_video_df)
missing_percentage

viewCount         0.000067
likeCount         0.004048
favouriteCount    1.000000
commentCount      0.012793
duration          0.000000
dtype: float64

### With each categorical column, how are values distributed?

What is the percentage of missing values?

In [68]:
missing_vals = raw_video_df.select_dtypes(exclude='number').isna().sum()
missing_percentage = missing_vals / len(raw_video_df)
missing_percentage

video_id        0.000000
channelTitle    0.000000
title           0.000000
description     0.028035
tags            0.189566
publishedAt     0.000000
definition      0.000000
caption         0.000000
dtype: float64

### Min? Max? Are they abnormal?

In [69]:
raw_video_df.describe()

Unnamed: 0,viewCount,likeCount,favouriteCount,commentCount,duration
count,60028.0,59789.0,0.0,59264.0,60032.0
mean,48492.34,1124.692686,,69.896244,1496.953575
std,358519.0,8288.857051,,532.723104,2475.709199
min,0.0,0.0,,0.0,0.0
25%,622.0,11.0,,0.0,293.0
50%,3059.5,55.0,,5.0,695.0
75%,16021.75,334.0,,27.0,1786.0
max,34476450.0,571358.0,,60054.0,92218.0


- We notice that `favouriteCount` does not have any value. We can remove this feature.

In [70]:
# remove favouriteCount
raw_video_df = raw_video_df.drop('favouriteCount', axis=1)

### Save the processed data

In [71]:
# Save processed data to disk
raw_video_df.to_csv("../data/processed/" + "video_data_processed.csv", index=False)