# 2. ETL

This notebook is to extract from sqlite and mongodb. Transforms data and load/save into a dataframe

### a) Gather Data from sqlite.db

Extracting data from sqlite.db and transforming to get video duration, count, like, comments and get total of engagement

In [1]:
import os
import sqlite3
import pandas as pd

# setting path to database
folder_path = "data"
db_file_path = os.path.join(folder_path, "tiktok.db")


In [2]:
# connecting to db
conn = sqlite3.connect(db_file_path)

In [3]:
# SQL query to query view, like and comment
query = '''
SELECT  m.video_view_count AS tiktok_view_count,
        m.video_like_count AS tiktok_like_count,
        m.video_comment_count AS tiktok_comment_count
FROM VideoMetrics m
WHERE m.video_view_count IS NOT NULL AND
        m.video_like_count IS NOT NULL AND
        m.video_comment_count IS NOT NULL 
'''

In [4]:
# Load into df
tiktok_df = pd.read_sql_query(query, conn)

# close connection
conn.close()


In [5]:
tiktok_df.head()

Unnamed: 0,tiktok_view_count,tiktok_like_count,tiktok_comment_count
0,343296.0,19425.0,0.0
1,140877.0,77355.0,684.0
2,902185.0,97690.0,329.0
3,437506.0,239954.0,584.0
4,56167.0,34987.0,152.0


In [6]:
# calculate like and comment
tiktok_df['tiktok_likes_and_comments'] = (
    tiktok_df['tiktok_like_count'] +
    tiktok_df['tiktok_comment_count']
)

In [7]:
tiktok_df.head()

Unnamed: 0,tiktok_view_count,tiktok_like_count,tiktok_comment_count,tiktok_likes_and_comments
0,343296.0,19425.0,0.0,19425.0
1,140877.0,77355.0,684.0,78039.0
2,902185.0,97690.0,329.0,98019.0
3,437506.0,239954.0,584.0,240538.0
4,56167.0,34987.0,152.0,35139.0


In [8]:
tiktok_df.describe()

Unnamed: 0,tiktok_view_count,tiktok_like_count,tiktok_comment_count,tiktok_likes_and_comments
count,76336.0,76336.0,76336.0,76336.0
mean,254708.558688,84304.63603,349.312146,84653.948176
std,322886.935826,133417.925044,799.623152,133968.673871
min,20.0,0.0,0.0,0.0
25%,4942.5,810.75,1.0,813.0
50%,9954.5,3403.5,9.0,3412.5
75%,504327.0,125020.0,292.0,125487.0
max,999817.0,657830.0,9599.0,659520.0


In [9]:
missing_values_count = tiktok_df.isnull().sum()
missing_values_count

tiktok_view_count            0
tiktok_like_count            0
tiktok_comment_count         0
tiktok_likes_and_comments    0
dtype: int64

In [10]:
# saving data to pickle
tiktok_df.to_pickle("tiktok.pkl")

### b) Gather Data from mongoDB

Gathering data from mongoDB views, likes, comments, and duration

In [11]:
import pymongo

# Create the client
client = pymongo.MongoClient('localhost', 27017)

# Connect to our database
db = client['local']
collection = db["youtube_videos"]

In [13]:
cursor = collection.find()

# iterating values in document for views, like, comments, and duration
for document in cursor:
    view_count = document["view_count"]
    like_count = document["like_count"]
    comment_count = document["comment_count"]
    print(f"View Count: {view_count}, Like Count: {like_count}, Comment Count: {comment_count}")

View Count: 77057, Like Count: 909, Comment Count: 557
View Count: 188949, Like Count: 3734, Comment Count: 1930
View Count: 1922598, Like Count: 20215, Comment Count: 3737
View Count: 6013, Like Count: 94, Comment Count: 77
View Count: 190901, Like Count: 2969, Comment Count: 762
View Count: 210388, Like Count: 2149, Comment Count: 502
View Count: 241646, Like Count: 5616, Comment Count: 4049
View Count: 173568, Like Count: 2996, Comment Count: 622
View Count: 4386847, Like Count: 141446, Comment Count: 6709
View Count: 25148, Like Count: 440, Comment Count: 112
View Count: 1261920, Like Count: 37491, Comment Count: 5630
View Count: 4392617, Like Count: 110483, Comment Count: 5467
View Count: 293962, Like Count: 13823, Comment Count: 2065
View Count: 431068, Like Count: 6845, Comment Count: 1745
View Count: 2781748, Like Count: 60948, Comment Count: 14457
View Count: 18697, Like Count: 220, Comment Count: 138
View Count: 3651531, Like Count: 113700, Comment Count: 17293
View Count: 38

In [14]:
# loading into a dataframe
data = []
cursor = collection.find()
for document in cursor:
    data.append({
        "youtube_view_count": document["view_count"],
        "youtube_like_count": document["like_count"],
        "youtube_comment_count": document["comment_count"]
    })

youtube_df = pd.DataFrame(data)

In [15]:
youtube_df.head()

Unnamed: 0,youtube_view_count,youtube_like_count,youtube_comment_count
0,77057,909,557
1,188949,3734,1930
2,1922598,20215,3737
3,6013,94,77
4,190901,2969,762


In [16]:
youtube_df.dtypes

youtube_view_count       int64
youtube_like_count       int64
youtube_comment_count    int64
dtype: object

In [17]:

missing_values_count = youtube_df.isnull().sum()
missing_values_count

youtube_view_count       0
youtube_like_count       0
youtube_comment_count    0
dtype: int64

In [18]:
youtube_df.head()

Unnamed: 0,youtube_view_count,youtube_like_count,youtube_comment_count
0,77057,909,557
1,188949,3734,1930
2,1922598,20215,3737
3,6013,94,77
4,190901,2969,762


In [19]:
# calculate total likes and comments
youtube_df['youtube_likes_and_comments'] = (
    youtube_df['youtube_like_count'] +
    youtube_df['youtube_comment_count']
)

In [20]:
youtube_df.head()

Unnamed: 0,youtube_view_count,youtube_like_count,youtube_comment_count,youtube_likes_and_comments
0,77057,909,557,1466
1,188949,3734,1930,5664
2,1922598,20215,3737,23952
3,6013,94,77,171
4,190901,2969,762,3731


In [21]:
youtube_df.describe()

Unnamed: 0,youtube_view_count,youtube_like_count,youtube_comment_count,youtube_likes_and_comments
count,4450.0,4450.0,4450.0,4450.0
mean,16671990.0,180340.3,4210.897978,184551.2
std,141069300.0,861944.4,22026.580878,878492.6
min,1.0,0.0,0.0,0.0
25%,66167.75,1051.5,86.0,1283.0
50%,411961.0,6913.0,505.0,7984.0
75%,2709688.0,48722.0,2269.0,52170.0
max,6104343000.0,15626340.0,532252.0,16158590.0


In [22]:
youtube_df.to_pickle("youtube.pkl")

Data extracted from databases, transformed into suitable formats, and loading/save into dataframes