# Gathering Data and Database Creations

This notebook is to gather data from a kaggle dataset into a sqlite db and api calls to place json into a local mongoDB 

### a) Dataset from kaggle

In [1]:
import os

folder_path = "data"

# creating folder for data to be held, .gitignore data/ added also
if not os.path.exists(folder_path):
    os.makedirs(folder_path)
    print(f"Folder '{folder_path}' created successfully.")
else:
    print(f"Folder '{folder_path}' already exists.")

Folder 'data' already exists.


In [2]:
import os
from dotenv import load_dotenv
import pandas as pd

# Load environment variables from the .env file
load_dotenv()

# Set the Kaggle username and key as environment variables for the session
# used .env instead json file from kaggle
os.environ["KAGGLE_USERNAME"] = os.getenv("KAGGLE_USERNAME")
os.environ["KAGGLE_KEY"] = os.getenv("KAGGLE_KEY")


In [3]:
from kaggle.api.kaggle_api_extended import KaggleApi

# Function to download Kaggle dataset
def download_kaggle_dataset(owner, dataset_name, download_path="data"):
    os.makedirs(download_path, exist_ok=True)
    api = KaggleApi()
    api.authenticate()
    api.dataset_download_files(f"{owner}/{dataset_name}", path=download_path, unzip=True)


In [4]:
# downloading kaggle tiktok dataset
download_kaggle_dataset("yakhyojon", "tiktok")

Dataset URL: https://www.kaggle.com/datasets/yakhyojon/tiktok


### b) Dataset into a sqlite DB

Chatgpt assited in creating db. More tables should be made to be normalized but normalization is not the focus of coursework. I wish to practice interacting with a db.

In [19]:
import sqlite3
from sqlalchemy import create_engine

# loading csv
csv_path = 'data/tiktok_dataset.csv'
tiktok_data = pd.read_csv(csv_path)

In [20]:
# creating db in data along with connection
db_file_path = os.path.join(folder_path, "tiktok.db")
conn = sqlite3.connect(db_file_path)


In [21]:
cursor = conn.cursor()
# Create tables

# 1. Videos table
cursor.execute('''
CREATE TABLE IF NOT EXISTS Videos (
    video_id INTEGER PRIMARY KEY,
    video_duration_sec INTEGER,
    claim_status TEXT,
    verified_status TEXT,
    video_transcription_text TEXT
)
''')

# 2. Authors table
cursor.execute('''
CREATE TABLE IF NOT EXISTS Authors (
    author_id INTEGER PRIMARY KEY AUTOINCREMENT,
    author_ban_status TEXT
)
''')

# 3. VideoMetrics table
cursor.execute('''
CREATE TABLE IF NOT EXISTS VideoMetrics (
    metric_id INTEGER PRIMARY KEY AUTOINCREMENT,
    video_id INTEGER,
    video_view_count REAL,
    video_like_count REAL,
    video_share_count REAL,
    video_download_count REAL,
    video_comment_count REAL,
    FOREIGN KEY (video_id) REFERENCES Videos(video_id)
)
''')

<sqlite3.Cursor at 0x7d679570cd50>

In [22]:

# Insert data into these tables

# Track unique authors and their ban status
authors = {}

for _, row in tiktok_data.iterrows():
    # Insert into Videos table
    cursor.execute('''
        INSERT OR IGNORE INTO Videos (video_id, video_duration_sec, claim_status, verified_status, video_transcription_text)
        VALUES (?, ?, ?, ?, ?)
    ''', (row['video_id'], row['video_duration_sec'], row['claim_status'], row['verified_status'], row['video_transcription_text']))

    # Insert into Authors table if unique
    author_ban_status = row['author_ban_status']
    if author_ban_status not in authors:
        cursor.execute('''
            INSERT INTO Authors (author_ban_status)
            VALUES (?)
        ''', (author_ban_status,))
        authors[author_ban_status] = cursor.lastrowid  # Store the author_id for reference
    
    # Insert into VideoMetrics table
    cursor.execute('''
        INSERT INTO VideoMetrics (video_id, video_view_count, video_like_count, video_share_count, video_download_count, video_comment_count)
        VALUES (?, ?, ?, ?, ?, ?)
    ''', (row['video_id'], row['video_view_count'], row['video_like_count'], row['video_share_count'], row['video_download_count'], row['video_comment_count']))


In [23]:

# Commit changes
conn.commit()

In [24]:

# Close the connection
conn.close()

In [25]:
# connecting and testing a query
conn = sqlite3.connect(db_file_path)
cursor = conn.cursor()

query = '''
SELECT v.video_id, v.video_duration_sec, v.claim_status, m.video_view_count, m.video_like_count
FROM Videos v
JOIN VideoMetrics m ON v.video_id = m.video_id
LIMIT 5
'''
result = cursor.execute(query).fetchall()
print(result)

conn.close()

[(7017666017, 59, 'claim', 343296.0, 19425.0), (4014381136, 32, 'claim', 140877.0, 77355.0), (9859838091, 31, 'claim', 902185.0, 97690.0), (1866847991, 25, 'claim', 437506.0, 239954.0), (7105231098, 19, 'claim', 56167.0, 34987.0)]


### c) Api Calls to Youtube

Making a fucntion to make api calls to youtube to get metrics of youtube videos. 

In [39]:

from googleapiclient.discovery import build
import json

# Get the YouTube API key from the environment
YOUTUBE_API_KEY = os.getenv("YOUTUBE_API_KEY")

# Initialize the YouTube API client
youtube = build("youtube", "v3", developerKey=YOUTUBE_API_KEY)

In [40]:

def get_video_ids(query, max_results=50):
    # Retrieve video IDs based on the search query
    request = youtube.search().list(
        part="id",
        q=query,
        type="video",
        maxResults=max_results
    )
    response = request.execute()

    # Extract video IDs from the response
    video_ids = [item["id"]["videoId"] for item in response.get("items", [])]
    return video_ids


In [42]:

def get_video_metrics(video_ids):
    # Retrieve video details including metrics
    request = youtube.videos().list(
        part="snippet,contentDetails,statistics",
        id=",".join(video_ids)
    )
    response = request.execute()

    # Collect relevant metrics in a structured format
    video_data = []
    for item in response.get("items", []):
        video_info = {
            "video_id": item["id"],
            "title": item["snippet"]["title"],
            "channel_title": item["snippet"]["channelTitle"],
            "published_at": item["snippet"]["publishedAt"],
            "view_count": int(item["statistics"].get("viewCount", 0)),
            "like_count": int(item["statistics"].get("likeCount", 0)),
            "comment_count": int(item["statistics"].get("commentCount", 0)),
            "duration": item["contentDetails"]["duration"]
        }
        video_data.append(video_info)

    return video_data


In [43]:

# Fetch video IDs based on search query
video_ids = get_video_ids(query="Harris", max_results=50)

# Split video IDs into batches of 50 to stay within API limits
batch_size = 50
all_video_data = []
for i in range(0, len(video_ids), batch_size):
    batch_ids = video_ids[i:i + batch_size]
    all_video_data.extend(get_video_metrics(batch_ids))


In [48]:

# Convert to a DataFrame for easy analysis
df_videos = pd.DataFrame(all_video_data)
df_videos.head()

Unnamed: 0,video_id,title,channel_title,published_at,view_count,like_count,comment_count,duration
0,Xe7TuOlyUIM,Kamala supporters told to leave Harris HQ,Sky News Australia,2024-11-06T06:00:14Z,237002,3980,3062,PT1M21S
1,0QLZB6djGAA,"LIVE election results, analysis",ABC 7 Chicago,2024-11-06T11:20:05Z,2244135,7182,267,PT9H49M30S
2,zZVO0mQFg9s,Kamala Harris won't speak at watch party,CNBC Television,2024-11-06T06:08:22Z,112507,940,3013,PT2M15S
3,uJE07Lpoom8,Trump supporters celebrate as Harris backs out...,news.com.au,2024-11-06T06:43:19Z,316297,7229,4156,PT50S
4,FMPSk8Fvt04,Deafening silence at Democrat HQ as Kamala Har...,Sky News Australia,2024-11-06T06:10:01Z,379885,9243,5741,PT6M5S


### d) Data into mongoDB

Storing json into a local mongoDB storage

In [54]:
import pymongo

# Create the client
client = pymongo.MongoClient('localhost', 27017)

# Connect to our database
db = client['local']
video_collection = db["youtube_videos"] 

In [51]:
def store_data_in_mongodb(video_data):
    # Insert data into MongoDB collection
    if video_data:
        video_collection.insert_many(video_data)
        print(f"Inserted {len(video_data)} records into MongoDB with query metric.")
    else:
        print("No data to insert.")


In [67]:

# Example usage
query = "Harris"  # Define your query term here
video_ids = get_video_ids(query=query, max_results=50)
video_data = get_video_metrics(video_ids)
store_data_in_mongodb(video_data)

Inserted 50 records into MongoDB with query metric.


In [68]:
# function for quicker queries


def quick_query_mongoDB(query):
    video_ids = get_video_ids(query=query, max_results=50)
    video_data = get_video_metrics(video_ids)
    store_data_in_mongodb(video_data)



In [69]:
quick_query_mongoDB("Trump")

Inserted 50 records into MongoDB with query metric.


In [71]:

quick_query_mongoDB("Apple")


Inserted 50 records into MongoDB with query metric.


In [73]:

quick_query_mongoDB("Samsung")

Inserted 50 records into MongoDB with query metric.


In [74]:

quick_query_mongoDB("elections")


Inserted 50 records into MongoDB with query metric.


In [75]:

quick_query_mongoDB("vote")


Inserted 50 records into MongoDB with query metric.


In [76]:
quick_query_mongoDB("Crime")


Inserted 50 records into MongoDB with query metric.


In [77]:

quick_query_mongoDB("Health Care")


Inserted 50 records into MongoDB with query metric.


In [78]:

quick_query_mongoDB("Abortion")


Inserted 50 records into MongoDB with query metric.


In [79]:

quick_query_mongoDB("Economy")


Inserted 50 records into MongoDB with query metric.


In [80]:

quick_query_mongoDB("Immigration")


Inserted 50 records into MongoDB with query metric.


In [81]:

quick_query_mongoDB("Unemployement")

Inserted 50 records into MongoDB with query metric.


In [83]:
quick_query_mongoDB("artificial intelligence")

Inserted 50 records into MongoDB with query metric.


In [84]:
quick_query_mongoDB("Travel")

Inserted 50 records into MongoDB with query metric.


In [85]:
quick_query_mongoDB("Tesla")

Inserted 50 records into MongoDB with query metric.


In [86]:
quick_query_mongoDB("Biden")

Inserted 50 records into MongoDB with query metric.


In [87]:
quick_query_mongoDB("data")

Inserted 50 records into MongoDB with query metric.


In [88]:
quick_query_mongoDB("Latvia")

Inserted 50 records into MongoDB with query metric.


In [89]:
quick_query_mongoDB("president")

Inserted 50 records into MongoDB with query metric.


In [90]:
quick_query_mongoDB("Holidays")

Inserted 50 records into MongoDB with query metric.


19,383 entries into the sqlite database from a kaggle dataset

1,000 entries into the mongoDB database from 20 random queries from youtube api