# Gathering Data and Database Creations

This notebook is to gather data from a kaggle dataset into a sqlite db and api calls to place json into a local mongoDB 

### a) Dataset from kaggle

In [4]:
import os

folder_path = "data"

# creating folder for data to be held, .gitignore data/ added also
if not os.path.exists(folder_path):
    os.makedirs(folder_path)
    print(f"Folder '{folder_path}' created successfully.")
else:
    print(f"Folder '{folder_path}' already exists.")

Folder 'data' already exists.


In [5]:
import os
from dotenv import load_dotenv
import pandas as pd

# Load environment variables from the .env file
load_dotenv()

# Set the Kaggle username and key as environment variables for the session
# used .env instead json file from kaggle
os.environ["KAGGLE_USERNAME"] = os.getenv("KAGGLE_USERNAME")
os.environ["KAGGLE_KEY"] = os.getenv("KAGGLE_KEY")


In [6]:
from kaggle.api.kaggle_api_extended import KaggleApi

# Function to download Kaggle dataset
def download_kaggle_dataset(owner, dataset_name, download_path="data"):
    os.makedirs(download_path, exist_ok=True)
    api = KaggleApi()
    api.authenticate()
    api.dataset_download_files(f"{owner}/{dataset_name}", path=download_path, unzip=True)


In [7]:
# downloading kaggle tiktok dataset
download_kaggle_dataset("yakhyojon", "tiktok")

Dataset URL: https://www.kaggle.com/datasets/yakhyojon/tiktok


### b) Dataset into a sqlite DB

Chatgpt assited in creating db. More tables should be made to be normalized but normalization is not the focus of coursework. I wish to practice interacting with a db.

In [8]:
import sqlite3
from sqlalchemy import create_engine

# loading csv
csv_path = 'data/tiktok_dataset.csv'
tiktok_data = pd.read_csv(csv_path)

In [9]:
# creating db in data along with connection
db_file_path = os.path.join(folder_path, "tiktok.db")
conn = sqlite3.connect(db_file_path)


In [10]:
cursor = conn.cursor()
# Create tables

# 1. Videos table
cursor.execute('''
CREATE TABLE IF NOT EXISTS Videos (
    video_id INTEGER PRIMARY KEY,
    video_duration_sec INTEGER,
    claim_status TEXT,
    verified_status TEXT,
    video_transcription_text TEXT
)
''')

# 2. Authors table
cursor.execute('''
CREATE TABLE IF NOT EXISTS Authors (
    author_id INTEGER PRIMARY KEY AUTOINCREMENT,
    author_ban_status TEXT
)
''')

# 3. VideoMetrics table
cursor.execute('''
CREATE TABLE IF NOT EXISTS VideoMetrics (
    metric_id INTEGER PRIMARY KEY AUTOINCREMENT,
    video_id INTEGER,
    video_view_count REAL,
    video_like_count REAL,
    video_share_count REAL,
    video_download_count REAL,
    video_comment_count REAL,
    FOREIGN KEY (video_id) REFERENCES Videos(video_id)
)
''')

<sqlite3.Cursor at 0x7e561eaf2d50>

In [11]:

# Insert data into these tables

# Track unique authors and their ban status
authors = {}

for _, row in tiktok_data.iterrows():
    # Insert into Videos table
    cursor.execute('''
        INSERT OR IGNORE INTO Videos (video_id, video_duration_sec, claim_status, verified_status, video_transcription_text)
        VALUES (?, ?, ?, ?, ?)
    ''', (row['video_id'], row['video_duration_sec'], row['claim_status'], row['verified_status'], row['video_transcription_text']))

    # Insert into Authors table if unique
    author_ban_status = row['author_ban_status']
    if author_ban_status not in authors:
        cursor.execute('''
            INSERT INTO Authors (author_ban_status)
            VALUES (?)
        ''', (author_ban_status,))
        authors[author_ban_status] = cursor.lastrowid  # Store the author_id for reference
    
    # Insert into VideoMetrics table
    cursor.execute('''
        INSERT INTO VideoMetrics (video_id, video_view_count, video_like_count, video_share_count, video_download_count, video_comment_count)
        VALUES (?, ?, ?, ?, ?, ?)
    ''', (row['video_id'], row['video_view_count'], row['video_like_count'], row['video_share_count'], row['video_download_count'], row['video_comment_count']))


In [12]:

# Commit changes
conn.commit()

In [13]:

# Close the connection
conn.close()

In [14]:
# connecting and testing a query
conn = sqlite3.connect(db_file_path)
cursor = conn.cursor()

query = '''
SELECT v.video_id, v.video_duration_sec, v.claim_status, m.video_view_count, m.video_like_count
FROM Videos v
JOIN VideoMetrics m ON v.video_id = m.video_id
LIMIT 5
'''
result = cursor.execute(query).fetchall()
print(result)

conn.close()

[(7017666017, 59, 'claim', 343296.0, 19425.0), (4014381136, 32, 'claim', 140877.0, 77355.0), (9859838091, 31, 'claim', 902185.0, 97690.0), (1866847991, 25, 'claim', 437506.0, 239954.0), (7105231098, 19, 'claim', 56167.0, 34987.0)]


### c) Api Calls to Youtube

Making a fucntion to make api calls to youtube to get metrics of youtube videos. 

In [15]:

from googleapiclient.discovery import build
import json

# Get the YouTube API key from the environment
YOUTUBE_API_KEY = os.getenv("YOUTUBE_API_KEY")

# Initialize the YouTube API client
youtube = build("youtube", "v3", developerKey=YOUTUBE_API_KEY)

In [16]:

def get_video_ids(query, max_results=50):
    # Retrieve video IDs based on the search query
    request = youtube.search().list(
        part="id",
        q=query,
        type="video",
        maxResults=max_results
    )
    response = request.execute()

    # Extract video IDs from the response
    video_ids = [item["id"]["videoId"] for item in response.get("items", [])]
    return video_ids


In [17]:

def get_video_metrics(video_ids):
    # Retrieve video details including metrics
    request = youtube.videos().list(
        part="snippet,contentDetails,statistics",
        id=",".join(video_ids)
    )
    response = request.execute()

    # Collect relevant metrics in a structured format
    video_data = []
    for item in response.get("items", []):
        video_info = {
            "video_id": item["id"],
            "title": item["snippet"]["title"],
            "channel_title": item["snippet"]["channelTitle"],
            "published_at": item["snippet"]["publishedAt"],
            "view_count": int(item["statistics"].get("viewCount", 0)),
            "like_count": int(item["statistics"].get("likeCount", 0)),
            "comment_count": int(item["statistics"].get("commentCount", 0)),
            "duration": item["contentDetails"]["duration"]
        }
        video_data.append(video_info)

    return video_data


In [23]:

# Fetch video IDs based on search query
video_ids = get_video_ids(query="Harris", max_results=50)

# Split video IDs into batches of 50 to stay within API limits
batch_size = 50
all_video_data = []
for i in range(0, len(video_ids), batch_size):
    batch_ids = video_ids[i:i + batch_size]
    all_video_data.extend(get_video_metrics(batch_ids))


In [24]:

# Convert to a DataFrame for easy analysis
df_videos = pd.DataFrame(all_video_data)
df_videos.head()

Unnamed: 0,video_id,title,channel_title,published_at,view_count,like_count,comment_count,duration
0,fBlnmptY3dA,Harris campaign is reportedly $20M in debt | L...,LiveNOW from FOX,2024-11-16T15:03:59Z,77057,909,558,PT7M51S
1,4WXark4iLwA,Joe Rogan reveals what Kamala Harris didn't wa...,Fox News,2024-11-14T17:00:03Z,1922598,20216,3737,PT5M33S
2,XF85fnxkpMk,"Harris Campaign $20 MILLION IN DEBT; Staffers,...",The Hill,2024-11-14T16:24:31Z,188979,3734,1930,PT9M26S
3,hUXLM4_OSgw,Joe Biden was 'unhappy' being 'shoved aside' f...,Sky News Australia,2024-11-17T00:23:37Z,180429,3685,686,PT5M19S
4,xbbhvDPgWYw,Harris Campaign Re-directs Money to Recounting...,Firstpost,2024-11-16T01:57:07Z,210428,2149,502,PT55M54S


### d) Data into mongoDB

Storing json into a local mongoDB storage

In [20]:
import pymongo

# Create the client
client = pymongo.MongoClient('localhost', 27017)

# Connect to our database
db = client['local']
video_collection = db["youtube_videos"] 

In [21]:
def store_data_in_mongodb(video_data):
    # Insert data into MongoDB collection
    if video_data:
        video_collection.insert_many(video_data)
        print(f"Inserted {len(video_data)} records into MongoDB with query metric.")
    else:
        print("No data to insert.")


In [22]:

# Example usage
query = "Harris"  # Define your query term here
video_ids = get_video_ids(query=query, max_results=50)
video_data = get_video_metrics(video_ids)
store_data_in_mongodb(video_data)

Inserted 50 records into MongoDB with query metric.


In [25]:
# function for quicker queries


def quick_query_mongoDB(query):
    video_ids = get_video_ids(query=query, max_results=50)
    video_data = get_video_metrics(video_ids)
    store_data_in_mongodb(video_data)



In [26]:
quick_query_mongoDB("Trump")

Inserted 50 records into MongoDB with query metric.


In [27]:

quick_query_mongoDB("Apple")


Inserted 50 records into MongoDB with query metric.


In [28]:

quick_query_mongoDB("Samsung")

Inserted 50 records into MongoDB with query metric.


In [29]:

quick_query_mongoDB("elections")


Inserted 50 records into MongoDB with query metric.


In [30]:

quick_query_mongoDB("vote")


Inserted 50 records into MongoDB with query metric.


In [31]:
quick_query_mongoDB("Crime")


Inserted 50 records into MongoDB with query metric.


In [32]:

quick_query_mongoDB("Health Care")


Inserted 50 records into MongoDB with query metric.


In [33]:

quick_query_mongoDB("Abortion")


Inserted 50 records into MongoDB with query metric.


In [34]:

quick_query_mongoDB("Economy")


Inserted 50 records into MongoDB with query metric.


In [35]:

quick_query_mongoDB("Immigration")


Inserted 50 records into MongoDB with query metric.


In [36]:

quick_query_mongoDB("Unemployement")

Inserted 50 records into MongoDB with query metric.


In [37]:
quick_query_mongoDB("artificial intelligence")

Inserted 50 records into MongoDB with query metric.


In [38]:
quick_query_mongoDB("Travel")

Inserted 50 records into MongoDB with query metric.


In [39]:
quick_query_mongoDB("Tesla")

Inserted 50 records into MongoDB with query metric.


In [40]:
quick_query_mongoDB("Biden")

Inserted 50 records into MongoDB with query metric.


In [41]:
quick_query_mongoDB("data")

Inserted 50 records into MongoDB with query metric.


In [42]:
quick_query_mongoDB("Latvia")

Inserted 50 records into MongoDB with query metric.


In [43]:
quick_query_mongoDB("president")

Inserted 50 records into MongoDB with query metric.


In [44]:
quick_query_mongoDB("Holidays")

Inserted 50 records into MongoDB with query metric.


In [45]:
quick_query_mongoDB("Thanksgiving")
quick_query_mongoDB("Halloween")
quick_query_mongoDB("Weekend")
quick_query_mongoDB("Food")
quick_query_mongoDB("Texas")
quick_query_mongoDB("vice-president")
quick_query_mongoDB("CEO")
quick_query_mongoDB("new")
quick_query_mongoDB("year")
quick_query_mongoDB("Biden")

Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.


In [46]:
quick_query_mongoDB("football")
quick_query_mongoDB("soccer")
quick_query_mongoDB("basketball")
quick_query_mongoDB("hockey")
quick_query_mongoDB("volleyball")
quick_query_mongoDB("tennis")
quick_query_mongoDB("sports")
quick_query_mongoDB("baseball")
quick_query_mongoDB("olympics")
quick_query_mongoDB("cycling")

Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.


In [48]:
states = ["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"]

for state in states:
    quick_query_mongoDB(state)


Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 records into MongoDB with query metric.
Inserted 50 

19,383 entries into the sqlite database from a kaggle dataset

4450 entries into the mongoDB database from several random queries from youtube api