<a href="https://colab.research.google.com/github/Rnov24/civic_sentiment/blob/master/notebooks/01-data-collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Collection Notebook

Hey there! 👋 This notebook is for collecting YouTube comments data and splitting it into training and inference datasets. I'll be scraping comments from some videos that are relevant to current political situations.

## 1. Environment Setup

Since I'm using Colab to run this notebook, I need to set up the environment first. I'll clone my GitHub repo and install the required dependencies.

In [1]:
import os
from pathlib import Path
import toml
import sys
import subprocess

# --- 1. Clone your GitHub repository ---
repo_url = "https://github.com/rnov24/civic_sentiment.git"
repo_name = "civic_sentiment"  # Should match the PROJ_ROOT in config.py
clone_path = Path("/content") / repo_name

if not clone_path.exists():
    subprocess.run(["git", "clone", repo_url, str(clone_path)], check=True)
else:
    print("Repository already cloned.")

# --- 2. Configure Git for Colab (IMPORTANT!) ---
# This is needed to be able to commit in Colab
subprocess.run(["git", "config", "--global", "user.name", "rnov24"], check=True)
subprocess.run(["git", "config", "--global", "user.email", "rnov24@users.noreply.github.com"], check=True)

# --- 3. Install project dependencies ---
sys.path.append(str(clone_path))
pyproject_path = clone_path / "pyproject.toml"

with open(pyproject_path, "r") as f:
    pyproject = toml.load(f)

dependencies = pyproject["project"]["dependencies"]
subprocess.run(["pip", "install"] + dependencies, check=True)

!cd /content/civic_sentiment && git pull

# --- 4. Now you can import your project modules ---
# The PROJ_ROOT in config.py is now correctly set to /content/civic_sentiment
from civic_sentiment.config import RAW_DATA_DIR

print("\nEnvironment setup complete!")
print(f"Raw data directory: {RAW_DATA_DIR}")
print("✅ Git is now configured for commits!")

Already up to date.


[32m2025-09-07 03:13:20.439[0m | [1mINFO    [0m | [36mcivic_sentiment.config[0m:[36m<module>[0m:[36m22[0m - [1mPROJ_ROOT path is: /content/civic_sentiment[0m



Environment setup complete!
Raw data directory: /content/civic_sentiment/data/raw
✅ Git is now configured for commits!


## 2. Scrape YouTube Comments

I have 4 videos that I want to scrape comments from. These are the videos I chose because they're super relevant to current political situations:

1. **Pernyataan Presiden Prabowo, 29 Agustus 2025** (LJ8yd0uRvwY) - This is President Prabowo's first official statement after the demonstrations, uploaded the same night after the incident where Affan Kurniawan was hit by a police tactical vehicle.

2. **LIVE: Keterangan Pers Presiden Prabowo, Istana Merdeka, 31 Agustus 2025** (oOf1b1P6fGc) - President Prabowo's official statement with the Speaker of Parliament, 5th President Megawati, Speaker of the People's Consultative Assembly, and representatives from majority political parties.

3. **BREAKING NEWS - PIMPINAN DPR RI MENERIMA PERWAKILAN MAHASISWA** (3Lz8PnFvjhs) - This video contains info about the reception of student representatives for an audience to express their aspirations at the DPR RI building. These aspirations became the foundation for the 17+8 demands formulated by Andovi et al.

4. **BREAKING NEWS - KONFERENSI PERS DPR RI MENJAWAB TUNTUTAN 17+8** (I9peHTC9g3o) - DPR RI's statement about the people's revolutionary demands "17+8".

*Note: 2 of these videos are live streams, but I'm not scraping the live chat because it's too massive and I can't handle it by myself 😅*

In [2]:
from civic_sentiment.scraping import scrape_videos
import os
from google.colab import userdata

# Get the API key from the colab secrets
API_KEY = userdata.get('YOUTUBE_API_KEY')

video_ids = [
    "LJ8yd0uRvwY",
    "oOf1b1P6fGc",
    "I9peHTC9g3o",
    "3Lz8PnFvjhs"
]

if API_KEY:
    print(f"Scraping comments from {len(video_ids)} videos...")
    comments_df = scrape_videos(API_KEY, video_ids)
    print(f"\n✅ Found {len(comments_df)} comments from {comments_df['video_id'].nunique()} videos.")

    # Display video titles that were scraped
    if not comments_df.empty:
        print("\n📺 Videos processed:")
        for video_id, title in comments_df[['video_id', 'video_title']].drop_duplicates().values:
            print(f"  • {video_id}: {title}")
else:
    print("❌ YOUTUBE_API_KEY environment variable not set.")
    print("Please set your YouTube Data API key either:")
    print("1. As an environment variable: export YOUTUBE_API_KEY=your_key")
    print("2. Or modify the API_KEY variable in this cell")

Scraping comments from 4 videos...
[32m2025-09-07 03:13:22.581[0m | [1mINFO    [0m | [36mcivic_sentiment.scraping[0m:[36mscrape_videos[0m:[36m76[0m - [1mScraping comments from: Pernyataan Presiden Prabowo, 29 Agustus 2025[0m




[32m2025-09-07 03:13:46.868[0m | [1mINFO    [0m | [36mcivic_sentiment.scraping[0m:[36mscrape_videos[0m:[36m120[0m - [1mCollected 11209 comments from Pernyataan Presiden Prabowo, 29 Agustus 2025[0m
[32m2025-09-07 03:13:46.960[0m | [1mINFO    [0m | [36mcivic_sentiment.scraping[0m:[36mscrape_videos[0m:[36m76[0m - [1mScraping comments from: LIVE: Keterangan Pers Presiden Prabowo, Istana Merdeka, 31 Agustus 2025[0m




[32m2025-09-07 03:14:07.290[0m | [1mINFO    [0m | [36mcivic_sentiment.scraping[0m:[36mscrape_videos[0m:[36m120[0m - [1mCollected 9909 comments from LIVE: Keterangan Pers Presiden Prabowo, Istana Merdeka, 31 Agustus 2025[0m
[32m2025-09-07 03:14:07.386[0m | [1mINFO    [0m | [36mcivic_sentiment.scraping[0m:[36mscrape_videos[0m:[36m76[0m - [1mScraping comments from: BREAKING NEWS - KONFERENSI PERS DPR RI MENJAWAB TUNTUTAN 17+8[0m




[32m2025-09-07 03:14:09.882[0m | [1mINFO    [0m | [36mcivic_sentiment.scraping[0m:[36mscrape_videos[0m:[36m120[0m - [1mCollected 1147 comments from BREAKING NEWS - KONFERENSI PERS DPR RI MENJAWAB TUNTUTAN 17+8[0m
[32m2025-09-07 03:14:09.981[0m | [1mINFO    [0m | [36mcivic_sentiment.scraping[0m:[36mscrape_videos[0m:[36m76[0m - [1mScraping comments from: BREAKING NEWS - PIMPINAN DPR RI MENERIMA PERWAKILAN MAHASISWA[0m


                                                                                                

[32m2025-09-07 03:14:10.695[0m | [1mINFO    [0m | [36mcivic_sentiment.scraping[0m:[36mscrape_videos[0m:[36m120[0m - [1mCollected 306 comments from BREAKING NEWS - PIMPINAN DPR RI MENERIMA PERWAKILAN MAHASISWA[0m

✅ Found 22571 comments from 4 videos.

📺 Videos processed:
  • LJ8yd0uRvwY: Pernyataan Presiden Prabowo, 29 Agustus 2025
  • oOf1b1P6fGc: LIVE: Keterangan Pers Presiden Prabowo, Istana Merdeka, 31 Agustus 2025
  • I9peHTC9g3o: BREAKING NEWS - KONFERENSI PERS DPR RI MENJAWAB TUNTUTAN 17+8
  • 3Lz8PnFvjhs: BREAKING NEWS - PIMPINAN DPR RI MENERIMA PERWAKILAN MAHASISWA




### Scraping Results

Alright, now let me check the scraping results. I'll see how many comments I successfully collected from each video.


In [3]:
# Display the first 5 rows of the DataFrame with the new video_title column
if not comments_df.empty:
    print("📊 Sample of scraped comments:")
    print(f"Columns: {list(comments_df.columns)}")
    print(f"Shape: {comments_df.shape}")
    print("\nFirst 5 comments:")
    display(comments_df.head())

    # Show some basic statistics
    print(f"\n📈 Summary:")
    print(f"Total comments: {len(comments_df)}")
    print(f"Unique videos: {comments_df['video_id'].nunique()}")
    print(f"Unique authors: {comments_df['author'].nunique()}")

    # Show comments per video
    print(f"\n📺 Comments per video:")
    video_stats = comments_df.groupby(['video_id', 'video_title']).size().reset_index(name='comment_count')
    for _, row in video_stats.iterrows():
        title_preview = row['video_title'][:50] + "..." if len(row['video_title']) > 50 else row['video_title']
        print(f"  {row['video_id']}: {row['comment_count']} comments")
        print(f"    Title: {title_preview}")
else:
    print("No comments were scraped.")

📊 Sample of scraped comments:
Columns: ['author', 'published_at', 'text', 'video_id', 'video_title']
Shape: (22571, 5)

First 5 comments:


Unnamed: 0,author,published_at,text,video_id,video_title
0,@SHOFAHIDAYATULILMIYAH,2025-09-06T04:21:29Z,Gimana mau percaya pak udah bbuuuanyak korban ...,LJ8yd0uRvwY,"Pernyataan Presiden Prabowo, 29 Agustus 2025"
1,@AkbarPutra-m2z,2025-09-05T22:29:54Z,"Assalamualaikum Pak Presiden RI, Pak Prabowo S...",LJ8yd0uRvwY,"Pernyataan Presiden Prabowo, 29 Agustus 2025"
2,@REZANGAPAK,2025-09-05T14:41:42Z,AKU YAKIN SEYAKIN YAKINNYA ! SEMUA INI ADA DAL...,LJ8yd0uRvwY,"Pernyataan Presiden Prabowo, 29 Agustus 2025"
3,@riskaindriyani8418,2025-09-05T12:46:09Z,Jangan turut berduka cita doang dengeken suara...,LJ8yd0uRvwY,"Pernyataan Presiden Prabowo, 29 Agustus 2025"
4,@ErikMuhdani8,2025-09-04T16:54:55Z,Coba pak presiden minta pak mahpud md ditugask...,LJ8yd0uRvwY,"Pernyataan Presiden Prabowo, 29 Agustus 2025"



📈 Summary:
Total comments: 22571
Unique videos: 4
Unique authors: 19202

📺 Comments per video:
  3Lz8PnFvjhs: 306 comments
    Title: BREAKING NEWS - PIMPINAN DPR RI MENERIMA PERWAKILA...
  I9peHTC9g3o: 1147 comments
    Title: BREAKING NEWS - KONFERENSI PERS DPR RI MENJAWAB TU...
  LJ8yd0uRvwY: 11209 comments
    Title: Pernyataan Presiden Prabowo, 29 Agustus 2025
  oOf1b1P6fGc: 9909 comments
    Title: LIVE: Keterangan Pers Presiden Prabowo, Istana Mer...


In [5]:
# Save the DataFrame to a CSV file
if not comments_df.empty:
    output_path = RAW_DATA_DIR / "comments.csv"
    comments_df.to_csv(output_path, index=False)
    print(f"💾 Comments saved to {output_path}")
    print(f"📁 File size: {output_path.stat().st_size / 1024:.1f} KB")
else:
    print("⚠️ No data to save - comments DataFrame is empty")

💾 Comments saved to /content/civic_sentiment/data/raw/comments.csv
📁 File size: 5804.4 KB


In [6]:
print("Checking for missing values:")
print(comments_df.isnull().sum())

Checking for missing values:
author          0
published_at    0
text            0
video_id        0
video_title     0
dtype: int64


## 3. Data Cleaning

Now I'll clean up the data I've collected. First, I'll check and remove duplicate comments, then I'll clean up the comment text.


In [7]:
print("Checking for duplicate comments:")
duplicate_rows = comments_df.duplicated()
print(f"Number of duplicate rows found: {duplicate_rows.sum()}")

comments_df.drop_duplicates(inplace=True)

print("\nDataFrame shape after removing duplicates:")
print(comments_df.shape)

Checking for duplicate comments:
Number of duplicate rows found: 7

DataFrame shape after removing duplicates:
(22564, 5)


## 3. Data Cleaning

In [8]:
import re
import string

def clean_text(text):
    """
    Cleans the input text by converting to lowercase, removing URLs,
    punctuation, numbers, and extra whitespace.
    """
    text = text.lower() # Convert to lowercase
    text = re.sub(r'http\S+', '', text) # Remove URLs
    text = text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation
    text = re.sub(r'\d+', '', text) # Remove numbers
    text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace
    return text

comments_df['cleaned_text'] = comments_df['text'].apply(clean_text)

print("DataFrame with cleaned_text column:")
display(comments_df[['text', 'cleaned_text']].head())

DataFrame with cleaned_text column:


Unnamed: 0,text,cleaned_text
0,Gimana mau percaya pak udah bbuuuanyak korban ...,gimana mau percaya pak udah bbuuuanyak korban ...
1,"Assalamualaikum Pak Presiden RI, Pak Prabowo S...",assalamualaikum pak presiden ri pak prabowo su...
2,AKU YAKIN SEYAKIN YAKINNYA ! SEMUA INI ADA DAL...,aku yakin seyakin yakinnya semua ini ada dalan...
3,Jangan turut berduka cita doang dengeken suara...,jangan turut berduka cita doang dengeken suara...
4,Coba pak presiden minta pak mahpud md ditugask...,coba pak presiden minta pak mahpud md ditugask...


In [9]:
print("Current data types:")
print(comments_df.info())

Current data types:
<class 'pandas.core.frame.DataFrame'>
Index: 22564 entries, 0 to 22570
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   author        22564 non-null  object
 1   published_at  22564 non-null  object
 2   text          22564 non-null  object
 3   video_id      22564 non-null  object
 4   video_title   22564 non-null  object
 5   cleaned_text  22564 non-null  object
dtypes: object(6)
memory usage: 1.2+ MB
None


In [10]:
import pandas as pd

comments_df['published_at'] = pd.to_datetime(comments_df['published_at'])

print("\nData types after converting 'published_at':")
print(comments_df.info())


Data types after converting 'published_at':
<class 'pandas.core.frame.DataFrame'>
Index: 22564 entries, 0 to 22570
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype              
---  ------        --------------  -----              
 0   author        22564 non-null  object             
 1   published_at  22564 non-null  datetime64[ns, UTC]
 2   text          22564 non-null  object             
 3   video_id      22564 non-null  object             
 4   video_title   22564 non-null  object             
 5   cleaned_text  22564 non-null  object             
dtypes: datetime64[ns, UTC](1), object(5)
memory usage: 1.2+ MB
None


"Why didn't I encode the emojis too?" I'm planning to manually label a subset of the data, and if I encode the emojis now, I'll get confused and uncomfortable during the labeling process.

## 4. Data Splitting

Now I'll split the data into training and inference sets. I'm using 75% for inference and 25% for training which will be manually labeled.


In [11]:
from sklearn.model_selection import train_test_split

inference_df, training_df = train_test_split(comments_df['cleaned_text'], test_size=0.25, random_state=42)
print(f"Training set size: {len(training_df)}")
print(f"Inference set size: {len(inference_df)}")

Training set size: 5641
Inference set size: 16923


## 5. Save Processed Data

Now I'll save all the cleaned and processed data to the processed folder. I'll save 3 files: cleaned data, training data, and inference data.


## 6. Commit and Push Changes

Now that we have processed data, let's commit our changes and push them to GitHub.


In [12]:
# Commit and push changes to GitHub
import subprocess
from pathlib import Path

# Change to the project directory
project_path = Path("/content/civic_sentiment")

# Check if there are any changes first
print("Checking for changes...")
result = subprocess.run(["git", "status", "--porcelain"], cwd=project_path, capture_output=True, text=True)

if result.stdout.strip():
    print("Changes detected:")
    print(result.stdout)

    # Add all changes
    subprocess.run(["git", "add", "."], cwd=project_path, check=True)

    # Check status
    print("\nGit status:")
    subprocess.run(["git", "status"], cwd=project_path, check=True)

    # Commit with a message
    commit_message = "Update data collection notebook and processed data"
    subprocess.run(["git", "commit", "-m", commit_message], cwd=project_path, check=True)

    # Push to GitHub
    subprocess.run(["git", "push", "origin", "master"], cwd=project_path, check=True)

    print("✅ Successfully committed and pushed changes to GitHub!")
else:
    print("ℹ️ No changes to commit. Repository is up to date.")
    print("If you made changes, make sure to save your notebook first!")


Checking for changes...
ℹ️ No changes to commit. Repository is up to date.
If you made changes, make sure to save your notebook first!


In [13]:
import os
from civic_sentiment.config import PROCESSED_DATA_DIR

# Create the processed data directory if it doesn't exist
os.makedirs(PROCESSED_DATA_DIR, exist_ok=True)

# Save the cleaned DataFrame to a CSV file in the processed directory
cleaned_output_path = PROCESSED_DATA_DIR / "cleaned_comments.csv"
if not comments_df.empty:
    comments_df.to_csv(cleaned_output_path, index=False)
    print(f"💾 Cleaned comments saved to {cleaned_output_path}")
    print(f"📁 File size: {cleaned_output_path.stat().st_size / 1024:.1f} KB")
else:
    print("⚠️ No cleaned data to save - comments DataFrame is empty")

# Save the training DataFrame to a CSV file in the processed directory
training_output_path = PROCESSED_DATA_DIR / "training_comments.csv"
if not training_df.empty:
    training_df.to_csv(training_output_path, index=False)
    print(f"💾 Training comments saved to {training_output_path}")
    print(f"📁 File size: {training_output_path.stat().st_size / 1024:.1f} KB")
else:
    print("⚠️ No training data to save - training DataFrame is empty")

# Save the inference DataFrame to a CSV file in the processed directory
inference_output_path = PROCESSED_DATA_DIR / "inference_comments.csv"
if not inference_df.empty:
    inference_df.to_csv(inference_output_path, index=False)
    print(f"💾 Inference comments saved to {inference_output_path}")
    print(f"📁 File size: {inference_output_path.stat().st_size / 1024:.1f} KB")
else:
    print("⚠️ No inference data to save - inference DataFrame is empty")

# Show a preview of what was saved
print(f"\n📋 Data saved includes:")
print(f"  • {len(comments_df)} total cleaned comments")
print(f"  • {len(training_df)} training comments")
print(f"  • {len(inference_df)} inference comments")
print(f"  • Columns: {', '.join(comments_df.columns)}")

💾 Cleaned comments saved to /content/civic_sentiment/data/processed/cleaned_comments.csv
📁 File size: 9161.6 KB
💾 Training comments saved to /content/civic_sentiment/data/processed/training_comments.csv
📁 File size: 800.1 KB
💾 Inference comments saved to /content/civic_sentiment/data/processed/inference_comments.csv
📁 File size: 2448.7 KB

📋 Data saved includes:
  • 22564 total cleaned comments
  • 5641 training comments
  • 16923 inference comments
  • Columns: author, published_at, text, video_id, video_title, cleaned_text


## 6. Commit Changes

Now that I have processed the data, I will commit the changes to my GitHub repository.

In [None]:
# Add, commit, and push changes to GitHub
%cd /content/civic_sentiment

!git config --global user.email "rrrijal24@gmail.com"
!git config --global user.name "Rnov24"

!git status
!git add .
!git commit -m "feat: Add raw and processed data files"
!git push