# **Crawling Data Twitter untuk Analisis Sentimen Etika AI**

Notebook ini berfokus pada pengumpulan data Twitter menggunakan tweet-harvest untuk mendapatkan tweet-tweet yang membahas topik etika AI (Artificial Intelligence). Data yang dikumpulkan akan digunakan untuk analisis sentimen guna memahami persepsi publik terhadap etika AI di Indonesia.

## Tujuan Crawling:
- Mengumpulkan tweet berbahasa Indonesia tentang etika AI
- Mendapatkan data mentah untuk preprocessing dan analisis sentimen
- Menyimpan data dalam format CSV untuk pemrosesan lebih lanjut

## Tools yang Digunakan:
- **tweet-harvest**: Library Node.js untuk crawling Twitter
- **pandas**: Untuk manipulasi dan pembersihan data
- **Twitter Auth Token**: Untuk autentikasi akses Twitter API

## **Instalasi Library yang Dibutuhkan (Windows)**

Sebelum melakukan crawling data Twitter, pastikan semua library dan tools yang dibutuhkan sudah terinstal dengan benar.

In [1]:
# Install Python libraries yang diperlukan untuk data crawling dan manipulasi
!pip install pandas numpy requests beautifulsoup4 lxml



In [2]:
import subprocess
import os

# Check if Node.js is installed (required for tweet-harvest)
try:
    node_version = subprocess.check_output("node -v", shell=True).decode().strip()
    print(f"✓ Node.js is installed: {node_version}")
    
    # Check npm version
    npm_version = subprocess.check_output("npm -v", shell=True).decode().strip()
    print(f"✓ npm is installed: {npm_version}")
    
    print("\n✓ Ready to use tweet-harvest for Twitter data crawling")
    
except subprocess.CalledProcessError:
    print("❌ Node.js is not installed!")
    print("Please install Node.js from: https://nodejs.org/")
    print("After installation, restart this notebook")
except Exception as e:
    print(f"Error checking Node.js: {e}")
    print("Please ensure Node.js is properly installed")

✓ Node.js is installed: v22.15.0
✓ npm is installed: 11.0.0

✓ Ready to use tweet-harvest for Twitter data crawling


## **Setup Autentikasi Twitter**

Untuk menggunakan tweet-harvest, Anda memerlukan Twitter Auth Token. Token ini harus disimpan dalam file `auth.key` di direktori yang sama dengan notebook ini.

In [3]:
import os

# Load Twitter authentication token from file
auth_file = 'auth.key'

if os.path.exists(auth_file):
    with open(auth_file, 'r') as f:
        twitter_auth_token = f.read().strip()
    print("✓ Twitter auth token loaded successfully")
    print(f"Token length: {len(twitter_auth_token)} characters")
else:
    print("❌ auth.key file not found!")
    print("Please create an 'auth.key' file with your Twitter auth token")
    print("You can get this token from your Twitter account settings")
    twitter_auth_token = None

✓ Twitter auth token loaded successfully
Token length: 40 characters


## **Konfigurasi Crawling Data**

Bagian ini mendefinisikan parameter untuk crawling data Twitter tentang etika AI. Anda dapat menyesuaikan kata kunci pencarian, jumlah tweet, dan nama file output sesuai kebutuhan.

In [4]:
# Konfigurasi parameter crawling
OUTPUT_FILENAME = 'etikaAI.csv'
SEARCH_KEYWORDS = [
    'etika ai since:2023-01-01 lang:id',
    'artificial intelligence ethics lang:id',
    'ai ethics indonesia lang:id',
    'etika kecerdasan buatan lang:id',
    'ai bias lang:id'
]
TWEET_LIMIT = 200  # Jumlah maksimal tweet per kata kunci
OUTPUT_DIRECTORY = 'tweets-data'

# Buat direktori output jika belum ada
if not os.path.exists(OUTPUT_DIRECTORY):
    os.makedirs(OUTPUT_DIRECTORY)
    print(f"✓ Created directory: {OUTPUT_DIRECTORY}")

print("Konfigurasi Crawling:")
print(f"• Output file: {OUTPUT_FILENAME}")
print(f"• Tweet limit per keyword: {TWEET_LIMIT}")
print(f"• Search keywords: {len(SEARCH_KEYWORDS)} queries")
print(f"• Output directory: {OUTPUT_DIRECTORY}")
print(f"• Total estimated tweets: {TWEET_LIMIT * len(SEARCH_KEYWORDS)}")

Konfigurasi Crawling:
• Output file: etikaAI.csv
• Tweet limit per keyword: 200
• Search keywords: 5 queries
• Output directory: tweets-data
• Total estimated tweets: 1000


## **Eksekusi Crawling Data Twitter**

Bagian ini menjalankan proses crawling data menggunakan tweet-harvest. Proses akan berjalan untuk setiap kata kunci yang telah dikonfigurasi.

In [None]:
import time
from datetime import datetime

# Fungsi untuk melakukan crawling dengan tweet-harvest
def crawl_twitter_data(search_query, output_file, limit, auth_token):
    """
    Melakukan crawling data Twitter menggunakan tweet-harvest
    """
    if auth_token is None:
        print("❌ Auth token tidak tersedia")
        return False
    
    try:
        # Jalankan tweet-harvest command
        cmd = f'npx -y tweet-harvest@2.6.1 -o "{output_file}" -s "{search_query}" --tab "LATEST" -l {limit} --token {auth_token}'
        print(f"Executing: {cmd}")
        
        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
        
        if result.returncode == 0:
            print(f"✓ Successfully crawled data for: {search_query}")
            return True
        else:
            print(f"❌ Error crawling data for: {search_query}")
            print(f"Error: {result.stderr}")
            return False
            
    except Exception as e:
        print(f"❌ Exception during crawling: {str(e)}")
        return False

# Mulai proses crawling
print("🚀 Starting Twitter data crawling for AI Ethics analysis...")
print(f"⏰ Start time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("="*60)

successful_crawls = 0
failed_crawls = 0

for i, search_query in enumerate(SEARCH_KEYWORDS, 1):
    print(f"\n📥 Crawling {i}/{len(SEARCH_KEYWORDS)}: {search_query}")
    
    # Generate unique filename for each search
    output_file = f"{OUTPUT_DIRECTORY}/etika_ai_batch_{i}.csv"
    
    success = crawl_twitter_data(search_query, output_file, TWEET_LIMIT, twitter_auth_token)
    
    if success:
        successful_crawls += 1
    else:
        failed_crawls += 1
    
    # Delay between requests to avoid rate limiting
    if i < len(SEARCH_KEYWORDS):
        print("⏸️  Waiting 30 seconds before next crawl...")
        time.sleep(30)

print("\n" + "="*60)
print("📊 Crawling Summary:")
print(f"✓ Successful crawls: {successful_crawls}")
print(f"❌ Failed crawls: {failed_crawls}")
print(f"⏰ End time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

🚀 Starting Twitter data crawling for AI Ethics analysis...
⏰ Start time: 2025-06-16 15:58:10

📥 Crawling 1/5: etika ai since:2023-01-01 lang:id
Executing: npx -y tweet-harvest@2.6.1 -o "tweets-data/etika_ai_batch_1.csv" -s "etika ai since:2023-01-01 lang:id" --tab "LATEST" -l 200 --token 7cbdbe8f96e2832bb515c78c88ed2d13c8e5e059
✓ Successfully crawled data for: etika ai since:2023-01-01 lang:id
⏸️  Waiting 30 seconds before next crawl...

📥 Crawling 2/5: artificial intelligence ethics lang:id
Executing: npx -y tweet-harvest@2.6.1 -o "tweets-data/etika_ai_batch_2.csv" -s "artificial intelligence ethics lang:id" --tab "LATEST" -l 200 --token 7cbdbe8f96e2832bb515c78c88ed2d13c8e5e059
✓ Successfully crawled data for: artificial intelligence ethics lang:id
⏸️  Waiting 30 seconds before next crawl...


### **Alternatif: Crawling Sederhana untuk Satu Query**

Jika Anda ingin melakukan crawling untuk satu kata kunci tertentu saja, gunakan cell di bawah ini:

In [None]:
# Simple crawling untuk satu query utama
filename = 'etikaAI.csv'
search_keyword = 'etika ai since:2025-01-01 lang:id'
limit = 1000

if twitter_auth_token:
    print(f"🔍 Searching for: {search_keyword}")
    print(f"📁 Output file: {filename}")
    print(f"🔢 Limit: {limit} tweets")
    
    !npx -y tweet-harvest@2.6.1 -o "{filename}" -s "{search_keyword}" --tab "LATEST" -l {limit} --token {twitter_auth_token}
    
    print("✅ Crawling completed!")
else:
    print("❌ Cannot proceed: Twitter auth token not available")

## **Validasi dan Pengolahan Awal Data**

Setelah crawling selesai, kita perlu memvalidasi data yang telah dikumpulkan dan melakukan pembersihan awal.

In [None]:
import pandas as pd
import re
import glob

def load_and_parse_twitter_data(file_path):
    """
    Memuat dan mem-parsing data Twitter dari file CSV yang dihasilkan tweet-harvest
    """
    try:
        # Read the raw file to parse it properly
        with open(file_path, 'r', encoding='utf-8') as f:
            lines = f.readlines()

        if not lines:
            print(f"❌ File {file_path} is empty")
            return None

        # Extract and parse the malformed header
        header_line = lines[0].strip()
        header_line = header_line.replace(';;;', '')

        # Extract column names using regex to find quoted strings
        header_matches = re.findall(r'"([^"]*)"', header_line)
        if not header_matches:
            # Fallback: split by comma and clean
            header_parts = header_line.split(',')
            headers = []
            for part in header_parts:
                clean_header = part.strip().replace('"', '').replace("'", "")
                if clean_header and clean_header not in ['', ';;;']:
                    headers.append(clean_header)
        else:
            headers = header_matches

        # Clean up headers - remove empty strings and duplicates
        headers = [h for h in headers if h.strip()]
        if len(headers) == 0:
            # Use default headers if parsing fails
            headers = ['conversation_id_str', 'created_at', 'favorite_count', 'full_text', 'id_str', 
                       'image_url', 'in_reply_to_screen_name', 'lang', 'location', 'quote_count', 
                       'reply_count', 'retweet_count', 'tweet_url', 'user_id_str', 'username']

        print(f"📋 Parsed headers: {headers}")

        # Process data lines
        data_rows = []
        for line in lines[1:]:
            line = line.strip()
            if line and not line.startswith(';;;'):
                # Remove the trailing ;;; and extract quoted values
                line = line.replace(';;;', '')
                # Use regex to extract quoted content
                parts = re.findall(r'"([^"]*)"', line)
                if len(parts) >= len(headers):
                    data_rows.append(parts[:len(headers)])
                elif len(parts) > 0:
                    # Pad with empty strings if needed
                    padded_parts = parts + [''] * (len(headers) - len(parts))
                    data_rows.append(padded_parts[:len(headers)])

        # Create DataFrame
        df = pd.DataFrame(data_rows, columns=headers)

        # Clean up the data
        df = df.dropna(subset=['full_text'])  # Remove rows without text
        df = df[df['full_text'].str.strip() != '']  # Remove rows with empty text

        print(f"✅ Successfully loaded {len(df)} tweets from {file_path}")
        return df

    except Exception as e:
        print(f"❌ Error loading {file_path}: {str(e)}")
        return None

# Load data from all crawled files
print("📂 Loading crawled Twitter data...")

# Check for individual batch files
batch_files = glob.glob(f"{OUTPUT_DIRECTORY}/etika_ai_batch_*.csv")
single_file = "etikaAI.csv"

all_dataframes = []

# Load batch files if they exist
if batch_files:
    print(f"Found {len(batch_files)} batch files:")
    for file_path in batch_files:
        print(f"  📄 {file_path}")
        df = load_and_parse_twitter_data(file_path)
        if df is not None:
            all_dataframes.append(df)

# Load single file if it exists
if os.path.exists(single_file):
    print(f"  📄 {single_file}")
    df = load_and_parse_twitter_data(single_file)
    if df is not None:
        all_dataframes.append(df)

# Combine all dataframes
if all_dataframes:
    df_combined = pd.concat(all_dataframes, ignore_index=True)
    print(f"\n📊 Combined dataset: {len(df_combined)} total tweets")
    
    # Remove duplicates based on tweet content
    df_combined = df_combined.drop_duplicates(subset=['full_text'], keep='first')
    print(f"📊 After removing duplicates: {len(df_combined)} unique tweets")
    
    # Display basic info
    print(f"\n🔍 Dataset Overview:")
    print(f"  Columns: {df_combined.columns.tolist()}")
    print(f"  Shape: {df_combined.shape}")
    
    if 'full_text' in df_combined.columns:
        print(f"\n📝 Sample tweets:")
        for i, tweet in enumerate(df_combined['full_text'].head(3), 1):
            print(f"{i}. {tweet[:100]}...")
else:
    print("❌ No data files found or all files failed to load")
    df_combined = None

In [None]:
# Generate data statistics and export final dataset
if df_combined is not None:
    print("📈 Generating Data Statistics...")
    print("="*50)
    
    # Basic statistics
    print(f"📊 Total Tweets: {len(df_combined)}")
    
    # Check for key columns
    if 'created_at' in df_combined.columns:
        print(f"📅 Date Range: Available")
        
    if 'username' in df_combined.columns:
        unique_users = df_combined['username'].nunique()
        print(f"👤 Unique Users: {unique_users}")
        top_users = df_combined['username'].value_counts().head(5)
        print(f"🔝 Top Users:")
        for user, count in top_users.items():
            print(f"   {user}: {count} tweets")
    
    if 'lang' in df_combined.columns:
        lang_dist = df_combined['lang'].value_counts()
        print(f"🌐 Language Distribution:")
        for lang, count in lang_dist.head(5).items():
            print(f"   {lang}: {count} tweets")
    
    # Text length analysis
    if 'full_text' in df_combined.columns:
        df_combined['text_length'] = df_combined['full_text'].str.len()
        avg_length = df_combined['text_length'].mean()
        print(f"📝 Average Tweet Length: {avg_length:.1f} characters")
        print(f"📝 Longest Tweet: {df_combined['text_length'].max()} characters")
        print(f"📝 Shortest Tweet: {df_combined['text_length'].min()} characters")
    
    # Export final dataset
    final_output_file = f"{OUTPUT_DIRECTORY}/{OUTPUT_FILENAME}"
    df_combined.to_csv(final_output_file, index=False, encoding='utf-8')
    print(f"\n💾 Final dataset exported to: {final_output_file}")
    
    # Also create a backup in the root directory
    backup_file = "etikaAI_crawled.csv"
    df_combined.to_csv(backup_file, index=False, encoding='utf-8')
    print(f"💾 Backup created: {backup_file}")
    
    print(f"\n✅ Data crawling and initial processing completed successfully!")
    print(f"📁 Files created:")
    print(f"   • {final_output_file}")
    print(f"   • {backup_file}")
    
else:
    print("❌ No data available for export")

## **Preview dan Quality Check Data**

Bagian ini menampilkan preview data yang telah di-crawl dan melakukan pengecekan kualitas data untuk memastikan data siap untuk tahap preprocessing.

In [None]:
# Data Preview and Quality Check
if df_combined is not None:
    print("🔍 DATA QUALITY CHECK")
    print("="*50)
    
    # Check for missing values
    print("📋 Missing Values Analysis:")
    missing_data = df_combined.isnull().sum()
    for col, missing_count in missing_data.items():
        if missing_count > 0:
            percentage = (missing_count / len(df_combined)) * 100
            print(f"   {col}: {missing_count} ({percentage:.1f}%)")
    
    if missing_data.sum() == 0:
        print("   ✅ No missing values found")
    
    # Check for empty text fields
    if 'full_text' in df_combined.columns:
        empty_texts = df_combined['full_text'].str.strip().eq('').sum()
        print(f"📝 Empty text fields: {empty_texts}")
        
        # Check text length distribution
        text_lengths = df_combined['full_text'].str.len()
        print(f"📏 Text length statistics:")
        print(f"   Min: {text_lengths.min()} characters")
        print(f"   Max: {text_lengths.max()} characters")
        print(f"   Mean: {text_lengths.mean():.1f} characters")
        print(f"   Median: {text_lengths.median():.1f} characters")
    
    # Display sample data
    print(f"\n📄 DATA PREVIEW (First 5 tweets):")
    print("="*50)
    
    for i, row in df_combined.head().iterrows():
        print(f"\n{i+1}. Tweet ID: {row.get('id_str', 'N/A')}")
        if 'username' in df_combined.columns:
            print(f"   User: @{row.get('username', 'N/A')}")
        if 'created_at' in df_combined.columns:
            print(f"   Date: {row.get('created_at', 'N/A')}")
        if 'full_text' in df_combined.columns:
            text = row.get('full_text', 'N/A')
            print(f"   Text: {text[:150]}{'...' if len(text) > 150 else ''}")
        if 'lang' in df_combined.columns:
            print(f"   Language: {row.get('lang', 'N/A')}")
    
    # Data readiness check
    print(f"\n✅ DATA READINESS CHECK:")
    print("="*30)
    
    if 'full_text' in df_combined.columns and len(df_combined) > 0:
        print("✅ Text data available for preprocessing")
    else:
        print("❌ No text data available")
    
    if len(df_combined) >= 100:
        print("✅ Sufficient data volume for analysis")
    else:
        print(f"⚠️  Limited data volume: {len(df_combined)} tweets")
    
    print(f"\n🎯 NEXT STEPS:")
    print("1. Run preprocessing pipeline in the main analysis notebook")
    print("2. Apply text cleaning and normalization")
    print("3. Perform sentiment analysis")
    print("4. Train machine learning models")
    
    # Display final dataframe info
    print(f"\n📊 FINAL DATASET INFO:")
    df_combined.info()
    
else:
    print("❌ No data available for quality check")

## **Kesimpulan dan Langkah Selanjutnya**

### ✅ **Hasil Crawling Data Twitter**

Notebook ini telah berhasil melakukan crawling data Twitter dengan fokus pada topik **etika AI (Artificial Intelligence)**. Data yang dikumpulkan akan menjadi foundation untuk analisis sentimen masyarakat Indonesia terhadap isu-isu etika dalam penggunaan AI.

### 📊 **Output yang Dihasilkan**

1. **File Data Utama**: 
   - `tweets-data/etikaAI.csv` - Dataset lengkap hasil crawling
   - `etikaAI_crawled.csv` - File backup di direktori root

2. **Informasi Data**:
   - Tweet berbahasa Indonesia tentang etika AI
   - Metadata lengkap (username, tanggal, engagement metrics)
   - Data sudah di-deduplikasi untuk menghindari duplikasi

### 🔄 **Langkah Selanjutnya**

Setelah proses crawling selesai, data siap untuk tahap selanjutnya:

1. **Preprocessing**:
   - Text cleaning dan normalisasi
   - Tokenization dan stemming bahasa Indonesia
   - Stopword removal

2. **Analisis Sentimen**:
   - Labeling sentimen (positif/negatif/netral)
   - Feature extraction menggunakan TF-IDF
   - Training model machine learning

3. **Evaluasi dan Visualisasi**:
   - Performance evaluation model
   - Sentiment distribution analysis
   - Word cloud dan trend analysis

### 🚀 **Menggunakan Data untuk Analisis**

Untuk melanjutkan ke tahap preprocessing dan analisis sentimen, gunakan notebook `dataMining_KompEtik.ipynb` dengan data yang telah di-crawl dari notebook ini.

### 💡 **Tips Optimisasi**

- Untuk dataset yang lebih besar, pertimbangkan untuk menjalankan crawling secara berkala
- Monitor rate limits Twitter API untuk menghindari pembatasan
- Validasi kualitas data secara berkala untuk memastikan relevancy dengan topik etika AI