# Data Acquisition Notebook

This notebook handles the data acquisition process for the CheckDi project. It scrapes news data from the Anti-Fake News Center (AFNC) website and creates a dataset for training the fake news detection model.

In [1]:
# Import required modules
import sys
import os

# Add the src directory to the path so we can import our modules
sys.path.append(os.path.join(os.getcwd(), '..'))

from src.core.scraper import main as scrape_data
import pandas as pd
import logging

In [2]:
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

In [3]:
# Run the data scraping process
print("Starting data acquisition process...")
scrape_data()
print("Data acquisition completed!")

INFO:src.core.scraper:Starting to scrape real news data...
INFO:src.core.scraper:Scraping page 1: https://www.antifakenewscenter.com/


Starting data acquisition process...


INFO:src.core.scraper:Scraping page 2: https://www.antifakenewscenter.com/?page=2
INFO:src.core.scraper:Scraping page 3: https://www.antifakenewscenter.com/?page=3
INFO:src.core.scraper:Creating synthetic fake news data...
INFO:src.core.scraper:Created 8 fake news items
INFO:src.core.scraper:Raw data saved to data/raw/news.csv
INFO:src.core.scraper:Preparing processed data...
INFO:src.core.scraper:Processed data saved to data/processed/news.csv


Data acquisition and preparation completed!
Total articles: 8
Real news: 0
Fake news: 8
Data acquisition completed!


In [4]:
# Load and examine the scraped data
try:
    df = pd.read_csv('../data/raw/news.csv')
    print(f"Dataset shape: {df.shape}")
    print("\nFirst few rows:")
    display(df.head())
    
    print("\nLabel distribution:")
    print(df['label'].value_counts())
    
    print("\nMissing values:")
    print(df.isnull().sum())
except FileNotFoundError:
    print("Data file not found. Please run the scraping process first.")
except Exception as e:
    print(f"Error loading data: {e}")

Dataset shape: (8, 4)

First few rows:


Unnamed: 0,headline,label,date,link
0,รัฐบาลเปิดเผยแผนพัฒนาเศรษฐกิจในปีหน้า,Real,,
1,พบยารักษาโรคเบาหวานใหม่ที่มีประสิทธิภาพสูง,Real,,
2,ผู้เชี่ยวชาญด้านสุขภาพเผยถึงอันตรายของการดื่มเ...,Real,,
3,การศึกษาใหม่แสดงให้เห็นว่าออกกำลังกายช่วยเพิ่ม...,Real,,
4,วิทยาศาสตร์ใหม่พบว่ากินใบย่านางช่วยลดน้ำหนักได...,Fake,,



Label distribution:
label
Real    4
Fake    4
Name: count, dtype: int64

Missing values:
headline    0
label       0
date        8
link        8
dtype: int64


## Next Steps

After running this notebook:
1. Check the `data/raw/news.csv` file for the scraped data
2. Proceed to the data exploration notebook for further analysis
3. Move to the data preparation notebook to clean and prepare the data for modeling