# GCAP3226 Week 7: Web Crawler & API Workshop
## Field Studies – Project Data Collection

**Date:** October 14, 2025 (Week 7)  
**Duration:** 3 hours  
**Focus:** Data Collection Techniques for Government Policy Analysis  
**Course:** Empowering Citizens through Data: Participatory Policy Analysis for Hong Kong

---

## 🎯 Workshop Objectives

By the end of this workshop, students will be able to:
1. **Understand APIs and Web Scraping** for government data collection
2. **Implement basic web crawlers** using Python libraries
3. **Access government APIs** for real-time data
4. **Apply data collection techniques** to their specific team projects
5. **Set up automated data collection** for ongoing project monitoring

---

## 📋 Workshop Agenda

### Part 1: Introduction to Data Collection (45 minutes)
- **Government Data Sources Overview** (15 min)
- **APIs vs Web Scraping** (15 min)
- **Legal and Ethical Considerations** (15 min)

### Part 2: Hands-on API Workshop (60 minutes)
- **Hong Kong Government APIs** (20 min)
- **Weather Data APIs** (20 min)
- **Transport Data APIs** (20 min)

### Part 3: Web Scraping Fundamentals (60 minutes)
- **BeautifulSoup Introduction** (20 min)
- **Requests Library** (20 min)
- **Practical Scraping Exercise** (20 min)

### Part 4: Team-Specific Application (45 minutes)
- **Team Breakout Sessions** (30 min)
- **Data Collection Planning** (15 min)

---

## 🔧 Technical Setup

### Required Software
```bash
# Install required Python packages
pip install requests beautifulsoup4 pandas matplotlib seaborn
pip install hko-weather-api  # Hong Kong Observatory API
```

### Development Environment
- **Jupyter Notebook** or **Google Colab**
- **Python 3.8+**
- **Internet connection** for API access


## 📊 Government Data Sources

### Hong Kong Government APIs

#### 1. Hong Kong Observatory (HKO) API
- **Weather Data**: Real-time weather conditions
- **Typhoon Information**: Tropical cyclone tracking
- **Air Quality**: AQHI data and forecasts

#### 2. Transport Department APIs
- **KMB Bus Routes**: Route information and schedules
- **Bus Stops**: Stop locations and facilities
- **Real-time Arrival**: Live bus arrival times

#### 3. Data.gov.hk
- **Open Data Portal**: Comprehensive government datasets
- **Department-specific data**: Various government departments
- **Historical data**: Long-term trend analysis

### Team-Specific Data Sources

#### Team 1: Flu Shot Campaign Analysis
- **Department of Health**: Vaccination statistics
- **Hospital Authority**: Healthcare utilization data
- **Census and Statistics**: Population demographics

#### Team 2: Bus Route Coordination
- **Transport Department**: Route performance data
- **KMB/CTB**: Real-time bus data
- **HKO**: Weather impact on ridership

#### Team 3: Typhoon Preparedness
- **Hong Kong Observatory**: Weather and typhoon data
- **Security Bureau**: Emergency response protocols
- **Census and Statistics**: Population distribution

#### Team 4: Municipal Solid Waste Charging
- **Environmental Protection Department**: Waste statistics
- **Food and Environmental Hygiene**: Collection data
- **Census and Statistics**: Household characteristics

#### Team 5: Green @ Community Initiatives
- **Environmental Protection Department**: Environmental data
- **Home Affairs Department**: Community programs
- **Development Bureau**: Urban planning data

#### Team 6: Bus Stop Optimization
- **Transport Department**: Bus stop utilization
- **Planning Department**: Urban development plans
- **Census and Statistics**: Population density


In [None]:
# Part 1: Setup and Import Libraries
# Install required packages (run this cell first)
# !pip install requests beautifulsoup4 pandas matplotlib seaborn

import requests
import json
import pandas as pd
import time
import random
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import seaborn as sns

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

print("✅ Libraries imported successfully!")
print("📊 Ready to start data collection workshop")


## Part 2: Hong Kong Government API Examples

### 2.1 Hong Kong Observatory (HKO) Weather API

Let's start with the Hong Kong Observatory API to get real-time weather data.


In [None]:
# HKO Weather API Class
class HKOWeatherAPI:
    """Hong Kong Observatory Weather API Interface"""
    
    def __init__(self):
        self.base_url = "https://data.weather.gov.hk/weatherAPI/opendata/weather.php"
    
    def get_current_weather(self):
        """Get current weather data"""
        params = {
            'dataType': 'rhrread',  # Real-time weather data
            'lang': 'en'
        }
        return self._make_request(params)
    
    def get_typhoon_info(self):
        """Get typhoon information"""
        params = {
            'dataType': 'tcm',  # Tropical cyclone information
            'lang': 'en'
        }
        return self._make_request(params)
    
    def get_air_quality(self):
        """Get air quality data"""
        params = {
            'dataType': 'aqhi',  # Air quality health index
            'lang': 'en'
        }
        return self._make_request(params)
    
    def _make_request(self, params):
        """Make API request with error handling"""
        try:
            response = requests.get(self.base_url, params=params, timeout=30)
            if response.status_code == 200:
                return response.json()
            else:
                print(f"API Error: {response.status_code}")
                return None
        except Exception as e:
            print(f"Request failed: {e}")
            return None

# Test the HKO API
print("🌤️ Testing Hong Kong Observatory API...")
weather_api = HKOWeatherAPI()

# Get current weather
weather_data = weather_api.get_current_weather()
if weather_data:
    print("✅ Weather API working!")
    print(f"📊 Available data keys: {list(weather_data.keys())}")
else:
    print("❌ Weather API failed")


In [None]:
# Display weather data in a readable format
if weather_data:
    print("🌤️ Current Weather Data:")
    print("=" * 40)
    
    # Display temperature data
    if 'temperature' in weather_data:
        temp_data = weather_data['temperature']
        print(f"🌡️ Temperature: {temp_data.get('data', [{}])[0].get('value', 'N/A')}°C")
    
    # Display humidity data
    if 'humidity' in weather_data:
        humidity_data = weather_data['humidity']
        print(f"💧 Humidity: {humidity_data.get('data', [{}])[0].get('value', 'N/A')}%")
    
    # Display wind data
    if 'wind' in weather_data:
        wind_data = weather_data['wind']
        print(f"💨 Wind: {wind_data.get('data', [{}])[0].get('value', 'N/A')}")
    
    print("=" * 40)
    print("📊 Full data structure:")
    print(json.dumps(weather_data, indent=2)[:500] + "...")
else:
    print("❌ No weather data available")


### 2.2 Transport Department API

Now let's explore the Transport Department APIs for bus route and stop information.


In [None]:
# Transport Department API Class
class TransportAPI:
    """Transport Department API Interface"""
    
    def __init__(self):
        self.kmb_base = "https://data.etabus.gov.hk/v1/transport/kmb"
        self.ctb_base = "https://data.etabus.gov.hk/v1/transport/ctb"
    
    def get_bus_routes(self, company='kmb'):
        """Get all bus routes"""
        url = f"{self.kmb_base}/route" if company == 'kmb' else f"{self.ctb_base}/route"
        return self._make_request(url)
    
    def get_bus_stops(self, company='kmb'):
        """Get all bus stops"""
        url = f"{self.kmb_base}/stop" if company == 'kmb' else f"{self.ctb_base}/stop"
        return self._make_request(url)
    
    def get_bus_arrival(self, stop_id, route, company='kmb'):
        """Get real-time bus arrival information"""
        url = f"{self.kmb_base}/eta/{stop_id}/{route}" if company == 'kmb' else f"{self.ctb_base}/eta/{stop_id}/{route}"
        return self._make_request(url)
    
    def _make_request(self, url):
        """Make API request with error handling"""
        try:
            response = requests.get(url, timeout=30)
            if response.status_code == 200:
                return response.json()
            else:
                print(f"API Error: {response.status_code}")
                return None
        except Exception as e:
            print(f"Request failed: {e}")
            return None

# Test the Transport API
print("🚌 Testing Transport Department API...")
transport_api = TransportAPI()

# Get bus routes
routes = transport_api.get_bus_routes()
if routes:
    print("✅ Transport API working!")
    print(f"📊 Available routes: {len(routes.get('data', []))}")
    
    # Show first few routes
    if 'data' in routes and len(routes['data']) > 0:
        print("🚌 Sample routes:")
        for i, route in enumerate(routes['data'][:5]):
            print(f"  {i+1}. Route {route.get('route', 'N/A')}: {route.get('dest_en', 'N/A')}")
else:
    print("❌ Transport API failed")


## Part 3: Web Scraping Fundamentals

### 3.1 Introduction to BeautifulSoup

Web scraping allows us to extract data from websites when APIs are not available. Let's learn the basics with BeautifulSoup.


In [None]:
# Web Scraping Example: Government News
class GovernmentWebScraper:
    """Web scraper for government websites"""
    
    def __init__(self):
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
    
    def scrape_news_announcements(self, url, max_pages=5):
        """Scrape government news and announcements"""
        all_news = []
        
        for page in range(1, max_pages + 1):
            try:
                page_url = f"{url}?page={page}"
                response = requests.get(page_url, headers=self.headers, timeout=30)
                soup = BeautifulSoup(response.content, 'html.parser')
                
                # Extract news items (adjust selectors based on actual website)
                news_items = soup.find_all('div', class_='news-item')
                
                for item in news_items:
                    title = item.find('h3').text.strip() if item.find('h3') else 'No title'
                    date = item.find('span', class_='date').text.strip() if item.find('span', class_='date') else 'No date'
                    link = item.find('a')['href'] if item.find('a') else 'No link'
                    
                    all_news.append({
                        'title': title,
                        'date': date,
                        'link': link,
                        'scraped_at': datetime.now().isoformat()
                    })
                
                # Be respectful - add delay between requests
                time.sleep(random.uniform(1, 3))
                
            except Exception as e:
                print(f"Error scraping page {page}: {e}")
                continue
        
        return pd.DataFrame(all_news)

# Example: Scrape a simple government page
print("🕷️ Web Scraping Example")
print("=" * 40)

# Create a simple HTML example for demonstration
sample_html = """
<html>
<body>
    <div class="news-item">
        <h3>Government Announces New Policy</h3>
        <span class="date">2025-10-14</span>
        <a href="/news/1">Read more</a>
    </div>
    <div class="news-item">
        <h3>Public Consultation Opens</h3>
        <span class="date">2025-10-13</span>
        <a href="/news/2">Read more</a>
    </div>
</body>
</html>
"""

# Parse the HTML
soup = BeautifulSoup(sample_html, 'html.parser')
news_items = soup.find_all('div', class_='news-item')

print("📰 Extracted News Items:")
for i, item in enumerate(news_items, 1):
    title = item.find('h3').text.strip()
    date = item.find('span', class_='date').text.strip()
    link = item.find('a')['href']
    print(f"{i}. {title} ({date}) - {link}")

print("✅ Web scraping demonstration complete!")


## Part 4: Team-Specific Data Collection

### 4.1 Team Data Collection Functions

Now let's create team-specific data collection functions for each project.


In [None]:
# Team-Specific Data Collection Functions
class TeamDataCollectors:
    """Team-specific data collection functions"""
    
    @staticmethod
    def team1_flu_shot_data():
        """Team 1: Flu shot campaign data collection"""
        print("💉 Team 1: Flu Shot Campaign Analysis")
        print("📊 Data sources: Department of Health, Hospital Authority")
        print("🔍 Focus: Vaccination rates, demographic patterns, policy effectiveness")
        # Implementation would go here
        return {"status": "ready", "data_sources": ["DOH", "HA", "Census"]}
    
    @staticmethod
    def team2_bus_route_data():
        """Team 2: Bus route coordination data collection"""
        print("🚌 Team 2: Bus Route Coordination")
        print("📊 Data sources: Transport Department, KMB/CTB APIs, Weather data")
        
        transport_api = TransportAPI()
        weather_api = HKOWeatherAPI()
        
        # Get bus routes
        routes = transport_api.get_bus_routes()
        
        # Get weather data for correlation
        weather = weather_api.get_current_weather()
        
        return {
            'routes': routes,
            'weather': weather,
            'status': 'collected'
        }
    
    @staticmethod
    def team3_typhoon_data():
        """Team 3: Typhoon preparedness data collection"""
        print("🌀 Team 3: Typhoon Preparedness & Emergency Management")
        print("📊 Data sources: Hong Kong Observatory, Security Bureau")
        
        weather_api = HKOWeatherAPI()
        
        # Get typhoon information
        typhoon_info = weather_api.get_typhoon_info()
        
        # Get current weather
        current_weather = weather_api.get_current_weather()
        
        return {
            'typhoon': typhoon_info,
            'weather': current_weather,
            'status': 'collected'
        }
    
    @staticmethod
    def team4_waste_data():
        """Team 4: Municipal solid waste charging data collection"""
        print("🗑️ Team 4: Municipal Solid Waste Charging")
        print("📊 Data sources: Environmental Protection Department, Census data")
        print("🔍 Focus: Waste generation patterns, charging effectiveness")
        return {"status": "ready", "data_sources": ["EPD", "Census", "FEHD"]}
    
    @staticmethod
    def team5_green_community_data():
        """Team 5: Green @ Community data collection"""
        print("🌱 Team 5: Green @ Community Initiatives")
        print("📊 Data sources: Environmental Protection Department, Home Affairs")
        print("🔍 Focus: Community engagement, environmental impact")
        return {"status": "ready", "data_sources": ["EPD", "HAD", "DevB"]}
    
    @staticmethod
    def team6_bus_stop_data():
        """Team 6: Bus stop optimization data collection"""
        print("🚏 Team 6: Bus Stop Optimization")
        print("📊 Data sources: Transport Department, Planning Department")
        
        transport_api = TransportAPI()
        
        # Get bus stops
        stops = transport_api.get_bus_stops()
        
        # Get bus routes
        routes = transport_api.get_bus_routes()
        
        return {
            'stops': stops,
            'routes': routes,
            'status': 'collected'
        }

# Test team data collection
print("🎯 Testing Team Data Collection Functions")
print("=" * 50)

# Test each team's data collection
teams = [
    TeamDataCollectors.team1_flu_shot_data,
    TeamDataCollectors.team2_bus_route_data,
    TeamDataCollectors.team3_typhoon_data,
    TeamDataCollectors.team4_waste_data,
    TeamDataCollectors.team5_green_community_data,
    TeamDataCollectors.team6_bus_stop_data
]

for i, team_func in enumerate(teams, 1):
    print(f"\n--- Team {i} ---")
    result = team_func()
    print(f"Status: {result.get('status', 'unknown')}")
    if 'data_sources' in result:
        print(f"Data sources: {', '.join(result['data_sources'])}")

print("\n✅ All team data collection functions tested!")


## Part 5: Data Processing and Visualization

### 5.1 Data Cleaning and Processing

Let's create utilities for processing the collected data.


In [None]:
# Data Processing and Visualization Utilities
class DataProcessor:
    """Data processing and cleaning utilities"""
    
    @staticmethod
    def clean_government_data(df):
        """Clean government data with common issues"""
        if df is None or df.empty:
            return df
            
        # Remove duplicates
        df = df.drop_duplicates()
        
        # Handle missing values
        df = df.fillna(method='ffill').fillna(method='bfill')
        
        # Standardize date formats
        date_columns = df.select_dtypes(include=['object']).columns
        for col in date_columns:
            if 'date' in col.lower() or 'time' in col.lower():
                try:
                    df[col] = pd.to_datetime(df[col], errors='coerce')
                except:
                    pass
        
        # Remove extreme outliers (99th percentile)
        numeric_columns = df.select_dtypes(include=['number']).columns
        for col in numeric_columns:
            if len(df[col].dropna()) > 0:
                q99 = df[col].quantile(0.99)
                df = df[df[col] <= q99]
        
        return df
    
    @staticmethod
    def create_time_series(df, date_col, value_col, freq='D'):
        """Create time series data"""
        if df is None or df.empty:
            return None
            
        df[date_col] = pd.to_datetime(df[date_col])
        df = df.set_index(date_col)
        ts = df[value_col].resample(freq).mean()
        return ts

class DataVisualizer:
    """Data visualization utilities"""
    
    @staticmethod
    def plot_time_series(data, title, xlabel, ylabel):
        """Plot time series data"""
        plt.figure(figsize=(12, 6))
        plt.plot(data.index, data.values)
        plt.title(title)
        plt.xlabel(xlabel)
        plt.ylabel(ylabel)
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.show()
    
    @staticmethod
    def plot_correlation_matrix(df, title):
        """Plot correlation matrix"""
        if df is None or df.empty:
            print("No data to plot")
            return
            
        plt.figure(figsize=(10, 8))
        correlation_matrix = df.corr()
        sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
        plt.title(title)
        plt.tight_layout()
        plt.show()
    
    @staticmethod
    def plot_distribution(data, title, bins=30):
        """Plot data distribution"""
        plt.figure(figsize=(10, 6))
        plt.hist(data, bins=bins, alpha=0.7, edgecolor='black')
        plt.title(title)
        plt.xlabel('Value')
        plt.ylabel('Frequency')
        plt.tight_layout()
        plt.show()

# Example: Create sample data and demonstrate processing
print("📊 Data Processing and Visualization Demo")
print("=" * 50)

# Create sample data
sample_data = {
    'date': pd.date_range('2025-01-01', periods=100, freq='D'),
    'temperature': np.random.normal(25, 5, 100),
    'humidity': np.random.normal(70, 10, 100),
    'pressure': np.random.normal(1013, 20, 100)
}

df = pd.DataFrame(sample_data)
print(f"📈 Sample dataset created: {df.shape}")
print(f"📅 Date range: {df['date'].min()} to {df['date'].max()}")

# Clean the data
df_clean = DataProcessor.clean_government_data(df)
print(f"🧹 Data cleaned: {df_clean.shape}")

# Create time series
ts = DataProcessor.create_time_series(df_clean, 'date', 'temperature')
print(f"📈 Time series created: {len(ts)} data points")

print("✅ Data processing utilities ready!")


## Part 6: Workshop Summary and Next Steps

### 6.1 What We've Learned

In this workshop, we've covered:

1. **Government APIs**: Accessing Hong Kong Observatory and Transport Department data
2. **Web Scraping**: Using BeautifulSoup to extract data from websites
3. **Data Processing**: Cleaning and organizing collected data
4. **Team-Specific Applications**: Customized data collection for each project

### 6.2 Next Steps for Your Team Project

1. **Choose Your Data Sources**: Select the most relevant APIs and websites for your project
2. **Implement Data Collection**: Use the templates provided to collect your data
3. **Clean and Process**: Apply the data cleaning utilities to prepare your data
4. **Analyze and Visualize**: Use the visualization tools to explore your data
5. **Document Your Process**: Keep track of your data collection methods and sources

### 6.3 Resources and Support

- **Code Templates**: Available in `Week7_Code_Templates.py`
- **Quick Reference**: See `Week7_Quick_Reference.md`
- **Workshop Summary**: Review `Week7_Workshop_Summary.md`
- **Team Materials**: Check your team folder for project-specific resources

### 6.4 Legal and Ethical Considerations

Remember to:
- **Respect rate limits** when accessing APIs
- **Follow robots.txt** guidelines for web scraping
- **Cite your data sources** in your final report
- **Be mindful of data privacy** and usage terms

---

## 🎉 Workshop Complete!

You now have the tools and knowledge to collect data for your GCAP3226 team project. Use the templates and examples provided to build your data collection pipeline.

**Good luck with your projects!** 🚀
