# CC5: Web Scraping - India State-wise Development Data

**Objective**: Scrape India's state-wise Human Development Index (HDI) data from Wikipedia and convert it to tidy (long) format for visualization.

**Data Source**: Wikipedia - List of Indian states and union territories by Human Development Index

**Tools**: BeautifulSoup, Pandas, Requests

In [None]:
# Install required libraries
!pip install beautifulsoup4 pandas requests lxml

In [None]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

## Step 1: Scrape the Data

In [None]:
# Target URL
url = 'https://en.wikipedia.org/wiki/List_of_Indian_states_and_union_territories_by_Human_Development_Index'

# Send GET request
response = requests.get(url)
print(f"Status Code: {response.status_code}")

# Parse HTML
soup = BeautifulSoup(response.content, 'html.parser')
print("Successfully parsed HTML")

In [None]:
# Find the table with HDI data
# Looking for tables with class 'wikitable'
tables = soup.find_all('table', {'class': 'wikitable'})
print(f"Found {len(tables)} tables")

# Display first few rows of the first table to verify
if tables:
    df_raw = pd.read_html(str(tables[0]))[0]
    print("\nFirst table preview:")
    print(df_raw.head())

## Step 2: Clean and Normalize the Data

In [None]:
# Extract the relevant table (adjust index if needed based on preview)
df = pd.read_html(str(tables[0]))[0]

# Display raw data structure
print("Raw columns:")
print(df.columns.tolist())
print("\nRaw data shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

In [None]:
# Clean column names - remove multi-level headers if present
if isinstance(df.columns, pd.MultiIndex):
    df.columns = ['_'.join(col).strip() for col in df.columns.values]

# Rename columns for clarity (adjust based on actual column names)
# Typical structure: Rank, State/UT, HDI value, etc.

# Select relevant columns - adjust based on your data
# Example: keeping state name and HDI value columns
print("\nCleaned columns:")
print(df.columns.tolist())

In [None]:
# Clean the data
# Remove any rows with missing values
df_clean = df.dropna(subset=[df.columns[1], df.columns[2]])  # Adjust column indices

# Remove footnote markers and special characters
# Clean state names
if 'State' in df_clean.columns or 'State/UT' in df_clean.columns:
    state_col = 'State' if 'State' in df_clean.columns else 'State/UT'
    df_clean[state_col] = df_clean[state_col].str.replace(r'\[.*?\]', '', regex=True)
    df_clean[state_col] = df_clean[state_col].str.strip()

print("\nCleaned data:")
print(df_clean.head(10))

## Step 3: Convert to TIDY (Long) Format

In [None]:
# Create a simplified tidy dataset
# Select key columns: State, HDI, Rank

# Adjust column selection based on actual data structure
tidy_df = df_clean.iloc[:, [1, 2]].copy()  # Typically: State name and HDI value
tidy_df.columns = ['state', 'hdi']  # Standardize column names

# Convert HDI to numeric, handling any text
tidy_df['hdi'] = pd.to_numeric(tidy_df['hdi'], errors='coerce')

# Remove any remaining null values
tidy_df = tidy_df.dropna()

# Add year column (metadata)
tidy_df['year'] = 2021  # Adjust based on Wikipedia data year

# Sort by HDI value descending
tidy_df = tidy_df.sort_values('hdi', ascending=False).reset_index(drop=True)

print("\nTIDY FORMAT DATA:")
print(tidy_df.head(15))
print(f"\nTotal states/UTs: {len(tidy_df)}")

## Step 4: Validate Data Quality

In [None]:
# Check data quality
print("Data Quality Checks:")
print(f"- Missing values: {tidy_df.isnull().sum().sum()}")
print(f"- HDI range: {tidy_df['hdi'].min():.3f} to {tidy_df['hdi'].max():.3f}")
print(f"- Data types:\n{tidy_df.dtypes}")

# Display summary statistics
print("\nSummary Statistics:")
print(tidy_df['hdi'].describe())

## Step 5: Export to CSV (Tidy Format)

In [None]:
# Export to CSV
tidy_df.to_csv('india_state_hdi.csv', index=False)
print("\n✅ Data exported to 'india_state_hdi.csv'")

# Display final dataset
print("\nFinal TIDY dataset:")
print(tidy_df)

In [None]:
# Download the file
from google.colab import files
files.download('india_state_hdi.csv')

## Summary

**What we did:**
1. Scraped Wikipedia table containing India's state-wise HDI data
2. Cleaned data: removed footnotes, special characters, and null values
3. Normalized to tidy format: each row = one observation (state), each column = one variable
4. Exported as CSV for use in Vega-Lite visualization

**Challenges:**
- Wikipedia tables often have multi-level headers requiring careful parsing
- Footnote markers and special characters needed removal
- Converting scraped text to proper numeric format for analysis

**Tidy Data Principles Applied:**
- ✅ Each variable forms a column (state, hdi, year)
- ✅ Each observation forms a row (one state per row)
- ✅ Ready for visualization in Vega-Lite