# Checkee.info Data Scraper

This notebook scrapes visa case data from [checkee.info](https://checkee.info) and combines data from multiple months into a single CSV file.

**Data Range**: January 2024 - January 2026

**Output Columns**: ID, visa_type, visa_entry, US_consulate, Major, Status, Check_date, Complete_date, waiting_days, details

## 1. Imports & Setup

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from datetime import datetime

# Configuration
BASE_URL = "https://checkee.info/main.php"
START_YEAR = 2024
START_MONTH = 1
END_YEAR = 2026
END_MONTH = 1
REQUEST_DELAY = 1.5  # seconds between requests (be respectful to the server)

# Headers to mimic a real browser (prevents 403 Forbidden)
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
}

print("Setup complete!")

Setup complete!


## 2. Scraping Function

In [3]:
def scrape_month(year: int, month: int) -> list[dict]:
    """
    Scrape all visa case records for a specific month.
    
    Args:
        year: The year (e.g., 2024)
        month: The month (1-12)
    
    Returns:
        List of dictionaries, each representing a case record
    """
    date_str = f"{year}-{month:02d}"
    url = f"{BASE_URL}?dispdate={date_str}"
    
    # Retry logic with exponential backoff
    max_retries = 3
    response = None
    
    for attempt in range(max_retries):
        try:
            response = requests.get(url, headers=HEADERS, timeout=30)
            response.raise_for_status()
            break  # Success! Exit the retry loop
        except requests.exceptions.RequestException as e:
            if attempt < max_retries - 1:
                wait_time = 1 * (2 ** attempt)  # Exponential backoff: 1s, 2s, 4s
                time.sleep(wait_time)
            else:
                raise  # Final attempt failed, propagate error
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find the 7th table (XPath: /html/body/table[7] -> index 6 in Python)
    tables = soup.find_all('table')
    if len(tables) < 7:
        return []
    table = tables[6]  # 0-indexed, so table[7] in XPath = index 6
    
    records = []
    rows = table.find_all('tr')
    
    # Skip header row, process data rows
    for row in rows[1:]:
        cells = row.find_all('td')
        if len(cells) >= 10:
            # Extract text from each cell, handling the Update column offset
            record = {
                'ID': cells[1].get_text(strip=True),
                'visa_type': cells[2].get_text(strip=True),
                'visa_entry': cells[3].get_text(strip=True),
                'US_consulate': cells[4].get_text(strip=True),
                'major': cells[5].get_text(strip=True),
                'status': cells[6].get_text(strip=True),
                'check_date': cells[7].get_text(strip=True),
                'complete_date': cells[8].get_text(strip=True),
                'waiting_days': cells[9].get_text(strip=True),
                'details': cells[10].get_text(strip=True) if len(cells) > 10 else ''
            }
            records.append(record)
    
    return records

print("Scraping function defined!")

Scraping function defined!


## 3. Test Scraping (Single Month)

Before running the full scrape, let's test on a single month to verify the parsing works correctly.

In [4]:
# Test on December 2025
test_records = scrape_month(2025, 12)
print(f"Found {len(test_records)} records for December 2025")

# Display first few records
if test_records:
    test_df = pd.DataFrame(test_records)
    print("\nColumns:", test_df.columns.tolist())
    print("\nFirst 5 records:")
    display(test_df.head())

Found 154 records for December 2025

Columns: ['ID', 'visa_type', 'visa_entry', 'US_consulate', 'major', 'status', 'check_date', 'complete_date', 'waiting_days', 'details']

First 5 records:


Unnamed: 0,ID,visa_type,visa_entry,US_consulate,major,status,check_date,complete_date,waiting_days,details
0,二进宫让我,B1,New,ShangHai,"Spanish, Management",Clear,2025-12-01,2025-12-03,2,detail
1,鸡飞助我,H1,New,GuangZhou,Computer Science,Pending,2025-12-01,0000-00-00,39,detail
2,孜然牛肉,H1,New,BeiJing,EE/BME,Pending,2025-12-01,0000-00-00,39,detail
3,东京h1b,H1,Renewal,Others,EE,Pending,2025-12-01,0000-00-00,39,detail
4,疯狂海豚,H1,New,GuangZhou,CS,Pending,2025-12-01,0000-00-00,39,detail


## 4. Main Scraping Loop

In [5]:
def generate_month_range(start_year, start_month, end_year, end_month):
    """Generate a list of (year, month) tuples for the date range."""
    months = []
    year, month = start_year, start_month
    
    while (year, month) <= (end_year, end_month):
        months.append((year, month))
        month += 1
        if month > 12:
            month = 1
            year += 1
    
    return months

# Generate list of months to scrape
months_to_scrape = generate_month_range(START_YEAR, START_MONTH, END_YEAR, END_MONTH)
print(f"Will scrape {len(months_to_scrape)} months: {months_to_scrape[0]} to {months_to_scrape[-1]}")

Will scrape 25 months: (2024, 1) to (2026, 1)


In [6]:
# Scrape all months
all_records = []
failed_months = []

for i, (year, month) in enumerate(months_to_scrape):
    date_str = f"{year}-{month:02d}"
    print(f"[{i+1}/{len(months_to_scrape)}] Scraping {date_str}...", end=" ")
    
    try:
        records = scrape_month(year, month)
        all_records.extend(records)
        print(f"Found {len(records)} records")
    except Exception as e:
        print(f"FAILED: {e}")
        failed_months.append((year, month, str(e)))
    
    # Respect the server with a delay between requests
    if i < len(months_to_scrape) - 1:
        time.sleep(REQUEST_DELAY)

print(f"\n{'='*50}")
print(f"Scraping complete!")
print(f"Total records collected: {len(all_records)}")
if failed_months:
    print(f"Failed months: {failed_months}")

[1/25] Scraping 2024-01... Found 180 records
[2/25] Scraping 2024-02... Found 121 records
[3/25] Scraping 2024-03... Found 167 records
[4/25] Scraping 2024-04... Found 219 records
[5/25] Scraping 2024-05... Found 310 records
[6/25] Scraping 2024-06... Found 213 records
[7/25] Scraping 2024-07... Found 158 records
[8/25] Scraping 2024-08... Found 145 records
[9/25] Scraping 2024-09... Found 137 records
[10/25] Scraping 2024-10... Found 138 records
[11/25] Scraping 2024-11... Found 179 records
[12/25] Scraping 2024-12... Found 420 records
[13/25] Scraping 2025-01... Found 172 records
[14/25] Scraping 2025-02... Found 142 records
[15/25] Scraping 2025-03... Found 159 records
[16/25] Scraping 2025-04... Found 183 records
[17/25] Scraping 2025-05... Found 264 records
[18/25] Scraping 2025-06... Found 164 records
[19/25] Scraping 2025-07... Found 131 records
[20/25] Scraping 2025-08... Found 114 records
[21/25] Scraping 2025-09... Found 108 records
[22/25] Scraping 2025-10... Found 187 recor

## 5. Data Processing & Export

In [7]:
# Convert to DataFrame
df = pd.DataFrame(all_records)

# Display basic info
print("DataFrame Info:")
print(f"Shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nData types:")
print(df.dtypes)
print(f"\nFirst few rows:")
display(df.head())

DataFrame Info:
Shape: (4326, 10)

Columns: ['ID', 'visa_type', 'visa_entry', 'US_consulate', 'major', 'status', 'check_date', 'complete_date', 'waiting_days', 'details']

Data types:
ID               object
visa_type        object
visa_entry       object
US_consulate     object
major            object
status           object
check_date       object
complete_date    object
waiting_days     object
details          object
dtype: object

First few rows:


Unnamed: 0,ID,visa_type,visa_entry,US_consulate,major,status,check_date,complete_date,waiting_days,details
0,MYOS,F1,Renewal,BeiJing,Optical Science,Clear,2024-01-02,2024-02-20,49,detail
1,wobalaba,F1,Renewal,Vancouver,Biochemistry,Clear,2024-01-02,2024-01-10,8,detail
2,顺利过！,F1,Renewal,BeiJing,Biomedical science,Clear,2024-01-02,2024-01-02,0,detail
3,嗞油美利坚,F1,Renewal,ShangHai,Plant Sciences,Clear,2024-01-02,2024-02-09,38,detail
4,快点吧,H1,New,BeiJing,CS,Clear,2024-01-02,2024-01-26,24,detail


In [8]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
print(f"\nEmpty string counts:")
print((df == '').sum())

Missing values per column:
ID               0
visa_type        0
visa_entry       0
US_consulate     0
major            0
status           0
check_date       0
complete_date    0
waiting_days     0
details          0
dtype: int64

Empty string counts:
ID               0
visa_type        0
visa_entry       0
US_consulate     0
major            0
status           0
check_date       0
complete_date    0
waiting_days     0
details          0
dtype: int64


In [9]:
# Save to CSV
output_path = "data/checkee_data.csv"
df.to_csv(output_path, index=False)
print(f"Data saved to {output_path}")
print(f"Total records: {len(df)}")

Data saved to data/checkee_data.csv
Total records: 4326


## 6. Quick Data Exploration

In [11]:
# Summary statistics
print("Status distribution:")
print(df['status'].value_counts())

print("\nVisa type distribution:")
print(df['visa_type'].value_counts())

print("\nTop 10 consulates:")
print(df['US_consulate'].value_counts().head(10))

Status distribution:
status
Clear      2707
Pending    1584
Reject       35
Name: count, dtype: int64

Visa type distribution:
visa_type
F1    1522
H1    1321
B1     482
J1     456
B2     194
L1     131
O1     115
H4      65
J2      17
L2      13
F2       8
O2       2
Name: count, dtype: int64

Top 10 consulates:
US_consulate
BeiJing      1219
GuangZhou     895
ShangHai      873
Others        427
ShenYang      214
Vancouver     193
HongKong      173
Europe        155
Toronto        96
Calgary        33
Name: count, dtype: int64
