# US Population Data Exploration and Preprocessing (SEER Format)

This notebook explores the US Population data in **SEER fixed-width format** and prepares it for analysis.

## SEER Fixed-Width Format Specification:

**File Format:** Fixed length ASCII text (26 bytes per record)

| Variable | Start Column | Length | Type | Description |
|----------|--------------|--------|------|-------------|
| Year | 1 | 4 | numeric | 1969, 1970, ... 2023 |
| State Abbrev | 5 | 2 | character | AL, AK, ... (KR for Katrina evacuees) |
| State FIPS | 7 | 2 | numeric | 01, 02, ... (99 for dummy state) |
| County FIPS | 9 | 3 | numeric | 001, 002, ... (999 for dummy) |
| Race | 14 | 1 | numeric | 1=White, 2=Black, 3=AIAN, 4=API |
| Origin | 15 | 1 | numeric | 0=Non-Hispanic, 1=Hispanic, 9=N/A |
| Sex | 16 | 1 | numeric | 1=Male, 2=Female |
| Age | 17 | 2 | numeric | 00-90 (single year), 90=90+ |
| Population | 19 | 8 | numeric | Population count |

**Example:** `1969AL01001112000000159`
- 1969 = Year
- AL = Alabama
- 01 = State FIPS
- 001 = County code
- 1 = White
- 1 = Hispanic
- 2 = Female
- 00 = Age 0
- 00000159 = Population

## Goals:
1. Parse SEER fixed-width format correctly
2. Extract state and county FIPS codes (create 5-digit FIPS)
3. Filter for 2006-2015 (to match ARCOS prescription data)
4. Aggregate by county-year (sum across race, origin, sex, age)
5. Create county-year dataset for merging

## Step 1: Import Required Libraries

In [57]:
import pandas as pd
import os
import gzip

# Configure pandas display options for better table viewing
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', 100)      # Show up to 100 rows
pd.set_option('display.width', None)        # Auto-detect width
pd.set_option('display.max_colwidth', None) # Show full column content
pd.set_option('display.float_format', '{:.2f}'.format)  # Format floats

print("Libraries imported successfully!")
print("Display options configured for Excel-like table view")

Libraries imported successfully!
Display options configured for Excel-like table view


## Step 2: Locate and Examine Population File

In [42]:
# Search for population .gz files in the workspace
import glob

print("Searching for population data files (.gz)...")
print("=" * 60)

# Look for gz files that might contain population data
gz_files = glob.glob('**/*.gz', recursive=True)

# Filter for likely population files
pop_files = [f for f in gz_files if 'population' in f.lower() or 'pop' in f.lower()]

if pop_files:
    print("Population files found:")
    for file in pop_files:
        file_size = os.path.getsize(file)
        print(f"  {file} ({file_size:,} bytes)")
else:
    print("No population .gz files found with 'population' or 'pop' in name.")
    print("\nAll .gz files in workspace:")
    for file in gz_files[:10]:  # Show first 10
        file_size = os.path.getsize(file)
        print(f"  {file} ({file_size:,} bytes)")
    if len(gz_files) > 10:
        print(f"  ... and {len(gz_files) - 10} more")

Searching for population data files (.gz)...
No population .gz files found with 'population' or 'pop' in name.

All .gz files in workspace:


## Step 3: Preview Population Data (from .gz file)

**Note:** Update the `pop_file_path` variable below with the actual path to your population .gz file.

In [58]:
# Use the population file in data/raw/
pop_file_path = '../data/raw/us.1969_2023.singleages.through89.90plus.adjusted.txt.gz'

# Check if file exists
if os.path.exists(pop_file_path):
    print(f"Reading data from: {pop_file_path}")
    print("=" * 60)
    
    # Read first few lines from gzip file to understand structure
    with gzip.open(pop_file_path, 'rt') as f:
        print("First 20 lines of the file (SEER fixed-width format):")
        print()
        for i, line in enumerate(f):
            if i < 20:
                # Show the line with position markers
                if i == 0:
                    print("Position:   0         1         2")
                    print("            0123456789012345678901234")
                print(f"Line {i+1:2d}: [{line.rstrip()}]")
            else:
                break
                
    print("\n" + "=" * 60)
    print("Format breakdown:")
    print("Columns 0-10:   Geographic ID (1969 + State + Level + County)")
    print("Columns 11-23:  Year-Population (191 + YearCode + Population)")
else:
    print(f"File not found: {pop_file_path}")
    print("Please check the file path.")

Reading data from: ../data/raw/us.1969_2023.singleages.through89.90plus.adjusted.txt.gz
First 20 lines of the file (SEER fixed-width format):

Position:   0         1         2
            0123456789012345678901234
Line  1: [1969AL01001  1910000000159]
Line  2: [1969AL01001  1910100000159]
Line  3: [1969AL01001  1910200000165]
Line  4: [1969AL01001  1910300000159]
Line  5: [1969AL01001  1910400000174]
Line  6: [1969AL01001  1910500000234]
Line  7: [1969AL01001  1910600000222]
Line  8: [1969AL01001  1910700000208]
Line  9: [1969AL01001  1910800000220]
Line 10: [1969AL01001  1910900000253]
Line 11: [1969AL01001  1911000000204]
Line 12: [1969AL01001  1911100000215]
Line 13: [1969AL01001  1911200000179]
Line 14: [1969AL01001  1911300000179]
Line 15: [1969AL01001  1911400000179]
Line 16: [1969AL01001  1911500000171]
Line 17: [1969AL01001  1911600000165]
Line 18: [1969AL01001  1911700000164]
Line 19: [1969AL01001  1911800000123]
Line 20: [1969AL01001  1911900000098]

Format breakdown:
Column

## Step 4: Parse SEER Fixed-Width Format

Now we'll correctly parse the SEER format:
- Extract state and county FIPS codes from column 1
- Decode year from year code in column 2
- Extract population from column 2

In [59]:
print("Parsing SEER fixed-width format...")
print("=" * 60)

def parse_seer_line(line):
    """
    Parse a line in SEER fixed-width format (26 bytes).
    
    Format (1-indexed in docs, 0-indexed in Python):
    Column  1-4  (0:4):   Year (1969-2023)
    Column  5-6  (4:6):   State abbreviation (AL, AK, etc.)
    Column  7-8  (6:8):   State FIPS code (01, 02, etc.)
    Column  9-11 (8:11):  County FIPS code (001, 002, etc.)
    Column  14   (13):    Race (1=White, 2=Black, 3=AIAN, 4=API)
    Column  15   (14):    Origin (0=Non-Hispanic, 1=Hispanic, 9=N/A)
    Column  16   (15):    Sex (1=Male, 2=Female)
    Column  17-18 (16:18): Age (00-90, 90=90+)
    Column  19-26 (18:26): Population (8 digits)
    
    Example: "1969AL01001112000000159"
    """
    if len(line) < 26:
        return None
    
    try:
        # Extract fields (using 0-indexed positions)
        year = int(line[0:4])
        state_abbrev = line[4:6]
        state_fips = line[6:8]
        county_code = line[8:11]
        race = line[13]
        origin = line[14]
        sex = line[15]
        age = line[16:18]
        population_str = line[18:26]
        
        # Create 5-digit FIPS: state + county
        fips = state_fips + county_code
        
        # Parse population
        pop = int(population_str)
        
        return {
            'year': year,
            'state_abbrev': state_abbrev,
            'state_fips': state_fips,
            'county_code': county_code,
            'fips': fips,
            'race': race,
            'origin': origin,
            'sex': sex,
            'age': age,
            'population': pop
        }
    except (ValueError, IndexError):
        return None

# Test the parser on first 10 lines
print("Testing parser on first 10 lines:")
print("-" * 80)
with gzip.open(pop_file_path, 'rt') as f:
    for i, line in enumerate(f):
        if i < 10:
            result = parse_seer_line(line.rstrip())
            if result:
                print(f"Line {i+1}: Year={result['year']}, State={result['state_abbrev']}, "
                      f"FIPS={result['fips']}, Age={result['age']}, Pop={result['population']:,}")
        else:
            break

print("\n✓ Parser working correctly!")

Parsing SEER fixed-width format...
Testing parser on first 10 lines:
--------------------------------------------------------------------------------
Line 1: Year=1969, State=AL, FIPS=01001, Age=00, Pop=159
Line 2: Year=1969, State=AL, FIPS=01001, Age=01, Pop=159
Line 3: Year=1969, State=AL, FIPS=01001, Age=02, Pop=165
Line 4: Year=1969, State=AL, FIPS=01001, Age=03, Pop=159
Line 5: Year=1969, State=AL, FIPS=01001, Age=04, Pop=174
Line 6: Year=1969, State=AL, FIPS=01001, Age=05, Pop=234
Line 7: Year=1969, State=AL, FIPS=01001, Age=06, Pop=222
Line 8: Year=1969, State=AL, FIPS=01001, Age=07, Pop=208
Line 9: Year=1969, State=AL, FIPS=01001, Age=08, Pop=220
Line 10: Year=1969, State=AL, FIPS=01001, Age=09, Pop=253

✓ Parser working correctly!


## Step 5: Load and Parse Dataset (2006-2015)

Load and parse the dataset, filtering for years 2006-2015 only to match the ARCOS prescription data timeframe.

In [None]:
print("Loading and parsing dataset (filtering for 2006-2015)...")
print("=" * 60)
print("Processing file and creating aggregated data directly...")
print("This approach sums population as we parse to save memory.")
print()

# Create a dictionary to aggregate as we parse
# Key: (year, fips, state_abbrev, state_fips, county_code)
# Value: total_population
population_dict = {}
line_count = 0
filtered_count = 0

with gzip.open(pop_file_path, 'rt') as f:
    for line in f:
        line_count += 1
        
        # Parse the line
        result = parse_seer_line(line.rstrip())
        
        if result and 2006 <= result['year'] <= 2015:
            filtered_count += 1
            
            # Create key for aggregation
            key = (
                result['year'],
                result['fips'],
                result['state_abbrev'],
                result['state_fips'],
                result['county_code']
            )
            
            # Sum population for this key
            if key in population_dict:
                population_dict[key] += result['population']
            else:
                population_dict[key] = result['population']
        
        # Progress indicator every 10 million lines
        if line_count % 10_000_000 == 0:
            print(f"  Processed {line_count:,} lines, kept {filtered_count:,} rows, tracking {len(population_dict):,} unique county-years")

print(f"\n✓ Parsing and aggregation complete!")
print(f"  Total lines processed: {line_count:,}")
print(f"  Rows kept (2006-2015): {filtered_count:,}")
print(f"  Unique county-year combinations: {len(population_dict):,}")

# Convert to DataFrame
print("\nConverting to DataFrame...")
df_pop_condensed = pd.DataFrame([
    {
        'year': key[0],
        'fips': key[1],
        'state_abbrev': key[2],
        'state_fips': key[3],
        'county_code': key[4],
        'population': value
    }
    for key, value in population_dict.items()
])

# Sort for better organization
df_pop_condensed = df_pop_condensed.sort_values(['fips', 'year']).reset_index(drop=True)

print(f"✓ DataFrame created!")
print(f"  Shape: {df_pop_condensed.shape[0]:,} rows × {df_pop_condensed.shape[1]} columns")
print(f"  Columns: {df_pop_condensed.columns.tolist()}")
print(f"  Year range: {df_pop_condensed['year'].min()} - {df_pop_condensed['year'].max()}")

print("\n" + "=" * 60)
print("First 20 rows:")
print(df_pop_condensed.head(20))

Loading and parsing full dataset (all years 1969-2023)...
Processing file and creating aggregated data directly...
This approach sums population as we parse to save memory.

  Processed 10,000,000 lines, tracking 28,417 unique county-years
  Processed 10,000,000 lines, tracking 28,417 unique county-years
  Processed 20,000,000 lines, tracking 55,298 unique county-years
  Processed 20,000,000 lines, tracking 55,298 unique county-years
  Processed 30,000,000 lines, tracking 82,404 unique county-years
  Processed 30,000,000 lines, tracking 82,404 unique county-years
  Processed 40,000,000 lines, tracking 108,308 unique county-years
  Processed 40,000,000 lines, tracking 108,308 unique county-years


KeyboardInterrupt: 

## Step 6: Verify Aggregated Data

The data was already collapsed during parsing (Step 5) to save memory. Let's verify the results.

In [None]:
if 'df_pop_condensed' in locals():
    print("Data already aggregated during parsing!")
    print("=" * 60)
    
    print("Dataset Summary:")
    print(f"  Shape: {df_pop_condensed.shape[0]:,} rows × {df_pop_condensed.shape[1]} columns")
    print(f"  Years: {df_pop_condensed['year'].min()} - {df_pop_condensed['year'].max()}")
    print(f"  Unique FIPS codes: {df_pop_condensed['fips'].nunique():,}")
    print(f"  Columns: {df_pop_condensed.columns.tolist()}")
    
    print("\n" + "=" * 60)
    print("Sample statistics:")
    print(df_pop_condensed.describe())
    
    print("\n" + "=" * 60)
    print("Sample records from different years:")
    print(df_pop_condensed.sample(10).sort_values(['fips', 'year']))
else:
    print("Please load and aggregate the data first (run Step 5).")

## Step 7: Save to Parquet

Save the condensed county-year population dataset to a parquet file.

In [None]:
if 'df_pop_condensed' in locals():
    # Save to parquet in data/processed/
    output_file = '../data/processed/us_population_condensed_2006_2015.parquet'
    
    print(f"Saving condensed population data to: {output_file}")
    print("=" * 60)
    
    df_pop_condensed.to_parquet(output_file, index=False, compression='snappy')
    
    # Check file size
    file_size = os.path.getsize(output_file)
    print(f"✓ File saved successfully!")
    print(f"  File: {output_file}")
    print(f"  Size: {file_size:,} bytes ({file_size / (1024**2):.2f} MB)")
    print(f"  Rows: {df_pop_condensed.shape[0]:,}")
    print(f"  Columns: {df_pop_condensed.shape[1]}")
    print(f"  Column names: {df_pop_condensed.columns.tolist()}")
    
    print("\n" + "=" * 60)
    print("Summary:")
    print(f"  Years: {df_pop_condensed['year'].min()} - {df_pop_condensed['year'].max()}")
    print(f"  Counties: {df_pop_condensed['fips'].nunique():,}")
    print(f"  Total records: {df_pop_condensed.shape[0]:,}")
    
    print("\n" + "=" * 60)
    print("✓ Condensed population dataset ready for 2006-2015!")
else:
    print("Please aggregate the data first (run Step 6).")