# US Population Data Exploration and Preprocessing

This notebook explores the US Population data and prepares it for analysis.

## Goals:
1. Load and examine the population data
2. Understand the data structure and coverage
3. Clean up and filter the dataset
4. Prepare data for merging with other datasets

## Step 1: Import Required Libraries

In [18]:
import pandas as pd
import os
import gzip

# Configure pandas display options for better table viewing
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', 100)      # Show up to 100 rows
pd.set_option('display.width', None)        # Auto-detect width
pd.set_option('display.max_colwidth', None) # Show full column content
pd.set_option('display.float_format', '{:.2f}'.format)  # Format floats

print("Libraries imported successfully!")
print("Display options configured for Excel-like table view")

Libraries imported successfully!
Display options configured for Excel-like table view


## Step 2: Locate and Examine Population File

In [8]:
# Search for population .gz files in the workspace
import glob

print("Searching for population data files (.gz)...")
print("=" * 60)

# Look for gz files that might contain population data
gz_files = glob.glob('**/*.gz', recursive=True)

# Filter for likely population files
pop_files = [f for f in gz_files if 'population' in f.lower() or 'pop' in f.lower()]

if pop_files:
    print("Population files found:")
    for file in pop_files:
        file_size = os.path.getsize(file)
        print(f"  {file} ({file_size:,} bytes)")
else:
    print("No population .gz files found with 'population' or 'pop' in name.")
    print("\nAll .gz files in workspace:")
    for file in gz_files[:10]:  # Show first 10
        file_size = os.path.getsize(file)
        print(f"  {file} ({file_size:,} bytes)")
    if len(gz_files) > 10:
        print(f"  ... and {len(gz_files) - 10} more")

Searching for population data files (.gz)...
No population .gz files found with 'population' or 'pop' in name.

All .gz files in workspace:
  us.1969_2023.singleages.through89.90plus.adjusted.txt.gz (278,134,338 bytes)


## Step 3: Preview Population Data (from .gz file)

**Note:** Update the `pop_file_path` variable below with the actual path to your population .gz file.

In [19]:
# Use the population file found in the workspace
pop_file_path = 'us.1969_2023.singleages.through89.90plus.adjusted.txt.gz'

# Check if file exists
if os.path.exists(pop_file_path):
    print(f"Reading data from: {pop_file_path}")
    print("=" * 60)
    
    # Read first few lines from gzip file to understand structure
    with gzip.open(pop_file_path, 'rt') as f:
        print("First 15 lines of the file:")
        for i, line in enumerate(f):
            if i < 15:
                print(line.rstrip())
            else:
                break
else:
    print(f"File not found: {pop_file_path}")
    print("Please check the file path.")

Reading data from: us.1969_2023.singleages.through89.90plus.adjusted.txt.gz
First 15 lines of the file:
1969AL01001  1910000000159
1969AL01001  1910100000159
1969AL01001  1910200000165
1969AL01001  1910300000159
1969AL01001  1910400000174
1969AL01001  1910500000234
1969AL01001  1910600000222
1969AL01001  1910700000208
1969AL01001  1910800000220
1969AL01001  1910900000253
1969AL01001  1911000000204
1969AL01001  1911100000215
1969AL01001  1911200000179
1969AL01001  1911300000179
1969AL01001  1911400000179


## Step 4: Load and Examine Population Data

**Note:** Adjust the `sep` parameter based on the file format discovered in Step 3.

In [21]:
# Load a SAMPLE of the population data first for quick exploration
# This is much faster than loading the entire file
print("Loading SAMPLE of population data (100,000 rows)...")
print("=" * 60)

# Based on the preview, the format appears to be:
# Columns: Year (4), State (2), County (5), Race (1), Hispanic Origin (1), Sex (1), Age (3), Population (8)

# Read as fixed-width format
colspecs = [
    (0, 4),    # year
    (4, 6),    # state FIPS
    (6, 11),   # county FIPS (5 digits)
    (11, 12),  # race (1=white, 2=black, 3=AIAN, 4=Asian/PI)
    (12, 13),  # hispanic origin (1=not hispanic, 2=hispanic)
    (13, 14),  # sex (1=male, 2=female)
    (14, 17),  # age (0-90+, 999=total all ages)
    (17, 25)   # population
]

column_names = ['year', 'state_fips', 'county_fips', 'race', 'hispanic', 'sex', 'age', 'population']

# Load only first 100k rows for quick exploration
df_pop = pd.read_fwf(pop_file_path, colspecs=colspecs, names=column_names, 
                     compression='gzip', nrows=100000)

print(f"Sample loaded successfully!")
print(f"\nDataset Shape: {df_pop.shape[0]:,} rows × {df_pop.shape[1]} columns")

print("\n" + "=" * 60)
print("DATASET INFO:")
print("=" * 60)
df_pop.info()

print("\n" + "=" * 60)
print("First few rows:")
print("=" * 60)
df_pop.head(10)

Loading SAMPLE of population data (100,000 rows)...
Sample loaded successfully!

Dataset Shape: 100,000 rows × 8 columns

DATASET INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   year         100000 non-null  int64  
 1   state_fips   100000 non-null  object 
 2   county_fips  100000 non-null  int64  
 3   race         0 non-null       float64
 4   hispanic     0 non-null       float64
 5   sex          100000 non-null  int64  
 6   age          100000 non-null  int64  
 7   population   100000 non-null  int64  
dtypes: float64(2), int64(5), object(1)
memory usage: 6.1+ MB

First few rows:
Sample loaded successfully!

Dataset Shape: 100,000 rows × 8 columns

DATASET INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  -----

Unnamed: 0,year,state_fips,county_fips,race,hispanic,sex,age,population
0,1969,AL,1001,,,1,910,15
1,1969,AL,1001,,,1,910,10000015
2,1969,AL,1001,,,1,910,20000016
3,1969,AL,1001,,,1,910,30000015
4,1969,AL,1001,,,1,910,40000017
5,1969,AL,1001,,,1,910,50000023
6,1969,AL,1001,,,1,910,60000022
7,1969,AL,1001,,,1,910,70000020
8,1969,AL,1001,,,1,910,80000022
9,1969,AL,1001,,,1,910,90000025


## Step 5: Check for Missing Values

In [11]:
if 'df_pop' in locals():
    print("Missing Values Analysis:")
    print("=" * 60)
    missing = df_pop.isnull().sum()
    missing_pct = (missing / len(df_pop)) * 100
    missing_df = pd.DataFrame({
        'Missing_Count': missing,
        'Missing_Percentage': missing_pct
    })
    missing_with_values = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)
    
    if len(missing_with_values) > 0:
        missing_with_values
    else:
        print("No missing values found!")
else:
    print("Please load the population data first.")

Missing Values Analysis:
No missing values found!
No missing values found!


## Step 6: Basic Statistics

In [12]:
if 'df_pop' in locals():
    print("Basic Statistics:")
    print("=" * 60)
    df_pop.describe(include='all')
else:
    print("Please load the population data first.")

Basic Statistics:


## Step 7: Data Preprocessing - Filter Years and Clean Data

In [13]:
if 'df_pop' in locals():
    # Check what columns we have
    print("Column names:")
    print(df_pop.columns.tolist())
    
    print("\n" + "=" * 60)
    print("Unique years in dataset:")
    if 'year' in df_pop.columns:
        print(f"Years: {sorted(df_pop['year'].unique())}")
    elif 'Year' in df_pop.columns:
        print(f"Years: {sorted(df_pop['Year'].unique())}")
    else:
        print("Year column not found. Available columns:")
        print(df_pop.columns.tolist())
else:
    print("Please load the population data first.")

Column names:
['1969AL01001', '1910000000159']

Unique years in dataset:
Year column not found. Available columns:
['1969AL01001', '1910000000159']


## Step 8: Filter for Project Time Period (2003-2015)

In [22]:
# Now load ONLY the years we need (2003-2015) from the full file
# This is more efficient than loading all years then filtering
print("Loading full dataset filtered for 2003-2015...")
print("=" * 60)
print("This may take 1-2 minutes as we're reading the entire file...")

# Define a function to filter as we read
def year_filter(df_chunk):
    return df_chunk[(df_chunk['year'] >= 2003) & (df_chunk['year'] <= 2015)]

# Read the file in chunks and filter
chunks = []
chunk_size = 500000  # Read 500k rows at a time

reader = pd.read_fwf(pop_file_path, colspecs=colspecs, names=column_names, 
                     compression='gzip', chunksize=chunk_size)

for i, chunk in enumerate(reader):
    # Filter this chunk
    filtered_chunk = year_filter(chunk)
    if not filtered_chunk.empty:
        chunks.append(filtered_chunk)
    
    # Progress indicator
    if (i + 1) % 10 == 0:
        print(f"  Processed {(i + 1) * chunk_size:,} rows...")

# Combine all filtered chunks
df_pop_filtered = pd.concat(chunks, ignore_index=True)

print(f"\n✓ Filtered dataset loaded!")
print(f"Rows (2003-2015 only): {df_pop_filtered.shape[0]:,}")

print("\n" + "=" * 60)
print("Years in filtered dataset:")
print(sorted(df_pop_filtered['year'].unique()))

print("\n" + "=" * 60)
print("Sample of filtered data:")
df_pop_filtered.head(10)

Loading full dataset filtered for 2003-2015...
This may take 1-2 minutes as we're reading the entire file...
  Processed 5,000,000 rows...
  Processed 5,000,000 rows...
  Processed 10,000,000 rows...
  Processed 10,000,000 rows...
  Processed 15,000,000 rows...
  Processed 15,000,000 rows...
  Processed 20,000,000 rows...
  Processed 20,000,000 rows...
  Processed 25,000,000 rows...
  Processed 25,000,000 rows...
  Processed 30,000,000 rows...
  Processed 30,000,000 rows...
  Processed 35,000,000 rows...
  Processed 35,000,000 rows...
  Processed 40,000,000 rows...
  Processed 40,000,000 rows...
  Processed 45,000,000 rows...
  Processed 45,000,000 rows...
  Processed 50,000,000 rows...
  Processed 50,000,000 rows...
  Processed 55,000,000 rows...
  Processed 55,000,000 rows...
  Processed 60,000,000 rows...
  Processed 60,000,000 rows...
  Processed 65,000,000 rows...
  Processed 65,000,000 rows...

✓ Filtered dataset loaded!
Rows (2003-2015 only): 17,620,800

Years in filtered datase

Unnamed: 0,year,state_fips,county_fips,race,hispanic,sex,age,population
0,2003,AL,1001,,,1,910,24
1,2003,AL,1001,,,1,910,10000021
2,2003,AL,1001,,,1,910,20000024
3,2003,AL,1001,,,1,910,30000024
4,2003,AL,1001,,,1,910,40000022
5,2003,AL,1001,,,1,910,50000025
6,2003,AL,1001,,,1,910,60000030
7,2003,AL,1001,,,1,910,70000025
8,2003,AL,1001,,,1,910,80000030
9,2003,AL,1001,,,1,910,90000030


## Step 9: Save Filtered Data to Parquet

In [26]:
if 'df_pop_filtered' in locals():
    # Save the filtered data to parquet for efficient subsequent processing
    filtered_file = 'us_population_2003_2015_filtered.parquet'
    
    print(f"Saving filtered data to: {filtered_file}")
    print("=" * 60)
    
    df_pop_filtered.to_parquet(filtered_file, index=False, compression='snappy')
    
    # Check file size
    file_size = os.path.getsize(filtered_file)
    print(f"✓ Filtered data saved successfully!")
    print(f"  File size: {file_size:,} bytes ({file_size / (1024**2):.2f} MB)")
    print(f"  Rows: {df_pop_filtered.shape[0]:,}")
    print(f"  Columns: {df_pop_filtered.shape[1]}")
    
    print("\n" + "=" * 60)
    print("From now on, we'll work with the Parquet file for faster processing.")
else:
    print("Please filter the data first (run Step 8).")

Saving filtered data to: us_population_2003_2015_filtered.parquet
✓ Filtered data saved successfully!
  File size: 23,849,744 bytes (22.74 MB)
  Rows: 17,620,800
  Columns: 8

From now on, we'll work with the Parquet file for faster processing.
✓ Filtered data saved successfully!
  File size: 23,849,744 bytes (22.74 MB)
  Rows: 17,620,800
  Columns: 8

From now on, we'll work with the Parquet file for faster processing.


## Step 10: Load from Parquet and Aggregate by State and Year

In [27]:
# Load the filtered data from Parquet (much faster than CSV/TSV!)
filtered_file = 'us_population_2003_2015_filtered.parquet'

if os.path.exists(filtered_file):
    print(f"Loading filtered data from Parquet: {filtered_file}")
    print("=" * 60)
    
    # Load from parquet - very fast!
    df_pop_filtered = pd.read_parquet(filtered_file)
    
    print(f"✓ Data loaded from Parquet!")
    print(f"  Rows: {df_pop_filtered.shape[0]:,}")
    print(f"  Columns: {df_pop_filtered.shape[1]}")
    
    print("\n" + "=" * 60)
    print("Aggregating by state and year...")
    print("=" * 60)
    
    # Aggregate population by state and year (sum across counties, age, sex, race, hispanic origin)
    df_pop_agg = df_pop_filtered.groupby(['state_fips', 'year'])['population'].sum().reset_index()
    
    print(f"\n✓ Aggregated dataset created!")
    print(f"  Shape: {df_pop_agg.shape[0]:,} rows × {df_pop_agg.shape[1]} columns")
    
    print("\n" + "=" * 60)
    print("Sample of aggregated data:")
    df_pop_agg.head(20)
else:
    print(f"File not found: {filtered_file}")
    print("Please run Step 9 first to create the filtered Parquet file.")

Loading filtered data from Parquet: us_population_2003_2015_filtered.parquet
✓ Data loaded from Parquet!
  Rows: 17,620,800
  Columns: 8

Aggregating by state and year...
✓ Data loaded from Parquet!
  Rows: 17,620,800
  Columns: 8

Aggregating by state and year...

✓ Aggregated dataset created!
  Shape: 664 rows × 3 columns

Sample of aggregated data:

✓ Aggregated dataset created!
  Shape: 664 rows × 3 columns

Sample of aggregated data:


## Step 11: Save Final Aggregated Data to Parquet

In [28]:
if 'df_pop_agg' in locals():
    # Save to parquet format for efficient storage and fast loading
    output_file = 'us_population_2003_2015.parquet'
    
    print(f"Saving processed population data to: {output_file}")
    print("=" * 60)
    
    df_pop_agg.to_parquet(output_file, index=False, compression='snappy')
    
    # Check file size
    file_size = os.path.getsize(output_file)
    print(f"✓ File saved successfully!")
    print(f"  File size: {file_size:,} bytes ({file_size / 1024:.2f} KB)")
    print(f"  Rows: {df_pop_agg.shape[0]:,}")
    print(f"  Columns: {df_pop_agg.shape[1]}")
    
    print("\n" + "=" * 60)
    print("Summary of processed data:")
    print(f"  Years covered: {df_pop_agg['year'].min()} - {df_pop_agg['year'].max()}")
    print(f"  Number of states/regions: {df_pop_agg['state_fips'].nunique()}")
    print(f"  Total population records: {df_pop_agg.shape[0]:,}")
else:
    print("Please create the aggregated dataset first (run Step 10).")

Saving processed population data to: us_population_2003_2015.parquet
✓ File saved successfully!
  File size: 8,763 bytes (8.56 KB)
  Rows: 664
  Columns: 3

Summary of processed data:
  Years covered: 2003 - 2015
  Number of states/regions: 52
  Total population records: 664
