# US Vital Statistics Data Exploration and Preprocessing

This notebook explores the US Vital Statistics (Underlying Cause of Death) data and prepares it for analysis.

## Goals:
1. Examine the US_VitalStatistics folder contents
2. Load and understand the mortality data structure
3. Clean up and filter the dataset
4. Prepare data for analysis

## Step 1: Import Required Libraries

In [1]:
import pandas as pd
import os
import glob

# Configure pandas display options for better table viewing
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', 100)      # Show up to 100 rows
pd.set_option('display.width', None)        # Auto-detect width
pd.set_option('display.max_colwidth', None) # Show full column content
pd.set_option('display.float_format', '{:.2f}'.format)  # Format floats

print("Libraries imported successfully!")
print("Display options configured for Excel-like table view")

Libraries imported successfully!
Display options configured for Excel-like table view


## Step 2: Explore US Vital Statistics Folder

In [2]:
# Path to the vital statistics folder
vital_stats_folder = 'US_VitalStatistics'

# List all files in the folder
print("Files in US_VitalStatistics folder:")
print("=" * 60)

files = sorted(glob.glob(os.path.join(vital_stats_folder, '*.txt')))
for file in files:
    file_name = os.path.basename(file)
    file_size = os.path.getsize(file)
    print(f"File: {file_name}")
    print(f"  Size: {file_size:,} bytes ({file_size / 1024:.2f} KB)")
    print("-" * 60)

print(f"\nTotal files: {len(files)}")
print(f"Year range: 2003-2015")

Files in US_VitalStatistics folder:
File: Underlying Cause of Death, 2003.txt
  Size: 371,567 bytes (362.86 KB)
------------------------------------------------------------
File: Underlying Cause of Death, 2004.txt
  Size: 376,159 bytes (367.34 KB)
------------------------------------------------------------
File: Underlying Cause of Death, 2005.txt
  Size: 382,237 bytes (373.28 KB)
------------------------------------------------------------
File: Underlying Cause of Death, 2006.txt
  Size: 388,051 bytes (378.96 KB)
------------------------------------------------------------
File: Underlying Cause of Death, 2007.txt
  Size: 392,636 bytes (383.43 KB)
------------------------------------------------------------
File: Underlying Cause of Death, 2008.txt
  Size: 398,659 bytes (389.32 KB)
------------------------------------------------------------
File: Underlying Cause of Death, 2009.txt
  Size: 398,417 bytes (389.08 KB)
------------------------------------------------------------
File:

## Step 3: Preview First File Structure

In [4]:
# Read the first file to understand structure
if files:
    sample_file = files[0]
    print(f"Reading sample from: {os.path.basename(sample_file)}")
    print("=" * 60)
    
    # Try reading as tab-separated (common for CDC data)
    df_sample = pd.read_csv(sample_file, sep='\t', nrows=10)
    
    print("\nFirst few rows:")
    df_sample

Reading sample from: Underlying Cause of Death, 2003.txt

First few rows:


## Step 4: Load Single Year Data and Examine Structure

In [5]:
# Load one year of data for exploration
print("Loading data from one year for exploration...")
print("=" * 60)

df = pd.read_csv(sample_file, sep='\t')

print(f"Data loaded successfully!")
print(f"\nDataset Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")

print("\n" + "=" * 60)
print("DATASET INFO:")
print("=" * 60)
df.info()

print("\n" + "=" * 60)
print("Basic Statistics (Numeric Columns):")
print("=" * 60)
df.describe()

Loading data from one year for exploration...
Data loaded successfully!

Dataset Shape: 4,102 rows × 8 columns

DATASET INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4102 entries, 0 to 4101
Data columns (total 8 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Notes                            15 non-null     object 
 1   County                           4087 non-null   object 
 2   County Code                      4087 non-null   float64
 3   Year                             4087 non-null   float64
 4   Year Code                        4087 non-null   float64
 5   Drug/Alcohol Induced Cause       4087 non-null   object 
 6   Drug/Alcohol Induced Cause Code  4087 non-null   object 
 7   Deaths                           4087 non-null   float64
dtypes: float64(4), object(4)
memory usage: 256.5+ KB

Basic Statistics (Numeric Columns):


Unnamed: 0,County Code,Year,Year Code,Deaths
count,4087.0,4087.0,4087.0,4087.0
mean,29925.36,2003.0,2003.0,595.33
std,15375.96,0.0,0.0,1895.81
min,1001.0,2003.0,2003.0,10.0
25%,18038.0,2003.0,2003.0,47.5
50%,29157.0,2003.0,2003.0,170.0
75%,44003.0,2003.0,2003.0,436.0
max,56045.0,2003.0,2003.0,59244.0


## Step 5: Check for Missing Values

In [6]:
print("Missing Values Analysis:")
print("=" * 60)
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing_Count': missing,
    'Missing_Percentage': missing_pct
})
missing_with_values = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)

if len(missing_with_values) > 0:
    print(missing_with_values)
else:
    print("No missing values found!")

Missing Values Analysis:
                                 Missing_Count  Missing_Percentage
Notes                                     4087               99.63
County                                      15                0.37
County Code                                 15                0.37
Year                                        15                0.37
Year Code                                   15                0.37
Drug/Alcohol Induced Cause                  15                0.37
Drug/Alcohol Induced Cause Code             15                0.37
Deaths                                      15                0.37


## Step 6: Load All Years and Combine

In [7]:
# Load all years and combine into one dataset
print("Loading all years of vital statistics data...")
print("=" * 60)

all_dfs = []
for file in files:
    year = os.path.basename(file).split(',')[1].strip().replace('.txt', '')
    print(f"Loading {year}...")
    df_year = pd.read_csv(file, sep='\t')
    all_dfs.append(df_year)

# Combine all years
df_all_years = pd.concat(all_dfs, ignore_index=True)

print(f"\nAll years combined!")
print(f"Total Shape: {df_all_years.shape[0]:,} rows × {df_all_years.shape[1]} columns")
print(f"Memory usage: {df_all_years.memory_usage(deep=True).sum() / (1024**2):.2f} MB")

Loading all years of vital statistics data...
Loading 2003...
Loading 2004...
Loading 2005...
Loading 2006...
Loading 2007...
Loading 2008...
Loading 2009...
Loading 2010...
Loading 2011...
Loading 2012...
Loading 2008...
Loading 2009...
Loading 2010...
Loading 2011...
Loading 2012...
Loading 2013...
Loading 2014...
Loading 2015...

All years combined!
Total Shape: 57,436 rows × 8 columns
Memory usage: 16.32 MB
Loading 2013...
Loading 2014...
Loading 2015...

All years combined!
Total Shape: 57,436 rows × 8 columns
Memory usage: 16.32 MB
