## PDS Group 7

### Import Libraries

This notebook analyzes US vital statistics data on drug and alcohol-induced mortality from 2003 to 2015. We load the data from multiple text files, clean and standardize it, and prepare it for analysis by handling missing values, removing duplicates, and converting data types appropriately.

In [2]:
import requests
import zipfile
import io
import pandas as pd
import numpy as np

pd.set_option("mode.copy_on_write", True)

### Load and Combine Data from Multiple Years

In [3]:
url = "https://www.dropbox.com/scl/fi/bnkoej224ve1tr35fhek8/US_VitalStatistics.zip?rlkey=oenpdsvsiovlqw7v7j1yhldye&dl=1"

# Download ZIP file into memory
resp = requests.get(url)
resp.raise_for_status()
zip_bytes = io.BytesIO(resp.content)

dfs = []

# Open ZIP and read data files
with zipfile.ZipFile(zip_bytes, "r") as zf:
    # Filter out metadata and resource fork files
    txt_files = [
        name
        for name in zf.namelist()
        if name.lower().endswith(".txt")
        and "__macosx" not in name.lower()
        and "/._" not in name
    ]

    for name in sorted(txt_files):
        print("Reading:", name)
        with zf.open(name) as f:
            df = pd.read_csv(f, sep="\t", encoding="latin1")
            dfs.append(df)

# Combine all years into one DataFrame
mortality_03_15 = pd.concat(dfs, ignore_index=True)

print("Number of files read:", len(dfs))
print("Final dataframe shape:", mortality_03_15.shape)
mortality_03_15.sample(20)

Reading: Underlying Cause of Death, 2003.txt
Reading: Underlying Cause of Death, 2004.txt
Reading: Underlying Cause of Death, 2005.txt
Reading: Underlying Cause of Death, 2006.txt
Reading: Underlying Cause of Death, 2007.txt
Reading: Underlying Cause of Death, 2008.txt
Reading: Underlying Cause of Death, 2009.txt
Reading: Underlying Cause of Death, 2010.txt
Reading: Underlying Cause of Death, 2011.txt
Reading: Underlying Cause of Death, 2012.txt
Reading: Underlying Cause of Death, 2013.txt
Reading: Underlying Cause of Death, 2014.txt
Reading: Underlying Cause of Death, 2015.txt
Number of files read: 13
Final dataframe shape: (57436, 8)


Unnamed: 0,Notes,County,County Code,Year,Year Code,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths
47137,,"Briscoe County, TX",48045.0,2013.0,2013.0,All other non-drug and non-alcohol causes,O9,16.0
16149,,"Mitchell County, TX",48335.0,2006.0,2006.0,All other non-drug and non-alcohol causes,O9,86.0
19042,,"Gage County, NE",31067.0,2007.0,2007.0,All other non-drug and non-alcohol causes,O9,269.0
24758,,"Harris County, TX",48201.0,2008.0,2008.0,Drug poisonings (overdose) Suicide (X60-X64),D2,40.0
45514,,"Newton County, MS",28101.0,2013.0,2013.0,All other non-drug and non-alcohol causes,O9,303.0
6315,,"Kearney County, NE",31099.0,2004.0,2004.0,All other non-drug and non-alcohol causes,O9,76.0
21492,,"Clear Creek County, CO",8019.0,2008.0,2008.0,All other non-drug and non-alcohol causes,O9,45.0
19474,,"Person County, NC",37145.0,2007.0,2007.0,All other non-drug and non-alcohol causes,O9,414.0
29433,,"Giles County, VA",51071.0,2009.0,2009.0,All other non-drug and non-alcohol causes,O9,200.0
18497,,"Antrim County, MI",26009.0,2007.0,2007.0,All other non-drug and non-alcohol causes,O9,235.0


### Initial Data Exploration

In [4]:
mortality_03_15.columns

Index(['Notes', 'County', 'County Code', 'Year', 'Year Code',
       'Drug/Alcohol Induced Cause', 'Drug/Alcohol Induced Cause Code',
       'Deaths'],
      dtype='object')

In [5]:
mortality_03_15.isna().sum()

Notes                              57241
County                               195
County Code                          195
Year                                 195
Year Code                            195
Drug/Alcohol Induced Cause           195
Drug/Alcohol Induced Cause Code      195
Deaths                               195
dtype: int64

In [6]:
mortality_03_15.duplicated().sum()

183

In [7]:
mortality_03_15[mortality_03_15.duplicated()]

Unnamed: 0,Notes,County,County Code,Year,Year Code,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths
4094,---,,,,,,,
4096,---,,,,,,,
4101,---,,,,,,,
8237,---,,,,,,,
8238,"Dataset: Underlying Cause of Death, 1999-2017",,,,,,,
...,...,...,...,...,...,...,...,...
57431,Suggested Citation: Centers for Disease Contro...,,,,,,,
57432,"1999-2017 on CDC WONDER Online Database, relea...",,,,,,,
57433,compiled from data provided by the 57 vital st...,,,,,,,
57434,at http://wonder.cdc.gov/ucd-icd10.html on Oct...,,,,,,,


### Data Cleaning

In [8]:
# Remove duplicate rows
mortality_03_15_clean = mortality_03_15.drop_duplicates()
mortality_03_15_clean.tail()

Unnamed: 0,Notes,County,County Code,Year,Year Code,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths
57416,,"Sweetwater County, WY",56037.0,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,251
57417,,"Teton County, WY",56039.0,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,95
57418,,"Uinta County, WY",56041.0,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,142
57419,,"Washakie County, WY",56043.0,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,81
57420,,"Weston County, WY",56045.0,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,61


In [9]:
mortality_03_15_clean.drop(columns="Notes", inplace=True)

In [10]:
mortality_03_15_clean

Unnamed: 0,County,County Code,Year,Year Code,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths
0,"Autauga County, AL",1001.0,2003.0,2003.0,All other non-drug and non-alcohol causes,O9,397.0
1,"Baldwin County, AL",1003.0,2003.0,2003.0,Drug poisonings (overdose) Unintentional (X40-...,D1,10.0
2,"Baldwin County, AL",1003.0,2003.0,2003.0,All other alcohol-induced causes,A9,14.0
3,"Baldwin County, AL",1003.0,2003.0,2003.0,All other non-drug and non-alcohol causes,O9,1479.0
4,"Barbour County, AL",1005.0,2003.0,2003.0,All other non-drug and non-alcohol causes,O9,287.0
...,...,...,...,...,...,...,...
57416,"Sweetwater County, WY",56037.0,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,251
57417,"Teton County, WY",56039.0,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,95
57418,"Uinta County, WY",56041.0,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,142
57419,"Washakie County, WY",56043.0,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,81


In [11]:
mortality_03_15_clean.isna().sum()

County                             12
County Code                        12
Year                               12
Year Code                          12
Drug/Alcohol Induced Cause         12
Drug/Alcohol Induced Cause Code    12
Deaths                             12
dtype: int64

In [12]:
mortality_03_15_clean[mortality_03_15_clean.isnull().any(axis=1)]

Unnamed: 0,County,County Code,Year,Year Code,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths
4087,,,,,,,
4088,,,,,,,
4089,,,,,,,
4090,,,,,,,
4091,,,,,,,
4092,,,,,,,
4093,,,,,,,
4095,,,,,,,
4097,,,,,,,
4098,,,,,,,


In [13]:
mortality_03_15_clean.dropna(how="all")

Unnamed: 0,County,County Code,Year,Year Code,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths
0,"Autauga County, AL",1001.0,2003.0,2003.0,All other non-drug and non-alcohol causes,O9,397.0
1,"Baldwin County, AL",1003.0,2003.0,2003.0,Drug poisonings (overdose) Unintentional (X40-...,D1,10.0
2,"Baldwin County, AL",1003.0,2003.0,2003.0,All other alcohol-induced causes,A9,14.0
3,"Baldwin County, AL",1003.0,2003.0,2003.0,All other non-drug and non-alcohol causes,O9,1479.0
4,"Barbour County, AL",1005.0,2003.0,2003.0,All other non-drug and non-alcohol causes,O9,287.0
...,...,...,...,...,...,...,...
57416,"Sweetwater County, WY",56037.0,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,251
57417,"Teton County, WY",56039.0,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,95
57418,"Uinta County, WY",56041.0,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,142
57419,"Washakie County, WY",56043.0,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,81


In [14]:
# Drop rows where all columns (except the first) are null
mortality_03_15_clean = mortality_03_15_clean.dropna(
    how="all", subset=mortality_03_15.columns[1:]
)

In [15]:
mortality_03_15_clean.isna().sum()

County                             0
County Code                        0
Year                               0
Year Code                          0
Drug/Alcohol Induced Cause         0
Drug/Alcohol Induced Cause Code    0
Deaths                             0
dtype: int64

In [16]:
# Check for unusual placeholder values in columns
for col in mortality_03_15_clean.columns:
    uniques = mortality_03_15_clean[col].astype(str).unique()
    unusual = [
        u
        for u in uniques
        if u.strip().lower()
        in ["missing", "n/a", "na", "none", ".", "null", "suppressed", ""]
    ]
    if unusual:
        print(f"{col}: {unusual}")

Deaths: ['Missing']


In [17]:
mortality_03_15_clean[mortality_03_15_clean["Deaths"] == "Missing"]

Unnamed: 0,County,County Code,Year,Year Code,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths
52756,"Prince of Wales-Outer Ketchikan Census Area, AK",2201.0,2015.0,2015.0,Drug poisonings (overdose) Unintentional (X40-...,D1,Missing
52757,"Prince of Wales-Outer Ketchikan Census Area, AK",2201.0,2015.0,2015.0,Drug poisonings (overdose) Suicide (X60-X64),D2,Missing
52758,"Prince of Wales-Outer Ketchikan Census Area, AK",2201.0,2015.0,2015.0,Drug poisonings (overdose) Homicide (X85),D3,Missing
52759,"Prince of Wales-Outer Ketchikan Census Area, AK",2201.0,2015.0,2015.0,Drug poisonings (overdose) Undetermined (Y10-Y14),D4,Missing
52760,"Prince of Wales-Outer Ketchikan Census Area, AK",2201.0,2015.0,2015.0,All other drug-induced causes,D9,Missing
52761,"Prince of Wales-Outer Ketchikan Census Area, AK",2201.0,2015.0,2015.0,"Alcohol poisonings (overdose) (X45, X65, Y15)",A1,Missing
52762,"Prince of Wales-Outer Ketchikan Census Area, AK",2201.0,2015.0,2015.0,All other alcohol-induced causes,A9,Missing
52763,"Prince of Wales-Outer Ketchikan Census Area, AK",2201.0,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,Missing
52765,"Skagway-Hoonah-Angoon Census Area, AK",2232.0,2015.0,2015.0,Drug poisonings (overdose) Unintentional (X40-...,D1,Missing
52766,"Skagway-Hoonah-Angoon Census Area, AK",2232.0,2015.0,2015.0,Drug poisonings (overdose) Suicide (X60-X64),D2,Missing


In [18]:
# Replace "Missing" with NaN in Deaths column
mortality_03_15_clean["Deaths"] = mortality_03_15_clean["Deaths"].replace(
    "Missing", np.nan
)

### Remove Redundant Columns

In [19]:
# Check if Year and Year Code are identical
(mortality_03_15_clean["Year"] == mortality_03_15_clean["Year Code"]).all()

True

In [20]:
mortality_03_15_clean.drop(columns="Year Code", inplace=True)

### Data Type Conversion

In [21]:
mortality_03_15_clean.dtypes

County                              object
County Code                        float64
Year                               float64
Drug/Alcohol Induced Cause          object
Drug/Alcohol Induced Cause Code     object
Deaths                              object
dtype: object

In [22]:
mask = ~mortality_03_15_clean["Year"].astype(str).str.match(r"^\d{4}\.0$")
mortality_03_15_clean.loc[mask, "Year"].unique()

array([], dtype=float64)

In [23]:
# Convert columns to appropriate data types
mortality_03_15_clean = mortality_03_15_clean.astype(
    {
        "County": "string",
        "County Code": "int",
        "Year": "float",
        "Drug/Alcohol Induced Cause": "string",
        "Drug/Alcohol Induced Cause Code": "string",
        "Deaths": "float",
    }
)

In [24]:
# Convert Year and Deaths to nullable integer types
mortality_03_15_clean = mortality_03_15_clean.astype(
    {
        "Year": "Int64",
        "Deaths": "Int64",
    }
)

In [25]:
mortality_03_15_clean.dtypes

County                             string[python]
County Code                                 int32
Year                                        Int64
Drug/Alcohol Induced Cause         string[python]
Drug/Alcohol Induced Cause Code    string[python]
Deaths                                      Int64
dtype: object

### Final Cleaned Dataset

In [26]:
mortality_03_15_clean.head()

Unnamed: 0,County,County Code,Year,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths
0,"Autauga County, AL",1001,2003,All other non-drug and non-alcohol causes,O9,397
1,"Baldwin County, AL",1003,2003,Drug poisonings (overdose) Unintentional (X40-...,D1,10
2,"Baldwin County, AL",1003,2003,All other alcohol-induced causes,A9,14
3,"Baldwin County, AL",1003,2003,All other non-drug and non-alcohol causes,O9,1479
4,"Barbour County, AL",1005,2003,All other non-drug and non-alcohol causes,O9,287


## After looking at the population dataset

In [27]:
population = pd.read_csv("data/clean/population_2000_2024.csv")

In [28]:
population

Unnamed: 0,STATE,COUNTY,STNAME,CTYNAME,year,population,fips
0,1,1,Alabama,Autauga County,2000,44021,1001
1,1,1,Alabama,Autauga County,2001,44889,1001
2,1,1,Alabama,Autauga County,2002,45909,1001
3,1,1,Alabama,Autauga County,2003,46800,1001
4,1,1,Alabama,Autauga County,2004,48366,1001
...,...,...,...,...,...,...,...
81708,56,45,Wyoming,Weston County,2020,6817,56045
81709,56,45,Wyoming,Weston County,2021,6747,56045
81710,56,45,Wyoming,Weston County,2022,6872,56045
81711,56,45,Wyoming,Weston County,2023,6828,56045


In [29]:
population["fips"].dtype


dtype('int64')

In [30]:
#Check str len
mortality_03_15_clean["County Code"].astype(str).str.len().value_counts()

County Code
5    50411
4     6830
Name: count, dtype: int64

In [31]:
#Since len for County Code differs, fix before merging
mortality_03_15_clean["County Code"] = mortality_03_15_clean["County Code"].astype(str).str.zfill(5)

mortality_03_15_clean["County Code"].str.len().value_counts()


County Code
5    57241
Name: count, dtype: int64

In [32]:
#Check str len
population["fips"].astype(str).str.len().value_counts()

fips
5    73487
4     8226
Name: count, dtype: int64

In [33]:
population["fips"] = population["fips"].astype(str).str.zfill(5)

In [34]:
population["fips"].str.len().value_counts()

fips
5    81713
Name: count, dtype: int64

In [35]:
population.head()

Unnamed: 0,STATE,COUNTY,STNAME,CTYNAME,year,population,fips
0,1,1,Alabama,Autauga County,2000,44021,1001
1,1,1,Alabama,Autauga County,2001,44889,1001
2,1,1,Alabama,Autauga County,2002,45909,1001
3,1,1,Alabama,Autauga County,2003,46800,1001
4,1,1,Alabama,Autauga County,2004,48366,1001


In [36]:
#missing_pop["County Code"].unique()
#These FIPS codes are independent county-equivalents that DO NOT appear in the modern Census population datasets.

In [37]:
obsolete_fips = ['02201','02232','02280','02270','46113','51515','51560'] #Alaska, South Daktoa, Virgina

mortality_03_15_clean = mortality_03_15_clean[~mortality_03_15_clean["County Code"].isin(obsolete_fips)]


In [38]:
pop_mortality_merged = mortality_03_15_clean.merge(
    population,
    left_on=["County Code", "Year"],
    right_on=["fips", "year"],
    how="left"
)

pop_mortality_merged

Unnamed: 0,County,County Code,Year,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths,STATE,COUNTY,STNAME,CTYNAME,year,population,fips
0,"Autauga County, AL",01001,2003,All other non-drug and non-alcohol causes,O9,397,1,1,Alabama,Autauga County,2003,46800,01001
1,"Baldwin County, AL",01003,2003,Drug poisonings (overdose) Unintentional (X40-...,D1,10,1,3,Alabama,Baldwin County,2003,151509,01003
2,"Baldwin County, AL",01003,2003,All other alcohol-induced causes,A9,14,1,3,Alabama,Baldwin County,2003,151509,01003
3,"Baldwin County, AL",01003,2003,All other non-drug and non-alcohol causes,O9,1479,1,3,Alabama,Baldwin County,2003,151509,01003
4,"Barbour County, AL",01005,2003,All other non-drug and non-alcohol causes,O9,287,1,5,Alabama,Barbour County,2003,28594,01005
...,...,...,...,...,...,...,...,...,...,...,...,...,...
61554,"Sweetwater County, WY",56037,2015,All other non-drug and non-alcohol causes,O9,251,56,37,Wyoming,Sweetwater County,2015,44719,56037
61555,"Teton County, WY",56039,2015,All other non-drug and non-alcohol causes,O9,95,56,39,Wyoming,Teton County,2015,23047,56039
61556,"Uinta County, WY",56041,2015,All other non-drug and non-alcohol causes,O9,142,56,41,Wyoming,Uinta County,2015,20763,56041
61557,"Washakie County, WY",56043,2015,All other non-drug and non-alcohol causes,O9,81,56,43,Wyoming,Washakie County,2015,8278,56043


In [39]:
pop_mortality_merged["population"].isna().sum()

0

Final Merged Dataset

In [40]:
pop_mortality_merged.head()

Unnamed: 0,County,County Code,Year,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths,STATE,COUNTY,STNAME,CTYNAME,year,population,fips
0,"Autauga County, AL",1001,2003,All other non-drug and non-alcohol causes,O9,397,1,1,Alabama,Autauga County,2003,46800,1001
1,"Baldwin County, AL",1003,2003,Drug poisonings (overdose) Unintentional (X40-...,D1,10,1,3,Alabama,Baldwin County,2003,151509,1003
2,"Baldwin County, AL",1003,2003,All other alcohol-induced causes,A9,14,1,3,Alabama,Baldwin County,2003,151509,1003
3,"Baldwin County, AL",1003,2003,All other non-drug and non-alcohol causes,O9,1479,1,3,Alabama,Baldwin County,2003,151509,1003
4,"Barbour County, AL",1005,2003,All other non-drug and non-alcohol causes,O9,287,1,5,Alabama,Barbour County,2003,28594,1005


In [42]:
pop_mortality_merged.to_csv("data/clean/merged_mortality_population.csv", index=False)
