## PDS Group 7

### Import Libraries

This notebook analyzes US vital statistics data on drug and alcohol-induced mortality from 2003 to 2015. We load the data from multiple text files, clean and standardize it, and prepare it for analysis by handling missing values, removing duplicates, and converting data types appropriately.

In [6]:
import requests
import zipfile
import io
import pandas as pd

pd.set_option("mode.copy_on_write", True)

### Load and Combine Data from Multiple Years

In [None]:
url = "https://www.dropbox.com/scl/fi/bnkoej224ve1tr35fhek8/US_VitalStatistics.zip?rlkey=oenpdsvsiovlqw7v7j1yhldye&dl=1"

# Download ZIP file into memory
resp = requests.get(url)
resp.raise_for_status()
zip_bytes = io.BytesIO(resp.content)

dfs = []

# Open ZIP and read data files
with zipfile.ZipFile(zip_bytes, "r") as zf:
    # Filter out metadata and resource fork files
    txt_files = [
        name
        for name in zf.namelist()
        if name.lower().endswith(".txt")
        and "__macosx" not in name.lower()
        and "/._" not in name
    ]

    for name in sorted(txt_files):
        print("Reading:", name)
        with zf.open(name) as f:
            df = pd.read_csv(f, sep="\t", encoding="latin1")
            dfs.append(df)

# Combine all years into one DataFrame
mortality_03_15 = pd.concat(dfs, ignore_index=True)

print("Number of files read:", len(dfs))
print("Final dataframe shape:", mortality_03_15.shape)
mortality_03_15.sample(20)

Reading: Underlying Cause of Death, 2003.txt
Reading: Underlying Cause of Death, 2004.txt
Reading: Underlying Cause of Death, 2005.txt
Reading: Underlying Cause of Death, 2006.txt
Reading: Underlying Cause of Death, 2007.txt
Reading: Underlying Cause of Death, 2008.txt
Reading: Underlying Cause of Death, 2009.txt
Reading: Underlying Cause of Death, 2010.txt
Reading: Underlying Cause of Death, 2011.txt
Reading: Underlying Cause of Death, 2012.txt
Reading: Underlying Cause of Death, 2013.txt
Reading: Underlying Cause of Death, 2014.txt
Reading: Underlying Cause of Death, 2015.txt
Number of files read: 13
Final dataframe shape: (57436, 8)


Unnamed: 0,Notes,County,County Code,Year,Year Code,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths
1238,,"Franklin County, KS",20059.0,2003.0,2003.0,All other non-drug and non-alcohol causes,O9,227.0
19589,,"Clinton County, OH",39027.0,2007.0,2007.0,All other non-drug and non-alcohol causes,O9,341.0
37238,,"Clark County, OH",39023.0,2011.0,2011.0,All other non-drug and non-alcohol causes,O9,1607.0
9603,,"Sedgwick County, KS",20173.0,2005.0,2005.0,Drug poisonings (overdose) Suicide (X60-X64),D2,15.0
57231,,"Jackson County, WV",54035.0,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,359.0
2590,,"Richmond County, NC",37153.0,2003.0,2003.0,All other non-drug and non-alcohol causes,O9,527.0
41910,,"Stark County, OH",39151.0,2012.0,2012.0,All other non-drug and non-alcohol causes,O9,3987.0
30375,,"Bay County, FL",12005.0,2010.0,2010.0,Drug poisonings (overdose) Unintentional (X40-...,D1,22.0
35409,,"Crawford County, IN",18025.0,2011.0,2011.0,All other non-drug and non-alcohol causes,O9,105.0
19749,,"Latimer County, OK",40077.0,2007.0,2007.0,All other non-drug and non-alcohol causes,O9,115.0


### Initial Data Exploration

In [3]:
mortality_03_15.columns

Index(['Notes', 'County', 'County Code', 'Year', 'Year Code',
       'Drug/Alcohol Induced Cause', 'Drug/Alcohol Induced Cause Code',
       'Deaths'],
      dtype='object')

In [4]:
mortality_03_15.isna().sum()

Notes                              57241
County                               195
County Code                          195
Year                                 195
Year Code                            195
Drug/Alcohol Induced Cause           195
Drug/Alcohol Induced Cause Code      195
Deaths                               195
dtype: int64

In [5]:
mortality_03_15.duplicated().sum()

np.int64(183)

In [6]:
mortality_03_15[mortality_03_15.duplicated()]

Unnamed: 0,Notes,County,County Code,Year,Year Code,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths
4094,---,,,,,,,
4096,---,,,,,,,
4101,---,,,,,,,
8237,---,,,,,,,
8238,"Dataset: Underlying Cause of Death, 1999-2017",,,,,,,
...,...,...,...,...,...,...,...,...
57431,Suggested Citation: Centers for Disease Contro...,,,,,,,
57432,"1999-2017 on CDC WONDER Online Database, relea...",,,,,,,
57433,compiled from data provided by the 57 vital st...,,,,,,,
57434,at http://wonder.cdc.gov/ucd-icd10.html on Oct...,,,,,,,


### Data Cleaning

In [None]:
# Remove duplicate rows
mortality_03_15_clean = mortality_03_15.drop_duplicates()
mortality_03_15_clean.tail()

Unnamed: 0,Notes,County,County Code,Year,Year Code,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths
57416,,"Sweetwater County, WY",56037,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,251
57417,,"Teton County, WY",56039,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,95
57418,,"Uinta County, WY",56041,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,142
57419,,"Washakie County, WY",56043,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,81
57420,,"Weston County, WY",56045,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,61


In [8]:
mortality_03_15_clean.drop(columns="Notes", inplace=True)

In [9]:
mortality_03_15_clean

Unnamed: 0,County,County Code,Year,Year Code,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths
0,"Autauga County, AL",1001,2003.0,2003.0,All other non-drug and non-alcohol causes,O9,397.0
1,"Baldwin County, AL",1003,2003.0,2003.0,Drug poisonings (overdose) Unintentional (X40-...,D1,10.0
2,"Baldwin County, AL",1003,2003.0,2003.0,All other alcohol-induced causes,A9,14.0
3,"Baldwin County, AL",1003,2003.0,2003.0,All other non-drug and non-alcohol causes,O9,1479.0
4,"Barbour County, AL",1005,2003.0,2003.0,All other non-drug and non-alcohol causes,O9,287.0
...,...,...,...,...,...,...,...
57416,"Sweetwater County, WY",56037,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,251
57417,"Teton County, WY",56039,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,95
57418,"Uinta County, WY",56041,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,142
57419,"Washakie County, WY",56043,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,81


In [10]:
mortality_03_15_clean.isna().sum()

County                             12
County Code                        12
Year                               12
Year Code                          12
Drug/Alcohol Induced Cause         12
Drug/Alcohol Induced Cause Code    12
Deaths                             12
dtype: int64

In [11]:
mortality_03_15_clean[mortality_03_15_clean.isnull().any(axis=1)]

Unnamed: 0,County,County Code,Year,Year Code,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths
4087,,,,,,,
4088,,,,,,,
4089,,,,,,,
4090,,,,,,,
4091,,,,,,,
4092,,,,,,,
4093,,,,,,,
4095,,,,,,,
4097,,,,,,,
4098,,,,,,,


In [12]:
mortality_03_15_clean.dropna(how="all")

Unnamed: 0,County,County Code,Year,Year Code,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths
0,"Autauga County, AL",1001,2003.0,2003.0,All other non-drug and non-alcohol causes,O9,397.0
1,"Baldwin County, AL",1003,2003.0,2003.0,Drug poisonings (overdose) Unintentional (X40-...,D1,10.0
2,"Baldwin County, AL",1003,2003.0,2003.0,All other alcohol-induced causes,A9,14.0
3,"Baldwin County, AL",1003,2003.0,2003.0,All other non-drug and non-alcohol causes,O9,1479.0
4,"Barbour County, AL",1005,2003.0,2003.0,All other non-drug and non-alcohol causes,O9,287.0
...,...,...,...,...,...,...,...
57416,"Sweetwater County, WY",56037,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,251
57417,"Teton County, WY",56039,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,95
57418,"Uinta County, WY",56041,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,142
57419,"Washakie County, WY",56043,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,81


In [None]:
# Drop rows where all columns (except the first) are null
mortality_03_15_clean = mortality_03_15_clean.dropna(
    how="all", subset=mortality_03_15.columns[1:]
)

In [14]:
mortality_03_15_clean.isna().sum()

County                             0
County Code                        0
Year                               0
Year Code                          0
Drug/Alcohol Induced Cause         0
Drug/Alcohol Induced Cause Code    0
Deaths                             0
dtype: int64

In [None]:
# Check for unusual placeholder values in columns
for col in mortality_03_15_clean.columns:
    uniques = mortality_03_15_clean[col].astype(str).unique()
    unusual = [
        u
        for u in uniques
        if u.strip().lower()
        in ["missing", "n/a", "na", "none", ".", "null", "suppressed", ""]
    ]
    if unusual:
        print(f"{col}: {unusual}")

Deaths: ['Missing']


In [16]:
mortality_03_15_clean[mortality_03_15_clean["Deaths"] == "Missing"]

Unnamed: 0,County,County Code,Year,Year Code,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths
52756,"Prince of Wales-Outer Ketchikan Census Area, AK",2201,2015.0,2015.0,Drug poisonings (overdose) Unintentional (X40-...,D1,Missing
52757,"Prince of Wales-Outer Ketchikan Census Area, AK",2201,2015.0,2015.0,Drug poisonings (overdose) Suicide (X60-X64),D2,Missing
52758,"Prince of Wales-Outer Ketchikan Census Area, AK",2201,2015.0,2015.0,Drug poisonings (overdose) Homicide (X85),D3,Missing
52759,"Prince of Wales-Outer Ketchikan Census Area, AK",2201,2015.0,2015.0,Drug poisonings (overdose) Undetermined (Y10-Y14),D4,Missing
52760,"Prince of Wales-Outer Ketchikan Census Area, AK",2201,2015.0,2015.0,All other drug-induced causes,D9,Missing
52761,"Prince of Wales-Outer Ketchikan Census Area, AK",2201,2015.0,2015.0,"Alcohol poisonings (overdose) (X45, X65, Y15)",A1,Missing
52762,"Prince of Wales-Outer Ketchikan Census Area, AK",2201,2015.0,2015.0,All other alcohol-induced causes,A9,Missing
52763,"Prince of Wales-Outer Ketchikan Census Area, AK",2201,2015.0,2015.0,All other non-drug and non-alcohol causes,O9,Missing
52765,"Skagway-Hoonah-Angoon Census Area, AK",2232,2015.0,2015.0,Drug poisonings (overdose) Unintentional (X40-...,D1,Missing
52766,"Skagway-Hoonah-Angoon Census Area, AK",2232,2015.0,2015.0,Drug poisonings (overdose) Suicide (X60-X64),D2,Missing


In [None]:
# Replace "Missing" with NaN in Deaths column
mortality_03_15_clean["Deaths"] = mortality_03_15_clean["Deaths"].replace(
    "Missing", np.nan
)

### Remove Redundant Columns

In [None]:
# Check if Year and Year Code are identical
(mortality_03_15_clean["Year"] == mortality_03_15_clean["Year Code"]).all()

np.True_

In [19]:
mortality_03_15_clean.drop(columns="Year Code", inplace=True)

### Data Type Conversion

In [20]:
mortality_03_15_clean.dtypes

County                             object
County Code                        object
Year                               object
Drug/Alcohol Induced Cause         object
Drug/Alcohol Induced Cause Code    object
Deaths                             object
dtype: object

In [21]:
mask = ~mortality_03_15_clean["Year"].astype(str).str.match(r"^\d{4}\.0$")
mortality_03_15_clean.loc[mask, "Year"].unique()

array([], dtype=object)

In [None]:
# Convert columns to appropriate data types
mortality_03_15_clean = mortality_03_15_clean.astype(
    {
        "County": "string",
        "County Code": "int",
        "Year": "float",
        "Drug/Alcohol Induced Cause": "string",
        "Drug/Alcohol Induced Cause Code": "string",
        "Deaths": "float",
    }
)

In [None]:
# Convert Year and Deaths to nullable integer types
mortality_03_15_clean = mortality_03_15_clean.astype(
    {
        "Year": "Int64",
        "Deaths": "Int64",
    }
)

In [25]:
mortality_03_15_clean.dtypes

County                             string[python]
County Code                                 int64
Year                                        Int64
Drug/Alcohol Induced Cause         string[python]
Drug/Alcohol Induced Cause Code    string[python]
Deaths                                      Int64
dtype: object

### Final Cleaned Dataset

In [26]:
mortality_03_15_clean.head()

Unnamed: 0,County,County Code,Year,Drug/Alcohol Induced Cause,Drug/Alcohol Induced Cause Code,Deaths
0,"Autauga County, AL",1001,2003,All other non-drug and non-alcohol causes,O9,397
1,"Baldwin County, AL",1003,2003,Drug poisonings (overdose) Unintentional (X40-...,D1,10
2,"Baldwin County, AL",1003,2003,All other alcohol-induced causes,A9,14
3,"Baldwin County, AL",1003,2003,All other non-drug and non-alcohol causes,O9,1479
4,"Barbour County, AL",1005,2003,All other non-drug and non-alcohol causes,O9,287


## After looking at the population dataset