# Birth Defects for Nebraska - EPHTracking Fall 2024 Data Call
- Babak J.Fard -- October 2024

This notebook shows the steps in creating the Birth Defects (BD) datasets-as required by the Tracking How-To-Guide (HTG) and Data Dictionary- from the raw datasets. Since the format of the raw dataset maybe very specific and different from other states (even from the future BD datasets) no separate python code (.py file) is created. The user is suggested to consider the potential differences and changes for use with other datasets. The notebook is reorganized into four sections to be easy to follow.

* All cell outputs that may contain detailed level health data are removed.

In [None]:
# Import the libraries for data validation
import numpy as np
import pandas as pd


from libraries import general as ge


import os
os.chdir('/Users/babak.jfard/projects/EPHTracking')

## 1. Preparing the Birth Defects Data
This year, the provided data were in the final format, including nine columns of the required finalized columns to submit. In the last data submission cycle (2023), we had received the raw birth defects data. The process are available in `Birth_Defects_2023.ipynb`. To make sure that the provided birth defects values match with the last year data for the common years (2005 to 2021), we compared our finalized birth defects data with the data we received. The steps are in `Birth_Defects_2024_vs_2023.ipynb`.

Below, we just calculate the two columns of `LBWBD` and `LBFDTWD` following the How-to-Guide.

In [None]:
# Reading all the provided .XLSX files into one dataframe

# Directory containing the .xlsx files
# folder_path = 'Data/BirthDefects_09192024'
folder_path = 'Data/BirthDefects_09192024/final'

# Initialize an empty list to store DataFrames
data_frames = []

# Iterate through all files in the directory
for file_name in os.listdir(folder_path):
    if file_name.endswith('.xlsx'):  
        file_path = os.path.join(folder_path, file_name)
        
        # Try to read the .xlsx file
        try:
            df = pd.read_excel(file_path, sheet_name=0)  # Read the first sheet
            data_frames.append(df)
        except Exception as e:
            print(f"Error reading {file_name}: {e}")

# Concatenate all DataFrames into one
bd_2024 = pd.concat(data_frames, ignore_index=True)

In [None]:
# Total births with a detected birth defect for each group
group_by = ['BirthDefect', 'County', 'StartDate', 'EndDate','MaternalAgeGroup', 'MaternalEthnicity', 'MaternalRace',
                   'InfantSex']

bd_2024_final = bd_2024.groupby(group_by, as_index=False).agg(LBWBD=('LiveBirth', lambda x: (x == 'Y').sum()),
                                           LBFDTWD=('LiveBirth', 'size')).reset_index()



In [None]:
# Saving the finalized birth defects file
bd_2024_final.to_csv('Data/BirthDefects_09192024/Birth_Dfects_2024.csv', index=False)

## 2.Preparing the Live Birth Data

In [None]:
# Categorizing ages into MaternalAgeGroup
# Categorizing maternal age into the groups from Birth Defects Dictionary, May 2022
def categorize_age(df, age_col = 'Mom_Age', new_col= 'MaternalAgeGroup'):
    # Define the age categorization function
    def age_category(age):
        if age < 20:
            return 1
        elif 20 <= age <= 24:
            return 2
        elif 25 <= age <= 29:
            return 3
        elif 30 <= age <= 34:
            return 4
        elif 35 <= age <= 39:
            return 5
        elif age >= 40:
            return 6
        else:
            return 9  # Unknown

    # Apply the age categorization function to the 'Mom_Age' column
    df[new_col] = df[age_col].apply(age_category)
    return df

In [None]:
import numpy as np
import pandas as pd


from libraries import general as ge

# from itables import init_notebook_mode

#init_notebook_mode(all_interactive=True)


import os
os.chdir('/Users/babak.jfard/projects/EPHTracking')

In [None]:
#live_births = pd.read_csv('Data/BIRTHDEFECTS080823/Live Births/bth2005.csv')
folder = r'Data/Live_Births_09192024'
live_births = pd.concat((pd.read_csv(folder+'/'+filename)) for filename in os.listdir(folder) if filename.endswith('.csv'))

In [None]:
print("Number of live births in each year")
live_births.DOB_YY.value_counts()

In [None]:
# keep only those with mother state of residence as NE
live_births = live_births[live_births.strm == 'NE'].copy()

In [None]:
# Are all counties covered?
live_births['coures'].nunique()

### Correcting the FIPS code, Maternal Ethnicity, and Dates
received FIPS codes for the counties are in three digit format. needs to be chanded into five digits

In [None]:

# Making FIPS codes into 5 dgigts
state_FIPS = '31' #For Nebraska.

live_births.coures = live_births.coures.astype('str').str.zfill(3) #Pad strings in the Series/Index by prepending ‘0’ characters.
live_births['County'] = (state_FIPS+ live_births['coures']).astype('int')
live_births.drop(columns='coures', inplace=True)

In [None]:
live_births['StartDate'] = live_births['DOB_YY'].astype('str') + '0101'

In [None]:
live_births.rename(columns={'sex': 'InfantSex'}, inplace=True)

In [None]:
# Distinguishing Ethnicity
# Define the conditions and choices
conditions = [
    live_births['hispanicm'].str.contains('H'),
    live_births['hispanicm'].str.contains('U')
]
choices = ['H', 'U']

# Create the new column using np.select
live_births['MaternalEthnicity'] = np.select(conditions, choices, default='NH')
live_births = live_births.drop(columns='hispanicm')

In [None]:
live_births.MaternalEthnicity.value_counts()

In [None]:
# Categorizing maternal age into the groups from the HTG, and saving it into the appropriate column
live_births = categorize_age(live_births, age_col='agemo')

### Editing Maternal Race

In [None]:
race_other_columns = ['aindianm', 'chamorrom', 'chinesem', 'filipinom', 'indianm', 'japanesem',
                      'koreanm', 'nhawaiianm', 'opacislm', 'otheram', 'otherm', 'samoanm', 'vietnamesem']
race_black_columns = 'blackm'
race_white_columns = 'whitem'

In [None]:
# checking the unique values in all other race columns are 'Y' or 'N'
[print(f"{col}: {live_births[col].value_counts(dropna=False)}") for col in race_other_columns]

In [None]:
# changing to digist to better be able to check consistency
to_digits = {'Y':1, 'N':0}

In [None]:
#[live_births[col].replace(to_digits, inplace=True) for col in race_other_columns]
live_births[race_other_columns] = live_births[race_other_columns].replace(to_digits)

In [None]:
# Calculating all into one column for other
live_births['race_other'] = live_births[race_other_columns].sum(axis=1)
live_births['race_other'] = np.where(live_births['race_other']>0, 1, 0)

In [None]:
# Doing the same for Black and White races
live_births[race_white_columns].replace(to_digits, inplace=True)
live_births[race_black_columns].replace(to_digits, inplace=True)

In [None]:
# checking potential values. In an ideal situation there must be only one
# or 0 for unknown
three_races = ['blackm', 'whitem', 'race_other']
live_births[three_races].sum(axis=1).value_counts(dropna=False)

Well. Looks good. only about 3% are not 1. Now if this value is other than one it will return 'U', otherwise checks which of 'W', 'B' or 'O' applies

In [None]:
# Calculate the sum of the three columns
live_births['sum_race'] = live_births[['blackm', 'whitem', 'race_other']].sum(axis=1)

# Define the conditions and choices for the 'MaternalRace' column
conditions = [
    (live_births['sum_race'] != 1),
    (live_births['blackm'] == 1),
    (live_births['whitem'] == 1),
    (live_births['race_other'] == 1)
]
choices = ['U', 'B', 'W', 'O']

# Create the 'MaternalRace' column using numpy.select
live_births['MaternalRace'] = np.select(conditions, choices, default='U')

# Drop the 'sum_race' column as it's no longer needed
live_births.drop('sum_race', axis=1, inplace=True)


In [None]:
live_births.MaternalRace.value_counts()

### Wrapping up
Here we prepare, rename and reorder column to get ready for joining with the birth defects data

In [None]:
columns_to_keep = ['County', 'StartDate', 'MaternalAgeGroup', 'MaternalEthnicity', 'MaternalRace',
                   'InfantSex']
live_births = live_births[columns_to_keep]

In [None]:
live_births.StartDate.value_counts()

In [None]:
live_births.to_csv('Data/BirthDefects_09192024/live_births_cleaned_2024.csv')

In [None]:
# Now grouping them and calculating TLB
#live_births['TLB'] = live_births.groupby(columns_to_keep).transform('size')
lb_grouped = live_births.groupby(columns_to_keep).agg(TLB=('County', 'size')).reset_index()


In [None]:
lb_grouped.head(50)

In [None]:
# Saving Total births
lb_grouped.to_csv('Data/BirthDefects_09192024/live_births_summarized_2.csv', index=False)

# ****************************** F I N I S H E D (Live Birth data section)

In [None]:
# Cleaning the memory
%reset

## 3.Joining the Datasets
Here we join the two tables into the final table. But first check to make sure that the key columns actually match

In [None]:
import numpy as np
import pandas as pd


from libraries import general as ge

from itables import init_notebook_mode

init_notebook_mode(all_interactive=True)


import os
os.chdir('/Users/babak.jfard/projects/EPHTracking')

### Preparing the tables

In [None]:
bd_grouped = pd.read_csv('Data/BirthDefects_09192024/Birth_Dfects_2024.csv')
lb_grouped = pd.read_csv('Data/BirthDefects_09192024/live_births_2024_summarized.csv')

In [None]:
# Check if all the years have all 93 counties
lb_grouped.groupby('StartDate')['County'].nunique()

In [None]:
# The key columns to join the live birth and birth defects tables
key_cols = ['County', 'StartDate', 'MaternalAgeGroup', 'MaternalEthnicity', 'MaternalRace', 'InfantSex']

In [None]:
print(bd_grouped.dtypes)
print(lb_grouped.dtypes)

In [None]:
lb_grouped['County'] = lb_grouped['County'].astype('str')

In [None]:
# Comparing the unique values of key columns in each df befor joining them
def comp_col_types(lb_grouped, bd_grouped, key_cols):
    lb_cols = lb_grouped[key_cols].dtypes.to_list()
    bd_cols = bd_grouped[key_cols].dtypes.to_list()

    if lb_cols == bd_cols:
        print("Columns match. We're good to go!")
    else:
        print("There are some mismatches between key columns")
        print(key_cols)
        print(f"Birth Defects: {bd_cols}")
        print(f"Live Births: {lb_cols}")


In [None]:
comp_col_types(lb_grouped, bd_grouped, key_cols)

In [None]:
# Checking for the differences in the unique values for corresponding columns
for col in key_cols:
    values_bd = set(bd_grouped[col])
    values_lb = set(lb_grouped[col])
    
    unique_to_bd = values_bd - values_lb
    unique_to_lb = values_lb - values_bd
    
    print(f"Values in {col} unique to Birth Defects: {unique_to_bd}")
    print(f"Values in {col} unique to Live Births: {unique_to_lb}\n")



### Outer join between Birth Defects and Live Births
We will have all counties present in each year. But LBWBD and LBFDTWD values for many cases will be 0 because there is no Birth Defect in such cases. It will require an outer join method. Also, each birth defect when added for TLBs must add up into total live births. Therefore, we separate BD data for each BirthDefect, do outer join and in the end concatenate all the resulted 12 tables into one final table

***Note:*** to make sure that the denominator for each Birth Defect is the total live birth, data for each birth defect is separated and is outer joined with the live birth data, and in the end all the separate 12 joined tables are concatenated into the final dataframe. 

In [None]:
# Merging the two datasets
all_bds = []  # Create an empty list to store the dataframes

# Loop over unique values of 'BirthDefects' column in bd_grouped
for defect in bd_grouped['BirthDefect'].unique():
    # Separate rows with the current 'BirthDefects' value
    current_defect_df = bd_grouped[bd_grouped['BirthDefect'] == defect]
    
    # Perform the outer join with lb_grouped on key columns
    merged_df = lb_grouped.merge(current_defect_df, on=key_cols, how='outer')
    
    # For unassigned 'BirthDefects', change them to the current 'BirthDefect' value
    merged_df['BirthDefect'] = merged_df['BirthDefect'].fillna(defect)
    
    # Add the new dataframe to the 'all_bds' list
    all_bds.append(merged_df)

# Continue the loop for the next 'BirthDefect'


In [None]:
final_bd = pd.concat(all_bds, ignore_index=True)

* Next step will be to take care of Nan values for non-matched rows

In [None]:
# final_bd.drop(columns='Unnamed: 0', inplace=True)

In [None]:
final_bd.isna().sum()

1) TLB: TLB is for those cases that we have maternal age 9. This is exactly the same number as we had in the first approach. Therefore we set them to -999
2) BirthDefect: For cases that we only want to provide live births. We add 21
3) EndDate = (StartDate // 1e4)* 1e4 + 1231
4) LBWBD will be 0 for all missing values
5) LBFDTWD same as LBWBD

In [None]:
final_bd['TLB'] = final_bd['TLB'].fillna(-999).astype('int')
final_bd['EndDate'] = ((final_bd['StartDate'] // 1e4) * 1e4 + 1231).astype('int')
final_bd['LBWBD'] = (final_bd['LBWBD'].fillna(0)).astype('int')
final_bd['LBFDTWD'] = (final_bd['LBFDTWD'].fillna(0)).astype('int')

In [None]:
data_toSave = final_bd

### Final step: Prepare Data to Save
This is the final step to save data into format and numbers that can be submitted to the Tracking system

In [None]:
# Order the columns in the same order as Data Dictionary
ordered_columns = ['County', 'StartDate', 'EndDate', 'BirthDefect', 'MaternalAgeGroup',
                   'MaternalEthnicity', 'MaternalRace', 'InfantSex', 'TLB', 'LBWBD', 'LBFDTWD']

data_toSave = data_toSave[ordered_columns]

In [None]:
# Checking the data types
data_toSave.dtypes

In [None]:
data_toSave['InfantSex'] = data_toSave['InfantSex'].replace({'N': 'U'})

In [None]:
# final check
for i in range(8):
    col = data_toSave.columns[i]
    print(f'Column: {col}')
    print(data_toSave[col].unique())
    print('\n')

In [None]:
# Now saving each year into a separate file:
#output_folder = 'Data/BIRTHDEFECTS080823/To_Submit/'
output_folder = 'Data/BirthDefects_09192024/final_submit_2/'

for st_date in data_toSave.StartDate.unique():
    to_save = data_toSave[data_toSave['StartDate'] == st_date]
    to_save.index = range(1, len(to_save) + 1)
    year = (st_date//1e4).astype('int').astype('str')
    filename = output_folder+'BirthDefects_AllCounties_'+year+ '.csv'

    to_save.to_csv(filename, index = True,index_label='RowIdentifier')

## A. Some Checks
Some checks to make sure the data makes sense
 

* Checking if the sum of TLBs for each BirthDefect adds up to total live births in each year

In [None]:
def sum_ignore_values(series, ignore=-999):
    return series.replace(ignore, np.nan).sum()

print(data_toSave.pivot_table(index='StartDate', columns='BirthDefect', values='TLB', aggfunc=lambda x: sum_ignore_values(x)))

* Plotting the Sum, max, mean, min for each Birth Defect in each year

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

def plot_summaries(df, groups, coexist, x_column, nrow):
    # Get unique values from the groups column
    unique_groups = df[groups].unique()
    
    # Calculate number of columns for the grid
    ncol = len(unique_groups) // nrow
    if len(unique_groups) % nrow != 0:
        ncol += 1
    
    # Initialize a figure
    fig, axes = plt.subplots(nrow, ncol, figsize=(15, 10))
    
    # If there's only one row or one column, axes is a 1D array
    if nrow == 1 or ncol == 1:
        axes = axes.reshape(nrow, ncol)
    
    # Iterate over each unique group and plot
    for idx, group in enumerate(unique_groups):
        ax = axes[idx // ncol, idx % ncol]
        
        # Filter dataframe for the current group
        subset = df[df[groups] == group]
        
        # Plot each column in coexist
        for col in coexist:
            sns.lineplot(data=subset, x=x_column, y=col, ax=ax, label=col)
        
        ax.set_title(f"{groups}: {group}")
        ax.legend()
    
    # If there are empty subplots, hide them
    for idx in range(len(unique_groups), nrow * ncol):
        axes[idx // ncol, idx % ncol].axis('off')
    
    plt.tight_layout()
    plt.show()


In [None]:
ds_summary = data_toSave.groupby(['BirthDefect', 'StartDate'])['LBFDTWD'].agg(SUM = 'sum', MIN='min', MEAN='mean', MAX='max').reset_index()
ds_summary['BirthDefect'] = ds_summary['BirthDefect'].astype('int')
ds_summary.StartDate = ds_summary.StartDate.astype('str').str[2:4]

plot_summaries(df=ds_summary, groups='BirthDefect', coexist=['SUM', 'MIN', 'MEAN', 'MAX'], x_column='StartDate', nrow=3)