# Birth Defects for Nebraska - EPHTracking Fall 2023 Data Call
- Babak J.Fard -- October 2023

This notebook shows the steps in creating the Birth Defects (BD) datasets-as required by the Tracking How-To-Guide (HTG) and Data Dictionary- from the raw datasets. Since the format of the raw dataset maybe very specific and different from other states (even from the future BD datasets) no separate python code (.py file) is created. The user is suggested to consider the potential differences and changes for use with other datasets. The notebook is reorganized into four sections to be easy to follow.

* All cell outputs that may contain detailed level health data are removed.

In [None]:
# Import the libraries for data validation
import numpy as np
import pandas as pd

#from libraries import Validator_Nebraska_2023_BirthDefects as VNBD
from libraries import general as ge
#from pydantic import ValidationError

#from itables import init_notebook_mode

#init_notebook_mode(all_interactive=True)


import os
os.chdir('/Users/babak.jfard/projects/EPHTracking')

## 1. Preparing the Birth Defects Data
Using the provided dataset, we have created a validator class. In this step we validate the raw data using Neraska Birth Defect validator. 

In [None]:
bd_2 = pd.read_csv('Data/BIRTHDEFECTS080823/bd10_9/bd10_9.csv', na_values=['nan'], dtype=str)

In [None]:
# According to the HTG-Appendix A, for BD 27 of CDC/BPA is must exclude 752.61
# Since we have CDC/BPA codes before 2015 we replace all these values with ''
# to make sure those are excluded
bd_2[bd_2['d15'].fillna('').str.startswith('752.621')]

In [None]:
# For when considering CDC/BPA is used instead of ICD-9, we replace all these values with ''
bd_2.loc[bd_2['d15']=='752.621', 'd15'] = ''

In [None]:
# double check
bd_2[bd_2['d15'].fillna('').str.startswith('752.621')]

In [None]:
# Checking if the maternal state of residence is NE
bd_2.stresm.value_counts(dropna=False)

In [None]:
# Only keeping those that maternal state in Nebraska
bd_2 = bd_2[bd_2['stresm'].str.lower() == 'nebraska']

In [None]:
# Remove the extra column
bd_2 = bd_2.drop(columns='stresm')

In [None]:
# Those that birth certificate and fetal death do not match. If fetal_cert is not na, no birth certificate number must be null
bd_2[bd_2['fetal_cert'].notna()]['bth_cert1'].value_counts(dropna=False)

In [None]:
# Removing those that the fetal death condition is not null, and  birth certificate is not null too
bd_2 = bd_2[~(bd_2['fetal_cert'].notna() & bd_2['bth_cert1'].notna())]


Looks like we have a big change between the two datasets. The old one and the new one just provided!

###  Dates
reminder from HTG:
* A row of data is a unique combination of County, Startdate, Enddate, BirthDefect, MaternalAgeGroup, MaternalRace, and MaternalEthnicity, Infantsex

In [None]:
# Change date columns into datetime format
bd_2['dob_c'] = pd.to_datetime(bd_2['dob_c'], format='%m/%d/%Y', errors='coerce')
bd_2['dob_m'] = pd.to_datetime(bd_2['dob_m'], format='%m/%d/%Y', errors='coerce')

In [None]:
bd_2 = bd_2[bd_2['dob_c'].dt.year <= 2021]

### Maternal Age
This section is about the mother's ages and the corrections

In [None]:
# Some of the date of births for mothers are weird. Will change them into NA
# bd_2.loc[bd_2['dob_m'].dt.year < 1930].replace('dob_m', pd.NA, inplace=True)
bd_2.loc[bd_2['dob_m'].dt.year < 1930, 'dob_m'] = pd.NA


In [None]:
# Double check if the extracted years from 'dob_c' actually matches those in bthyr column
sum(bd_2.dob_c.dt.year.astype('int') - bd_2.bthyr.astype('int'))

In [None]:
bd_2['Mom_Age'] = bd_2['bthyr'].astype('int') - bd_2['dob_m'].dt.year

In [None]:
bd_2[bd_2['Mom_Age'] <= 12]

In [None]:
# correcting for these ages
bd_2.loc[bd_2['Mom_Age'] <= 12, 'dob_m'] = pd.NA

# recalcualting Mom_Age
bd_2['Mom_Age'] = bd_2['bthyr'].astype('int') - bd_2['dob_m'].dt.year

In [None]:
import matplotlib.pyplot as plt

# Plotting the histogram
bd_2['Mom_Age'].hist(bins=10, edgecolor='black')
plt.title('Mother Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Categorizing ages into MaternalAgeGroup
# Categorizing maternal age into the groups from Birth Defects Dictionary, May 2022
def categorize_age(df, age_col = 'Mom_Age', new_col= 'MaternalAgeGroup'):
    # Define the age categorization function
    def age_category(age):
        if age < 20:
            return 1
        elif 20 <= age <= 24:
            return 2
        elif 25 <= age <= 29:
            return 3
        elif 30 <= age <= 34:
            return 4
        elif 35 <= age <= 39:
            return 5
        elif age >= 40:
            return 6
        else:
            return 9  # Unknown

    # Apply the age categorization function to the 'Mom_Age' column
    df[new_col] = df[age_col].apply(age_category)
    return df


In [None]:
bd_2 = categorize_age(bd_2)
# matches = categorize_age(matches)


In [None]:
# Count the occurrences of each category
age_counts = bd_2['MaternalAgeGroup'].value_counts()

# Sort the index
age_counts = age_counts.sort_index()

# Plotting the bar plot
age_counts.plot(kind='bar', edgecolor='black', align='center')
plt.title('Mother Age Category')
plt.xlabel('Age Category')
plt.ylabel('Number')
plt.xticks(rotation=0)  # Ensure x-axis labels are horizontal
plt.show()

### County Codes
The provided counties are in names format. In this section created a column for their corresponding FIPS codes

In [None]:
# Getting the numeric FIPS code
fips = ge.get_Counties_FIPS()
fips['county_name'] = fips['county_name'].str.lower()
fips['county_name'] = fips['county_name'].str.replace(" county", "")


bd_2['cou'] = bd_2['cou'].str.lower()

In [None]:
bd_2 = bd_2.merge(fips, left_on='cou', right_on='county_name')

In [None]:
# Check if the merge has worked out fine!
bd_2[bd_2['cou'] != bd_2['county_name']]

In [None]:
# Removing cou and renaming the new column
bd_2 = bd_2.drop(columns=['cou', 'county_name'])
bd_2.rename(columns={'fips': 'County', 'sex':'InfantSex'}, inplace=True)

### Maternal Race and Ethnicity
Following HTG one column for each

In [None]:
race = {1:'W', 2:'B', 3:'O', 4:'O', 8:'O', 9:'U'}
ethnicity = {1:'NH', 2:'NH', 3:'NH', 4:'NH', 8: 'H', 9:'U'}

In [None]:
bd_2.racethm = bd_2.racethm.astype('int')

In [None]:
bd_2['MaternalEthnicity'] = bd_2['racethm'].replace(ethnicity)
bd_2['MaternalRace'] = bd_2['racethm'].replace(race)

In [None]:
bd_2 = bd_2.drop(columns='racethm')

In [None]:
bd_2['StartDate'] = bd_2['bthyr'].astype('str')+'0101'
bd_2['EndDate'] = bd_2['bthyr'].astype('str')+'1231'

In [None]:
bd_2 = bd_2.drop(columns=['dob_c', 'bthyr', 'dob_m', 'case_id', 'Mom_Age'])

### Converting ICD-9 and ICD-10s to BirthDefect Codes

In [None]:
import json

# Reading ICD-9 codes related to the Birth Defects
with open('Data/Dictionaries/BirthDefects_icd_9_convert.json', 'r') as f:
     icd_9_dict = json.load(f)

# Reading CDC/BPA codes related to the Birth Defects
with open('Data/Dictionaries/BirthDefects_CDC_BPA_convert.json', 'r') as f:
    icd_BPA_dict = json.load(f)

In [None]:
with open('Data/Dictionaries/BirthDefects_icd_10_convert.json') as f:
    icd_10_dict = json.load(f)

In [None]:
icd_BPA_dict

In [None]:
# this function maps the values of icd_9 or icd_10 into the values of BirthDefect 12 codes
# This method was not used
# def lookup_icd_value(value, icd_code = 10):
#     if icd_code == 10:
#         return icd_10_dict.get(value, np.nan)
#     if icd_code == 9:
#         return icd_9_dict.get(value, np.nan)
#     
# # Creating a column mapping from ICD-9
# bd['BIRTH_DEFECTS_from_9'] = bd['DEFECT_CODE'].apply(lookup_icd_value, icd_code = 9)
# bd['BIRTH_DEFECTS_from_10'] = bd['DEFECT_CODE10CM'].apply(lookup_icd_value)

In [None]:
# The second method: Instead of the exact match, look if it starts with a key in the dictionary
def lookup_icd_value_startsWith(value, icd_code=10):
    if pd.isnull(value):
        return np.nan
    # value = str(value)
    if icd_code == 10:
        for key in icd_10_dict:
            if value.startswith(key):
                return icd_10_dict[key]
        return np.nan
    if icd_code == 9:
        for key in icd_9_dict:
            if value.startswith(key):
                return icd_9_dict[key]
        return np.nan
    if icd_code == 8:
        for key in icd_BPA_dict:
            if value.startswith(key):
                return icd_BPA_dict[key]
    
# Creating a column mapping from ICD-9
#bd_2['BIRTH_DEFECTS_from_9'] = bd_2['d15'].apply(lookup_icd_value_startsWith, icd_code = 9)
bd_2['BIRTH_DEFECTS_from_9'] = bd_2['d15'].apply(lookup_icd_value_startsWith, icd_code = 8)
bd_2['BIRTH_DEFECTS_from_10'] = bd_2['d16'].apply(lookup_icd_value_startsWith)

In [None]:
# Birth Defects out of our 12 categories
not_matches = bd_2[bd_2['BIRTH_DEFECTS_from_9'].isna() & bd_2['BIRTH_DEFECTS_from_10'].isna()]

# Birth Defects of our 12 categories
matches = bd_2[bd_2['BIRTH_DEFECTS_from_9'].notna() | bd_2['BIRTH_DEFECTS_from_10'].notna()]


Checking what percentages of total birth defects for 2005 to 2021 are from the 12 categories from the Tracking(below shows it's around 6.3%)

In [None]:
matches.shape[0]* 100 / bd_2.shape[0]

In [None]:
# Checking if the duplicated values that values for both ICD-10 and CDC/BPA result into similar BD ids
dup_check = bd_2[bd_2['BIRTH_DEFECTS_from_10'].notna() & bd_2['BIRTH_DEFECTS_from_9'].notna()]
sum(dup_check['BIRTH_DEFECTS_from_10'] != dup_check['BIRTH_DEFECTS_from_9'])

In [None]:
# Merge the two columns into BirthDefect column
def merge_defect_columns(row, ICD_9_Col='BIRTH_DEFECTS_from_9', ICD_10_Col= 'BIRTH_DEFECTS_from_10'):
    val_10 = row[ICD_10_Col]
    val_9 = row[ICD_9_Col]
    
    if pd.isna(val_10) and pd.isna(val_9):
        return np.nan
    elif pd.isna(val_10):
        return val_9
    elif pd.isna(val_9):
        return val_10
    else:
        return val_10 if val_10 == val_9 else np.nan

bd_2['BirthDefects'] = bd_2.apply(merge_defect_columns, axis=1)

In [None]:
bd_2['BirthDefects'].value_counts(dropna=False)

### Final preparations of the Birth Defect data

* We are only interested in `matches` section of data that contains the 12 BDs of interest

In [None]:
matches['BirthDefect'] = matches.apply(merge_defect_columns, axis=1)
matches['BirthDefect'].value_counts(dropna=False)

In [None]:
matches.MaternalEthnicity.value_counts(dropna=False)

In [None]:
matches.MaternalRace.value_counts(dropna=False)

In [None]:
# Percent of live births with the specified birth defects
matches.bth_cert1.notna().sum()*100/matches.shape[0]

In [None]:
# Percent of fetal death with the specified birth defects
matches.fetal_cert.notna().sum()*100/matches.shape[0]

In [None]:
matches.columns

In [None]:
group_by = ['BirthDefect', 'County', 'StartDate', 'EndDate','MaternalAgeGroup', 'MaternalEthnicity', 'MaternalRace',
                   'InfantSex']

In [None]:
# bd_grouped = matches.groupby(group_by).agg(LBWBD=('bth_cert1', 'size')).reset_index()
bd_grouped = matches.groupby(group_by).agg(LBWBD=('bth_cert1', lambda x: x.notna().sum()),
                                           LBFDTWD=('bth_cert1', 'size')).reset_index()


In [None]:
bd_grouped.to_csv('Data/BIRTHDEFECTS080823/BirthDefects_without_TBL_BPA.csv')

In [None]:
# At the end of this section the following command deletes all the memory.
%reset

## 2.Preparing the Live Birth Data

In [None]:
import numpy as np
import pandas as pd


from libraries import general as ge

# from itables import init_notebook_mode

#init_notebook_mode(all_interactive=True)


import os
os.chdir('/Users/babak.jfard/projects/EPHTracking')

In [None]:
#live_births = pd.read_csv('Data/BIRTHDEFECTS080823/Live Births/bth2005.csv')
folder = r'Data/BIRTHDEFECTS080823/Live Births'
live_births = pd.concat((pd.read_csv(folder+'/'+filename)) for filename in os.listdir(folder) if filename.endswith('.csv'))

In [None]:
print("Number of live births in each year")
live_births.DOB_YY.value_counts()

In [None]:
# keep only those with mother state of residence as NE
live_births = live_births[live_births.strm == 'NE']

In [None]:
live_births['coures'].nunique()

### Correcting the FIPS code, Maternal Ethnicity, and Dates
received FIPS codes for the counties are in three digit format. needs to be chanded into five digits

In [None]:

# Making FIPS codes into 5 dgigts
state_FIPS = '31' #For Nebraska.

live_births.coures = live_births.coures.astype('str').str.zfill(3) #Pad strings in the Series/Index by prepending ‘0’ characters.
live_births['County'] = (state_FIPS+ live_births['coures']).astype('int')
live_births.drop(columns='coures', inplace=True)

In [None]:
live_births['StartDate'] = live_births['DOB_YY'].astype('str') + '0101'

In [None]:
live_births.rename(columns={'sex': 'InfantSex'}, inplace=True)

In [None]:
# Distinguishing Ethnicity
# Define the conditions and choices
conditions = [
    live_births['hispanicm'].str.contains('H'),
    live_births['hispanicm'].str.contains('U')
]
choices = ['H', 'U']

# Create the new column using np.select
live_births['MaternalEthnicity'] = np.select(conditions, choices, default='NH')
live_births = live_births.drop(columns='hispanicm')

In [None]:
live_births.MaternalEthnicity.value_counts()

In [None]:
# Categorizing maternal age into the groups from the HTG, and saving it into the appropriate column
live_births = categorize_age(live_births, age_col='agemo')

### Editing Maternal Race

In [None]:
race_other_columns = ['aindianm', 'chamorrom', 'chinesem', 'filipinom', 'indianm', 'japanesem',
                      'koreanm', 'nhawaiianm', 'opacislm', 'otheram', 'otherm', 'samoanm', 'vietnamesem']
race_black_columns = 'blackm'
race_white_columns = 'whitem'

In [None]:
# checking the unique values in all other race columns are 'Y' or 'N'
[print(f"{col}: {live_births[col].value_counts(dropna=False)}") for col in race_other_columns]

In [None]:
# changing to digist to better be able to check consistency
to_digits = {'Y':1, 'N':0}

In [None]:
[live_births[col].replace(to_digits, inplace=True) for col in race_other_columns]

In [None]:
# Calculating all into one column for other
live_births['race_other'] = live_births[race_other_columns].sum(axis=1)
live_births['race_other'] = np.where(live_births['race_other']>0, 1, 0)

In [None]:
# Doing the same for Black and White races
live_births[race_white_columns].replace(to_digits, inplace=True)
live_births[race_black_columns].replace(to_digits, inplace=True)

In [None]:
# checking potential values. In an ideal situation there must be only one
# or 0 for unknown
three_races = ['blackm', 'whitem', 'race_other']
live_births[three_races].sum(axis=1).value_counts(dropna=False)

Well. Looks good. only about 3% are not 1. Now if this value is other than one it will return 'U', otherwise checks which of 'W', 'B' or 'O' applies

In [None]:
# Calculate the sum of the three columns
live_births['sum_race'] = live_births[['blackm', 'whitem', 'race_other']].sum(axis=1)

# Define the conditions and choices for the 'MaternalRace' column
conditions = [
    (live_births['sum_race'] != 1),
    (live_births['blackm'] == 1),
    (live_births['whitem'] == 1),
    (live_births['race_other'] == 1)
]
choices = ['U', 'B', 'W', 'O']

# Create the 'MaternalRace' column using numpy.select
live_births['MaternalRace'] = np.select(conditions, choices)

# Drop the 'sum_race' column as it's no longer needed
live_births.drop('sum_race', axis=1, inplace=True)


In [None]:
live_births.MaternalRace.value_counts()

### Wrapping up
Here we prepare, rename and reorder column to get ready for joining with the birth defects data

In [None]:
columns_to_keep = ['County', 'StartDate', 'MaternalAgeGroup', 'MaternalEthnicity', 'MaternalRace',
                   'InfantSex']
live_births = live_births[columns_to_keep]

In [None]:
live_births.StartDate.value_counts()

In [None]:
live_births.to_csv('Data/BIRTHDEFECTS080823/live_births_cleaned.csv')

In [None]:
# Now grouping them and calculating TLB
#live_births['TLB'] = live_births.groupby(columns_to_keep).transform('size')
lb_grouped = live_births.groupby(columns_to_keep).agg(TLB=('County', 'size')).reset_index()


In [None]:
lb_grouped.head(50)

In [None]:
# Saving Total births
lb_grouped.to_csv('Data/BIRTHDEFECTS080823/live_births_summarized.csv', index=False)

# ****************************** F I N I S H E D (Live Birth data section)

In [None]:
# Cleaning the memory
%reset

## 3.Joining the Datasets
Here we join the two tables into the final table. But first check to make sure that the key columns actually match

In [None]:
import numpy as np
import pandas as pd


from libraries import general as ge

from itables import init_notebook_mode

init_notebook_mode(all_interactive=True)


import os
os.chdir('/Users/babak.jfard/projects/EPHTracking')

### Preparing the tables

In [None]:
bd_grouped = pd.read_csv('Data/BIRTHDEFECTS080823/BirthDefects_without_TBL_BPA.csv')
lb_grouped = pd.read_csv('Data/BIRTHDEFECTS080823/live_births_summarized.csv')

In [None]:
# Check if all the years have all 93 counties
lb_grouped.groupby('StartDate')['County'].nunique()

In [None]:
# Which county is missing in 20120101?
set(lb_grouped.County) - set(lb_grouped[lb_grouped['StartDate']==20120101]['County'])

In [None]:
import pandas as pd

def check_and_insert_row(df, new_row, key_columns):
    """
    Check if a new row is a duplicate based on key columns and insert it if not.

    Parameters:
    df (pd.DataFrame): The DataFrame to check against.
    new_row (dict): The new row to insert, in the form of a dictionary.
    key_columns (list): The list of columns to check for duplicates.

    Returns:
    tuple: (is_duplicate, df)
        is_duplicate (bool): True if the new row is a duplicate, False otherwise.
        df (pd.DataFrame): The updated DataFrame with the new row inserted if not a duplicate.
    """
    # Check if the new_row values for key columns match any existing rows
    is_duplicate = (df[key_columns] == pd.Series(new_row, index=key_columns)).all(axis=1).any()

    if is_duplicate:
        print("The new row is a duplicate based on the key columns.")
    else:
        print("The new row is not a duplicate and has been added to the DataFrame.")
        # Add the new row to the DataFrame
        #df = df.append(new_row, ignore_index=True)
        new_row_df = pd.DataFrame([new_row])
        df = pd.concat([df, new_row_df], ignore_index=True)
    
    return is_duplicate, df

# Example usage:
# df is your existing DataFrame
# new_row is a dictionary with your new row data
# key_columns are the columns you want to check for duplicates
# is_duplicate, updated_df = check_and_insert_row(df, new_row, key_columns)


In [None]:
# Adding a new row for 2012
new_row = {'County': 31117 , 'StartDate': 20120101, 'MaternalAgeGroup':9, 'MaternalEthnicity': "U",
       'MaternalRace': "U", 'InfantSex': "U", 'TLB': 0}
key_columns = ['County', 'StartDate', 'MaternalAgeGroup', 'MaternalEthnicity',
       'MaternalRace', 'InfantSex']
_, lb_grouped = check_and_insert_row(lb_grouped, new_row, key_columns)


In [None]:
# Checking again
lb_grouped.groupby('StartDate')['County'].nunique()

In [None]:
bd_grouped

In [None]:
# The key columns to join the live birth and birth defects tables
key_cols = ['County', 'StartDate', 'MaternalAgeGroup', 'MaternalEthnicity', 'MaternalRace', 'InfantSex']

In [None]:
# Comparing the unique values of key columns in each df befor joining them
def comp_col_types(lb_grouped, bd_grouped, key_cols):
    lb_cols = lb_grouped[key_cols].dtypes.to_list()
    bd_cols = bd_grouped[key_cols].dtypes.to_list()

    if lb_cols == bd_cols:
        print("Columns match. We're good to go!")
    else:
        print("There are some mismatches between key columns")
        print(key_cols)
        print(f"Birth Defects: {bd_cols}")
        print(f"Live Births: {lb_cols}")


In [None]:
comp_col_types(lb_grouped, bd_grouped, key_cols)

In [None]:
# Checking for the differences in the unique values for corresponding columns
for col in key_cols:
    values_bd = set(bd_grouped[col])
    values_lb = set(lb_grouped[col])
    
    unique_to_bd = values_bd - values_lb
    unique_to_lb = values_lb - values_bd
    
    print(f"Values in {col} unique to Birth Defects: {unique_to_bd}")
    print(f"Values in {col} unique to Live Births: {unique_to_lb}\n")



### Outer join between Birth Defects and Live Births
We will have all counties present in each year. But LBWBD and LBFDTWD values for many cases will be 0 because there is no Birth Defect in such cases. It will require an outer join method. Also, each birth defct when added for TLBs must add up into total live births. Therefore, we separate BD data for each BirthDefect, do outer join and in the end concatenate all the resulted 12 tables into one final table

***Note:*** to make sure that the denominator for each Birth Defect is the total live birth, data for each birth defect is separated and is outer joined with the live birth data, and in the end all the separate 12 joined tables are concatenated into the final dataframe. 

In [None]:
# Merging the two datasets
all_bds = []  # Create an empty list to store the dataframes

# Loop over unique values of 'BirthDefects' column in bd_grouped
for defect in bd_grouped['BirthDefect'].unique():
    # Separate rows with the current 'BirthDefects' value
    current_defect_df = bd_grouped[bd_grouped['BirthDefect'] == defect]
    
    # Perform the outer join with lb_grouped on key columns
    merged_df = lb_grouped.merge(current_defect_df, on=key_cols, how='outer')
    
    # For unassigned 'BirthDefects', change them to the current 'BirthDefect' value
    merged_df['BirthDefect'].fillna(defect, inplace=True)
    
    # Add the new dataframe to the 'all_bds' list
    all_bds.append(merged_df)

# Continue the loop for the next 'BirthDefect'


In [None]:
final_bd = pd.concat(all_bds, ignore_index=True)

* Next step will be to take care of Nan values for non-matched rows

In [None]:
final_bd.drop(columns='Unnamed: 0', inplace=True)

In [None]:
final_bd.isna().sum()

1) TLB: TLB is for those cases that we have maternal age 9. This is exactly the same number as we had in the first approach. Therefore we set them to -999
2) BirthDefect: For cases that we only want to provide live births. We add 21
3) EndDate = (StartDate // 1e4)* 1e4 + 1231
4) LBWBD will be 0 for all missing values
5) LBFDTWD same as LBWBD

In [None]:
final_bd['TLB'].fillna(-999, inplace=True)
final_bd['EndDate'] = (final_bd['StartDate'] // 1e4) * 1e4 + 1231
final_bd['LBWBD'].fillna(0, inplace=True)
final_bd['LBFDTWD'].fillna(0, inplace=True)

In [None]:
data_toSave = final_bd

### Final step: Prepare Data to Save
This is the final step to save data into format and numbers that can be submitted to the Tracking system

In [None]:
# Order the columns in the same order as Data Dictionary
ordered_columns = ['County', 'StartDate', 'EndDate', 'BirthDefect', 'MaternalAgeGroup',
                   'MaternalEthnicity', 'MaternalRace', 'InfantSex', 'TLB', 'LBWBD', 'LBFDTWD']

data_toSave = data_toSave[ordered_columns]

In [None]:
# Checking the data types
data_toSave.dtypes

In [None]:
cols_to_int = ['EndDate', 'BirthDefect', 'TLB', 'LBWBD', 'LBFDTWD']
data_toSave[cols_to_int] = data_toSave[cols_to_int].astype('int')

data_toSave.InfantSex.replace({'N': 'U'}, inplace=True)

In [None]:
# final check
for i in range(8):
    col = data_toSave.columns[i]
    print(f'Column: {col}')
    print(data_toSave[col].unique())
    print('\n')

In [None]:
# Now saving each year into a separate file:
#output_folder = 'Data/BIRTHDEFECTS080823/To_Submit/'
output_folder = 'Data/BIRTHDEFECTS080823/final_submit/'

for st_date in data_toSave.StartDate.unique():
    to_save = data_toSave[data_toSave['StartDate'] == st_date]
    to_save.index = range(1, len(to_save) + 1)
    year = (st_date//1e4).astype('int').astype('str')
    filename = output_folder+'BirthDefects_AllCounties_'+year+ '.csv'

    to_save.to_csv(filename, index = True,index_label='RowIdentifier')

## A. Some Checks
Some checks to make sure the data makes sense
 

* Checking if the sum of TLBs for each BirthDefect adds up to total live births in each year

In [None]:
def sum_ignore_values(series, ignore=-999):
    return series.replace(ignore, np.nan).sum()

data_toSave.pivot_table(index='StartDate', columns='BirthDefect', values='TLB', aggfunc=lambda x: sum_ignore_values(x))

* Plotting the Sum, max, mean, min for each Birth Defect in each year

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

def plot_summaries(df, groups, coexist, x_column, nrow):
    # Get unique values from the groups column
    unique_groups = df[groups].unique()
    
    # Calculate number of columns for the grid
    ncol = len(unique_groups) // nrow
    if len(unique_groups) % nrow != 0:
        ncol += 1
    
    # Initialize a figure
    fig, axes = plt.subplots(nrow, ncol, figsize=(15, 10))
    
    # If there's only one row or one column, axes is a 1D array
    if nrow == 1 or ncol == 1:
        axes = axes.reshape(nrow, ncol)
    
    # Iterate over each unique group and plot
    for idx, group in enumerate(unique_groups):
        ax = axes[idx // ncol, idx % ncol]
        
        # Filter dataframe for the current group
        subset = df[df[groups] == group]
        
        # Plot each column in coexist
        for col in coexist:
            sns.lineplot(data=subset, x=x_column, y=col, ax=ax, label=col)
        
        ax.set_title(f"{groups}: {group}")
        ax.legend()
    
    # If there are empty subplots, hide them
    for idx in range(len(unique_groups), nrow * ncol):
        axes[idx // ncol, idx % ncol].axis('off')
    
    plt.tight_layout()
    plt.show()


In [None]:
ds_summary = data_toSave.groupby(['BirthDefect', 'StartDate'])['LBFDTWD'].agg(SUM = 'sum', MIN='min', MEAN='mean', MAX='max').reset_index()
ds_summary['BirthDefect'] = ds_summary['BirthDefect'].astype('int')
ds_summary.StartDate = ds_summary.StartDate.astype('str').str[2:4]

plot_summaries(df=ds_summary, groups='BirthDefect', coexist=['SUM', 'MIN', 'MEAN', 'MAX'], x_column='StartDate', nrow=3)