# M1L6 Data Challenge:  Missing Data

## Scenario

You are continuing your analysis of the "New York City Leading Causes of Death" dataset.  You've noticed that the Deaths and Death Rate columns contain some missing values, represented by periods ('.').  Missing data is a common issue in real-world datasets, and it's crucial to handle it appropriately to avoid biased or inaccurate conclusions.


For more information about the data (which is highly recommended) here is the [Link to the Data](https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam/about_data)

## Objectives 

- Create missing value indicator columns.
- Identify the extent of missing data in the Deaths and Death Rate columns.
- Impute these missing values using appropriate strategies.



**Let's get started!**

### Step 1:  Import Pandas & Numpy

In [2]:
# Import Pandas 
import pandas as pd
import numpy as np

### Step 2: Load the dataset (csv file stored in the data folder) into a Pandas DataFrame. 

The file is called:  `nyc_causeofdeath.csv`


In [3]:
df = pd.read_csv("/Users/gabriel/Desktop/marcy/DA2025_Lectures2/Mod1/nyc_causeofdeath.csv")

### Step 3: Count up the number of missing values in 2 columns

a.  Use value_counts() to determine the number of missing values (represented by '.') in the `Deaths` and `Death Rate` columns.  Print the value counts for each column.  Be sure to set dropna=False as an argument within value_counts()

b.  Add a comment for the number of missig values (aka periods '.') are in each column


In [23]:
deathperiods = df.value_counts(["Deaths"], dropna=False)
count1 = deathperiods["."]
print(count1)

138


In [24]:
deathrateperiods = df.value_counts(["Death Rate"], dropna=False)
count2 = deathrateperiods["."]
print(count2)

386


In [None]:
#Comment 1:  
138
#Comment 2:  
386

### Step 4:  Replace periods with NaN (not a number) and convert to numeric 

- a. Replace the '.' values in the `Deaths` and `Death Rate` columns with `np.nan`.
- b. Convert the `Deaths` and `Death Rate` columns to numeric.

This may take several lines of code

In [None]:
df_filled1 = df["Deaths"].replace(".", np.nan)
print(df_filled1)


0         11
1         70
2        213
3        NaN
4       1852
        ... 
1089    2293
1090      94
1091       9
1092     149
1093      93
Name: Deaths, Length: 1094, dtype: object
0         NaN
1         NaN
2          25
3         NaN
4       176.5
        ...  
1089    170.3
1090      NaN
1091      NaN
1092       13
1093      8.9
Name: Death Rate, Length: 1094, dtype: object


In [34]:
df_filled2 = df["Death Rate"].replace(".", np.nan)
print(df_filled2)

0         NaN
1         NaN
2          25
3         NaN
4       176.5
        ...  
1089    170.3
1090      NaN
1091      NaN
1092       13
1093      8.9
Name: Death Rate, Length: 1094, dtype: object


In [35]:
df[['Deaths', 'Death Rate']] = df[['Deaths', 'Death Rate']].apply(pd.to_numeric, errors='coerce')
print(df)

      Year                                      Leading Cause Sex  \
0     2007                        Diabetes Mellitus (E10-E14)   M   
1     2010     Diseases of Heart (I00-I09, I11, I13, I20-I51)   F   
2     2007          Cerebrovascular Disease (Stroke: I60-I69)   M   
3     2007                              Atherosclerosis (I70)   F   
4     2014              Malignant Neoplasms (Cancer: C00-C97)   F   
...    ...                                                ...  ..   
1089  2013                                   All Other Causes   M   
1090  2009     Diseases of Heart (I00-I09, I11, I13, I20-I51)   M   
1091  2008  Human Immunodeficiency Virus Disease (HIV: B20...   M   
1092  2010       Chronic Lower Respiratory Diseases (J40-J47)   M   
1093  2013  Nephritis, Nephrotic Syndrome and Nephrisis (N...   F   

             Race Ethnicity  Deaths  Death Rate Age Adjusted Death Rate  
0     Other Race/ Ethnicity    11.0         NaN                       .  
1        Not Stated/Unk

### Step 5:  Check the data's info again 

Run a `.info()` to see if the columns have missing data in them -- they should!

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1094 entries, 0 to 1093
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Year                     1094 non-null   int64  
 1   Leading Cause            1094 non-null   object 
 2   Sex                      1094 non-null   object 
 3   Race Ethnicity           1094 non-null   object 
 4   Deaths                   956 non-null    float64
 5   Death Rate               708 non-null    float64
 6   Age Adjusted Death Rate  1094 non-null   object 
dtypes: float64(2), int64(1), object(4)
memory usage: 60.0+ KB


## Step 6:  Create a missing inidcator (run the code below without changes)

-  This column will have a 1 if the row is missing a value in a respective column and 0 if it is not 

-  Add a comment about what the np.where() function is (feel free to use documentation)

In [37]:
#Run this cell without changes 

df['Deaths_missing'] = np.where(df['Deaths'].isna(), 1, 0)
df['Death_Rate_missing'] = np.where(df['Death Rate'].isna(), 1, 0)

### Step 7:  Calculate the median for each column 

In [40]:
median_col_a = df['Deaths'].median()
median_col_b = df['Death Rate'].median()

print(f"Median of Deaths: {median_col_a}")
print(f"Median of Death Rate: {median_col_b}")

Median of Deaths: 148.5
Median of Death Rate: 18.35


### Step 8:  Use the median to fill in each column's missing values (aka impute)

Hint:  Use `fillna()` with the median values you created above.

In [42]:
columns_to_fill = ['Deaths', 'Death Rate']
medians = df[columns_to_fill].median()

df[columns_to_fill] = df[columns_to_fill].fillna(medians)
df

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate,Deaths_missing,Death_Rate_missing
0,2007,Diabetes Mellitus (E10-E14),M,Other Race/ Ethnicity,11.0,18.35,.,0,1
1,2010,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",F,Not Stated/Unknown,70.0,18.35,.,0,1
2,2007,Cerebrovascular Disease (Stroke: I60-I69),M,Black Non-Hispanic,213.0,25.00,33,0,0
3,2007,Atherosclerosis (I70),F,Other Race/ Ethnicity,148.5,18.35,.,1,1
4,2014,Malignant Neoplasms (Cancer: C00-C97),F,Black Non-Hispanic,1852.0,176.50,148.4,0,0
...,...,...,...,...,...,...,...,...,...
1089,2013,All Other Causes,M,White Non-Hispanic,2293.0,170.30,143.3,0,0
1090,2009,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",M,Not Stated/Unknown,94.0,18.35,.,0,1
1091,2008,Human Immunodeficiency Virus Disease (HIV: B20...,M,Not Stated/Unknown,9.0,18.35,.,0,1
1092,2010,Chronic Lower Respiratory Diseases (J40-J47),M,Hispanic,149.0,13.00,23.9,0,0


In [43]:
#Check the info to see if imputation worked 

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1094 entries, 0 to 1093
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Year                     1094 non-null   int64  
 1   Leading Cause            1094 non-null   object 
 2   Sex                      1094 non-null   object 
 3   Race Ethnicity           1094 non-null   object 
 4   Deaths                   1094 non-null   float64
 5   Death Rate               1094 non-null   float64
 6   Age Adjusted Death Rate  1094 non-null   object 
 7   Deaths_missing           1094 non-null   int64  
 8   Death_Rate_missing       1094 non-null   int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 77.1+ KB


## Above and Beyond (AAB)  -- OPTIONAL

### Question 1:  What year had the most deaths?

In [None]:
None

### Question 2:  Change the 'Death Rate' column to a float.  

Why would you want to do this?  Add a comment answering this question.

In [19]:
None 