# M1L5 Pandas Part 2 Data Challenge:  EDA

## Scenario

We'll be working with a real-world dataset from the NYC Open Data portal, focusing on the leading causes of death in New York City (same as data challenge 4). This dataset provides valuable insights into public health trends and disparities. Understanding this data is crucial for community advocacy and policy-making.

For more information about the data (which is highly recommended) here is the [Link to the Data](https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam/about_data)

## Objectives 
- Group and Aggregate Data
- Create a contingency table with `crosstab()`
- Check for duplicated data (remember not all duplicated data needs to be dropped)

**Let's get started!**

### Step 1:  Import Pandas & Numpy

In [1]:
# Import Pandas & Numpy
import pandas as pd
import numpy as np

### Step 2: Load the dataset (csv file stored in the data folder) into a Pandas DataFrame. The file is called:  `nyc_causeofdeath.csv`


In [2]:
df = pd.read_csv("/Users/gabriel/Desktop/marcy/DA2025_Lectures2/Mod1/nyc_causeofdeath.csv")


### Step 3: Check the information of the data (column names, data types, size, etc.)


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1094 entries, 0 to 1093
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Year                     1094 non-null   int64  
 1   Leading Cause            1094 non-null   object 
 2   Sex                      1094 non-null   object 
 3   Race Ethnicity           1094 non-null   object 
 4   Deaths                   956 non-null    float64
 5   Death Rate               1094 non-null   object 
 6   Age Adjusted Death Rate  1094 non-null   object 
dtypes: float64(1), int64(1), object(5)
memory usage: 60.0+ KB


### Step 4:  We need to change Deaths from an object to an integer so that we can do some MATH (you will learn this officially later) for now just run the cell below.


In [16]:
#Run this cell without changes 
df['Deaths'] = df['Deaths'].replace('.', np.nan)
df['Deaths'] = pd.to_numeric(df['Deaths'])


### Step 5:  Create code to get the sum of deaths by Sex -- what Sex has the most deaths based on this data (add a comment in the cell with your answer)

In [20]:
deaths_by_sex = df.groupby(['Sex']).count()
print(deaths_by_sex)

     Year  Leading Cause  Race Ethnicity  Deaths  Death Rate  \
Sex                                                            
F     554            554             554     463         554   
M     540            540             540     493         540   

     Age Adjusted Death Rate  
Sex                           
F                        554  
M                        540  


### Step 6:  Now create a contingency table (using `crosstab()`) of the Leading Cause of Death by Sex -- put a comment in the cell of a takeaway from the output 

In [6]:
cause_by_sex =   pd.crosstab(df['Leading Cause'], df['Sex'])
cause_by_sex

Sex,F,M
Leading Cause,Unnamed: 1_level_1,Unnamed: 2_level_1
"Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)",35,45
All Other Causes,48,48
Alzheimer's Disease (G30),31,1
Aortic Aneurysm and Dissection (I71),2,1
"Assault (Homicide: Y87.1, X85-Y09)",3,17
Atherosclerosis (I70),3,0
Cerebrovascular Disease (Stroke: I60-I69),48,42
Certain Conditions originating in the Perinatal Period (P00-P96),13,13
"Chronic Liver Disease and Cirrhosis (K70, K73)",8,21
Chronic Lower Respiratory Diseases (J40-J47),45,43


### Step 7:  Are there any duplicate records in this dataset?  Code it below and add a comment with your answer

In [12]:
df.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
1089    False
1090    False
1091    False
1092    False
1093    False
Length: 1094, dtype: bool

## Above and Beyond (AAB)  -- OPTIONAL

### Question 1:  What year had the most deaths?

TypeError: 'int' object is not callable

### Question 2:  Change the 'Death Rate' column to a float.  Why would you want to do this?

In [None]:
None