# M1L4 Pandas Data Challenge:  EDA 

## Scenario

We'll be working with a real-world dataset from the NYC Open Data portal, focusing on the leading causes of death in New York City. This dataset provides valuable insights into public health trends and disparities. Understanding this data is crucial for community advocacy and policy-making.

## Recall:  What is Exploratory Data Analysis (EDA)?

EDA is the process of understanding a dataset by summarizing its main characteristics, often with visual methods. In Pandas, this involves:

- Loading data: Reading data from files (like CSV) into DataFrames.
- Inspecting data: Checking the data's structure, data types, and basic statistics.
- Cleaning data: Handling missing values, correcting data types, and removing inconsistencies.
- Analyzing data: Calculating summary statistics, identifying patterns, and exploring relationships between variables.

For more information about the data (which is highly recommended) here is the [Link to the Data](https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam/about_data)

**Let's get started!**

### Step 1:  Import Pandas use the alias `pd` 

In [26]:
# Import Pandas 
import pandas as pd
import numpy as np

### Step 2: Load the dataset (csv file stored in the data folder) into a Pandas DataFrame. The file is called:  `nyc_causeofdeath.csv`


In [9]:
df = pd.read_csv('New_York_City_Leading_Causes_of_Death_20250605.csv')

### Step 3: Display the first 10 rows AND the last 10 rows of the DataFrame.


In [11]:
# Display the first 10 rows
print(df.head(10))


   Year                                    Leading Cause Sex  \
0  2007                      Diabetes Mellitus (E10-E14)   M   
1  2010   Diseases of Heart (I00-I09, I11, I13, I20-I51)   F   
2  2007        Cerebrovascular Disease (Stroke: I60-I69)   M   
3  2007                            Atherosclerosis (I70)   F   
4  2014            Malignant Neoplasms (Cancer: C00-C97)   F   
5  2010     Chronic Lower Respiratory Diseases (J40-J47)   F   
6  2007  Intentional Self-Harm (Suicide: X60-X84, Y87.0)   M   
7  2012                                 All Other Causes   F   
8  2009   Diseases of Heart (I00-I09, I11, I13, I20-I51)   F   
9  2010                             Septicemia (A40-A41)   F   

               Race Ethnicity Deaths Death Rate Age Adjusted Death Rate  
0       Other Race/ Ethnicity     11          .                       .  
1          Not Stated/Unknown     70          .                       .  
2          Black Non-Hispanic    213         25                      33  

In [12]:
# Display the last 10 rows
print(df.tail(10))


      Year                                      Leading Cause Sex  \
1084  2010  Certain Conditions originating in the Perinata...   F   
1085  2009  Essential Hypertension and Renal Diseases (I10...   F   
1086  2011          Cerebrovascular Disease (Stroke: I60-I69)   F   
1087  2008     Diseases of Heart (I00-I09, I11, I13, I20-I51)   F   
1088  2010     Diseases of Heart (I00-I09, I11, I13, I20-I51)   M   
1089  2013                                   All Other Causes   M   
1090  2009     Diseases of Heart (I00-I09, I11, I13, I20-I51)   M   
1091  2008  Human Immunodeficiency Virus Disease (HIV: B20...   M   
1092  2010       Chronic Lower Respiratory Diseases (J40-J47)   M   
1093  2013  Nephritis, Nephrotic Syndrome and Nephrisis (N...   F   

                  Race Ethnicity Deaths Death Rate Age Adjusted Death Rate  
1084  Asian and Pacific Islander     18        3.2                       4  
1085                    Hispanic     84          7                     8.8  
1086     

### Step 4:  Get a summary of the DataFrame's information by using the `.info()` method.  Type in a comment that is the name of the only numeric column


In [None]:
# Only numeric column is None

print(df.info())
#Year

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1094 entries, 0 to 1093
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Year                     1094 non-null   int64 
 1   Leading Cause            1094 non-null   object
 2   Sex                      1094 non-null   object
 3   Race Ethnicity           1094 non-null   object
 4   Deaths                   1094 non-null   object
 5   Death Rate               1094 non-null   object
 6   Age Adjusted Death Rate  1094 non-null   object
dtypes: int64(1), object(6)
memory usage: 60.0+ KB
None


### Step 5:  Calculate descriptive statistics for the numerical columns using the `.describe()` method.  Add a comment to list the range of the years in the data set


In [None]:
# Years are None

print(df.describe())
# 2007 to 2014


              Year
count  1094.000000
mean   2010.477148
std       2.293419
min    2007.000000
25%    2008.000000
50%    2010.000000
75%    2012.000000
max    2014.000000


### Step 6:  Run the cell below to see the method `value_counts()` in action.  Then use the same method to check the unique values in the 'Race Ethnicity' column.  Is there one Ethnicity that shows up in the data more than another -- Type the answer as a comment?  


In [15]:
# Run this cell without changes 
print(df['Sex'].value_counts())

Sex
F    554
M    540
Name: count, dtype: int64


In [17]:
# Use value_counts() on the Race Ethnicity column
# The most common Race is None
df['Race Ethnicity'].value_counts() 


Race Ethnicity
Not Stated/Unknown            200
Other Race/ Ethnicity         186
Black Non-Hispanic            178
Asian and Pacific Islander    177
Hispanic                      177
White Non-Hispanic            176
Name: count, dtype: int64

### Step 7:  Convert the 'Death Rate` column to numeric by running the cell below.  Type a resonse on what you think the errors='coerce' argument does AND list the method you would use to change a numeric column to text

In [19]:
#Run this cell without changes 

df['Death Rate'] = pd.to_numeric(df['Death Rate'], errors='coerce')
#Maybe it means ignore errors?


What is the method you would use to change a numeric column to text?  
You would use the .astype(str) method
Double-click this cell to type in an answer below:

**None**

### Step 8:  Use `.describe()` again now that Death Rate is a number.  Type a response on a main takeaways about the death rates in NYC based on this data.  List at least one other question you could answer about death rates based soley on this dataset.  

In [21]:
print(df.describe())


              Year  Death Rate
count  1094.000000  708.000000
mean   2010.477148   53.438842
std       2.293419   76.524700
min    2007.000000    2.400000
25%    2008.000000   11.600000
50%    2010.000000   18.350000
75%    2012.000000   64.625000
max    2014.000000  491.400000


What is one other question you could answer about death rates based soley on the data

Double-click this cell to type in an answer below:

The death rates range from a minimum of 2.4 to a maximum of 491.4, showing a substantial difference between the lowest and highest observed rates.

## Above and Beyond (AAB) -- OPTIONAL

### Question 1:  View the documentation for the pandas `group_by()` method and get an average of the Death Rate column by Sex

[Group By Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)

In [30]:
#average_death_rates = df.groupby("Death Rate").agg(np.mean)
average_death_rates = df.groupby("Sex")["Death Rate"].mean()
print(average_death_rates)

Sex
F    51.401130
M    55.476554
Name: Death Rate, dtype: float64


### Question 2:  Use `group_by()` along with the `size()` method to count the number of Sex and Race combinations that exist in the data

In [None]:
sex_race_counts = None
print(sex_race_counts)