# Week 11 Assignment


Please do the programming exercise and verify that your code works using the tests, then think about your final project and fill out the questions in the second part.

---
---

### 47.1: Filtering and summarizing data

For this work, you'll find a data file in `https://hds5210-data.s3.amazonaws.com/complications_all.csv`.

Read in the data file and create a variable called `mo_hospitals` that contains a data frame from the `complications_all.csv` file, filtered down to only contain those hospitals from the state of Missouri (MO).

Then aggregate that data by hospital into a variable named `mo_summary`.  There are some key fields that we want to summarize:
* We want to know the earliest date that each hospital was participating in any program
* We want to know the latest date that each hospital stopped participating in any program
* We want to know the total number of patients in the denominators of these programs

Some things to note:
* You will need to convert the `Start Date` and `End Date` to actual datetime fields
* You will need to clean up and convert the `Denominator` field to just be numeric - the rule that you should use it to simply remove any records where the `Denominator` is `'Not Available'`


The final result of this step should be a new data frame called `mo_summary` that contains one row for each hospital and contains the min start date, max end date, and total denominator.  Use the names `start_date`, `end_date`, and `number` for those columns in `mo_summary`.


You do not need to create your code in the form of a function, just make sure your variable names match what I've described above so the tests work.

In [1]:
import pandas as pd
all_hospitals = pd.read_csv('https://hds5210-data.s3.amazonaws.com/complications_all.csv')


In [3]:
mo_hospitals = all_hospitals[all_hospitals['State'] == 'MO'].copy()

In [4]:
mo_hospitals['Start Date'] = pd.to_datetime(mo_hospitals['Start Date'])
mo_hospitals['End Date'] = pd.to_datetime(mo_hospitals['End Date'])


In [5]:
mo_hospitals = mo_hospitals[mo_hospitals['Denominator'] != 'Not Available'].copy()

In [6]:
mo_hospitals['Denominator'] = pd.to_numeric(mo_hospitals['Denominator'])

In [7]:
mo_summary = mo_hospitals.groupby('Facility Name').agg(
    start_date=('Start Date', 'min'),
    end_date=('End Date', 'max'),
    number=('Denominator', 'sum')
)


In [8]:
assert(mo_summary['number'].sum() == 1766908)
assert(mo_summary['start_date'].min() == pd.Timestamp(2015,4,1))
assert(mo_summary['end_date'].max() == pd.Timestamp(2018,6,30))
assert(mo_summary.shape == (108,3))
assert(mo_summary.loc['BARNES JEWISH HOSPITAL'].number == 131313)
assert(mo_summary.loc['BOONE HOSPITAL CENTER'].number == 63099)

---

### 47.2 Planning your final project

You should be thinking about the things we've been learning and how you can apply them to your final project.  Use the rubric to help guid your thinking and then answer the questions below.  This is meant as a guide to help you think through what you will do.

#### A) Data Access

Your project should include data from at least three distinct types of sources.  For example: AWS S3, Relational Databases, Internet, Web Services, local files.  List what data sources you're planning to use.

1. Heart-Disease from Kaggle
https://www.kaggle.com/datasets/allanwandia/heart-disease
2. Heart Disease Mortality Data from Healthdata.gov
https://healthdata.gov/dataset/Heart-Disease-Mortality-Data-Among-US-Adults-35-by/xg8i-mvdk
3. Behavioral Risk Factor Surveillance System (BRFSS) - National Cardiovascular Disease Surveillance Data from CDC
https://data.cdc.gov/Heart-Disease-Stroke-Prevention/Behavioral-Risk-Factor-Surveillance-System-BRFSS-N/ikwk-8git


#### B. Data Formats

Your project should include data that comes in different file formats.  For example: HL7, EDI, HTML, CSV, Excel, JSON, XML.  List what data formats you're planning to use.

CSV
JSON
XML


#### C. Objective

What purpose would your project serve in a real work setting?  Take a couple of paragraphs to write down why this is an interesting product.

My project aims to construct a comprehensive picture of heart disease dynamics. This approach is extremely relevant in real-world healthcare applications.

Firstly, the project will offer valuable for healthcare practitioners and medical researchers. By comparing predictive data with actual mortality data, it will produce a clearer understanding of the effectiveness of current predictive models for heart disease. Such insights are essential for enhancing early detection and preventative strategies.

Additionally, the inclusion of BRFSS data allows for an exploration of the lifestyle and behavioral factors  that influence heart disease. This analysis is crucial for public health officals and policy makers, offering data-driven evidence to shape health promotion campaigns and preventative measures aimed at reducing heart disease risk



---



## Submit your work via GitHub as normal
