# Week 11 Assignment


Please do the programming exercise and verify that your code works using the tests, then think about your final project and fill out the questions in the second part.

---
---

### 47.1: Filtering and summarizing data

For this work, you'll find a data file in `https://hds5210-data.s3.amazonaws.com/complications_all.csv`.

Read in the data file and create a variable called `mo_hospitals` that contains a data frame from the `complications_all.csv` file, filtered down to only contain those hospitals from the state of Missouri (MO).

Then aggregate that data by hospital into a variable named `mo_summary`.  There are some key fields that we want to summarize:
* We want to know the earliest date that each hospital was participating in any program
* We want to know the latest date that each hospital stopped participating in any program
* We want to know the total number of patients in the denominators of these programs

Some things to note:
* You will need to convert the `Start Date` and `End Date` to actual datetime fields
* You will need to clean up and convert the `Denominator` field to just be numeric - the rule that you should use it to simply remove any records where the `Denominator` is `'Not Available'`


The final result of this step should be a new data frame called `mo_summary` that contains one row for each hospital and contains the min start date, max end date, and total denominator.  Use the names `start_date`, `end_date`, and `number` for those columns in `mo_summary`.


You do not need to create your code in the form of a function, just make sure your variable names match what I've described above so the tests work.

In [15]:
import pandas as pd
# This is just to show you the name to use for the variable you need to create for this step to pass.
all_hospitals = pd.read_csv('https://hds5210-data.s3.amazonaws.com/complications_all.csv')


In [16]:
import pandas as pd

# 1. Read the data
all_hospitals = pd.read_csv('https://hds5210-data.s3.amazonaws.com/complications_all.csv')

# 2. Create mo_hospitals with Missouri data
mo_hospitals = all_hospitals[all_hospitals['State'] == 'MO'].copy()

# 3. Convert dates using loc
mo_hospitals.loc[:, 'Start Date'] = pd.to_datetime(mo_hospitals['Start Date'])
mo_hospitals.loc[:, 'End Date'] = pd.to_datetime(mo_hospitals['End Date'])

# 4. Clean Denominator data
mo_hospitals = mo_hospitals[mo_hospitals['Denominator'] != 'Not Available']
mo_hospitals.loc[:, 'Denominator'] = pd.to_numeric(mo_hospitals['Denominator'])

# 5. Create mo_summary - using the exact column name from the dataset
mo_summary = mo_hospitals.groupby('Facility Name').agg({
    'Start Date': 'min',
    'End Date': 'max',
    'Denominator': 'sum'
}).rename(columns={
    'Start Date': 'start_date',
    'End Date': 'end_date',
    'Denominator': 'number'
})

# Verify results
print("\nShape of mo_summary:", mo_summary.shape)
print("Total patients:", mo_summary['number'].sum())
print("Date range:", mo_summary['start_date'].min(), "to", mo_summary['end_date'].max())
print("\nBarnes Jewish Hospital number:", mo_summary.loc['BARNES JEWISH HOSPITAL', 'number'])
print("Boone Hospital Center number:", mo_summary.loc['BOONE HOSPITAL CENTER', 'number'])


Shape of mo_summary: (108, 3)
Total patients: 1766908
Date range: 2015-04-01 00:00:00 to 2018-06-30 00:00:00

Barnes Jewish Hospital number: 131313
Boone Hospital Center number: 63099


In [17]:
assert(mo_summary['number'].sum() == 1766908)
assert(mo_summary['start_date'].min() == pd.Timestamp(2015,4,1))
assert(mo_summary['end_date'].max() == pd.Timestamp(2018,6,30))
assert(mo_summary.shape == (108,3))
assert(mo_summary.loc['BARNES JEWISH HOSPITAL'].number == 131313)
assert(mo_summary.loc['BOONE HOSPITAL CENTER'].number == 63099)

---

### 47.2 Planning your final project

You should be thinking about the things we've been learning and how you can apply them to your final project.  Use the rubric to help guid your thinking and then answer the questions below.  This is meant as a guide to help you think through what you will do.

#### A) Data Access

Your project should include data from at least three distinct types of sources.  For example: AWS S3, Relational Databases, Internet, Web Services, local files.  List what data sources you're planning to use.

I will use three main ways to get data for my project:

*AWS S3 Storage: Like a big online folder where we can keep lots of health data files (just like the hospital file we used)
*SQL Database: A place to store and organize patient information in a way that's easy to find and use
*Web APIs: Tools that let us get live data directly from hospital systems




#### B. Data Formats

Your project should include data that comes in different file formats.  For example: HL7, EDI, HTML, CSV, Excel, JSON, XML.  List what data formats you're planning to use.

My project will work with these types of files:

*CSV Files: Simple files like spreadsheets that hold lists of hospital data
*JSON Files: Files that websites commonly use to share data
*XML Files: Another way to organize and share hospital information
*HL7 Files: Special files made just for healthcare data
*Excel Files: Spreadsheets that hospital staff often use for reports


#### C. Objective

What purpose would your project serve in a real work setting?  Take a couple of paragraphs to write down why this is an interesting product.

This project would help hospitals track and improve their work. Here's how it would help in real life:
First, it would save time by automatically collecting and analyzing data from different hospitals. Instead of staff spending hours gathering information by hand, the system would do it automatically.
The main benefits would be:

*Making it easier to see how well hospitals are caring for patients
*Helping hospital managers make better decisions using real data
*Creating reports that hospitals need to show they're following healthcare rules

For example, just like we analyzed Missouri hospital data in our code, this system would help hospital leaders quickly see important information about patient care and make their hospitals work better.
This simple approach would help hospitals:

*Spend less time on paperwork
*Find and fix problems faster
*Take better care of patients
*Keep track of everything more easily




---



## Submit your work via GitHub as normal
