<a href="https://colab.research.google.com/github/Bhavyasaradhi/HDS5210_InClass/blob/master/week11/week11_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 11 Assignment


Please do the programming exercise and verify that your code works using the tests, then think about your final project and fill out the questions in the second part.

---
---

### 47.1: Filtering and summarizing data

For this work, you'll find a data file in `/data/complications_all.csv`.

Read in the data file and create a variable called `mo_hospitals` that contains a data frame from the `complications_all.csv` file, filtered down to only contain those hospitals from the state of Missouri (MO).

Then aggregate that data by hospital into a variable named `mo_summary`.  There are some key fields that we want to summarize:
* We want to know the earliest date that each hospital was participating in any program
* We want to know the latest date that each hospital stopped participating in any program
* We want to know the total number of patients in the denominators of these programs

Some things to note:
* You will need to convert the `Start Date` and `End Date` to actual datetime fields
* You will need to clean up and convert the `Denominator` field to just be numeric - the rule that you should use it to simply remove any records where the `Denominator` is `'Not Available'`


The final result of this step should be a new data frame called `mo_summary` that contains one row for each hospital and contains the min start date, max end date, and total denominator.  Use the names `start_date`, `end_date`, and `number` for those columns in `mo_summary`.


You do not need to create your code in the form of a function, just make sure your variable names match what I've described above so the tests work.

In [1]:
import pandas as pd
# This is just to show you the name to use for the variable you need to create for this step to pass.
all_hospitals = pd.read_csv('https://hds5210-data.s3.amazonaws.com/complications_all.csv')


In [2]:
# Filtering down to hospitals in Missouri (MO)
mo_hospitals = all_hospitals[all_hospitals['State'] == 'MO']

In [3]:
# Convert the 'Start Date' and 'End Date' to datetime
mo_hospitals.loc[:,'Start Date'] = pd.to_datetime(mo_hospitals['Start Date'])
mo_hospitals.loc[:,'End Date'] = pd.to_datetime(mo_hospitals['End Date'])

In [4]:
# Clean up and convert 'Denominator' to numeric, removing records with 'Not Available'
mo_hospitals = mo_hospitals[mo_hospitals['Denominator'] != 'Not Available']
mo_hospitals['Denominator'] = pd.to_numeric(mo_hospitals['Denominator'])

In [5]:
mo_hospitals.columns = mo_hospitals.columns.str.strip()

In [6]:
mo_summary = mo_hospitals.groupby('Facility Name').agg(
    start_date=pd.NamedAgg(column='Start Date', aggfunc='min'),
    end_date=pd.NamedAgg(column='End Date', aggfunc='max'),
    number=pd.NamedAgg(column='Denominator', aggfunc='sum')
).reset_index().set_index('Facility Name')

In [7]:
mo_summary

Unnamed: 0_level_0,start_date,end_date,number
Facility Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BARNES JEWISH HOSPITAL,2015-04-01 00:00:00,2018-06-30 00:00:00,131313
BARNES-JEWISH ST PETERS HOSPITAL,2015-04-01 00:00:00,2018-06-30 00:00:00,15668
BARNES-JEWISH WEST COUNTY HOSPITAL,2015-04-01 00:00:00,2018-06-30 00:00:00,9622
BATES COUNTY MEMORIAL HOSPITAL,2015-07-01 00:00:00,2018-06-30 00:00:00,3117
BELTON REGIONAL MEDICAL CENTER,2015-04-01 00:00:00,2018-06-30 00:00:00,9270
...,...,...,...
TRUMAN MEDICAL CENTER LAKEWOOD,2015-04-01 00:00:00,2018-06-30 00:00:00,4297
UNIVERSITY OF MISSOURI HEALTH CARE,2015-04-01 00:00:00,2018-06-30 00:00:00,56493
WASHINGTON COUNTY MEMORIAL HOSPITAL,2015-07-01 00:00:00,2018-06-30 00:00:00,220
WESTERN MISSOURI MEDICAL CENTER,2015-04-01 00:00:00,2018-06-30 00:00:00,7254


In [8]:
assert(mo_summary['number'].sum() == 1766908)
assert(mo_summary['start_date'].min() == pd.Timestamp(2015,4,1))
assert(mo_summary['end_date'].max() == pd.Timestamp(2018,6,30))
assert(mo_summary.shape == (108,3))
assert(mo_summary.loc['BARNES JEWISH HOSPITAL'].number == 131313)
assert(mo_summary.loc['BOONE HOSPITAL CENTER'].number == 63099)

---

### 47.2 Planning your final project

You should be thinking about the things we've been learning and how you can apply them to your final project.  Use the rubric to help guid your thinking and then answer the questions below.  This is meant as a guide to help you think through what you will do.

#### A) Data Access

Your project should include data from at least three distinct types of sources.  For example: AWS S3, Relational Databases, Internet, Web Services, local files.  List what data sources you're planning to use.

For my final project, therefore, to consolidate on this idea, I will be useing both AWS S3 (cloud-based), Internet and Local Files, so as to achieve a wide and more general perspective with regards to the coverage of the data to be analyzed and integrated.


AWS S3: I will put and get searchable information about clients (patients, health parameters, insurance reimbursement) in AWS S3. This open-source cloud storage solution lets me work with immense data and retain the security of the data without requiring a massive amount of storage on my local machine.
How it will be Useful: AWS S3 has listed features which include secure data storage, reasonable facilitating data sharing with other cloud services, and data availability. Therefore, it is suitable for processing health care data that demands compliance to data privacy laws such as HIPAA.


Internet : I’ll either scrape data off the web or use APIs to get data from reliable governmental and healthcare databases, or reports, (e.g. CDC, WHO, NIH). This will provide me with up to date health statistics, trend or any other population health practice for my project.
How it will be Useful: Internet data sources are also very important because most of the time I need real life data relevant for analysis that can be collected within the health service field. This will assist me look for trend of any kind that might be affecting patient outcome or even healthcare delivery.

Local Files : Temporary and local files will be used in cases when I may preprocess or clean up some of the data, and then upload it to AWS S3. Furthermore, any other dataset that cannot be stored in a cloud due to its size, can be managed at a local level. I also use CSV files or excel sheets to manage any form of interims, summaries, and other relevant project papers.
How it will be Useful: File storage is fluent at the local space for the rapid asset access, data manipulation, or first EDA execution. It enables me to do my work in offline mode, and I need to have the control over the data pipeline before pushing it into cloud platforms.


#### B. Data Formats

Your project should include data that comes in different file formats.  For example: HL7, EDI, HTML, CSV, Excel, JSON, XML.  List what data formats you're planning to use.

In order to apply multilevel analysis and interconnect different varieties of healthcare data for my project I will use file formats such as HTML, CSV, MS Excel flexible and compatible mode of operation.

Usage:
HTML : I will still scrape data from web pages by analyzing the HTML tags and attributes of the web pages. This will require collecting healthcare information for patients, the masses, or research and other information from genuine websites.
How it will be Useful: HTML data helps me to gather information from web sources on the fly, obtaining additional information that might not be included in conventional datasets and add to my project a current perspective.

CSV (Comma-Separated Values): Data that has structures, for instance, health statistics, and patient information will be stored in CSV files. I will also use CSV files for data preprocessing and initial analysis as they are small-file formats that are easily manageable in a data manipulation environment at the early stage of an analytic process.
How it will be Useful: CSV,.csv stands for Comma-Separated-Values is a universally accepted format free from structural intricacies and proves helpful for cleaning, transforming, and loading, and incorporating the data into other platforms such as AWS S3.

Excel: Smaller datasets will be kept in excel files used for data exploratory analysis, as well as for presenting preliminary analysis results. It will also be useful for those data sets that need one to enter it, make modifications or corrections on and verify.
How it will be Useful: Due to flexibility and having various analytical tools incorporated in it, Excel is perfect for immediate and basic data analysis, data validation, and creating pivot table or chart, which supports my project’s early exploratory stage before the exploitation of typical programming tools.

The three types of data formats facilitate data acquisition, storage, retrieval, and elaboration to establish a sound and more scalable system for running the projects.









#### C. Objective

What purpose would your project serve in a real work setting?  Take a couple of paragraphs to write down why this is an interesting product.

**Double-click to enter your answer**

This project providеs a flеxiblе solution for businеssеs facing data intеgration challеngеs by intеgrating data from AWS S3, Intеrnеt sourcеs, and local filеs in a variеty of formats such as HTML, CSV, and Excеl. Many organizations work with dispеrsеd data on multiplе platforms еncodеd in multiplе formats, making it difficult to intеgratе and analyzе data еasily Manufacturing strеamlinеs data prеprocеssing phasеs and improvеs thе accuracy and prеcision of analytics work morе еfficiеntly using a platform that can еasily managе thеsе typеs of data sourcеs and formats.

This projеct's practical valuе is dеmonstratеd by its ability to еnhancе organizational dеcision-making. Businеssеs can gain insights from a largеr rangе of data with thе hеlp of a flеxiblе and unifiеd data intеgration tool, which makеs wеll-informеd dеcision-making еasiеr. This in turn fostеrs rеsponsivеnеss and agility, which arе еssеntial traits for businеssеs navigating fast-pacеd, cutthroat markеts. Thе projеct makеs a significant contribution to a morе еfficiеnt and succеssful businеss intеlligеncе stratеgy by offеring an еasy-to-usе intеrfacе for data aggrеgation and analysis. This makеs it possiblе for businеssеs to fully utilizе thе data thеy havе, which hеlps with opеrational and stratеgic planning.




---



## Submit your work via GitHub as normal
