<a href="https://colab.research.google.com/github/MeghanaSen/-HD5210-homework-/blob/main/week12_assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 11 Assignment


Please do the programming exercise and verify that your code works using the tests, then think about your final project and fill out the questions in the second part.

---
---

### 47.1: Filtering and summarizing data

For this work, you'll find a data file in `https://hds5210-data.s3.amazonaws.com/complications_all.csv`.

Read in the data file and create a variable called `mo_hospitals` that contains a data frame from the `complications_all.csv` file, filtered down to only contain those hospitals from the state of Missouri (MO).

Then aggregate that data by hospital into a variable named `mo_summary`.  There are some key fields that we want to summarize:
* We want to know the earliest date that each hospital was participating in any program
* We want to know the latest date that each hospital stopped participating in any program
* We want to know the total number of patients in the denominators of these programs

Some things to note:
* You will need to convert the `Start Date` and `End Date` to actual datetime fields
* You will need to clean up and convert the `Denominator` field to just be numeric - the rule that you should use it to simply remove any records where the `Denominator` is `'Not Available'`


The final result of this step should be a new data frame called `mo_summary` that contains one row for each hospital and contains the min start date, max end date, and total denominator.  Use the names `start_date`, `end_date`, and `number` for those columns in `mo_summary`.


You do not need to create your code in the form of a function, just make sure your variable names match what I've described above so the tests work.

In [None]:
import pandas as pd
# This is just to show you the name to use for the variable you need to create for this step to pass.
all_hospitals = pd.read_csv('https://hds5210-data.s3.amazonaws.com/complications_all.csv')


In [None]:

mo_hospitals = all_hospitals[all_hospitals['State'] == 'MO'].copy()

mo_hospitals.loc[:, 'Denominator'] = pd.to_numeric(
    mo_hospitals['Denominator'], errors='coerce'
)


mo_hospitals = mo_hospitals.dropna(subset=['Denominator'])


mo_hospitals.loc[:, 'Denominator'] = mo_hospitals['Denominator'].astype(int)


mo_hospitals.loc[:, 'Start Date'] = pd.to_datetime(
    mo_hospitals['Start Date'], errors='coerce'
)
mo_hospitals.loc[:, 'End Date'] = pd.to_datetime(
    mo_hospitals['End Date'], errors='coerce'
)


mo_summary = (
    mo_hospitals
    .groupby('Facility Name')
    .agg(
        start_date=('Start Date', 'min'),
        end_date=('End Date', 'max'),
        number=('Denominator', 'sum')
    )
    .reset_index()
)

mo_summary = mo_summary.set_index('Facility Name')

print(mo_summary)


                                              start_date             end_date  \
Facility Name                                                                   
BARNES JEWISH HOSPITAL               2015-04-01 00:00:00  2018-06-30 00:00:00   
BARNES-JEWISH ST PETERS HOSPITAL     2015-04-01 00:00:00  2018-06-30 00:00:00   
BARNES-JEWISH WEST COUNTY HOSPITAL   2015-04-01 00:00:00  2018-06-30 00:00:00   
BATES COUNTY MEMORIAL HOSPITAL       2015-07-01 00:00:00  2018-06-30 00:00:00   
BELTON REGIONAL MEDICAL CENTER       2015-04-01 00:00:00  2018-06-30 00:00:00   
...                                                  ...                  ...   
TRUMAN MEDICAL CENTER LAKEWOOD       2015-04-01 00:00:00  2018-06-30 00:00:00   
UNIVERSITY OF MISSOURI HEALTH CARE   2015-04-01 00:00:00  2018-06-30 00:00:00   
WASHINGTON COUNTY MEMORIAL HOSPITAL  2015-07-01 00:00:00  2018-06-30 00:00:00   
WESTERN MISSOURI MEDICAL CENTER      2015-04-01 00:00:00  2018-06-30 00:00:00   
WRIGHT MEMORIAL HOSPITAL    

In [None]:
assert(mo_summary['number'].sum() == 1766908)
assert(mo_summary['start_date'].min() == pd.Timestamp(2015,4,1))
assert(mo_summary['end_date'].max() == pd.Timestamp(2018,6,30))
assert(mo_summary.shape == (108,3))
assert(mo_summary.loc['BARNES JEWISH HOSPITAL'].number == 131313)
assert(mo_summary.loc['BOONE HOSPITAL CENTER'].number == 63099)

---

### 47.2 Planning your final project

You should be thinking about the things we've been learning and how you can apply them to your final project.  Use the rubric to help guid your thinking and then answer the questions below.  This is meant as a guide to help you think through what you will do.

#### A) Data Access

Your project should include data from at least three distinct types of sources.  For example: AWS S3, Relational Databases, Internet, Web Services, local files.  List what data sources you're planning to use.

**Double-click to enter your answer**

In this project, I will utilize three types of sources in the attempt to be both diverse and comprehensive for my dataset. First, I have local files-in particular, CSV files stored on my local machine. These can be very easily accessed through Python's pandas library for efficient data analysis. Then, I will use Google BigQuery, an extremely strong, cloud-based relational database. With the help of BigQuery, I would be able to run complex SQL queries in order to extract and analyze structured large volumes of data, thus adding depth to the analysis of the project. Finally, I'm going to make use of **Internet** data, by calling publicly available APIs. This will allow access to real-time or current information from various online sources, which will include healthcare, financial, or indeed any other type of API-specific sources using Python's requests library to fetch JSON data. Combining these three sources of information-local files, a relational database, and web-based APIs-provides a rounded data set which could support strong, enlightening analysis.

#### B. Data Formats

Your project should include data that comes in different file formats.  For example: HL7, EDI, HTML, CSV, Excel, JSON, XML.  List what data formats you're planning to use.

**Double-click to enter your answer**

I will make use of varied formats in my project to ensure that my data set is as large and diverse as it can be. Therefore, there will be use of **CSV** on structured tabular data, handling multisheet data with **Excel**, and dealing with complex nested data typically extracted from web APIs with **JSON**. For scraping tables of data from websites, I'll use HTML. Other source types will include structured data acquired through known industry-specific standards such as healthcare and finance using **XML**. I try to use such diversified formats to enhance the flexibility and depth of data analysis.


#### C. Objective

What purpose would your project serve in a real work setting?  Take a couple of paragraphs to write down why this is an interesting product.

**Double-click to enter your answer**

It will be a powerful tool in any natural work environment, aggregating data from various diversified sources to a unified platform from which informed decisions are made by organizations. Drawing data from AWS S3, relational databases like Google BigQuery, and real-time web services, the project allows wide analysis by stitching together historical, structured, and real-time data. This becomes very valuable in industries based on large-scale data analysis, such as healthcare, finance, and retail, where access to internal and external data can provide operational efficiency and strategic insight.

The project is exciting because it brings different data formats and types together in a way that business houses can bridge the gaps between different data silos. Supported formats like CSV, Excel, JSON, HTML, XML, among others-the data from various departments, external APIs, or online sources-is analyzed together seamlessly without much hassle. Such methods will especially be of use in the areas of healthcare because the data is often broken up between various systems or even formats. For instance, it may link performance with public health real-time data and make the determination much more perceptive and responsive in relation to the trends in health. The reason this solution is valuable is because of the diversity within that solution itself, and to any organization interested in leveraging data for a competitive advantage and better decision-making through comprehensive and integrated analytics.




---



## Submit your work via GitHub as normal
