# Week 11 Assignment

Because I was unable to conduct our workshop this week, I'm keeping the assignment light as well.  Below you'll find just two steps for this week: one programming exercise and then a planning activity for your final project.

For clarification, the "final project" I've been referring to is your "final."  It is not a project in addition to a final exam.  They're one-in-the-same.

Please do the programming exercise and verify that your code works using the tests, then think about your final project and fill out the questions in the second part.

---
---

### 47.1: Filtering and summarizing data

For this work, you'll find a data file in `/data/complications_all.csv`.

Read in the data file and create a variable called `mo_hospitals` that contains a data frame from the `complications_all.csv` file, filtered down to only contain those hospitals from the state of Missouri (MO).

Then aggregate that data by hospital into a variable named `mo_summary`.  There are some key fields that we want to summarize:
* We want to know the earliest date that each hospital was participating in any program
* We want to know the latest date that each hospital stopped participating in any program
* We want to know the total number of patients in the denominators of these programs

Some things to note:
* You will need to convert the `Start Date` and `End Date` to actual datetime fields
* You will need to clean up and convert the `Denominator` field to just be numeric - the rule that you should use it to simply remove any records where the `Denominator` is `'Not Available'`


The final result of this step should be a new data frame called `mo_summary` that contains one row for each hospital and contains the min start date, max end date, and total denominator.  Use the names `start_date`, `end_date`, and `number` for those columns in `mo_summary`.


You do not need to create your code in the form of a function, just make sure your variable names match what I've described above so the tests work.

In [1]:
import pandas as pd
import pandasql as ps
import datetime
from pathlib import Path
HOME = str(Path.home())

# This is just to show you the name to use for the variable you need to create for this step to pass.
#mo_hospitals = ...

#read in dataset
filename = "/data/complications_all.csv"
df = pd.read_csv(filename)
df

Unnamed: 0,Facility ID,Facility Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure ID,Measure Name,Compared to National,Denominator,Score,Lower Estimate,Higher Estimate,Footnote,Start Date,End Date
0,010001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,(334) 793-8701,COMP_HIP_KNEE,Rate of complications for hip/knee replacement...,No Different Than the National Rate,292,3.2,2.1,4.8,,04/01/2015,03/31/2018
1,010001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,(334) 793-8701,MORT_30_AMI,Death rate for heart attack patients,No Different Than the National Rate,688,13,11.0,15.5,,07/01/2015,06/30/2018
2,010001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,(334) 793-8701,MORT_30_CABG,Death rate for CABG surgery patients,No Different Than the National Rate,291,4.3,2.6,6.8,,07/01/2015,06/30/2018
3,010001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,(334) 793-8701,MORT_30_COPD,Death rate for COPD patients,No Different Than the National Rate,411,8.8,6.7,11.4,,07/01/2015,06/30/2018
4,010001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,HOUSTON,(334) 793-8701,MORT_30_HF,Death rate for heart failure patients,No Different Than the National Rate,869,12.7,10.7,15.0,,07/01/2015,06/30/2018
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91390,670128,BAYLOR SCOTT & WHITE MEDICAL CENTER PFLUGERVILLE,2600 EAST PFLUGERVILLE PARKWAY,PFLUGERVILLE,TX,78660,TRAVIS,(512) 654-6100,PSI_4_SURG_COMP,Deaths among Patients with Serious Treatable C...,Not Available,Not Available,Not Available,Not Available,Not Available,5.0,07/01/2016,06/30/2018
91391,670128,BAYLOR SCOTT & WHITE MEDICAL CENTER PFLUGERVILLE,2600 EAST PFLUGERVILLE PARKWAY,PFLUGERVILLE,TX,78660,TRAVIS,(512) 654-6100,PSI_6_IAT_PTX,Collapsed lung due to medical treatment,Not Available,Not Available,Not Available,Not Available,Not Available,5.0,07/01/2016,06/30/2018
91392,670128,BAYLOR SCOTT & WHITE MEDICAL CENTER PFLUGERVILLE,2600 EAST PFLUGERVILLE PARKWAY,PFLUGERVILLE,TX,78660,TRAVIS,(512) 654-6100,PSI_8_POST_HIP,Broken hip from a fall after surgery,Not Available,Not Available,Not Available,Not Available,Not Available,5.0,07/01/2016,06/30/2018
91393,670128,BAYLOR SCOTT & WHITE MEDICAL CENTER PFLUGERVILLE,2600 EAST PFLUGERVILLE PARKWAY,PFLUGERVILLE,TX,78660,TRAVIS,(512) 654-6100,PSI_90_SAFETY,Serious complications,Not Available,Not Available,Not Available,Not Available,Not Available,5.0,07/01/2016,06/30/2018


In [2]:
#quick eyeball of the data
#check unique values of the State column 
print(df['State'].unique())
print(len(df['State'].unique()))

['AL' 'AK' 'AZ' 'AR' 'CA' 'CO' 'CT' 'DE' 'DC' 'FL' 'GA' 'HI' 'ID' 'IL'
 'IN' 'IA' 'KS' 'KY' 'LA' 'ME' 'MD' 'MA' 'MI' 'MN' 'MS' 'MO' 'MT' 'NE'
 'NV' 'NH' 'NJ' 'NM' 'NY' 'NC' 'ND' 'OH' 'OK' 'OR' 'PA' 'PR' 'RI' 'SC'
 'SD' 'TN' 'TX' 'UT' 'VT' 'VI' 'VA' 'WA' 'WV' 'WI' 'WY' 'AS' 'GU' 'MP']
56


In [3]:
#filter dataframe to select MO state only

#method 1:
# filter = df['State'] == 'MO'
# filter.head()

# mo_hospitals = df[filter]
# mo_hospitals.head()

#method 2 (SQL-like Syntax - I prefer this one): 
sql = """
select *
from df 
where State = 'MO' 
"""

mo_hospitals =  ps.sqldf(sql, locals())
mo_hospitals

Unnamed: 0,Facility ID,Facility Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure ID,Measure Name,Compared to National,Denominator,Score,Lower Estimate,Higher Estimate,Footnote,Start Date,End Date
0,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,COMP_HIP_KNEE,Rate of complications for hip/knee replacement...,No Different Than the National Rate,26,2.5,1.4,4.2,,04/01/2015,03/31/2018
1,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_AMI,Death rate for heart attack patients,No Different Than the National Rate,175,13.9,11.0,16.9,,07/01/2015,06/30/2018
2,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_CABG,Death rate for CABG surgery patients,No Different Than the National Rate,91,2.5,1.2,5.1,,07/01/2015,06/30/2018
3,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_COPD,Death rate for COPD patients,No Different Than the National Rate,326,8.5,6.5,10.9,,07/01/2015,06/30/2018
4,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_HF,Death rate for heart failure patients,No Different Than the National Rate,461,13.1,10.7,15.9,,07/01/2015,06/30/2018
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2128,263304,SHRINERS HOSPITALS FOR CHILDREN,4400 CLAYTON AVE,SAINT LOUIS,MO,63110,ST. LOUIS CITY,(314) 432-3600,PSI_4_SURG_COMP,Deaths among Patients with Serious Treatable C...,Not Available,Not Available,Not Available,Not Available,Not Available,19.0,07/01/2016,06/30/2018
2129,263304,SHRINERS HOSPITALS FOR CHILDREN,4400 CLAYTON AVE,SAINT LOUIS,MO,63110,ST. LOUIS CITY,(314) 432-3600,PSI_6_IAT_PTX,Collapsed lung due to medical treatment,Not Available,Not Available,Not Available,Not Available,Not Available,19.0,07/01/2016,06/30/2018
2130,263304,SHRINERS HOSPITALS FOR CHILDREN,4400 CLAYTON AVE,SAINT LOUIS,MO,63110,ST. LOUIS CITY,(314) 432-3600,PSI_8_POST_HIP,Broken hip from a fall after surgery,Not Available,Not Available,Not Available,Not Available,Not Available,19.0,07/01/2016,06/30/2018
2131,263304,SHRINERS HOSPITALS FOR CHILDREN,4400 CLAYTON AVE,SAINT LOUIS,MO,63110,ST. LOUIS CITY,(314) 432-3600,PSI_90_SAFETY,Serious complications,Not Available,Not Available,Not Available,Not Available,Not Available,19.0,07/01/2016,06/30/2018


In [4]:
#quick check
print(mo_hospitals['State'].unique())
print(len(mo_hospitals['State'].unique()))

['MO']
1


In [5]:
# These assertions will help make sure that you're on the right track.
assert(mo_hospitals['State'].unique() == ['MO'])
assert(mo_hospitals.shape == (2133,18))

In [6]:
#summarize data
#key fields to summarize:
#(1)earliest date that each hospital was participating in any program
#(2)latest date that each hospital stopped participating in any program
#(3)total number of patients in the denominators of these programs

#Some things to note:
#convert the Start Date and End Date to actual datetime fields
#clean up and convert the Denominator field to just be numeric (the rule that you should use is to simply remove any records where the Denominator is 'Not Available')
#final result of this step should be a new data frame called mo_summary that contains one row for each hospital and contains the min start date, max end date, and total denominator. 
#Use the names start_date, end_date, and number for those columns in mo_summary.

# mo_summary = ...


##### First, Some Quick Checks

In [7]:
#check data type
type(mo_hospitals['Start Date'])

#info of data
print(mo_hospitals.info())

print('----------')

#check columns
df['State'].keys()
print(mo_hospitals.keys())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2133 entries, 0 to 2132
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Facility ID           2133 non-null   object 
 1   Facility Name         2133 non-null   object 
 2   Address               2133 non-null   object 
 3   City                  2133 non-null   object 
 4   State                 2133 non-null   object 
 5   ZIP Code              2133 non-null   int64  
 6   County Name           2133 non-null   object 
 7   Phone Number          2133 non-null   object 
 8   Measure ID            2133 non-null   object 
 9   Measure Name          2133 non-null   object 
 10  Compared to National  2133 non-null   object 
 11  Denominator           2133 non-null   object 
 12  Score                 2133 non-null   object 
 13  Lower Estimate        2133 non-null   object 
 14  Higher Estimate       2133 non-null   object 
 15  Footnote             

In [8]:
#datefield
print(mo_hospitals['Start Date'])
print(mo_hospitals['End Date'])

0       04/01/2015
1       07/01/2015
2       07/01/2015
3       07/01/2015
4       07/01/2015
           ...    
2128    07/01/2016
2129    07/01/2016
2130    07/01/2016
2131    07/01/2016
2132    07/01/2016
Name: Start Date, Length: 2133, dtype: object
0       03/31/2018
1       06/30/2018
2       06/30/2018
3       06/30/2018
4       06/30/2018
           ...    
2128    06/30/2018
2129    06/30/2018
2130    06/30/2018
2131    06/30/2018
2132    06/30/2018
Name: End Date, Length: 2133, dtype: object


###### Observation: It seems the date fields are not date types

In [9]:
#check value & counts of facility field
mo_hospitals['Facility Name'].value_counts()

HARRISON COUNTY COMMUNITY HOSPITAL         19
SSM HEALTH ST CLARE HOSPITAL - FENTON      19
MOSAIC LIFE CARE AT ST JOSEPH              19
CITIZENS MEMORIAL HOSPITAL                 19
MOSAIC MEDICAL CENTER ALBANY               19
                                           ..
PEMISCOT COUNTY MEMORIAL HOSPITAL          19
KANSAS CITY VA MEDICAL CENTER               6
ST LOUIS-JOHN COCHRAN VA MEDICAL CENTER     6
POPLAR BLUFF VA MEDICAL CENTER              6
COLUMBIA MO VA MEDICAL CENTER               6
Name: Facility Name, Length: 115, dtype: int64

##### Next, Convert data fields...

In [10]:
#convert Start and End Dates to actual datetime fields
# start_date = pd.to_datetime(mo_hospitals['Start Date'], format='%Y%m%d')
# mo_hospitals['start_date'] = start_date

mo_hospitals['start_date']= pd.to_datetime(mo_hospitals['Start Date'])
mo_hospitals['end_date']= pd.to_datetime(mo_hospitals['End Date'])

#check
print(mo_hospitals['start_date'])
print(mo_hospitals['end_date'])

mo_hospitals.head()

0      2015-04-01
1      2015-07-01
2      2015-07-01
3      2015-07-01
4      2015-07-01
          ...    
2128   2016-07-01
2129   2016-07-01
2130   2016-07-01
2131   2016-07-01
2132   2016-07-01
Name: start_date, Length: 2133, dtype: datetime64[ns]
0      2018-03-31
1      2018-06-30
2      2018-06-30
3      2018-06-30
4      2018-06-30
          ...    
2128   2018-06-30
2129   2018-06-30
2130   2018-06-30
2131   2018-06-30
2132   2018-06-30
Name: end_date, Length: 2133, dtype: datetime64[ns]


Unnamed: 0,Facility ID,Facility Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure ID,Measure Name,Compared to National,Denominator,Score,Lower Estimate,Higher Estimate,Footnote,Start Date,End Date,start_date,end_date
0,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,COMP_HIP_KNEE,Rate of complications for hip/knee replacement...,No Different Than the National Rate,26,2.5,1.4,4.2,,04/01/2015,03/31/2018,2015-04-01,2018-03-31
1,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_AMI,Death rate for heart attack patients,No Different Than the National Rate,175,13.9,11.0,16.9,,07/01/2015,06/30/2018,2015-07-01,2018-06-30
2,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_CABG,Death rate for CABG surgery patients,No Different Than the National Rate,91,2.5,1.2,5.1,,07/01/2015,06/30/2018,2015-07-01,2018-06-30
3,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_COPD,Death rate for COPD patients,No Different Than the National Rate,326,8.5,6.5,10.9,,07/01/2015,06/30/2018,2015-07-01,2018-06-30
4,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_HF,Death rate for heart failure patients,No Different Than the National Rate,461,13.1,10.7,15.9,,07/01/2015,06/30/2018,2015-07-01,2018-06-30


##### Next, clean up and convert the Denominator field to just be numeric (the rule that you should use is to simply remove any records where the Denominator is 'Not Available')

In [11]:
#clean up and convert the Denominator field to just be numeric 
#(the rule that you should use is to simply remove any records where the Denominator is 'Not Available')

#check dimension before filtering
mo_hospitals.shape
#before (2133, 20)

#create filter
filter =  mo_hospitals['Denominator'] != 'Not Available'

#apply filter
mo_hospitals= mo_hospitals[filter]

mo_hospitals.shape
#after (1189, 20)

mo_hospitals.head()


Unnamed: 0,Facility ID,Facility Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure ID,Measure Name,Compared to National,Denominator,Score,Lower Estimate,Higher Estimate,Footnote,Start Date,End Date,start_date,end_date
0,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,COMP_HIP_KNEE,Rate of complications for hip/knee replacement...,No Different Than the National Rate,26,2.5,1.4,4.2,,04/01/2015,03/31/2018,2015-04-01,2018-03-31
1,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_AMI,Death rate for heart attack patients,No Different Than the National Rate,175,13.9,11.0,16.9,,07/01/2015,06/30/2018,2015-07-01,2018-06-30
2,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_CABG,Death rate for CABG surgery patients,No Different Than the National Rate,91,2.5,1.2,5.1,,07/01/2015,06/30/2018,2015-07-01,2018-06-30
3,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_COPD,Death rate for COPD patients,No Different Than the National Rate,326,8.5,6.5,10.9,,07/01/2015,06/30/2018,2015-07-01,2018-06-30
4,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_HF,Death rate for heart failure patients,No Different Than the National Rate,461,13.1,10.7,15.9,,07/01/2015,06/30/2018,2015-07-01,2018-06-30


In [12]:
#check denom field type
#mo_data_types = pd.DataFrame(mo_hospitals, columns=['Denominator', 'start_date', 'end_date'])
#mo_hospitals['Denominator'].dtypes

print(mo_hospitals.dtypes)

#convert to numeric
mo_hospitals["number"] = mo_hospitals["Denominator"].astype('int')

#print(mo_hospitals["Denominator"].dtypes)
print(mo_hospitals.dtypes)
#Denominator is now an int type

#check 
mo_hospitals.head()

Facility ID                     object
Facility Name                   object
Address                         object
City                            object
State                           object
ZIP Code                         int64
County Name                     object
Phone Number                    object
Measure ID                      object
Measure Name                    object
Compared to National            object
Denominator                     object
Score                           object
Lower Estimate                  object
Higher Estimate                 object
Footnote                       float64
Start Date                      object
End Date                        object
start_date              datetime64[ns]
end_date                datetime64[ns]
dtype: object
Facility ID                     object
Facility Name                   object
Address                         object
City                            object
State                           object
ZIP Code   

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Facility ID,Facility Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure ID,Measure Name,...,Denominator,Score,Lower Estimate,Higher Estimate,Footnote,Start Date,End Date,start_date,end_date,number
0,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,COMP_HIP_KNEE,Rate of complications for hip/knee replacement...,...,26,2.5,1.4,4.2,,04/01/2015,03/31/2018,2015-04-01,2018-03-31,26
1,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_AMI,Death rate for heart attack patients,...,175,13.9,11.0,16.9,,07/01/2015,06/30/2018,2015-07-01,2018-06-30,175
2,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_CABG,Death rate for CABG surgery patients,...,91,2.5,1.2,5.1,,07/01/2015,06/30/2018,2015-07-01,2018-06-30,91
3,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_COPD,Death rate for COPD patients,...,326,8.5,6.5,10.9,,07/01/2015,06/30/2018,2015-07-01,2018-06-30,326
4,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_HF,Death rate for heart failure patients,...,461,13.1,10.7,15.9,,07/01/2015,06/30/2018,2015-07-01,2018-06-30,461


##### Summarize the Data

In [13]:
#final result of this step should be a new data frame called mo_summary that contains one row for each hospital and contains the min start date, max end date, and total denominator. 
#Use the names start_date, end_date, and number for those columns in mo_summary.

print('-----Min Date--------')
#(1)earliest date that each hospital was participating in any program (min_date)
min_date = mo_hospitals.groupby('Facility Name')['start_date'].min()
print(min_date)

print('         ')
print('-----Max Date--------')
#(2)latest date that each hospital stopped participating in any program (max_date)
max_date = mo_hospitals.groupby('Facility Name')['end_date'].max()
print(max_date)

print('         ')
print('-----Totals--------')
#(3)total number of patients in the denominators of these programs (denom_sum)

number = mo_hospitals.groupby('Facility Name')['number'].sum()
print(number)

#Aggregate/Compile
mo_summary = pd.concat([min_date, max_date, number], axis = 1)

#rename denominator column
#mo_summary.rename(columns = {'Denominator':'number'}, inplace = True)
#denom_rename = {'Denominator':'number'}

mo_summary

-----Min Date--------
Facility Name
BARNES JEWISH HOSPITAL                2015-04-01
BARNES-JEWISH ST PETERS HOSPITAL      2015-04-01
BARNES-JEWISH WEST COUNTY HOSPITAL    2015-04-01
BATES COUNTY MEMORIAL HOSPITAL        2015-07-01
BELTON REGIONAL MEDICAL CENTER        2015-04-01
                                         ...    
TRUMAN MEDICAL CENTER LAKEWOOD        2015-04-01
UNIVERSITY OF MISSOURI HEALTH CARE    2015-04-01
WASHINGTON COUNTY MEMORIAL HOSPITAL   2015-07-01
WESTERN MISSOURI MEDICAL CENTER       2015-04-01
WRIGHT MEMORIAL HOSPITAL              2015-07-01
Name: start_date, Length: 108, dtype: datetime64[ns]
         
-----Max Date--------
Facility Name
BARNES JEWISH HOSPITAL                2018-06-30
BARNES-JEWISH ST PETERS HOSPITAL      2018-06-30
BARNES-JEWISH WEST COUNTY HOSPITAL    2018-06-30
BATES COUNTY MEMORIAL HOSPITAL        2018-06-30
BELTON REGIONAL MEDICAL CENTER        2018-06-30
                                         ...    
TRUMAN MEDICAL CENTER LAKEWOOD  

Unnamed: 0_level_0,start_date,end_date,number
Facility Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BARNES JEWISH HOSPITAL,2015-04-01,2018-06-30,131313
BARNES-JEWISH ST PETERS HOSPITAL,2015-04-01,2018-06-30,15668
BARNES-JEWISH WEST COUNTY HOSPITAL,2015-04-01,2018-06-30,9622
BATES COUNTY MEMORIAL HOSPITAL,2015-07-01,2018-06-30,3117
BELTON REGIONAL MEDICAL CENTER,2015-04-01,2018-06-30,9270
...,...,...,...
TRUMAN MEDICAL CENTER LAKEWOOD,2015-04-01,2018-06-30,4297
UNIVERSITY OF MISSOURI HEALTH CARE,2015-04-01,2018-06-30,56493
WASHINGTON COUNTY MEMORIAL HOSPITAL,2015-07-01,2018-06-30,220
WESTERN MISSOURI MEDICAL CENTER,2015-04-01,2018-06-30,7254


In [14]:
assert(mo_summary['number'].sum() == 1766908)
assert(mo_summary['start_date'].min() == pd.Timestamp(2015,4,1))
assert(mo_summary['end_date'].max() == pd.Timestamp(2018,6,30))
assert(mo_summary.shape == (108,3))
assert(mo_summary.loc['BARNES JEWISH HOSPITAL'].number == 131313)
assert(mo_summary.loc['BOONE HOSPITAL CENTER'].number == 63099)

---

### 47.2 Planning your final project

You should be thinking about the things we've been learning and how you can apply them to your final project.  Use the rubric to help guid your thinking and then answer the questions below.  This is meant as a guide to help you think through what you will do.

#### A) Data Access

Your project should include data from at least three distinct types of sources.  For example: AWS S3, Relational Databases, Internet, Web Services, local files.  List what data sources you're planning to use.

**Double-click to enter your answer**

- Web Services via the Internet
- Local File(s)

#### B. Data Formats

Your project should include data that comes in different file formats.  For example: HL7, EDI, HTML, CSV, Excel, JSON, XML.  List what data formats you're planning to use.

**Double-click to enter your answer**

- CSV 
- Excel 

#### C. Objective


**Double-click to enter your answer**

What purpose would your project serve in a real work setting?  Take a couple of paragraphs to write down why this is an interesting product.

Context: The opioid epidemic at the forefront of public discourse in the United States in recent years is of great concern due to the devastating consequences and long-term health problems from the reliance and misuse of opioids. While there are treatments for opioid use disorder (OUD), there seems to be barriers to getting the treatment.

This HDS 5210 course project is based on an on-going research study (different course) that evaluates the social determinants of health (SDOH) most associated with a lack of access to Medication Assisted Treatment (MAT) care for OUD across counties in the US. The HDS 5210 course project will focus on the Python data management topics covered in class such as retrieving, merging, aggregating & summarizing data. The final product for this class project will be descriptive/summary statistics and visualizations.

Purpose in Real-World Setting:
Determining the factors that could inhibit care for OUD would give policy makers and other key decision makers the ability to see where programs and funding are necessary to target specific groups, in order to equalize access to OUD care. If funding and programs could be targeted towards high-risk groups to improve access to OUD treatment at MAT facilities, it could lead to increased access to care and potentially help lessen the opioid epidemic in the United States. 



---



## Submitting Your Work

In order to submit your work, you'll need to use the `git` command line program to **add** your homework file (this file) to your local repository, **commit** your changes to your local repository, and then **push** those changes up to github.com.  From there, I'll be able to **pull** the changes down and do my grading.  I'll provide some feedback, **commit** and **push** my comments back to you.  Next week, I'll show you how to **pull** down my comments.

To run through everything one last time and submit your work:
1. Use the `Kernel` -> `Restart Kernel and Run All Cells` menu option to run everything from top to bottom and stop here.
2. Follow the instruction on the prompt below to either ssave and submit your work, or continue working.

If anything fails along the way with this submission part of the process, let me know.  I'll help you troubleshoort.

---

In [None]:
a=input('''
Are you ready to submit your work?
1. Click the Save icon (or do Ctrl-S / Cmd-S)
2. Type "yes" or "no" below
3. Press Enter

''')

if a=='yes':
    !git add week11_assignment_2.ipynb
    !git commit -a -m "Submitting the week 11 programming assignment"
    !git push
else:
    print('''
    
OK. We can wait.
''')