# Filter, Sort, Merge and Create Data
    1. Data Structures
    2. Sort this out!
    3. Formats, Mappings, and Dictionaries

In [50]:
#Import libraries
from pandasql import sqldf #!pip install -U pandasql
pysqldf = lambda q: sqldf(q, globals())
from collections import Counter
import pandas as pd #Pandas Lib
import numpy as np  #NumPy Lib
import matplotlib.pyplot as plt
import urllib.request, json 

## Data set for all sections:
**Inpatient Prospective Payment System (IPPS):** *Provider Summary for the Top 100 Diagnosis-Related Groups (DRG) - FY2011Medicare - Inpatient*

~~~~
Link: https://data.cms.gov/Medicare-Inpatient/Inpatient-Prospective-Payment-System-IPPS-Provider/97k6-zzx3
~~~~

Let's accees the data via Socrata Open Data API (SODA) which provides programmatic access to this dataset including the ability to filter, query, and aggregate data. The file on the server is JSON file and we will use urlib to access this.

In [40]:
with urllib.request.urlopen("https://data.cms.gov/resource/ehrv-m9r6.json") as url:
    datax = json.loads(url.read().decode())

Unnamed: 0,average_covered_charges,average_medicare_payments,average_medicare_payments_2,drg_definition,hospital_referral_region_description,provider_city,provider_id,provider_name,provider_state,provider_street_address,provider_zip_code,total_discharges
0,32963.07,5777.24,4763.73,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,AL - Dothan,DOTHAN,10001,SOUTHEAST ALABAMA MEDICAL CENTER,AL,1108 ROSS CLARK CIRCLE,36301,91
1,15131.85,5787.57,4976.71,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,AL - Birmingham,BOAZ,10005,MARSHALL MEDICAL CENTER SOUTH,AL,2505 U S HIGHWAY 431 NORTH,35957,14


## 1. Data Structures

**What are data frames?**

If you've used SAS then this is simply a data set (yes, even in IML). In SQL, we call it a table (not the one we eat on), and in R, it's called a data frame as well! It is a way for Python to store the data efficiently in an *i x j* grid. If you've used arrays or IML in SAS, you'll notice some similarities in concepts. 

Data frames are native to the Pandas library (which consists of data values, index and columns). Not very different from  table partions and indexes.

**Pandas DataFrame can be:**
1.  Pandas DataFrame 
2.  Pandas Series: 1-d array





In [41]:
#convert to data fram using .DataFrame    
df = pd.DataFrame(data=datax)
df.head(2)

Unnamed: 0,average_covered_charges,average_medicare_payments,average_medicare_payments_2,drg_definition,hospital_referral_region_description,provider_city,provider_id,provider_name,provider_state,provider_street_address,provider_zip_code,total_discharges
0,32963.07,5777.24,4763.73,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,AL - Dothan,DOTHAN,10001,SOUTHEAST ALABAMA MEDICAL CENTER,AL,1108 ROSS CLARK CIRCLE,36301,91
1,15131.85,5787.57,4976.71,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,AL - Birmingham,BOAZ,10005,MARSHALL MEDICAL CENTER SOUTH,AL,2505 U S HIGHWAY 431 NORTH,35957,14


## 2. Sort this out!

**Why bother sorting? **

Sorting data helps in many ways but not limited to the following:

* making lookup or search efficient
* making merging of sequences efficient
* enable processing of data in a defined order
* understanding data better
* reducing computing times

Examples provided below. 

*I am sure many are wondering what happened to 'distinct' (SQL) or noduprecs, nodupkeys (SAS)? We will talk about that in data processing. *

**Example 1:**

Sort the dataframe's rows by average_covered_charges, in descending order by using ascending = 0

~~~~
/* same cms data*/
proc sort data=df out=sorted;
    by average_covered_charges descending; 
run;
~~~~
 

In [44]:
df.sort_values(by='average_covered_charges', ascending=0).head(2)

Unnamed: 0,average_covered_charges,average_medicare_payments,average_medicare_payments_2,drg_definition,hospital_referral_region_description,provider_city,provider_id,provider_name,provider_state,provider_street_address,provider_zip_code,total_discharges
760,9973.26,5954.43,4725.7,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,OH - Canton,CANTON,360070,MERCY MEDICAL CENTER,OH,1320 MERCY DRIVE NW,44708,30
485,9929.25,6582.75,5747.75,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,MA - Boston,NORWOOD,220126,NORWOOD HOSPITAL,MA,800 WASHINGTON STREET,2062,16


**Example 2:**

In this example the provider city is in ascending order whilst the average covered cost is in descending order thus the 1,0 respectively 

~~~
/* same cms data*/
proc sql;
create table sorted as 
select * 
from df
order by average_covered_charges descending provider_city;
quit;
~~~

In [47]:
df.sort_values(['provider_city', 'average_covered_charges'], ascending=[1, 0]).head(2)    

Unnamed: 0,average_covered_charges,average_medicare_payments,average_medicare_payments_2,drg_definition,hospital_referral_region_description,provider_city,provider_id,provider_name,provider_state,provider_street_address,provider_zip_code,total_discharges
961,35184.16,5890.3,4695.54,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,TX - Abilene,ABILENE,450558,ABILENE REGIONAL MEDICAL CENTER,TX,6250 HWY 83/84,79606,42
946,22869.07,6921.92,4198.69,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,TX - Abilene,ABILENE,450229,HENDRICK MEDICAL CENTER,TX,1900 PINE,79601,56


## 3. Formats, Mappings, and Dictionaries

Lets explore some constructs from SAS and SQL that we can bring over which I will explain and illustrate in the 'markdown' (above the code)

You are now going to learn about the Dictionary data structure in Python. A Dictionary (or "dict") is a way to store data just like a list, but instead of using only numbers to get the data, you can use almost anything. This lets you treat a dict like it's a database for storing and organizing data.

### Formats

One of the first things you learn in SAS is formats 

~~~~
proc format;
value state_name
    TX="Lone Star"
    CA="Sunshine Tax";
run; 
~~~

In [72]:
df1 = df

def state_name (i):
    if i =='TX':
        return "Lone Star"
    if i =='CA':
        return "Sunshine Tax"
    if i =='IN':
        return "Hoosiers"
    
#'new var' is the variable I am assigning the format values to       
df1['new_var'] = df1['provider_state'].apply(state_name)

#proof that it worked for the Sate of Indiana:
df1[df1['provider_state'] == 'IN'].head(2)


Unnamed: 0,average_covered_charges,average_medicare_payments,average_medicare_payments_2,drg_definition,hospital_referral_region_description,provider_city,provider_id,provider_name,provider_state,provider_street_address,provider_zip_code,total_discharges,new_var
332,34741.5,6980.2,6088.04,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,IN - Gary,GARY,150002,METHODIST HOSPITALS INC,IN,600 GRANT ST,46402,24,Hoosiers
333,23228.89,7409.42,6529.14,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,KY - Louisville,JEFFERSONVILLE,150009,CLARK MEMORIAL HOSPITAL,IN,1220 MISSOURI AVE,47130,28,Hoosiers


### Mappings

Mappings are similar to formats, they are indexed and contain primary keys that you can join on to attain corresponding values. 
