# Filter, Sort, Merge and Create Data
    1. Data Structures
    2. Sort this out!
    3. Formats, Mappings, and Dictionaries
    4. Filter
    5. Merge and Joins
    7. Create & Delete
    8. Duplicates

In [258]:
#Import libraries
from pandasql import sqldf #!pip install -U pandasql
pysqldf = lambda q: sqldf(q, globals())
from collections import Counter
import pandas as pd #Pandas Lib
import numpy as np  #NumPy Lib
import matplotlib.pyplot as plt
import urllib.request, json 
from IPython.display import display
from IPython.display import Image

## Data set for all sections:
**Inpatient Prospective Payment System (IPPS):** *Provider Summary for the Top 100 Diagnosis-Related Groups (DRG) - FY2011Medicare - Inpatient*

~~~~
Link: https://data.cms.gov/Medicare-Inpatient/Inpatient-Prospective-Payment-System-IPPS-Provider/97k6-zzx3
~~~~

Let's accees the data via Socrata Open Data API (SODA) which provides programmatic access to this dataset including the ability to filter, query, and aggregate data. The file on the server is JSON file and we will use urlib to access this.

In [286]:
with urllib.request.urlopen("https://data.cms.gov/resource/ehrv-m9r6.json") as url:
    datax = json.loads(url.read().decode())

## 1. Data Structures

**What are data frames?**

If you've used SAS then this is simply a data set (yes, even in IML). In SQL, we call it a table (not the one we eat on), and in R, it's called a data frame as well! It is a way for Python to store the data efficiently in an *i x j* grid. If you've used arrays or IML in SAS, you'll notice some similarities in concepts. 

Data frames are native to the Pandas library (which consists of data values, index and columns). Not very different from  table partions and indexes.

**Pandas DataFrame can be:**
1.  Pandas DataFrame 
2.  Pandas Series: 1-d array





In [287]:
#convert to data fram using .DataFrame    
df = pd.DataFrame(data=datax)
df.head(2)

Unnamed: 0,average_covered_charges,average_medicare_payments,average_medicare_payments_2,drg_definition,hospital_referral_region_description,provider_city,provider_id,provider_name,provider_state,provider_street_address,provider_zip_code,total_discharges
0,32963.07,5777.24,4763.73,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,AL - Dothan,DOTHAN,10001,SOUTHEAST ALABAMA MEDICAL CENTER,AL,1108 ROSS CLARK CIRCLE,36301,91
1,15131.85,5787.57,4976.71,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,AL - Birmingham,BOAZ,10005,MARSHALL MEDICAL CENTER SOUTH,AL,2505 U S HIGHWAY 431 NORTH,35957,14


## 2. Sort this out!

**Why bother sorting? **

Sorting data helps in many ways but not limited to the following:

* making lookup or search efficient
* making merging of sequences efficient
* enable processing of data in a defined order
* understanding data better
* reducing computing times

Examples provided below. 

*I am sure many are wondering what happened to 'distinct' (SQL) or noduprecs, nodupkeys (SAS)? We will talk about that in data processing. *

**Example 1:**

Sort the dataframe's rows by average_covered_charges, in descending order by using ascending = 0

~~~~
/* same cms data*/
proc sort data=df out=sorted;
    by average_covered_charges descending; 
run;
~~~~
 

In [76]:
df.sort_values(by='average_covered_charges', ascending=0).head(2)

Unnamed: 0,average_covered_charges,average_medicare_payments,average_medicare_payments_2,drg_definition,hospital_referral_region_description,provider_city,provider_id,provider_name,provider_state,provider_street_address,provider_zip_code,total_discharges
760,9973.26,5954.43,4725.7,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,OH - Canton,CANTON,360070,MERCY MEDICAL CENTER,OH,1320 MERCY DRIVE NW,44708,30
485,9929.25,6582.75,5747.75,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,MA - Boston,NORWOOD,220126,NORWOOD HOSPITAL,MA,800 WASHINGTON STREET,2062,16


**Example 2:**

In this example the provider city is in ascending order whilst the average covered cost is in descending order thus the 1,0 respectively 

~~~
/* same cms data*/
proc sql;
create table sorted as 
select * 
from df
order by average_covered_charges descending provider_city;
quit;
~~~

In [77]:
df.sort_values(['provider_city', 'average_covered_charges'], ascending=[1, 0]).head(2)    

Unnamed: 0,average_covered_charges,average_medicare_payments,average_medicare_payments_2,drg_definition,hospital_referral_region_description,provider_city,provider_id,provider_name,provider_state,provider_street_address,provider_zip_code,total_discharges
961,35184.16,5890.3,4695.54,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,TX - Abilene,ABILENE,450558,ABILENE REGIONAL MEDICAL CENTER,TX,6250 HWY 83/84,79606,42
946,22869.07,6921.92,4198.69,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,TX - Abilene,ABILENE,450229,HENDRICK MEDICAL CENTER,TX,1900 PINE,79601,56


## 3. Formats, Mappings, and Dictionaries

Lets explore some constructs from SAS and SQL that we can bring over which I will explain and illustrate in the 'markdown' (above the code)

You are now going to learn about the Dictionary data structure in Python. A Dictionary (or "dict") is a way to store data just like a list, but instead of using only numbers to get the data, you can use almost anything. This lets you treat a dict like it's a database for storing and organizing data.

### Formats

One of the first things you learn in SAS is formats 

~~~~
proc format;
value state_name
    TX="Lone Star"
    CA="Sunshine Tax";
run; 
~~~

In [78]:
df1 = df

def state_name (i):
    if i =='TX':
        return "Lone Star"
    if i =='CA':
        return "Sunshine Tax"
    if i =='IN':
        return "Hoosiers"
    
#'new var' is the variable I am assigning the format values to       
df1['new_var'] = df1['provider_state'].apply(state_name)

#proof that it worked for the Sate of Indiana:
df1[df1['provider_state'] == 'IN'].head(2)


Unnamed: 0,average_covered_charges,average_medicare_payments,average_medicare_payments_2,drg_definition,hospital_referral_region_description,provider_city,provider_id,provider_name,provider_state,provider_street_address,provider_zip_code,total_discharges,new_var
332,34741.5,6980.2,6088.04,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,IN - Gary,GARY,150002,METHODIST HOSPITALS INC,IN,600 GRANT ST,46402,24,Hoosiers
333,23228.89,7409.42,6529.14,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,KY - Louisville,JEFFERSONVILLE,150009,CLARK MEMORIAL HOSPITAL,IN,1220 MISSOURI AVE,47130,28,Hoosiers


### Mappings

Mappings are similar to formats, they are indexed and contain primary keys that you can join on to attain corresponding values. 

In [84]:
df1 = df

#Let's create a dictionary to illustrate 'mappings'
state_name =  { 'TX' : 'Lone Star', 
                'CA' : 'Sunshine Tax',
                'IN' : 'Hoosiers'}
df1['new_var'] = df1['provider_state'].map(state_name)

#proof it worked fine
df1 [df1['provider_state'] == 'IN'].head(2)

Unnamed: 0,average_covered_charges,average_medicare_payments,average_medicare_payments_2,drg_definition,hospital_referral_region_description,provider_city,provider_id,provider_name,provider_state,provider_street_address,provider_zip_code,total_discharges,new_var
332,34741.5,6980.2,6088.04,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,IN - Gary,GARY,150002,METHODIST HOSPITALS INC,IN,600 GRANT ST,46402,24,Hoosiers
333,23228.89,7409.42,6529.14,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,KY - Louisville,JEFFERSONVILLE,150009,CLARK MEMORIAL HOSPITAL,IN,1220 MISSOURI AVE,47130,28,Hoosiers


## 4. Filters

Filtering the bedrock for data wrangling! If you recall, in **SQL** we can use *where* and *join* statements. In **SAS** you can use the same in *proc sql* and in the *data step*, we can use *if* and *where*.


**Unique values in Pandas data frame**

In [105]:
#Method 1: Convert pandas column into a set (you can read about data sets)
g = list(set(df.provider_state))

In [106]:
#Method 2: Create a list of unique values in df.trucks
h = list(df['provider_state'].unique())

**Select records that DO NOT have a specific DRG definition **

In [297]:
j= df[~(df['drg_definition'] == '039 - EXTRACRANIAL PROCEDURES W/O CC/MCC')]
j.head(1)

Unnamed: 0,average_covered_charges,average_medicare_payments,average_medicare_payments_2,drg_definition,hospital_referral_region_description,provider_city,provider_id,provider_name,provider_state,provider_street_address,provider_zip_code,total_discharges
253,23979.53,5389.91,4314.06,069 - TRANSIENT ISCHEMIA,MA - Boston,SALEM,220035,NORTH SHORE MEDICAL CENTER,MA,81 HIGHLAND AVENUE,1970,80


**Select records that HAVE a specific DRG definition **

In [298]:
j= df[(df['drg_definition'] == '039 - EXTRACRANIAL PROCEDURES W/O CC/MCC')]
j.head(1)

Unnamed: 0,average_covered_charges,average_medicare_payments,average_medicare_payments_2,drg_definition,hospital_referral_region_description,provider_city,provider_id,provider_name,provider_state,provider_street_address,provider_zip_code,total_discharges
0,32963.07,5777.24,4763.73,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,AL - Dothan,DOTHAN,10001,SOUTHEAST ALABAMA MEDICAL CENTER,AL,1108 ROSS CLARK CIRCLE,36301,91


**Greater than 35 discharges**

In [288]:
df1 = df

cols_to_convert = ['total_discharges']
for col in cols_to_convert:
    df1[col] = pd.to_numeric(df[col], errors='coerce')
    
gt_30dis = df[df['total_discharges'] > 25]
gt_30dis.head(1)

Unnamed: 0,average_covered_charges,average_medicare_payments,average_medicare_payments_2,drg_definition,hospital_referral_region_description,provider_city,provider_id,provider_name,provider_state,provider_street_address,provider_zip_code,total_discharges
0,32963.07,5777.24,4763.73,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,AL - Dothan,DOTHAN,10001,SOUTHEAST ALABAMA MEDICAL CENTER,AL,1108 ROSS CLARK CIRCLE,36301,91


**More than 2 filters**
*in this example - all discharges greater than 25 and only Indiana without using the STATE column*
~~~
/* same cms data*/
proc sql;
create table filters as 
select * 
from df
where total_discharges  > 25 and substr (hospital_referral_region_description, 1,2) = 'IN';
quit;
~~~

In [290]:
#how about 'AND'
df1 = df[(df['total_discharges']  > 25) & (df['hospital_referral_region_description'].str[0:2] == 'IN')]
df1.head(1)

Unnamed: 0,average_covered_charges,average_medicare_payments,average_medicare_payments_2,drg_definition,hospital_referral_region_description,provider_city,provider_id,provider_name,provider_state,provider_street_address,provider_zip_code,total_discharges
334,25856.23,6675.53,5289.6,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,IN - South Bend,MISHAWAKA,150012,SAINT JOSEPH REGIONAL MEDICAL CENTER,IN,5215 HOLY CROSS PKWY,46545,30


In [293]:
#how about 'OR'
df2 = df[(df['total_discharges']  > 1000) | (df['hospital_referral_region_description'].str[0:2] == 'IN')]
df2.head(2)

Unnamed: 0,average_covered_charges,average_medicare_payments,average_medicare_payments_2,drg_definition,hospital_referral_region_description,provider_city,provider_id,provider_name,provider_state,provider_street_address,provider_zip_code,total_discharges
332,34741.5,6980.2,6088.04,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,IN - Gary,GARY,150002,METHODIST HOSPITALS INC,IN,600 GRANT ST,46402,24
334,25856.23,6675.53,5289.6,039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,IN - South Bend,MISHAWAKA,150012,SAINT JOSEPH REGIONAL MEDICAL CENTER,IN,5215 HOLY CROSS PKWY,46545,30


**Date ranges**

In [132]:
#for this example we will need a new data set which has some dates
url = "https://data.medicare.gov/api/views/9n3s-kdb3/rows.csv?accessType=DOWNLOAD" 
life = pd.read_csv(url)
life.head(2)
df =life

##Step 1: Convert to dates:
cols_to_convert = ['Start Date']
for col in cols_to_convert:
    df[col] = pd.to_datetime(df[col], errors='coerce')   
## Did it work?
df.dtypes

#Step 2: use the converted varaible to filter on
datex = (df['Start Date'] > '2012-06-01') ) 
datex.head(2)
df.loc[datex].head(2)

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date
0,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-AMI-HRRP,781,,0.9837,15.358,15.6121,119,2012-07-01,30-JUN-15
1,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-CABG-HRRP,273,,1.0618,13.8887,13.0809,40,2012-07-01,30-JUN-15


### Filter function(s)
using lists, functions and lambda

In [333]:
url = "https://data.medicare.gov/api/views/9n3s-kdb3/rows.csv?accessType=DOWNLOAD" 
life = pd.read_csv(url)
df =life
cols_to_convert = ['Number of Discharges', 'Expected Readmission Rate']
for col in cols_to_convert:
    df[col] = pd.to_numeric(df[col], errors='coerce')
    
df.head(2)

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date
0,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-AMI-HRRP,781.0,,0.9837,15.358,15.6121,119,01-JUL-12,30-JUN-15
1,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-CABG-HRRP,273.0,,1.0618,13.8887,13.0809,40,01-JUL-12,30-JUN-15


**Using Lists**

In [251]:
#Pandas DataFrame columns are Pandas Series when you pull them out, which
#you can then call .tolist() on to turn them into a Python list
my_list = df['Number of Discharges'].tolist()
f = lambda x: x<250
dat1 = filter(f, my_list)
dat2 = list(dat1)
# you can not use head on a list thus this wont work: f2.head()

**Using a function ('def')**

In [255]:
#If you wanted to apply this to a panda data frame!
df1 = df
def is_lt_250(x):
    return x <250

df1['is_lt_250'] = df1['Number of Discharges'].apply(is_lt_250)
df1.head(1)

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date,is_lt_250
0,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-AMI-HRRP,781.0,,0.9837,15.358,15.6121,119,01-JUL-12,30-JUN-15,False


**Using Lambda ( not this λ or a sheep ) **

In [257]:
#If you want to use a Lambda function (admitdly, I am still getting better at this!)
df1 = df
df1['is_lt_250'] =df1['Number of Discharges'].apply(lambda x: x <250)
df1.head(1)

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date,is_lt_250
0,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-AMI-HRRP,781.0,,0.9837,15.358,15.6121,119,01-JUL-12,30-JUN-15,False


## 5. Merge and Joins

Intuitively speaking, I feel SQL has nailed this and it's easier to read. If you are working with RDB's or any big data environment, you'll probably use some flavor of SQL! You can run pass through queries which can agregate the data. At this point you may ask: why is he putting us throgh this painful ordeal of learning a new way to merge data? -

1. It's not that complicated
2. Uses less lines of code
3. Just as intuitive
4. It's critical for data wrangling in Python :)

***

**SAS** uses the merge statement
~~~
data final;
set from;
  merge a b;
  by key;
run;
~~~
**SQL** joins should be clear because concepts such as left, right and inner join's will be referenced.



### Let's use 2 data sources for this:

> *Heart Disease Mortality Data Among US Adults (35+) by State/Territory and County:*
  https://chronicdata.cdc.gov/api/views/r35g-znws/rows.csv?accessType=DOWNLOAD


> *Hospital Readmissions Reduction Program:*
"https://data.medicare.gov/api/views/9n3s-kdb3/rows.csv?accessType=DOWNLOAD" 


In [308]:
url1 = "https://chronicdata.cdc.gov/api/views/r35g-znws/rows.csv?accessType=DOWNLOAD" 
hdmd = pd.read_csv(url1)
df1 = hdmd
df1.head(1)

Unnamed: 0,Year,LocationAbbr,LocationDesc,GeographicLevel,DataSource,Class,Topic,Data_Value,Data_Value_Unit,Data_Value_Type,Data_Value_Footnote_Symbol,Data_Value_Footnote,StratificationCategory1,Stratification1,StratificationCategory2,Stratification2,TopicID,LocationID,Location 1
0,2013,AK,Aleutians East,County,NVSS,Cardiovascular Diseases,Heart Disease Mortality,147.4,"per 100,000 population","Age-adjusted, Spatially Smoothed, 3-year Avera...",,,Gender,Overall,Race/Ethnicity,Overall,T2,2013,"(55.440626, -161.962562)"


In [309]:
url2 = "https://data.medicare.gov/api/views/9n3s-kdb3/rows.csv?accessType=DOWNLOAD" 
hrrp = pd.read_csv(url2)
df2 = hrrp
df2.head(1)

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date
0,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-AMI-HRRP,781,,0.9837,15.358,15.6121,119,01-JUL-12,30-JUN-15


*subset data and rename columns *

In [319]:
x = df1[['LocationAbbr', 'Class']]
x.columns = ['State', 'Class']
y = df2[['State', 'Hospital Name']]

**Stack Vertically**

In [326]:
stack_vert = pd.concat([x, y])
stack_vert.head()

Unnamed: 0,Class,Hospital Name,State
0,Cardiovascular Diseases,,AK
1,Cardiovascular Diseases,,AK
2,Cardiovascular Diseases,,AK
3,Cardiovascular Diseases,,AK
4,Cardiovascular Diseases,,AK


**Stack Horizontally**

In [325]:
stack_horiz = pd.concat([x, y], axis=1)
stack_horiz.head()

Unnamed: 0,State,Class,State.1,Hospital Name
0,AK,Cardiovascular Diseases,AL,SOUTHEAST ALABAMA MEDICAL CENTER
1,AK,Cardiovascular Diseases,AL,SOUTHEAST ALABAMA MEDICAL CENTER
2,AK,Cardiovascular Diseases,AL,SOUTHEAST ALABAMA MEDICAL CENTER
3,AK,Cardiovascular Diseases,AL,SOUTHEAST ALABAMA MEDICAL CENTER
4,AK,Cardiovascular Diseases,AL,SOUTHEAST ALABAMA MEDICAL CENTER


**Join on primary key**

In [323]:
#First 1000 reord joins:
test1 = x.iloc[:1000]
test2 = df2[['State', 'Hospital Name']].iloc[:1000]
z = pd.merge(test1, test2, on='State') ## if you wanted to do multiple joins: on = ['State', 'Year']
z.head()

# For the entire data set:
## z = pd.merge(x, y, on='State')
## z.head()

Unnamed: 0,State,Class,Hospital Name
0,AK,Cardiovascular Diseases,PROVIDENCE ALASKA MEDICAL CENTER
1,AK,Cardiovascular Diseases,PROVIDENCE ALASKA MEDICAL CENTER
2,AK,Cardiovascular Diseases,PROVIDENCE ALASKA MEDICAL CENTER
3,AK,Cardiovascular Diseases,PROVIDENCE ALASKA MEDICAL CENTER
4,AK,Cardiovascular Diseases,PROVIDENCE ALASKA MEDICAL CENTER


**LEFT and RIGHT joins**

In [328]:
left_join = pd.merge(test1, test2, on='State', how='left')
right_join = pd.merge(test1, test2, on='State', how='right')

In [332]:
print (right_join.shape)
print (left_join.shape)

(271654, 3)
(271200, 3)


## 6. Create & Delete

**Create binary variable/indicator when discharge is greater than or equal to 1000**

In [344]:
df['too many'] = np.where(df['Number of Discharges']>=1000, 1, 0)
df.head(2) #binary/boolean variable

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date,too many
0,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-AMI-HRRP,781.0,,0.9837,15.358,15.6121,119,01-JUL-12,30-JUN-15,0
1,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-CABG-HRRP,273.0,,1.0618,13.8887,13.0809,40,01-JUL-12,30-JUN-15,0


**Drop columns that start with 'too'**

In [346]:
cols = [x for x in df.columns if x.lower()[:3] != 'too']
df1=df[cols]
df1.head(2)

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date
0,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-AMI-HRRP,781.0,,0.9837,15.358,15.6121,119,01-JUL-12,30-JUN-15
1,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-CABG-HRRP,273.0,,1.0618,13.8887,13.0809,40,01-JUL-12,30-JUN-15


**Drop a record based on an index that the data set already has**

In [353]:
df.index.values #what indexes already exist
df1 = df.drop([0]) 
df1.head(2)

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date,too many
1,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-CABG-HRRP,273.0,,1.0618,13.8887,13.0809,40,01-JUL-12,30-JUN-15,0
2,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-COPD-HRRP,709.0,,1.0455,19.7525,18.8932,143,01-JUL-12,30-JUN-15,0


**Drop a column based on a column name**

In [356]:
df1 = df.drop('End Date', axis=1)
df1.head(2)

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,too many
0,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-AMI-HRRP,781.0,,0.9837,15.358,15.6121,119,01-JUL-12,0
1,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-CABG-HRRP,273.0,,1.0618,13.8887,13.0809,40,01-JUL-12,0


**Drop a row based on a cetain value in a variable**

In [360]:
df1 = df[df['Hospital Name'] != 'SOUTHEAST ALABAMA MEDICAL CENTER']
df1.head(2)

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date,too many
6,MARSHALL MEDICAL CENTER SOUTH,10005,AL,READM-30-AMI-HRRP,,,0.9905,16.3958,16.5529,Too Few to Report,01-JUL-12,30-JUN-15,0
7,MARSHALL MEDICAL CENTER SOUTH,10005,AL,READM-30-CABG-HRRP,,5.0,Not Available,Not Available,,Not Available,01-JUL-12,30-JUN-15,0


In [366]:
#Drop records between 1 - 10
df1 = df.drop(df.index[[1, 10]])
df1.head(2)

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date,too many
0,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-AMI-HRRP,781.0,,0.9837,15.358,15.6121,119,01-JUL-12,30-JUN-15,0
2,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-COPD-HRRP,709.0,,1.0455,19.7525,18.8932,143,01-JUL-12,30-JUN-15,0


## 7. Duplicates

In [370]:
test1 = x.iloc[:1000]
test2 = df2[['State', 'Hospital Name']].iloc[:1000]
z = pd.merge(test1, test2, on='State')

**How does the data look now?**

In [371]:
z.head()

Unnamed: 0,State,Class,Hospital Name
0,AK,Cardiovascular Diseases,PROVIDENCE ALASKA MEDICAL CENTER
1,AK,Cardiovascular Diseases,PROVIDENCE ALASKA MEDICAL CENTER
2,AK,Cardiovascular Diseases,PROVIDENCE ALASKA MEDICAL CENTER
3,AK,Cardiovascular Diseases,PROVIDENCE ALASKA MEDICAL CENTER
4,AK,Cardiovascular Diseases,PROVIDENCE ALASKA MEDICAL CENTER


**Which records are duplicated?**

In [369]:
z.duplicated().head()

0    False
1     True
2     True
3     True
4     True
dtype: bool

**Drop duplicated records**

In [373]:
z1 = z.drop_duplicates()
z1.duplicated().head()

0     False
6     False
12    False
18    False
24    False
dtype: bool

**Drop duplicated based on a column**

In [374]:
z.drop_duplicates(['State'], keep='last')

Unnamed: 0,State,Class,Hospital Name
24191,AK,Cardiovascular Diseases,ALASKA NATIVE MEDICAL CENTER
271199,AL,Cardiovascular Diseases,ATMORE COMMUNITY HOSPITAL
