# PHO Database Project

### Background

* Monthly Master Medical Staff (MMS) list comes out;
* Craig manually adds any new providers to the CIN database;
* It's frustrating & error-prone; we want this to be easier

### What We Want

* Make a list of providers to add to the database
* Make a deduplicated list of healthcare entities and locations from the Master Medical Staff list
* match those against the database


### Assumptions

* All entities come with a TIN; i.e. I haven't considered any rows without a TIN when working with entities.
* All locations come with a street address; i.e. I haven't considered any rows without a street address when working with locations. 


## 0. Basic Stuff

In [2]:
# import libraries
import numpy as np
import pandas as pd
import usaddress

# read files
entities_db = pd.read_csv('kn_entities.csv')
locations_db = pd.read_csv('kn_locations.csv')
providers_db = pd.read_csv('kn_providers.csv')
# For now, we need to specify the sheet names because the .xlsx file contains summary sheets 
sheets = ['CAD','CHX','GRY','KMHC','MAN','MMC','OMH','POMH'] # GO BACK TO MACK LATER; THEY DON'T HAVE NPI'S LOL
MMS = pd.read_excel('master_medical_staff_list_200601.xlsx',sheet_name=sheets)

For now, MMS is a dictionary where the sheet names are the indices and the dataframes are the values.
The last several rows of each sheet is a summary table that is irrelevant to this project but they will be dropped as we clean the dataframe. 

## 1. Make a list of providers to add to the database

We'll make a list of distinct providers using __first name, last name, NPI__.

1. Rename three columns in `providers_db` (firstname, lastname, npi) so that it matches the ones in `MMS`; extract those three columns
2. For each sheet in `MMS`, 
    1. extract the three columns `Last Name`, `First Name`, `NPI`
    2. Find out who is in the sheet but not in `providers_db`
    3. Use a sheet name to populate a new column 'Hospital affiliation'
3. export to a new .csv file `new_providers.csv`

In [3]:
# 1. rename three columns in providers & extract those three columns
colnames = {'Provider Name: First': 'First Name',
           'Provider Name: Last': 'Last Name',
           'Provider NPI': 'NPI'}
providers_db = providers_db.rename(columns=colnames)
providers_db = providers_db[['First Name','Last Name','NPI']]
# a lil preview
providers_db.head()

Unnamed: 0,First Name,Last Name,NPI
0,Cynthia,Aaron,1922057000.0
1,Sam,Abdul,1932188000.0
2,Bachu,Abraham,1225045000.0
3,Glen,Ackerman,1912061000.0
4,Craig,Adams,1982647000.0


__*(I have modified column G of CHX from 'Speciality' to 'Expertise' bc they were out of the ordinary and I didn't want to bother writing an if/else statement)*__

In [4]:
# which information about the providers do we want to have at the end?
cols = ['First Name','Last Name','NPI','Expertise','Staff Category','Staff Status',
        'Department','Practice Phone','Practice Fax']
# (this will be a list of distinct providers who are in MMS but not in providers)
new_providers = pd.DataFrame(columns = cols)

# 2. iterate through each affiliation 
for affiliation in MMS:
    df = MMS[affiliation]
    # 2A. extract the relevant columns
    df = df[cols]
    # 2B. compare MMS and providers_db, and add any distinct provider to new_providers
    diff = df.merge(providers_db, indicator = True, on = ['First Name','Last Name','NPI'], how='left').loc[lambda x : x['_merge']!='both']
    diff = diff.dropna() # assuming that those without NPI or name are irrelevant...
    # 2C. Use a sheet name to populate a new column 'Hospital affiliation'
    diff['Affiliation'] = affiliation
    new_providers = new_providers.append(diff)
new_providers.drop('_merge',axis=1,inplace=True)
    
# 4. export to a new .csv file    
new_providers.to_csv('new_providers.csv',index=False)

# a lil preview (haven't reindexed but it's going to csv file anyway so...)
new_providers.head()

Unnamed: 0,First Name,Last Name,NPI,Expertise,Staff Category,Staff Status,Department,Practice Phone,Practice Fax,Affiliation
8,Jacob,Ballard,1558990000.0,Emergency Medicine,Allied Health Prof.,Applicant,Pri. Care & Med. Spec.,2318767245,2318767625,CAD
10,Kareem,Bazzi,1639470000.0,Family Medicine,Active Staff,Applicant,Pri. Care & Med. Spec.,2318767200,2318766830,CAD
11,Dennis,Behler,1649590000.0,Physical Medicine & Rehabilitation,Allied Health Prof.,Active,Pri. Care & Med. Spec.,(231)592-1360,(231)592-1361,CAD
20,Mark,Clark,1205990000.0,Pain Medicine,Consulting,Active,Pri. Care & Med. Spec.,231 592 1360,231 592 1361,CAD
21,Alan,Conrad,1699770000.0,Family Medicine,Emeritus,Active,Pri. Care & Med. Spec.,(231) 775-2493,(231)775-2570,CAD


## 2. Make a deduplicated list of healthcare entities from the Master Medical Staff list; match with the DB

Similar to #1, but for entities...

1. Rename columns in the database to match MMS(`Entity TIN`-->`Tax ID`)
2. Extract relevant parts from `MMS`
    1. Drop all the rows that don't have a `TIN` value.
    2. Choose columns `TIN`, `Practice`, `Practice Address`, `City, State Zip`
    3. Compare the values with the TINs in `entities_db`
    4. Add to a new dataframe (`new_entities`) whatever entities that are in MMS but not in database
3. Export to a new .csv file (`new_entities.csv`)

In [5]:
# 1. rename columns in entities so that they match MMS
colnames = {'Entity TIN': 'Tax ID'} # add more later
entities_db = entities_db.rename(columns=colnames)
entities_db = entities_db[['Entity Legal Name','Tax ID']]
# typecast to str so that we can join with MMS later
entities_db['Tax ID'] = entities_db['Tax ID'].astype(str).str[0:9]
# a lil preview
entities_db.head()

Unnamed: 0,Entity Legal Name,Tax ID
0,Active Chiropractic of Cadillac,474680795
1,"Advance Pathology Services, PC",208238099
2,"Advanced Optometry, PLLC",382137907
3,Allergy and Asthma Specialists of Cadillac,383588887
4,"Andrew S. Riemer, DO PC",383156438


In [6]:
new_entities = pd.DataFrame(columns = ['Practice','Tax ID','Practice Address','Zip','Practice Phone','Practice Fax'])

# 2. extract relevant parts from MMS
for affiliation in MMS:
    df = MMS[affiliation]
    # 2A. select rows that have TIN
    df = df[df['Tax ID'].notna()]
    # 2B. select relevant columns (Can add more later if needed)
    df = df[['Practice','Tax ID','Practice Address','City, State Zip','Practice Phone','Practice Fax']]
    if df.empty==False: # if there is any entity in MMS to check for,
        df['Tax ID'] = df['Tax ID'].str.replace("-","")
        pattern = r"([0-9]{9})"
        df['TIN'] = df['Tax ID'].str.extract(pattern)
        df = df[df['TIN'].notna()]
        # 2C. left join, leaving only the entities that are in MMS but not in database
        diff = df.merge(entities_db,indicator = True, how='left',on='Tax ID')
        diff = diff[diff['_merge']=='left_only']
        # 2D. append the deduplicated list of entities to the new dataframe
        new_entities = new_entities.append(diff,ignore_index=True)

new_entities = new_entities[['Practice','Tax ID','Practice Address','City, State Zip','Practice Phone','Practice Fax']].drop_duplicates()        

# 3. export the dataframe to a new .csv file    
new_entities.to_csv('new_entities.csv',index=False)

# a lil preview
new_entities.head(5)

Unnamed: 0,Practice,Tax ID,Practice Address,"City, State Zip",Practice Phone,Practice Fax
0,Munson Healthcare Cadillac Hospital Cardiopulm...,382191390,400 Hobart Street,"Cadillac, MI 49601",(231)876-7210,(231)876-7213
1,Crawford Continuing Care Center,382191390,1100 E Michigan Ave.,"Grayling, MI 49738",(989)348-0317,(989)348-0529
5,Behavioral Health,382191390,1105 Sixth St,"Traverse City, MI 49684",(231)935-6210,(231)935-7130
6,MMC Trauma & Acute Care Surgery Program,382191390,1105 Sixth St.,"Traverse City, MI 49684",(231)935-5000,(231)392-0039
7,Munson Neurosurgery,382191390,1221 Sixth St Ste 300,"Traverse City, MI 49684",2313920640,2313920643


### ISSUES:

* __*`MAN` doesn't have any Tax ID recorded; does this allow me to assume that they are no new entity, or should I come up with  a way to find deduplicated entities from `MAN` too?*__

### Task 3: Make a deduplicated list of locations from the Master Medical Staff list

#### GAME PLAN

__From `locations_db`:__
1. create a dataframe `db_df` that has address attributes + name on one axis 
2. flip `db_df` so that the index is `Name` and columns are address attributes

__From `MMS`:__
1. create a dataframe `mms_df` that has address attributes + name on one axis 
2. flip `mms_df` so that the index is `Name` and columns are address attributes


__With the cleaned dataframes:__


* Create a dataframe with `locations_db["Physical Address"]`
* Create a dataframe with `locations_MMS["Address"]`
* match with the practice name?



In [7]:
#to print everything...
#pd.set_option('display.max_columns', None)  
#pd.set_option('display.expand_frame_repr', False)
#pd.set_option('max_colwidth', -1)

# what does the parsed column look like?
locations_parsed = locations_db["Physical Address"].map(usaddress.tag)
print(dict(locations_parsed[0][0]))
print(dict(locations_parsed[1][0]))

{'AddressNumber': '119', 'StreetNamePreDirectional': 'N', 'StreetName': 'Shelby', 'StreetNamePostType': 'St.', 'PlaceName': 'Cadillac', 'StateName': 'MI', 'ZipCode': '49601'}
{'AddressNumber': '8805', 'StreetName': 'Pine Ridge', 'StreetNamePostType': 'Drive', 'PlaceName': 'Cadillac', 'StateName': 'MI', 'ZipCode': '49601'}


In [8]:
# 1. Make df using `locations_db`

# framework for final output
db_df = pd.DataFrame(columns=['Category','Values'])
categories = ['Name','AddressNumber','AddressNumberPrefix','AddressNumberSuffix','BuildingName',
              'CornerOf','IntersectionSeparator','LandmarkName','NotAddress','OccupancyType','OccupancyIdentifier',
              'PlaceName','Recipient','StateName','StreetName','StreetNamePreDirectional','StreetNamePreModifier',
              'StreetNamePreType','StreetNamePostDirectional','StreetNamePostModifier','StreetNamePostType',
              'SubaddressIdentifier','SubaddressType','USPSBoxGroupID','USPSBoxGroupType','USPSBoxID',
              'USPSBoxType','ZipCode']
db_df['Category'] = categories

# make a df of address attributes and populate with whatever info we have
for i in range(locations_parsed.size):
    dic = dict(locations_parsed[i][0])
    #print(dic)
    ls = list(dic.items())
    df = pd.DataFrame(ls,columns=['Category','Values'])
    new_row = pd.DataFrame([['Name','{}'.format(locations_db.iloc[i][1])]],columns=['Category','Values']) 
    #print(new_row)
    df = new_row.append(df).reset_index(drop = True)
    #print(df)
    db_df = db_df.merge(df, how='left',on='Category')
    #print(final_df)

    
# clean up
db_df.set_index('Category',inplace=True)
db_df.dropna(how='all',inplace=True)
    
# 2. transpose db_df so that the columns are the attribute types and the indices are the practices
db_df.columns = range(db_df.shape[1])
db_df = db_df.T

# lil preview
print(db_df.shape)
db_df.head()

(155, 14)


Category,Name,AddressNumber,OccupancyType,OccupancyIdentifier,PlaceName,StateName,StreetName,StreetNamePreDirectional,StreetNamePreType,StreetNamePostDirectional,StreetNamePostType,USPSBoxID,USPSBoxType,ZipCode
0,,,,,,,,,,,,,,
1,Active Chiropractic of Cadillac,119.0,,,Cadillac,MI,Shelby,N,,,St.,,,49601.0
2,Advanced Foot and Ankle Center - Cadillac,8805.0,,,Cadillac,MI,Pine Ridge,,,,Drive,,,49601.0
3,Advanced Foot and Ankle Center - Manistee [118...,1860.0,Ste,# 2,Manistee,MI,Parkdale,E,,,Ave,,,49660.0
4,Advanced Foot and Ankle Center - Traverse City,1225.0,Ste,200,Traverse City,MI,Front,,,,Street,,,49684.0


In [9]:
# Now work with MMS

# function to work with one sheet at a time; 
# IN: df, dataframe; 
# OUT: add_df, dataframe that contains name and 
def make_address_df (df):
    # make a column that contains full addresses of each practice
    df['Address'] = df['Practice Address'] + df['City, State Zip']
    df = df.dropna(subset=['Address'])
    #print(df['Address'])
    # take address column; use usaddress.tag
    add_parsed = df['Address'].map(usaddress.tag)
    #print(add_parsed)
    
    # create add_df
    add_df = pd.DataFrame(columns=['Category','Values'])
    categories = ['Name','AddressNumber','AddressNumberPrefix','AddressNumberSuffix','BuildingName',
              'CornerOf','IntersectionSeparator','LandmarkName','NotAddress','OccupancyType','OccupancyIdentifier',
              'PlaceName','Recipient','StateName','StreetName','StreetNamePreDirectional','StreetNamePreModifier',
              'StreetNamePreType','StreetNamePostDirectional','StreetNamePostModifier','StreetNamePostType',
              'SubaddressIdentifier','SubaddressType','USPSBoxGroupID','USPSBoxGroupType','USPSBoxID',
              'USPSBoxType','ZipCode']
    add_df['Category'] = categories
    
    # go through each item in add_parsed and put in add_df
    for i in range(add_parsed.size):
        dic = dict(add_parsed[i][0])
        #print(dic)
        ls = list(dic.items())
        temp_df = pd.DataFrame(ls,columns=['Category','Values'])
        #print(temp_df)
        
        # add the name of the practice
        new_row = pd.DataFrame([['Name','{}'.format(df.iloc[i, df.columns.get_loc('Practice')])]],columns=['Category','Values']) 
        #print(new_row)
        temp_df = new_row.append(temp_df).reset_index(drop = True)
        #print(temp_df)
        add_df = add_df.merge(temp_df, how='left',on='Category')
        #print(add_df)
        
    # clean up
    add_df.set_index('Category',inplace=True)
    add_df.dropna(how='all',inplace=True)
    
    # transpose
    add_df.columns = range(add_df.shape[1])
    add_df = add_df.T
    
    # return add_df
    return add_df

In [10]:
# 1. create a dataframe `mms_df` that has address attributes + name on one axis 
mms_df = pd.DataFrame()
for affiliation in MMS:
    df = MMS[affiliation]
    aff_df = make_address_df(df)
    mms_df = mms_df.append(aff_df)

mms_df.dropna(how='all',inplace=True)

# lil preview
print(mms_df.shape)
mms_df.head()
    

(1273, 15)


Unnamed: 0,Name,AddressNumber,OccupancyType,OccupancyIdentifier,PlaceName,StateName,StreetName,StreetNamePreDirectional,StreetNamePostType,USPSBoxID,USPSBoxType,ZipCode,StreetNamePreType,StreetNamePostDirectional,Recipient
1,Munson Healthcare Cadillac Anesthesia,400,,,StCadillac,MI,Hobart,,,,,49601,,,
2,Munson Healthcare Cadillac Anesthesia,400,,,StCadillac,MI,Hobart,,,,,49601,,,
3,"Chowdhury MD, PLLC",8795,,,Dr.Cadillac,MI,Pine,,Ridge,,,49601,,,
4,Munson Healthcare Cadillac Cancer & Infusion C...,400,,,StreetCadillac,MI,Hobart,,,,,49601,,,
5,"Family Practice of Cadillac, PC",827,,,DivisionCadillac,MI,,E.,,,,49601,,,


## Issues with `usaddress`

My original plan was to make two dataframes using MMS and the DB and iterate through each row of MMS to check for a (partial) match in the DB dataframe. That was based on the assumption that `usaddress` will always give out the correct parsing given any string of address information. If you look at the above table, though, `usaddress` has confused Street name with Street direction/type. I don't know how to fix this... one thing we can try in the future is to standardize the address in the MMS (like we do in the Alternative solution below) and then run the same thing and see how it goes. 


## Alternative

If an alternative (imperfect) solution is ok, below is my initial approach with the locations files. I standardize the address on both MMS and DB and do a left-merge on the address columns (left being MMS). This solution is imperfect in that it only looks for perfect matches; so if a location is already in the DB but the same location in MMS has more information (floor number, suite number, etc.), the script will recognize this location as a new one. 


#### Game Plan outline for the Alternative solution:
__From `MMS`:__
1. Extract `Practice`, `Practice Address`, `City, State Zip` (for now) (*Drop NANs*)
2. Split the column `City, State Zip` into `City`, `State`, and `Zip`
3. Standaradize `Practice Address`:
    1. Get rid of all the dots and commas
    2. Put it in all caps
    3. Avenue --> Ave; Street-->St; Drive-->Dr; Road--> Rd; Highway-->Hwy, etc.
4. Rename columns: `Practice` --> `Name_MMS`; `Practice Address`--> `Address_MMS`; `Zip`-->`Zip_MMS`; etc.
5. deduplicate

__From `locations_db`:__
1. Extract `Location Name`, `Physical Address: Street 1`, `Physical Address: Zip` (for now) (*Drop NANs-__if there is any NANs drop the entire row, for now__*)
2. Standardize `Physical Address: Street 1` the same way we did for `MMS`
3. Rename columns: `Location Name` --> `Name_DB`; `Physical Address: Street 1`--> `Address_DB`; `Physical Address: Zip`--> `Zip_DB`; etc.

__With the cleaned dataframes:__
1. Extract rows from `MMS` that are not in `locations_db` based on `Address_MMS`/`Address_DB` and `Zip_MMS`/`Zip_DB`
2. Put in a new dataframe `new_locations` (*Columns: Name_MMS, Name_DB, Address, City, State, Zip (for now)*)
3. Export the dataframe into a new .csv file `new_locations.csv`


In [13]:
# Working with DB - locations_db
locations_db = pd.read_csv('kn_locations.csv')

# 1. Extract `Location Name`, `Physical Address: Street 1`, `Physical Address: Zip` (for now) (*Drop NANs*)
locations_db = locations_db[['Location Name','Physical Address: Street 1','Physical Address: Zip']]
locations_db = locations_db.dropna(how='any')

# 2. Rename columns: `Location Name` --> `Name_DB`; `Physical Address: Street 1`--> `Address_DB`; `Physical Address: Zip`--> `Zip_DB`; etc.
newcols = {'Location Name':'Name_DB','Physical Address: Street 1':'Address_DB','Physical Address: Zip': 'Zip_DB'}
locations_db = locations_db.rename(columns=newcols)

# 3. Standardize `Address_DB` the same way we did for `MMS`
# Get rid of all the dots commas etc etc
locations_db['Address_DB'] = locations_db['Address_DB'].str.replace('\W',' ')
# Also get rid of unnecessary whitespaces
locations_db['Address_DB'] = locations_db['Address_DB'].str.replace('\s+',' ').str.strip()
# Put it in all caps
locations_db['Address_DB'] = locations_db['Address_DB'].str.upper()
# Avenue --> Ave; Street-->St; Drive-->Dr; Road--> Rd; Highway-->Hwy, etc.
locations_db['Address_DB'] = locations_db['Address_DB'].str.replace('AVENUE','AVE')
locations_db['Address_DB'] = locations_db['Address_DB'].str.replace('STREET','ST')
locations_db['Address_DB'] = locations_db['Address_DB'].str.replace('DRIVE','DR')
locations_db['Address_DB'] = locations_db['Address_DB'].str.replace('ROAD','RD')
locations_db['Address_DB'] = locations_db['Address_DB'].str.replace('HIGHWAY','HWY')
locations_db['Address_DB'] = locations_db['Address_DB'].str.replace('TRAIL','TR')
locations_db['Address_DB'] = locations_db['Address_DB'].str.replace('SUITE','STE')

# 4. Clean location names
locations_db['Name_DB'] = locations_db['Name_DB'].str.replace('\W',' ').str.replace('\s+',' ').str.strip()

# preview
locations_db.head()

Unnamed: 0,Name_DB,Address_DB,Zip_DB
0,Active Chiropractic of Cadillac,119 N SHELBY ST,49601
1,Advanced Foot and Ankle Center Cadillac,8805 PINE RIDGE DR,49601
2,Advanced Foot and Ankle Center Manistee 118483...,1860 E PARKDALE AVE STE 2,49660
3,Advanced Foot and Ankle Center Traverse City,1225 FRONT ST STE 200,49684
4,Advanced Optometry,120 PALUSTER ST,49601


In [14]:
# Working with MMS

locations_MMS = pd.DataFrame(columns = ['Name_MMS','Address_MMS','City','State','Zip_MMS'])

for affiliation in MMS:
    df = MMS[affiliation]
    # 1. Extract `Practice`, `Practice Address`, `City, State Zip` (for now)
    df = df[['Practice', 'Practice Address', 'City, State Zip']]
    # 2. Split the column `City, State Zip` into `City`, `State`, and `Zip`
    df[['City','StateZip']] = df['City, State Zip'].str.split(',',expand=True)
    df[['State','Zip']] = df['StateZip'].str.strip().str.split(expand=True)
    df = df[['Practice', 'Practice Address', 'City', 'State', 'Zip']]
        #print(df.head())
    # 3. Standaradize `Practice Address`:
        # 3A. Get rid of all the dots and commas etc 
    df['Practice Address'] = df['Practice Address'].str.replace('\W',' ')
    # Also get rid of unnecessary whitespaces
    df['Practice Address'] = df['Practice Address'].str.replace('\s+',' ').str.strip()
        # 3B. Put it in all caps
    df['Practice Address'] = df['Practice Address'].str.upper()
        # 3C. Avenue --> Ave; Street-->St; Drive-->Dr; Road--> Rd; Highway-->Hwy, etc.
    df['Practice Address'] = df['Practice Address'].str.replace('AVENUE','AVE')
    df['Practice Address'] = df['Practice Address'].str.replace('STREET','ST')
    df['Practice Address'] = df['Practice Address'].str.replace('DRIVE','DR')
    df['Practice Address'] = df['Practice Address'].str.replace('ROAD','RD')
    df['Practice Address'] = df['Practice Address'].str.replace('HIGHWAY','HWY')
    df['Practice Address'] = df['Practice Address'].str.replace('TRAIL','TR')
    df['Practice Address'] = df['Practice Address'].str.replace('SUITE','STE')
    # 4. Rename columns: `Practice` --> `Name_MMS`; `Practice Address`--> `Address_MMS`; `Zip`-->`Zip_MMS`; etc.
    colnames = {'Practice': 'Name_MMS', 'Practice Address': 'Address_MMS', 'Zip': 'Zip_MMS'}
    df = df.rename(columns=colnames)
    locations_MMS = locations_MMS.append(df)

# 5. deduplicate    
locations_MMS = locations_MMS.drop_duplicates()    
    
print(locations_MMS.info())    
locations_MMS.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 381 entries, 0 to 25
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Name_MMS     380 non-null    object
 1   Address_MMS  380 non-null    object
 2   City         380 non-null    object
 3   State        380 non-null    object
 4   Zip_MMS      379 non-null    object
dtypes: object(5)
memory usage: 17.9+ KB
None


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0,Name_MMS,Address_MMS,City,State,Zip_MMS
0,Munson Healthcare Cadillac Anesthesia,400 HOBART ST,Cadillac,MI,49601
2,"Chowdhury MD, PLLC",8795 PINE RIDGE DR,Cadillac,MI,49601
3,Munson Healthcare Cadillac Cancer & Infusion C...,400 HOBART ST,Cadillac,MI,49601
4,"Family Practice of Cadillac, PC",827 E DIVISION,Cadillac,MI,49601
5,American Healthcare Staffing Association,10126 E CHERRY BEND RD,Traverse City,MI,49684


In [15]:
# __With the cleaned dataframes:__
# 1. Extract rows from `MMS` that are not in `locations_db` based on `Address_MMS`/`Address_DB` and `Zip_MMS`/`Zip_DB`
new_locations = locations_MMS.merge(locations_db, indicator=True, how='outer',
                                    left_on='Address_MMS',right_on='Address_DB')
# new_locations = new_locations[new_locations['_merge']=='both']
new_locations = new_locations[new_locations['_merge']=='left_only']
#new_locations = new_locations[(new_locations['_merge']=='left_only') | (new_locations['_merge']=='right_only')]
new_locations = new_locations.drop_duplicates(subset='Name_MMS')
#print(new_locations.info())
#print(new_locations.head(10))

#diff = df.merge(entities,indicator = True, how='left',on='Tax ID')
#        diff = diff[diff['_merge']=='left_only']

# 2. Put in a new dataframe `new_locations` (*Columns: Name_MMS, Name_DB, Address, City, State, Zip (for now)*)
new_locations = new_locations[['Name_MMS','Address_MMS','City','State','Zip_MMS']]

# 3. Export the dataframe into a new .csv file `new_locations.csv`
new_locations.to_csv('new_locations.csv',index=False)

Observations:

* the list is not entirely of new list bc of minor variations in the address (i.e. some have STE# or not have St, Rd, etc)
    For example, Family Practice of Cadillac is already in the database but with a more specific address ('827 E DIVISION STE 2').


Possible solutions:

* When deduplicating `locations_MMS`, make the longest address/longest name absorb the shorter ones (i.e. leave only the entries with the most information)
* When merging the two dataframes `locations_MMS` and `locations`, use the street address but look for 'containment', not 'exact match'. 
    1. i.e. if the entry in `Address_MMS` contains that in `Address_DB`, then either
        1. record both address and include them 
        2. consider that location already in the DB and don't include in the unique lists
    2. i.e. if the entry in `Address_DB` contains that in `Address_MMS`, then 
        1. consider that location already in the DB and don't include in the unique lists

## Ending Note

* Run this thing but check `MACK` separately.
* The sheet affiliated with `MAN` doesn't have any Tax ID recorded for any of the practices; hence no entities were added from this affiliation. If we want to get entities from `MAN` we should come up with some other method. 

Also, it seems like there is a way to compute the 'distance' between the two strings (i.e. [Levinshtein distance](https://www.datacamp.com/community/tutorials/fuzzy-string-python)) and choose a certain row if the distance between the two strings is short enough.
There's also the `fuzzywuzzy` package that computes the 'fuzz ratio' - how similar the two strings are. Great thing about this one is that it supports partial ratio (like a search) and mixed orders (i.e. 'US vs Canada' and 'Canada vs US' will have a token ratio of 100%).

But since we are looking for a definite solution (i.e. would rather extract more information than to miss some) I didn't use these packages.

## *++++below are just random stuff i've been trying++++*

In [None]:
# populate a new boolean column 'MMS>DB' that says 'True' if MMS address contains DB address
locations_MMS = locations_MMS[['Name_MMS','Address_MMS','City','State','Zip_MMS']].dropna(how='any')
locations_MMS['MMS>DB'] = False

# for every row in Address_MMS, check if there is a substring in Address_DB

for i in range(locations_MMS.shape[0]):
    address = str(locations_MMS.iloc[i,1])
    #print(address)
    for item in locations_db['Address_DB']:
        #print(item_DB)
        if str(item) in address:
            locations_MMS.loc[i,'Address_DB'] = item
            locations_MMS.loc[i,'MMS>DB'] = True
            #print('Found it!')
            break  

# take a look at locations that are in MMS but not in DB
print(locations_MMS['MMS>DB'].value_counts(dropna=False))
locations_MMS[locations_MMS['MMS>DB']==True].tail()

In [None]:
# populate a new boolean column 'DB>MMS' that says 'True' if DB address contains MMS address
locations_db = locations_db.dropna(how='any')
locations_db['DB>MMS'] = False

#for address_DB in locations['Address_DB'], check if there is a substring in Address_MMS
for i in range (locations.shape[0]):
    item = str(locations.iloc[i,1])
    for address in locations_MMS['Address_MMS']:
        if str(address) in item:
            locations_db.loc[i,'Address_MMS'] = address
            locations_db.loc[i,'DB>MMS'] = True
            break

# take a look at locations that are in MMS but not in DB
#print(locations[locations['DB>MMS']==False])
print(locations['DB>MMS'].value_counts(dropna=False))
locations_db.head(10)
locations_db[locations_db['DB>MMS']==True]


In [None]:
# outer merge with the street address
new_locations = locations_MMS.merge(locations, indicator=True, how='outer',
                                    left_on='Address_MMS',right_on='Address_DB')
# new_locations = new_locations[new_locations['_merge']=='both']
new_locations = new_locations[new_locations['_merge']=='left_only']
#new_locations = new_locations[(new_locations['_merge']=='left_only') | (new_locations['_merge']=='right_only')]
new_locations = new_locations.drop_duplicates(subset='Name_MMS')
new_locations.info()
new_locations.head(10)

#diff = df.merge(entities,indicator = True, how='left',on='Tax ID')
#        diff = diff[diff['_merge']=='left_only']
# 2. Put in a new dataframe `new_locations` (*Columns: Name_MMS, Name_DB, Address, City, State, Zip (for now)*)
# 3. Export the dataframe into a new .csv file `new_locations.csv`

## Note:



### Just...exploring. random notes.

#### MMS First Impression
* The sheet `MACK` is in a format that is different from all other sheets for individual hospitals. Why...
* For all other sheets, there is a mini-table at the bottom with sum summary numbers.
* Minor variations in `Practice Address` (e.g. 'St' vs 'St.' vs 'Street'; 'Trail vs 'Tr.'; 'Carmel St.' vs 'S. Carmel St.')
* `Specialty` seems to be clean and standardized
* Ooh there is a column `Primary Facility` that has the hospital code as its value; I can download everything (except `MACK`) into one dataframe. 
    * not all of these will be useful
* I'm just going to believe the provider names, NPI, Tax ID, and phone numbers.

__*Practice column is both a location and an entity.__ 

#### Providers list First Impression
* `Provider Name: Last`, `Provider Name: First`, and `Provider NPI` will be useful.
* `Provider Primary Specialty` has some empty cells but still usable; just need to use in conjunction with other things
* There are subtle variations between `Primary Employer`(healthcare entity - TIN) and `Primary Practice Location` (healthcare location - Group NPI); also some are empty cells.
    * these are going away in the next couple of weeks, though.  

#### Locations list First Impression
* pretty messy...
* `Location Name` is almost clean; some entries have NPI attached at the end but we can remove it.
* Addresses are all messed up. Most have the physical address in full (`Physical Address`), but not all. Some entries don't have physical address but only state and zip code; Some have state code under the column `Physical Address: City`; the only column without any missing values seems to be `Physical Address: Zip`. 
    * __Adam's note: throw these out :)__ 
* There are information about latitudes, longitudes, and phone numbers; but they aren't available for all locations.

#### Entities list First Impression
* `Entity Legal Name` looks nice and clean. No empty cells; no weird variations (at least on the first look). This could be used as a standard.
* There are many entities that literally have no information other than their name and TIN...what to do? 
    * probably 
* There's the `Billing Address` and there's the `Mailing Address`, and they are _different_. Geez
* The only columns that are fully populated are `Entity`,`Entity Legal Name`, and `Entity TIN`. If I am going to use any other columns I would have to be careful.

In [None]:
addr = '530 W Diversey Pkwy Chicago IL 60614'
usaddress.parse(addr)
