<font color=#2DA2F2>

**Azure Maps API**
- **Batch Geocoding:** Keep the batch size at 50, which is the maximum allowed by Azure Maps for a single batch request
- This implementation should be more efficient as it uses the batch geocoding capability of Azure Maps, reducing the number of API calls and potentially speeding up the process. It also respects rate limits by adding a small delay between batch requests.


In [1]:
# import libraries
from geopy.geocoders import AzureMaps
from geopy.exc import GeocoderTimedOut, GeocoderServiceError
from tqdm import tqdm
import time
import pandas as pd
import os

# # set the azure maps api key and define a helper function to geocode addresses in batches
azure_maps_key =os.getenv('azure_maps_key')

def batch_geocode_azure(addresses, api_key, batch_size=50):
    geolocator = AzureMaps(api_key)
    results = []
    
    for i in tqdm(range(0, len(addresses), batch_size), desc="Processing batches"):
        batch = addresses[i:i+batch_size]
        batch_results = []
        
        for address in batch:
            try:
                location = geolocator.geocode(address, exactly_one=True)
                if location:
                    batch_results.append((address, location.latitude, location.longitude))
                else:
                    batch_results.append((address, None, None))
            except (GeocoderTimedOut, GeocoderServiceError) as e:
                print(f"Error geocoding address '{address}': {e}")
                batch_results.append((address, None, None))
            
            time.sleep(0.1)  # Small delay between individual requests
        
        results.extend(batch_results)
        time.sleep(1)  # Add a small delay between batches to respect rate limits
    
    return results


<font color=#1DA1F2>

**Extract and Clean Addresses**

In [7]:
# Load the CSV file into a DataFrame
df = pd.read_csv('data/npidata_pfile.csv')

# Extract the first 5 digits of the zip code ,ensure that the zip code is an integer
df['Zip Code'] = df['Zip Code'].apply(lambda x: int(x / 10**(len(str(int(x))) - 5)) if not pd.isna(x) else x).astype('Int64')
# combine the names provider to create a full name column, ensure that the names are strings and it should be last name, first name middle name format 
df['Full Name'] = df['Provider Last Name'] + ', ' + df['Provider First Name'] + ' ' + df['Provider Middle Name']

df=df[['NPI','Full Name','Street','City','State','Zip Code']]
# drop rows with missing values of street 1,full name
df = df.dropna(subset=['Street','Full Name'])
# drop duplicates
df = df.drop_duplicates()
# prepare the addresses for geocoding. Create new column for Clean_address
df['Clean_address'] = df.apply(lambda row: f"{row['Street']}, {row['City']}, {row['State']} {row['Zip Code']}", axis=1)
# print(df.shape)
# randomly sample 1000 rows
df = df.sample(1000)
df.head()


Unnamed: 0,NPI,Full Name,Street,City,State,Zip Code,Clean_address
27454,1821077272,"CURTIS, SHERYL L",101 S NEWELL DRIVE,GAINESVILLE,FL,32611,"101 S NEWELL DRIVE, GAINESVILLE, FL 32611"
4254,1013596584,"SEKARAN, KANITHRA C",675 N SAINT CLAIR ST STE 21-100,CHICAGO,IL,60611,"675 N SAINT CLAIR ST STE 21-100, CHICAGO, IL 6..."
23667,1124512272,"QUENTRILL, STAYSHA SHAE",37 BEECHWOOD AVE,WINFIELD,WV,25213,"37 BEECHWOOD AVE, WINFIELD, WV 25213"
11647,1861223489,"MEEKS, SWAIZE REGAYNA SHELBY",260 PEACHTREE ST NW STE 2200,ATLANTA,GA,30303,"260 PEACHTREE ST NW STE 2200, ATLANTA, GA 30303"
15093,1972355675,"HASSETT, KRISTIN M",172 E SCHILLER ST,ELMHURST,IL,60126,"172 E SCHILLER ST, ELMHURST, IL 60126"


<font color=#1DA1F2>

**Batch Geocoding with Azure Maps API**

In [9]:
# Prepare addresses for geocoding
addresses = df['Clean_address'].tolist()

# Geocode addresses using Azure Maps
all_results_azure = batch_geocode_azure(addresses, azure_maps_key)

# Create new columns in the existing DataFrame
df['latitude'] = None
df['longitude'] = None

# Update the DataFrame with geocoding results
for address, lat, lon in all_results_azure:
    mask = df['Clean_address'] == address
    df.loc[mask, 'latitude'] = lat
    df.loc[mask, 'longitude'] = lon

# Save the updated DataFrame to a new CSV file
# df.to_csv('data/npidata_pfile_AzureMaps_with_coordinates.csv', index=False)

# Display a sample of 10 rows
df.head()

Processing batches: 100%|██████████| 1/1 [00:02<00:00,  2.51s/it]


Unnamed: 0,NPI,Full Name,Street,City,State,Zip Code,Clean_address,latitude,longitude
27454,1821077272,"CURTIS, SHERYL L",101 S NEWELL DRIVE,GAINESVILLE,FL,32611,"101 S NEWELL DRIVE, GAINESVILLE, FL 32611",29.65154,-82.3434
4254,1013596584,"SEKARAN, KANITHRA C",675 N SAINT CLAIR ST STE 21-100,CHICAGO,IL,60611,"675 N SAINT CLAIR ST STE 21-100, CHICAGO, IL 6...",41.89436,-87.62267
23667,1124512272,"QUENTRILL, STAYSHA SHAE",37 BEECHWOOD AVE,WINFIELD,WV,25213,"37 BEECHWOOD AVE, WINFIELD, WV 25213",38.52524,-81.90097
11647,1861223489,"MEEKS, SWAIZE REGAYNA SHELBY",260 PEACHTREE ST NW STE 2200,ATLANTA,GA,30303,"260 PEACHTREE ST NW STE 2200, ATLANTA, GA 30303",33.76149,-84.38771
15093,1972355675,"HASSETT, KRISTIN M",172 E SCHILLER ST,ELMHURST,IL,60126,"172 E SCHILLER ST, ELMHURST, IL 60126",41.90088,-87.93677
