<font color=#2DA2F2>

**Azure Maps API**
- **Batch Geocoding:** Keep the batch size at 50, which is the maximum allowed by Azure Maps for a single batch request
- This implementation should be more efficient as it uses the batch geocoding capability of Azure Maps, reducing the number of API calls and potentially speeding up the process. It also respects rate limits by adding a small delay between batch requests.


In [9]:
# import libraries
import pandas as pd
from geopy.geocoders import Nominatim 
from geopy.exc import GeocoderTimedOut, GeocoderServiceError
from tqdm import tqdm
import requests
import time
from tqdm import tqdm
import os

# set the azure maps api key and define a helper function to geocode addresses in batches
azure_maps_key =os.getenv('azure_maps_key')

# Function to send a batch request to Azure Maps API
def batch_geocode_azure(addresses, api_key):
    url = "https://atlas.microsoft.com/search/address/batch/json"
    headers = {
        'Content-Type': 'application/json',
        'subscription-key': api_key
    }
    
    # Prepare batch request payload
    batch_request = {
        "batchItems": [{"query": address} for address in addresses]
    }
    
    try:
        response = requests.post(url, json=batch_request, headers=headers)
        if response.status_code == 200:
            data = response.json()
            results = []
            for item in data['batchItems']:
                address = item['query']
                if item['statusCode'] == 200 and item['response']['summary']['numResults'] > 0:
                    coordinates = item['response']['results'][0]['position']
                    results.append((address, coordinates['lat'], coordinates['lon']))
                else:
                    results.append((address, None, None))
            return results
        else:
            print(f"Batch request failed with status code: {response.status_code}")
            return [(address, None, None) for address in addresses]
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
        return [(address, None, None) for address in addresses]

<font color=#1DA1F2>

**Extract and Clean Addresses**

In [10]:
# Load the CSV file into a DataFrame
df = pd.read_csv('data/npidata_pfile.csv')

# Extract the first 5 digits of the zip code ,ensure that the zip code is an integer
df['Zip Code'] = df['Zip Code'].apply(lambda x: int(x / 10**(len(str(int(x))) - 5)) if not pd.isna(x) else x).astype('Int64')
# combine the names provider to create a full name column, ensure that the names are strings and it should be last name, first name middle name format 
df['Full Name'] = df['Provider Last Name'] + ', ' + df['Provider First Name'] + ' ' + df['Provider Middle Name']

df=df[['NPI','Full Name','Street','City','State','Zip Code']]
# drop rows with missing values of street 1,full name
df = df.dropna(subset=['Street','Full Name'])
# drop duplicates
df = df.drop_duplicates()
# prepare the addresses for geocoding. Create new column for Clean_address
df['Clean_address'] = df.apply(lambda row: f"{row['Street']}, {row['City']}, {row['State']} {row['Zip Code']}", axis=1)
# print(df.shape)
# randomly sample 1000 rows
df = df.sample(1000)
df.sample(15)


Unnamed: 0,NPI,Full Name,Street,City,State,Zip Code,Clean_address
14330,1770172058,"REALEGENO-ARELLANO, ANGIE ABLEEN",1605 ROCK PRAIRIE RD STE 315,COLLEGE STATION,TX,77845,"1605 ROCK PRAIRIE RD STE 315, COLLEGE STATION,..."
22635,1740012947,"EGHAREVBA, CHRISTY O",9 SPARROW RD,CARPENTERSVILLE,IL,60110,"9 SPARROW RD, CARPENTERSVILLE, IL 60110"
3948,1952024952,"LARREA, AMANDA R",12004 S ROUTE 59 UNIT 100,PLAINFIELD,IL,60585,"12004 S ROUTE 59 UNIT 100, PLAINFIELD, IL 60585"
1754,1053142075,"QUINN, MAUREEN K",128 PARKER AVE S,BROOKLET,GA,30415,"128 PARKER AVE S, BROOKLET, GA 30415"
26165,1366274482,"CALLENDER, ASHLEY ELIZABETH",590 HARDING AVE,MILLS,WY,82604,"590 HARDING AVE, MILLS, WY 82604"
3073,1285465153,"SMITH, ASHLEY S",1 ELIZABETH PL,DAYTON,OH,45417,"1 ELIZABETH PL, DAYTON, OH 45417"
11693,1649458365,"FELDER, MICHELLE RENAAY HARRIS",8432 107TH ST,RICHMOND HILL,NY,11418,"8432 107TH ST, RICHMOND HILL, NY 11418"
1462,1033222732,"KLASS, BRENDA M",16550 VENTURA BLVD,ENCINO,CA,91436,"16550 VENTURA BLVD, ENCINO, CA 91436"
3712,1316778210,"SHEN, ANGIE M",11040 BOLLINGER CANYON RD # 155,SAN RAMON,CA,94582,"11040 BOLLINGER CANYON RD # 155, SAN RAMON, CA..."
13496,1841227568,"ONUCHUKWU-AZUONYE, CHINONYE OGECHI",1118 FERRY ST,RICHMOND,TX,77469,"1118 FERRY ST, RICHMOND, TX 77469"


<font color=#1DA1F2>

**Batch Geocoding with Azure Maps API**

In [11]:
# Prepare addresses for geocoding
addresses = df['Clean_address'].tolist()

# Send addresses in batches
batch_size = 50  # Azure Maps supports up to 50 addresses per batch
batches = [addresses[i:i + batch_size] for i in range(0, len(addresses), batch_size)]
all_results_azure = []

for batch in tqdm(batches, desc="Processing batches"):
    results = batch_geocode_azure(batch, azure_maps_key)
    all_results_azure.extend(results)
    time.sleep(1)  # Add a small delay between batches to respect rate limits

# Create new columns in the existing DataFrame
df['latitude'] = None
df['longitude'] = None

# Update the DataFrame with geocoding results
for address, lat, lon in all_results_azure:
    mask = df['Clean_address'] == address
    df.loc[mask, 'latitude'] = lat
    df.loc[mask, 'longitude'] = lon

# Display the first few rows to verify
print(df[['Clean_address', 'latitude', 'longitude']].head())

# Save the updated DataFrame to a new CSV file
df.to_csv('data/npidata_pfile_AzureMaps _with_coordinates.csv', index=False)

df.sample(15)

Processing batches:   0%|          | 0/20 [00:00<?, ?it/s]

Batch request failed with status code: 401


Processing batches:   5%|▌         | 1/20 [00:01<00:20,  1.07s/it]

Batch request failed with status code: 401


Processing batches:  10%|█         | 2/20 [00:02<00:18,  1.05s/it]

Batch request failed with status code: 401


Processing batches:  15%|█▌        | 3/20 [00:03<00:17,  1.05s/it]

Batch request failed with status code: 401


Processing batches:  20%|██        | 4/20 [00:04<00:16,  1.05s/it]

Batch request failed with status code: 401


Processing batches:  25%|██▌       | 5/20 [00:05<00:15,  1.05s/it]

Batch request failed with status code: 401


Processing batches:  30%|███       | 6/20 [00:06<00:14,  1.04s/it]

Batch request failed with status code: 401


Processing batches:  35%|███▌      | 7/20 [00:07<00:13,  1.04s/it]

Batch request failed with status code: 401


Processing batches:  40%|████      | 8/20 [00:08<00:12,  1.04s/it]

Batch request failed with status code: 401


Processing batches:  45%|████▌     | 9/20 [00:09<00:11,  1.04s/it]

Batch request failed with status code: 401


Processing batches:  50%|█████     | 10/20 [00:10<00:10,  1.04s/it]

Batch request failed with status code: 401


Processing batches:  55%|█████▌    | 11/20 [00:11<00:09,  1.04s/it]

Batch request failed with status code: 401


Processing batches:  60%|██████    | 12/20 [00:12<00:08,  1.04s/it]

Batch request failed with status code: 401


Processing batches:  60%|██████    | 12/20 [00:13<00:09,  1.13s/it]


KeyboardInterrupt: 

In [None]:
# Prepare addresses for geocoding
addresses = df['Clean_address'].tolist()

# Send addresses in batches
batch_size = 50  # Azure Maps supports up to 50 addresses per batch
batches = [addresses[i:i + batch_size] for i in range(0, len(addresses), batch_size)]
all_results_azure = []

for batch in tqdm(batches, desc="Processing batches"):
    results = batch_geocode_azure(batch, azure_maps_key)
    all_results_azure.extend(results)
    time.sleep(1)  # Add a small delay between batches to respect rate limits

# Convert results to a DataFrame
geocoded_data_azure = pd.DataFrame(all_results_azure, columns=['Clean_address', 'latitude', 'longitude'])

# Merge geocoded data with original DataFrame
df = df.merge(geocoded_data_azure, on='Clean_address', how='left')

# Display the first few rows to verify
print(df[['Clean_address', 'latitude', 'longitude']].head())

# Save the updated DataFrame to a new CSV file
df.to_csv('data/npidata_pfile_small_with_coordinates_azure.csv', index=False)

print("The file with the new columns added has been saved as npidata_pfile_small_with_coordinates_azure.csv")