# Web Scraping Information for Maricopa County, AZ HIV Testing Sites

Data science has many tools that could potentially make public health research much more efficient and streamlined. For example, I am helping a professor create and gather data for follow-up analyses of a study they published this year. One of those datasets is a database of all the HIV testing and PrEP provider sites in the US. I was originally going to manually create the spreadsheet, which admittedly is very time-consuming and could lead to many errors. In my self-directed study of data science, I have run into the idea of web scraping but haven't tackled it yet as I didn't feel prepared to attempt it. However, I decided to just look up tutorials on webscraping in an attempt to implement it to save myself time. 

I found a great article from [Towards Data Science](https://towardsdatascience.com/how-to-web-scrape-with-python-in-4-minutes-bc49186a8460) on implementing web scraping in Python. The tutorial implements the library, BeautifulSoup, to perform the web scraping. I decided to try to follow the tutorial to accomplish creating that spreadsheet of the information I need just for the HIV testing sites in Maricopa County, Arizona as a test before implementing it for all sites from [PrEP Locator](https://preplocator.org/).

## Inspecting the Web Page

The first step according to the article is to open the website and inspect the website for its underlying HTML code. I inspected the AIDSVu website. From the inspection, you can see all the headings and subheadings in the underlying HTML code. In the subheadings, the text of the information I need is grouped together and organized. 

## Importing Libraries

The following code was borrowed from the tutorial website.

In [1]:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

As I was attempting the first step of importing the url, I learned I need to find another way to extract the results you get when you use the web app. From looking through Stack Overflow, I am being pointed to look at the JSON (JavaScript Object Notation) of a webpage to extract the results related to the API (Application Programming Interface). When you use imbedded web apps, that the results are stored in JSON or XML.

I decided to follow the web scraping tutorial found [here](http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/). In the window that appears when you inspect the webpage, we need to go to 'Network', and we'll need to filter for XHR (XMLHttpRequest). After searching a few different counties in the web app, I was able to find the code that spits out the results. Its name starts with, "admis-ajax.php..." I then copied the request URL, and ran the suggested code from the tutorial.

But, I then learned that websites have underlying permissions for web scraping found in their robots.txt file. So, I reached a dead-end in trying to scrape the AIDSVu website. Luckily, from looking around the code, it seems I may have access to the same dataset through a different website, so I decided to web scrape the [PrEP Locator website](https://preplocator.org/), whose permissions allow me to web scrape. 

I followed the same initial steps of inspecting the website after generating results and finding the file that contiains the dataset I am trying to extract.

In [2]:
search_url = 'https://npin.cdc.gov/api/organization/proximity?prox%5Borigin%5D=33.2917968%2C-112.42914639999998&prox%5Bdistance%5D=50&svc_care=Pre-Exposure%20Prophylaxis%20(PrEP)'

response = requests.get(search_url, headers={'User-Agent': 'Edge/18.17763'})
maricopacounty = BeautifulSoup(response.text,"html.parser")
% print(maricopacounty)

[{"field_org_nid":"331919","field_org_id":"331919","title_field":"CVS MinuteClinic","field_organization_name_2":"","field_organization_name_3":"","field_org_street1":"2840 N Dysart Rd","field_org_street2":"","field_org_city_name":"Goodyear","field_org_county":"","field_org_state":"AZ","field_org_zipcode":"85395","field_org_country":"United States","field_org_phone":"\n\n  \n  (866) 389-2727 (Main)\n\n","field_org_lat_long":"33.478177300000, -112.341833300000","field_org_distance":"13.82","field_org_last_updated":"","field_org_svc_testing":"Chlamydia Testing, Conventional Blood HIV Testing, Gonorrhea Testing, Hepatitis B Testing, Hepatitis C Testing, Syphilis Testing, TB Testing","field_org_svc_prevention":"","field_org_svc_capacity":"","field_org_svc_care":"Adult Hepatitis B Vaccine, Hepatitis A Vaccine, Human Papillomavirus Vaccine, Post Exposure Prophylaxis, Pre-Exposure Prophylaxis (PrEP), STD Treatment","field_org_svc_support":"","field_audiences":"General Public","field_organizati

The variable 'maricopacounty' looks very familiar to what I've learned about in the intro data science courses I've been taking. It's a dictionary! I can now create a clean dataset where I extract the variables of interest.

## Cleaning the Dictionaries

Because of the format of the data, I will need to import some additional libraries to output the information into a more manageable data format.

In [3]:
import pandas as pd
import json

maricopacountyjs = json.loads(response.text)
maricopacountydf = pd.DataFrame(maricopacountyjs)
pd.options.display.max_columns = 40
maricopacountydf.head()

Unnamed: 0,field_audiences,field_npin_link,field_org_city_name,field_org_country,field_org_county,field_org_distance,field_org_emails,field_org_fee,field_org_id,field_org_last_updated,field_org_lat_long,field_org_nid,field_org_phone,field_org_state,field_org_street1,field_org_street2,field_org_svc_capacity,field_org_svc_care,field_org_svc_prevention,field_org_svc_support,field_org_svc_testing,field_org_type,field_org_websites,field_org_zipcode,field_organization_eligibilty,field_organization_hours,field_organization_languages,field_organization_name_2,field_organization_name_3,last_updated,title_field
0,General Public,https://npin.cdc.gov/node/331919,Goodyear,United States,,13.82,,"lowcost,Fee,Insurance Accepted,Medicare Accept...",331919,,"33.478177300000, -112.341833300000",331919,\n\n \n (866) 389-2727 (Main)\n\n,AZ,2840 N Dysart Rd,,,"Adult Hepatitis B Vaccine, Hepatitis A Vaccine...",,,"Chlamydia Testing, Conventional Blood HIV Test...","CVS Partner,",https://www.cvs.com/minuteclinic||,85395,Please visit the website or call for eligibili...,"Monday, 8:30am To 7:30pm, Tuesday, 8:30am To 7...","English, Interpretation Services Available for...",,,5/20/19,CVS MinuteClinic
1,General Public,https://npin.cdc.gov/node/332107,Laveen,United States,,16.25,,"lowcost,Fee,Insurance Accepted,Medicare Accept...",332107,,"33.378318200000, -112.167348000000",332107,\n\n \n (866) 389-2727 (Main)\n\n,AZ,5050 W Baseline Rd,,,"Adult Hepatitis B Vaccine, Hepatitis A Vaccine...",,,"Chlamydia Testing, Conventional Blood HIV Test...","CVS Partner,",https://www.cvs.com/minuteclinic||,85339,Please visit the website or call for eligibili...,"Monday, 8:30am To 7:30pm, Tuesday, 8:30am To 7...","English, Interpretation Services Available for...",,,5/20/19,CVS MinuteClinic
2,General Public,https://npin.cdc.gov/node/332247,Phoenix,United States,,16.9,,"lowcost,Fee,Insurance Accepted,Medicare Accept...",332247,,"33.507588000000, -112.290886700000",332247,\n\n \n (866) 389-2727 (Main)\n\n,AZ,10707 W Camelback Rd,,,"Adult Hepatitis B Vaccine, Hepatitis A Vaccine...",,,"Chlamydia Testing, Conventional Blood HIV Test...","CVS Partner,",https://www.cvs.com/minuteclinic||,85037,Please visit the website or call for eligibili...,"Monday, 8:00am To 7:00pm, Tuesday, 8:00am To 7...","English, Interpretation Services Available for...",,,5/20/19,CVS MinuteClinic
3,"General Public, Adults, Adolescents/Youth/Teen...",https://npin.cdc.gov/node/308476,Phoenix,United States,Maricopa,20.99,education@ppaz.org,"lowcost,Donations Accepted,Fee,Insurance Accep...",111059,2017-11-03 00:00,"33.504887400000, -112.169754500000",308476,"\n\n \n (602) 277-7526 (Main)\n\n, \n\n \n ...",AZ,4616 N 51st Ave,Ste 210,,"Family Planning, Gynecological Care, Adult Hep...","Materials – Print/Audiovisual, Condom/Female C...",,"Conventional Blood HIV Testing, Rapid Blood HI...","Clinic,Social Service Organization",https://www.plannedparenthood.org/planned-pare...,85031,"If you are uninsured, you may qualify for a st...","Monday,9:00am To 7:00pm, Tuesday,9:00am To 7:0...","English, Interpretation Services Available for...",Maryvale Health Center,,11/12/18,Planned Parenthood Arizona Incorporated
4,General Public,https://npin.cdc.gov/node/332054,Surprise,United States,,23.09,,"lowcost,Fee,Insurance Accepted,Medicare Accept...",332054,,"33.624703500000, -112.393150700000",332054,\n\n \n (866) 389-2727 (Main)\n\n,AZ,15474 W Greenway Rd,,,"Adult Hepatitis B Vaccine, Hepatitis A Vaccine...",,,"Chlamydia Testing, Conventional Blood HIV Test...","CVS Partner,",https://www.cvs.com/minuteclinic||,85374,Please visit the website or call for eligibili...,"Monday, 8:30am To 7:30pm, Tuesday, 8:30am To 7...","English, Interpretation Services Available for...",,,5/20/19,CVS MinuteClinic


As you can see, the library json spits out a beautiful table of all of our data! This is a much nicer way to be able to go through the data and see what's available as well as potential areas that need to be cleaned, imputed, and deleted. 

We have 45 entries/sites and 30 variables. 

One thing I noticed is that the variable "field_org_phone" is a bit odd and is carrying over some coding from the conversion. Just to clean that variable up, I'll go in and remove the chunks before and after the phone number so that we are only left with the phone number in the field. I will also perform a similar task for "field_org_websites". 

In [4]:
maricopacountydf = maricopacountydf.replace(r'\n','', regex=True)
maricopacountydf['field_org_websites'] = maricopacountydf['field_org_websites'].replace("\|\|","", regex=True) 

maricopacountydf.head()

Unnamed: 0,field_audiences,field_npin_link,field_org_city_name,field_org_country,field_org_county,field_org_distance,field_org_emails,field_org_fee,field_org_id,field_org_last_updated,field_org_lat_long,field_org_nid,field_org_phone,field_org_state,field_org_street1,field_org_street2,field_org_svc_capacity,field_org_svc_care,field_org_svc_prevention,field_org_svc_support,field_org_svc_testing,field_org_type,field_org_websites,field_org_zipcode,field_organization_eligibilty,field_organization_hours,field_organization_languages,field_organization_name_2,field_organization_name_3,last_updated,title_field
0,General Public,https://npin.cdc.gov/node/331919,Goodyear,United States,,13.82,,"lowcost,Fee,Insurance Accepted,Medicare Accept...",331919,,"33.478177300000, -112.341833300000",331919,(866) 389-2727 (Main),AZ,2840 N Dysart Rd,,,"Adult Hepatitis B Vaccine, Hepatitis A Vaccine...",,,"Chlamydia Testing, Conventional Blood HIV Test...","CVS Partner,",https://www.cvs.com/minuteclinic,85395,Please visit the website or call for eligibili...,"Monday, 8:30am To 7:30pm, Tuesday, 8:30am To 7...","English, Interpretation Services Available for...",,,5/20/19,CVS MinuteClinic
1,General Public,https://npin.cdc.gov/node/332107,Laveen,United States,,16.25,,"lowcost,Fee,Insurance Accepted,Medicare Accept...",332107,,"33.378318200000, -112.167348000000",332107,(866) 389-2727 (Main),AZ,5050 W Baseline Rd,,,"Adult Hepatitis B Vaccine, Hepatitis A Vaccine...",,,"Chlamydia Testing, Conventional Blood HIV Test...","CVS Partner,",https://www.cvs.com/minuteclinic,85339,Please visit the website or call for eligibili...,"Monday, 8:30am To 7:30pm, Tuesday, 8:30am To 7...","English, Interpretation Services Available for...",,,5/20/19,CVS MinuteClinic
2,General Public,https://npin.cdc.gov/node/332247,Phoenix,United States,,16.9,,"lowcost,Fee,Insurance Accepted,Medicare Accept...",332247,,"33.507588000000, -112.290886700000",332247,(866) 389-2727 (Main),AZ,10707 W Camelback Rd,,,"Adult Hepatitis B Vaccine, Hepatitis A Vaccine...",,,"Chlamydia Testing, Conventional Blood HIV Test...","CVS Partner,",https://www.cvs.com/minuteclinic,85037,Please visit the website or call for eligibili...,"Monday, 8:00am To 7:00pm, Tuesday, 8:00am To 7...","English, Interpretation Services Available for...",,,5/20/19,CVS MinuteClinic
3,"General Public, Adults, Adolescents/Youth/Teen...",https://npin.cdc.gov/node/308476,Phoenix,United States,Maricopa,20.99,education@ppaz.org,"lowcost,Donations Accepted,Fee,Insurance Accep...",111059,2017-11-03 00:00,"33.504887400000, -112.169754500000",308476,"(602) 277-7526 (Main), (800) 230-7526 ...",AZ,4616 N 51st Ave,Ste 210,,"Family Planning, Gynecological Care, Adult Hep...","Materials – Print/Audiovisual, Condom/Female C...",,"Conventional Blood HIV Testing, Rapid Blood HI...","Clinic,Social Service Organization",https://www.plannedparenthood.org/planned-pare...,85031,"If you are uninsured, you may qualify for a st...","Monday,9:00am To 7:00pm, Tuesday,9:00am To 7:0...","English, Interpretation Services Available for...",Maryvale Health Center,,11/12/18,Planned Parenthood Arizona Incorporated
4,General Public,https://npin.cdc.gov/node/332054,Surprise,United States,,23.09,,"lowcost,Fee,Insurance Accepted,Medicare Accept...",332054,,"33.624703500000, -112.393150700000",332054,(866) 389-2727 (Main),AZ,15474 W Greenway Rd,,,"Adult Hepatitis B Vaccine, Hepatitis A Vaccine...",,,"Chlamydia Testing, Conventional Blood HIV Test...","CVS Partner,",https://www.cvs.com/minuteclinic,85374,Please visit the website or call for eligibili...,"Monday, 8:30am To 7:30pm, Tuesday, 8:30am To 7...","English, Interpretation Services Available for...",,,5/20/19,CVS MinuteClinic


Now that that is fixed, in the variable "field_org_county", the values are missing or inconsistent. Given the search for "Maricopa County" in the search engine, I want all the results that came from it to be coded as "Maricopa" in that field.

In [5]:
maricopacountydf['field_org_county'] = maricopacountydf['field_org_county'].replace(["","Maricopa County"],"Maricopa")

maricopacountydf.head()

Unnamed: 0,field_audiences,field_npin_link,field_org_city_name,field_org_country,field_org_county,field_org_distance,field_org_emails,field_org_fee,field_org_id,field_org_last_updated,field_org_lat_long,field_org_nid,field_org_phone,field_org_state,field_org_street1,field_org_street2,field_org_svc_capacity,field_org_svc_care,field_org_svc_prevention,field_org_svc_support,field_org_svc_testing,field_org_type,field_org_websites,field_org_zipcode,field_organization_eligibilty,field_organization_hours,field_organization_languages,field_organization_name_2,field_organization_name_3,last_updated,title_field
0,General Public,https://npin.cdc.gov/node/331919,Goodyear,United States,Maricopa,13.82,,"lowcost,Fee,Insurance Accepted,Medicare Accept...",331919,,"33.478177300000, -112.341833300000",331919,(866) 389-2727 (Main),AZ,2840 N Dysart Rd,,,"Adult Hepatitis B Vaccine, Hepatitis A Vaccine...",,,"Chlamydia Testing, Conventional Blood HIV Test...","CVS Partner,",https://www.cvs.com/minuteclinic,85395,Please visit the website or call for eligibili...,"Monday, 8:30am To 7:30pm, Tuesday, 8:30am To 7...","English, Interpretation Services Available for...",,,5/20/19,CVS MinuteClinic
1,General Public,https://npin.cdc.gov/node/332107,Laveen,United States,Maricopa,16.25,,"lowcost,Fee,Insurance Accepted,Medicare Accept...",332107,,"33.378318200000, -112.167348000000",332107,(866) 389-2727 (Main),AZ,5050 W Baseline Rd,,,"Adult Hepatitis B Vaccine, Hepatitis A Vaccine...",,,"Chlamydia Testing, Conventional Blood HIV Test...","CVS Partner,",https://www.cvs.com/minuteclinic,85339,Please visit the website or call for eligibili...,"Monday, 8:30am To 7:30pm, Tuesday, 8:30am To 7...","English, Interpretation Services Available for...",,,5/20/19,CVS MinuteClinic
2,General Public,https://npin.cdc.gov/node/332247,Phoenix,United States,Maricopa,16.9,,"lowcost,Fee,Insurance Accepted,Medicare Accept...",332247,,"33.507588000000, -112.290886700000",332247,(866) 389-2727 (Main),AZ,10707 W Camelback Rd,,,"Adult Hepatitis B Vaccine, Hepatitis A Vaccine...",,,"Chlamydia Testing, Conventional Blood HIV Test...","CVS Partner,",https://www.cvs.com/minuteclinic,85037,Please visit the website or call for eligibili...,"Monday, 8:00am To 7:00pm, Tuesday, 8:00am To 7...","English, Interpretation Services Available for...",,,5/20/19,CVS MinuteClinic
3,"General Public, Adults, Adolescents/Youth/Teen...",https://npin.cdc.gov/node/308476,Phoenix,United States,Maricopa,20.99,education@ppaz.org,"lowcost,Donations Accepted,Fee,Insurance Accep...",111059,2017-11-03 00:00,"33.504887400000, -112.169754500000",308476,"(602) 277-7526 (Main), (800) 230-7526 ...",AZ,4616 N 51st Ave,Ste 210,,"Family Planning, Gynecological Care, Adult Hep...","Materials – Print/Audiovisual, Condom/Female C...",,"Conventional Blood HIV Testing, Rapid Blood HI...","Clinic,Social Service Organization",https://www.plannedparenthood.org/planned-pare...,85031,"If you are uninsured, you may qualify for a st...","Monday,9:00am To 7:00pm, Tuesday,9:00am To 7:0...","English, Interpretation Services Available for...",Maryvale Health Center,,11/12/18,Planned Parenthood Arizona Incorporated
4,General Public,https://npin.cdc.gov/node/332054,Surprise,United States,Maricopa,23.09,,"lowcost,Fee,Insurance Accepted,Medicare Accept...",332054,,"33.624703500000, -112.393150700000",332054,(866) 389-2727 (Main),AZ,15474 W Greenway Rd,,,"Adult Hepatitis B Vaccine, Hepatitis A Vaccine...",,,"Chlamydia Testing, Conventional Blood HIV Test...","CVS Partner,",https://www.cvs.com/minuteclinic,85374,Please visit the website or call for eligibili...,"Monday, 8:30am To 7:30pm, Tuesday, 8:30am To 7...","English, Interpretation Services Available for...",,,5/20/19,CVS MinuteClinic


Our data is looking better and better. I've standardized the response in the variable "field_org_county", so that later on when we aggregate all the counties into a massive CSV file, the variables are not only filled in, but in some sort of standard form that would lend itself to easier analysis.

I will now remove variables that I don't think will be necessary like the country, distance, eligibility, and hours. 

In [6]:
maricopacountydf = maricopacountydf.drop(["field_org_country","field_org_distance","field_organization_eligibilty","field_organization_hours"], axis=1)

maricopacountydf.head()

Unnamed: 0,field_audiences,field_npin_link,field_org_city_name,field_org_county,field_org_emails,field_org_fee,field_org_id,field_org_last_updated,field_org_lat_long,field_org_nid,field_org_phone,field_org_state,field_org_street1,field_org_street2,field_org_svc_capacity,field_org_svc_care,field_org_svc_prevention,field_org_svc_support,field_org_svc_testing,field_org_type,field_org_websites,field_org_zipcode,field_organization_languages,field_organization_name_2,field_organization_name_3,last_updated,title_field
0,General Public,https://npin.cdc.gov/node/331919,Goodyear,Maricopa,,"lowcost,Fee,Insurance Accepted,Medicare Accept...",331919,,"33.478177300000, -112.341833300000",331919,(866) 389-2727 (Main),AZ,2840 N Dysart Rd,,,"Adult Hepatitis B Vaccine, Hepatitis A Vaccine...",,,"Chlamydia Testing, Conventional Blood HIV Test...","CVS Partner,",https://www.cvs.com/minuteclinic,85395,"English, Interpretation Services Available for...",,,5/20/19,CVS MinuteClinic
1,General Public,https://npin.cdc.gov/node/332107,Laveen,Maricopa,,"lowcost,Fee,Insurance Accepted,Medicare Accept...",332107,,"33.378318200000, -112.167348000000",332107,(866) 389-2727 (Main),AZ,5050 W Baseline Rd,,,"Adult Hepatitis B Vaccine, Hepatitis A Vaccine...",,,"Chlamydia Testing, Conventional Blood HIV Test...","CVS Partner,",https://www.cvs.com/minuteclinic,85339,"English, Interpretation Services Available for...",,,5/20/19,CVS MinuteClinic
2,General Public,https://npin.cdc.gov/node/332247,Phoenix,Maricopa,,"lowcost,Fee,Insurance Accepted,Medicare Accept...",332247,,"33.507588000000, -112.290886700000",332247,(866) 389-2727 (Main),AZ,10707 W Camelback Rd,,,"Adult Hepatitis B Vaccine, Hepatitis A Vaccine...",,,"Chlamydia Testing, Conventional Blood HIV Test...","CVS Partner,",https://www.cvs.com/minuteclinic,85037,"English, Interpretation Services Available for...",,,5/20/19,CVS MinuteClinic
3,"General Public, Adults, Adolescents/Youth/Teen...",https://npin.cdc.gov/node/308476,Phoenix,Maricopa,education@ppaz.org,"lowcost,Donations Accepted,Fee,Insurance Accep...",111059,2017-11-03 00:00,"33.504887400000, -112.169754500000",308476,"(602) 277-7526 (Main), (800) 230-7526 ...",AZ,4616 N 51st Ave,Ste 210,,"Family Planning, Gynecological Care, Adult Hep...","Materials – Print/Audiovisual, Condom/Female C...",,"Conventional Blood HIV Testing, Rapid Blood HI...","Clinic,Social Service Organization",https://www.plannedparenthood.org/planned-pare...,85031,"English, Interpretation Services Available for...",Maryvale Health Center,,11/12/18,Planned Parenthood Arizona Incorporated
4,General Public,https://npin.cdc.gov/node/332054,Surprise,Maricopa,,"lowcost,Fee,Insurance Accepted,Medicare Accept...",332054,,"33.624703500000, -112.393150700000",332054,(866) 389-2727 (Main),AZ,15474 W Greenway Rd,,,"Adult Hepatitis B Vaccine, Hepatitis A Vaccine...",,,"Chlamydia Testing, Conventional Blood HIV Test...","CVS Partner,",https://www.cvs.com/minuteclinic,85374,"English, Interpretation Services Available for...",,,5/20/19,CVS MinuteClinic


## Exporting the Clean Dataframe into a CSV File

Now that we've gone in and cleaned up the dataframe, we're going to want to export that into a CSV file so that we can later stack all the county CSV files into one file.

In [7]:
maricopacountydf.to_csv('HIVTesting_Maricopa.csv', index=False)

## Conclusion

In this project, I was attempting to speed up a data collection process by web scraping. I learned a lot from just this short project, from web scraping permissions to the underlying scaffolding of websites and their accompanying APIs. This tool can be really beneficial for extracting data from web pages and online databases especially for those in public health where the website's data may not be requestable. 

My next step is to try to find a way to pass a list of the 3,000 or so counties in the US into the API, extract the JSON data from all of the results, and compile them all into one file. 

## Acknowledgements 

I would like to thank Google Search, the various posters on Stack Overflow, and the many people who have written articles about how to address certain data wrangling problems. 