## Extracting Data from the Master Health Facility Registry (KMHFR)

In this exercise, we will use the following Python libraries:

- **Beautiful Soup**: For parsing HTML and XML documents.
- **Requests**: For making HTTP requests to retrieve data from the web.
- **JSON**: For handling JSON data structures.
- **Pandas**: For data manipulation and analysis.

### About the Master Health Facility Registry (KMHFR)

**Master Health Facility Registry (KMHFR)** is an application that contains information on all health facilities and community units in Kenya. Each health facility and community unit is identified with a unique code, along with details describing its geographical location, administrative location, ownership, type, and the services offered.

### Accessing the Data

The data from KMHFR can be accessed through an API, with documentation available [here](https://mfl-api-docs.readthedocs.io/en/latest/).

However, in this exercise, we will use **Beautiful Soup** to scrape the data from the KMHFR website. 


We'll be working with the following base URL:
```python
base_url = "https://kmhfr.health.go.ke/public/facilities?page={}"



**NB: Ethical Considerations of Web Scraping**

Web scraping is a powerful tool for extracting data from websites, but it comes with ethical and legal responsibilities. When scraping data from websites, it is important to:
- **Respect the website's robots.txt file**: This file specifies which parts of the site should not be accessed by automated tools. Ignoring this can lead to legal issues or your IP being blocked.
- **Avoid overloading the server**: Sending too many requests in a short period can strain the website's server, potentially causing disruptions in service.
- **Use data responsibly**: Ensure that the data you scrape is used in accordance with the terms of service of the website, and respect privacy and copyright laws.


In this exercise, web scraping is used solely for demonstration purposes. The goal is to illustrate how web scraping works and how data can be extracted and processed using Python libraries. In practice, it is recommended to use official APIs when available, as they are designed to provide access to data in a controlled and ethical manner.

### Import necessary Libraries

In [1]:
import requests  # Used for making HTTP requests to fetch data from websites or APIs

from bs4 import BeautifulSoup as bs  # Used for parsing HTML and XML documents, enabling easy extraction of data from web pages

import pandas as pd  # Pandas is used for data manipulation and analysis, providing data structures like DataFrames to handle tabular data

import json  # JSON library allows for parsing JSON strings and converting Python objects to JSON format

import time  # Provides functions for time-related tasks like adding delays or measuring execution time

import sys  # Provides access to system-specific parameters and functions for interacting with the Python runtime environment

import geopandas as gpd  # Extends pandas to handle geospatial data, enabling operations and analysis on geographic objects

from shapely.geometry import Point  # Used for creating and manipulating geometric objects, such as points, lines, and polygons in 2D space

import matplotlib.pyplot as plt  # A plotting library used to generate static, animated, and interactive visualizations in Python

import folium  # Used for creating interactive maps using Leaflet.js, making it easy to visualize geospatial data

from folium.plugins import HeatMap  # Allows for the creation of heat maps to visualize the density of data points on a map


## Data Scraping

##### Get all the url pages

In [None]:
# Define the base URL for the KMHFR data. The placeholder `{}` will be replaced with page numbers.
base_url = "https://kmhfr.health.go.ke/public/facilities?page={}"

# Generate a list of URLs for each page from 1 to 494 .
urls = [base_url.format(page) for page in range(1, 495)]

# Print the list of generated URLs to verify the URLs created for each page.
print(urls)


##### Fetch json data for each url and combine them to a single dataframe

In [4]:
# Function to fetch JSON data
def fetch_json_data(url, headers):
    """
    Fetch JSON data from a given URL and headers.
    
    Parameters:
        url (str): The URL to fetch data from.
        headers (dict): The headers to use for the request.
        
    Returns:
        dict: The parsed JSON data.
    """
    # Send a GET request to the URL with the provided headers
    response = requests.get(url, headers=headers)
    
    # Parse the response content using BeautifulSoup
    soup = bs(response.text, "html.parser")
    
    # Find the script tag containing the JSON data
    script_tag = soup.find('script', id='__NEXT_DATA__')
    
    if script_tag:
        # Load and return the JSON data
        data = json.loads(script_tag.string)
        return data
    return None

# Function to convert JSON data to DataFrame
def json_to_dataframe(json_data):
    """
    Convert JSON data to a DataFrame.
    
    Parameters:
        json_data (dict): The JSON data to convert.
        
    Returns:
        DataFrame: A DataFrame containing the JSON data.
    """
    # Extract the results from the JSON data
    results = json_data.get('props', {}).get('pageProps', {}).get('data', {}).get('results', [])
    
    # Convert the results list to a DataFrame
    df = pd.DataFrame(results)
    return df

# Define the base URL for fetching data, with a placeholder for page numbers
base_url = "https://kmhfr.health.go.ke/public/facilities?page={}"

# Generate a list of URLs for each page from 1 to 494
urls = [base_url.format(page) for page in range(1, 495)]

# Define headers for HTTP requests to mimic a browser request
HEADERS = {
    "accept": "*/*",
    "accept-encoding": "gzip, deflate, br, zstd",
    "accept-language": "en-US,en;q=0.9,fr;q=0.8",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"
}

# Initialize an empty DataFrame to accumulate all the fetched data
all_data = pd.DataFrame()

# Loop through each URL and fetch data
for url in urls:
    data = fetch_json_data(url, HEADERS)
    if data:
        # Convert JSON data to DataFrame
        df = json_to_dataframe(data)
        # Append the DataFrame to the accumulated data
        all_data = pd.concat([all_data, df], ignore_index=True)



In [None]:
# Print the shape of the DataFrame (number of rows and columns)
print(f"DataFrame shape: {all_data.shape}")

# Print the number of rows in the DataFrame
print(f"Number of rows: {all_data.shape[0]}")

# Print the number of columns in the DataFrame
print(f"Number of columns: {all_data.shape[1]}")

# Print the column names in the DataFrame
print(f"Column names: {all_data.columns.tolist()}")

# Check for null values in each column
print("\nNull values per column:")
print(all_data.isnull().sum())


DataFrame shape: (14805, 47)
Number of rows: 14805
Number of columns: 47

### Save dataframe to local dir

In [6]:
# Save the combined DataFrame to a file
all_data.to_csv("data/all_facilities.csv", index=False)

In [None]:
all_data.columns

In [None]:
all_data.county_name.unique()

In [None]:
null_counts = all_data.isnull().sum()
print("Null values per column:")
print(null_counts)

## Data Visualization


In [None]:
# df2.shape[0]
all_data.columns

### Convert to gdf

In [14]:
# Drop rows where 'lat' or 'long' columns have null values
all_data_cleaned = all_data.dropna(subset=['lat', 'long'])

In [None]:
# Create a GeoDataFrame by converting latitude and longitude to geometry points
geometry = [Point(xy) for xy in zip(all_data_cleaned['lat'], all_data_cleaned['long'])]
gdf = gpd.GeoDataFrame(all_data_cleaned, geometry=geometry)

# Set the coordinate reference system (CRS) to WGS84 (EPSG:4326)
gdf.set_crs(epsg=4326, inplace=True)

In [None]:
gdf.columns

In [None]:
# Print the number of rows in the DataFrame
print(f"Number of rows: {gdf.shape[0]}")
gdf.explore(
    column='facility_type_category',  # Column to color by
    cmap='viridis',            # Colormap
    legend=True,               # Show legend
    legend_kwds={
        'loc': 'topright',     # Position of the legend
        'title': 'Keph facility_type_category', # Title of the legend
    },
    tooltip='keph_level_name', # Tooltip information
    style_kwds=dict(radius=4, fillOpacity=0.7)  # Style options
)


In [None]:
gdf.columns

###  drop unneccesary columns

In [None]:
# Drop specified columns from the GeoDataFrame
#gdf = gdf.drop(columns=['county', 'constituency', 'ward'])
gdf

In [18]:
# Define the list of columns you want to display in the tooltip
tooltip_columns = ['name',  'beds', 'cots', 'search', 
       'operation_status', 'operation_status_name', 'admission_status_name',
       'open_whole_day', 'open_public_holidays', 'open_weekends',
       'open_late_night'] 

# Generate the interactive map with detailed tooltip
m = gdf.explore(
    column='facility_type_category',  # Column to color by
    cmap='viridis',            # Colormap
    legend=True,               # Show legend
    tooltip=tooltip_columns,   # List of columns to show in the tooltip
    style_kwds=dict(radius=4, fillOpacity=0.7)  # Style options
)

# Save the map as an HTML file
m.save('interactive_map.html')