# Feature Engineering Guide


<b>A how-to guide with details on how to create machine learning features using PredictHQ's Features API</b>. <br>
<b>The aim of this notebook is showcase how the Features API can be used to create features for a location and date range of your choice</b>.

- [Overview](#overview)
- [How to use event based features in your models](#usage)
- [The Features API features summary](#summary)
    - [Features for attendance and rank based events](#summary_attend_rank)
    - [Features for severe weather (retail only)](#summary_weather)
- [Setup](#setup)
- [Access token](#access_token)
- [SDK parameters](#setting_params)
- [Using the Features API to query features for forecasting](#features_api)
    - [Functions for formating data frame](#functions)
    - [Attendance based features](#attend)
    - [Rank based features](#rank)
    - [Impact based features](#impact)
- [Using a longer date range](#wide_range)
    - [Common functions](#functions_wide)
    - [Attendance based features](#attend_wide)
    - [Rank based features](#rank_wide)
    - [Impact based features](#impact_wide)

<a id='overview'></a>
## Overview
Our customers find that factoring events into their demand forecasting improves the accuracy of their forecasts. For example, if events are occurring near a store, hotel, parking garage, or any other location, they can impact demand. Customers use our events for different types of demand forecasting including labor forecasting, inventory, dynamic pricing, and more.

The goal of the Feature Engineering Guide notebook is to help data science teams understand and get hands-on experience querying different event-based features from PredictHQ's Features API. This guide outlines, with clear and simple examples, recommended features per event category. Data science teams can use these examples to create features easily and include them in their own demand forecasting models or any other applicable models.

See also the [API reference guide for the Features API](https://docs.predicthq.com/resources/features) for more info on the Features API. For an overview of categories see the [categories page](https://www.predicthq.com/intelligence/data-enrichment/event-categories). For more details on categories see the documentation on [PredictHQ's event categories](https://docs.predicthq.com/categoryinfo/introduction) for background on the categories. This guide uses [PredictHQ's SDK](https://docs.predicthq.com/sdks/python) to access the Features API.

We also provide other [data science guides](https://docs.predicthq.com/datascience/introduction) that go into more detail on different categories, for example [Attendance Based Events](https://docs.predicthq.com/categoryinfo/attended-events). Each of these guides consists of a set of notebooks. Refer to these guides if you want to do more in depth research on specific categories.

<a id='usage'></a>
## How to use event based features in your models

This guide provides details on how to create event based features using the SDK for the Features API. It assumes you are familiar with adding ML features to your models. Features are often used in demand forecasting models to forecast future demand but can be used in other types of models as well.

The features in this guide are designed to be used in a forecasting solution where you can forecast demand for each location. For example, you may forecast demand for a set of retail stores, hotels, restaurants, parking garages, or other types of locations. In this scenario you have a set of event-based features that have a value per location and these features are used in your model to adjust the forecast for that location.

For example the process for using event based features may look as follows:
1. Review the different features in this guide and the other related documentation.
2. Data exploration - you may want to conduct data exploration using a sample data set around one of your locations.
3. Events feature testing - you may want to get a training data set for a location or locations (see below on using a longer data range), and conduct model R&D with events features using techniques such as model selection and cross validation.
4. Model evaluation - you may want to compare the model accuracy on historical data before adding the event features and after adding the event features.
5. Update your model to use the new features - get your production model ready for deployment.
6. Work with your engineering team to add the features to your production pipeline - the Features API can be used by your production pipeline
7. Deploy your new model in production with the new event-based features.

<p>For a simple overview of how you can integrate event-based features in your system, see below. For a production implementation, we'd suggest that you store or cache the features from the Features API before you call them in production. This helps make your implementation more robust as calling services over the internet always have an element of risk. Then regularly update your copy of the features. For example, if you run your model once per day to update your forecast then you may want to refresh your copy of the Features API just before your forecast runs.</p>

<img src="./features-engineering-architecture-diagram.png">


<a id='summary'></a>
## The Features API features summary
<p>Below is a summary of features from the Features API that you can try using in your models. The table below shows the name of the feature, the type of Features API stat value to use for the aggregation (e.g. sum is the summary of the values on a given day) and any notes. See below in the guide for example code and details on using the features.</p>


<a id='summary_attend_rank'></a>
### Features for attendance and rank based events
 
<table class="c28">
<tr class="c23"><td class="c22" colspan="1" rowspan="1"><p class="c5"><span class="c11"><strong>Category</strong></span></p></td><td class="c2" colspan="1" rowspan="1"><p class="c5"><span class="c11"><strong>Features API feature</strong></span></p></td><td class="c10" colspan="1" rowspan="1"><p class="c5"><span class="c11"><strong>Stats</strong></span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c11"><strong>Notes</strong></span></p></td></tr>
<tr class="c23"><td class="c22" colspan="1" rowspan="1"><p class="c5"><span class="c7">Community</span></p></td><td class="c2" colspan="1" rowspan="1"><p class="c5"><span class="c7">phq_attendance_community</span></p></td><td class="c10" colspan="1" rowspan="1"><p class="c5"><span class="c7">sum</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Choose a radius around your location</span></p></td></tr><tr class="c23"><td class="c22" colspan="1" rowspan="1"><p class="c5"><span class="c7">Concerts</span></p></td><td class="c2" colspan="1" rowspan="1"><p class="c5"><span class="c7">phq_attendance_concerts</span></p></td><td class="c10" colspan="1" rowspan="1"><p class="c5"><span class="c7">sum</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Choose a radius around your location</span></p></td></tr><tr class="c23"><td class="c22" colspan="1" rowspan="1"><p class="c5"><span class="c7">Conferences</span></p></td><td class="c2" colspan="1" rowspan="1"><p class="c5"><span class="c7">phq_attendance_conferences</span></p></td><td class="c10" colspan="1" rowspan="1"><p class="c5"><span class="c7">sum</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Choose a radius around your location</span></p></td></tr><tr class="c23"><td class="c22" colspan="1" rowspan="1"><p class="c5"><span class="c7">Expos</span></p></td><td class="c2" colspan="1" rowspan="1"><p class="c5"><span class="c7">phq_attendance_expos</span></p></td><td class="c10" colspan="1" rowspan="1"><p class="c5"><span class="c7">sum</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Choose a radius around your location</span></p></td></tr><tr class="c23"><td class="c22" colspan="1" rowspan="1"><p class="c5"><span class="c7">Festivals</span></p></td><td class="c2" colspan="1" rowspan="1"><p class="c5"><span class="c7">phq_attendance_festivals</span></p></td><td class="c10" colspan="1" rowspan="1"><p class="c5"><span class="c7">sum</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Choose a radius around your location</span></p></td></tr><tr class="c23"><td class="c22" colspan="1" rowspan="1"><p class="c5"><span class="c7">Performing Arts</span></p></td><td class="c2" colspan="1" rowspan="1"><p class="c5"><span class="c7">phq_attendance_performing_arts</span></p></td><td class="c10" colspan="1" rowspan="1"><p class="c5"><span class="c7">sum</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Choose a radius around your location</span></p></td></tr><tr class="c23"><td class="c22" colspan="1" rowspan="1"><p class="c5"><span class="c7">Sports</span></p></td><td class="c2" colspan="1" rowspan="1"><p class="c5"><span class="c7">phq_attendance_sports</span></p></td><td class="c10" colspan="1" rowspan="1"><p class="c5"><span class="c7">sum</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Choose a radius around your location</span></p></td></tr><tr class="c23"><td class="c22" colspan="1" rowspan="1"><p class="c5"><span class="c7">Observances</span></p></td><td class="c2" colspan="1" rowspan="1"><p class="c5"><span class="c7">phq_rank_observances</span></p></td><td class="c10" colspan="1" rowspan="1"><p class="c5"><span class="c7">n/a</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Radius can be set to&nbsp;1 mile to cover events in a region.</span></p></td></tr><tr class="c23"><td class="c22" colspan="1" rowspan="1"><p class="c5"><span class="c7">Public Holidays</span></p></td><td class="c2" colspan="1" rowspan="1"><p class="c5"><span class="c7">phq_rank_public_holidays</span></p></td><td class="c10" colspan="1" rowspan="1"><p class="c5"><span class="c7">n/a</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Radius can be set to&nbsp;1 mile to cover events in a region.</span></p></td></tr><tr class="c23"><td class="c22" colspan="1" rowspan="1"><p class="c5"><span class="c7">School Holidays</span></p></td><td class="c2" colspan="1" rowspan="1"><p class="c5"><span class="c7">phq_attendance_school_holidays</span></p></td><td class="c10" colspan="1" rowspan="1"><p class="c5"><span class="c7">sum</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Available for the US and UK only. Set the radius to 1 mile.</span></p></td></tr><tr class="c23"><td class="c22" colspan="1" rowspan="1"><p class="c5"><span class="c7">Academic Graduation</span></p></td><td class="c2" colspan="1" rowspan="1"><p class="c5"><span class="c7">phq_attendance_academic_graduation</span></p></td><td class="c10" colspan="1" rowspan="1"><p class="c5"><span class="c7">sum</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Choose a radius around your location</span></p></td></tr><tr class="c23"><td class="c22" colspan="1" rowspan="1"><p class="c5"><span class="c7">Academic Social</span></p></td><td class="c2" colspan="1" rowspan="1"><p class="c5"><span class="c7">phq_attendance_academic_social</span></p></td><td class="c10" colspan="1" rowspan="1"><p class="c5"><span class="c7">sum</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Choose a radius around your location</span></p></td></tr><tr class="c23"><td class="c22" colspan="1" rowspan="1"><p class="c5"><span class="c7">Academic Session</span></p></td><td class="c2" colspan="1" rowspan="1"><p class="c5"><span class="c7">phq_rank_academic_session</span></p></td><td class="c10" colspan="1" rowspan="1"><p class="c5"><span class="c7">n/a</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Choose a radius around your location</span></p></td></tr><tr class="c23"><td class="c22" colspan="1" rowspan="1"><p class="c5"><span class="c7">Academic Exam</span></p></td><td class="c2" colspan="1" rowspan="1"><p class="c5"><span class="c7">phq_rank_academic_exam</span></p></td><td class="c10" colspan="1" rowspan="1"><p class="c5"><span class="c7">n/a</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Choose a radius around your location</span></p></td></tr><tr class="c23"><td class="c22" colspan="1" rowspan="1"><p class="c5"><span class="c7">Academic Holiday</span></p></td><td class="c2" colspan="1" rowspan="1"><p class="c5"><span class="c7">phq_rank_academic_holiday</span></p></td><td class="c10" colspan="1" rowspan="1"><p class="c5"><span class="c7">n/a</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Choose a radius around your location</span></p></td></tr>
</table>

<a id='summary_weather'></a>
### Features for severe weather (retail only)
<p>The features below are for the retail industry only. The severe weather features use demand impact patterns. Demand impact patterns calculate impact duration of a severe weather event and are based on industry specific information. Our severe weather features are currently designed and tested on data for the retail segment only. If your business is in an industry segment other than retail (e.g. accomodation or travel) then the features below may not work for you or may be less effective.</p><p>

<table class="c28">
<tr class="c23"><td class="c22" colspan="1" rowspan="1"><p class="c5"><span class="c11"><strong>Category</strong></span></p></td><td class="c36" colspan="1" rowspan="1"><p class="c5"><span class="c11"><strong>Features API feature</strong></span></p></td><td class="c10" colspan="1" rowspan="1"><p class="c5"><span class="c11"><strong>Stats</strong></span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c11"><strong>Notes</strong></span></p></td></tr>
<tr class="c15"><td class="c25" colspan="1" rowspan="1"><p class="c5"><span class="c7">Severe Weather (Air quality)</span></p></td><td class="c19" colspan="1" rowspan="1"><p class="c5"><span class="c7">phq_impact_severe_weather_air_quality_retail</span></p></td><td class="c16" colspan="1" rowspan="1"><p class="c5"><span class="c7">max</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Retail only. Use a radius of 1 mile</span></p></td></tr><tr class="c15"><td class="c25" colspan="1" rowspan="1"><p class="c5"><span class="c7">Severe Weather (Blizzard)</span></p></td><td class="c19" colspan="1" rowspan="1"><p class="c5"><span class="c7">phq_impact_severe_weather_blizzard_retail</span></p></td><td class="c16" colspan="1" rowspan="1"><p class="c5"><span class="c7">max</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Retail only. Use a radius of 1 mile</span></p></td></tr><tr class="c15"><td class="c25" colspan="1" rowspan="1"><p class="c5"><span class="c7">Severe Weather (Cold wave)</span></p></td><td class="c19" colspan="1" rowspan="1"><p class="c14"><span class="c7">phq_impact_severe_weather_cold_wave_retail</span></p></td><td class="c16" colspan="1" rowspan="1"><p class="c5"><span class="c7">max</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Retail only. Use a radius of 1 mile</span></p></td></tr><tr class="c15"><td class="c25" colspan="1" rowspan="1"><p class="c5"><span class="c7">Severe Weather (Cold wave - snow)</span></p></td><td class="c19" colspan="1" rowspan="1"><p class="c14"><span class="c7">phq_impact_severe_weather_cold_wave_snow_retail</span></p></td><td class="c16" colspan="1" rowspan="1"><p class="c5"><span class="c7">max</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Retail only. Use a radius of 1 mile</span></p></td></tr><tr class="c15"><td class="c25" colspan="1" rowspan="1"><p class="c5"><span class="c7">Severe Weather (Cold wave - storm)</span></p></td><td class="c19" colspan="1" rowspan="1"><p class="c14"><span class="c7">phq_impact_severe_weather_cold_wave_storm_retail</span></p></td><td class="c16" colspan="1" rowspan="1"><p class="c5"><span class="c7">max</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Retail only. Use a radius of 1 mile</span></p></td></tr><tr class="c15"><td class="c25" colspan="1" rowspan="1"><p class="c5"><span class="c7">Severe Weather (Dust)</span></p></td><td class="c19" colspan="1" rowspan="1"><p class="c14"><span class="c7">phq_impact_severe_weather_dust_retail</span></p></td><td class="c16" colspan="1" rowspan="1"><p class="c5"><span class="c7">max</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Retail only. Use a radius of 1 mile</span></p></td></tr><tr class="c15"><td class="c25" colspan="1" rowspan="1"><p class="c5"><span class="c7">Severe Weather (Dust - Storm)</span></p></td><td class="c19" colspan="1" rowspan="1"><p class="c14"><span class="c7">phq_impact_severe_weather_dust_storm_retail</span></p></td><td class="c16" colspan="1" rowspan="1"><p class="c5"><span class="c7">max</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Retail only. Use a radius of 1 mile</span></p></td></tr><tr class="c15"><td class="c25" colspan="1" rowspan="1"><p class="c5"><span class="c7">Severe Weather (Flood)</span></p></td><td class="c19" colspan="1" rowspan="1"><p class="c14"><span class="c7">phq_impact_severe_weather_flood_retail</span></p></td><td class="c16" colspan="1" rowspan="1"><p class="c5"><span class="c7">max</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Retail only. Use a radius of 1 mile</span></p></td></tr><tr class="c15"><td class="c25" colspan="1" rowspan="1"><p class="c5"><span class="c7">Severe Weather (Heat wave)</span></p></td><td class="c19" colspan="1" rowspan="1"><p class="c14"><span class="c7">phq_impact_severe_weather_heat_wave_retail</span></p></td><td class="c16" colspan="1" rowspan="1"><p class="c5"><span class="c7">max</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Retail only. Use a radius of 1 mile</span></p></td></tr><tr class="c15"><td class="c25" colspan="1" rowspan="1"><p class="c5"><span class="c7">Severe Weather (Hurricane)</span></p></td><td class="c19" colspan="1" rowspan="1"><p class="c14"><span class="c7">phq_impact_severe_weather_hurricane_retail</span></p></td><td class="c16" colspan="1" rowspan="1"><p class="c5"><span class="c7">max</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Retail only. Use a radius of 1 mile</span></p></td></tr><tr class="c15"><td class="c25" colspan="1" rowspan="1"><p class="c5"><span class="c7">Severe Weather (Thunderstorm)</span></p></td><td class="c19" colspan="1" rowspan="1"><p class="c14"><span class="c7">phq_impact_severe_weather_thunderstorm_retail</span></p></td><td class="c16" colspan="1" rowspan="1"><p class="c5"><span class="c7">max</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Retail only. Use a radius of 1 mile</span></p></td></tr><tr class="c15"><td class="c25" colspan="1" rowspan="1"><p class="c5"><span class="c7">Severe Weather (Tornado)</span></p></td><td class="c19" colspan="1" rowspan="1"><p class="c14"><span class="c7">phq_impact_severe_weather_tornado_retail</span></p></td><td class="c16" colspan="1" rowspan="1"><p class="c5"><span class="c7">max</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Retail only. Use a radius of 1 mile</span></p></td></tr><tr class="c15"><td class="c25" colspan="1" rowspan="1"><p class="c5"><span class="c7">Severe Weather (Tropical Storm)</span></p></td><td class="c19" colspan="1" rowspan="1"><p class="c14"><span class="c7">phq_impact_severe_weather_tropical_storm_retail</span></p></td><td class="c16" colspan="1" rowspan="1"><p class="c5"><span class="c7">max</span></p></td><td class="c12" colspan="1" rowspan="1"><p class="c5"><span class="c7">Retail only. Use a radius of 1 mile</span></p></td></tr>
</table>
</p>

<a id='setup'></a>
## Setup

- If you're using Google Colab, uncomment and run the following code block.

In [6]:
# %%capture
# !git clone https://github.com/predicthq/phq-data-science-docs.git
# %cd phq-data-science-docs/feature-engineering-guide
# !pip install pandas==1.1.5 shapely==1.8.0 timezonefinder==5.2.0 predicthq==2.0.6 numpy==1.20.3

- Alternatively if you're running this notebook on a local machine, set up a Python environment using [requirements.txt](https://github.com/predicthq/phq-data-science-docs/blob/master/feature-engineering-guide/requirements.txt) file which is shared alongside the notebook.
These requirements can be installed by runing the command `pip install -r requirements.txt`.

In [9]:
import pandas as pd
from predicthq import Client
import requests
import collections
import numpy as np
from datetime import datetime, date, timedelta

# To display more columns and with a larger width in the DataFrame
pd.set_option("display.max_columns", 50)

<a id='access_token'></a>
## Access token
An Access Token is required to query the API.

The following link will guide you through creating an account and an access token. 

 - https://docs.predicthq.com/guides/quickstart/

In [8]:
# Replace Access Token with own access token.
ACCESS_TOKEN = 'EQc611eHIqyNFcbTJo_ejlT8PRVi4hKUVMt2J4y3'
phq = Client(access_token=ACCESS_TOKEN)

NameError: name 'Client' is not defined

<a id='setting_params'></a>
## SDK parameters
To search for events, start by building a parameter dictionary for SDK parmaters and adding the required filters.

In [11]:
parameters = dict()

### Location
Specifying the location ensures you will use the relevant events for the location for calculating features.

The default location in this notebook is a location within New York. The latitude and longitude corresponds to 40.7079, -74.0115 which is on Wall Street. If you are running the notebook you can change the latitude and longitude to one of your locations.

Please note that **School Holiday** features are only avaliable in the US and UK.

Location can be set in two ways:  

  1) Using the `location__geo` parameter, which contains `latitude`, `longitude` of the location of interest with a `radius` and a `unit` for the radius. Use this option when looking for events around a specific location like a store or hotel.
    * Avaliable Units:
        - m: meter
        - km: kilometer
        - mi: mile
  
  
  2) Using a `place_id`. Use this option when getting events in an entire city or a larger area.


When calling the Features API with a latitude and longitude you need to specify a radius to use. The Features API will then return aggregate features that reflect all events happening within the specified radius. Events outside the radius will not be included in the features generated. You can use the Suggested Radius API to find a radius for a location.

If you are using a `place_id` you do not need to set a radius as this option returns all event in the specified area.

In [2]:
# Using latitude, longitude and a radius
# Comment out this cell if you want to use place_ids
latitude, longitude = (40.7079, -74.0115) # lat, lon for center of New York City

##### Using Suggested Radius API to set radius
The Suggested Radius API is powered by a machine learning model that looks at factors like population density, the events around a location, the customer’s industry, and many other factors to determine the ideal radius.
The Suggested Radius API returns a radius that can be used to find attended events around a given location. When looking for events around a business location (such as a store, a hotel, or another business location) a key question is how far should you look for events. For example, should you look at events in a 0.5-mile radius, a 2-mile radius, or a 10-mile radius from your location? The Suggested Radius API answers this question by returning a radius based on a number of factors that can be used to retrieve events around a location.
If you've used the Suggested Radius API (beta) before, please note that this updated version now allows you to specify the radius unit. The previous response value was in ***meters***.

In [13]:
# set a location dictionary for suggested radius API to use
loc_radius = dict()
loc_radius.update(lat=latitude, lon=longitude)
loc_radius

{'lat': 40.7079, 'lon': -74.0115}

In [6]:
def get_suggested_radius(lat_lon, industry, radius_unit):
    """
    Returns the suggested radius for a given latitude and longitude.

    Args:
        lat_lon (str): The latitude and longitude of the location in the format "lat,lon".
        industry (str): The industry of interest that the radius will be calculated for. 
        radius_unit (str): Unit in which the suggested radius will be returned.
        
    Returns:
        str: The suggested radius in your perferred unit.
    """
     # Set the url for the API call
    url="https://api.predicthq.com/v1/suggested-radius/"
    # Set the query parameters for the API call
    params = {
        "location.origin": lat_lon,
        "industry": industry, 
        "radius_unit": radius_unit 
    }
     # Set the headers for the API call (including the access token)
    headers={
              "Authorization": "Bearer " + ACCESS_TOKEN,
              "Accept": "application/json"
            }
    # Make the API call and get the JSON response
    response = requests.get(url, params=params, headers=headers).json()
    # Extract the radius from the JSON response and return it
    radius =  int(response['radius']) # convert the radius to an int to be consumed by Features API
    return f"{radius}{radius_unit}"

To set parameters for the Suggested Radius API, you can refer to our documentation available at [Suggested Radius API](https://docs.predicthq.com/resources/suggested-radius). It provides detailed information on how to configure the API according to your requirements.

In [7]:
# Please specify your own industry, the default industry is 'other', see more options in the documentation. 
industry = 'other'
# Please specify your preferred unit here
radius_unit = 'km'
lat_lon = f"{latitude}, {longitude}"
# you can change this value subject to your requirments
radius_filter = get_suggested_radius(lat_lon, industry,radius_unit)
parameters.update(location__geo=dict(lat=latitude, lon=longitude, radius = radius_filter))

NameError: name 'ACCESS_TOKEN' is not defined

Alternatively, we could have used a `place_id` for our search (See our [Appendix on place_ids](#appendix) for detailed explanation).

In [9]:
## Keep commented if you want to use lat and lon
#place_ids = [5128638]
#parameters.update(location__place_id=place_ids) 

### Date "YYYY-MM-DD"

To define the period of time for which events will be returned, set the greater than or equal (`active__gte`) and less than or equal (`active__lte`) parameters. This will select all Attendance Based Events that are active within this period.

You could also use these parameters depending on your time period of interest:

`gte - Greater than or equal.` <br>
`gt - Greater than.`<br>
`lte - Less than or equal.`<br>
`lt - Less than.`<br>


Each request can currently fetch up to 90 days worth - for longer date ranges, multiple requests must be made and we have some examples of how to do that in this notebook. There is no pagination in this API.

Please note that **School Holiday** features are only avaliable for the following period:
* UK: mid-2017 onwards;
* US: mid-2018 onwards.


In [10]:
start_time = "2021-09-01"
end_time = "2021-11-28"
parameters.update(active__gte = start_time, active__lte = end_time)

<a id='features_api'></a>
## Using the Features API to query features for forecasting

<a id='functions'></a>
### Functions for formating data frame
The default response from the Features API is in json format, to convert it to a more usable data frame format, the following functions are defined and employed. 

In [11]:
def dict_value_by_flatten_key(dict_record, flatten_key):
    return reduce(lambda d, k: d.get(k) if isinstance(d, dict) else None,
                  flatten_key.split('.'),
                  dict_record)


def flatten_dict(d, parent_key='', sep='_'):
    items = []
    for k, v in d.items():
        new_key = parent_key + sep + k if parent_key else k
        if isinstance(v, collections.MutableMapping):
            items.extend(flatten_dict(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

<a id='attend'></a>
### Attendance based features

This group of features is based on PHQ Attendance. The following features are supported:

- `phq_attendance_sports`
- `phq_attendance_conferences`
- `phq_attendance_expos`
- `phq_attendance_concerts`
- `phq_attendance_festivals`
- `phq_attendance_performing_arts`
- `phq_attendance_community`
- `phq_attendance_academic_graduation`
- `phq_attendance_academic_social` (For Academic features, we recommend using the 3 rank based features shown in later sections)

Each of these features includes stats. You define which stats you need (or don't define any and receive the default set of stats). Supported stats are:

- `sum` (Recommended to start with)
- `count` 
- `min`
- `max`
- `avg`
- `median`
- `std_dev`

These features also support filtering by PHQ Rank as you'll see in the example below.

#### Setup SDK parameters
Specify a list of Attendance Based Events categories to return.

In [12]:
categories_attended = [
    "phq_attendance_sports",
    "phq_attendance_conferences",
    "phq_attendance_expos",
    "phq_attendance_concerts",
    "phq_attendance_festivals",
    "phq_attendance_performing_arts",
    "phq_attendance_community",
]

# only return sum of the attendance
stats = ["sum"]
#parameters.update(phq_attendance_school_holidays__stats=stats)
for i in categories_attended:
    parameters.update({f"{i}__stats": stats})

parameters

{'location__geo': {'lat': 40.7079, 'lon': -74.0115, 'radius': '3km'},
 'active__gte': '2021-09-01',
 'active__lte': '2021-11-28',
 'phq_attendance_sports__stats': ['sum'],
 'phq_attendance_conferences__stats': ['sum'],
 'phq_attendance_expos__stats': ['sum'],
 'phq_attendance_concerts__stats': ['sum'],
 'phq_attendance_festivals__stats': ['sum'],
 'phq_attendance_performing_arts__stats': ['sum'],
 'phq_attendance_community__stats': ['sum']}

#### Rank filter
Low rank or high rank events can be filtered out when calculating features if desired, just set the greater than and equal/greater than (gte/gt) and less than and equal or less than (lte/lt) parameters for the desired features. For example, this allows you to filter our smaller events if you want to initially concetrate on larger events.

See PHQ Attendance under [General Category Information](https://docs.predicthq.com/categoryinfo/general-category-information) in the documentation for more information on how rank maps to attendance.

In [13]:
phq_rank_filter = 30

# Example 1, set rank filter for a single feature
parameters.update(phq_attendance_sports__phq_rank=dict(gte = phq_rank_filter))

# Example 2, set rank filter for a batch of features
for i in categories_attended:
    parameters.update({f"{i}__phq_rank":{'gte': phq_rank_filter}})

parameters

{'location__geo': {'lat': 40.7079, 'lon': -74.0115, 'radius': '3km'},
 'active__gte': '2021-09-01',
 'active__lte': '2021-11-28',
 'phq_attendance_sports__stats': ['sum'],
 'phq_attendance_conferences__stats': ['sum'],
 'phq_attendance_expos__stats': ['sum'],
 'phq_attendance_concerts__stats': ['sum'],
 'phq_attendance_festivals__stats': ['sum'],
 'phq_attendance_performing_arts__stats': ['sum'],
 'phq_attendance_community__stats': ['sum'],
 'phq_attendance_sports__phq_rank': {'gte': 30},
 'phq_attendance_conferences__phq_rank': {'gte': 30},
 'phq_attendance_expos__phq_rank': {'gte': 30},
 'phq_attendance_concerts__phq_rank': {'gte': 30},
 'phq_attendance_festivals__phq_rank': {'gte': 30},
 'phq_attendance_performing_arts__phq_rank': {'gte': 30},
 'phq_attendance_community__phq_rank': {'gte': 30}}

#### Query features

In [14]:
results = []

for feature in phq.features.obtain_features(parameters):
    results.append(flatten_dict(feature.to_dict(), '', '_'))

feature_df = pd.DataFrame(results)

feature_df.head()

  if isinstance(v, collections.MutableMapping):


Unnamed: 0,date,phq_attendance_community_stats_sum,phq_attendance_concerts_stats_sum,phq_attendance_conferences_stats_sum,phq_attendance_expos_stats_sum,phq_attendance_festivals_stats_sum,phq_attendance_performing_arts_stats_sum,phq_attendance_sports_stats_sum
0,2021-09-01,360.0,794.0,0.0,593.0,0.0,0.0,0.0
1,2021-09-02,540.0,962.0,0.0,436.0,0.0,0.0,0.0
2,2021-09-03,440.0,522.0,0.0,702.0,0.0,400.0,0.0
3,2021-09-04,960.0,105.0,0.0,691.0,43636.0,400.0,240.0
4,2021-09-05,520.0,581.0,0.0,425.0,36364.0,600.0,0.0


#### Features for School Holidays
`phq_attendance_school_holidays` is also one of the attendance based features, but it requires a special setting for radius. Currently the school holidays are detailed at the district level for the US, therefore, we recommend setting the radius of 1 mile for the US.

In [15]:
# Reset SKD Parameters
parameters = dict()

latitude, longitude = (40.7079, -74.0115) # LAT, LONG for centre of New York City
# Specify your desired radius_filter, or utilize the get_suggested_radius function to generate a Suggested radius.
radius_filter = "1mi" 
parameters.update(location__geo=dict(lat=latitude, lon=longitude,radius=radius_filter)) 

start_time = "2021-09-01"
end_time = "2021-11-28"
parameters.update(active__gte = start_time)
parameters.update(active__lte = end_time)

In [16]:
# only return sum of the attendance
stats = ["sum"]
parameters.update({'phq_attendance_school_holidays__stats': stats})

# add rank filter if required
phq_rank_filter = 30
parameters.update(phq_attendance_school_holidays__phq_rank=dict(gt = phq_rank_filter))
parameters

{'location__geo': {'lat': 40.7079, 'lon': -74.0115, 'radius': '1mi'},
 'active__gte': '2021-09-01',
 'active__lte': '2021-11-28',
 'phq_attendance_school_holidays__stats': ['sum'],
 'phq_attendance_school_holidays__phq_rank': {'gt': 30}}

In [17]:
# Query Features
results = []

for feature in phq.features.obtain_features(parameters):
    results.append(flatten_dict(feature.to_dict(), '', '_'))

feature_df = pd.DataFrame(results)
feature_df.head()

Unnamed: 0,date,phq_attendance_school_holidays_stats_sum
0,2021-09-01,1182931.0
1,2021-09-02,1182931.0
2,2021-09-03,1182931.0
3,2021-09-04,1182931.0
4,2021-09-05,1182931.0


<b>Another two useful features can be derived from</b> `phq_attendance_school_holidays_stats_sum`:
* `phq_school_holidays_first_day_flag`: a binary variable which indicates if that day is the first day of any school holidays within the seleacted radius at the selected location.
* `phq_school_holidays_last_day_flag`: a binary variable which indicates if that day is the last day of any school holidays within the seleacted radius at the selected location.
* please note that the value of first row's `phq_school_holidays_first_day_flag` and the value of last row's `phq_school_holidays_last_day_flag` will be </b>NaN</b> as those two features are derivated from the customer selected time range, which cannot guarantee to cover the actual entire school holiday.

In [18]:
# creating shifted attendance
feature_df['temp_pre'] = feature_df['phq_attendance_school_holidays_stats_sum'].shift(1)
feature_df['temp_after'] = feature_df['phq_attendance_school_holidays_stats_sum'].shift(-1)

# first day flag
feature_df['phq_school_holidays_first_day_flag'] = feature_df[[
    'phq_attendance_school_holidays_stats_sum','temp_pre']].apply(
        lambda x: np.nan if pd.isna(x[1]) else 1 if  x[0] > x[1] else 0, axis=1)

# last day flag
feature_df['phq_school_holidays_last_day_flag'] = feature_df[[
    'phq_attendance_school_holidays_stats_sum','temp_after']].apply(
        lambda x: np.nan if pd.isna(x[1]) else 1 if  x[0] > x[1] else 0, axis=1)

# remove temeorary features
feature_df.drop(['phq_school_holidays_first_day_flag', 'phq_school_holidays_last_day_flag'], axis=1)

feature_df.head()

Unnamed: 0,date,phq_attendance_school_holidays_stats_sum,temp_pre,temp_after,phq_school_holidays_first_day_flag,phq_school_holidays_last_day_flag
0,2021-09-01,1182931.0,,1182931.0,,0.0
1,2021-09-02,1182931.0,1182931.0,1182931.0,0.0,0.0
2,2021-09-03,1182931.0,1182931.0,1182931.0,0.0,0.0
3,2021-09-04,1182931.0,1182931.0,1182931.0,0.0,0.0
4,2021-09-05,1182931.0,1182931.0,1182931.0,0.0,0.0


<a id='rank'></a>
### Rank based features

This group of features is based on PHQ Rank for non-attendance based events (mostly scheduled non-attendance based). The following features are supported:

- `phq_rank_public_holidays`
- `phq_rank_school_holidays` (For US and UK we recommend to use `phq_attendance_school_holidays`)
- `phq_rank_observances`
- `phq_rank_academic_session`
- `phq_rank_academic_exam`
- `phq_rank_academic_holiday`

Results are broken down by PHQ Rank Level (1 to 5). Rank Levels are groupings of Rank and are grouped as follows:

- 1 = between 0 and 20
- 2 = between 21 and 40
- 3 = between 41 and 60
- 4 = between 61 and 80
- 5 = between 81 and 100

Additional filtering for PHQ Rank features is not currently supported.

#### Setup SDK parameters

In [19]:
parameters = dict()

latitude, longitude = (40.7079, -74.0115) # LAT, LONG for centre of New York City

loc_radius = dict()
loc_radius.update(lat=latitude, lon=longitude)

# Set parameters for querying suggested radius API
# Please specify your own industry, the default industry is 'other', see more options in the documentation. 
industry = 'other'
# Please specify your preferred unit here
radius_unit = 'km'
lat_lon = f"{latitude}, {longitude}"
# query suggested radius API
# Alternatively, you can modify the following line to use your perferred radius 
radius_filter = get_suggested_radius(lat_lon, industry,radius_unit)

parameters.update(location__geo=dict(lat=latitude, lon=longitude,radius=radius_filter)) 
parameters.update(location__geo=dict(lat=latitude, lon=longitude,radius=radius_filter)) 

start_time = "2021-09-01"
end_time = "2021-11-28"
parameters.update(active__gte = start_time)
parameters.update(active__lte = end_time)

Specify a list of Rank Based Events categories to return.

In [20]:
categories_rank = [
     "phq_rank_observances",
     "phq_rank_public_holidays",
     "phq_rank_school_holidays",
     "phq_rank_academic_session",
     "phq_rank_academic_exam",
     "phq_rank_academic_holiday",
]


for i in categories_rank:
    parameters.update({f"{i}": True})

parameters

{'location__geo': {'lat': 40.7079, 'lon': -74.0115, 'radius': '3km'},
 'active__gte': '2021-09-01',
 'active__lte': '2021-11-28',
 'phq_rank_observances': True,
 'phq_rank_public_holidays': True,
 'phq_rank_school_holidays': True,
 'phq_rank_academic_session': True,
 'phq_rank_academic_exam': True,
 'phq_rank_academic_holiday': True}

#### Query features

In [21]:
results = []

for feature in phq.features.obtain_features(parameters):
    results.append(flatten_dict(feature.to_dict(), '', '_'))

feature_df = pd.DataFrame(results)

feature_df.head()

Unnamed: 0,date,phq_rank_observances_rank_levels_1,phq_rank_observances_rank_levels_2,phq_rank_observances_rank_levels_3,phq_rank_observances_rank_levels_4,phq_rank_observances_rank_levels_5,phq_rank_public_holidays_rank_levels_1,phq_rank_public_holidays_rank_levels_2,phq_rank_public_holidays_rank_levels_3,phq_rank_public_holidays_rank_levels_4,phq_rank_public_holidays_rank_levels_5,phq_rank_school_holidays_rank_levels_1,phq_rank_school_holidays_rank_levels_2,phq_rank_school_holidays_rank_levels_3,phq_rank_school_holidays_rank_levels_4,phq_rank_school_holidays_rank_levels_5,phq_rank_academic_session_rank_levels_1,phq_rank_academic_session_rank_levels_2,phq_rank_academic_session_rank_levels_3,phq_rank_academic_session_rank_levels_4,phq_rank_academic_session_rank_levels_5,phq_rank_academic_exam_rank_levels_1,phq_rank_academic_exam_rank_levels_2,phq_rank_academic_exam_rank_levels_3,phq_rank_academic_exam_rank_levels_4,phq_rank_academic_exam_rank_levels_5,phq_rank_academic_holiday_rank_levels_1,phq_rank_academic_holiday_rank_levels_2,phq_rank_academic_holiday_rank_levels_3,phq_rank_academic_holiday_rank_levels_4,phq_rank_academic_holiday_rank_levels_5
0,2021-09-01,0,0,0,0,0,0,0,0,0,0,0,0,2,0,3,0,0,0,5,0,0,0,0,0,0,0,0,0,3,0
1,2021-09-02,0,0,0,0,0,0,0,0,0,0,0,0,2,0,3,0,0,0,6,0,0,0,0,0,0,0,0,0,2,0
2,2021-09-03,0,0,0,0,0,0,0,0,0,0,0,0,2,0,3,0,0,0,6,0,0,0,0,0,0,0,0,0,2,0
3,2021-09-04,0,0,1,0,0,0,0,0,0,0,0,0,2,0,3,0,0,0,6,0,0,0,0,0,0,0,0,0,2,0
4,2021-09-05,0,0,1,0,0,0,0,0,0,0,0,0,2,0,3,0,0,0,6,0,0,0,0,0,0,0,0,0,2,0


#### Aggregate impact of rank based features
As explained at the beginning of this section, each feature has 5 rank levels, so those 5 rank levels can be aggregated based on various requirements. Here we provide an example of aggregating all the 5 rank levels of `phq_rank_observances`:

In [22]:
column_names = list(feature_df.columns)
column_names = [i for i in column_names if "rank_observances_rank_levels" in i]
new_types = {i: 'int' for i in column_names}
aggregate_df = feature_df[column_names]

aggregate_df = aggregate_df.astype(new_types)
for i in column_names:
    multiply = int(i.split("_")[-1])
    aggregate_df[i] = aggregate_df[i]*multiply

aggregate_df['phq_rank_observancesphq_rank_observances_rank_agg'] = aggregate_df.sum(axis=1)
aggregate_df['date'] = feature_df['date']
aggregate_df[['date','phq_rank_observancesphq_rank_observances_rank_agg']]

Unnamed: 0,date,phq_rank_observancesphq_rank_observances_rank_agg
0,2021-09-01,0
1,2021-09-02,0
2,2021-09-03,0
3,2021-09-04,3
4,2021-09-05,3
...,...,...
84,2021-11-24,0
85,2021-11-25,3
86,2021-11-26,6
87,2021-11-27,0


<a id='impact'></a>
### Impact based features (severe weather)

<b>Severe weather</b> is currently the only category that covers impact based features. The severe weather features use demand impact patterns. Demand impact patterns calculate the impact duration of an severe weather event and are based on industry specific information. Our severe weather features are currently designed and tested on data for the retail segment only. For example, for a flood event the impact pattern may show that it typically impacts retail businesses 1 day before and 2 days after the event. That impact pattern information is used in the features below. If your business is in an industry segment other than retail (e.g. Accomodation or Travel) then the features below may not work for you or may be less effective.

For more details on severe weather see the [data science guide](https://docs.predicthq.com/datascience/severe-weather-events).

There are 13 features avaiable for the retail industry from the Features API:

- `phq_impact_severe_weather_air_quality_retail`
- `phq_impact_severe_weather_blizzard_retail`
- `phq_impact_severe_weather_cold_wave_retail`
- `phq_impact_severe_weather_cold_wave_snow_retail`
- `phq_impact_severe_weather_cold_wave_storm_retail`
- `phq_impact_severe_weather_dust_retail`
- `phq_impact_severe_weather_dust_storm_retail`
- `phq_impact_severe_weather_flood_retail`
- `phq_impact_severe_weather_heat_wave_retail`
- `phq_impact_severe_weather_hurricane_retail`
- `phq_impact_severe_weather_thunderstorm_retail`
- `phq_impact_severe_weather_tornado_retail`
- `phq_impact_severe_weather_tropical_storm_retail`

Similar to Attendance Based Events, each of these feature includes 7 stats:
- `sum` 
- `count` 
- `min`
- `max` (Recommended to start with)
- `avg`
- `median`
- `std_dev`
For severe weather, we recommend to start with `max` in this notebook.

#### Radius for Severe Weahter

The distance between a store and event is defined as the minimum distance between the store and the points from the polygon. When the store is inside the polygon, the distance between the store and the event is 0km. The default radius is set to 0km, i.e., the events which are used for aggregating and feature engineering have polygons which overlap with the store.

<b>Note: when using the `geo__location` parameter in the Features API to query for features around a radius choose a radius of 1 meter (the `geo__location` parameter doesn’t support a radius of 0).</b>

#### Setup SDK parameters

In [23]:
parameters = dict()

latitude, longitude = (40.7079, -74.0115) # LAT, LONG for centre of New York City
# Specify your desired radius_filter, or utilize the get_suggested_radius function to generate a Suggested radius.
radius_filter = "1mi"
parameters.update(location__geo=dict(lat=latitude, lon=longitude,radius=radius_filter)) 

start_time = "2021-09-01"
end_time = "2021-11-28"
parameters.update(active__gte = start_time)
parameters.update(active__lte = end_time)

parameters

{'location__geo': {'lat': 40.7079, 'lon': -74.0115, 'radius': '1mi'},
 'active__gte': '2021-09-01',
 'active__lte': '2021-11-28'}

In [24]:
categories_impact = [
     "phq_impact_severe_weather_air_quality_retail",
     "phq_impact_severe_weather_blizzard_retail",
     "phq_impact_severe_weather_cold_wave_retail",
     "phq_impact_severe_weather_cold_wave_snow_retail",
     "phq_impact_severe_weather_cold_wave_storm_retail",
     "phq_impact_severe_weather_dust_retail",
     "phq_impact_severe_weather_dust_storm_retail",
     "phq_impact_severe_weather_flood_retail",
     "phq_impact_severe_weather_heat_wave_retail",
     "phq_impact_severe_weather_hurricane_retail",
     "phq_impact_severe_weather_thunderstorm_retail",
     "phq_impact_severe_weather_tornado_retail",
     "phq_impact_severe_weather_tropical_storm_retail",
]

# only return sum of the attendance
stats = ["max"]
#parameters.update(phq_attendance_school_holidays__stats=stats)
for i in categories_impact:
    parameters.update({f"{i}": {'stats': stats}})

# # Similar to Attend Events, low/high rank events can be excluded from calculating features.
# phq_rank_filter = 30

# for i in categories_impact:
#     parameters.update({f"{i}__phq_rank":{'gte':phq_rank_filter}})

parameters

{'location__geo': {'lat': 40.7079, 'lon': -74.0115, 'radius': '1mi'},
 'active__gte': '2021-09-01',
 'active__lte': '2021-11-28',
 'phq_impact_severe_weather_air_quality_retail': {'stats': ['max']},
 'phq_impact_severe_weather_blizzard_retail': {'stats': ['max']},
 'phq_impact_severe_weather_cold_wave_retail': {'stats': ['max']},
 'phq_impact_severe_weather_cold_wave_snow_retail': {'stats': ['max']},
 'phq_impact_severe_weather_cold_wave_storm_retail': {'stats': ['max']},
 'phq_impact_severe_weather_dust_retail': {'stats': ['max']},
 'phq_impact_severe_weather_dust_storm_retail': {'stats': ['max']},
 'phq_impact_severe_weather_flood_retail': {'stats': ['max']},
 'phq_impact_severe_weather_heat_wave_retail': {'stats': ['max']},
 'phq_impact_severe_weather_hurricane_retail': {'stats': ['max']},
 'phq_impact_severe_weather_thunderstorm_retail': {'stats': ['max']},
 'phq_impact_severe_weather_tornado_retail': {'stats': ['max']},
 'phq_impact_severe_weather_tropical_storm_retail': {'stats':

#### Query features

In [25]:
results = []

for feature in phq.features.obtain_features(parameters):
    results.append(flatten_dict(feature.to_dict(), '', '_'))

feature_df = pd.DataFrame(results)

feature_df.head()

Unnamed: 0,date,phq_impact_severe_weather_air_quality_retail_stats_max,phq_impact_severe_weather_blizzard_retail_stats_max,phq_impact_severe_weather_cold_wave_retail_stats_max,phq_impact_severe_weather_cold_wave_snow_retail_stats_max,phq_impact_severe_weather_cold_wave_storm_retail_stats_max,phq_impact_severe_weather_dust_retail_stats_max,phq_impact_severe_weather_dust_storm_retail_stats_max,phq_impact_severe_weather_flood_retail_stats_max,phq_impact_severe_weather_heat_wave_retail_stats_max,phq_impact_severe_weather_hurricane_retail_stats_max,phq_impact_severe_weather_thunderstorm_retail_stats_max,phq_impact_severe_weather_tornado_retail_stats_max,phq_impact_severe_weather_tropical_storm_retail_stats_max
0,2021-09-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,82.0,0.0,0.0,86.0,60.0,0.0
1,2021-09-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,80.0,0.0,0.0,34.0,0.0,0.0
2,2021-09-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,28.0,0.0,0.0,0.0,0.0,0.0
3,2021-09-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2021-09-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<a id='wide_range'></a>
## Using a longer date range
As we mentioned earlier, the Features API only allows a range of up to 90 days, so if want to get a longer range of data, multiple requests have to be made. In this section we will provide examples of how to extract features for more than 90 days. Please note that these examples are not using the SDK.

You may want to use this approach to download a data set of historic data to train your model.

<a id='functions_wide'></a>
### Common functions
Functions to split wide data range into multiple 90 days ranges.

In [26]:
DATE_FORMAT = "%Y-%m-%d"
FEATURES_API_URL = "https://api.predicthq.com/v1/features"

phq = Client(access_token=ACCESS_TOKEN)

def get_date_groups(start, end):
    """
    Features API allows a range of up to 90 days, so we have to do several requests
    """

    def _split_dates(s, e):
        capacity = timedelta(days=90)
        interval = 1 + int((e - s) / capacity)
        for i in range(interval):
            yield s + capacity * i
        yield e

    dates = list(_split_dates(start, end))
    for i, (d1, d2) in enumerate(zip(dates, dates[1:])):
        if d2 != dates[-1]:
            d2 -= timedelta(days=1)
        yield d1.strftime(DATE_FORMAT), d2.strftime(DATE_FORMAT)

<a id='attend_wide'></a>
### Attendance based features

In [27]:
categories_attended = [
    "phq_attendance_sports",
    "phq_attendance_conferences",
    "phq_attendance_expos",
    "phq_attendance_concerts",
    "phq_attendance_festivals",
    "phq_attendance_performing_arts",
    "phq_attendance_community",
    "phq_attendance_school_holidays",
]

def get_features_api_attended_data(lat, lon, start, end, radius=5, rank_threshold=30):
    start = datetime.strptime(start, DATE_FORMAT).date()
    end = datetime.strptime(end, DATE_FORMAT).date()

    result = []
    for gte, lte in get_date_groups(start, end):
        query = {
            "active__gte": gte,
            "active__lte": lte,
            "location__geo": {"lat": lat, "lon": lon, "radius": f"{radius}mi"},
        }

        query.update({f"{f}__stats": ["sum"] for f in categories_attended})
        query.update(
            {f"{f}__phq_rank": {"gte": rank_threshold} for f in categories_attended}
        )

        features = phq.features.obtain_features(**query)

        for feature in features:
            record = {}
            for k, v in feature.to_dict().items():
                if k == "date":
                    record[k] = v.strftime("%Y-%m-%d")
                elif k in categories_attended:
                    record[k] = v.get("stats", {}).get("sum")
            result.append(record)

    return result


res = get_features_api_attended_data(40.7079, -74.0115, "2021-06-01", "2022-07-04", 5, 30)
df_attended = pd.DataFrame(res)

df_attended.head()

Unnamed: 0,date,phq_attendance_community,phq_attendance_concerts,phq_attendance_conferences,phq_attendance_expos,phq_attendance_festivals,phq_attendance_performing_arts,phq_attendance_school_holidays,phq_attendance_sports
0,2021-06-01,0.0,169.0,0.0,36.0,1038.0,2217.0,0.0,4195.0
1,2021-06-02,0.0,418.0,0.0,196.0,62.0,3589.0,0.0,14898.0
2,2021-06-03,0.0,1376.0,0.0,36.0,62.0,2611.0,0.0,1746.0
3,2021-06-04,647.0,2137.0,324.0,43.0,78.0,3307.0,0.0,123.0
4,2021-06-05,473.0,3859.0,0.0,0.0,94.0,5662.0,0.0,4386.0


<a id='rank_wide'></a>
### Rank based features

In [28]:
categories_rank = [
     "phq_rank_health_warnings",
     "phq_rank_observances",
     "phq_rank_public_holidays",
     "phq_rank_school_holidays",
     "phq_rank_academic_session",
     "phq_rank_academic_exam",
     "phq_rank_academic_holiday"
]

def get_features_api_data(lat, lon, start, end, radius=5):
    start = datetime.strptime(start, DATE_FORMAT).date()
    end = datetime.strptime(end, DATE_FORMAT).date()

    result = []
    for gte, lte in get_date_groups(start, end):
        query = {
            "active__gte": gte,
            "active__lte": lte,
            "location__geo": {"lat": lat, "lon": lon, "radius": f"{radius}mi"},
        }

        query.update({f"{f}": True for f in categories_rank})

        features = phq.features.obtain_features(**query)

        for feature in features:
            record = {}
            for k, v in feature.to_dict().items():
                if k == "date":
                    record[k] = v.strftime("%Y-%m-%d")
                elif k in categories_rank:
                    for rank_level, level_count in v.get("rank_levels", {}).items():
                        record[f"{k}_level_{rank_level}"] = float(level_count)

            result.append(record)

    return result

res = get_features_api_data(40.7079, -74.0115, "2021-06-01", "2022-07-04", 5)
df_rank = pd.DataFrame(res)
df_rank.head()

Unnamed: 0,date,phq_rank_health_warnings_level_1,phq_rank_health_warnings_level_2,phq_rank_health_warnings_level_3,phq_rank_health_warnings_level_4,phq_rank_health_warnings_level_5,phq_rank_observances_level_1,phq_rank_observances_level_2,phq_rank_observances_level_3,phq_rank_observances_level_4,phq_rank_observances_level_5,phq_rank_public_holidays_level_1,phq_rank_public_holidays_level_2,phq_rank_public_holidays_level_3,phq_rank_public_holidays_level_4,phq_rank_public_holidays_level_5,phq_rank_school_holidays_level_1,phq_rank_school_holidays_level_2,phq_rank_school_holidays_level_3,phq_rank_school_holidays_level_4,phq_rank_school_holidays_level_5,phq_rank_academic_session_level_1,phq_rank_academic_session_level_2,phq_rank_academic_session_level_3,phq_rank_academic_session_level_4,phq_rank_academic_session_level_5,phq_rank_academic_exam_level_1,phq_rank_academic_exam_level_2,phq_rank_academic_exam_level_3,phq_rank_academic_exam_level_4,phq_rank_academic_exam_level_5,phq_rank_academic_holiday_level_1,phq_rank_academic_holiday_level_2,phq_rank_academic_holiday_level_3,phq_rank_academic_holiday_level_4,phq_rank_academic_holiday_level_5
0,2021-06-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.0,0.0
1,2021-06-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.0,0.0
2,2021-06-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.0,0.0
3,2021-06-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.0,0.0
4,2021-06-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.0,0.0


<a id='impact_wide'></a>
### Impact based features
<b> Severe weahter features </b>

In [29]:
categories_impact = {
    "phq_impact_severe_weather_air_quality_retail",
    "phq_impact_severe_weather_blizzard_retail",
    "phq_impact_severe_weather_cold_wave_retail",
    "phq_impact_severe_weather_cold_wave_snow_retail",
    "phq_impact_severe_weather_cold_wave_storm_retail",
    "phq_impact_severe_weather_dust_retail",
    "phq_impact_severe_weather_dust_storm_retail",
    "phq_impact_severe_weather_flood_retail",
    "phq_impact_severe_weather_heat_wave_retail",
    "phq_impact_severe_weather_hurricane_retail",
    "phq_impact_severe_weather_thunderstorm_retail",
    "phq_impact_severe_weather_tornado_retail",
    "phq_impact_severe_weather_tropical_storm_retail",
}


def get_features_api_impact_events(lat, lon, start, end, rank_threshold=30):
    start = datetime.strptime(start, DATE_FORMAT).date()
    end = datetime.strptime(end, DATE_FORMAT).date()

    result = []
    for gte, lte in get_date_groups(start, end):
        query = {
            "active__gte": gte,
            "active__lte": lte,
            "location__geo": {"lat": lat, "lon": lon, "radius": "1mi"},
        }

        query.update({f"{f}__stats": ["max"] for f in categories_impact})
        query.update(
            {f"{f}__phq_rank": {"gte": rank_threshold} for f in categories_impact}
        )

        features = phq.features.obtain_features(**query)

        for feature in features:
            record = {}
            for k, v in feature.to_dict().items():
                if k == "date":
                    record[k] = v.strftime("%Y-%m-%d")
                else:
                    record[k] = v.get("stats", {}).get("max")

            result.append(record)

    return result



res = get_features_api_impact_events(
    40.7079, -74.0115, "2021-06-01", "2022-07-04", 60
)
df_impact_features = pd.DataFrame(res)

# drop features that only contains 0s
columns_constant = [
    col
    for col in df_impact_features.sum()[1:].to_dict().keys()
    if df_impact_features[col].sum() == 0
]
df_impact_features.drop(columns=columns_constant, inplace=True)

df_impact_features

Unnamed: 0,date,phq_impact_severe_weather_cold_wave_retail,phq_impact_severe_weather_cold_wave_storm_retail,phq_impact_severe_weather_flood_retail,phq_impact_severe_weather_heat_wave_retail,phq_impact_severe_weather_thunderstorm_retail,phq_impact_severe_weather_tornado_retail,phq_impact_severe_weather_tropical_storm_retail
0,2021-06-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2021-06-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2021-06-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2021-06-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2021-06-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...
394,2022-06-30,0.0,0.0,0.0,0.0,0.0,0.0,0.0
395,2022-07-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0
396,2022-07-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0
397,2022-07-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=34d11ecb-656c-4587-8315-19d5c1e5ce54' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>