# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span>

<span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Feature Backfill</span>

**Note**: This tutorial does not support Google Colab.

## 🗒️ This notebook is divided into the following sections:

1. Fetch historical data.
2. Connect to the Hopsworks feature store.
3. Create feature groups and insert them to the feature store.

![tutorial-flow](../../images/01_featuregroups.png)

## <span style='color:#ff5f27'> 📝 Imports

In [None]:
!pip install -U 'hopsworks[python]' --quiet
!pip install geopy folium streamlit-folium --q

In [None]:
import json

import pandas as pd
import folium

from features import air_quality
from functions.common_functions import convert_date_to_unix

import warnings
warnings.filterwarnings("ignore")

## <span style='color:#ff5f27'> 🌍 Representing the Target cities </span>

In [None]:
# Open the 'target_cities.json' file in read mode
with open('target_cities.json') as json_file:
    # Load the JSON data from the file into a Python dictionary
    target_cities = json.load(json_file)

# Now, 'target_cities' contains the data from the JSON file

In [None]:
# Create a folium map centered on the first location in the list
my_map = folium.Map(location=[42.57, -44.092], zoom_start=3)

for continent in target_cities:
        for city_name, coords in target_cities[continent].items():
            folium.CircleMarker(
                location=coords,
                popup=city_name,
            ).add_to(my_map)
#my_map

In [None]:
# # Save the map to an HTML file
# my_map.save("map_all_target_cities.html")

## <span style='color:#ff5f27'> 🌫 Processing Air Quality data</span>

### [🇪🇺 EEA](https://discomap.eea.europa.eu/map/fme/AirQualityExport.htm)
#### EEA means European Environmental Agency

In [None]:
# EU Cities 
target_cities["EU"]

In [None]:
# Read the CSV file from the specified URL into a pandas DataFrame
df_eu = pd.read_csv("https://repo.hops.works/dev/davit/air_quality/backfill_pm2_5_eu.csv")

# Print the size of the 'df_eu' DataFrame (number of rows and columns)
print("⛳️ Size of this dataframe:", df_eu.shape)

# Check for missing values in the 'df_eu' DataFrame
print(f'⛳️ Missing Values: {df_eu.isna().sum().sum()}')

# Display a random sample of three rows from the 'df_eu' DataFrame
df_eu.sample(3)

### [🇺🇸 USEPA](https://aqs.epa.gov/aqsweb/documents/data_api.html#daily)
#### USEPA means United States Environmental Protection Agency
[Manual downloading](https://www.epa.gov/outdoor-air-quality-data/download-daily-data)



In [None]:
# US Cities 
target_cities["US"]

In [None]:
# Read the CSV file from the specified URL into a pandas DataFrame
df_us = pd.read_csv("https://repo.hops.works/dev/davit/air_quality/backfill_pm2_5_us.csv")

# Print the size of the 'df_us' DataFrame (number of rows and columns)
print("⛳️ Size of this dataframe:", df_us.shape)

# Check for missing values in the 'df_us' DataFrame
print(f'⛳️ Missing Values: {df_us.isna().sum().sum()}')

# Display a random sample of three rows from the 'df_us' DataFrame
df_us.sample(3)

### <span style="color:#ff5f27;">🏢 Processing special city - `Seattle`</span>
#### We need different stations across the Seattle. 
I downloaded daily `PM2.5` data manually [here](https://www.epa.gov/outdoor-air-quality-data/download-daily-data)

In [None]:
target_cities["Seattle"]

In [None]:
# Read the CSV file from the specified URL into a pandas DataFrame
df_seattle = pd.read_csv("https://repo.hops.works/dev/davit/air_quality/backfill_pm2_5_seattle.csv")

# Print the size of the 'df_seattle' DataFrame (number of rows and columns)
print("⛳️ Size of this dataframe:", df_seattle.shape)

# Check for missing values in the 'df_seattle' DataFrame
print(f'⛳️ Missing Values: {df_seattle.isna().sum().sum()}')

# Display a random sample of three rows
df_seattle.sample(3)

### <span style="color:#ff5f27;">🌟 All together</span>

In [None]:
# Concatenate the DataFrames df_eu, df_us, and df_seattle along the rows and reset the index
df_air_quality = pd.concat(
    [df_eu, df_us, df_seattle],
).reset_index(drop=True)

# Print the shape of the df_air_quality DataFrame
print(f'⛳️ DF shape: {df_air_quality.shape}')

# Display a random sample of five rows from the df_air_quality DataFrame
df_air_quality.sample(5)

## <span style="color:#ff5f27;">🛠 Feature Engineering</span>

In [None]:
# Convert the 'date' column in the df_air_quality DataFrame to datetime format
df_air_quality['date'] = pd.to_datetime(df_air_quality['date'])

In [None]:
# Apply feature engineering to the df_air_quality DataFrame using the air_quality.feature_engineer_aq() function
df_air_quality = air_quality.feature_engineer_aq(df_air_quality)

# Drop rows with missing values in the df_air_quality DataFrame
df_air_quality = df_air_quality.dropna()

# Check and print the total number of missing values in the df_air_quality DataFrame
df_air_quality.isna().sum().sum()

In [None]:
# Print the shape (number of rows and columns) of the df_air_quality DataFrame
df_air_quality.shape

In [None]:
# Retrieve and display the column names of the df_air_quality DataFrame
df_air_quality.columns

## <span style='color:#ff5f27'> 🌦 Loading Weather Data from [Open Meteo](https://open-meteo.com/en/docs)

In [None]:
# Read the CSV file from the specified URL into a pandas DataFrame for weather data
df_weather = pd.read_csv("https://repo.hops.works/dev/davit/air_quality/backfill_weather.csv")

# Display the first three rows of the df_weather DataFrame
df_weather.head(3)

---

In [None]:
# Apply the 'convert_date_to_unix' function to create a new 'unix_time' column in df_air_quality
df_air_quality["unix_time"] = pd.to_datetime(df_air_quality.date).apply(convert_date_to_unix)

# Apply the 'convert_date_to_unix' function to create a new 'unix_time' column in df_weather
df_weather["unix_time"] = pd.to_datetime(df_weather.date).apply(convert_date_to_unix)

# Convert the 'date' column in the df_air_quality DataFrame back to string format
df_air_quality.date = df_air_quality.date.astype(str)

# Convert the 'date' column in the df_weather DataFrame back to string format
df_weather.date = df_weather.date.astype(str)

## <span style="color:#ff5f27;"> 🔮 Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

## <span style="color:#ff5f27;">🪄 Creating Feature Groups</span>

### <span style='color:#ff5f27'> 🌫 Air Quality Data

In [None]:
# Get or create feature group
air_quality_fg = fs.get_or_create_feature_group(
    name='air_quality',
    description='Air Quality characteristics of each day',
    version=1,
    primary_key=['unix_time','city_name'],
    event_time=["unix_time"],
)   
# Insert data
air_quality_fg.insert(df_air_quality)

### <span style='color:#ff5f27'> 🌦 Weather Data

In [None]:
# Get or create feature group
weather_fg = fs.get_or_create_feature_group(
    name='weather',
    description='Weather characteristics of each day',
    version=1,
    primary_key=['unix_time','city_name'],
    event_time=["unix_time"],
) 
# Insert data
weather_fg.insert(df_weather)

---
## <span style="color:#ff5f27;">⏭️ **Next:** Part 02: Feature Pipeline 
 </span> 

In the following notebook you will parse data and insert it into Feature Groups.