# loadData.ipynb

##### Purpose:
Download, loads, aggregates and creates final data steps

##### Steps:
1. Create data folders to house any data files
2. Download the data
3. Load the data
4. Pull data
5. Aggregate data
6. Build the final data sets

##### Remarks:
- **Its important to check the settings of each step before its ran, some steps may duplicate their data if ran multiple times**

<br>

- Both copernicus data and satellite soil moisture data require permission to access their data
- Soil data from 1978 - 2001 as well as the ergot data can be found in a one drive to more easily load the data
- [Loading data from other sources can be pretty straightforward](#loading-other-data)
    - Note that when attempting to load a datasource to a database in this way, sometimes it will not work which requires the following workarounds:
        - commiting the change via sqlalchemy
        - manually creating the database then inserting the data
- [The same can be said for aggregating other data](#aggregating-other-data)
- More information about these tables can be found on the [readme](https://github.com/ChromaticPanic/CGC_Grain_Outcome_Predictions)

<br>

![data download path](./../.github/img/loadData.PNG)

In [None]:
from Datasets.setCreator import SetCreator  # ignore: type
import urllib.request
import zipfile
import os

Purpose:  
Create the data subfolders inside each folder found within the source directory

Pseudocode:  
- Get the current directory (points to src)
- Create the would be paths for each data folder
- If the data folder does not exist, creat it

In [None]:
srcDir = os.getcwd()

# Create the would be paths for each data folder
datasetsData = os.path.join(srcDir, "Datasets/data")
ergotData = os.path.join(srcDir, "Edgot/data")
modelsData = os.path.join(srcDir, "Models/data")
copernicusData = os.path.join(srcDir, "SatelliteCopernicus/data")
moistureData = os.path.join(srcDir, "SatelliteMoisture/data")
sharedData = os.path.join(srcDir, "Shared/data")
soilData = os.path.join(srcDir, "Soil/data")
stationData = os.path.join(srcDir, "WeatherStation/data")

# If the data folder does not exist, creat it
if not os.path.exists(datasetsData):
    os.makedirs(datasetsData)
if not os.path.exists(ergotData):
    os.makedirs(ergotData)
if not os.path.exists(modelsData):
    os.makedirs(modelsData)
if not os.path.exists(copernicusData):
    os.makedirs(copernicusData)
if not os.path.exists(moistureData):
    os.makedirs(moistureData)
if not os.path.exists(sharedData):
    os.makedirs(sharedData)
if not os.path.exists(soilData):
    os.makedirs(soilData)
if not os.path.exists(stationData):
    os.makedirs(stationData)

# Download data

Purpose:  
Download the province geometries required to load the boarders and stations

Pseudocode:  
- Move to the stationData directory
- [Download the necessary file](https://www150.statcan.gc.ca/n1/pub/92-174-x/2007000/carboundary/gcar000b07a_e.zip)
- Extract its contents (since its a zip file)
- Go back to src

In [None]:
os.chdir(stationData)
print(os.getcwd())

urllib.request.urlretrieve(
    url="https://www150.statcan.gc.ca/n1/pub/92-174-x/2007000/carboundary/gcar000b07a_e.zip",
    filename="gcar000b07a_e.zip",
)

with zipfile.ZipFile(f"{os.getcwd()}/gcar000b07a_e.zip", "r") as zip_ref:
    zip_ref.extractall(os.getcwd())

os.chdir(f"{stationData}/../..")
print(os.getcwd())

Purpose:  
Download the list of weather stations

Pseudocode:  
- Move to the stationData directory
- [Download the necessary file](https://dd.weather.gc.ca/climate/observations/climate_station_list.csv)
- Go back to src

In [None]:
os.chdir(stationData)
print(os.getcwd())

urllib.request.urlretrieve(
    url="https://dd.weather.gc.ca/climate/observations/climate_station_list.csv",
    filename="climate_station_list.csv",
)

os.chdir(f"{stationData}/../..")
print(os.getcwd())

Purpose:  
Download the soil data

Pseudocode:  
- Move to the soilData directory
- Download the necessary files:
    - [geometries and their data](https://sis.agr.gc.ca/nsdb/ca/cac003/cac003.20110308.v3.2/ca_all_slc_v3r2.zip)
        - Extract its contents (since its a zip file)
    - [Manitoba soil names](https://sis.agr.gc.ca/soildata/mb/soil_name_mb_v2r20130705.dbf)
    - [Manitoba soil layers](https://sis.agr.gc.ca/soildata/mb/soil_layer_mb_v2r20130705.dbf)
    - [Alberta soil names](https://sis.agr.gc.ca/soildata/ab/soil_name_ab_v2r20140529.dbf)
    - [Alberta soil layers](https://sis.agr.gc.ca/soildata/ab/soil_layer_ab_v2r20140529.dbf)
    - [Saskatchewan soil names](https://sis.agr.gc.ca/soildata/sk/soil_name_sk_v2r20130705.dbf)
    - [Saskatchewan soil layers](https://sis.agr.gc.ca/soildata/sk/soil_layer_sk_v2r20130705.dbf)
- Go back to src

In [None]:
os.chdir(soilData)
print(os.getcwd())

urllib.request.urlretrieve(
    url="https://sis.agr.gc.ca/nsdb/ca/cac003/cac003.20110308.v3.2/ca_all_slc_v3r2.zip",
    filename="ca_all_slc_v3r2.zip",
)

with zipfile.ZipFile(f"{os.getcwd()}/ca_all_slc_v3r2.zip", "r") as zip_ref:
    zip_ref.extractall(os.getcwd())

urllib.request.urlretrieve(
    url="https://sis.agr.gc.ca/soildata/mb/soil_name_mb_v2r20130705.dbf",
    filename="soil_name_mb_v2r20130705.dbf",
)
urllib.request.urlretrieve(
    url="https://sis.agr.gc.ca/soildata/mb/soil_layer_mb_v2r20130705.dbf",
    filename="soil_layer_mb_v2r20130705.dbf",
)

urllib.request.urlretrieve(
    url="https://sis.agr.gc.ca/soildata/ab/soil_name_ab_v2r20140529.dbf",
    filename="soil_name_ab_v2r20140529.dbf",
)
urllib.request.urlretrieve(
    url="https://sis.agr.gc.ca/soildata/ab/soil_layer_ab_v2r20140529.dbf",
    filename="soil_layer_ab_v2r20140529.dbf",
)

urllib.request.urlretrieve(
    url="https://sis.agr.gc.ca/soildata/sk/soil_name_sk_v2r20130705.dbf",
    filename="soil_name_sk_v2r20130705.dbf",
)
urllib.request.urlretrieve(
    url="https://sis.agr.gc.ca/soildata/sk/soil_layer_sk_v2r20130705.dbf",
    filename="soil_layer_sk_v2r20130705.dbf",
)

os.chdir(f"{soilData}/../..")
print(os.getcwd())

Ergot data

This file was provided by the Canadian Grains Commission. Loading this data must be done manually:

1. Locate the file
2. Create a copy or move it into the path that is stored in the variable below (ergotData)
3. Rename the file to newErgot.csv or change the FILENAME inside of [importErgot](https://github.com/ChromaticPanic/CGC_Grain_Outcome_Predictions/blob/main/src/Ergot/importErgot.ipynb) so that they match

In [None]:
ergotData

Soil Moisture data

These files were provided by the ESA through means of SFTP, unfortunately we have lost access to these files and therefore they too must be loaded manually:

1. Obtain the file either from the University of Manitoba 
    - stored on woodswallow-01 /../../../../data/common/Images
    - stored on Dane's machine
2. Create a copy or move it into the path that is stored in the variable below (moistureData)
3. Rename the variable MAIN_FOLDER_PATH inside of [PullMoistureData](https://github.com/ChromaticPanic/CGC_Grain_Outcome_Predictions/blob/main/src/SatelliteSoilMoisture/PullMoistureData.py) so that it matches where the files were saved


In [None]:
moistureData

# Load data

Purpose:  
Runs the following file to [setup the geometries and stations](https://github.com/ChromaticPanic/CGC_Grain_Outcome_Predictions/blob/main/src/WeatherStation/importBoundariesAndStations.ipynb)

Updating the data:
- this only needs to be run once
- in the event the areas or weather stations change, the script will override whatever was already stored

In [None]:
%run WeatherStation/importBoundariesAndStations.ipynb

Purpose:  
Runs the following file to [setup the ergot data](https://github.com/ChromaticPanic/CGC_Grain_Outcome_Predictions/blob/main/src/Ergot/importErgot.ipynb)

Updating the data:
- this only needs to be run once
- in the event the ergot data change, the script will override whatever was already stored

In [None]:
%run Ergot/importErgot.ipynb

Purpose:  
Runs the following file to [setup the soil moisture data](https://github.com/ChromaticPanic/CGC_Grain_Outcome_Predictions/blob/main/src/SatelliteSoilMoisture/PullMoistureData.py)

Updating the data:
- this only needs to be run once (data gets appended to the existing table)
- in the event more data would like to be added, those [files must be accessed with permission from the European space agency](https://www.esa.int/Applications/Observing_the_Earth/Space_for_our_climate/Nearly_four_decades_of_soil_moisture_data_now_available)

Remarks: this file will duplicate its data if ran multiple times unless the data within the soil moisture data folder is moved afterwards

In [None]:
%run SatelliteSoilMoisture/PullMoistureData.py

Purpose:  
Runs the following file to [setup the soil data](https://github.com/ChromaticPanic/CGC_Grain_Outcome_Predictions/blob/main/src/Soil/importSoil.ipynb)

Updating the data:
- this only needs to be run once 
- in the event the soil data change, the script will override whatever was already stored

In [None]:
%run Soil/importSoil.ipynb

##### Loading other data  

This part (enclosed within the comment block, should never change)  
1. Load the necessary environment variables and packages

<br>

This part will change depending on the data you load  

2. Locate where the data is relative to this file (store in PATH)
3. Give a name to the table that will hold the data you are loading (TABLENAME)
4. Load it in using one of the two commands: [read_file](https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html) or [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
5. Store it in the database using one of the two commands: [to_sql](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html) or [to_postgis](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_postgis.html)

Remarks: note that some data will not persist in a database when following these steps, in those instances the following must be done:
- [commiting the change via sqlalchemy](https://docs.sqlalchemy.org/en/20/orm/session_basics.html)
- manually creating the database then inserting the data [i.e ...](https://github.com/ChromaticPanic/CGC_Grain_Outcome_Predictions/blob/main/src/SatelliteCopernicus/CopernicusQueryBuilder.py)

Note this does not include the labeling process either, that can be added by using the following function once loaded:
```
def addRegions(df: pd.DataFrame, agRegions: gpd.GeoDataFrame) -> pd.DataFrame:
    # Creates geometry from df using lon and lat as cords to create points (points being geometry)
    df = gpd.GeoDataFrame(
        df, crs="EPSG:4326", geometry=gpd.points_from_xy(df.lon, df.lat)
    )

    # Changes the points projection to match the agriculture regions of EPSG:3347
    df.to_crs(crs="EPSG:3347", inplace=True)  # type: ignore

    # Join the two dataframes based on which points fit within what agriculture regions
    df = gpd.sjoin(df, agRegions, how="left", predicate="within")

    df = pd.DataFrame(df.drop(columns=["index_right", "geometry"]))

    df = df[df["cr_num"].notna()]  # Take rows that are valid numbers
    df[["cr_num"]] = df[["cr_num"]].astype(int)

    return df
```

In [None]:
# -----------------------------------------------------------------------
# import geopandas as gpd
# import pandas as pd
#
# Load the database connection environment variables located in the docker folder
# load_dotenv("docker/.env")
# PG_USER = os.getenv("POSTGRES_USER")
# PG_PW = os.getenv("POSTGRES_PW")
# PG_DB = os.getenv("POSTGRES_DB")
# PG_ADDR = os.getenv("POSTGRES_ADDR")
# PG_PORT = os.getenv("POSTGRES_PORT")
#
# if (
#     PG_DB is None
#     or PG_ADDR is None
#     or PG_PORT is None
#     or PG_USER is None
#     or PG_PW is None
# ):
#     raise ValueError("Environment variables not set")
#
# db = DataService(PG_DB, PG_ADDR, int(PG_PORT), PG_USER, PG_PW)
# conn = db.connect()
# -----------------------------------------------------------------------

# PATH = ...
# TABLENAME = ...


# Data with geometry
# ----------------------
# data = gpd.read_file(PATH, encoding="utf-8")
# data.to_postgis(TABLENAME, conn, index=False, if_exists="replace")

# OR

# All other data
# ----------------------
# data = pd.read_csv(PATH)
# data.to_sql(TABLENAME, conn, index=False, if_exists="replace")

# Pull data

Purpose:  
Loads the daily weather station data

Updating the data: 
- This script will always pull data from the last year data was pulled for plus any data after that
- Stations can be manually disabled by updated their [is_active column](https://github.com/ChromaticPanic/CGC_Grain_Outcome_Predictions/tree/main#station_data_last_updated)
- The dates of their latest data is stored in their [latest_data column](https://github.com/ChromaticPanic/CGC_Grain_Outcome_Predictions/tree/main#station_data_last_updated)

Remarks: 
- Data that was already stored gets deleted (no duplicates are ever stored)
- Logs are not set up, however if an error is encountered it will be recorded in the console and that particular station would be skipped over

In [None]:
%run WeatherStation/scrapeDaily.py

Purpose:  
Loads the hourly weather station data

Updating the data: 
- This script will always pull all data within the full range of when a weather station has data

Remarks: 
- **Duplicate data may occur if ran multiple times**
- Logs are set up, therefore, if an error is encountered it will be recorded and that particular station would be skipped over

In [None]:
%run WeatherStation/scrapeHourlyParallel.py

Purpose:  
Loads the Copernicus satellite weather data

Updating the data: 
- This script will only pull data that has not been stored in the database yet
- Copernicus needs an API key which if access has been/still is gratned can be setup with the [following steps](https://cds.climate.copernicus.eu/api-how-to)

Remarks: 
- Logs are set up, therefore, if an error is encountered it will be recorded and that entry would be skipped over

In [None]:
%run SatelliteCopernicus/pullCopernicusData.py

# Aggregation

Purpose:  
Aggregates the ergot data

Updating the data: 
- Rerunning this script will replace previous data

In [None]:
%run Ergot/aggregateErgot.py

Purpose:  
Aggregates the ergot data

Updating the data: 
- Rerunning this script will replace previous data

In [None]:
%run Ergot/featureEngErgot.py

Purpose:  
Aggregates the weather station data (daily and hourly)

Updating the data: 
- Rerunning this script will replace previous data

In [None]:
%run WeatherStation/CombineProvinceData.py

Purpose:  
Aggregates the soil data

Updating the data: 
- Rerunning this script will replace previous data

In [None]:
%run Soil/soilAggregation.ipynb

Purpose:  
Aggregates the soil moisture data

Updating the data: 
- Rerunning this script will replace previous data

In [None]:
%run SatelliteSoilMoisture/soilMoistureAggregation.ipynb

##### Aggregating other data  

This part (enclosed within the comment block, should never change)  
1. Load the necessary environment variables and packages

<br>

This part will change depending on the data you load  

2. Create the query to load the data from the database (QUERY)
3. Decide if the data should be stored as a CSV file or in the database (this will mostly likely depend on how many columns appear in the aggregation)
    - If there are more then 1600 use the CSV and set a location to store the aggregated data (PATH)
    - Otherwise create a name for the table (TABLENAME)
4. [Read the data you want to aggregate](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html)
5. [Group the values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) by whatever you'd like to aggregate the values by and [choose how they should be aggregated](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html)
6. Set the column names
7. Preprocess the data with the [aggregatorHelper](https://github.com/ChromaticPanic/CGC_Grain_Outcome_Predictions/blob/main/src/Shared/aggregatorHelper.py) if needed
8. Store the data
    - [CSV](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html)
    - [Database](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html)

Remarks: note that some data will not persist in a database when following these steps, in those instances the following must be done:
- [commiting the change via sqlalchemy](https://docs.sqlalchemy.org/en/20/orm/session_basics.html)
- manually creating the database then inserting the data [i.e ...](https://github.com/ChromaticPanic/CGC_Grain_Outcome_Predictions/blob/main/src/SatelliteCopernicus/CopernicusQueryBuilder.py)

In [None]:
# -----------------------------------------------------------------------
# from Shared.aggregatorHelper import AggregatorHelper  #type: ignore
# import pandas as pd
#
# Load the database connection environment variables located in the docker folder
# load_dotenv("docker/.env")
# PG_USER = os.getenv("POSTGRES_USER")
# PG_PW = os.getenv("POSTGRES_PW")
# PG_DB = os.getenv("POSTGRES_DB")
# PG_ADDR = os.getenv("POSTGRES_ADDR")
# PG_PORT = os.getenv("POSTGRES_PORT")
#
# if (
#     PG_DB is None
#     or PG_ADDR is None
#     or PG_PORT is None
#     or PG_USER is None
#     or PG_PW is None
# ):
#     raise ValueError("Environment variables not set")
#
# db = DataService(PG_DB, PG_ADDR, int(PG_PORT), PG_USER, PG_PW)
# conn = db.connect()
# -----------------------------------------------------------------------

# QUERY = ...
# TABLENAME = ...
# PATH = ...

# data = pd.read_sql(QUERY, conn)

# What you group by as well as what aggregated values will depend on the data
# agg_df = (
#     data.groupby(["district", dates of some sort etc... (i.e "year", "week", "day")])
#     .agg({"attribute": ["min", "max", "mean"]})
#     .reset_index()
# )

# This will set the column names which is again dependent on the data
# agg_df.columns = [  # type: ignore
#     "district",
#     dates of some sort etc...
#     attributes...
# ]

# you can either store the aggregated data in the database (if below 1600 columns)
# -----------------------------------------------------------------------------------------
# agg_df.to_sql(TABLENAME, con=conn, schema="public", if_exists="replace", index=False)

# OR

# store the aggregated data in a csv file
# -----------------------------------------------------------------------------------------
# helper = AggregatorHelper()

# this example aggregates by individual dates
# dates = helper.getDatesInYr()
# agg_df = helper.reshapeDataByDates(dates, agg_df, data, "dates")

# agg_df.to_csv(path_or_buf=PATH, sep=",", columns=agg_df.columns.tolist(),)

# Build datasets

Purpose:  
Creates exploratory datasets and loads them into a csv file (too many columns for the database)

Remarks: these csv files can be found within the DatasetsData folder

In [None]:
SetCreator()

Purpose:  
Creates the final datasets and loads them into the database

In [None]:
%run Datasets/DatasetJS.py