# ergotVisualization.ipynb

Visualisation for [ergot_sample](https://github.com/ChromaticPanic/CGC_Grain_Outcome_Predictions/tree/main#ergot_sample) and [agg_ergot_sample](https://github.com/ChromaticPanic/CGC_Grain_Outcome_Predictions/tree/main#agg_ergot_sample), The following script can be used to visualize the data

## Output graphs:

- Correlation plot ([such as ...](https://github.com/ChromaticPanic/CGC_Grain_Outcome_Predictions/blob/main/.github/img/ergotCorr.png))


In [None]:
from matplotlib import pyplot as plt  # type: ignore
from matplotlib import cm, colors  # type: ignore
from matplotlib.colors import ListedColormap, Normalize  # type: ignore
from matplotlib.cm import ScalarMappable  # type: ignore
from dotenv import load_dotenv
import geopandas as gpd  # type: ignore
import sqlalchemy as sq
import pandas as pd
import numpy as np
import os, sys

sys.path.append("../")
from Shared.DataService import DataService

Psuedocode:  
- Load the environment database variables
- Connect to the database

In [None]:
load_dotenv()
PG_DB = os.getenv("POSTGRES_DB")
PG_ADDR = os.getenv("POSTGRES_ADDR")
PG_PORT = os.getenv("POSTGRES_PORT")
PG_USER = os.getenv("POSTGRES_USER")
PG_PW = os.getenv("POSTGRES_PW")

In [None]:
if (
    PG_DB is None
    or PG_ADDR is None
    or PG_PORT is None
    or PG_USER is None
    or PG_PW is None
):
    raise ValueError("Environment variables not set")

# connecting to database
db = DataService(PG_DB, PG_ADDR, int(PG_PORT), PG_USER, PG_PW)
conn = db.connect()

## Visualization before aggregation

Purpose : Self contained data retrieval for the ergot visualization for the data before aggregation

Psuedocode: 
- Create the ergot data SQL query
- [Load the data from the database directly into a DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html) 

In [None]:
query = sq.text("SELECT * FROM public.ergot_sample")
ergot_df = pd.read_sql(query, conn)

In [None]:
ergot_df

Purpose: 
- The purpose of this code is to update the "district" values for provinces entries in the DataFrame. It adjusts the district codes, to reconcile them for consistency.

Psuedocode : 
- Update the in the DataFrame ergot_df based on the "province" and "crop_district" values for each row. 

In [None]:
ergot_df.loc[ergot_df["province"] == "MB", "district"] = (
    ergot_df.loc[ergot_df["province"] == "MB", "crop_district"] + 4600
)
ergot_df.loc[ergot_df["province"] == "SK", "district"] = (
    ergot_df.loc[ergot_df["province"] == "SK", "crop_district"] - 1
) + 4700
ergot_df.loc[ergot_df["province"] == "AB", "district"] = (
    ergot_df.loc[ergot_df["province"] == "AB", "crop_district"] * 10
) + 4800

Purpose:  
Merge the soil and drop irrelevant columns
- Drop irrelevant attributes 

In [None]:
ergot_df.drop(columns=["crop_district", "sample_id"], inplace=True)

Purpose:
- The purpose of this code is to ensure that the "district" column contains integer values

Psuedocode :
- [convert district value to integer type](https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html)


In [None]:
ergot_df["district"] = pd.to_numeric(ergot_df["district"], downcast="integer")
ergot_df

Purpose:
- The purpose of this code is to visualize and compare the incidence of the specified condition in the provinces of Manitoba, Alberta, and Saskatchewan over the years. 

Psuedocode :
- [use plt.xlabel to label x-axis](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.ylabel.html)
- [use plt.xlabel to label y-axis](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.xlabel.html)
- [use plt.figure to determine the dimensions of graph](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html)
- [use plt.plot to plot the graph](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html)

In [None]:
# sample with incidence = True: per year, per province
samples_df = (
    ergot_df[ergot_df["incidence"] == True]
    .groupby(["province", "year"])["incidence"]
    .count()
    .reset_index()
)
mb_df = samples_df[samples_df["province"] == "MB"]
ab_df = samples_df[samples_df["province"] == "AB"]
sk_df = samples_df[samples_df["province"] == "SK"]

year = mb_df["year"].tolist()
mb_incidence = mb_df["incidence"].tolist()
ab_incidence = ab_df["incidence"].tolist()
sk_incidence = sk_df["incidence"].tolist()


plt.figure(figsize=(10, 5))
plt.xlabel("Year")
plt.ylabel("Incidence")
plt.plot(year, mb_incidence, color="blue")
plt.plot(year, ab_incidence, color="green")
plt.plot(year, sk_incidence, color="red")
plt.legend(["Manitoba", "Alberta", "Saskatchewan"])
plt.show()

Purpose:
- The purpose of this code is to create a DataFrame (ratio_df) containing the ratio of the incidence of ergot for each province over the years. The ratio represents how frequently the condition occurs relative to the total incidence count for each province-year group.

In [None]:
ratio_df = ergot_df.groupby(["province", "year"])["incidence"].count().reset_index()
ratio_df["ratio"] = (samples_df["incidence"] / ratio_df["incidence"]).to_frame()
ratio_df.drop(columns=["incidence"], inplace=True)
ratio_df

Purpose:
- The purpose of this code is to visualize and compare the incidence ratios of the specified condition for each province over the years.

In [None]:
# Min, max for each province
mb_ratio = ratio_df[ratio_df["province"] == "MB"]["ratio"].tolist()
ab_ratio = ratio_df[ratio_df["province"] == "AB"]["ratio"].tolist()
sk_ratio = ratio_df[ratio_df["province"] == "SK"]["ratio"].tolist()


plt.figure(figsize=(10, 5))
plt.xlabel("Year")
plt.ylabel("Incidence")
plt.plot(year, mb_ratio, color="blue")
plt.plot(year, ab_ratio, color="green")
plt.plot(year, sk_ratio, color="red")
plt.legend(["Manitoba", "Alberta", "Saskatchewan"])
plt.show()

print(
    "Highest ratio in MB: {}, in year: {}".format(
        max(mb_ratio), mb_ratio.index(max(mb_ratio)) + 1995
    )
)
print(
    "Highest ratio in AB: {}, in year: {}".format(
        max(ab_ratio), ab_ratio.index(max(ab_ratio)) + 1995
    )
)
print(
    "Highest ratio in SK: {}, in year: {}".format(
        max(sk_ratio), sk_ratio.index(max(sk_ratio)) + 1995
    )
)

print(
    "Lowest ratio in MB: {}, in year: {}".format(
        min(mb_ratio), mb_ratio.index(min(mb_ratio)) + 1995
    )
)
print(
    "Lowest ratio in AB: {}, in year: {}".format(
        min(ab_ratio), ab_ratio.index(min(ab_ratio)) + 1995
    )
)
print(
    "Lowest ratio in SK: {}, in year: {}".format(
        min(sk_ratio), sk_ratio.index(min(sk_ratio)) + 1995
    )
)

Purpose:
- The purpose of this code is to calculate the incidence ratio (as a percentage) of ergot for each region (province and district) over the years. 

Output:
- The resulting region_df DataFrame contains the "province", "year", "district", "incidence", and "ratio" columns, where "ratio" represents the calculated incidence ratio for each region.

In [None]:
total_df = ergot_df.groupby(["year", "district"])["incidence"].count().reset_index()
region_df = (
    ergot_df[ergot_df["incidence"] == True]
    .groupby(["province", "year", "district"])["incidence"]
    .count()
    .reset_index()
)
region_df["ratio"] = (region_df["incidence"] / total_df["incidence"]) * 100
region_df

Purpose:
- The purpose of this code is to fetch geospatial data related to agricultural regions from database and store it in a GeoDataFrame (agRegions). 
- This GeoDataFrame can be used for operations such as plotting the regions on maps.

In [None]:
regionQuery = sq.text("select district, color, geometry FROM public.census_ag_regions")
agRegions = gpd.GeoDataFrame.from_postgis(
    regionQuery, conn, crs="EPSG:3347", geom_col="geometry"
)

Purpose:
- The purpose of this function is to convert a numerical value into a corresponding color representation based on a chosen colormap. 

Psuedocode:
- [Normalization](https://matplotlib.org/stable/api/_as_gen/matplotlib.colors.Normalize.html): The function uses the `colors.Normalize()` function to normalize the input `value` between the specified minimum (`vmin`) and maximum (`vmax`) values. This ensures that the value falls within the range of 0 to 1.

- [Colormap Retrieval](https://matplotlib.org/stable/api/_as_gen/matplotlib.cm.get_cmap.html): It then retrieves the specified colormap using the `cm.get_cmap()`.

- Color Mapping: The normalized `value` is used as input to the colormap obtained in the previous step. The colormap maps the normalized value to a corresponding RGB color using the `cmap(norm(abs(value)))` operation. 

- [Hexadecimal Color Conversion](https://matplotlib.org/stable/api/_as_gen/matplotlib.colors.rgb2hex.html): The resulting RGB color tuple is then converted to a hexadecimal color string using the `colors.rgb2hex()`.

In [None]:
def color_map_color(
    value: int, cmap_name="Wistia", vmin: int = 0, vmax: int = 100
) -> str:
    norm = colors.Normalize(vmin=vmin, vmax=vmax)
    cmap = cm.get_cmap(cmap_name)
    rgb = cmap(norm(abs(value)))[:3]
    color = colors.rgb2hex(rgb)

    return color

Purpose :
- The purpose of this function is to create a color map for a set of districts based on the provided ratio data.

In [None]:
def get_color(ratio_year: pd.DataFrame) -> pd.Series:
    color_map = []

    for district in agRegions["district"].tolist():
        if district in ratio_year["district"].tolist():
            ratio = ratio_year[ratio_year["district"] == district]["ratio"].tolist()[0]
            color_map.append(color_map_color(ratio))
        else:
            color_map.append(color_map_color(0))

    return pd.Series(color_map)

Purpose:
- The purpose of this function is to visualize incident levels for each district on a map. It takes the incident ratio color map and the year as input to create a choropleth map. Each district is colored according to its incident level, and the district names are annotated on the map for easy identification.

Pseudocode:
- Get the bounding box (minimum and maximum coordinates). 
- Create a figure and axes.
- Set the y-axis and x-axis limits of the axes using the bounding box coordinates obtained earlier.
- Set the title of the plot to "Incident level for district in {year}".
- Plot the GeoDataFrame on the axes , and pass color_map as an argument to color the districts based on the provided incident level data.
- [Display the plot](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.show.html).

In [None]:
def plot_map(color_map: pd.Series, year: int):
    minx, miny, maxx, maxy = agRegions.total_bounds
    fig, ax = plt.subplots(figsize=(20, 20))
    ax.set_ylim(miny, maxy)
    ax.set_xlim(minx, maxx)
    ax.set_title("Incident level for district in " + str(year))

    # Create a custom colormap based on unique colors in color_map
    unique_colors = color_map.unique()
    cmap = ListedColormap(unique_colors)
    normalize = Normalize(vmin=0, vmax=len(unique_colors) - 1)
    scalar_mappable = ScalarMappable(cmap=cmap, norm=normalize)

    agRegions.plot(ax=ax, color=color_map, edgecolor="black")
    agRegions.apply(
        lambda x: ax.annotate(
            text=x["district"],
            xy=x.geometry.centroid.coords[0],
            ha="center",
            color="black",
            size=10,
        ),
        axis=1,
    )

    # Create a colorbar without tick labels
    cbar = plt.colorbar(scalar_mappable, ax=ax, orientation="vertical", pad=0.02)
    cbar.set_label("Incident Level")

    plt.show()

Purpose:
- The purpose of this pseudocode is to loop through the years in the range (1995 to 2022) and create map visualizations for each year based on incident ratio data.

In [None]:
for currYear in range(1995, 2023):
    ratio_year = region_df.loc[region_df["year"] == currYear]
    color = get_color(ratio_year)
    plot_map(color, currYear)

# Visualization for aggregated ergot

Purpose : Self contained data retrieval for the ergot visualization for the data after aggregation

Psuedocode: 
- Create the agg_ergot data SQL query
- [Load the data from the database directly into a DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html) 
- obtain the unique values in the "severity" column of the DataFrame.

In [None]:
query = sq.text("SELECT * FROM public.agg_ergot_samples")
agg_ergot_df = pd.read_sql(query, conn)
agg_ergot_df

In [None]:
agg_ergot_df["severity"].unique()

Purpose:
- The purpose of this function is to provide basic descriptive statistics (first quartile, median, third quartile) of the given numerical data and identify outliers using the Interquartile Range (IQR) method.

In [None]:
def stats(data: list):
    # assign your quartiles, limits and iq3
    q1, q2, q3 = np.percentile(data, [25, 50, 75])
    iqr = q3 - q1
    print("iqr: ", iqr)
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr

    # create conditions to isolate the outliers
    outliers = [value for value in data if value < lower_bound or value > upper_bound]
    return q1, q2, q3, outliers

Purpose:
- The purpose of this code is to analyze the "severity" data from the DataFrame. It calculates the quartiles (Q1, Q2, Q3) using the stats function and identifies any outlier values. 

In [None]:
data = np.array(agg_ergot_df["severity"])

q1, q2, q3, outliers = stats(data.tolist())
print("number of outliers:", len(outliers))
print("q1, q2, q3: {}, {}, {}".format(q1, q2, q3))
print(outliers)

Purpose:
- The purpose of this code is to remove outliers from the "severity" data in the DataFrame wo_outliers and then create a box plot without the outliers. The box plot visually represents the distribution of the "severity" data without extreme values

In [None]:
wo_outliers = pd.DataFrame(data, columns=["severity"])
wo_outliers = wo_outliers[~wo_outliers["severity"].isin(outliers)]
wo_outliers

In [None]:
wo_outliers.boxplot(column=["severity"], showfliers=False, autorange=True)

In [None]:
# incident/present of urgot information by district
present_df = agg_ergot_df[
    [
        "year",
        "has_ergot",
        "district",
        "present_prev1",
        "present_prev2",
        "present_prev3",
        "present_in_neighbor",
    ]
].drop_duplicates()
present_df

Purpose:
- The purpose of this code is to analyze the relationships between ergot incidents in the current year and different scenarios related to previous years and neighboring areas. By calculating the percentages, it provides insights into how the presence of ergot in specific periods or neighboring areas might influence the occurrence of ergot in the current year.

In [None]:
# find percentage of having ergot in previous year -> having ergot in this year
percent1 = (
    present_df[
        (present_df["has_ergot"] == True) & (present_df["present_prev1"] == True)
    ].shape[0]
    / present_df.shape[0]
)
print("Percent of having ergot when prev year had ergot: ", percent1)

# find percentage of having ergot in previous 2 years -> having ergot in this year
percent2 = (
    present_df[
        (present_df["has_ergot"] == True) & (present_df["present_prev2"] == True)
    ].shape[0]
    / present_df.shape[0]
)
print("Percent of having ergot when prev 2 year had ergot: ", percent2)

# find percentage of having ergot when having ergot in previous 3 year
percent3 = (
    present_df[
        (present_df["has_ergot"] == True) & (present_df["present_prev3"] == True)
    ].shape[0]
    / present_df.shape[0]
)
print("Percent of having ergot when prev 3 year had ergot: ", percent3)

# find percentage of having ergot when neighbor has ergot
percent4 = (
    present_df[
        (present_df["has_ergot"] == True) & (present_df["present_in_neighbor"] == True)
    ].shape[0]
    / present_df.shape[0]
)
print("Percent of having ergot when neighbor is having ergot: ", percent4)

In [None]:
# severity information by district
severity_df = agg_ergot_df[
    [
        "year",
        "district",
        "has_ergot",
        "sum_severity",
        "severity_prev1",
        "severity_prev2",
        "severity_prev3",
    ]
].drop_duplicates()
severity_df

Purpose:
- The purpose of this code is to analyze the severity levels in districts with no ergot incidents (has_ergot is False) but with a non-zero sum of severity levels (sum_severity > 0). It first filters the severity_df to select such rows and then calculates the quartiles and detects outliers in the "sum_severity" data.

In [None]:
severity_df[(~severity_df["has_ergot"]) & (severity_df["sum_severity"] > 0)]

In [None]:
# assign your quartiles, limits and iq3
data = np.array(severity_df["sum_severity"])
q1, q2, q3, outliers = stats(data.tolist())
print("number of outliers:", len(outliers))
print("q1, q2, q3: {}, {}, {}".format(q1, q2, q3))
print(outliers)

Purpose:
- The purpose of this code is to visualize the distribution of severity levels across districts without displaying any outliers.

In [None]:
# plt.rcParams.update({'figure.figsize':(7,5), 'figure.dpi':100})

# plot
severity_df.plot.box(title="Severity", column=["sum_severity"], showfliers=False)