## Lets look at the results of the simulations.

I want to check

(A) That the centers are characterized correctly

(B) That the characterizations make sense

In [110]:
import polars as pl
from statistics import mean

In [111]:
ver = pl.read_csv(
    "/home/leyregarrido/01_github_repos/VBR-template/country_pipelines/mali/compile_vbr/workspace/pipelines/run_vbr/to_compile/compiled_data/verification_detailed.csv"
)
stats = pl.read_csv(
    "/home/leyregarrido/01_github_repos/VBR-template/country_pipelines/mali/compile_vbr/workspace/pipelines/run_vbr/to_compile/compiled_data/simulation_statistics_db.csv"
)
quant = pl.read_csv(
    "/home/leyregarrido/01_github_repos/VBR-template/country_pipelines/mali/run_vbr/workspace/pipelines/initialize_vbr/data/quantity_data/model_1911_cleaned.csv"
)
quant_not_cleaned = pl.read_csv(
    "/home/leyregarrido/01_github_repos/VBR-template/country_pipelines/mali/run_vbr/workspace/pipelines/initialize_vbr/data/quantity_data/model_1911_not_cleaned.csv"
)

## 1. Look at the ecarts calculation

The ecart calculation makes sense, and it does seem to me that doing the median overall VS doing the median of the medians makes more sense.

In [112]:
selected_center_id = "Af10JtDP39O"
selected_method = ["ecart_median", "ecart_moyen"]
selected_period = "2025Q4"

In [113]:
sel_ver = ver.filter(
    (pl.col("ID") == selected_center_id)
    & (pl.col("qtrisk").is_in(selected_method))
    & (pl.col("Periode") == selected_period)
)
sel_quant = quant.filter(pl.col("ou") == selected_center_id)
sel_quant_not_cleaned = quant_not_cleaned.filter(pl.col("ou") == selected_center_id)

In [114]:
benif_calc_median = sel_ver.select(pl.col("Median ecart - window")).to_series()[0]
benif_calc_mean = sel_ver.select(pl.col("Mean ecart - window")).to_series()[0]
ecart_quant_median = sel_quant.select(pl.col("ecart_dec_val")).median().to_series()[0]
ecart_quant_mean = sel_quant.select(pl.col("ecart_dec_val")).mean().to_series()[0]

display(f"The calculated ecart median is: {benif_calc_median}")
display(f"The quant ecart median is: {ecart_quant_median}")
display(f"The calculated ecart mean is: {benif_calc_mean}")
display(f"The quant ecart mean is: {ecart_quant_mean}")

'The calculated ecart median is: 1.0'

'The quant ecart median is: 1.0'

'The calculated ecart mean is: 0.617256930023341'

'The quant ecart mean is: 0.6172569300233409'

## Look at the gain verif calculation

The calculation is well done, and it seems to make sense.

In [115]:
selected_center_id = "yVDehuSo0cC"
selected_method = "verifgain"
selected_period = "2025Q4"
prix_verif = 100000

In [116]:
sel_ver = ver.filter(
    (pl.col("ID") == selected_center_id)
    & (pl.col("qtrisk") == selected_method)
    & (pl.col("Periode") == selected_period)
)
sel_quant = quant.filter(pl.col("ou") == selected_center_id)

In [117]:
taux_validation_par_service_window = sel_quant.group_by("service").agg(
    pl.col("taux_validation").median().alias("taux_validation_median")
)
sel_quant = sel_quant.join(taux_validation_par_service_window, on="service", how="left")
sel_quant = sel_quant.with_columns(
    (pl.col("taux_validation_median") * pl.col("dec") * pl.col("tarif")).alias("subsidy_verifgain")
)

In [118]:
list_benif_quant = []
for period in ["2025Q2", "2025Q3"]:
    benif_quant_period = prix_verif
    list_services = sel_quant.select(pl.col("service")).unique().to_series().sort()
    for service in list_services:
        avec_verif_period = (
            sel_quant.filter((pl.col("month") == period) & (pl.col("service") == service))
            .select(pl.col("subside_avec_verification"))
            .sum()
            .to_series()[0]
        )
        sans_verif_period = (
            sel_quant.filter((pl.col("month") == period) & (pl.col("service") == service))
            .select(pl.col("subsidy_verifgain"))
            .sum()
            .to_series()[0]
        )
        benif_quant_period += -sans_verif_period + avec_verif_period

    display(f"Quant benefice for {period} is: {benif_quant_period}")
    list_benif_quant.append(benif_quant_period)

benif_quant = mean(list_benif_quant)
benif_calc_median = sel_ver.select(pl.col("Median benefice complet VBR - window")).to_series()[0]
display(f"The calculated benefice is: {benif_calc_median}")
display(f"The quant benefice is: {benif_quant}")

'Quant benefice for 2025Q2 is: 317650.0'

'Quant benefice for 2025Q3 is: -327092.47965498385'

'The calculated benefice is: -4721.239827491925'

'The quant benefice is: -4721.239827491925'

## Lets see if the verification makes sense
Looks very nice

In [119]:
verification_stats = ver.filter(pl.col("Periode") == "2025Q4")
result = (
    verification_stats
    .group_by(["Risk category", "Is the center verified?"])
    .count()
    .join(
        verification_stats
        .group_by("Risk category")
        .count()
        .rename({"count": "total"}),
        on="Risk category"
    )
    .with_columns(
        (pl.col("count") / pl.col("total")).alias("percentage")
    )
    .select([
        "Risk category",
        "Is the center verified?",
        "count",
        "percentage"
    ])
)
display(result)

  .count()
  .count()


Risk category,Is the center verified?,count,percentage
str,i64,u32,f64
"""high""",1,9462,1.0
"""moderate""",1,158,0.504792
"""low""",1,750,0.100739
"""moderate""",0,155,0.495208
"""low""",0,6695,0.899261


## Now we want to at whether the risk categorizations actually make sense: ecart median

So, the problem with this is that we have a lot of centers that have an ecart_median = 0 or 1.
The 0 is okey, it means that for these centers most services are very nice.
The 1 worries me a bit more... It is because we have so many services that have validated = 0, so it propagates.
I think that the solution will be to
(a) Also calculate the moyen
(b) Use the cleaned data too. 

Also, when looking at the risk categories, there are very little centers in risk moyen, which I do not like -- I would like a bit more of /distribution/. We can also see that the majority of centers that are at low risk have an ecart of 0 -- again, I think I need something a bit more granular...

In [120]:
columns_to_select = [
    "ID",
    "Risk category",
    "Median benefice complet VBR - window",
    "Median taux validation - window",
    "Mean taux validation - window",
    "Median ecart - window",
    "Mean ecart - window",
]

In [121]:
selected_method = ["ecart_median"]
selected_period = "2025Q4"

In [122]:
ver_ecart_median = ver.filter(
    (pl.col("qtrisk").is_in(selected_method)) & (pl.col("Periode") == selected_period)
)
ver_ecart_median_merge = ver_ecart_median.select(columns_to_select)

In [123]:
ecart_0 = ver_ecart_median.filter(pl.col("Median ecart - window").is_in([0])).height
ecart_1 = ver_ecart_median.filter(pl.col("Median ecart - window").is_in([1])).height
ecart_0_1 = ecart_0 + ecart_1
total_ecart = ver_ecart_median.height
display(f"Number of centers with ecart 0 or 1: {ecart_0_1} out of {total_ecart}")
display(f"Number of centers with ecart 0: {ecart_0} out of {total_ecart}")  # These are very good
display(f"Number of centers with ecart 1: {ecart_1} out of {total_ecart}")  # These are very bad
# The majority of them end up in 1 or 0, I do not like that at all...
display("The categorization of risks looks like: ")
display(ver_ecart_median["Risk category"].value_counts())
low_risk = ver_ecart_median.filter(pl.col("Risk category") == "low").height
low_risk_non_zero = ver_ecart_median.filter(
    (pl.col("Risk category") == "low") & (pl.col("Median ecart - window") > 0)
).height
display(f"Number of low risk centers with ecart > 0: {low_risk_non_zero} out of {low_risk}")

'Number of centers with ecart 0 or 1: 4828 out of 5740'

'Number of centers with ecart 0: 1069 out of 5740'

'Number of centers with ecart 1: 3759 out of 5740'

'The categorization of risks looks like: '

Risk category,count
str,u32
"""high""",4260
"""low""",1317
"""moderate""",163


'Number of low risk centers with ecart > 0: 248 out of 1317'

## Now we want to at whether the risk categorizations actually make sense: ecart moyen

In this case, we end up with more in high -- but this might be because, if we use the ecart moyen, we need parameters that are a bit more generous. 

In [124]:
selected_method = ["ecart_moyen"]
selected_period = "2025Q4"

In [125]:
ver_ecart_moyen = ver.filter(
    (pl.col("qtrisk").is_in(selected_method)) & (pl.col("Periode") == selected_period)
)
ver_ecart_moyen_merge = ver_ecart_moyen.select(columns_to_select)

In [126]:
ecart_0 = ver_ecart_moyen.filter(pl.col("Mean ecart - window").is_in([0])).height
ecart_1 = ver_ecart_moyen.filter(pl.col("Mean ecart - window").is_in([1])).height
ecart_0_1 = ecart_0 + ecart_1
total_ecart = ver_ecart_moyen.height
display(f"Number of centers with ecart 0 or 1: {ecart_0_1} out of {total_ecart}")
display(f"Number of centers with ecart 0: {ecart_0} out of {total_ecart}")  # These are very good
display(f"Number of centers with ecart 1: {ecart_1} out of {total_ecart}")  # These are very bad
# The majority of them end up in 1 or 0, I do not like that at all...
display("The categorization of risks looks like: ")
display(ver_ecart_moyen["Risk category"].value_counts())
low_risk = ver_ecart_moyen.filter(pl.col("Risk category") == "low").height
low_risk_non_zero = ver_ecart_moyen.filter(
    (pl.col("Risk category") == "low") & (pl.col("Mean ecart - window") > 0)
).height
display(f"Number of low risk centers with ecart > 0: {low_risk_non_zero} out of {low_risk}")

'Number of centers with ecart 0 or 1: 594 out of 5740'

'Number of centers with ecart 0: 217 out of 5740'

'Number of centers with ecart 1: 377 out of 5740'

'The categorization of risks looks like: '

Risk category,count
str,u32
"""high""",5074
"""moderate""",150
"""low""",516


'Number of low risk centers with ecart > 0: 299 out of 516'

## Now we want to at whether the risk categorizations actually make sense: verification gain

The majority of them are low risk -- this is very much a discrepancy with the previous categorization methods.

This method is, not the best. This is because we are using the ecart from the same period as we calculate the verif gain -- so, we are kind of cheating. It would be interesting to also launch this method paying using the complet.


In [127]:
selected_method = ["verifgain"]
selected_period = "2025Q4"

In [128]:
ver_verifgain = ver.filter(
    (pl.col("qtrisk").is_in(selected_method)) & (pl.col("Periode") == selected_period)
)
ver_verifgain_merge = ver_verifgain.select(columns_to_select)

In [129]:
display(ver_verifgain["Risk category"].value_counts())

Risk category,count
str,u32
"""high""",128
"""low""",5612


In [130]:
selected_center_id = "aKnxgRs5WWN"

In [131]:
sel_ver = ver.filter(
    (pl.col("ID") == selected_center_id)
    & (pl.col("qtrisk").is_in(selected_method))
    & (pl.col("Periode") == selected_period)
)
sel_quant = quant.filter(pl.col("ou") == selected_center_id)
taux_validation_par_service_window = sel_quant.group_by("service").agg(
    pl.col("taux_validation").median().alias("taux_validation_median")
)
sel_quant = sel_quant.join(taux_validation_par_service_window, on="service", how="left")
sel_quant = sel_quant.with_columns(
    (pl.col("taux_validation_median") * pl.col("dec") * pl.col("tarif")).alias("subsidy_verifgain")
)
list_benif_quant = []
for period in ["2025Q2", "2025Q3"]:
    benif_quant_period = prix_verif
    list_services = sel_quant.select(pl.col("service")).unique().to_series().sort()
    for service in list_services:
        avec_verif_period = (
            sel_quant.filter((pl.col("month") == period) & (pl.col("service") == service))
            .select(pl.col("subside_avec_verification"))
            .sum()
            .to_series()[0]
        )
        sans_verif_period = (
            sel_quant.filter((pl.col("month") == period) & (pl.col("service") == service))
            .select(pl.col("subsidy_verifgain"))
            .sum()
            .to_series()[0]
        )
        benif_quant_period += -sans_verif_period + avec_verif_period

    display(f"Quant benefice for {period} is: {benif_quant_period}")
    list_benif_quant.append(benif_quant_period)

benif_quant = mean(list_benif_quant)
benif_calc_median = sel_ver.select(pl.col("Median benefice complet VBR - window")).to_series()[0]
display(f"The calculated benefice is: {benif_calc_median}")
display(f"The quant benefice is: {benif_quant}")

'Quant benefice for 2025Q2 is: 100000.0'

'Quant benefice for 2025Q3 is: 100000.0'

'The calculated benefice is: 100000.0'

'The quant benefice is: 100000.0'

## Lets compare


In [135]:
merge_cols = [
    "ID",
    "Median benefice complet VBR - window",
    "Median taux validation - window",
    "Mean taux validation - window",
    "Median ecart - window",
    "Mean ecart - window",
]
comparing = ver_verifgain_merge.join(
    ver_ecart_median_merge, on=merge_cols, how="inner", suffix="_ecart_median"
).join(ver_ecart_moyen_merge, on=merge_cols, how="inner", suffix="_ecart_moyen")
comparing = comparing.with_columns(pl.col("Risk category").alias("Risk category_verifgain"))
comparing = comparing.select(["ID",
    "Median benefice complet VBR - window",
    "Median taux validation - window",
    "Mean taux validation - window",
    "Median ecart - window",
    "Mean ecart - window","Risk category_verifgain",
    "Risk category_ecart_median", "Risk category_ecart_moyen"])

In [139]:
rows_all_high = comparing.filter(
    (pl.col("Risk category_verifgain") == "high") &
    (pl.col("Risk category_ecart_median") == "high") &
    (pl.col("Risk category_ecart_moyen") == "high")
).height
rows_ecarts_high = comparing.filter(
    (pl.col("Risk category_ecart_median") == "high") &
    (pl.col("Risk category_ecart_moyen") == "high")
).height
rows_ecart_median_high_moyen_not = comparing.filter(
    (pl.col("Risk category_ecart_median") == "high") &
    (pl.col("Risk category_ecart_moyen") != "high")
).height
rows_ecart_median_not_moyen_high = comparing.filter(
    (pl.col("Risk category_ecart_median") != "high") &
    (pl.col("Risk category_ecart_moyen") == "high")
).height
display(f"Total number of rows compared: {comparing.height}")
display(f"Number of rows all high: {rows_all_high}")
display(f"Number of rows ecarts high: {rows_ecarts_high}")
display(f"Number of rows ecart median high, ecart moyen not high: {rows_ecart_median_high_moyen_not}")
display(f"Number of rows ecart median not high, ecart moyen high: {rows_ecart_median_not_moyen_high}")

'Total number of rows compared: 5740'

'Number of rows all high: 119'

'Number of rows ecarts high: 4260'

'Number of rows ecart median high, ecart moyen not high: 0'

'Number of rows ecart median not high, ecart moyen high: 814'

I say I have done an okey job, and things make sense
So, lets
(a) Make sure that the pipelines make sense
(b) Create a dashboard with multiple results
(c) Message JP