1. Read file_gpkg (This is generated by data_processing.ipynb)
2. Count the number of rows (PV_normal, PV_pool, PV_heater)
    - Some rows corresponds to multiple columns. (e.g. PV_pool & uncertflag)
3. Select Sample rows and saved it as 'sample_annotations_for_qc'

In [5]:
import geopandas as gpd
final_annotation = "../db_pipeline/final_annotations.gpkg"
gdf = gpd.read_file(final_annotation)
gdf

print(gdf.dtypes)

id                                 int64
PV_normal                        float64
PV_heater                        float64
PV_pool                          float64
uncertflag                       float64
area                             float64
annotator                         object
centroid_latitude                float64
centroid_longitude               float64
image_name                        object
nw_corner_of_image_latitude       object
nw_corner_of_image_longitude      object
se_corner_of_image_latitude       object
se_corner_of_image_longitude      object
geometry                        geometry
dtype: object


In [None]:
print(f"The number of annotations is {len(gdf['id'])}")

columns_to_check = ['PV_normal', 'PV_heater', 'PV_pool', 'uncertflag']
counts = (gdf[columns_to_check] == 1.0).sum()
print(counts)


only_uncertflag = gdf[
    (gdf['uncertflag'] == 1.0) &
    (gdf[['PV_normal', 'PV_heater', 'PV_pool']] != 1.0).all(axis=1)
]
print(f"The number of annotations with only 'uncertflag' = 1.0 (no double counts): {len(only_uncertflag)}")


The number of annotations is 19735
PV_normal     10787
PV_heater      5216
PV_pool        2503
uncertflag     1920
dtype: int64
The number of annotations with only 'uncertflag' = 1.0 (no double counts): 1231


**Count rows corresponding to multiple columns**
 - PV_pool & uncertflag = 205
 - PV_heater & uncertflag = 484
 - PV_heater & PV_pool = 2 

In [4]:
import pandas as pd
from itertools import combinations

# Define the columns to check
columns_to_check = ['PV_normal', 'PV_heater', 'PV_pool', 'uncertflag']

# Dictionary to store results
combo_counts = {}

# Count for each combination of size >= 2
for r in range(2, len(columns_to_check) + 1):
    for combo in combinations(columns_to_check, r):
        condition = (gdf[list(combo)] == 1.0).all(axis=1)
        count = condition.sum()
        combo_counts[combo] = count

# Display results
for combo, count in combo_counts.items():
    print(f"{' & '.join(combo)} = {count}")


PV_normal & PV_heater = 0
PV_normal & PV_pool = 0
PV_normal & uncertflag = 0
PV_heater & PV_pool = 2
PV_heater & uncertflag = 484
PV_pool & uncertflag = 205
PV_normal & PV_heater & PV_pool = 0
PV_normal & PV_heater & uncertflag = 0
PV_normal & PV_pool & uncertflag = 0
PV_heater & PV_pool & uncertflag = 0
PV_normal & PV_heater & PV_pool & uncertflag = 0


Sampling strategy
1. Checking all double annotated rows = 484 + 205 -> Move on to do annotaiton check for accuracy
2. OR Random Sampling excluding double annotated rows 
3. Also, If there are annotated and wrongly annoated or missing PV in a single tile, that would lower model's accuracy.