# How to use this notebook (recommended for non-CLI users)

This notebook is designed for users who prefer working interactively rather than running command-line scripts.

- Step 1: Ensure you have activated the `pyagri-notebooks` environment and installed the package in editable mode (`pip install -e .`). See the top cell for environment checks.
- Step 2: Set the `PROJECT_ROOT` variable in the **Configuration** cell if you want to use an explicit absolute path; otherwise the notebook will use the current working directory. All other data paths (inputs and outputs) are relative to `PROJECT_ROOT`.
- Step 3: In the **Configuration** cell, confirm `INPUT_TASKDATA` (list of TaskData folders), `LEDREBORG_OUT` and `HARVEST_GEOJSON` look correct.
- Step 4 (recommended): Run the small **Per-input API** cell for the specific input(s) you want to process — this calls the library functions directly and shows concise results inline.
- Step 5: Use the **Advanced: Batch run** cell only if you want to process multiple inputs in one go (this runs subprocesses and is better for batch jobs, not interactive exploration).

If a GeoJSON or CSV already exists, the notebook will append features or rows rather than overwrite, so you can safely run the same input multiple times for incremental exports.

Tip: to quickly set `PROJECT_ROOT` to the workspace root use:

```python
PROJECT_ROOT = Path(r"C:/dev/agri_analysis").resolve()
```

In [1]:
# 01_Raw_read
# Friendly notebook that reproduces the CLI workflows:
#  - python -m pyagri.export data/TASKDATA_2 data/ledreborg_CSV
#  - python -m pyagri.geo data/TASKDATA data/harvest_fields_s.geojson

# NOTE: This notebook is intended to be run in the `pyagri-notebooks` environment
# (or another environment with the project installed in editable mode).
# Recommended quick setup (run in a shell once):
#   micromamba activate pyagri-notebooks
#   pip install -e .
#   pip install geopandas pandas

# Section 1: Quick environment check
import sys
from pathlib import Path
import logging

logger = logging.getLogger('01_Raw_read')
logger.setLevel(logging.INFO)
handler = logging.StreamHandler()
handler.setLevel(logging.INFO)
logger.addHandler(handler)

logger.info('Python executable: %s', sys.executable)
try:
    import pyagri
    logger.info('pyagri import OK')
except Exception as e:
    logger.warning('pyagri import failed: %s', e)


Python executable: c:\Users\holmes\.local\share\mamba\envs\pyagri-notebooks\python.exe
pyagri import OK


In [3]:
# Configuration cell — set project root, inputs and outputs here
# Set the PROJECT_ROOT to the absolute path of the folder containing the data (the previous cell set it to the folder of the notebook)
# Example explicit override (uncomment and adjust if you run the notebook from a different working directory):
PROJECT_ROOT = Path.cwd().resolve().parent

# PROJECT_ROOT = Path(r"C:/dev/agri_analysis").resolve()

DATA_DIR = PROJECT_ROOT / 'data'

# You can specify more than one TaskData folder in INPUT_TASKDATA list. These are paths relative to PROJECT_ROOT.
INPUT_TASKDATA = [
    DATA_DIR / 'TASKDATA_2',  # example input folder 1
    DATA_DIR / 'TASKDATA',  # add more as needed
]

# CSV output directory (one dir used for all inputs; files for each task will be appended)
HARVESTER_POINTS_DIR = PROJECT_ROOT / 'data' / 'ledreborg_CSV'
HARVESTER_POINTS_DIR.mkdir(parents=True, exist_ok=True)

# GeoJSON output file for task polygons (appends features if file exists)
FIELD_POLYGONS_GEOJSON = PROJECT_ROOT / 'data' / 'harvest_fields_s.geojson'

# --- Set target projected CRS for area/overlap analysis ---
# Use EPSG:25832 (ETRS89 / UTM zone 32N) for Denmark
EPSG = '25832'  # Change as needed for your region


logger.info('Project root: %s', PROJECT_ROOT)
logger.info('Configured %d input folders: %s', len(INPUT_TASKDATA), [str(p) for p in INPUT_TASKDATA])
logger.info('CSV output dir: %s', HARVESTER_POINTS_DIR)
logger.info('GeoJSON output file: %s', FIELD_POLYGONS_GEOJSON)
logger.info('Target projected CRS EPSG: %s', EPSG)


Project root: C:\dev\agri_analysis
Configured 2 input folders: ['C:\\dev\\agri_analysis\\data\\TASKDATA_2', 'C:\\dev\\agri_analysis\\data\\TASKDATA']
CSV output dir: C:\dev\agri_analysis\data\ledreborg_CSV
GeoJSON output file: C:\dev\agri_analysis\data\harvest_fields_s.geojson
Target projected CRS EPSG: 25832


## Loding binary data
The following cell loads the data form the XML and binary data and exports it to CSV and GeoJSON for futher analasis. If you alredy have converted the data and only want to validate the data skip the following cell

In [8]:
# Per-input API example 
try:
    from pyagri import export as export_mod
    from pyagri import geo as geo_mod
    for src in INPUT_TASKDATA:
        srcp = Path(src)
        if not srcp.exists():
            logger.warning('Skipping missing input: %s', srcp)
            continue
        logger.info('Exporting (API) for: %s', srcp)
        written = export_mod.export_taskdata(str(srcp), str(HARVESTER_POINTS_DIR))
        logger.info('Wrote CSV files: %s', written)

        logger.info('Extracting GeoJSON (API) for: %s', srcp)
        rc = geo_mod.extract_taskdata_to_geojson(str(srcp), str(FIELD_POLYGONS_GEOJSON))
        logger.info('Geo extraction returned: %s (%s)', rc, 'success' if rc == 0 else 'error')
except Exception as e:
    logger.exception('Per-input API example failed: %s', e)


Exporting (API) for: C:\dev\agri_analysis\data\TASKDATA_2
Wrote CSV files: ['C:\\dev\\agri_analysis\\data\\ledreborg_CSV\\021-0_ Monica.csv', 'C:\\dev\\agri_analysis\\data\\ledreborg_CSV\\021-0_ Monica.csv', 'C:\\dev\\agri_analysis\\data\\ledreborg_CSV\\021-0_ Monica.csv', 'C:\\dev\\agri_analysis\\data\\ledreborg_CSV\\016-0_ Stendyssegård.csv', 'C:\\dev\\agri_analysis\\data\\ledreborg_CSV\\016-0_ Stendyssegård.csv', 'C:\\dev\\agri_analysis\\data\\ledreborg_CSV\\016-0_ Stendyssegård.csv', 'C:\\dev\\agri_analysis\\data\\ledreborg_CSV\\037-0_ Bispegård øst.csv', 'C:\\dev\\agri_analysis\\data\\ledreborg_CSV\\037-0_ Bispegård øst.csv', 'C:\\dev\\agri_analysis\\data\\ledreborg_CSV\\037-0_ Bispegård øst.csv', 'C:\\dev\\agri_analysis\\data\\ledreborg_CSV\\014-0_ Dellingemølle.csv', 'C:\\dev\\agri_analysis\\data\\ledreborg_CSV\\014-0_ Dellingemølle.csv', 'C:\\dev\\agri_analysis\\data\\ledreborg_CSV\\024-0_ Silvia.csv', 'C:\\dev\\agri_analysis\\data\\ledreborg_CSV\\024-0_ Silvia.csv', 'C:\\dev\\

In [9]:
# --- Display enhanced TASKDATA.XML report (with crop, TLG, sensors) ---
from pyagri.taskdata_report import taskdata_report

xml_path = PROJECT_ROOT / 'data' / 'taskdata' / 'TASKDATA.XML'
taskdata_report(xml_path)



Year: 2022
  Field: 002-0, St.Amalienborg/Litgoth
    Task TSK189: Product=PFD54 Device=DVC1 Details=4
      Start: 2022-08-04T18:06:44.000+02:00 End: 2022-08-13T22:30:35.000+02:00
    Task TSK190: Product=PFD54 Device=DVC4 Details=4
      Start: 2022-08-04T18:06:44.000+02:00 End: 2022-08-13T22:30:35.000+02:00
    Task TSK191: Product=PFD54 Device=DVC3 Details=4
      Start: 2022-08-04T18:06:44.000+02:00 End: 2022-08-13T22:30:35.000+02:00
    Task TSK192: Product=PFD54 Device=DVC6 Details=4
      Start: 2022-08-04T18:06:44.000+02:00 End: 2022-08-13T22:30:35.000+02:00
  Field: 010-0, Møllemark
    Task TSK182: Product=PFD52 Device=DVC1 Details=4
      Start: 2022-08-13T14:32:41.000+02:00 End: 2022-08-13T18:46:03.000+02:00
    Task TSK183: Product=PFD52 Device=DVC6 Details=4
      Start: 2022-08-13T14:32:41.000+02:00 End: 2022-08-13T18:46:03.000+02:00
  Field: 011-0, Akselbæk
    Task TSK27: Product=PFD20 Device=DVC1 Details=4
      Start: 2022-08-13T11:58:22.000+02:00 End: 2022-08-13T1

In [7]:
import importlib
import pyagri.taskdata_report
importlib.reload(pyagri.taskdata_report)
from pyagri.taskdata_report import taskdata_report_df

In [8]:
# --- Display TASKDATA.XML report as a DataFrame ---
from pyagri.taskdata_report import taskdata_report_df
from IPython.display import display

xml_path = PROJECT_ROOT / 'data' / 'taskdata' / 'TASKDATA.XML'
df_report = taskdata_report_df(xml_path)
display(df_report)


Unnamed: 0,Year,Field,TaskID,Product,Crop,Details,StartTime,EndTime,Device,TLG,Sensors
188,2022,"002-0, St.Amalienborg/Litgoth",TSK189,PFD54,DET10,4,2022-08-04T18:06:44.000+02:00,2022-08-13T22:30:35.000+02:00,DVC1,TLG00123,"005A, 0074, 0077, 0106, 013B"
189,2022,"002-0, St.Amalienborg/Litgoth",TSK190,PFD54,DET40,4,2022-08-04T18:06:44.000+02:00,2022-08-13T22:30:35.000+02:00,DVC4,TLG00124,"005A, 0074, 0077, 0106, 013B"
190,2022,"002-0, St.Amalienborg/Litgoth",TSK191,PFD54,DET30,4,2022-08-04T18:06:44.000+02:00,2022-08-13T22:30:35.000+02:00,DVC3,TLG00125,"005A, 0074, 0077, 0106, 013B"
191,2022,"002-0, St.Amalienborg/Litgoth",TSK192,PFD54,DET64,4,2022-08-04T18:06:44.000+02:00,2022-08-13T22:30:35.000+02:00,DVC6,,"005A, 0077, 0106, 013B"
181,2022,"010-0, Møllemark",TSK182,PFD52,DET10,4,2022-08-13T14:32:41.000+02:00,2022-08-13T18:46:03.000+02:00,DVC1,TLG00119,"005A, 0074, 0077, 0106, 013B"
...,...,...,...,...,...,...,...,...,...,...,...
81,2025,Import,TSK82,PFD13,DET10,4,2025-08-09T16:56:45.000+02:00,2025-08-09T17:31:00.000+02:00,DVC1,TLG00054,"005A, 0074, 0077, 0106, 013B"
82,2025,Import,TSK83,PFD13,DET64,4,2025-08-09T16:56:45.000+02:00,2025-08-09T17:31:00.000+02:00,DVC6,,"005A, 0077, 0106, 013B"
124,2025,Import,TSK125,PFD28,DET10,4,2025-08-10T13:00:07.000+02:00,2025-08-10T14:14:48.000+02:00,DVC1,TLG00082,"005A, 0074, 0077, 0106, 013B"
125,2025,Import,TSK126,PFD28,DET20,4,2025-08-10T13:00:07.000+02:00,2025-08-10T14:14:48.000+02:00,DVC2,TLG00083,"005A, 0074, 0077, 0106, 013B"


## Premlmary data analasis
The data is now loaded and the following cells contain simpel valdation of the input

In [18]:
# --- GeoJSON Year Investigation ---
import geopandas as gpd
import pandas as pd
from IPython.display import display

# Load the GeoJSON file
gdf = gpd.read_file(FIELD_POLYGONS_GEOJSON)

# List all unique years in the 'Year' attribute
years = sorted(gdf['Year'].dropna().unique())
print(f"Unique years in GeoJSON: {years}")
display(pd.DataFrame({'Year': years}))

Unique years in GeoJSON: [np.int32(2021), np.int32(2022), np.int32(2023), np.int32(2024), np.int32(2025)]


Unnamed: 0,Year
0,2021
1,2022
2,2023
3,2024
4,2025


In [23]:

# Reproject GeoDataFrame to the target CRS for analysis
gdf_proj = gdf.to_crs(epsg=EPSG)
print(f"Reprojected GeoDataFrame to EPSG:{EPSG} for area/overlap analysis.")


Reprojected GeoDataFrame to EPSG:25832 for area/overlap analysis.


In [24]:
# --- Check for overlapping polygons within each year (projected CRS) ---
from shapely.geometry import Polygon
from shapely.ops import unary_union

# Function to find overlapping polygons in a GeoDataFrame for a given year
def find_overlaps(gdf_year):
    overlaps = []
    for i, row1 in gdf_year.iterrows():
        for j, row2 in gdf_year.iterrows():
            if i >= j:
                continue
            if row1['geometry'].intersects(row2['geometry']):
                intersection = row1['geometry'].intersection(row2['geometry'])
                if not intersection.is_empty and intersection.area > 0:
                    overlaps.append((i, j, intersection.area))
    return overlaps

# Analyze overlaps for each year using the projected GeoDataFrame
overlap_summary = {}
for year in years:
    gdf_year = gdf_proj[gdf_proj['Year'] == year]
    overlaps = find_overlaps(gdf_year)
    overlap_summary[year] = overlaps
    print(f"Year {year}: {len(overlaps)} overlapping pairs found.")
    if overlaps:
        for i, j, area in overlaps:
            print(f"  Overlap between index {i} and {j}, area: {area:.2f} m²")

# Optionally, display summary as a DataFrame
overlap_counts = pd.DataFrame({
    'Year': list(overlap_summary.keys()),
    'Num Overlaps': [len(v) for v in overlap_summary.values()]
})
display(overlap_counts)

Year 2021: 14 overlapping pairs found.
  Overlap between index 2 and 3, area: 114092.87 m²
  Overlap between index 8 and 9, area: 60366.70 m²
  Overlap between index 10 and 11, area: 43302.90 m²
  Overlap between index 10 and 29, area: 43302.90 m²
  Overlap between index 11 and 29, area: 43302.90 m²
  Overlap between index 12 and 13, area: 175356.26 m²
  Overlap between index 16 and 17, area: 109298.54 m²
  Overlap between index 18 and 19, area: 210481.65 m²
  Overlap between index 26 and 27, area: 269055.94 m²
  Overlap between index 28 and 41, area: 83364.05 m²
  Overlap between index 28 and 42, area: 83364.05 m²
  Overlap between index 32 and 33, area: 24295.68 m²
  Overlap between index 38 and 39, area: 102238.25 m²
  Overlap between index 41 and 42, area: 83364.05 m²
Year 2022: 12 overlapping pairs found.
  Overlap between index 0 and 1, area: 72749.90 m²
  Overlap between index 4 and 5, area: 24295.68 m²
  Overlap between index 22 and 23, area: 114092.87 m²
  Overlap between inde

Unnamed: 0,Year,Num Overlaps
0,2021,14
1,2022,12
2,2023,18
3,2024,31
4,2025,25


In [26]:
# --- Display features for a selected year ---
import ipywidgets as widgets
from IPython.display import display

# Dropdown widget for year selection
year_selector = widgets.Dropdown(
    options=years,
    description='Year:',
    value=years[0] if years else None
)

def show_features_for_year(selected_year):
    gdf_year = gdf[gdf['Year'] == selected_year]
    print(f"Features for year {selected_year} (count: {len(gdf_year)}):")
    display(gdf_year)
    try:
        gdf_year.explore()
    except Exception:
        print("Map display requires geopandas 0.10+ and folium.")

widgets.interact(show_features_for_year, selected_year=year_selector)

interactive(children=(Dropdown(description='Year:', options=(np.int32(2021), np.int32(2022), np.int32(2023), n…

<function __main__.show_features_for_year(selected_year)>

In [11]:

import xml.etree.ElementTree as ET
from collections import defaultdict

def parse_isoxml_taskdata(file_path):
    tree = ET.parse(file_path)
    root = tree.getroot()

    # 1. Extract Farm Name (FRM)
    # ISOXML structure: root -> FRM
    farms = {}
    for frm in root.findall('FRM'):
        # A=ID, B=Name
        farms[frm.attrib.get('A')] = frm.attrib.get('B', 'Unknown Farm')
    
    # 2. Extract Fields (PFD)
    # Map PFD ID -> Field Name
    fields = {}
    for pfd in root.findall('PFD'):
        pfd_id = pfd.attrib.get('A')
        # B is typically Code/Name, C is Description. 
        # User asked for "024-0" which is usually in B or C.
        name = pfd.attrib.get('C') or pfd.attrib.get('B') or "Unknown Field"
        fields[pfd_id] = name

    # 3. Extract Device Descriptions (DVC) for Danish Names (DPD)
    # Map DDI (hex) -> Danish Name (E attribute in DPD)
    # Also map Device ID -> Device Name
    devices = {}
    ddi_definitions = {} # Key: DDI (e.g., '0074'), Value: Name (e.g., 'Bearb. areal')
    
    for dvc in root.findall('DVC'):
        dvc_id = dvc.attrib.get('A')
        dvc_name = dvc.attrib.get('B', 'Unknown Machine')
        devices[dvc_id] = dvc_name
        
        # Look for DeviceProcessData (DPD) definitions inside the Device
        # DET -> DPD or direct DPD depending on version
        # Usually DVC -> DPD in flattened exports or DVC -> DET -> DPD
        for dpd in dvc.findall('.//DPD'):
            ddi = dpd.attrib.get('B') # The DDI Number (e.g., 0074)
            name = dpd.attrib.get('E') # The Designator (Danish Name)
            if ddi and name:
                ddi_definitions[ddi] = name

    # 4. Process Tasks (TSK)
    # Structure: Farm -> Year -> Field -> Task
    report_data = defaultdict(lambda: defaultdict(lambda: defaultdict(list)))

    ns = {'iso': ''} # Namespace handling if needed, usually clear in standard parsing

    for tsk in root.findall('TSK'):
        tsk_id = tsk.attrib.get('A')
        field_ref = tsk.attrib.get('E') # Reference to PFD
        
        # Determine Year from Time (TIM) start
        # Try to find a TIM tag
        tim = tsk.find('TIM')
        year = "Unknown Year"
        if tim is not None and 'A' in tim.attrib:
            # Format usually YYYY-MM-DDThh:mm:ss
            year = tim.attrib['A'][:4]
        
        # Resolve Field Name
        field_name = fields.get(field_ref, f"Field ID {field_ref}")

        # Resolve Machine
        # DAN (DeviceAllocation) -> C attribute is DVC ID
        dan = tsk.find('DAN')
        machine_name = "Unknown Machine"
        if dan is not None:
            dvc_ref = dan.attrib.get('C')
            machine_name = devices.get(dvc_ref, dvc_ref)

        # Get TLG Files
        tlg_files = [tlg.attrib.get('A') for tlg in tsk.findall('TLG')]

        # Get Registered Values (DLV)
        # We look for DLV elements inside the main TIM tag of the task (totals)
        logged_values = []
        if tim is not None:
            for dlv in tim.findall('DLV'):
                ddi = dlv.attrib.get('A') # DDI ID
                raw_value = dlv.attrib.get('B') # Raw integer value
                
                # Lookup Danish name
                prop_name = ddi_definitions.get(ddi, f"DDI_{ddi}")
                
                logged_values.append(f"{prop_name}: {raw_value}")

        # Store in structure
        task_info = {
            'TaskID': tsk_id,
            'Machine': machine_name,
            'TLGs': tlg_files,
            'Properties': logged_values
        }
        
        # We assume one farm for the file, or use the one linked in TSK 'G' or 'D'
        # For this report we use the first found farm or "Ledreborg Gods"
        farm_name = list(farms.values())[0] if farms else "Unknown Farm"
        
        report_data[farm_name][year][field_name].append(task_info)

    # 5. Print Report
    print(f"--- ISOXML Content Report ---\n")
    for farm, years in report_data.items():
        print(f"FARM: {farm}")
        for year, fields_in_year in sorted(years.items()):
            print(f"\n  YEAR: {year}")
            for field, tasks in fields_in_year.items():
                print(f"    FIELD: {field}")
                for task in tasks:
                    print(f"      TASK: {task['TaskID']}")
                    print(f"        Machine: {task['Machine']}")
                    print(f"        TLG Files: {', '.join(task['TLGs'])}")
                    if task['Properties']:
                        print(f"        Registered Values (Totals):")
                        for prop in task['Properties']:
                            print(f"          - {prop}")
                    print("")

if __name__ == "__main__":
    # Replace with the actual path to your downloaded TASKDATA.XML
    file_path = PROJECT_ROOT / 'data' / 'taskdata' / 'TASKDATA.XML'
    try:
        parse_isoxml_taskdata(file_path)
    except Exception as e:
        print(f"Error parsing file: {e}")

--- ISOXML Content Report ---

FARM: Ledreborg Gods

  YEAR: 2022
    FIELD: 016-2, Stendyssegård Højre
      TASK: TSK14
        Machine: Nr: 221  C8600703
        TLG Files: TLG00010

      TASK: TSK15
        Machine: Generic Implement
        TLG Files: 

    FIELD: 011-0, Akselbæk
      TASK: TSK27
        Machine: Nr: 223 C8600722
        TLG Files: TLG00018

      TASK: TSK28
        Machine: Generic Implement
        TLG Files: 

    FIELD: 035-0, Bispegård mark
      TASK: TSK38
        Machine: Nr: 220  C8600702
        TLG Files: TLG00025

      TASK: TSK39
        Machine: Nr: 221  C8600703
        TLG Files: TLG00026

      TASK: TSK40
        Machine: Generic Implement
        TLG Files: 

    FIELD: 022-0, Lidia
      TASK: TSK58
        Machine: Nr: 222 C8600704
        TLG Files: TLG00038

      TASK: TSK59
        Machine: Generic Implement
        TLG Files: 

    FIELD: 013-0, Hulegaard b.
      TASK: TSK87
        Machine: Nr: 223 C8600722
        TLG Files: TLG000