This notebook shows how to call the main python functions to preprocess, extract data for training, model training and predict a tif.

Generally using this notebook is not recommended over calling the data extration, training and prediction scripts from the command line. 

The notebook takes about twice as long.



TODO: explain the config file and folder structure.

In [1]:
from rural_beauty.config import models_dir
import pathlib


from rural_beauty import preprocessing         # function to get from the raw data to cleaned data ready for extraction. 
from rural_beauty import get_data_for_training # the function to create model data frames
from rural_beauty import training_model        # the function to train a tree model
from rural_beauty import predict_generic       # the function to predict a tif based on the model

Preprocessing needs the raw (as downloaded form source) data, and converts them to 1x1km resolution geotiffs. 

Aggregations are usually the share of the area within a pixel. 

E.g:

     Forest.tif: "Share of the pixel that is covered by Forests"

     protected.tif: share of the pixel that is part of a protected area. 


preprocessing.py does the calculations for every layer we consider wholesale. Individual groups of inputs can be skipped using flags. --skip_CLC

In [None]:
preprocessing.main()

# options to skip:
# preprocessing.main( --skip_DE --skip_UK --skip_CLC --skip_DEM --skip_OSM --skip_Hemerobie --no-skip_Protected --skip_Neighborhood)

# right now some steps overwrite and some steps skip if the output files are already present. This is a TODO. 

### `get_data_for_training.py`

This script extracts data from preprocessed raster files based on a specified sampling method and country. The resulting data is formatted into tabular form and saved as CSV files, making it ready for ingestion by machine learning workflows.

### Key Features:
- **Flexible Sampling Methods**: Supports different sampling strategies such as extracting all raster pixels, random sampling, or using predefined points (e.g., UK scenic points).
- **Multi-Feature Extraction**: Combines raster-based predictors with outcome variables for a unified dataset.
- **Polygon Support**: Handles extractions within specific boundary polygons and supports buffering to include coastal areas.

### Sampling Methods:
For Germany there are random_pixels and all_pixels. 
For UK we have implemented pooled_pixels_random_points and pooled_pixels_all_points. 

pooled_pixels refers to the fact that we can have multiple scenic or not images within the same 1km gridcell.  
pooled_pixels pools the ratings of all images within a cell together.  
Additionally, when we sample, we only sample pixels that have images in them, and do not interpolate.  
If needed we can extend all and random pixels to UK by first interpolating the scenic or not rating so all pixels, but this seems worse, so it is not implemented.  

Possible future methods could treat same pixel images, separatly.  

### Outputs:
The script generates tabular data saved in the following directory structure:

The tabular data is saved as .csvs in at: data / models / __extracted_points / {country} / {sampling} / 


### Files Created:
1. **`coords.csv`**: Coordinates of the sampled points.
2. **`predictors.csv`**: Raster-based explanatory variables extracted at the points.
3. **`outcome.csv`**: Target variable values extracted at the same points.
4. **`features.json`**: Metadata containing the paths to the raster files used for extraction.

---

#### Example Usage:

To extract data for Germany (`DE`) using the `all_pixels` sampling method for the `beauty` target variable:
```bash
python3 get_data_for_training.py DE beauty all_pixels
```




In [None]:
# here we set parameters fo
# parameters for data generation
country = 'DE'
target_variable = 'beauty'
sampling_method = 'all_pixels' # extracting all_pixels will take a long time. 60+ min on the IIASA VM101 server. 

get_data_for_training.main(country=country, target_variable =  target_variable, sampling_method=sampling_method)
# python3 rural_beauty/rural_beauty/get_data_for_training.py DE beauty all_pixels


All files exist
Extracting beauty's raster values


Extracting explanatory raster values: 100%|██████████| 68/68 [01:25<00:00,  1.26s/it]


Coordinate file written to /h/u145/hofer/MyDocuments/Granular/beauty/data/models/__extracted_points/DE/beauty/random_pixels/coords.csv
Outcome file written to /h/u145/hofer/MyDocuments/Granular/beauty/data/models/__extracted_points/DE/beauty/random_pixels/outcome.csv
Predictors file written to /h/u145/hofer/MyDocuments/Granular/beauty/data/models/__extracted_points/DE/beauty/random_pixels/predictors.csv
Feature path json written to /h/u145/hofer/MyDocuments/Granular/beauty/data/models/__extracted_points/DE/beauty/random_pixels/feature_paths.json


### ML Model Training Script: `train_model.py`

This script trains machine learning models on preprocessed spatial and raster data for Germany (`DE`) or the United Kingdom (`UK`). The output includes trained models, evaluation metrics, and visualizations to support further analysis.

---

#### Key Features
- **Flexible Model Selection**: Supports various machine learning models, including:
  - `RandomForestClassifier`
  - `DecisionTreeClassifier`
  - `XGBClassifier`
- **Hyperparameter Tuning**: Optional grid search for optimal model parameters.
- **Data Balancing**: Handles unbalanced classes through oversampling or uses data as is.
- **Sampling Strategies**: Tailored for spatial data, offering:
  - `all_pixels`: Extracts all raster values.
  - `random_pixels`: Samples random raster values.
  - `pooled_pixels_all_points`: For UK, pools multiple scenic values within a grid cell.
  - `pooled_pixels_random_points`: For UK, randomly samples grid cells with images.

---

#### Inputs
1. **Country**: Specify the target country (`DE` or `UK`).
2. **Target Variable**: Options include `scenic`, `beauty`, `unique`, or `diverse`.
3. **Sampling Method**: Choose the strategy for spatial data sampling.
4. **Model Class**: Select the model to train (`RandomForestClassifier`, `DecisionTreeClassifier`, or `XGBClassifier`).
5. **Number of Classes**: Set the number of classes for classification.
6. **`sugar`**: Unique identifier for the model output folder.
7. **Class Balance**:
   - `asis`: Use data as is.
   - `oversampling`: Balance classes by oversampling underrepresented classes.

---

#### Outputs
All results are saved under the directory:  
`data/models/{country}__{target_variable}__{sampling_method}__{model_class}__{class_balance}__{sugar}/`

##### Files Created:
1. **Trained Model**:
   - Saved as `model.pkl` for later predictions.
2. **Confusion Matrix**:
   - Visualized and saved as `confusion_matrix.png`.
3. **Significant Coefficients**:
   - CSV file of important features (if supported by the model).
4. **Logfile**:
   - Appends training summaries (accuracy, F1 score, Kendall's Tau).

---

#### Evaluation Metrics
The script computes the following metrics for model evaluation:
- **Accuracy**: Overall prediction correctness.
- **F1 Score**: Weighted harmonic mean of precision and recall.
- **Kendall's Tau**: Measures rank correlation between predictions and true labels.

---

#### Command-Line Usage
Run the script with the following arguments:

```bash
python train_model.py <country> <target_variable> <model_class> <sampling_method> <number_classes> <sugar> [--tune-hyperparameters] [--class_balance <method>]


In [None]:
# this is for training the model
model_class      = 'DecisionTreeClassifier'
class_balance    = 'asis'
number_classes   = 7
sugar            = str(number_classes) + '_'+ '191224'



# python3 rural_beauty/rural_beauty/training_model.py DE beauty XGB all_pixels asis 7 7_123456
training_model.main(country          = country,
                    target_variable  = target_variable,
                    model_class      = model_class,
                    sampling_method  = sampling_method,
                    class_balance    = class_balance,
                    sugar            = sugar,
                    number_classes   = number_classes)

Model Accuracy:      0.78
Model F1:            0.77
Model Kendall's Tau: 0.83
Confusion matrix saved to: /h/u145/hofer/MyDocuments/Granular/beauty/data/models/DE__beauty__random_pixels__XGB__asis__7_021224/confusion_matrix.png


<Figure size 800x600 with 0 Axes>

Alternatively we can train a whole group of models.
Lets set up a short list of models we want to try out and then run all the combinations of them.

In [None]:
countries = ['DE', 'UK']
# for the target variable the choice depends on the country so we set up a dictionary. 
target_variables = {'DE': 'beauty', 'UK': 'scenic'}
model_classes = ['XGB', 'RandomForestClassifier', 'DecisionTreeClassifier']	
class_balances = ['asis', 'oversampling']
# we leave the number of classes and the sugar as they are.

# the results are both stored in new folders in the models directory, but also the results are stored in 
# a logfile at data/models/logfile.txt

for country in countries:
  for model_class in model_classes:
    for class_balance in class_balances:
        training_model.main(country          = country,
                            target_variable  = target_variables[country],
                            model_class      = model_class,
                            sampling_method  = sampling_method,
                            class_balance    = class_balance,
                            sugar            = sugar,
                            number_classes   = number_classes)

# Predicting EU-Wide Values Using Trained Models

This script generates predictions for the EU (as defined by ) based on trained machine learning models and raster data. It aligns and normalizes input rasters, evaluates model features, and creates GeoTIFF outputs with predictions.

---

### Key Features
- **Flexible Model Parsing**: Dynamically extracts model metadata from folder names, enabling seamless integration with trained models.
- **Raster Alignment and Normalization**:
  - Ensures all input rasters have a common extent and resolution.
  - Subsets and aligns rasters to focus on areas of interest.
- **Boundary-Based Predictions**:
  - Predictions are constrained to specified regions (e.g., NUTS polygons) with optional buffering for coastal areas.
- **GeoTIFF Outputs**:
  - Generates prediction rasters in GeoTIFF format for easy visualization and integration with GIS tools.
- **Robust Validations**:
  - Checks for invalid values (e.g., NaNs, infinities) in predictor rasters.
  - Handles missing or misaligned data gracefully.

---

### Workflow
1. **Input Data Preparation**:
   - Loads trained model and feature rasters.
   - Aligns rasters to a common extent and resolution.
2. **Prediction Generation**:
   - Reads rasters and stacks them for input into the machine learning model.
   - Applies model predictions only within specified boundaries (e.g., polygons from NUTS data).
3. **Output**:
   - Creates a prediction GeoTIFF with nodata values (-99) for areas outside the polygon or with invalid inputs.
   - Saves prediction rasters and logs relevant information.

---

### Inputs
1. **Model Folder**:
   - The folder containing the trained model, e.g., `data/models/DE__unique__random_pixels__XGB__asis__7_271124`.
   - Includes the model file (`model.pkl`), features, and training metadata.
2. **Boundary Data**:
   - GeoJSON file specifying polygons for the prediction region (e.g., EU boundaries with NUTS data).

---

### Outputs
1. **Prediction GeoTIFF**:
   - File path: `<model_folder>/prediction.tif`
   - Contains model predictions for valid regions, aligned to the input rasters.
2. **Aligned Rasters**:
   - Adjusted rasters saved under `data/forprediction/[boundary_name]`.
3. **Logs**:
   - Validation logs for raster alignment, invalid values, and prediction summaries.

---

### Command-Line Usage
Run the script with the following arguments:

```bash
python predict_eu.py <model_folder> <boundary_path>


In [6]:
# the prediction function takes a model folder (as crated by the training function)
model_basename = f"{country}__{target_variable}__{sampling_method}__{model_class}__{class_balance}__{sugar}" # instead use something like "__".join(**kargs)
model_folder   = models_dir / model_basename

from beauty.config import NUTS_EU, NUTS_DE, NUTS_UK

# predict_generic.main(model_folder, NUTS_EU)
# alternatively we can cross predict the model on the other country for all the models we set up the grid for

for country in countries:
    for model_class in model_classes:
        for class_balance in class_balances:
            model_basename = f"{country}__{target_variable}__{sampling_method}__{model_class}__{class_balance}__{sugar}" # instead use something like "__".join(**kargs)
            model_folder   = models_dir / model_basename
            if country == 'DE':
                predict_generic.main(model_folder, NUTS_UK)
            else:
                predict_generic.main(model_folder, NUTS_DE)




Finished writing the prediction to /h/u145/hofer/MyDocuments/Granular/beauty/data/models/DE__beauty__random_pixels__XGB__asis__7_021224/prediction.tif
