# Soil Viability Prediction: Model Experimentation with XGBoost and MLflow

## **Objective**
The goal of this notebook is to experiment with various configurations of the XGBoost model to predict soil viability for farming in Manitoba. The notebook will leverage MLflow to:
- Track and log experiments, including hyperparameters, metrics, and model artifacts.
- Facilitate reproducibility and collaboration.

---

## **Workflow Outline**
1. **Setup and Configuration**:
   - Import necessary libraries.
   - Initialize MLflow for experiment tracking.
2. **Data Loading and Preprocessing**:
   - Load and process the Manitoba soil dataset using the `DataProcessor` class.
   - Perform exploratory data analysis (EDA) if needed.
3. **Feature Engineering**:
   - Apply transformations to create features such as `MANCON` flags, weighted scores, and encoded classifications.
4. **Model Experimentation**:
   - Train XGBoost models with various hyperparameter configurations.
   - Evaluate model performance using appropriate metrics (e.g., RMSE, accuracy).
   - Log experiments to MLflow.
5. **Analysis of Results**:
   - Compare model performance across experiments.
   - Visualize feature importance and residuals.
6. **Model Selection**:
   - Choose the best-performing model based on metrics logged in MLflow.

---

## **Tools and Technologies**
- **XGBoost**: A gradient boosting framework for training models.
- **MLflow**: A platform for managing machine learning experiments, including:
  - Tracking hyperparameters and metrics.
  - Logging models and artifacts for reproducibility.
  - Visualizing experiment results.
- **Pandas, NumPy**: For data manipulation and analysis.
- **Matplotlib, Seaborn**: For data visualization.
- **DataProcessor**: Custom class for data preprocessing and feature engineering.

---

## **Expected Outcomes**
By the end of this notebook:
- You will have a trained and evaluated XGBoost model.
- All experiments will be tracked and logged in MLflow, enabling easy comparison and reproducibility.
- Insights into the most influential features for predicting soil viability will be generated.

---

> **Note**: Before running this notebook, ensure that the MLflow tracking server is properly configured, and the necessary dependencies are installed.


In [1]:
import pandas as pd
import numpy as np

# reading dataset into a pandas dataframe 
file_path = r'C:\Users\JP\Documents\Manitoba Soil Survey Data\MB soil segmentation project\manitoba_10_sample.csv'
agric = pd.read_csv(file_path, low_memory=False) 


In [3]:
#The original dataset consists of about 100 thousand rows. We have been working with a subset of
#10k rows out of the original dataset. The code below isolates the remaining 90k rows that were not 
#sampled from the dataset

file_path_2 = r'C:\Users\JP\Documents\Manitoba Soil Survey Data\MB soil segmentation project\Soil_Survey_Manitoba.csv'
total_100k_agric = pd.read_csv(file_path_2, low_memory=False)

remainder_agric = total_100k_agric[~total_100k_agric['OBJECTID'].isin(agric['OBJECTID'])]

remainder_agric.to_csv("remainder_agric.csv", index=False)
