# Investigating Allegations of Soil Pollution in the Kalumbila Mineralized Zone 


## 1. Introduction and Background

- With the growing global demand for subsurface resources such as critical minerals, rare earth elements, geothermal energy, and groundwater, the **sustainable management of surface and subsurface systems** IS a critical scientific and societal priority. 

- A geochemical anomaly, for Soil geochemistry provides a direct window into surface and subsurface processes, mineralization, and environmental conditions, making it a key tool for **mapping contamination, mineral prospectivity, field exploration geology and environmental protection**  (Carranza,2009).  


- Mineral deposits produce **distinctive soil geochemical signatures**, which can extend several kilometers due to weathering, erosion, and sediment transport. These geochemical footprints can reach **up to 6 km**, leading to naturally elevated metal concentrations even in areas without direct mining activity.


- In mineralized terrains like Kalumbila, soil geochemistry reflects a **complex mixture of natural geological enrichment and potential anthropogenic inputs**, making the assessment of pollution allegations challenging.  


- **Kalumbila**, is located in north-western Zambia, naturally enriched in metals and active minig activities due to:
  - Underlying lithology  
  - Hydrothermal alteration  
  - Long-term tropical weathering  


- Mining activities and land use changes raise **concerns about soil pollution**, especially from heavy metals and metalloids that are **persistent, toxic, and non-biodegradable**. 


- Addressing soil pollution here is critical for **environmental sustainability and social well-being**, ensuring safe land for communities, agriculture, and future development.  


- **Conventional environmental assessment methods including**:
  - Single-element thresholds  
  - Deterministic yes/no classifications  
- Cannot account for **multivariate geochemical relationships**, Ignore **spatial context** and do not quantify **uncertainty**  


- Thus in this project, we'll employ **modern data science and multivariate methods** to provide robust, defensible, and socially responsible interpretations.



## 2. Study Area, Why It works?: Kalumbila Mineralized Zone

- **Location:** North-western Zambia; actively mined, highly mineralized.  
- **Natural enrichment:** Lithology, hydrothermal mineralization, and prolonged weathering.  
- **Complexity:** Elevated metals may originate from:
  - Natural geological enrichment  
  - Mining-related anthropogenic activities  
- **Importance:** Kalumbila offers a **realistic and complex case study** to apply **multivariate, uncertainty-aware, and spatial anomaly detection methods**.


## 3. Problem Statement

- Soil pollution allegations in mineralized regions present a **major environmental and social challenge**.  
- Elevated metal concentrations may **threaten ecosystem health, agricultural productivity, and community well-being**, making **environmental sustainability and public safety** central concerns.  
- Traditional methods relying on **single-element thresholds** and deterministic classification:
  - Fail to detect **subtle but important anomalies**  
  - Misclassify **natural geochemical enrichment as pollution**  
  - Lack defensibility in **complex mineralized terrains**  
- **Data science methods are critical** because they:
  - Capture **multivariate relationships** between elements  
  - Account for **spatial distribution** of anomalies  
  - Explicitly quantify **uncertainty**  
  - Provide **objective, reproducible, and socially defensible conclusions**  
- There is a **need for an integrated, data-driven framework** that prioritizes **environmental sustainability, social well-being, and scientific detail**, rather than relying on oversimplified threshold methods.


## 4. Aim of the Study

- To investigate **allegations of soil pollution in Kalumbila** using an **integrated data science and exploration geology framework**, prioritizing **multivariate outlier detection** ensuring **environmental, sociall, and sustainability**.



## 5. Objectives

1. Characterize the **multivariate geochemical background** of Kalumbila soils.  
2. Identify **multivariate outliers** associated with heavy metals and metalloids.  
3. Distinguish **natural geochemical enrichment** from **potential anthropogenic contamination**.  
4. Evaluate **mineral prospectivity signals** in soil geochemical data.  
5. Quantify **uncertainty in anomaly classification** using probabilistic and spatially aware methods.


## 6. Research Questions

- What is the **natural geochemical background** of soils in Kalumbila?  
- Which soil samples are **multivariate outliers**?  
- Are identified anomalies **spatially coherent and geologically meaningful**?  
- Can anomalies be explained by **natural mineralization**?  
- What is the **probability that anomalies are anthropogenic** rather than geogenic?  
- How can **data science methods** improve **environmental sustainability and social well-being** compared to traditional approaches?




## 7. Dataset Description

### 7.1 Soil Geochemical Data
- Collected systematically in the field (primary/secondary).  
- Analyzed for **major elements, trace elements, and heavy metals**.  
- Characteristics to work with:
  - Multivariate  
  - Compositional  
  - Spatially distributed

### 7.2 Borehole Data If Available Will:
- Provides subsurface geochemistry and lithology.  
- Supports **linking surface anomalies to subsurface sources**.  
- Enhances interpretation but is **not mandatory**.



## 8. Methodological Framework

### 8.1 Compositional Data Analysis (CoDA)
- Soil geochemistal data is **compositional**; element concentrations sum to a constant.  
- **Importance:** Prevents spurious correlations that could mislead interpretations.  
- **How it'll work in this study:**  

- Decide on what to do with samples that have zeros
- Ask a domain expert about interesting ratios to look at for our problem at hand
- Make log-ratio transformations and perform basic EDA
- Calculate the variance array and look for small and large variances
- Produce ternary diagrams for those ratios with small and large variances
- Create the biplot
- Interpret the biplot looking for alignment and orthogonality, length of lines belonging to elements
- Communicate your findings with the expert, helping in refining our analysis
- **Plots to make will include:**
- EDA of alr and clr log-ratios
- Ternary diagrams and variation array
- Biplot

### 8.2 Multivariate Outlier Detection (**Primary Method For This Project**)

- Soil pollution is tricky to access, and soils are naturally enriched; simple thresholds fail.  

- **How it'll work in this study:**
- Since we're working with compositional data, we make log-ratio transformations
- Use quantile-quantile plots to detect univariate outliers
- Make bivariate scatter plots of quantities of interest, including the biplot
- Inspect the determinant as function of dropped outliers (if our code logic makes this option available)
- Make the outlier detection plot, including a high quantile (e.g., 95%) of the robust MD
- Make the chi-square quantile plot of the MD to further refine the analysis
- Decide on ‚Äúoutlier‚Äù vs ‚Äúnot outlier‚Äù
- Mark the outliers in their context, e.g. spatial locations


- **Plots to make willinclude:**
- QQ plot, scatter plot, biplot
- Determinant function of outliers removed
- Chi-square quantile plot of the robust MD
- Scatter plot indicating the robust MD
- Outliers marked on maps


 - This is important because it helps us check if anomalies are **natural or anthropogenic** and provides potential contamination.  


### 8.3 Principal Component Analysis (PCA)
- **Purpose:** Reduce dimensionality and reveal **dominant geochemical patterns**. PCs highlight **element associations** controlled by lithology, alteration, or contamination and supports outlier interpretation by showing **which elements drive anomalies**. This enables clear distinction between **natural geochemical variability** and potential pollution. 


- **How it'll work in this project:**  

- Since we're working with compositional data, we'll make log-ratio transformations (preferably ilr)

- Standardize data to make sure variables are on the same scale
- Run the PCA Code Logic
- Make the scree plot to understand the variance contributions of each PC vector
- Make loading plots to attribute meaning to principal components
- Make score plots, and color them with quantities of interest
- Map the scores back the location in space where the sample was collected
- When performing dimension reduction, map the lower dimensional score back to the original variables and assess the error mode

- **Plots to make will include:**
- Scree plot
- Loadings plot
- Score plots
- For spatial data: maps with pc scores
- For dimension reduction: error made for each sample

## 9. Other Supportive Methods Will Include:

### 9.1 **Extreme Value Analysis (EVA):**

- Identifies regulatory concern metals exceeding safe limits.  

- **Protocol: predict magnitude of extreme events**

- Calculate lognormal, Pareto and mean excess quantile plots
- Use quantile plots to diagnose threshold
- Estimate ùúâ using regression over threshold
- Run uncertainty analysis of ùúâ using bootstrap
- Final estimate of ùúâ, ùúé¬†using GPD fitting routine
- Make prediction of extreme events using fitted GPD

- **Plots to make will include:**

- Histogram of data with summary statistics
- Lognormal, Pareto and mean excess quantile plots
- Estimation of ùúâ¬†as function of the threshold
- Histogram of bootstrap estimates of ùúâ
- Histogram of data overlayed with fitted GPD

### 9.2 **Clustering:**

- Defines natural background soil populations for contextual comparison for unlabbled data.

- **Protocol: Cluster Analysis**

- Since we are working with compositional data, make log-ratio transformations (preferably ilr) 
- Choose a distance that defines the difference between samples
- Perform MDS, make the scree plot
- Use the scree plot to analyze the quality of your distance
- Plot the sample data in 2D and 3D using the scores provided by MDS
- In the MDS score plots, color the points with a quantity of interest
- Create the Silhouette plot and decide on the number of clusters
- Run ùëò-means clustering, plot the cluster labels in the MDS score plot
- Plot the cluster labels in the context of the data, e.g. spatial location of the data

- **Plots to make will include:**
- Clr-biplot
- MDS: scree plot & score plot
- Score plots with dots colored by property of interest
- Silhouette plot
- MDS score plot with cluster labels


### 9.3 **Spatial and Geostatistical Analysis**
- Variogram analysis and kriging of PCA scores.

- **Protocol:**
 - Make a histogram of the data: look at the variance
 - Calculate variograms for 4-6 directions
 - Determine the nugget effect
 - Determine the behavior at the origin (smoothness)
 - Plot the sill = variance on the variogram plot
 - Look for anisotropy: find direction with largest and smallest range
 - If trend is present: remove trend, then calculate residual variogram

 

## 10. Expected Outcomes

1. **Objective identification of soil anomalies:**  
   - Using multivariate outlier detection, the study will flag soils with unusual elemental combinations, distinguishing them from the natural geochemical background.  

2. **Clear differentiation between natural enrichment and potential pollution:**  
   - PCA, CoDA, and spatial modeling will support interpretation, ensuring that naturally enriched soils are not misclassified as contaminated.  

3. **Quantified uncertainty in interpretations:**  
   - Bayesian inference and robust statistical methods will provide probabilistic estimates of whether anomalies are geogenic or anthropogenic.  

4. **Spatially coherent anomaly mapping:**  
   - Geostatistical modeling will allow visualization of contamination patterns versus geological variability, enhancing environmental monitoring and decision-making.  

5. **Identification of mineral prospectivity signals:**  
   - Multivariate analysis may highlight areas with exploration potential, linking environmental assessment with resource evaluation.  

6. **A transferable framework for other mineralized regions:**  
   - The integrated data science and exploration geology approach can be adapted for similar studies, promoting **evidence-based environmental management** and **sustainable resource use**.

7. **Support for environmental sustainability and social well-being:**  
   - By reliably distinguishing natural enrichment from pollution, communities, regulators, and mining companies can make informed decisions that protect human health, ecosystems, and livelihoods.

---

## 11. Conclusion

This study presents a **comprehensive, data-driven, and defensible approach** to investigate soil pollution allegations in the Kalumbila mineralized zone.  

By making **multivariate outlier detection** the core method and integrating **CoDA, PCA, CCA, KDE, spatial analysis, and Bayesian inference**, the study:  

- Provides **objective and reproducible results**, avoiding false classification of natural geochemical anomalies.  
- Quantifies **uncertainty** in anomaly interpretation, enabling **probabilistic statements** about the origin of anomalies.  
- Ensures that detected anomalies are **spatially and geologically meaningful**, supporting environmentally and socially responsible conclusions.  
- Bridges the gap between **environmental assessment and exploration geology**, highlighting both **potential contamination** and **mineral prospectivity signals**.  

Overall, this framework prioritizes **environmental sustainability, social well-being, and scientific rigor**, demonstrating the power and necessity of **modern data science methods** over traditional threshold-based approaches in complex, mineralized terrains.


# REFERENCES

- Aitchison, J. (1982). The statistical analysis of compositional data.
Journal of the Royal Statistical Society: Series B (Methodological), 44(2), 139‚Äì160.

- Balestriero, R., Pesenti, J., & LeCun, Y. (2021). Learning in high
dimension always amounts to extrapolation. arXiv preprint a
rXiv:2110.09485.

- Thompson B (1984) Canonical correlation analysis: uses and interpretation. Sage, London
Wang L, Kitanidis PK, Caers J (2022) Hierarchical Bayesian inversion of global variables and large-scale
spatial fields. Water Resour Res 58:e2021WR031610

- Carranza, E. J. M. (2009). Geochemical anomaly and mineral
prospectivity mapping in GIS. Elsevier.

- Wang, L., Yin, D. Z., & Caers, J. (2023). Data science for the
geosciences. Cambridge University Press.

- Zhang, C., Zuo, R., & Xiong, Y. (2021a). Detection of the multivariate geochemical anomalies associated with mineralization using a deep convolutional neural network and a pixelpair feature method. Applied Geochemistry, 130, 104994.

- Zhang, C., Zuo, R., Xiong, Y., Zhao, X., & Zhao, K. (2022). A
geologically-constrained deep learning algorithm for recognizing geochemical anomalies. Computers & Geosciences,
162, 105100

- Xiaolong Wei1  ¬∑ Zhen Yin1, Celine Scheidt1, Kris Darnell2, Lijing Wang3, Jef Caers1: Constructing Priors for Geophysical Inversions Constrained
by Surface and Borehole Geochemistry, 2024 
