# Notebook 1: Data Foundation & High-Fidelity Synthetic Generation

## Objective

A core challenge in mining analytics is the lack of public, high-quality data. Real-world geotechnical sensor data is proprietary and rarely shared.

To overcome this, we will create a **high-fidelity synthetic dataset** for this project. This approach gives us full control to:

1.  **Ensure Logical Soundness:** All features are relevant to rockfall prediction.
2.  **Engineer Statistical Realism:** "Driver" features like `rainfall` and `seismic_activity` are modeled on the statistical properties of real-world Kaggle datasets.
3.  **Build Purposeful Complexity:** We will create built-in correlations and logical rules, providing a rich, complex dataset for our DAV analysis.

In [2]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

# Set a random seed for reproducibility
np.random.seed(42)

## Part A: "Driver" Data Analysis (Conceptual)

To ensure our synthetic data feels realistic, we will not just use random numbers. We will model our "driver" features based on the statistical distributions of real-world public data from Kaggle.

We identified two "driver" datasets:
1.  **Rainfall:** "Rainfall Dataset for Simple Time Series Analysis" (from Kaggle)
2.  **Seismic:** "All the Earthquakes Dataset : from 1990-2023" (from Kaggle)

We will analyze their properties (e.g., "zero-inflated" for rain, "long-tail" for seismic) and build our data generator to match.