# Notebook 1: Data Sourcing & High-Fidelity Generation

## 1. Objective

The foundation of any analytics project is its data. A significant challenge in mining analytics is the lack of public, high-quality data, as real-world geotechnical sensor data is proprietary.

To overcome this, this notebook creates a **high-fidelity synthetic dataset**. We will follow a professional workflow:

1.  **Sourcing:** We will load and analyze two real-world Kaggle datasets (one for rainfall, one for seismic activity) to understand their statistical properties.
2.  **Generation:** We will use these real-world properties as our "recipe" to build a new, clean dataset of 20,000 samples.
3.  **Logic:** We will engineer purposeful, complex relationships between the features to create a rich dataset for analysis.

This notebook's final output is the `rockfall_synthetic_data.csv` file, which will be the foundation for the entire project.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import os

# Set a random seed for reproducibility
np.random.seed(42)

## 2. Part A: Sourcing & Analyzing "Driver" Datasets

To build a realistic synthetic dataset, we must first understand the properties of real-world "driver" factors. We will load and analyze two datasets from Kaggle to create our "recipe".

**Datasets Used:**
1.  **Rainfall:** "Rainfall Dataset for Simple Time Series Analysis"
2.  **Seismic:** "All the Earthquakes Dataset : from 1990-2023"

We will analyze their statistical distributions to ensure our synthetic data is not just random, but is statistically representative of real-world patterns.

In [6]:
import kaggle
import zipfile
import time

# ---
# ### 2.1. Setup File Paths & Download Data
# ---

# This is the path to our main project directory (one level up from 'notebooks')
BASE_DIR = '..'

# --- Define Data Directories ---
DATA_DIR = os.path.join(BASE_DIR, 'data')
os.makedirs(DATA_DIR, exist_ok=True) # Create 'data' folder if it doesn't exist
print(f"Data directory set to: {DATA_DIR}")


# --- Define Kaggle Dataset "Slugs" (from their URL) ---
RAINFALL_SLUG = 'sujithmandala/rainfall-dataset-for-simple-time-series-analysis'
SEISMIC_SLUG = 'alessandrolobello/the-ultimate-earthquake-dataset-from-1990-2023'

# --- Define our Standard File Names ---
RAINFALL_DRIVER_FILE = os.path.join(DATA_DIR, 'rainfall.csv')
SEISMIC_DRIVER_FILE = os.path.join(DATA_DIR, 'earthquake_data.csv')
SYNTHETIC_DATA_FILE = os.path.join(DATA_DIR, 'rockfall_synthetic_data.csv')

# --- Define the *actual* downloaded name for the seismic file ---
# We found this name from your "debug" cell output.
DOWNLOADED_SEISMIC_NAME = os.path.join(DATA_DIR, 'Eartquakes-1990-2023.csv')


# --- Download Logic ---
# This code will check if the file already exists. If not, it will download it.

# 1. Download Rainfall Data
if not os.path.exists(RAINFALL_DRIVER_FILE):
    print(f"Downloading {RAINFALL_SLUG}...")
    try:
        kaggle.api.dataset_download_files(RAINFALL_SLUG, path=DATA_DIR, unzip=True)
        print(f"Rainfall data downloaded and unzipped to {DATA_DIR}")
    except Exception as e:
        print(f"Error downloading rainfall data: {e}")
        print("Please check your Kaggle API setup (kaggle.json).")
else:
    print(f"Rainfall data ('{os.path.basename(RAINFALL_DRIVER_FILE)}') already exists. Skipping download.")

# 2. Download and Rename Seismic Data
if not os.path.exists(SEISMIC_DRIVER_FILE):
    print(f"Checking for seismic data...")
    
    # Check if the *original downloaded file* exists (e.g., Eartquakes-1990-2023.csv)
    if os.path.exists(DOWNLOADED_SEISMIC_NAME):
        print(f"Found '{os.path.basename(DOWNLOADED_SEISMIC_NAME)}'. Renaming...")
        os.rename(DOWNLOADED_SEISMIC_NAME, SEISMIC_DRIVER_FILE)
        print(f"Successfully renamed to '{os.path.basename(SEISMIC_DRIVER_FILE)}'.")
    
    # If neither file exists, then we need to download it
    else:
        print(f"Downloading {SEISMIC_SLUG}...")
        try:
            kaggle.api.dataset_download_files(SEISMIC_SLUG, path=DATA_DIR, unzip=True)
            print("Seismic data downloaded and unzipped.")
            
            # Give the system a second to make sure the file is written
            time.sleep(2) 
            
            # Now, do the rename
            if os.path.exists(DOWNLOADED_SEISMIC_NAME):
                os.rename(DOWNLOADED_SEISMIC_NAME, SEISMIC_DRIVER_FILE)
                print(f"Successfully renamed '{os.path.basename(DOWNLOADED_SEISMIC_NAME)}' to '{os.path.basename(SEISMIC_DRIVER_FILE)}'.")
            else:
                print(f"Warning: Downloaded seismic data, but the file '{os.path.basename(DOWNLOADED_SEISMIC_NAME)}' was not found for renaming.")
        
        except Exception as e:
            print(f"Error downloading seismic data: {e}")
            print("Please check your Kaggle API setup (kaggle.json).")
else:
    print(f"Seismic data ('{os.path.basename(SEISMIC_DRIVER_FILE)}') already exists. Skipping download.")
    
print("\n--- Data Sourcing and Setup Complete ---")
print(f"We are ready to analyze:")
print(f"1. {os.path.basename(RAINFALL_DRIVER_FILE)}")
print(f"2. {os.path.basename(SEISMIC_DRIVER_FILE)}")

Data directory set to: ..\data
Rainfall data ('rainfall.csv') already exists. Skipping download.
Seismic data ('earthquake_data.csv') already exists. Skipping download.

--- Data Sourcing and Setup Complete ---
We are ready to analyze:
1. rainfall.csv
2. earthquake_data.csv
