# Notebook 2: Exploratory Data Analysis (EDA)

## 1. Objective

This notebook is the core of our **Data Analytics & Visualization (DAV)** work.

In Notebook 1, we successfully sourced, generated, and saved a high-quality synthetic dataset: `rockfall_synthetic_data.csv`.

The purpose of *this* notebook is to load that data and perform a deep Exploratory Data Analysis (EDA). We will "interrogate" the data visually, using a wide variety of plots to:

1.  **Analyze Distributions:** Understand the shape, center, and spread of each individual feature.
2.  **Identify Correlations:** Visually confirm the relationships we engineered (and discover new ones).
3.  **Find Outliers:** Identify any extreme or unusual data points.
4.  **Understand Feature-Target Relationships:** Visually analyze how each feature (e.g., `displacement_mm`) relates to the final `risk_level`.

This analysis will provide all the insights necessary to build an effective model in Notebook 3.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import os

# --- 1. Define File Paths ---
BASE_DIR = '..'
DATA_DIR = os.path.join(BASE_DIR, 'data')
DATA_FILE = os.path.join(DATA_DIR, 'rockfall_synthetic_data.csv')

# --- 2. Load the Data ---
try:
    df = pd.read_csv(DATA_FILE)
    print(f"Successfully loaded '{os.path.basename(DATA_FILE)}'.")
    print(f"Dataset has {df.shape[0]} rows and {df.shape[1]} columns.")
except FileNotFoundError:
    print(f"--- ERROR ---")
    print(f"The file '{os.path.basename(DATA_FILE)}' was not found at: {DATA_FILE}")
    print("Please make sure Notebook 1 was run successfully and the file was saved.")
except Exception as e:
    print(f"An error occurred: {e}")

# --- 3. Initial Data Inspection ---
if 'df' in locals():
    print("\n--- Data Head (First 5 Rows) ---")
    print(df.head())
    
    print("\n--- Data Info (Column Types and Nulls) ---")
    print(df.info())

Successfully loaded 'rockfall_synthetic_data.csv'.
Dataset has 20000 rows and 6 columns.

--- Data Head (First 5 Rows) ---
   rainfall_mm_past_24h  seismic_activity  joint_water_pressure_kPa  \
0             16.351962          1.042005                 41.542877   
1              4.133069          1.410756                 34.360071   
2              7.764729          1.554489                 38.339998   
3             14.857486          1.517141                 50.282894   
4              0.000000          0.941456                 31.325325   

   vibration_level  displacement_mm risk_level  
0         0.206142         6.991127        Low  
1         0.349532         8.686621        Low  
2         0.384975        10.007807        Low  
3         0.351243        13.029807     Medium  
4         0.266100         8.382537        Low  

--- Data Info (Column Types and Nulls) ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 6 columns):
 #  