

---


# **Data Science and Machine Learning for Environmental Engineering**


---



---
# **Module 8**
---

# Before we start

![Data Science Workflow](https://www.tutorialspoint.com/google_colab/images/saving_google_drive.jpg)

---
<!-- Slide 4: Role of Random Seed -->
# What Is a Random Seed?

- **Definition:** A random seed initializes the random number generator, ensuring reproducibility.
- **Effect:** Changing the random seed changes how data is shuffled and split into training and test sets.
- **Reproducibility:** Fixing the random seed ensures consistent results across runs.

**Example:**  
Random seed = 42 → One specific split  
Random seed = 10 → A different split  

---

---

<!-- Slide 5: Visualizing the Impact of Random Seed -->
# How Does the Split Change?
- **Observation:** The same dataset produces different training and test sets depending on the random seed.

---

In [12]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from ipywidgets import interact, IntSlider

X, y = make_moons(n_samples=100, noise=0.15, random_state=42)

def plot_dataset(X, y, axes, marker=['bo', 'g^'], label=None):
    """Plots the dataset."""
    plt.plot(X[y == 0, 0], X[y == 0, 1], marker[0], label=label)
    plt.plot(X[y == 1, 0], X[y == 1, 1], marker[1], label=label)
    plt.axis(axes)
    plt.grid(True, which='both')
    #plt.axhline(y=0, color='k')
    #plt.axvline(x=0, color='k')

def visualize_split(r):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=r)
    #plt.figure(figsize=(4, 3))
    plot_dataset(X_train, y_train, [-1.5, 2.5, -1, 1.5])
    plot_dataset(X_test, y_test, [-1.5, 2.5, -1, 1.5], marker=['ro', 'r^'], label='Test')
    plt.tight_layout()
    plt.title('Seed = ' + str(r))
    plt.legend(bbox_to_anchor=(1.1, 0.8))
    plt.show()

interact(visualize_split, r=IntSlider(min=42, max=100, step=1, value=42, description='Random Seed'));

interactive(children=(IntSlider(value=42, description='Random Seed', min=42), Output()), _dom_classes=('widget…

Why Does Random Seed Matter?
--
+ Model Performance: Different splits can lead to different values for  evaluation metrics (e.g., accuracy, RMSE).
+ Bias and Variance: Some splits may over-represent certain patterns in the data.
+ Consistency: Fixing the random seed ensures reproducibility in research and experiments.
+ Environmental Engineering Context:
  - When predicting pollutant levels, small changes in data splits can affect model reliability.

