![Image Description](./images/NormalityAndInferencing.png)


# Environment Setup
---

- Start by creating a Virtual Environment for your project

Before running any code in this notebook, it's important to set up a clean Python environment to manage dependencies. We recommend using a **VENV-type virtual environment** in **Visual Studio Code (VSC)**. Follow these steps:

### ✅ Steps to Create a Virtual Environment in VSC

1. **Open your project folder** in Visual Studio Code.

2. **Open the terminal**:
   - Go to `View` > `Terminal` or press `Ctrl + `` (backtick).

3. **Create the virtual environment** by running:
   ```bash
   python -m venv venv
   ```

4. Once the environment is created, you need to **activate it** so that all Python packages you install are scoped to this project only.
    On Windows:
    ```bash
    .\venv\Scripts\activate
    ```

    On macOS/Linux
    ```bash
    source venv/bin/activate
    ```

5. ⚠️ Why Activation Matters
Activating the virtual environment ensures that:

- All package installations using pip are local to your project.
- You avoid modifying the global Python environment, which could affect other projects or system tools.
- Your project remains portable and reproducible, especially when sharing with others or deploying.

### 📦 Installing Required Libraries
Once activated, install the required libraries using:

In [None]:
pip install -r requirements.txt

# 🎯 Accuracy vs. Precision

---

![Image Description](./images/AccuracyVsPrecision.png)

<div style="display: flex;">
    <div style="flex: 1; padding: 10px;">
        <h2>Accuracy</h2>
        <p><b>Definition:</b> Accuracy refers to how close a measurement or set of measurements is to the true or accepted value of the quantity being measured. It is about correctness.</p>
        <p><b>Target Analogy:</b> In the context of a dartboard, accuracy is how close your darts land to the bullseye.</p>
        <p><b>Statistical Measure:</b> Accuracy is often quantified by the <b>bias</b> of a measurement. A low bias indicates high accuracy. It can also be described by the mean error.</p>
    </div>
    <div style="flex: 1; padding: 10px;">
        <h2>Precision</h2>
        <p><b>Definition:</b> Precision refers to how close repeated measurements are to each other, regardless of whether they are close to the true value. It is about consistency and reproducibility.</p>
        <p><b>Target Analogy:</b> On a dartboard, precision is how tightly clustered your darts are, even if they are far from the bullseye.</p>
        <p><b>Statistical Measure:</b> Precision is typically quantified by the <b>spread</b> or <b>variability</b> of the measurements. Common statistical measures of precision include:</p>
        <ul>
            <li><b>Standard Deviation:</b> A measure of the average amount of variation or dispersion of a set of values. A smaller standard deviation indicates higher precision.</li>
            <li><b>Variance:</b> The square of the standard deviation.</li>
            <li><b>Range:</b> The difference between the highest and lowest values in a set of measurements.</li>
        </ul>
    </div>
</div>

# 📊 Normalization and Standardization


These two terms are fundamental to the Data Exploration phase of the ML Cycle, specifically when it comes to cleaning data.

![Image Description](./images/MachineLearningOperationsLifeCycle.png)

### Key concepts

<table>
  <tr>
    <td style="vertical-align: top; padding-right: 20px;">
        <img src="./images/NormalizationDefined.png">
    </td>
    <td style="vertical-align: top; padding-right: 20px;">
        <img src="./images/StandardizationDefined.png">
    </td>
  </tr>
</table>

### Principles
![Image Description](./images/NormalizationVsStandarization-1.png)

### Examples

<table>
  <tr>
    <th>Normalization</th>
    <th>Standardization</th>
  </tr>
  <tr>
    <td style="vertical-align: top; padding-right: 20px;">
        <img src="./images/NormalizationExample-1.png">
    </td>
    <td style="vertical-align: top; padding-right: 20px;">
        <img src="./images/StandardizationExample-1.png">
    </td>
  </tr>
  <tr>
    <td style="vertical-align: top; padding-right: 20px;">
        <img src="./images/NormalizationExample-2.png">
    </td>
    <td style="vertical-align: top; padding-right: 20px;">
        <img src="./images/StandardizationExample-2.png">
    </td>
  </tr>
</table>


# 🔔 Normality and Probability Distributions


Experiment with the **Discrete and Continuous** simulator. Choose the **Continuous** class and then **Normal** from the list of distributions. 

<table>
  <tr>
    <td style="vertical-align: top; padding-right: 20px;">
      Go to Probablity Distributions > Discrete and Contiuous. <br>
      Then hit the <b>Continuous</b> radio. <br>
      Then, from the pull down meny, choose <b>Normal</b>
    </td>
  </tr>
  <tr>
    <td style="vertical-align: top; padding-right: 20px;">
      <iframe
        src="https://seeing-theory.brown.edu/probability-distributions/index.html?section2"
        width="800"
        height="400">
      </iframe>
    </td>
    <td style="width: 200px; height: 400px">
      <p>
        When the mean (μ) is greater than 0 and the variance (σ²) is zero, all the data points are tightly clustered.<br><br>
        When the mean (μ) is greater than 0 and the variance (σ²) is very large, the data points are spread far apart.
      </p>
    </td>
  </tr>
</table>


## 📊 Z-distribution

The standard normal distribution has **mean μ = 0** and **standard deviation σ = 1**.   
It is used to standardize data and calculate probabilities.  
Z-scores show how many standard deviations a value is from the mean.


<table>
  <tr>
    <td style="vertical-align: top; padding-right: 20px;">
      <strong>Steps to Use the Simulator</strong><br><br>
      1. Open the simulator below.<br>
      2. Adjust the Z₁ and Z₂ sliders to set the bounds.<br>
      3. Observe how the shaded area changes.<br>
      4. Use the area value to interpret probability.<br>
      5. Try different values to explore the distribution.
    </td>
  </tr>
  <tr>
    <td style="vertical-align: top; padding-right: 20px;">
      <iframe
        src="https://www.geogebra.org/m/zeF3hkXf"
        width="800"
        height="1000">
      </iframe>
    </td>
    <td  style="vertical-align: top; padding-right: 20px; width: 100; height: 800;">
      <p>
        Z₁ and Z₂ define the lower and upper bounds on the standard normal curve.  
        The shaded area between them represents the probability of a value falling within that range.  
        A larger area means higher likelihood; a smaller area means lower probability.
      </p>
      <p>
        The Z-score measures how many standard deviations a data point is from the mean of a dataset. It helps identify outliers and compare values across different distributions. A Z-score of 0 means the value is exactly at the mean, while positive or negative scores indicate how far above or below the mean the value lies.
      </p>
      <p>
📐 Z-Score
The Z-score measures how many standard deviations a data point is from the mean.  
It helps identify outliers and compare values across different distributions.  
A Z-score of 0 means the value is at the mean; positive or negative scores show how far above or below the mean the value lies.

**Formula:**
$$
Z = \frac{X - \mu}{\sigma}
$$

Where:
- Z is the z-score,
- X is the value of the element,
- μ is the population mean, and
- σ is the standard deviation.

**Example:**  
In a class where the average score is 70 and the standard deviation is 10,  
a student scoring 85 has:

$$
Z = \frac{85 - 70}{10} = 1.5
$$
This means the score is 1.5 standard deviations above the mean.
      </p>
      <p>
        Example: In a class where the average test score is 70 with a standard deviation of 10, a student scoring 85 has a Z-score of (85 - 70) / 10 = 1.5.This means the score is 1.5 standard deviations above the mean.
      </p>
    </td>
  </tr>
</table>


### 📐 Z-Scores: Proportions & Reverse Lookups (Example: Student Score = 85)


When we have a normal distribution (not standardized), say with mean $\mu$ and standard deviation $\sigma$, we often want to answer questions like:

> *“What proportion (or percentage) of students score less than $X = 85$?”*

This is where **standardization** with Z-scores comes in.


#### 1. Standardization: Transforming to the Standard Normal

The standard normal distribution has:

- mean $0$ (denoted $\mu = 0$),
- standard deviation $1$ (denoted $\sigma = 1$).

We convert any normal variable $X$ with mean $\mu$ and standard deviation $\sigma$ into a **Z-score** via:

$$
Z = \frac{X - \mu}{\sigma}
$$

This transformation shifts and scales the variable so we can use standard normal tables or functions.


#### 2. Example: Proportion Less Than a Value

Suppose:

- $X$ = test scores,
- $\mu = 70$,
- $\sigma = 10$,
- We want to find the proportion of students scoring **less than** $X = 85$.

**Step A.** Compute the Z-score for $X = 85$:

$$
Z = \frac{85 - 70}{10} = \frac{15}{10} = 1.5
$$

**Step B.** Ask: *What is $P(X < 85)$?*  
Because $X < 85 \;\;\Longleftrightarrow\;\; Z < 1.5$.

We can look up $P(Z < 1.5)$ in a standard normal table or use a calculator / programming language (e.g. `scipy.stats.norm.cdf(1.5)`).

That value is about:

$$
P(Z < 1.5) \approx 0.9332
$$

So about **93.32%** of students are expected to score less than 85.


#### 3. Reverse Lookup: From Proportion to Value

Sometimes we want the opposite: *“What test score corresponds to the 95th percentile?”* In other words:

> Find $x$ such that $P(X < x) = 0.95$.

**Step A.** In standard normal world, find $z$ such that $P(Z < z) = 0.95$.  
From tables or functions, $z \approx 1.645$ (or 1.64 depending on precision).

**Step B.** Turn that back into the original $X$ scale by solving:

$$
z = \frac{x - \mu}{\sigma} 
\;\;\Longrightarrow\;\; 
x = \mu + z \sigma
$$

With $\mu = 70, \sigma = 10$:

$$
x = 70 + (1.645)(10) = 70 + 16.45 = 86.45
$$

So the 95th percentile score is about **86.45**.


#### 4. Why This Matters

- Any normal distribution, no matter its mean or standard deviation, can be converted into the standard normal via standardization.  
- This lets us use one table or one set of tools (cdf, quantile) instead of hav


### 🎥 Explore Standardization Further

Want to deepen your understanding of **standardization** and how Z-scores are used to calculate probabilities?

Check out this short video lesson:

👉 [Watch on YouTube: Standardizing Normal Distributions](https://www.youtube.com/watch?v=2tuBREK_mgE)

The video walks you through:

- Why we standardize normal distributions,  
- How any normal distribution can be converted to the **standard normal** ($\mu = 0$, $\sigma = 1$),  
- Using Z-scores to find proportions (areas under the curve),  
- Doing reverse lookups (from a proportion to a Z-score).  

Take notes as you watch, and think about how the examples connect to the practice problems in this notebook!

### 🎥 Explore Normalization Further

Want to deepen your understanding of **normalization** and how to **normalize raw data**.

Check out this short video lesson:

👉 [Watch on YouTube: Min-Max Normalization](https://youtu.be/-LC_PKBoZfk?si=axDsYAqT6fKs-rJ-)

The video walks you through:

- Min-Max Normalization with a numerical example,  
- Normalization with a numerical example,  

Take notes as you watch, and think about how the examples connect to the practice problems in this notebook. 

We will continue to use normalization during the remainder of the workshop (below)

### 📝 In-Class Activity: Stock Market Investment and Z-Scores

You are analyzing the daily returns of a particular stock.  
You know the following information:

- The stock’s mean daily return ($\mu$) is **0.5%**.  
- The daily returns’ standard deviation ($\sigma$) is **2%**.  

You want to find out how unusual a **daily return of 5%** is.


#### Your Task

1. **Write down the Z-score formula**:

   $$
   Z = \frac{X - \mu}{\sigma}
   $$

2. **Substitute the values** into the formula using:
   - $X = 5$ (the daily return in %),
   - $\mu = 0.5$,
   - $\sigma = 2$.

   Show your calculation for $Z$.

3. **Interpret the Z-score**:  
   - What does your calculated Z-score mean in terms of how many standard deviations the 5% return is from the mean?  
   - Is this return unusually high compared to the average?

4. **Find the probability**:  
   - Use the Z-score you found to determine the cumulative probability $P(X \leq 5)$.  
   - You can use a Z-table or a Python function such as:

     ```python
     from scipy.stats import norm
     norm.cdf(Z_value)
     ```

5. **Interpret the probability**:  
   - What percentage of daily returns fall below 5%?  
   - What percentage of daily returns exceed 5%?


#### 🔎 Reflection Questions
- Why is a Z-score useful when comparing returns to the average?  
- If the standard deviation were **larger**, how would that affect the Z-score for the same return of 5%?  


### 💻 **Now it’s your turn to code!**  
Write a short Python script that:  
- Calculates the Z-score for $X = 5$,  
- Uses `scipy.stats.norm.cdf()` to compute the probability,  
- Prints both results clearly.


In [1]:
# TODO: Write the code here

from scipy.stats import norm

# Given values
X = 5       # daily return in %
mu = 0.5    # mean return in %
sigma = 2   # standard deviation in %

# 1. Calculate Z-score
Z = (X - mu) / sigma

# 2. Compute cumulative probability P(X ≤ 5)
prob_below = norm.cdf(Z)         # probability of ≤ 5%
prob_above = 1 - prob_below      # probability of > 5%

# Print results
print(f"Z-score for a 5% return: {Z:.2f}")
print(f"Probability of return ≤ 5%: {prob_below:.4f} ({prob_below*100:.2f}%)")
print(f"Probability of return > 5%: {prob_above:.4f} ({prob_above*100:.2f}%)")


Z-score for a 5% return: 2.25
Probability of return ≤ 5%: 0.9878 (98.78%)
Probability of return > 5%: 0.0122 (1.22%)


#### 💭 Reflection: Add Your Talking Point

---
Z-scores are useful because they standardize values and let us compare results across different datasets.  
I learned that a 5% return is much higher than normal because most daily returns stay close to 0.5%.  
Almost all returns are below 5%, and only a very small chance ie; 1.22% goes higher.  
I also realized that if the stock moved up and down more often, a 5% return would not feel as rare.  
---


<br>
<br>
<br>


### 📊 Statistical Inference: From Sample to Population

**Statistical inference** is the act of generalizing from a **sample** to a **population** with a calculated degree of certainty.

- We want to learn about **population parameters** …  
- …but we can only calculate **sample statistics**.


#### Example: Stock Market Context

- **Population** = Stock Market Performance  
- **Sample (Data)** = Company XYZ Stock  
- **Statistic** = Z-Score (calculated from the sample)  
- **Parameter** = Probability of daily return  

We use the **sample statistic** (Z-score from XYZ stock) to make an **inference** about the **population parameter** (probability of daily return across the market).

![Image Description](./images/InferenceFromSample.png)

### 🌍 Challenge #1: Climate Change and Z-Scores

Let’s consider an example where you are analyzing **annual temperature anomalies** to study climate change.

Suppose you have collected data on the annual temperature anomalies (differences from the long-term average temperature) for a particular region over the past 30 years.

- Mean annual temperature anomaly ($\mu$) = **0.5°C**  
- Standard deviation ($\sigma$) = **0.2°C**  

You want to understand how unusual a year with a **temperature anomaly of 0.9°C** is.


#### 🧩 Your Task

1. **Use the Z-score formula**:

   $$
   Z = \frac{X - \mu}{\sigma}
   $$

2. **Substitute the values** into the formula using:
   - $X = 0.9$,  
   - $\mu = 0.5$,  
   - $\sigma = 0.2$.  

   3. 💻 Code the Climate Change Z-Score in Python <br>
      - Complete the Python code scaffold below.  
      - Calculate the Z-score.  
      - Use `scipy.stats.norm.cdf()` to compute the cumulative probability $P(X \leq 0.9)$.  
      - Interpret the result:  
         - What percentage of years have anomalies less than or equal to 0.9°C?  
         - What percentage have anomalies greater than 0.9°C?  

```python
from scipy.stats import norm

# Given values
mu = 0.5   # mean anomaly
sigma = 0.2  # standard deviation
X = 0.9   # observed anomaly

# 1. Compute the Z-score
Z = ( z - mu ) / sigma
print("Z-score:", round(Z, 2))

# 2. Compute the cumulative probability
p_less = norm.cdf(Z)
print("P(X <= 0.9):", f"{p_less:.4f}", f"({p_less*100:.2f}%)")

# 3. Compute the probability above 0.9
p_greater = 1 - p_less
print("P(X > 0.9):", f"{p_greater:.4f}", f"({p_greater*100:.2f}%)")
````

      - Write the complete object-oriened Python code the cell below.


In [2]:
# TODO: Write the code here
from dataclasses import dataclass
from scipy.stats import norm

@dataclass
class ClimateAnomalyAnalyzer:
    """Analyze temperature anomalies with Z-scores and probabilities."""
    mu: float     # mean anomaly
    sigma: float  # standard deviation

    def z_score(self, x: float) -> float:
        if self.sigma <= 0:
            raise ValueError("Standard deviation (sigma) must be positive")
        return (x - self.mu) / self.sigma

    def prob_leq(self, x: float) -> float:
        """Return P(X <= x)."""
        return norm.cdf(self.z_score(x))

    def prob_gt(self, x: float) -> float:
        """Return P(X > x)."""
        return 1 - self.prob_leq(x)

    def report(self, x: float) -> str:
        z = self.z_score(x)
        p_leq = self.prob_leq(x)
        p_gt = self.prob_gt(x)
        return (
            f"Z-score: {z:.2f}\n"
            f"P(X ≤ {x:.1f}°C): {p_leq:.4f} ({p_leq*100:.2f}%)\n"
            f"P(X > {x:.1f}°C): {p_gt:.4f} ({p_gt*100:.2f}%)"
        )

# Example usage
analyzer = ClimateAnomalyAnalyzer(mu=0.5, sigma=0.2)
print(analyzer.report(0.9))

Z-score: 2.00
P(X ≤ 0.9°C): 0.9772 (97.72%)
P(X > 0.9°C): 0.0228 (2.28%)


After you have written and tested the code:

4.  📈 Visualization Exercise: Climate Anomaly Probability **Interpret the Z-score**:  
   - How many standard deviations is 0.9°C above the mean?  
   - Is this anomaly unusually high compared to the average year?  

   - Let’s make the result more visual. We’ll plot the normal distribution curve for annual temperature anomalies and shade the probability of having a value **greater than 0.9°C**.
      - Run the code below.  
      - Observe the shaded region above $X = 0.9$.  
      - Compare the shaded probability with the number you calculated earlier using the Z-score.  


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Given values
mu = 0.5   # mean anomaly
sigma = 0.2  # standard deviation
X = 0.9   # observed anomaly

# Generate x values for the curve
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 1000)
y = norm.pdf(x, mu, sigma)

# Plot the normal distribution curve
plt.plot(x, y, label="Normal Distribution", linewidth=2)

# Shade the region above X = 0.9
x_fill = np.linspace(X, mu + 4*sigma, 500)
y_fill = norm.pdf(x_fill, mu, sigma)
plt.fill_between(x_fill, y_fill, alpha=0.5)

# Add vertical line at X = 0.9
plt.axvline(X, color="red", linestyle="--", label=f"X = {X}")

# Labels and legend
plt.title("Probability of Annual Temperature Anomaly > 0.9°C")
plt.xlabel("Temperature Anomaly (°C)")
plt.ylabel("Density")
plt.legend()
plt.show()


### 🚀 Challenge #2: Integrating Z-Scores, Visualization, and Web Services

You’ve now completed two separate tasks:  
1. **Coding the Climate Change Z-Score**: You calculated the Z-score and probabilities using Python code.  
2. **Visualizing the Climate Anomaly Probability**: You plotted the normal curve and shaded the probability region above $X = 0.9$.

Both tasks are useful on their own, but in real Data Science work we want our tools to be **reusable, organized, and accessible to others**.

### 🧩 Your Task

Step 1. Combine Both Codes Using Object-Oriented Python
- Create a Python **class** called `ClimateAnomalyAnalyzer`.  
- The class should:
  - Store the values of $\mu$, $\sigma$, and $X$ as attributes.  
  - Have a method `compute_zscore()` that calculates and returns the Z-score.  
  - Have a method `compute_probabilities()` that calculates and returns $P(X \leq X)$ and $P(X > X)$.  
  - Have a method `plot_distribution()` that produces the visualization you created earlier (curve + shaded area).  


Step 2. Build a Flask Web Service
- Create a **Flask app** that exposes a web service so that users can interact with your analysis.  
- The app should:
  - Provide a **form** or **query parameters** where the user enters values for $\mu$, $\sigma$, and $X$.  
  - Display the **calculated Z-score** and **probabilities**.  
  - Show the **distribution plot** as an image.  

*(Hint: you can save the Matplotlib plot as a `.png` file in memory and serve it in the Flask app.)*


Step 3. Test Your Web Service
Building a web service is only half the job — the other half is **testing it like a user would**. There are three tests for you to implement: 

> **Browser Test (quick check):**  
   - Run your Flask app with `python app.py`.  
   - Open a browser and visit:  
     ```
     http://127.0.0.1:5000/analyze?mu=0.5&sigma=0.2&X=0.9
     ```  
   - You should see a JSON response or an HTML page with results.  

> **Command Line Test with `curl`:**  
   - Run:  
     ```bash
     curl "http://127.0.0.1:5000/analyze?mu=0.5&sigma=0.2&X=0.9"
     ```  
   - This checks that the service responds correctly to HTTP requests.  

> **Python Test with `requests` library:**  
   ```python
   import requests

   response = requests.get(
       "http://127.0.0.1:5000/analyze",
       params={"mu": 0.5, "sigma": 0.2, "X": 0.9}
   )
   print(response.json())


Step 4. Reflection

Explain how using Object-Oriented Python made your code cleaner and easier to extend.

Reflect on how wrapping your analysis in a Flask web service could make it accessible to others, e.g. policymakers, researchers, or classmates.

#### 💭 Reflection: Add Your Talking Point

---
TODO: Your reflection goes here


---

<br>
<br>
<br>


## 📊 T-Score

The **t-score** is a type of standard score used when the **sample size is small** or when the **population standard deviation ($\sigma$) is unknown**.  

It works very much like the **z-score**, but with one important difference:  
- Instead of using the population standard deviation ($\sigma$), it uses the **sample standard deviation ($s$)**.  
- It also adjusts for the **sample size ($n$)**, since smaller samples add more uncertainty.


### Formula

$$
T = \frac{X - \mu}{s / \sqrt{n}}
$$

Where:  
- $T$ = t-score  
- $X$ = value of the observation  
- $\mu$ = sample mean  
- $s$ = sample standard deviation  
- $n$ = sample size  


### Why use the t-score?
- When $n$ is **large** and $\sigma$ is known → use the **z-score**.  
- When $n$ is **small** or $\sigma$ is unknown → use the **t-score**.  

The t-distribution is wider and has heavier tails than the normal distribution, reflecting more uncertainty with smaller samples. As $n$ grows larger, the t-distribution approaches the standard normal distribution.


### 📝 Exercise

You collected the following data on exam scores from a **small sample of $n = 10$ students**:

- Sample mean ($\mu$) = 75  
- Sample standard deviation ($s$) = 8  
- One student scored $X = 90$  

**Task:**  
1. Calculate the t-score for the student’s score of 90.  
2. Interpret the result: how many standard errors above the sample mean is this student’s score?  
3. Discuss: why might using the z-score here be misleading?  

💻 *Challenge:* Write Python code using `scipy.stats.t.cdf()` to calculate the probability of observing a score at least this extreme with $n-1$ degrees of freedom.


In [4]:
# TODO: Write the code here
# ALL-IN-ONE: OOP + Visualization + Flask Service
# -----------------------------------------------
# Requirements (in your .venv):
#   pip install numpy matplotlib scipy flask requests

import io
import base64
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from flask import Flask, request, jsonify, send_file, Response

# ---------------------------
# Step 1 — Object-Oriented API
# ---------------------------
class ClimateAnomalyAnalyzer:
    """
    Analyze temperature anomalies with Z-scores, probabilities,
    and a normal-distribution plot with shaded right tail.
    """
    def __init__(self, mu: float, sigma: float, X: float):
        if sigma <= 0:
            raise ValueError("sigma must be positive")
        self.mu = float(mu)
        self.sigma = float(sigma)
        self.X = float(X)

    def compute_zscore(self) -> float:
        return (self.X - self.mu) / self.sigma

    def compute_probabilities(self) -> dict:
        Z = self.compute_zscore()
        p_leq = norm.cdf(Z)
        p_gt = 1 - p_leq
        return {
            "mu": self.mu,
            "sigma": self.sigma,
            "X": self.X,
            "Z": round(Z, 2),
            "p_leq": float(p_leq),  # P(X ≤ X)
            "p_gt": float(p_gt)     # P(X > X)
        }

    def make_plot(self):
        """
        Returns a Matplotlib Figure with the PDF and shaded tail for X>=threshold.
        """
        x = np.linspace(self.mu - 4*self.sigma, self.mu + 4*self.sigma, 600)
        pdf = norm.pdf(x, loc=self.mu, scale=self.sigma)

        fig, ax = plt.subplots(figsize=(8, 4.5))
        ax.plot(x, pdf, label=f"Normal PDF (μ={self.mu}, σ={self.sigma})")
        ax.axvline(self.X, color="red", linestyle="--", label=f"X = {self.X:.2f}°C")

        mask = x >= self.X
        ax.fill_between(x[mask], pdf[mask], 0, alpha=0.3,
                        label=f"P(X > {self.X:.2f})")
        ax.set_title("Climate Anomaly Distribution with Tail Probability")
        ax.set_xlabel("Temperature anomaly (°C)")
        ax.set_ylabel("Density")
        ax.legend()
        fig.tight_layout()
        return fig

    def plot_png_bytes(self) -> bytes:
        """Render the plot to PNG bytes (for Flask response)."""
        fig = self.make_plot()
        buf = io.BytesIO()
        fig.savefig(buf, format="png", dpi=150, bbox_inches="tight")
        plt.close(fig)
        buf.seek(0)
        return buf.read()

    def plot_png_base64(self) -> str:
        """Plot as base64 data URI (handy to embed in HTML)."""
        png = self.plot_png_bytes()
        return "data:image/png;base64," + base64.b64encode(png).decode("ascii")


# --------------------------------
# Step 2 — Flask Web Service (API)
# --------------------------------
app = Flask(__name__)

def _parse_floats():
    try:
        mu = float(request.args.get("mu", 0.5))
        sigma = float(request.args.get("sigma", 0.2))
        X = float(request.args.get("X", 0.9))
        return mu, sigma, X, None
    except (TypeError, ValueError):
        return None, None, None, jsonify({"error": "mu, sigma, and X must be numeric"}),


@app.get("/analyze")
def analyze():
    """
    Returns JSON by default.
    - Add &format=png to get a PNG image of the plot.
    - Add &format=html for a simple HTML page showing results + embedded plot.
    Example:
      /analyze?mu=0.5&sigma=0.2&X=0.9
      /analyze?mu=0.5&sigma=0.2&X=0.9&format=png
      /analyze?mu=0.5&sigma=0.2&X=0.9&format=html
    """
    mu, sigma, X, err = _parse_floats()
    if err:
        return err, 400

    analyzer = ClimateAnomalyAnalyzer(mu, sigma, X)
    results = analyzer.compute_probabilities()
    fmt = request.args.get("format", "json").lower()

    if fmt == "png":
        png_bytes = analyzer.plot_png_bytes()
        return send_file(io.BytesIO(png_bytes), mimetype="image/png")

    if fmt == "html":
        img_b64 = analyzer.plot_png_base64()
        html = f"""
        <html>
        <head><title>Climate Anomaly Analysis</title></head>
        <body style="font-family: Arial, sans-serif; margin: 24px;">
          <h2>Climate Anomaly Analysis</h2>
          <p><b>Inputs:</b> μ={results['mu']}, σ={results['sigma']}, X={results['X']}</p>
          <p><b>Z-score:</b> {results['Z']}</p>
          <p><b>P(X ≤ X):</b> {results['p_leq']:.4f} ({results['p_leq']*100:.2f}%)<br>
             <b>P(X &gt; X):</b> {results['p_gt']:.4f} ({results['p_gt']*100:.2f}%)</p>
          <img src="{img_b64}" alt="Distribution plot" style="max-width: 800px; width: 100%; border: 1px solid #ddd;" />
        </body>
        </html>
        """
        return Response(html, mimetype="text/html")

    # default JSON
    out = {
        "mu": results["mu"],
        "sigma": results["sigma"],
        "X": results["X"],
        "Z": results["Z"],
        "P(X <= X)": round(results["p_leq"], 4),
        "P(X > X)": round(results["p_gt"], 4),
    }
    return jsonify(out)


# --------------------------------
# Step 3 — (Optional) local sanity check in notebook
# --------------------------------
if __name__ == "__main__":
    # Run Flask only when executing as a script:
    #   python app.py
    # Tests:
    # 1) Browser JSON:  http://127.0.0.1:5000/analyze?mu=0.5&sigma=0.2&X=0.9
    # 2) Browser PNG:   http://127.0.0.1:5000/analyze?mu=0.5&sigma=0.2&X=0.9&format=png
    # 3) Browser HTML:  http://127.0.0.1:5000/analyze?mu=0.5&sigma=0.2&X=0.9&format=html
    app.run(debug=True)

 * Serving Flask app '__main__'
 * Debug mode: on


 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
 * Restarting with stat


SystemExit: 1

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


### 🏠 Challenge #3: T-Scores, Housing Prices, and Degrees of Freedom

Let’s revisit the example of house prices in Elmira, Ontario.  
You collected the following sample of house prices (in thousands of dollars) from **10 houses**:

$$
[450, 470, 430, 490, 410, 460, 440, 480, 500, 455]
$$

- Sample mean ($\bar{x}$) = 458.5 (thousand dollars)  
- Sample standard deviation ($s$) = 25.21 (thousand dollars)  
- Sample size ($n$) = 10  

Suppose you want to determine the **T-score** for a house priced at **500 thousand dollars**.


#### 📊 Statistical Background

The **t-distribution** is used when data are approximately normally distributed, but the **population variance is unknown**.  

- The variance is estimated using the **degrees of freedom (df)**, defined as:

$$
df = n - 1
$$

- Why $n-1$? Because when you calculate a sample mean, one degree of freedom is "used up" — the last value is constrained by the others and the mean.  

For this housing dataset:

$$
df = 10 - 1 = 9
$$


#### 🔢 Formula for the T-Score

The formula is:

$$
T = \frac{X - \bar{x}}{s / \sqrt{n}}
$$

Where:  
- $T$ = T-score  
- $X$ = observed value (here, 500)  
- $\bar{x}$ = sample mean (458.5)  
- $s$ = sample standard deviation (25.21)  
- $n$ = sample size (10)  


#### 📝 Your Task

1. **Calculate the T-score** for $X = 500$.  
2. **Interpret the result**: how unusual is this value compared to the sample mean?  
3. **Use the t-distribution** with $df = 9$ to calculate the probability of observing a value this high or higher.  
   - This is a **one-tailed probability**.  
   - Use Python’s `scipy.stats.t` distribution to do this.  


#### 💻 Python Scaffold

Here’s some starter code to guide you:

```python
import numpy as np
from scipy.stats import t

# Given values
x_bar = 458.5
s = 25.21
n = 10
X = 500

# 1. Compute the T-score
T = ( ___ - ___ ) / ( ___ / np.sqrt(n) )
print("T-score:", T)

# 2. Degrees of freedom
df = n - 1
print("Degrees of freedom:", df)

# 3. Compute one-tailed probability P(T >= observed)
p_value = 1 - t.cdf(T, df)
print("P-value (one-tailed):", p_value)



Write the code on the cell below. Don't forget to make it **Object Oriented Python**

In [None]:
# TODO: Write the code here

#### 💭 Reflection: Add Your Talking Point

---
TODO: Your reflection goes here


---

<br>
<br>
<br>


# 📈 Test for Normality

Before applying many statistical methods, we need to check whether the data follow a **normal distribution**.  

A **normality test** helps us decide if it is reasonable to assume the data are normally distributed.

**Why Normalize a dataset?**  
In Machine Learning, features can have very different scales (e.g., age in years vs. income in dollars).  
If left unnormalized, algorithms that rely on distance (like K-Nearest Neighbors, k-Means, or gradient descent in neural networks) can become biased toward features with larger values.  

👉 Normalization rescales all features to the same range (commonly 0 to 1), ensuring that **each feature contributes fairly** to the model.

<br/>

### 🔑 Assumptions: Data Independence

When performing a test for normality, we assume that the data are **independent**.  
This means:

1. **Independent Collection:**  
   - Example: Water hardness data samples across the UK were collected by **source A**.  
   - Mortality rates across the UK were collected by **source B**.  

2. **No Pairing or Matching:**  
   - The two samples (water hardness and mortality rates) are not paired or matched.  

3. **No Dependence:**  
   - The two sets of measurements do not depend on each other.  


### 🗂️ Example Dataset

We are working with the following dataset (from `water.csv`):

| location | town        | mortality | hardness |
|----------|-------------|-----------|----------|
| South    | Bath        | 1247      | 10


### Import the data set.

#### Reading and Exporting Water Data from HSAUR Package in R

The `water` dataset is originally available in the `HSAUR` package in R. To use this dataset in a Jupyter Notebook, we first need to read it into the R programming environment and then export it to a CSV file. Here are the steps involved:

1. **Install and Load the HSAUR Package**: 
   - If the `HSAUR` package is not already installed, we need to install it using `install.packages("HSAUR")`.
   - Load the package using `library(HSAUR)`.

2. **Read the Water Data**:
   - The `water` dataset can be accessed directly from the `HSAUR` package using the command `data("water")`.

3. **Export the Data to a CSV File**:
   - Once the data is loaded into the R environment, we can use the `write.csv()` function to export it to a CSV file. For example, `write.csv(water, "water.csv")` will save the dataset as `water.csv` in the current working directory.

Below is the R code that performs these steps. We already executed it for you and stored the CSV file in the 'data' sub-folder.

```r
# Install and load the HSAUR package
if(!require(HSAUR)){install.packages("HSAUR")}
library(HSAUR)

# Load the water dataset
data("water")

# Export the dataset to a CSV file
write.csv(water, "water.csv")

#### Import all necessary libraries

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Read in Water Data
water = pd.read_csv(".\data\water.csv")
water.head()


#### Display Summary Statistics on the data we just imported.


In [None]:
# Summary statistics
desc_water = water.iloc[:, 2:4].describe()
desc_water.round(2)

# Additional summary
water.describe()


#### Graphical Summary of the data on water, mortality and hardness.


In [None]:
# Histograms and density plots
sns.histplot(water['mortality'], bins=12, kde=True)
plt.xlabel('Mortality')
plt.title('Distribution of Mortality')
plt.show()

sns.histplot(water['hardness'], bins=12, kde=True)
plt.xlabel('Hardness')
plt.title('Distribution of Water Hardness')
plt.show()


### 📉 Are these normal curves? Not really...

<br/>

#### 🔑 Why Normalize the Data?

You have two features in the `water.csv` dataset:

* **Mortality** (ranges around 1200–2000)
* **Hardness** (ranges around 5–120)

These features are on **very different scales**.
If you use them directly in **distance-based models** (like K-Nearest Neighbors) or **gradient descent–based models** (like Logistic Regression, Neural Networks), the larger-scaled feature (**Mortality**) will dominate.

👉 This means the model may **ignore water hardness** just because mortality values are numerically larger.

**Normalization** rescales the features so they’re comparable, ensuring **each feature contributes fairly** to the model.

#### 🧪 Demonstration in Python

We’ll prove this using **K-Nearest Neighbors (KNN)**, a distance-based classifier.
I’ll create a simple experiment: predict whether a town is in the **North or South** based on `mortality` and `hardness`.

#### 1. Without Normalization

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler

# Features and labels
X = water[['mortality', 'hardness']]
y = water['location']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)

# KNN without normalization
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print("Accuracy WITHOUT normalization:", accuracy_score(y_test, y_pred))



#### 2. With Normalization (Min-Max Scaling)

In [None]:
# Normalize features to [0,1] range
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Train/test split again (on scaled features)
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_scaled, y, random_state=42, test_size=0.3)

# KNN with normalization
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_s, y_train_s)
y_pred_s = knn.predict(X_test_s)

print("Accuracy WITH normalization:", accuracy_score(y_test_s, y_pred_s))


#### 🎯 What to Observe

* The **accuracy without normalization** may be poor because the large scale of `mortality` overwhelms `hardness`.
* After **normalization**, `hardness` is on equal footing with `mortality`, so the classifier can use **both features effectively**.

💡 **Takeaway:**
Always check the **scale of your features**. If they differ significantly, normalize (or standardize) before running experiments. Otherwise, you risk biased models.

### 🧪 Shapiro-Wilk Test: Beyond Visual Inspection

When we look at histograms or density plots, we can often *guess* whether the data look approximately normal.  
But in **machine learning workflows**, relying on **visual inspection** is not practical:

- It is subjective — two people may interpret the same plot differently.  
- It doesn’t scale — imagine inspecting thousands of features across hundreds of datasets.  


#### 🔑 Enter the Shapiro-Wilk Test
The **Shapiro-Wilk test** provides a **formal statistical test for normality**.

- **Null hypothesis ($H_0$):** The data come from a normal distribution.  
- **Alternative hypothesis ($H_1$):** The data do not come from a normal distribution.  

It produces a **p-value**:
- If `p > 0.05`: fail to reject $H_0$ → data are likely normal.  
- If `p <= 0.05`: reject $H_0$ → data are not normal.  

What is a p-value anyway?

👉 A p-value tells us how surprising our data would be if nothing unusual was going on.

- A small p-value means: “Wow, this result would be very unlikely if nothing unusual was happening.”
- A big p-value means: “This result isn’t surprising at all; it could easily happen just by chance.”

👉 A p-value is like asking: “If the world was completely normal and nothing special was happening, how often would I expect to see results like this just by chance?”

- If the p-value is small, it means: “This result almost never happens just by chance — maybe something real is going on.”
- If the p-value is large, it means: “This result is common enough that it could easily happen by chance — nothing special to see here.”

👉 Imagine you flip a fair coin. Normally, you’d expect about half heads and half tails.

Now suppose you flip it 10 times and get 10 heads in a row.
The p-value answers the question:

“If this coin were truly fair, how likely would it be to see 10 heads in a row?”

- If that probability (the p-value) is very small, you start to suspect the coin isn’t fair.
- If it’s not that small, then the result could just be normal chance.

👉 A p-value is just a number that tells you how much your result looks like it could be a random accident.

- A small p-value means: “This doesn’t look like an accident — something real might be happening.”
- A big p-value means: “This looks like it could easily be an accident — probably nothing unusual here.”

### 🎲 Understanding the p-value

👉 **What is a p-value?**

1. **Simplest view:**  
   A p-value is a number that tells you how much your result looks like it could be a **random accident**.  
   - Small p-value → unlikely to be an accident.  
   - Large p-value → could easily be an accident.  

2. **Everyday analogy (coin flips):**  
   Imagine flipping a fair coin. If you get **10 heads in a row**, the p-value answers:  
   *“If the coin were really fair, how likely would it be to see this result?”*  

3. **Plain English version:**  
   The p-value is the chance of seeing results **at least as extreme as yours** if nothing unusual is happening.  

4. **Statistical phrasing (when we can’t avoid it):**  
   The p-value is the probability of observing your data, or something more extreme, **assuming the null hypothesis is true**.  

💡 **Takeaway:**  
The smaller the p-value, the stronger the evidence that your result is *not* just random chance.


### 🔗 Connecting p-values and the Shapiro-Wilk Test

The **Shapiro-Wilk test** is a statistical test for checking whether a dataset is normally distributed.  
It produces a **p-value**, which we interpret just like any other p-value:

- **Null hypothesis ($H_0$):** The data come from a normal distribution.  
- **Alternative hypothesis ($H_1$):** The data do not come from a normal distribution.  

👉 The p-value tells us how much the data could look like a **random accident** under the assumption of normality.  

- If **p > 0.05** → the data do not provide strong evidence against $H_0$, so it is reasonable to assume the data are normal.  
- If **p ≤ 0.05** → the data are unlikely under $H_0$, so we conclude the data are not normally distributed.  

💡 In short: the **Shapiro-Wilk test uses the p-value to automate the decision** of whether a dataset is “normal enough” to justify using methods that assume normality.


In [None]:
# Shapiro-Wilk test for normality
shapiro_hardness = stats.shapiro(water['hardness'])
shapiro_mortality = stats.shapiro(water['mortality'])
print(f"Shapiro-Wilk test for hardness: {shapiro_hardness}")
print(f"Shapiro-Wilk test for mortality: {shapiro_mortality}")

# Q-Q plots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Q-Q plot for hardness
stats.probplot(water['hardness'], dist="norm", plot=axes[0])
axes[0].set_title('Q-Q Plot for Hardness')

# Q-Q plot for mortality
stats.probplot(water['mortality'], dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot for Mortality')

plt.tight_layout()
plt.show()


### ⚙️ Why It Matters in ML Operations
- In automated ML pipelines, we cannot visually inspect every feature.  
- Tests like **Shapiro-Wilk** give us a **systematic, programmatic way** to decide whether normality assumptions hold.  
- This helps us choose the right tools:
  - If normal → use **parametric methods** (e.g., linear regression, t-tests).  
  - If not normal → consider **non-parametric methods** (e.g., Mann-Whitney test, tree-based models).  


👉 In short: **visual inspection is good for learning**, but **automated testing is essential for scaling machine learning operations.**


## Comparing Mortality and Hardness by Location

In [None]:
# Box plots
sns.boxplot(x='location', y='hardness', data=water)
plt.title('Hardness by Location')
plt.xlabel('Regions')
plt.show()

sns.boxplot(x='location', y='mortality', data=water)
plt.title('Mortality by Location')
plt.xlabel('Regions')
plt.show()


### 🎥 Explore p-values Further

Want to deepen your understanding of **p-values** and how to **Statistical Hypothesis Tests** ?.

Check out this short video lesson:

👉 [Watch on YouTube: What is a p-value?](https://youtu.be/ukcFrzt6cHk?si=tHVMt9vXkvXMWTX3)

The video walks you through:

- The formal definition of p-value in the context of observations,  
- The formal definition of p-value in the context of probabilty,  
- It's use in a drug administration use case,  
- The concept of **Random Noise**,  

Take notes as you watch, and think about how the examples connect to the practice problems in this notebook. 

### 🌍 Challenge #4 : Country Data — Normality & Normalization

Use two free APIs about **country statistics / demographics** to fetch data, test for normality, normalize/standardize, visualize, and reflect.

#### 📡 Suggested APIs to Use  
Here are two APIs that give country-level data, with numeric fields, and are free or have free endpoints:

1. **REST Countries API** — [https://restcountries.com/](https://restcountries.com/)  
   - Returns data on countries: population, area, region, etc. :contentReference[oaicite:0]{index=0}  
   - No API key needed.

2. **World Bank API** — [https://api.worldbank.org/v2/country](https://api.worldbank.org/v2/country)  
   - Provides country metadata plus statistical indicators if extended, or for starters population / income etc. :contentReference[oaicite:1]{index=1}  
   - Free to use.

#### 💡 Tasks

1. **Fetch data** from both APIs:  
   - For each API, get at least **100 countries**, and pick **two or more numeric features** (e.g. population, area, GDP per capita, life expectancy).  
   - Load into DataFrames.

2. **Check for normality** using the **Shapiro-Wilk test** on each numeric feature:  
   - Compute p-values.  
   - Note which features are *not* normal (p ≤ 0.05).

3. **Transform (if needed)** those “not normal” features:  
   - Normalize via Min-Max scaling (to range 0-1).  
   - Standardize (z-score: mean = 0, standard deviation = 1).

4. **Visualize before & after transformations**:  
   - Use **box plots** for each feature before transformation.  
   - Box plots after normalization.  
   - Box plots after standardization.  
   - Display side by side for comparison.

5. **Reflection / Talking Points**: Write at least three short points about:  
   - Why normalization/standardization was needed (or not) in your datasets.  
   - How differences in scale showed up (outliers, spread).  
   - Implications for using these datasets in ML models (distance-based, etc.).

#### 🐍 Python Scaffold

```python
import requests
import pandas as pd
from scipy.stats import shapiro
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

def test_normality(df, cols):
    results = {}
    for c in cols:
        vals = df[c].dropna()
        if len(vals) >= 3:
            stat, p = shapiro(vals)
            results[c] = p
    return results

# API 1: REST Countries
url1 = "https://restcountries.com/v3.1/all"
resp1 = requests.get(url1)
data1 = pd.json_normalize(resp1.json())
# Example numeric features
num_cols1 = ['population', 'area']

# API 2: World Bank — get country data
url2 = "https://api.worldbank.org/v2/country?format=json&per_page=300"
resp2 = requests.get(url2)
# World Bank returns a list of two elements: metadata and actual data
wb_data = resp2.json()[1]  
data2 = pd.DataFrame(wb_data)
# Example numeric features (you might need to filter or convert)
num_cols2 = ['population', 'longitude', 'latitude']  # or other numeric fields available

# 1. Normality before transformations
pvals1_before = test_normality(data1, num_cols1)
pvals2_before = test_normality(data2, num_cols2)
print("P-values before:", pvals1_before, pvals2_before)

# 2. Normalize / Standardize
minmax = MinMaxScaler()
stdscaler = StandardScaler()

data1_norm = data1.copy()
data1_norm[num_cols1] = minmax.fit_transform(data1[num_cols1])

data1_std = data1.copy()
data1_std[num_cols1] = stdscaler.fit_transform(data1[num_cols1])

# 3. Normality after transformations
pvals1_norm = test_normality(data1_norm, num_cols1)
pvals1_std = test_normality(data1_std, num_cols1)
print("P-values after Min-Max (REST Countries):", pvals1_norm)
print("P-values after Standardization (REST Countries):", pvals1_std)

# 4. Visualization
fig, axes = plt.subplots(3, len(num_cols1), figsize=(6 * len(num_cols1), 12))
for i, c in enumerate(num_cols1):
    sns.boxplot(y=data1[c], ax=axes[0, i])
    axes[0, i].set_title(f"{c} before")
    sns.boxplot(y=data1_norm[c], ax=axes[1, i])
    axes[1, i].set_title(f"{c} normalized")
    sns.boxplot(y=data1_std[c], ax=axes[2, i])
    axes[2, i].set_title(f"{c} standardized")
plt.tight_layout()
plt.show()


---

## T-Test and Variance Test

In [None]:
# Variance test
res_ftest = stats.levene(water['mortality'][water['location'] == 'North'],
                         water['mortality'][water['location'] == 'South'])
print(f"Variance test result: {res_ftest}")

# T-Test
res_ttest = stats.ttest_ind(water['mortality'][water['location'] == 'North'],
                            water['mortality'][water['location'] == 'South'],
                            equal_var=True)
print(f"T-Test result: {res_ttest}")


## Non-Parametric Test for Hardness

In [None]:
# Wilcoxon test
res_wilcox = stats.mannwhitneyu(water['hardness'][water['location'] == 'North'],
                                water['hardness'][water['location'] == 'South'])
print(f"Wilcoxon test result: {res_wilcox}")


## Correlation Tests

In [None]:
# Scatter plot with regression line
sns.lmplot(x='hardness', y='mortality', hue='location', data=water)
plt.xlabel('Calcium concentration (in parts per million)')
plt.ylabel('Averaged annual mortality per 100,000 males')
plt.title('Comparing Water Hardness to Mortality')
plt.show()

# Pearson correlation
pearson_corr = stats.pearsonr(water['hardness'], water['mortality'])
print(f"Pearson correlation: {pearson_corr}")

# Spearman correlation
spearman_corr = stats.spearmanr(water['hardness'], water['mortality'])
print(f"Spearman correlation: {spearman_corr}")


## Comparing Categorical Data

In [None]:
# Chi-Square test for water data
# Assuming 'location' and 'mortality' are categorical for this example

# Create a contingency table
contingency_table = pd.crosstab(water['location'], water['mortality'])

# Chi-Square test
chi2_test = stats.chi2_contingency(contingency_table)
print(f"Chi-Square test result: {chi2_test}")

# Residuals
residuals = chi2_test[3]
print(f"Residuals: {residuals}")

# Graphing the residuals
sns.heatmap(residuals, annot=True, cmap='coolwarm')
plt.title('Residuals Heatmap')
plt.show()


# 📈 Normalization, part 2

### Most of the time, our raw data will not be normalized.

Min-max normalization scales data to a fixed range, typically [0, 1], ensuring that no single feature dominates due to its scale. This is crucial when features have different units or ranges, as it prevents biased model performance. It also improves the efficiency of gradient-based optimization algorithms, leading to faster and more stable convergence. Additionally, normalization enhances data visualization, making it easier to compare and interpret relationships between features. Overall, min-max normalization is essential for improving the performance, efficiency, and interpretability of data analysis and machine learning models.

### Step 1 - graph the raw data (not normalized)

In [None]:
# Graph the unscaled data
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.histplot(water['mortality'], bins=16, kde=True, ax=axes[0])
axes[0].set_title('Distribution of Mortality (Unscaled)')
axes[0].set_xlabel('Mortality')

sns.histplot(water['hardness'], bins=16, kde=True, ax=axes[1])
axes[1].set_title('Distribution of Hardness (Unscaled)')
axes[1].set_xlabel('Hardness')

plt.tight_layout()
plt.show()


## Step 2 - Apply the Min-Max scaler

In [None]:
# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Select the columns to normalize
columns_to_normalize = ['mortality', 'hardness']

# Apply the scaler to the selected columns
water[columns_to_normalize] = scaler.fit_transform(water[columns_to_normalize])

# Display the first few rows of the normalized dataset
water.head()


## Step 3 - Graph the normalized data

In [None]:
# Graph the scaled data
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.histplot(water['mortality'], bins=16, kde=True, ax=axes[0])
axes[0].set_title('Distribution of Mortality (Scaled)')
axes[0].set_xlabel('Mortality')

sns.histplot(water['hardness'], bins=16, kde=True, ax=axes[1])
axes[1].set_title('Distribution of Hardness (Scaled)')
axes[1].set_xlabel('Hardness')

plt.tight_layout()
plt.show()
