**Author:** Shahab Fatemi

**Email:** shahab.fatemi@umu.se   ;   shahab.fatemi@amitiscode.com

**Created:** 2025-06-01

**Last update:** 2025-09-23

**MIT License** — Shahab Fatemi (2025); For use in the *Machine Learning in Physics* course, Umeå University, Sweden; See the full license text in the parent folder.

<hr>

📢 <span style="color:red"><strong> Note for Students:</strong></span>

* Before working on the labs, review your lecture notes.

* Please read all sections, code blocks, and comments **carefully** to fully understand the material. Throughout the labs, my instructions are provided to you in written form, guiding you through the materials step-by-step.

* All concepts covered in this lab are part of the course and may be included in the final exam.

* I strongly encourage you to work in pairs and discuss your findings, observations, and reasoning with each other.

* If something is unclear, don't hesitate to ask.

* I have done my best to make the lab files as bug-free (and error-free) as possible, but remember: *there is no such thing as bug-free code.* If you observed any bugs, errors, typos, or other issues, I would greatly appreciate it if you report them to me by email. Verbal notifications are not work, as I will likely forget 🙂

ENJOY WORKING ON THIS LAB.
***

# 🛠️ Purpose and Learning Outcomes:

The main focus of this lab is to guide you through the principles and practical application of data analysis and learn how to apply Gaussian Naive Bayes (GNB) for classification tasks. This notebook is relatively rich in data analysis, which I am sure will be useful for purposes beyond this course. It covers various aspects of statistical data analysis and the ML workflow. 
Learning goals include: 
- Understanding the theory behind Gaussian Naive Bayes, 
- exploring and learning feature distributions using Kernel Density Estimates (KDE), histograms, and QQ plots, 
- training and evaluating a GNB classifier with sklearn, 
- interpreting confusion matrices and classification reports, and 
- visualizing decision boundaries for non-linear high-dimensional data. 
***

In [None]:
import sys
import os
sys.path.append(os.path.abspath('../utils'))
from notebook_config import *

***
❗ <span style="color:red"><strong> Important:</strong></span>

Due to consistently low attendance in the lab sessions (<10 per session and I've never seen a few of you in the lab at all), I am unable to assess whether the lab material is being properly understood or not. Therefore, starting from this lab, submitting your answers to the sections marked as "⚡ Mandatory" will be required within the given deadline for each session. This is not applied to the earlier labs.

- Your answers for the "⚡ Mandatory" sections of each lab <span style="color:red"><strong>must be submitted before the start of the next lab session</strong></span> (e.g., for this lab is by September 26, at 13:15).

- You may submit either a handwritten version (on paper) or an electronic version.

- **Submission:** Write all your answers in one single file (either PDF, Word document, or simple text) and email it to me or write all on a paper and hand it to me in person.

***

# Naive Bayes: overview

Naive Bayes is a probabilistic `classification` algorithm based on Bayes' theorem. While it can, in principle, be applied to regression, it is primarily used for `classification` tasks. The key assumption is that each feature ${x_k}$ is conditionally independent of the others given the class label, y. This "naive" assumption greatly simplifies the computation of the posterior probability ${P(y∣x)}$, since it allows the joint likelihood to be expressed as a product of individual feature likelihoods. The algorithm also avoids computing the full marginal probability ${P(x)}$ since it cancels out in the normalization.

Gaussian Naive Bayes (GNB) is a widely used variant of the Naive Bayes that assumes continuous features follow a Gaussian (normal) distribution. This generative approach makes it particularly suitable for datasets with continuous variables, or the number of features are exessivly large compared to the data samples. GNB is often applied in domains where efficient classification is required and the Gaussian assumption provides a reasonable approximation.

## Exploring Gaussian Naive Bayes (GNB)

In this lab, you will investigate the GNB algorithm using a dataset that contains three features (x1, x2, and x3) and one text-based label (y). We first need to convert the text label into a numeric format. Then we will visualize the distributions of the features, fit a GNB model, and evaluate its performance. The dataset is developed by me, located in the "dataset" folder, and we will load it from there for our analysis. 

**For now, forget the "Gaussian" distribution, and GNB. We will come back to it.**

Before doing anything, lets load the data using pandas, and explore it. If you do not remember pandas, review `Python_Jumpstart/05-Data_Analysis.ipynb`

In [None]:
import pandas as pd

# Load dataset with 3 features and 1 text-based label. Ignore loading comments in the CSV file.
df = pd.read_csv("../datasets/Gaussian_Naive_Bayes.csv", comment="#")

# Display the first few rows of the dataframe
df.head()

Open the file "../datasets/Gaussian_Naive_Bayes.csv" with your text editor and review its content. 

⚠️ Before beginning any analysis (both in this course and in general practice) carefully inspect the file: check the headers, labels, and any additional notes or housekeeping information it may contain.

In [None]:
# Print data shape and column names
print("Number of rows in the file:", df.shape[0])
print("Number of columns in the file:", df.shape[1])
print("Column names:", list(df.columns))

The "y" column contains text labels. Let's identify and display all unique label values.

In [None]:
# Find unique labels in column "y"
unique_labels = df["y"].unique()
print("Unique labels in 'y':", unique_labels)

We convert the text labels into numeric format using sklearn's `LabelEncoder`. This step is crucial because ML algorithms require numeric input, not text labels. If you want, you can instead use NumPy's unique function, but requires further machinary to use it in your model: https://numpy.org/devdocs/reference/generated/numpy.unique.html

In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the labels
y_encoded = label_encoder.fit_transform(df['y'])

# Display the unique numeric labels
print("Unique numeric labels:", set(y_encoded))

In [None]:
# print label mapping
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Label mapping:", label_mapping)

The code comments explain what has been done! If you do not understand, ask your neighbor classmate, or you can ask me.

In [None]:
# Save original labels in a new column
df["y_original"] = df["y"]

# Convert categorical labels to integers and update the dataframe
df["y"] = y_encoded
df.head()

List all available features in the dataset excluding the target variable (labels).

In [None]:
feature_cols = [col for col in df.columns if col not in ["y", "y_original"]]
print(feature_cols)

Split the dataset into training and testing (validation) sets and form new dataframes for each. Note that I've used 30\% of the data for testing, and used shuffle=True and stratify=y to ensure that the data and class labels are correctly distributed in both training and testingsets. If you do not remember it from earlier labs, please review: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
from sklearn.model_selection import train_test_split

X = df[feature_cols]    # Select feature columns
y = df["y"]             # Select the label column

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3, # 30% for testing
                                                    stratify=y,    # Ensure proportional representation of classes
                                                    shuffle=True,  # Shuffle the data
                                                    random_state=42)

# Combine X_train and y_train into one DataFrame (df_train). 
# This is for plotting purposes in seaborn, and not a necessary step for modeling.
df_train = X_train.copy()
df_train["y"] = y_train.copy()

# Combine X_test and y_test into one DataFrame (df_test) for later use
df_test = X_test.copy()
df_test["y"] = y_test.copy()

The function below visualizes the distributions of specified features in a given DataFrame by creating two plots for each feature: a Kernel Density Estimate (KDE) plot and a histogram combined with the KDE of each feature in the dataset, but before that, let's review the KDE, and its differences compared to classical histogram.

### KDE vs. Histogram
Kernel Density Estimation (KDE) is a non-parametric statistical method used to estimate the probability density function (PDF) of a random variable. Unlike parametric methods, KDE **does not** assume that the data follow a specific distribution (such as Gaussian), which makes it very flexible.

Compared to a histogram, which groups data into bins and can produce a discontinuous shape depending on the chosen bin size, KDE produces a smooth and continuous curve that better captures the structure of the data. The figure below taken from Wikipedia (https://en.wikipedia.org/wiki/Kernel_density_estimation) nicely shows the difference between a histogram and a KDE:
![My figure](../figures/Comparison_of_1D_histogram_and_KDE.png)

KDE is particularly useful when working with small or moderate sample sizes, where histograms may look irregular and cannot provide meaningful patterns. While histograms remain a simple and valid tool, I recommend to develop a habit of visualizing data with KDE.

Study the figure above, and compare the histogram with the KDE plot. Do you understand how the KDE works? Do you see the purpose of the dashed red-lines in the KDE plot? discuss it with your neighbor classmate, and if you are unsure, ask me.

Later in this course, we will discuss non-parametric models in more detail and explore the role of kernel functions beyond density estimation, especially in the context of algorithms such as Support Vector Machines.

You can also read more on Density Estimators from: https://scikit-learn.org/stable/modules/density.html

Let's visualize the distributions of specified features.

In [None]:
import seaborn as sns

# plot KDE and Histogram for each feature
def plot_distributions(df, features):
    fig, axes = plt.subplots(2, len(features), 
                    figsize=(5*len(features), 8), dpi=200)

    for i, feature in enumerate(features):
        # KDE plot
        sns.kdeplot(df[feature], fill=True, 
                    color="#4991d9", 
                    lw=2, ax=axes[0, i])
        axes[0, i].set_xlabel(feature)
        axes[0, i].set_ylabel("Density")
        axes[0, i].set_title(f"KDE {feature}")
        axes[0, i].grid(True)

        # Histogram with KDE
        sns.histplot(df[feature], kde=True, 
                     bins=25, 
                     color="#B25FB5", 
                     edgecolor="k", 
                     lw=2, ax=axes[1, i])
        axes[1, i].set_xlabel(feature)
        axes[1, i].set_ylabel("Counts")
        axes[1, i].set_title(f"Histogram {feature}")
        axes[1, i].grid(True)

    plt.suptitle("Feature distributions (Training Set)", fontsize=20, weight="bold")
    plt.show()

# Call on training data
plot_distributions(X_train, feature_cols)

***
### ✅ Check your understanding

- Examine the histogram and KDE plots for each feature. What do these plots show you about the data distribution? NOTE: The purple curve superimposed on each histogram is indeed the KDE line. However, note that the horizontal axis ranges in the top panels differ from those in the corresponding bottom panels.

- How do different features compare in shape and spread? Remember that axis ranges are different between features and between the KDE and histogram plots. Consider this difference in your interpretation.

***

The code below generates KDE plots for every one of the features (${x_1, x_2, x_3}$) within different classes of our target variable, $y$. It creates a series of KDEs for each feature, showing $P(x_i | y)$. The results look very similar to the plot you have in your lecture notes (see the last slides).

In [None]:
def plot_kde_by_class(df, features):
    fig, axes = plt.subplots(1, len(features), figsize=(6*len(features), 6), dpi=200)
    for i, feature in enumerate(features):
        ax = axes[i]

        # Filled KDE plot
        sns.kdeplot(data=df, x=feature, hue="y", fill=True, common_norm=False, alpha=0.2,
                    ax=ax, palette="muted")

        # Overlay line KDE plot for clarity
        sns.kdeplot(data=df, x=feature, hue="y", common_norm=False, lw=3,
                    ax=ax, palette="muted", legend=False)

        ax.set_xlabel(feature, fontsize=20)
        ax.set_ylabel(f"P({feature} | Y=y)", fontsize=20)
        ax.set_title(f"KDE {feature}", fontsize=22, weight="bold")
        ax.grid(True)

    plt.show()

# Call with training data
plot_kde_by_class(df_train, feature_cols)

***
### ⚡ Mandatory submission
- The plots above show a visual presentation of $P(x_k | y)$. What does it mean and how is it different from $P(x_k)$? Write your answer in a short paragraph (see the submission instruction at the beginning of this notebook). 

***

### Quantile-Quantile (QQ) Analysis

As mentioned earlier, our goal is to use Gaussian Naive Bayes (GNB) to train the model and provide probabilistic estimates for the data. In the previous step, we applied KDE to visualize the feature distributions. The next question is: how well does a "Gaussian" (normal) distribution describe our data? To evaluate this, we turn to Quantile–Quantile (QQ) analysis.

A **quantile** is a statistical term. It refers to values that divide a dataset into equal-sized intervals. For example,  median is a special case of a quantile that divides the dataset into two equal halves. 

Quantiles are a fundamental statistical concept that provide valuable information on the distribution of data, such as identifying skewness or the presence of outliers. A Quantile-Quantile (QQ) plot shows the quantiles of the feature's values against the quantiles of the theoretical distribution (often a Gausian distribution or a distribution obtained from a KDE). If the points on the QQ plot lie approximately along a straight 45-degree line, it suggests that the feature's distribution closely matches the theoretical distribution. If you have forgotten what the QQ plot is, visit https://www.geeksforgeeks.org/machine-learning/quantile-quantile-plots/ and 
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html

Side note: A commonly use quantile is the `Percentil` that divides data into 100 equal parts.

Below, we show the QQ plot for the features in our dataset to visually assess how closely the distributions of these features approximate a Gaussian distribution. This allows us to identify deviations from normality, which is a very important analysis.

In [None]:
from scipy import stats

def plot_qq_plots(df, features):
    fig, axes = plt.subplots(1, len(features), figsize=(6*len(features), 5), dpi=200)

    for i, feature in enumerate(features):
        # Get theoretical quantiles and ordered values
        (osm, osr), (slope, intercept, r) = stats.probplot(df[feature], dist="norm")
        
        ## ----------------------------------------------------------------
        ## NOTE: If you want to test other distributions, you can change the dist parameter 
        #   to "uniform", "expon", "lognorm", etc.
        #   Find the full list from https://docs.scipy.org/doc/scipy/reference/stats.html
        #   However, for GNB, we are interested in "norm".
        ## ----------------------------------------------------------------

        ax = axes[i]
        ax.scatter(osm, osr, marker="o", s=70, color="tomato", alpha=0.7)
        ax.plot(osm, osm*slope + intercept, "--", color="k", lw=2)  # Reference line

        ax.set_xlabel("Theoretical Quantiles", fontsize=20)
        ax.set_ylabel("Ordered Values", fontsize=20)
        ax.set_title(f"Q-Q plot for {feature}", fontsize=22, weight="bold")
        ax.grid(True)

    plt.show()

# Call with training data
plot_qq_plots(X_train, feature_cols)

***
### ⚡ Mandatory submission
Carefully analyze the QQ plot above. What do the deviations from the dashed line mean? Write your answers in one short paragraph.
***

Let's perform further analysis of our data using pairplot and boxplot. We'd used them before in the Python Jumpstart, file 03-Data_Plotting.ipynb.

In [None]:
def plot_pair_and_box(df):
    plt.figure()
    sns.pairplot(df, hue="y", diag_kind="kde", palette="bright")
    plt.show()

    plt.figure()
    sns.boxplot(data=df.drop(columns="y"))
    plt.title("Boxplot of features")
    plt.show()
    
plot_pair_and_box(df_train)

***
### ✅ Check your understanding

- The code section above is very simple to understand, and its presented outcomes are very important. What do you observe? Do you understand the distribution of the features and their relation?

***

### ⚡ Mandatory submission
- What does the box plot for the ${x_1}$ feature tell us about the data distribution? Explain all your observations in a short paragraph.

***

The code below shows you how to use `Seaborn` for marginal data visualization. This is mainly for you to see different variations in data analysis and visualizations. It creates a customized jointplot of two features (`x1` and `x3`) from a dataset, combining a scatter plot with a regression line in the center and marginal histograms with KDE Gaussian distributions. This setup allows you to explore both the joint distribution and the individual feature distributions in a visually informative way.

In [None]:
from scipy.stats import gaussian_kde

# Define color palette
palette_color = "#1fb435"  # Green for points
line_color    = "#ff7f0e"  # Orange for regression line
kde_color     = "#005f73"  # Deep blue for KDE
hist_color    = "#98c1d9"  # Light blue for hist bars

# Optional: pre-set figure (not required for jointplot itself)
plt.figure(figsize=(10, 6), dpi=120)

# Create jointplot
jn = sns.jointplot(
    data=df_train,
    x="x1",
    y="x3",
    kind="reg",
    height=8,
    ratio=5,
    space=0.3,
    scatter_kws={
        "s": 60,
        "color": palette_color,
        "marker": "o",
        "edgecolor": "k",
        "alpha": 0.7
    },
    line_kws={
        "color": line_color,
        "linewidth": 2,
        "linestyle": "--"
    },
    marginal_kws={"bins": 30, "color": hist_color}
)

jn.set_axis_labels("Feature x1", "Feature x3", fontsize=14)

plt.show()

Now we have explored our data and we want to train a Gaussian Naive Bayes model using the features and plot the confusion matrix (contingency table). I do not think you need my explanation on what has been done below! Study the code, run it, and carefully analyze the results.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

def train_evaluate(df_train, df_test, features):
    X_train = df_train[features]  # Select features
    y_train = df_train["y"]          # Class Labels
    
    X_test = df_test[features]
    y_test = df_test["y"]

    # Standardize features and create a pipeline with Gaussian Naive Bayes
    pipeline = Pipeline([
        ("scaler", StandardScaler()),
        ("gnb", GaussianNB())
    ])
    
    # Fit the model and make predictions
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)

    # Perform cross-validation to evaluate the model
    scores = cross_val_score(pipeline, X_train, y_train, cv=5)
    print(f"Cross-validation Accuracy: {scores.mean():.2f} ± {scores.std():.2f}")
        
    print("---------------------------------------------")
    print("Classification Report:\n", classification_report(y_test, y_pred))
    print("---------------------------------------------\n")
        
    # Print/plot confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    print("Confusion Matrix: \n", cm)
        
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, 
                                  display_labels=np.unique(y))
    disp.plot(cmap="Blues")
    plt.title("Confusion Matrix")
    plt.show()    
    
    return pipeline

train_evaluate(df_train, df_test, feature_cols)

***
### ✅ Check your understanding
- Carefully analyze the classification report and the confusion matrix. How good is our classifier?

***

Below we visualize the decision boundaries of our GNB classifier, trained on two selected features from a dataset. We only do that for two of the provided features because we show the boundary in a 2D plot. 

In [None]:
from matplotlib.colors import ListedColormap

def plot_decision_boundaries(pipe, df, features):
    assert len(features) == 2, "Decision boundaries can only be plotted for 2D data."

    X = df[features].values
    y = df["y"].values
    
    scaler = pipe.named_steps["scaler"]    # Get the scaler from the pipeline
    gnb    = pipe.named_steps["gnb"]       # Get the Gaussian Naive Bayes model from the pipeline

    # Create a grid for plotting decision boundaries
    X_scaled = scaler.fit_transform(X)  # Scale the features
    
    # Define the grid limits with a margin for better visualization
    # Adjust the margin to ensure the decision boundaries are clearly visible
    x_min, x_max = X_scaled[:, 0].min() - 1, X_scaled[:, 0].max() + 1
    y_min, y_max = X_scaled[:, 1].min() - 1, X_scaled[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200))
    grid = np.c_[xx.ravel(), yy.ravel()]
    
    # Predict the class labels for the grid points
    predicted = gnb.predict(grid).reshape(xx.shape)

    plt.figure()
    plt.contourf(xx, yy, predicted, cmap=ListedColormap(["#8F8FEE","#F49696", "#92F792"]), alpha=0.5)
    scatter = plt.scatter(X_scaled[:, 0], X_scaled[:, 1], s=30, c=y, edgecolors="k", cmap="brg", alpha=0.7)
    
    plt.xlabel(features[0])
    plt.ylabel(features[1])
    plt.title("Decision Boundaries (Scaled Data)")
    plt.legend(*scatter.legend_elements(), title="Label")
    plt.show()

# For 2D visualization, we will use only x1 and x2
features_2d = ["x1", "x2"]
pipe_2d = train_evaluate(df_train, df_test, features_2d)

plot_decision_boundaries(pipe_2d, df_test , features_2d)

We usually focus on decision boundaries from the test set, but it is also useful to look at the training set to check how well the model has learned from it.

In [None]:
plot_decision_boundaries(pipe_2d, df_train, features_2d)

***
### ✅ Check your understanding
- Carefully analyze the classification report and ensure you understand the meaning of each metric. 

- Modify the selected feature pair from $[x_1, x_2]$ to $[x_1, x_3]$ and to $[x_2, x_3]$, running the decision boundary code each time. You can execute them in separate code cells to more easily compare the results and observe how the classifier's performance and boundaries change with different feature combinations.

### ⚡ Mandatory submission

- What is your model accuracy when using the ${[x_1, x_3]}$ and how is it different from using ${[x_1, x_2]}$? Explain your observations in a short paragraph.

- Find line 
```python
    X_train = scaler.fit_transform(X_train)
```
in the code section above. What does the `fit_transform` method do in the context of the `StandardScaler`? Explain its purpose and how it differs from using `fit` and `transform` separately. Write your answer in a short paragraph.

***

# ⛷️ Exercise

### Stellar Data Analysis for Gaussian Naive Bayes (GNB)

In this exercise, you will analyze a dataset that shows the properties of stars based on their astrophysical principles. The dataset contains information about stellar parameters, including **Mass**, **Radius**, **Temperature**, and **Luminosity**. Most of the values are normalized with respect to our Sun: The **Mass** is given in solar masses, the **Radius** in solar radii, the **Temperature** in Kelvin, and the **Luminosity** is normalized to solar luminosity. 

Your objective is to analyze the data, explore the relationships between the given parameters, and develop regression models to predict target values (i.e., luminosity) based on given features (i.e., mass, radius, and temperature). You will also explore the use of Gaussian Naive Bayes (GNB) in a regression context, which is typically more suited for classification tasks.

**Note:** GNB is primarily a classification algorithm, but for the purpose of this exercise (i.e., education and practice), you will learn how to adapt it for regression tasks by discretizing the target variable (luminosity) into categorical classes. 

### Tasks

1. Begin with loading the stellar dataset using the code below. The luminosity is calculated using the Stefan-Boltzmann Law, and it contains noise.

   ```python
   # Load Luminosity Dataset
   import pandas as pd

   data = pd.read_csv("../datasets/stellar_luminosity.csv", comment="#")
   ```

2. Visualize the distributions of the data. Use pair plots, histograms, and correlation matrices to perform a comprehensive analysis. This will help you understand the data structure, identify patterns, and observe the relationships between parameters such as mass, radius, temperature, and luminosity. 

3. Start by implementing a regression model using techniques you have learned so far, such as polynomial regression or linear regression to predict luminosity. Assess the model's performance using cross-validation to calculate the R2-score, Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE). These metrcis will quantify the model's accuracy in predicting luminosity based on the other features.

4. Next, you will implement GNB. Since predicting a continuos target like luminosity is not suitable for GNB, you need to discretize the luminosity values into categories (e.g., low, medium, high). This transformation allows you to treat the problem as a `classification` task. After discretization, you can train the GNB model and evaluate its performance using accuracy metrics, confusion matrices, and classification reports. Compare the results of GNB with your regression model to analyze the differences in performance.

5. Finally, visualize your results by plotting the predictd versus actual luminosity values. Additionally, analyse the residuals (the difference between predicted and actual values) to assess the performance of your models. Create histograms and scatter plots of the residuals to intprepret their distribution and identify any patterns. Discuss the accuracy of both models and potential biases that may influence their predictions.

### Important to consider

The GNB is primarily designed for classification tasks. The key considerations include:

- Transforming the continuous target variable (luminosity) into discrete categories allows GNB to apply its probabilistic framework. 

- GNB operates under the assumption that the features are conditionally independent given the class label. This assumption may not hold in the context of your astrophysical data, where parameters like mass and radius can be correlated. Understanding these limitations is crucial for interpreting GNB's predictions.

- By comparing GNB's performance with a traditional regression model, you will learn the strnghts and weaknesses of using a classification approach for predicting continuous outcomes. This analysis will highlight the importance of **selecting the appropriate model** based on the **nature of the data** and the specific task at hand.


***
END
***