**Advanced Astroinformatics (Semester 1 2025)**
# Supervised Classification, Data Processing Pipelines

**Advanced Astroinformatics Student Project**

*N. Hernitschek*



---
## Contents
* [Recap, Questions](#first-bullet)
* [Data Processing Pipelines](#second-bullet)
* [Summary](#fifth-bullet)


## 1. Recap, Questions <a class="anchor" id="first-bullet"></a>

Time for questions!

Your **tasks until this week** were:

Based on what you have seen here: Use a multiclass supervised machine learning algorithm on the three TESS feature data sets, including making diagnostic plots and the classification scores.

Hint: Use the `scikit-learn` documentation (a good starting point: https://scikit-learn.org/stable/modules/multiclass.html). You can also generally search for code examples and reuse parts of the code. Reusing code is a great way to learn. As always: When reusing code, never use this without understanding what the code does!

What are your **results** so far? Can you show plots?
 

If this works: 

Use the k-means algorithm on the three TESS feature data sets, including making diagnostic plots.

Try to interpret your results.
How do your results differ from the a) _TESS_lightcurves_outliercleaned, b) _TESS_lightcurves_median_after_detrended, c) _TESS_lightcurves_raw?

## 2. Data Processing Pipelines <a class="anchor" id="fourth-bullet"></a>

We already have seen *data processing pipelines* to some extent:

* TESS light-curve data were first detrended and outlier-cleaned: _TESS_lightcurves_raw $\rightarrow$ _TESS_lightcurves_median_after_detrended $\rightarrow$  _TESS_lightcurves_outliercleaned
* features were calculated from (outlier-cleaned) TESS light-curve data
* features were used for classification

**Question:** 
Can you think of ways to improve the classification process?
What does "improve" exactly mean? What we want to achieve?



Large existing and upcoming surveys, such as LSST, rely and will rely heavily on data processing pipelines, e.g. for LSST:
    
    
https://www.lsst.org/about/dm/pipelines

https://antares.noirlab.edu/pipeline
    
Such surveys have the goal to:

* find "unknowns", "unknown unknowns"
* find more examples of the same types to build catalogs.
    
**Question:**
How is this related to the science you are doing currently, and/or are planning to do?
    

In [None]:
import pathlib
from pathlib import Path
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import pandas as pd
import os
from astropy.timeseries import LombScargle
import numpy as np
import corner
from scipy import stats
import matplotlib.pyplot as plt
import pylab as pl
from matplotlib.colors import ListedColormap
import seaborn as sns
import feets
from scipy.stats import gaussian_kde
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, adjusted_rand_score, ConfusionMatrixDisplay


le = LabelEncoder()


path = "/Users/guillermow/Documents/PhD/1st Semester/Advanced Astroinformatics/_data/"
lc_median= path + "_TESS_lightcurves_median_after_detrended/"
lc_raw=path + '_TESS_lightcurves_raw/'
lc_cleaned=path + '_TESS_lightcurves_outliercleaned/'
info_tess_data=path + 'info_tess_data.txt'

paths_raw = sorted([str(p) for p in Path(lc_raw).iterdir()])
paths_raw = paths_raw[1:] 

paths_median = sorted([str(p) for p in Path(lc_median).iterdir()])
paths_median = paths_median[1:]

paths_cleaned = sorted([str(p) for p in Path(lc_cleaned).iterdir()])
paths_cleaned = paths_cleaned[1:] 

suffix_cleaned = np.array([Path(a).name for a in paths_cleaned])
suffix_raw=suffix_cleaned
subtext_median= '_lc_median_after_cbv_detrended_'
suffix_median=np.array([subtext_median + a for a in suffix_raw])

X_clean_1ft = []
X_raw_1ft = []
X_median_1ft = []

X_clean_2ft = []
X_raw_2ft = []
X_median_2ft = []

y = []

a=4
b=2

for suffix in suffix_cleaned:
    folder_path = f'/Users/guillermow/Documents/PhD/1st Semester/Advanced Astroinformatics/_data/features/{suffix}'
    df_clean = pd.read_csv(f'{folder_path}/clean_feats.csv', index_col=0)
    df_raw = pd.read_csv(f'{folder_path}/raw_feats.csv', index_col=0)
    df_median = pd.read_csv(f'{folder_path}/median_feats.csv', index_col=0)
    
    cols = df_clean.columns[:-1] 
    X_clean_1ft.append(df_clean[cols[a]].values)
    X_raw_1ft.append(df_raw[cols[a]].values)
    X_median_1ft.append(df_median[cols[a]].values)
    
    X_clean_2ft.append(df_clean[cols[b]].values)
    X_raw_2ft.append(df_raw[cols[b]].values)
    X_median_2ft.append(df_median[cols[b]].values)
    
    y.append([suffix for _ in range(len(df_clean[cols[0]].values))])
    
feats = df_clean.columns
X_1ft = np.concatenate(X_raw_1ft)
X_2ft = np.concatenate(X_raw_2ft)
X = np.vstack([X_1ft, X_2ft]).T

y_string = np.concatenate(y)
y = le.fit_transform(y_string) #Turning the labels into numerical values

def plot_confusion_matrix(conf_mat, title):
    plt.figure(figsize=(14, 10))
    norm_conf_mat = conf_mat.astype('float') / conf_mat.sum(axis=1, keepdims=True)
    if 'suffix_cleaned' in globals() and len(suffix_cleaned) == conf_mat.shape[0]:
        class_labels = suffix_cleaned
    else:
        class_labels = np.arange(conf_mat.shape[0])
    sns.heatmap(norm_conf_mat, 
                xticklabels=class_labels, 
                yticklabels=class_labels,
                cmap="Blues", annot=True, fmt=".2f", cbar=True)
    plt.title(title)
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.tight_layout()
    plt.show()


def plot_decision_boundary(X, y, title, num_label, ab):
    
    X_clean_1ft = []
    X_raw_1ft = []
    X_median_1ft = []

    X_clean_2ft = []
    X_raw_2ft = []
    X_median_2ft = []

    y = []

    a, b = ab

    for suffix in suffix_cleaned:
        folder_path = f'/Users/guillermow/Documents/PhD/1st Semester/Advanced Astroinformatics/_data/features/{suffix}'
        df_clean = pd.read_csv(f'{folder_path}/clean_feats.csv', index_col=0)
        df_raw = pd.read_csv(f'{folder_path}/raw_feats.csv', index_col=0)
        df_median = pd.read_csv(f'{folder_path}/median_feats.csv', index_col=0)
        
        cols = df_clean.columns[:-1] 
        X_clean_1ft.append(df_clean[cols[a]].values)
        X_raw_1ft.append(df_raw[cols[a]].values)
        X_median_1ft.append(df_median[cols[a]].values)
        
        X_clean_2ft.append(df_clean[cols[b]].values)
        X_raw_2ft.append(df_raw[cols[b]].values)
        X_median_2ft.append(df_median[cols[b]].values)
        
        y.append([suffix for _ in range(len(df_clean[cols[0]].values))])
        
    feats = df_clean.columns
    X_1ft = np.concatenate(X_raw_1ft)
    X_2ft = np.concatenate(X_raw_2ft)
    X = np.vstack([X_1ft, X_2ft]).T

    y_string = np.concatenate(y)
    y = le.fit_transform(y_string) #Turning the labels into numerical values
    
    cmap = plt.get_cmap('gnuplot')
    folds = 10
    k_fold = KFold(folds, shuffle=True, random_state=1)

    predicted_targets = np.array([])
    actual_targets = np.array([])

    input_folds = np.empty(folds, dtype=object)
    result_folds = np.empty(folds, dtype=object)

    fold = 0

    for train_ix, test_ix in k_fold.split(X):
        train_x, train_y = X[train_ix], y[train_ix]
        test_x, test_y = X[test_ix], y[test_ix]

        scaler = StandardScaler()
        train_x = scaler.fit_transform(train_x)
        test_x = scaler.transform(test_x)

        classifier = RandomForestClassifier(n_estimators=10, criterion='gini', random_state=0, bootstrap=True)
        classifier.fit(train_x, train_y)
        predicted_labels = classifier.predict(test_x)

        predicted_targets = np.append(predicted_targets, predicted_labels)
        actual_targets = np.append(actual_targets, test_y)

        input_folds[fold] = test_y
        result_folds[fold] = predicted_labels

        fold += 1

    fig, ax = plt.subplots(1, 2, figsize=(10, 5))
    if num_label < 22:
        predicted_targets = predicted_targets.astype(int)
        mask1 = (predicted_targets == num_label)
        string_label = le.inverse_transform([num_label])[0] 

        idx = num_label
        unique_numerical_labels_test = np.unique(predicted_targets)

        ax[0].scatter(
            X[mask1, 0], X[mask1, 1],
            c=[cmap(idx / len(unique_numerical_labels_test))],
            alpha=0.6,
            label=string_label,
            edgecolors='k',
            s=10,
        )
        ax[0].set_title("Test Predictions")
        ax[0].legend()
        ax[0].set_xlim(X[:, 0].min() - 0.15, X[:, 0].max() + 0.15)
        ax[0].set_ylim(X[:, 1].min() - 0.15, X[:, 1].max() + 0.15)
        ax[0].set_ylabel(feats[b])

        mask2 = (y == num_label)
        unique_numerical_labels_train = np.unique(y)

        ax[1].scatter(
            X[mask2, 0], X[mask2, 1],
            c=[cmap(idx / len(unique_numerical_labels_train))],
            label=string_label,
            alpha=0.6,
            edgecolors='k',
            s=10,
        )
        ax[1].set_title("Training Samples")
        ax[1].legend()
        ax[1].set_xlim(X[:, 0].min() - 0.15, X[:, 0].max() + 0.15)
        ax[1].set_ylim(X[:, 1].min() - 0.15, X[:, 1].max() + 0.15)
        ax[1].set_xlabel(feats[a])

        plt.suptitle(f"Decision Boundary - {title}")
        plt.tight_layout()
        plt.show()
    
    else:
        unique_numerical_labels_test = np.unique(predicted_targets)
        unique_numerical_labels_train = np.unique(y)

        for idx, num in enumerate(unique_numerical_labels_train):
            mask1 = (predicted_targets == num)
            string_label = le.inverse_transform([int(num)])[0]

            ax[0].scatter(
                X[mask1, 0], X[mask1, 1],
                c=[cmap(idx / len(unique_numerical_labels_test))],
                label=string_label,
                alpha=0.5,
                edgecolors='k',
                s=10,
            )
            ax[0].set_title("Test Predictions")
            ax[0].set_ylabel(feats[b])
            #ax[0].legend()
            ax[0].set_xlim(X[:, 0].min() - 0.15, X[:, 0].max() + 0.15)
            ax[0].set_ylim(X[:, 1].min() - 0.15, X[:, 1].max() + 0.15)
            
            
        for idx, num in enumerate(unique_numerical_labels_test):
            mask2 = (y == num)
            string_label = le.inverse_transform([int(num)])[0] 

            ax[1].scatter(
                X[mask2, 0], X[mask2, 1],
                c=[cmap(idx / len(unique_numerical_labels_train))],
                alpha=0.5,
                label=string_label,
                edgecolors='k',
                s=10,
            )
            ax[1].set_title("Training Samples")
            ax[1].set_xlabel(feats[a])
            ax[1].legend(bbox_to_anchor=(1, 1.125))
            ax[1].set_xlim(X[:, 0].min() - 0.15, X[:, 0].max() + 0.15)
            ax[1].set_ylim(X[:, 1].min() - 0.15, X[:, 1].max() + 0.15)
            
        plt.suptitle(f"Feature Comparison - {title}")
        plt.tight_layout()
        plt.show()
        
        cm = confusion_matrix(actual_targets.astype(int), predicted_targets.astype(int))
        
        plot_confusion_matrix(cm, title=f"Confusion Matrix - {title}")

results = {}
versions = ['Raw', 'Clean', 'Median']

X_clean = np.vstack([np.concatenate(X_clean_1ft), np.concatenate(X_clean_2ft)]).T
X_raw = np.vstack([np.concatenate(X_raw_1ft), np.concatenate(X_raw_2ft)]).T
X_median = np.vstack([np.concatenate(X_median_1ft), np.concatenate(X_median_2ft)]).T

data_versions = [X_raw, X_clean, X_median]
x_data = data_versions[0]

num_label = widgets.IntSlider(
    value=22,
    min=0,
    max=len(np.unique(y)),
    step=1,
    description='Class Label:',
    continuous_update=False
)

ab_slider = widgets.IntRangeSlider(
    value=[4, 5],
    min=0,
    max=5,
    step=1,
    description='Types of Feature:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d',
)

#plot_decision_boundary(x_data, y, f"{versions[0]}", num_label=0)
interact(plot_decision_boundary, X=fixed(x_data), y=fixed(y), title=fixed(versions[1]), num_label=num_label, ab=ab_slider);


interactive(children=(IntSlider(value=22, continuous_update=False, description='Class Label:', max=22), IntRan…

## Summary <a class="anchor" id="fifth-bullet"></a>

At this point, you should have:


* seen how `scikit-learn` works in general
* seen some complete examples of machine learning for both unsupervised and supervised classification in the case of binary and multiclass classification
* seen ways on how to verify machine learning results for both unsupervised and supervised classification in the case of binary and multiclass classification.
* seen how machine learning, and data processing pipelines in general, fit into the larger picture in processing astronomical data sets.