# <span style="color:blue;">**Asthma Detection from human breathing**</span>

## <span style="color:green;">**Course**</span>: Topological Data Analysis  

---

### <span style="color:orange;">**Group Members:**</span>
- **Aubrey UNDI PHIRI**  
- **Gaudence IRADUKUNDA**
- **Stephen TAIWO**   
- **Emile NIYITANGA**  

---

### <span style="color:purple;">**Institution:**</span> African Institute for Mathematical Sciences 
### <span style="color:purple;">**Supervisor:**</span> Olakunle S. Abawonse  
### <span style="color:purple;">**Date:**</span> 25 January 2025

---


------------------------------------
## SCOPE OF THE PROJECT

In this project, we will be working on an audio dataset, which consists of normal breathinga and asthmatic breathing. We want classify the audios into their respective classes by converting each audio to a signal, and extracting topological features using different methods. These methods include Persistent Entropy, various metrics such as bottleneck, heat, and Betti numbers, as well as Carlsson Coordinates.


In [2]:
# Loading the necessary packages and modules

# data wrangling
import numpy as np
import pandas as pd
from pathlib import Path
from IPython.display import YouTubeVideo
from fastprogress import progress_bar

# hepml
from hepml.core import make_gravitational_waves, download_dataset

# tda magic
from gtda.homology import VietorisRipsPersistence, CubicalPersistence
from gtda.diagrams import PersistenceEntropy, Scaler, NumberOfPoints, Amplitude
from gtda.plotting import plot_heatmap, plot_point_cloud, plot_diagram
from gtda.pipeline import Pipeline
from gtda.time_series import TakensEmbedding, SingleTakensEmbedding
from ripser import ripser
from persim import plot_diagrams
from teaspoon.ML import feature_functions as Ff

# ml tools
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score, accuracy_score
from sklearn.pipeline import make_pipeline, make_union

# dataviz
import matplotlib.pyplot as plt
import seaborn as sns

import matplotlib.pyplot as plt
import wave
import sys
sns.set(color_codes=True)
sns.set_palette(sns.color_palette("muted"))
import os
import pandas as pd
import random

Here, we try to organize our data set into training and test. To achieve this, about 80% of the audio files from each category (Normal and Asthma) are allocated to the training set, while the remaining from each category is used for the test set. The training and test data, along with their respective labels, are then saved into separate `.csv` files for easy access.

In [3]:
random.seed(42)

normal_folder = '/home/stephen/Downloads/TDA/Project_3/Normal'
asthma_folder = '/home/stephen/Downloads/TDA/Project_3/Asthma/'

normal_files = [os.path.join(normal_folder, f) for f in os.listdir(normal_folder) if f.endswith(".wav")]
asthma_files = [os.path.join(asthma_folder, f) for f in os.listdir(asthma_folder) if f.endswith(".wav")]

# Randomly select files for training and test sets
normal_train = random.sample(normal_files, 73)
normal_test = random.sample([f for f in normal_files if f not in normal_train], 32)

asthma_train = random.sample(asthma_files, 67)
asthma_test = random.sample([f for f in asthma_files if f not in asthma_train], 28)

# Putting all file names and labels in a dataframe
data1 = {
    "file_path": normal_train + asthma_train,
    "label": [0] * len(normal_train) + [1] * len(asthma_train)}

df1 = pd.DataFrame(data1)


df1.to_csv("train_data.csv", index=False)

data2 = {
    "file_path": normal_test + asthma_test,
    "label": [0] * len(normal_test) + [1] * len(asthma_test)}

df2 = pd.DataFrame(data2)


df2.to_csv("test_data.csv", index=False)

### Functions to extract features

In [4]:
# Helper Functions

def convert_dgm(dgm):
    Arr = dgm.copy()
    Arr[0] = Arr[0][:-1]
    col_a  = np.zeros(Arr[0].shape[0])
    Arr[0] = np.column_stack((Arr[0], col_a))
    
    col_b  = np.ones(Arr[1].shape[0], dtype=int)
    Arr[1] = np.column_stack((Arr[1], col_b))
    temp_1 = list(Arr[0])
    temp_2 = list(Arr[1])
    temp_1.extend(temp_2)
    return np.asarray(temp_1)

def fit_embedder(embedder, y, verbose=True):
    y_embedded = embedder.fit_transform(y)

    if verbose:
        print(f"Shape of embedded time series: {y_embedded.shape}")
        print(f"Optimal embedding dimension is {embedder.dimension_} and time delay is {embedder.time_delay_}")

    return y_embedded

In [5]:
def extract_features(file):
    
    spf = wave.open(file, "r")
    signal = spf.readframes(-1)
    signal = np.frombuffer(signal, np.int16)
    
    # Takens Embedding parameters
    embedding_dimension = 30
    embedding_time_delay = 100
    stride = 15

    # Create a SingleTakensEmbedding object
    embedder = SingleTakensEmbedding(
        parameters_type="search", n_jobs=2, time_delay=embedding_time_delay, dimension=embedding_dimension, stride=stride
    )
    
    # Fit and transform the signal
    y_noise_embedded = embedder.fit_transform(signal)

    # Compute persistent homology using Ripser
    res = ripser(y_noise_embedded, n_perm=700)
    dgms_sub = res['dgms']

    # Convert diagrams (assuming this is a custom function)
    res = convert_dgm(dgms_sub)

    # Extract persistence diagrams
    test = dgms_sub[0][:-1]  # H0
    test_1 = dgms_sub[1]     # H1

    # Compute feature matrix using custom function Ff.F_CCoordinates
    FN = 5
    FeatureMatrix, TotalNumComb, CombList = Ff.F_CCoordinates(test[None, :, :], FN)
    X_cc_0 = FeatureMatrix[-4]
    
    FeatureMatrix, TotalNumComb, CombList = Ff.F_CCoordinates(test_1[None, :, :], FN)
    X_cc_1 = FeatureMatrix[-3]

    # Define metrics for additional features
    metrics = [
        {"metric": "bottleneck", "metric_params": {}},
        {"metric": "wasserstein", "metric_params": {"p": 2}},
        {"metric": "betti", "metric_params": {"p": 2, "n_bins": 100}},
        {"metric": "landscape", "metric_params": {"p": 2, "n_layers": 2, "n_bins": 100}},
        {"metric": "heat", "metric_params": {"p": 2, "sigma": 1.6, "n_bins": 100}},
        {"metric": "heat", "metric_params": {"p": 2, "sigma": 3.2, "n_bins": 100}},
    ]

    # Create a feature union with persistence diagram metrics
    feature_union = make_union(
        PersistenceEntropy(normalize=True),
        NumberOfPoints(n_jobs=-1),
        *[Amplitude(**metric, n_jobs=-1) for metric in metrics]
    )

    # Fit and transform persistence diagrams
    single_data = feature_union.fit_transform(res[None, :, :])
    X_metrics = single_data

    # Concatenate all features
    single_X_train = np.concatenate((X_cc_0, X_cc_1, X_metrics), axis=None)

    return single_X_train

In [6]:
# Testing the extraction function on a single file
test_file = "/home/stephen/Downloads/TDA/Project_3/Normal/BP30_N,N,P R M,18,F.wav"
features = extract_features(test_file)
features.shape

(24,)

### Extracting Features

To extract the features from the signal, we use Carlsson Coordinates(8), Persistence Entropy(2), Number of Points(2) and Amplitude(12) with different metrics: bottleneck, wasserstein, betti, landscape, and heat. 

In [7]:
# Load the CSV files for the training and test sets
train_df = pd.read_csv("train_data.csv") 
test_df = pd.read_csv("test_data.csv")    

X_train, y_train = [], []
X_test, y_test = [], []

# Extract features for the training set
for index, row in train_df.iterrows():
    file_path = row['file_path']
    label = row['label']
    
    try:
        feature = extract_features(file_path)
        X_train.append(feature)
        y_train.append(label)
    except Exception as e:
        print(f"Error processing {file_path}: {e}")

In [8]:
# Extract features for the test set
for index, row in test_df.iterrows():
    file_path = row['file_path']
    label = row['label']
    
    try:
        feature = extract_features(file_path)
        X_test.append(feature)
        y_test.append(label)
    except Exception as e:
        print(f"Error processing {file_path}: {e}")

In [9]:
X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
y_test = np.array(y_test)

In [10]:
X_test.shape

(60, 24)

In [11]:
X_train.shape

(140, 24)

We obtained 24 features in total. This features will be used to train and classifier and see its performance.

### Training a Classifier

Let's train a Random Forest Classifier to see how the features help us classify the signals.

In [16]:
from sklearn.ensemble import RandomForestClassifier


rf = RandomForestClassifier(random_state = 31)
rf.fit(X_train , y_train)

rf.score(X_test, y_test)

0.65

In [18]:
def print_scores(fitted_model):
    res = {
        "Accuracy on train:": accuracy_score(fitted_model.predict(X_train), y_train),
        "ROC AUC on train:": roc_auc_score(y_train, fitted_model.predict_proba(X_train)[:, 1]),
        "Accuracy on valid:": accuracy_score(fitted_model.predict(X_test), y_test),
        "ROC AUC on valid:": roc_auc_score(y_test, fitted_model.predict_proba(X_test)[:, 1]),
    }
    if hasattr(fitted_model, "oob_score_"):
        res["OOB accuracy:"] = fitted_model.oob_score_

    for k, v in res.items():
        print(k, round(v, 3))

In [19]:
rf = RandomForestClassifier(random_state=31)
rf.fit(X_train, y_train)
print_scores(rf)

Accuracy on train: 0.993
ROC AUC on train: 1.0
Accuracy on valid: 0.65
ROC AUC on valid: 0.624


The accuracy of the classifier is observed to be 65% indicating that it performs moderately well but still leaves room for improvement. This can be improved by looking carefully into our feature selection.