# Information about the Notebook

The Original Dataset Has Imbalance Classes. To solve this issue based on the domain knowledge Imbalancement can be reduce by combining the original classes into  higher-level classes.

- Aim:
    - Reduce overfitting risk on rare classes
    - Makes dataset more balanced
    - Keeps clinical meaning intact

- ❗ Drawback of this approach:
    - lose granularity: e.g., can't distinguish Pneumonia from Bronchiectasis.

---
### Approach
- There are 8 Classes:
    - 'URTI', 
    - 'Healthy', 
    - 'Asthma', 
    - 'COPD', 
    - 'LRTI', 
    - 'Bronchiectasis',
    - 'Pneumonia', 
    - 'Bronchiolitis'
- We Will Combine these 8 classes into three main classes. Namely Chronic, Acute, and Healthy
    - Chronic
        - 'COPD', 
        - 'Asthma', 
    - Acute
        - 'URTI', 
        - 'LRTI', 
        - 'Bronchiectasis',
        - 'Pneumonia', 
        - 'Bronchiolitis'
    - 'Healthy'
- Input: 
    - `patient_diagnosis.csv` file:
        - this file contains pid, diagnosis
        - originally there are 
            - 126 pids
            - diagnosis : 8 unique classes

- Output:
    - `patient_diagnosis_relabelled.csv` file
        - A modified csv file 
            - relablelled diagnosis column with 3 unique classes [Chronic, Acute and Healthy] -/Only.

# Code

In [5]:
import pandas as pd

In [6]:
# Load the CSV file
df = pd.read_csv("../data/Respiratory_Sound_Data/patient_diagnosis.csv", names=['pid','diagnosis'])

In [7]:
# Define your mapping dictionary
mapping = {
    'URTI': 'Acute',
    'LRTI': 'Acute',
    'Bronchiectasis': 'Chronic',
    'Pneumonia': 'Acute',
    'Bronchiolitis': 'Acute',
    'Asthma': 'Chronic',
    'COPD': 'Chronic',
    'Healthy': 'Healthy'
}

In [8]:
# Apply the mapping to relabel the 'Diagnosis' column
df['diagnosis'] = df['diagnosis'].map(mapping)

# Optional: save to a new CSV file
df.to_csv("patient_diagnosis_relabelled.csv", index=False)

In [9]:
# Show value counts to check new class distribution
df['diagnosis'].value_counts()

diagnosis
Chronic    72
Acute      28
Healthy    26
Name: count, dtype: int64