# **Dynamic Selection Hybrid Model for Advancing Thyroid Care With BOOST Balancing Method**

======================================================================================================================================

#### Overview of the code:

1. Data collection

2. Analyze the data

3. Daata preprocessing
    * Drop unwanted columns and duplicates
    
    * Handle Null values
        * Drop if it contains more than 50% null values
        * Fill or replace null values with mean, median or mode.

    * Implement Labelencoding to convert object type columns into numeric columns.

    * Implement BOOST balancing method to balance the data.

        BOOST :=>
        - BS(Boosting with Sample Weighting), 
        - SMOTE (Synthetic Minority Over-sampling Technique),
        - Tomek Links (TL)
        
            implementation:
            - Split dataset into training and test sets
            - Step 1: Apply SMOTE for oversampling minority class
            - Step 2: Apply Tomek Links to remove noisy samples
            - Step 3: Apply Boosting Stage (BS) => Model implementation

4. Build a Dynamic Selection Hybrid Model

    implementation:
    
    * Define the classifiers
    * Step 1: Train all classifiers and compute Permutation Feature Importance (PFI)
    * Step 2: Select Half-Most Effective Classifiers (HEC) based on PFI
    * Step 3: Define the ensemble methods using the selected classifiers
    * Step 4: Train each ensemble method and evaluate accuracy
    * Step 5: Select Most Efficient Ensemble Method (EEM)
    * Build final model

5. Save the model for deploy into web application frontend

6. sample predictions

======================================================================================================================================

#

### Import necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
from IPython.display import display, HTML

: 

# # Data Collection

In [None]:
# Read the dataset and store it a a dataframe
df = pd.read_csv("DATASET/thyroidDF.csv")

### Analyze the data

In [None]:
# View the dataframe
df

In [None]:
# Count rows, columns
df.shape

In [None]:
# Information about data
df.info()

In [None]:
# Count of Object columns
num_object_columns = len(df.select_dtypes(include='object').columns)
print(f"Number of object columns: {num_object_columns}")

# Count of numeric columns
num_numeric_columns = len(df.select_dtypes(include='number').columns)
print(f"Number of numeric columns: {num_numeric_columns}")

In [None]:
# Information about data
df.describe()

In [None]:
# count of duplicate rows
df.duplicated().sum()

In [None]:
# count of null rows
df.isnull().sum()

In [None]:
# Calculate total missing values and percentage of missing values
from IPython.display import display, HTML

train_total = df.isnull().sum()
train_percent = (train_total / df.shape[0]) * 100

# Create a DataFrame to hold this information
data_missing = pd.DataFrame({
    'Total nulls': train_total,
    'Percentage': train_percent
})

# Sort the DataFrame by the 'Total nulls' column in descending order
data_missing_sorted = data_missing.sort_values(by='Total nulls', ascending=False)

# Convert the DataFrame to HTML
html = data_missing_sorted.to_html()

# Display the HTML as a scrollable element
display(HTML(f"""
<div style="height:300px; overflow-y:scroll; border:1px solid black; padding:10px;">
    {html}
</div>
"""))

Analyze target column

In [None]:
df["target"].unique()

In [None]:
df["target"].value_counts()

In [None]:
# Distribution plot for target column
object_columns = ["target"]

plt.figure(figsize=(15, 10))
for i, feature in enumerate(object_columns):
    plt.subplot(2, 3, i + 1)
    sns.countplot(x=df[feature], data=df)
    plt.title(f'Distribution of {feature}')
    plt.xticks(rotation=0)

plt.tight_layout()
plt.show()

Dataset Description;


| S.No. | Feature Name           | Description                                                 | Data Type |
|-------|------------------------|-------------------------------------------------------------|-----------|
| 1     | age                    | age of the patient                                          | int       |
| 2     | sex                    | sex patient identifies                                      | str       |
| 3     | on_thyroxine           | whether patient is on thyroxine                             | bool      |
| 5     | on_antithyroid_meds    | Whether patient is on antithyroid meds                      | bool      |
| 6     | sick                   | Whether patient is sick                                     | bool      |
| 7     | pregnant               | Whether patient is pregnant                                 | bool      |
| 8     | thyroid_surgery        | Whether patient has undergone thyroid surgery               | bool      |
| 9     | I131_treatment         | Whether patient is undergoing I131 treatment                | bool      |
| 10    | query_hypothyroid      | Whether patient believes they have hypothyroid              | bool      |
| 11    | query_hyperthyroid     | Whether patient believes they have hyperthyroid             | bool      |
| 12    | lithium                | Whether patient use lithium. This will cause goiter and hypothyroidism                               | bool      |
| 13    | goitre                 | Whether patient has goitre. It is an abnormal enlargement of the thyroid gland | bool      |
| 14    | tumor                  | Whether patient has tumor                                   | bool      |
| 15    | hypopituitary          | Whether patient has a hypopituitary gland                   | float     |
| 16    | psych                  | Whether patient has psych issues                            | bool      |
| 17    | TSH_measured           | Whether TSH was measured in the blood                       | bool      |
| 18    | TSH                    | TSH level in blood from lab work                            | float     |
| 19    | T3_measured            | Whether T3 was measured in the blood                        | bool      |
| 20    | T3                     | T3 level in blood from lab work                             | float     |
| 21    | TT4_measured           | Whether TT4 was measured in the blood                       | bool      |
| 22    | TT4                    | TT4 level in blood from lab work                            | float     |
| 23    | T4U_measured           | Whether T4U was measured in the blood                       | bool      |
| 24    | T4U                    | T4U level in blood from lab work                            | float     |
| 25    | FTI_measured           | Whether FTI was measured in the blood                       | bool      |
| 26    | FTI                    | FTI level in blood from lab work                            | float     |
| 27    | TBG_measured           | Whether TBG was measured in the blood                       | bool      |
| 28    | TBG                    | TBG level in blood from lab work                            | float     |
| 29    | referral_source        | Referral source                                             | str       |
| 30    | patient_id             | Unique ID of the patient                                    | str       |


Target:
 	* target - hyperthyroidism medical diagnosis (str)

we are going to define user have Hypothyroid or Hyperthyroid and their conditions (Totally 8 classifications).

            1. hyperthyroid conditions:
                1) A   Subclinical (initial)
                2) B   T3 toxic
                3) C   toxic goitre
                4) D   secondary toxic

            2. hypothyroid conditions:
                5) E   Subclinical (initial)
                6) F   primary hypothyroid
                7) G   compensated hypothyroid
                8) H   secondary hypothyroid


In our dataset;
* There are 920 rows and 31 columns,
* Number of object columns: 23,
* Number of numeric columns: 8,
* There one duplicate,
* Null values:

    | Column | Total Nulls | Percentage |
    |--------|-------------|------------|
    | TBG    | 912         | 99.130435  |
    | T3     | 200         | 21.739130  |
    | T4U    | 55          | 5.978261   |
    | FTI    | 54          | 5.869565   |
    | TSH    | 47          | 5.108696   |
    | sex    | 42          | 4.565217   |
    | TT4    | 9           | 0.978261   |



For preprocess the data we are going to;
1. Drop unwanted columns (patient_id) and duplicates
2. Handle Null values:
    * Drop TBG	column. Beacause it contains 912 null values which means 99% null values.
    * Fill or replace T3, TSH, T4U, FTI, TT4, sex columns with mean, median or mode.
3. Implement BOOST balancing method to balance the data.
4. Implement Labelencoding to convert object type columns into numeric columns.

# # Data Preprocessing

### 1. Drop unwanted columns
patient_id column is no need. So we can drop that column.

In [None]:
df.drop(["patient_id"], axis=1, inplace=True)

In [None]:
df.drop_duplicates(inplace=True)

### 2. Handle null values

Drop TBG column. Beacause it contains 912 null values which means 99% null values.

In [None]:
df.drop(["TBG"], axis=1, inplace=True)

Fill or replace T3, TSH, T4U, FTI, TT4, sex columns with mean, median or mode.

In [None]:
import pandas as pd
import numpy as np

def fill_nulls_with_random(df):
    # Identify numerical columns
    numerical_columns = df.select_dtypes(include=[np.number]).columns
    
    # Loop through each numerical column
    for col in numerical_columns:
        # Check for null values
        if df[col].isnull().sum() > 0:
            # Get lower and upper bounds (min and max of non-null values)
            lower_bound = df[col].min()
            upper_bound = df[col].max()
            
            # Generate random numbers to fill the null values
            # We only generate numbers for the number of NaNs in the column
            random_values = np.random.uniform(lower_bound, upper_bound, df[col].isnull().sum())
            
            # Fill NaN values with generated random numbers
            df.loc[df[col].isnull(), col] = random_values
    
    return df

df_filled = fill_nulls_with_random(df)

In [None]:
# Filter for object (categorical) columns
object_df = df.select_dtypes(include='object')

# Function to fill null values with random values from unique values of the column
def fill_na_with_random_choice(column):
    if column.isnull().any():  # Check if there are NaNs in the column
        unique_values = column.dropna().unique()  # Get unique values, excluding NaNs
        if len(unique_values) > 0:
            # Generate random choices for each NaN
            random_choices = np.random.choice(unique_values, size=column.isnull().sum())
            column.fillna(pd.Series(random_choices, index=column[column.isnull()].index), inplace=True)

# Apply the function to each object column
for col in object_df.columns:
    fill_na_with_random_choice(df[col])

### 3. Label Encoding

In [None]:
# Filter for object (categorical) columns
object_df = df.select_dtypes(include='object')

# Identify object columns with fewer than 2 unique values
columns_with_fewer_than_two_classes = [
    col for col in object_df.columns
    if object_df[col].nunique() < 2
]

print("Object columns with fewer than 2 unique values:\n", columns_with_fewer_than_two_classes)

In [None]:
#label encoding the object data.

from sklearn.preprocessing import LabelEncoder

# Store original column names
original_columns = df.select_dtypes(include='object').columns

# Initialize LabelEncoder
label_encoders = {}

# Apply LabelEncoder to each categorical variable
for col in original_columns:
    label_encoders[col] = LabelEncoder()
    df[col] = label_encoders[col].fit_transform(df[col])

# Print the mapping between original categories and numerical labels
for col, encoder in label_encoders.items():
    print(f"Mapping for column '{col}':")
    for label, category in enumerate(encoder.classes_):
        print(f"Label {label}: {category}")
    print("===============================")

### Correlation Matrix

In [None]:
# correlation matrix
correlation_matrix = df.corr()
print("Correlation Matrix:\n", correlation_matrix)

# Plot the correlation matrix
plt.figure(figsize=(20, 16))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

In [None]:
X = df.drop(["target"], axis=1)
y = df["target"]

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

# Initialize SelectKBest with the desired scoring function and k=10 (10 best features)
selector = SelectKBest(score_func=f_classif, k=10)

# Fit the selector
selector.fit(X, y)

# Get the feature scores for each feature
scores = selector.scores_
p_values = selector.pvalues_

# Create a DataFrame with feature scores
feature_scores = pd.DataFrame({
    'Feature': X.columns,
    'Score': scores,
    'P-Value': p_values
})

# Sort the DataFrame by Score in descending order
feature_scores_sorted = feature_scores.sort_values(by='Score', ascending=False)

# top 10 features
top_10_features = feature_scores_sorted.head(10)

# Retrieve the column names of the top 10 features
top_10_feature_names = top_10_features['Feature'].tolist()

# Filter the original DataFrame to include only the top 10 features
df_top_10 = df[top_10_feature_names]

In [None]:
# view thedataframe
df_top_10

### 3. BOO-ST balance method to balance the imbalace data

BOOST :=>
* BS(Boosting with Sample Weighting), 
* SMOTE (Synthetic Minority Over-sampling Technique),
* Tomek Links (TL)

- First, apply SMOTE to oversample the minority class.
- Then, apply Tomek Links to remove noisy data points.
- Finally, train the boosting model on the resampled dataset.

In [None]:
# Split the data Into X and y (X has all features and y has target variable)
X = df_top_10
y = y

In [None]:
# Import required libraries
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification

In [None]:
# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

#### Step 1: Apply SMOTE for oversampling minority class

In [None]:
smote = SMOTE(sampling_strategy='auto', random_state=42, k_neighbors=5)
X_smote, y_smote = smote.fit_resample(X_train, y_train)

#### Step 2: Apply Tomek Links to remove noisy samples

In [None]:
tomek = TomekLinks()
X_resampled, y_resampled = tomek.fit_resample(X_smote, y_smote)

In [None]:
# Convert the resampled data into DataFrames
X_train_res_df = pd.DataFrame(X_resampled, columns=X_train.columns)
y_train_res_df = pd.DataFrame(y_resampled, columns=[y_train.name])  # y_train.name preserves the original target column name

# Combine the features and target into one DataFrame (preprocessed data)
df = pd.concat([X_train_res_df, y_train_res_df], axis=1)

In [None]:
# view the preprocessed data
df

In [None]:
# remove duplicates
print("Duplicates before drop:", df.duplicated().sum())
df.drop_duplicates(inplace=True)
print("Duplicates before drop:", df.duplicated().sum())

In [None]:
# null counts
df.isnull().sum()

In [None]:
# Distribution plot for target column to check the whether the data balanced or not
object_columns = ["target"]

plt.figure(figsize=(15, 10))
for i, feature in enumerate(object_columns):
    plt.subplot(2, 3, i + 1)
    sns.countplot(x=df[feature], data=df)
    plt.title(f'Distribution of {feature}')
    plt.xticks(rotation=0)

plt.tight_layout()
plt.show()

Our preprocessed data is a balanced dataset

In [None]:
# information about preprocessde data
df.info()

In [None]:
# information about preprocessed data
df.describe()

After preprocessing the data we have;
* No duplicates,
* No null values,
* No object columns (Converted all object into numeric type using label encoding)
* Balanced dataset

Now we can go for Algorithm implementation

#### Step 3: Apply Boosting Stage (BS)

Now our balanced dataset is ready. Now we can train the boosting model on the resampled dataset.
For that we are going to implement Dynamic Selection Hybrid Model 

Dynamic Selection Hybrid Model 

* Define the classifiers
* Step 1: Train all classifiers and compute Permutation Feature Importance (PFI)
* Step 2: Select Half-Most Effective Classifiers (HEC) based on PFI
* Step 3: Define the ensemble methods using the selected classifiers
* Step 4: Train each ensemble method and evaluate accuracy
* Step 5: Select Most Efficient Ensemble Method (EEM)
* Build final model

In [None]:
# # Save preprocessed data as a csv file for web application use
# df.to_csv("DATASET/Final_Dataset.csv")

# # Data Splitting

In [None]:
# Split the data into features and target
X = df.drop("target", axis=1) # Features
y = df["target"] # Target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [None]:
# Traing features
X_train

In [None]:
# Traing target
y_train

In [None]:
# testing features
X_test

In [None]:
# testing target
y_test

# # Algorithm Impementation

In [None]:
# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier, BaggingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance

In [None]:
# Define the classifiers
classifiers = {
    'Decision Tree': DecisionTreeClassifier(),
    'SVM': SVC(probability=True),
    'KNN': KNeighborsClassifier(),
    'Random Forest': RandomForestClassifier(),
    'AdaBoost': AdaBoostClassifier(),
    'Gradient Boosting': GradientBoostingClassifier()
}

" Permutation Feature Importance (PFI) is a method used to evaluate the importance of features in a machine learning model. "

In [None]:
# Step 1: Train all classifiers and compute Permutation Feature Importance (PFI)
pfi_results = {}
for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    # Calculate testing accuracy
    test_accuracy = accuracy_score(y_test, y_pred)
    print(f"Testing Accuracy for {name}: {test_accuracy:.4f}")

    # Generate the classification report
    report = classification_report(y_test, y_pred)
    print(f"\nClassification Report for {name}:")
    print(report)

    # Generate the confusion matrix
    from sklearn.metrics import confusion_matrix
    conf_matrix = confusion_matrix(y_test, y_pred)

    # Plot the confusion matrix using seaborn heatmap
    plt.figure(figsize=(8, 6))
    sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False, xticklabels=clf.classes_, yticklabels=clf.classes_)    plt.title(f'Confusion Matrix for {name}')
    plt.xlabel('Predicted Labels')
    plt.ylabel('True Labels')
    plt.show()

    pfi = permutation_importance(clf, X_train, y_train, n_repeats=10, random_state=42)
    pfi_results[name] = np.mean(pfi.importances_mean)
    print(f"Classifier: {name}, PFI Score: {pfi_results[name]}")
    print("=============================================================")

In [None]:
# Step 2: Select Half-Most Effective Classifiers (HEC) based on PFI
sorted_classifiers = sorted(pfi_results.items(), key=lambda x: x[1], reverse=True)
hec_classifiers = [name for name, _ in sorted_classifiers[:len(sorted_classifiers) // 2]]
print(f"Selected HEC Classifiers: {hec_classifiers}")

In [None]:
# Step 3: Define the ensemble methods using the selected classifiers
estimators = [(name, classifiers[name]) for name in hec_classifiers]

# Define ensemble methods
boosting = AdaBoostClassifier(estimator=estimators[0][1])
bagging = BaggingClassifier(estimator=estimators[0][1])
voting = VotingClassifier(estimators=estimators, voting='soft')
stacking = StackingClassifier(estimators=estimators)

ensemble_methods = {
    'Boosting': boosting,
    'Bagging': bagging,
    'Voting': voting,
    'Stacking': stacking
}

In [None]:
# Step 4: Train each ensemble method and evaluate accuracy
accuracy_results = {}
for name, ensemble in ensemble_methods.items():
    ensemble.fit(X_train, y_train)
    y_pred = ensemble.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_results[name] = accuracy
    print(f"Ensemble Method: {name}, Accuracy: {accuracy}")

In [None]:
# Step 5: Select Most Efficient Ensemble Method (EEM)
best_ensemble = max(accuracy_results, key=accuracy_results.get)
print(f"Most Efficient Ensemble Method: {best_ensemble}")

In [None]:
# Build the final model
from sklearn.metrics import classification_report, accuracy_score

# Retrieve the best model
best_model = ensemble_methods[best_ensemble]

# Make predictions with the best model on the test set
y_pred = best_model.predict(X_test)

# Calculate training accuracy
y_train_pred = best_model.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
print(f"Training Accuracy for {best_ensemble}: {train_accuracy:.4f}")

# Calculate testing accuracy
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Testing Accuracy for {best_ensemble}: {test_accuracy:.4f}")

# Generate the classification report
report = classification_report(y_test, y_pred)
print(f"\nClassification Report for {best_ensemble}:")
print(report)

# Generate the confusion matrix
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Plot the confusion matrix using seaborn heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False, xticklabels=best_model.classes_, yticklabels=best_model.classes_)
plt.title(f'Confusion Matrix for {best_ensemble}')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()

## Save Model

In [None]:
# saving the model
import joblib

joblib.dump(best_model, 'MODELS/best_ensemble_model.pkl')
print("Best model saved as 'best_ensemble_model.pkl'")

# # Prediction Part

In [53]:
import joblib

# Load the saved model
model = joblib.load('MODELS/best_ensemble_model.pkl')

# Define class mappings
condition_class = {
    0: "Subclinical (initial level)", 
    1: "T3 toxic", 
    2: "toxic goitre", 
    3: "secondary toxic", 
    4: "Subclinical (initial level)",
    5: "primary hypothyroid",
    6: "compensated hypothyroid",
    7: "secondary hypothyroid",
}

disorder_class = {
    0: "hyperthyroid", 
    1: "hyperthyroid", 
    2: "hyperthyroid", 
    3: "hyperthyroid", 
    4: "hypothyroid",
    5: "hypothyroid",
    6: "hypothyroid",
    7: "hypothyroid",
}

# Prediction Function
def prediction_func(input_features):
    input_array = np.array([input_features])
    
    # Make prediction
    prediction = model.predict(input_array)
    
    # Convert prediction to class label
    predicted_class = prediction[0]

    # Map the class label to disorder and condition
    predicted_disorder = disorder_class[predicted_class]
    predicted_condition = condition_class[predicted_class]
    
    print(f"Predicted Disorder: {predicted_disorder}")
    print(f"Predicted Condition: {predicted_condition}")

#### Sample Outputs

In [None]:
# Sample input for hyperthyroid (Subclinical)
prediction_func([160.000000, 0, 204.000000, 1, 0, 0.030000, 0, 0.780000, 0, 0])

In [None]:
# Sample input for hyperthyroid (T3 toxic)
prediction_func([140.489084, 0, 117.644102, 1, 0, 0.030775, 0, 1.192227, 0, 0])

In [None]:
# Sample input for hyperthyroid (toxic goitre)
prediction_func([117.830119, 1, 106.245178, 1, 0, 0.865798, 0, 1.105849, 0, 0])

In [None]:
# Sample input for hyperthyroid (secondary toxic)
prediction_func([131.975559, 0, 487.814102, 1, 0, 7.953860, 0, 0.294808, 0, 0])

In [None]:
# Sample input for hypothyroid (Subclinical)
prediction_func([16.000000, 0, 15.000000, 0, 0, 298.456436, 0, 1.100000, 0, 1])

In [None]:
# Sample input for hypothyroid (primary hypothyroid)
prediction_func([3.900000, 0, 5.000000, 1, 0, 70.000000, 0, 0.830000, 0, 0])

In [None]:
# Sample input for hypothyroid (compensated hypothyroid)
prediction_func([78.000000, 0, 85.000000, 1, 0, 23.000000, 0, 0.920000, 0, 0])

In [None]:
# Sample input for hypothyroid (secondary hypothyroid)
prediction_func([47.263362, 0, 54.475693, 1, 0, 5.507901, 0, 0.874653, 0, 0])