## Group Project

## Importing the libraries

Let's import the necessary libraries to get started.

In [1]:
import pandas as pd

## The Data

Now we read the Heart Dataset into a dataframe.

In [2]:
heart_df = pd.read_csv('data/heart.dat', header=None, sep=" ")
heart_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,70.0,1.0,4.0,130.0,322.0,0.0,2.0,109.0,0.0,2.4,2.0,3.0,3.0,2
1,67.0,0.0,3.0,115.0,564.0,0.0,2.0,160.0,0.0,1.6,2.0,0.0,7.0,1
2,57.0,1.0,2.0,124.0,261.0,0.0,0.0,141.0,0.0,0.3,1.0,0.0,7.0,2
3,64.0,1.0,4.0,128.0,263.0,0.0,0.0,105.0,1.0,0.2,2.0,1.0,7.0,1
4,74.0,0.0,2.0,120.0,269.0,0.0,2.0,121.0,1.0,0.2,1.0,1.0,3.0,1


We now add the respective column names to the dataframe.

In [3]:
column_names = [
    "age",
    "sex",
    "cp_type",  # chest pain type
    "rest_bp",  # resting blood pressure
    "cholesterol",  # serum cholesterol in mg/dl
    "fast_bs",  # fasting blood sugar > 120 mg/dl
    "rest_ecg",  # resting electrocardiograph results
    "max_hr",  # maximum heart rate achieved
    "ex_angina",  # exercise induced angina
    "old_peak",  # old peak = ST depression induced by exercise relative to rest
    "slope",  # the slope of the peak exercise ST segment
    "num_vessels",  # number of major vessels colored by fluoroscopy
    "thal",  # thal: 3 = normal; 6 = fixed defect; 7 = reversible defect
    "hd_presence"  # heart disease presence
]

heart_df.columns = column_names
heart_df.head()

Unnamed: 0,age,sex,cp_type,rest_bp,cholesterol,fast_bs,rest_ecg,max_hr,ex_angina,old_peak,slope,num_vessels,thal,hd_presence
0,70.0,1.0,4.0,130.0,322.0,0.0,2.0,109.0,0.0,2.4,2.0,3.0,3.0,2
1,67.0,0.0,3.0,115.0,564.0,0.0,2.0,160.0,0.0,1.6,2.0,0.0,7.0,1
2,57.0,1.0,2.0,124.0,261.0,0.0,0.0,141.0,0.0,0.3,1.0,0.0,7.0,2
3,64.0,1.0,4.0,128.0,263.0,0.0,0.0,105.0,1.0,0.2,2.0,1.0,7.0,1
4,74.0,0.0,2.0,120.0,269.0,0.0,2.0,121.0,1.0,0.2,1.0,1.0,3.0,1


In [4]:
y = heart_df['hd_presence']
X = heart_df.drop('hd_presence', axis=1)

## ReliefF Feature Ranking

We used the ReliefF with the Ranker search method to attribute features in the first step. We left the method’s parameters as the default, which compares every instance with its five nearest neighbors. The highest features with ReliefF were thal, sex, and CP, with 0.0821, 0.0793, and 0.0790. Age, slope, depression, and Ca had significantly lower scores than other features: 0.0188, 0.0157, 0.0118, and 0.0114. We removed these four features and saved the remaining nine into a dataset referred to as the Heart-2 dataset.

In [5]:
from skrebate import ReliefF

reliefF = ReliefF(verbose=True, n_jobs=-1, n_features_to_select=len(X.columns), discrete_threshold=2)

reliefF.fit(X.to_numpy(), y.to_numpy())
feature_importance = reliefF.feature_importances_

feature_rank_df = pd.DataFrame({
    'Feature': X.columns,
    'feature_importance': feature_importance,
})

feature_rank_df = feature_rank_df.sort_values(by='feature_importance', ascending=False)
feature_rank_df

Created distance array in 0.0039997100830078125 seconds.
Feature scoring under way ...
Completed scoring in 1.9806015491485596 seconds.
        Feature  feature_importance
12         thal            0.281435
11  num_vessels            0.177778
8     ex_angina            0.175222
2       cp_type            0.157889
10        slope            0.139037
9      old_peak            0.108556
7        max_hr            0.101460
1           sex            0.073778
0           age            0.036615
6      rest_ecg            0.020259
4   cholesterol            0.012076
3       rest_bp            0.007401
5       fast_bs            0.003889


## Fast Correlation-Based Filter (FCBF) Feature Ranking

Also, we used the FCBF attribute evaluator with the Ranker search method. The most significant features according to FCBF were cp, heart rate, and thal with scores of 0.1728, 0.1701, and 0.1598. This time, the lowest-ranked features were age, slope, depression, and Ca, with 0.0341, 0.0228, 0.0116, and 0.0088. We removed these four features and selected the remaining nine features to create a dataset
that we will refer to as the Heart-2 dataset.

In [6]:
from fcbf import fcbf

relevant_features, irrelevant_features, correlations = fcbf(X, y, base=2)
print('correlations:', correlations)

relevant_features: ['thal', 'cp_type', 'num_vessels', 'ex_angina', 'slope', 'rest_ecg'] (count: 6 )
irrelevant_features: ['fast_bs', 'age', 'rest_bp', 'sex', 'old_peak', 'max_hr', 'cholesterol'] (count: 7 )
correlations: {'thal': 0.1888023810360205, 'cp_type': 0.1416013911225963, 'num_vessels': 0.1371973858796752, 'ex_angina': 0.13634876704082757, 'slope': 0.09762728006147679, 'rest_ecg': 0.023604492346983152, 'fast_bs': 0.00024132042088733543, 'age': 0.050784796796006594, 'rest_bp': 0.05156329996334495, 'sex': 0.0704959014313005, 'old_peak': 0.09424607729730912, 'max_hr': 0.09924288899494042, 'cholesterol': 0.15297356642114715}


## Genetic Algorithm Test 2

In [72]:
import numpy as np
import pandas as pd
from random import randint

import warnings

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, ShuffleSplit
from sklearn.metrics import accuracy_score
from statistics import median

warnings.filterwarnings("ignore")

In [73]:
models = {
    'Logistic Regression': LogisticRegression(n_jobs=-1),
    'Gaussian Naive Bayes': GaussianNB(),
    'Decision Tree': DecisionTreeClassifier(),
    'SVM': SVC(kernel='linear'),
    'Gradient Boosting': GradientBoostingClassifier(),
    'Random Forest': RandomForestClassifier(n_estimators=100, n_jobs=-1, max_depth=10),
    'K-Nearest Neighbors': KNeighborsClassifier(n_jobs=-1),
}

In [74]:
def population_init(size, n_feat):
    population = []

    for i in range(size):
        chromosome = np.ones(n_feat, dtype='bool')
        chromosome[:int(0.3 * n_feat)] = False
        np.random.shuffle(chromosome)
        population.append(chromosome)

    return population


def fitness_score(population, model, X, y):
    scores = []

    for chromosome in population:
        cv = ShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
        accuracy_scores = cross_val_score(model, X.iloc[:, chromosome], y, scoring='accuracy', cv=cv)
        median_score = median(accuracy_scores)
        scores.append(median_score)

    scores, population = np.array(scores), np.array(population)
    indices = np.argsort(scores)

    return list(scores[indices][::-1]), list(population[indices, :][::-1])


def selection(pop_after_fit, selection_prop=0.5):
    n_select = int(len(pop_after_fit) * selection_prop)
    population_nextgen = pop_after_fit[:n_select]

    return population_nextgen


def crossover(pop_after_sel, size, crossover_prop=0.5):
    # pop_nextgen = pop_after_sel
    # 
    # for i in range(0, len(pop_after_sel), 2):
    #     new_par = []
    #     child_1, child_2 = pop_nextgen[i], pop_nextgen[i + 1]
    # 
    #     split_index = int(len(child_1) * crossover_prop)
    #     new_par = np.concatenate((child_1[:split_index], child_2[split_index:]))
    # 
    #     pop_nextgen.append(new_par)
    # 
    # return pop_nextgen

    pop_nextgen = []

    while len(pop_nextgen) < size - 2:
        parent1 = pop_after_sel[randint(0, len(pop_after_sel) - 1)]
        parent2 = pop_after_sel[randint(0, len(pop_after_sel) - 1)]

        split_index = int(len(parent1) * crossover_prop)

        child1 = np.concatenate((parent1[:split_index], parent2[split_index:]))
        child2 = np.concatenate((parent2[:split_index], parent1[split_index:]))

        pop_nextgen.append(child1)
        pop_nextgen.append(child2)

    return pop_nextgen[:size - 2]


def mutation(pop_after_cross, n_feat, mutation_rate=0.3):
    mutation_range = int(mutation_rate * n_feat)
    pop_next_gen = []

    for n in range(0, len(pop_after_cross)):
        chromo = pop_after_cross[n]
        rand_position = []

        for i in range(0, mutation_range):
            pos = randint(0, n_feat - 1)
            rand_position.append(pos)

        for j in rand_position:
            chromo[j] = not chromo[j]

        pop_next_gen.append(chromo)

    return pop_next_gen


def generations(X, y, model_name, model, size, n_feat, selection_prop=0.5, crossover_prop=0.5, mutation_rate=0.3,
                n_elites=2, n_gen=5, stall_gen=2):
    best_chromo = []
    best_score = []
    consecutive_same_chromo = 0
    last_best_chromo = None

    population_nextgen = population_init(size, n_feat)

    print('\nClassifier Running:', model_name)

    for i in range(n_gen):
        scores, pop_after_fit = fitness_score(
            population=population_nextgen,
            model=model,
            X=X,
            y=y,
        )

        print('Best score in generation', i + 1, ':', scores[:1], pop_after_fit[:1])

        if last_best_chromo is not None and np.array_equal(last_best_chromo, pop_after_fit[0]):
            consecutive_same_chromo += 1
        else:
            consecutive_same_chromo = 0

        last_best_chromo = pop_after_fit[0]

        if consecutive_same_chromo >= stall_gen:
            print(
                f'Stopping early at generation {i + 1} because the best chromosome has not changed for {stall_gen} generations.')
            break

        pop_after_sel = selection(pop_after_fit, selection_prop=selection_prop)
        pop_after_cross = crossover(pop_after_sel, size, crossover_prop=crossover_prop)
        population_nextgen = mutation(pop_after_cross, n_feat, mutation_rate=mutation_rate)
        population_nextgen.extend(pop_after_fit[:n_elites])

        best_chromo.append(pop_after_fit[0])
        best_score.append(scores[0])

    return best_chromo, best_score

In [75]:
heart_df = pd.read_csv("data/heart.dat", header=None, sep=" ")

column_names = [
    "age",
    "sex",
    "cp_type",  # chest pain type
    "rest_bp",  # resting blood pressure
    "cholesterol",  # serum cholesterol in mg/dl
    "fast_bs",  # fasting blood sugar > 120 mg/dl
    "rest_ecg",  # resting electrocardiograph results
    "max_hr",  # maximum heart rate achieved
    "ex_angina",  # exercise induced angina
    "old_peak",  # old peak = ST depression induced by exercise relative to rest
    "slope",  # the slope of the peak exercise ST segment
    "num_vessels",  # number of major vessels colored by fluoroscopy
    "thal",  # thal: 3 = normal; 6 = fixed defect; 7 = reversible defect
    "hd_presence"  # heart disease presence
]

heart_df.columns = column_names
heart_df.head()

Unnamed: 0,age,sex,cp_type,rest_bp,cholesterol,fast_bs,rest_ecg,max_hr,ex_angina,old_peak,slope,num_vessels,thal,hd_presence
0,70.0,1.0,4.0,130.0,322.0,0.0,2.0,109.0,0.0,2.4,2.0,3.0,3.0,2
1,67.0,0.0,3.0,115.0,564.0,0.0,2.0,160.0,0.0,1.6,2.0,0.0,7.0,1
2,57.0,1.0,2.0,124.0,261.0,0.0,0.0,141.0,0.0,0.3,1.0,0.0,7.0,2
3,64.0,1.0,4.0,128.0,263.0,0.0,0.0,105.0,1.0,0.2,2.0,1.0,7.0,1
4,74.0,0.0,2.0,120.0,269.0,0.0,2.0,121.0,1.0,0.2,1.0,1.0,3.0,1


In [76]:
y = heart_df["hd_presence"]
X = heart_df.drop(["hd_presence", "age", "slope", "old_peak", "num_vessels"], axis=1)

print("Heart Disease dataset:", X.shape[0], "Records &", X.shape[1], "Features")

Heart Disease dataset: 270 Records & 9 Features


In [77]:
X.head()

Unnamed: 0,sex,cp_type,rest_bp,cholesterol,fast_bs,rest_ecg,max_hr,ex_angina,thal
0,1.0,4.0,130.0,322.0,0.0,2.0,109.0,0.0,3.0
1,0.0,3.0,115.0,564.0,0.0,2.0,160.0,0.0,7.0
2,1.0,2.0,124.0,261.0,0.0,0.0,141.0,0.0,7.0
3,1.0,4.0,128.0,263.0,0.0,0.0,105.0,1.0,7.0
4,0.0,2.0,120.0,269.0,0.0,2.0,121.0,1.0,3.0


In [78]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

selected_features = {}

for name, model in models.items():
    chromo_df_bc, score_bc = generations(
        X=X_train,
        y=y_train,
        model_name=name,
        model=model,
        size=50,
        n_feat=X.shape[1],
        selection_prop=0.5,
        crossover_prop=0.6,
        mutation_rate=0.033,
        n_elites=2,
        n_gen=5,
        stall_gen=2,
    )
    selected_features[name] = chromo_df_bc[-1]


Classifier Running: Logistic Regression
Best score in generation 1 : [0.8524590163934427] [array([ True,  True,  True, False,  True,  True,  True, False,  True])]
Best score in generation 2 : [0.860655737704918] [array([ True,  True,  True,  True, False, False,  True, False,  True])]
Best score in generation 3 : [0.860655737704918] [array([ True,  True,  True,  True, False, False,  True, False,  True])]
Best score in generation 4 : [0.860655737704918] [array([ True,  True,  True,  True, False, False,  True, False,  True])]
Stopping early at generation 4 because the best chromosome has not changed for 2 generations.

Classifier Running: Gaussian Naive Bayes
Best score in generation 1 : [0.8524590163934426] [array([ True,  True,  True,  True, False,  True,  True, False,  True])]
Best score in generation 2 : [0.860655737704918] [array([ True,  True,  True, False, False,  True,  True, False,  True])]
Best score in generation 3 : [0.860655737704918] [array([ True,  True,  True, False, Fals

In [80]:
selected_features

{'Logistic Regression': array([ True,  True,  True,  True, False, False,  True, False,  True]),
 'Gaussian Naive Bayes': array([ True,  True,  True, False, False,  True,  True, False,  True]),
 'Decision Tree': array([ True,  True,  True, False,  True,  True, False,  True,  True]),
 'SVM': array([ True,  True,  True,  True,  True,  True,  True, False,  True]),
 'Gradient Boosting': array([ True,  True,  True,  True, False,  True, False,  True,  True]),
 'Random Forest': array([ True,  True,  True,  True, False,  True,  True,  True,  True]),
 'K-Nearest Neighbors': array([ True,  True, False, False,  True,  True, False,  True,  True])}

In [83]:
base_models = []
for name, model in models.items():
    new_selected_features = X_train.columns[selected_features[name]]
    base_models.append((name, model.fit(X_train[new_selected_features], y_train)))

meta_X_train = np.column_stack(
    [model.predict(X_train[X_train.columns[selected_features[name]]]) for name, model in base_models])
meta_X_test = np.column_stack(
    [model.predict(X_test[X_test.columns[selected_features[name]]]) for name, model in base_models])

meta_chromosome, meta_score = generations(
    X=pd.DataFrame(meta_X_train),
    y=y_train,
    model_name='Meta-Classifier',
    model=AdaBoostClassifier(),
    size=50,
    n_feat=meta_X_train.shape[1],
    selection_prop=0.5,
    crossover_prop=0.6,
    mutation_rate=0.033,
    n_elites=2,
    n_gen=10,
    stall_gen=2,
)


Classifier Running: Meta-Classifier
Best score in generation 1 : [1.0] [array([ True,  True, False,  True, False,  True,  True])]
Best score in generation 2 : [1.0] [array([ True,  True, False, False,  True,  True,  True])]
Best score in generation 3 : [1.0] [array([ True,  True, False,  True, False,  True,  True])]
Best score in generation 4 : [1.0] [array([False, False,  True,  True,  True,  True, False])]
Best score in generation 5 : [1.0] [array([ True, False,  True, False,  True,  True,  True])]
Best score in generation 6 : [1.0] [array([ True, False,  True,  True, False,  True,  True])]
Best score in generation 7 : [1.0] [array([ True, False,  True,  True,  True,  True,  True])]
Best score in generation 8 : [1.0] [array([ True,  True, False, False,  True,  True, False])]
Best score in generation 9 : [1.0] [array([ True, False,  True, False,  True,  True,  True])]
Best score in generation 10 : [1.0] [array([ True, False,  True, False,  True,  True,  True])]


In [82]:
meta_model = AdaBoostClassifier()
selected_meta_features = meta_chromosome[-1]
meta_model.fit(meta_X_train[:, selected_meta_features], y_train)

stacking_predictions = meta_model.predict(meta_X_test[:, selected_meta_features])
accuracy = np.mean(stacking_predictions == y_test)
print('Stacking Classifier with Genetic Algorithm Optimized Meta-Classifier Accuracy:', accuracy)

Stacking Classifier with Genetic Algorithm Optimized Meta-Classifier Accuracy: 0.8235294117647058
