## Group Project

## Importing the libraries

Let's import the necessary libraries to get started.

In [1]:
import pandas as pd

## The Data

Now we read the Heart Dataset into a dataframe.

In [2]:
heart_df = pd.read_csv('data/heart.dat', header=None, sep=" ")
heart_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,70.0,1.0,4.0,130.0,322.0,0.0,2.0,109.0,0.0,2.4,2.0,3.0,3.0,2
1,67.0,0.0,3.0,115.0,564.0,0.0,2.0,160.0,0.0,1.6,2.0,0.0,7.0,1
2,57.0,1.0,2.0,124.0,261.0,0.0,0.0,141.0,0.0,0.3,1.0,0.0,7.0,2
3,64.0,1.0,4.0,128.0,263.0,0.0,0.0,105.0,1.0,0.2,2.0,1.0,7.0,1
4,74.0,0.0,2.0,120.0,269.0,0.0,2.0,121.0,1.0,0.2,1.0,1.0,3.0,1


We now add the respective column names to the dataframe.

In [3]:
column_names = [
    "age",
    "sex",
    "cp_type",  # chest pain type
    "rest_bp",  # resting blood pressure
    "cholesterol",  # serum cholesterol in mg/dl
    "fast_bs",  # fasting blood sugar > 120 mg/dl
    "rest_ecg",  # resting electrocardiograph results
    "max_hr",  # maximum heart rate achieved
    "ex_angina",  # exercise induced angina
    "old_peak",  # old peak = ST depression induced by exercise relative to rest
    "slope",  # the slope of the peak exercise ST segment
    "num_vessels",  # number of major vessels colored by fluoroscopy
    "thal",  # thal: 3 = normal; 6 = fixed defect; 7 = reversible defect
    "hd_presence"  # heart disease presence
]

heart_df.columns = column_names
heart_df.head()

Unnamed: 0,age,sex,cp_type,rest_bp,cholesterol,fast_bs,rest_ecg,max_hr,ex_angina,old_peak,slope,num_vessels,thal,hd_presence
0,70.0,1.0,4.0,130.0,322.0,0.0,2.0,109.0,0.0,2.4,2.0,3.0,3.0,2
1,67.0,0.0,3.0,115.0,564.0,0.0,2.0,160.0,0.0,1.6,2.0,0.0,7.0,1
2,57.0,1.0,2.0,124.0,261.0,0.0,0.0,141.0,0.0,0.3,1.0,0.0,7.0,2
3,64.0,1.0,4.0,128.0,263.0,0.0,0.0,105.0,1.0,0.2,2.0,1.0,7.0,1
4,74.0,0.0,2.0,120.0,269.0,0.0,2.0,121.0,1.0,0.2,1.0,1.0,3.0,1


In [4]:
y = heart_df['hd_presence']
X = heart_df.drop('hd_presence', axis=1)

## ReliefF Feature Ranking

We used the ReliefF with the Ranker search method to attribute features in the first step. We left the method’s parameters as the default, which compares every instance with its five nearest neighbors. The highest features with ReliefF were thal, sex, and CP, with 0.0821, 0.0793, and 0.0790. Age, slope, depression, and Ca had significantly lower scores than other features: 0.0188, 0.0157, 0.0118, and 0.0114. We removed these four features and saved the remaining nine into a dataset referred to as the Heart-2 dataset.

In [5]:
from skrebate import ReliefF

reliefF = ReliefF(verbose=True, n_jobs=-1, n_features_to_select=len(X.columns), discrete_threshold=2)

reliefF.fit(X.to_numpy(), y.to_numpy())
feature_importance = reliefF.feature_importances_

feature_rank_df = pd.DataFrame({
    'Feature': X.columns,
    'feature_importance': feature_importance,
})

feature_rank_df = feature_rank_df.sort_values(by='feature_importance', ascending=False)
print(feature_rank_df)

Created distance array in 0.0039997100830078125 seconds.
Feature scoring under way ...
Completed scoring in 1.9806015491485596 seconds.
        Feature  feature_importance
12         thal            0.281435
11  num_vessels            0.177778
8     ex_angina            0.175222
2       cp_type            0.157889
10        slope            0.139037
9      old_peak            0.108556
7        max_hr            0.101460
1           sex            0.073778
0           age            0.036615
6      rest_ecg            0.020259
4   cholesterol            0.012076
3       rest_bp            0.007401
5       fast_bs            0.003889


## Fast Correlation-Based Filter (FCBF) Feature Ranking

Also, we used the FCBF attribute evaluator with the Ranker search method. The most significant features according to FCBF were cp, heart rate, and thal with scores of 0.1728, 0.1701, and 0.1598. This time, the lowest-ranked features were age, slope, depression, and Ca, with 0.0341, 0.0228, 0.0116, and 0.0088. We removed these four features and selected the remaining nine features to create a dataset
that we will refer to as the Heart-2 dataset.

In [6]:
from fcbf import fcbf

relevant_features, irrelevant_features, correlations = fcbf(X, y, base=2)
print('relevant_features:', relevant_features, '(count:', len(relevant_features), ')')
print('irrelevant_features:', irrelevant_features, '(count:', len(irrelevant_features), ')')
print('correlations:', correlations)

relevant_features: ['thal', 'cp_type', 'num_vessels', 'ex_angina', 'slope', 'rest_ecg'] (count: 6 )
irrelevant_features: ['fast_bs', 'age', 'rest_bp', 'sex', 'old_peak', 'max_hr', 'cholesterol'] (count: 7 )
correlations: {'thal': 0.1888023810360205, 'cp_type': 0.1416013911225963, 'num_vessels': 0.1371973858796752, 'ex_angina': 0.13634876704082757, 'slope': 0.09762728006147679, 'rest_ecg': 0.023604492346983152, 'fast_bs': 0.00024132042088733543, 'age': 0.050784796796006594, 'rest_bp': 0.05156329996334495, 'sex': 0.0704959014313005, 'old_peak': 0.09424607729730912, 'max_hr': 0.09924288899494042, 'cholesterol': 0.15297356642114715}


## Genetic Algorithm Test 1

In [7]:
'''
import random
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from deap import creator, base, tools, algorithms
import warnings

warnings.filterwarnings('ignore')


def avg(l):
    """
    Returns the average between list elements
    """
    return sum(l) / float(len(l))


def get_fitness(individual, X, y):
    """
    Feature subset fitness function
    """

    if individual.count(0) != len(individual):
        # get index with value 0
        cols = [index for index in range(
            len(individual)) if individual[index] == 0]

        # get features subset
        X_parsed = X.drop(X.columns[cols], axis=1)
        X_subset = pd.get_dummies(X_parsed)

        # apply classification algorithm
        clf = MLPClassifier(random_state=1, max_iter=300)

        return (avg(cross_val_score(clf, X_subset, y, cv=2, scoring='accuracy', n_jobs=-1)),)
    else:
        return (0,)


def genetic_algorithm(X, y, n_population, n_generation):
    """
    Deap global variables
    Initialize variables to use eaSimple
    """
    # create individual
    creator.create("FitnessMax", base.Fitness, weights=(1.0,))
    creator.create("Individual", list, fitness=creator.FitnessMax)

    # create toolbox
    toolbox = base.Toolbox()
    toolbox.register("attr_bool", random.randint, 0, 1)
    toolbox.register("individual", tools.initRepeat,
                     creator.Individual, toolbox.attr_bool, len(X.columns))
    toolbox.register("population", tools.initRepeat, list,
                     toolbox.individual)
    toolbox.register("evaluate", get_fitness, X=X, y=y)
    toolbox.register("mate", tools.cxOnePoint)
    toolbox.register("mutate", tools.mutFlipBit, indpb=0.05)
    toolbox.register("select", tools.selTournament, tournsize=3)

    # initialize parameters
    pop = toolbox.population(n=n_population)
    hof = tools.HallOfFame(n_population * n_generation)
    stats = tools.Statistics(lambda ind: ind.fitness.values)
    stats.register("avg", np.mean)
    stats.register("min", np.min)
    stats.register("max", np.max)

    # genetic algorithm
    pop, log = algorithms.eaSimple(pop, toolbox, cxpb=0.2, mutpb=0.1,
                                   ngen=n_generation, stats=stats, halloffame=hof,
                                   verbose=True)

    # return hall of fame
    return hof


def best_individual(hof, X, y):
    """
    Get the best individual
    """
    for individual in hof:
        _individual = individual

    _individualHeader = [list(X)[i] for i in range(
        len(_individual)) if _individual[i] == 1]
    return _individual.fitness.values, _individual, _individualHeader


if __name__ == '__main__':
    # read dataframe from csv
    df = pd.read_csv('data/heart.dat', header=None, sep=" ")
    n_pop = 8
    n_gen = 8

    # encode labels column to numbers
    le = LabelEncoder()
    le.fit(df.iloc[:, -1])
    y = le.transform(df.iloc[:, -1])
    X = df.iloc[:, :-1]

    # get accuracy with all features
    individual = [1 for i in range(len(X.columns))]
    print("Accuracy with all features: \t" +
          str(get_fitness(individual, X, y)) + "\n")

    # apply genetic algorithm
    hof = genetic_algorithm(X, y, n_pop, n_gen)

    # select the best individual
    accuracy, individual, header = best_individual(hof, X, y)
    print('\n\nBest Accuracy: \t' + str(accuracy))
    print('Number of Features in Subset: \t' + str(individual.count(1)))
    print('Individual: \t\t' + str(individual))
    print('Feature Subset\t: ' + str(header))

    print('\n\nCreating a new classifier with the result')

    # read dataframe from csv one more time
    df = pd.read_csv('data/heart.dat', header=None, sep=" ")

    # with feature subset
    X = df[header]

    clf = MLPClassifier(random_state=1, max_iter=300)

    scores = cross_val_score(clf, X, y, cv=2, scoring='accuracy', n_jobs=-1)
    print("Accuracy with Feature Subset: \t" + str(avg(scores)) + "\n")
    
'''



## Genetic Algorithm Test 2

In [13]:
import numpy as np
import pandas as pd
from random import randint

import warnings

warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import cross_val_score, ShuffleSplit
from statistics import median
from sklearn.metrics import accuracy_score

from mlxtend.classifier import StackingClassifier

classifiers = [
    'Logistic Regression',
    'Gaussian Naive Bayes',
    'Decision Tree',
    'SVM',
    'Gradient Boosting',
    'Random Forest',
    'K-Nearest Neighbors'
]

models = {
    'Logistic Regression': LogisticRegression(),
    'Gaussian Naive Bayes': GaussianNB(),
    'Decision Tree': DecisionTreeClassifier(),
    'SVM': svm.SVC(kernel='linear'),
    'Gradient Boosting': GradientBoostingClassifier(),
    'Random Forest': RandomForestClassifier(),
    'K-Nearest Neighbors': KNeighborsClassifier(),
}


# def acc_score(df, label):
# Score = pd.DataFrame({"Classifier": classifiers})
# j = 0
# acc = []
# X_train, X_test, Y_train, Y_test = split(df, label)
# 
# for i in models:
#     model = i
#     model.fit(X_train, Y_train)
#     predictions = model.predict(X_test)
#     acc.append(accuracy_score(Y_test, predictions))
#     j = j + 1
# 
# Score["Accuracy"] = acc
# Score.sort_values(by="Accuracy", ascending=False, inplace=True)
# Score.reset_index(drop=True, inplace=True)
# 
# return Score


# def plot(score, x, y, c="b"):
#     gen = [1, 2, 3, 4, 5]
#     plt.figure(figsize=(6, 4))
#     ax = sns.pointplot(x=gen, y=score, color=c)
#     ax.set(xlabel="Generation", ylabel="Accuracy")
#     ax.set(ylim=(x, y))

In [14]:
def split(df, label):
    X_tr, X_te, Y_tr, Y_te = train_test_split(df, label, test_size=0.25, random_state=42)

    return X_tr, X_te, Y_tr, Y_te


def initilization_of_population(size, n_feat):
    population = []

    for i in range(size):
        chromosome = np.ones(n_feat, dtype='bool')
        chromosome[:int(0.3 * n_feat)] = False
        np.random.shuffle(chromosome)
        population.append(chromosome)

    return population


def fitness_score(population, model, X, y):
    scores = []

    for chromosome in population:
        cv = ShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
        accuracy_scores = cross_val_score(model, X.iloc[:, chromosome], y, scoring='accuracy', cv=cv)
        median_score = median(accuracy_scores)
        scores.append(median_score)

    scores, population = np.array(scores), np.array(population)
    indices = np.argsort(scores)

    return list(scores[indices][::-1]), list(population[indices, :][::-1])


def selection(pop_after_fit, n_parents):
    population_nextgen = []

    for i in range(n_parents):
        population_nextgen.append(pop_after_fit[i])

    return population_nextgen


def crossover(pop_after_sel):
    pop_nextgen = pop_after_sel

    for i in range(0, len(pop_after_sel), 2):
        new_par = []
        child_1, child_2 = pop_nextgen[i], pop_nextgen[i + 1]

        split_index = int(len(child_1) * 0.6)
        new_par = np.concatenate((child_1[:split_index], child_2[split_index:]))

        pop_nextgen.append(new_par)

    return pop_nextgen


def mutation(pop_after_cross, mutation_rate, n_feat):
    mutation_range = int(mutation_rate * n_feat)
    pop_next_gen = []

    for n in range(0, len(pop_after_cross)):
        chromo = pop_after_cross[n]
        rand_position = []

        for i in range(0, mutation_range):
            pos = randint(0, n_feat - 1)
            rand_position.append(pos)

        for j in rand_position:
            chromo[j] = not chromo[j]

        pop_next_gen.append(chromo)

    return pop_next_gen


def generations(X, y, model_name, model, size, n_feat, n_parents=48, mutation_rate=0.3, n_gen=5, stop_gen=20):
    best_chromo = []
    best_score = []
    population_nextgen = initilization_of_population(size, n_feat)
    consecutive_same_chromo = 0
    last_best_chromo = None

    print('\nClassifier Running:', model_name)

    for i in range(n_gen):
        scores, pop_after_fit = fitness_score(
            population=population_nextgen,
            model=model,
            X=X,
            y=y,
        )

        print('Best score in generation', i + 1, ':', scores[:1], pop_after_fit[:1])

        if last_best_chromo is not None and np.array_equal(last_best_chromo, pop_after_fit[0]):
            consecutive_same_chromo += 1
        else:
            consecutive_same_chromo = 0

        last_best_chromo = pop_after_fit[0]

        if consecutive_same_chromo >= stop_gen:
            print(
                f'Stopping early at generation {i + 1} because the best chromosome has not changed for {stop_gen} generations.')
            break

        pop_after_sel = selection(pop_after_fit, n_parents)
        pop_after_cross = crossover(pop_after_sel)
        population_nextgen = mutation(pop_after_cross, mutation_rate, n_feat)
        population_nextgen.extend(pop_after_fit[:2])

        best_chromo.append(pop_after_fit[0])
        best_score.append(scores[0])

    return best_chromo, best_score

In [15]:
data_bc = pd.read_csv("data/heart.dat", header=None, sep=" ")

column_names = [
    "age",
    "sex",
    "cp_type",  # chest pain type
    "rest_bp",  # resting blood pressure
    "cholesterol",  # serum cholesterol in mg/dl
    "fast_bs",  # fasting blood sugar > 120 mg/dl
    "rest_ecg",  # resting electrocardiograph results
    "max_hr",  # maximum heart rate achieved
    "ex_angina",  # exercise induced angina
    "old_peak",  # old peak = ST depression induced by exercise relative to rest
    "slope",  # the slope of the peak exercise ST segment
    "num_vessels",  # number of major vessels colored by fluoroscopy
    "thal",  # thal: 3 = normal; 6 = fixed defect; 7 = reversible defect
    "hd_presence"  # heart disease presence
]

data_bc.columns = column_names
data_bc.head()

Unnamed: 0,age,sex,cp_type,rest_bp,cholesterol,fast_bs,rest_ecg,max_hr,ex_angina,old_peak,slope,num_vessels,thal,hd_presence
0,70.0,1.0,4.0,130.0,322.0,0.0,2.0,109.0,0.0,2.4,2.0,3.0,3.0,2
1,67.0,0.0,3.0,115.0,564.0,0.0,2.0,160.0,0.0,1.6,2.0,0.0,7.0,1
2,57.0,1.0,2.0,124.0,261.0,0.0,0.0,141.0,0.0,0.3,1.0,0.0,7.0,2
3,64.0,1.0,4.0,128.0,263.0,0.0,0.0,105.0,1.0,0.2,2.0,1.0,7.0,1
4,74.0,0.0,2.0,120.0,269.0,0.0,2.0,121.0,1.0,0.2,1.0,1.0,3.0,1


In [16]:
label_bc = data_bc["hd_presence"]
data_bc.drop(["hd_presence", "age", "slope", "old_peak", "num_vessels"], axis=1, inplace=True)

print("Heart Disease dataset:", data_bc.shape[0], "Records &", data_bc.shape[1], "Features")

Heart Disease dataset: 270 Records & 9 Features


In [17]:
data_bc.head()

Unnamed: 0,sex,cp_type,rest_bp,cholesterol,fast_bs,rest_ecg,max_hr,ex_angina,thal
0,1.0,4.0,130.0,322.0,0.0,2.0,109.0,0.0,3.0
1,0.0,3.0,115.0,564.0,0.0,2.0,160.0,0.0,7.0
2,1.0,2.0,124.0,261.0,0.0,0.0,141.0,0.0,7.0
3,1.0,4.0,128.0,263.0,0.0,0.0,105.0,1.0,7.0
4,0.0,2.0,120.0,269.0,0.0,2.0,121.0,1.0,3.0


In [18]:
# score1 = acc_score(data_bc, label_bc)
# score1

All the features in this dataset have continuous values

In [19]:
X_train, X_test, y_train, y_test = split(data_bc, label_bc)

selected_features = {}

for name, model in models.items():
    chromo_df_bc, score_bc = generations(
        X=X_train,
        y=y_train,
        model_name=name,
        model=model,
        size=50,
        n_feat=data_bc.shape[1],
        n_parents=50,
        mutation_rate=0.033,
        n_gen=20
    )
    selected_features[name] = chromo_df_bc[-1]  # Store the best chromosome for each classifier


Classifier Running: Logistic Regression
Best score in generation 1 : [0.8524590163934427] [array([ True,  True,  True, False,  True,  True,  True, False,  True])]
Best score in generation 2 : [0.8524590163934427] [array([ True,  True,  True, False,  True,  True,  True, False,  True])]
Best score in generation 3 : [0.8524590163934427] [array([ True,  True,  True, False,  True,  True,  True, False,  True])]
Best score in generation 4 : [0.8524590163934427] [array([ True,  True,  True, False,  True,  True,  True, False,  True])]
Best score in generation 5 : [0.8524590163934427] [array([ True,  True,  True, False,  True,  True,  True, False,  True])]
Best score in generation 6 : [0.8524590163934427] [array([ True,  True,  True, False,  True,  True,  True, False,  True])]
Best score in generation 7 : [0.8524590163934427] [array([ True,  True,  True, False,  True,  True,  True, False,  True])]
Best score in generation 8 : [0.8524590163934427] [array([ True,  True,  True, False,  True,  True

In [9]:
# plot(score_bc, 0.9, 1.0)
# chromo_df_bc

In [10]:
'''
model_1 = KNeighborsClassifier()
model_2 = RandomForestClassifier()
model_3 = GradientBoostingClassifier()

model_meta = AdaBoostClassifier()

stackingModel = StackingClassifier(classifiers=[model_1, model_2, model_3], meta_classifier=model_meta)

for model, model_name in zip([model_1, model_2, model_3, stackingModel], ['KNN', 'RF', 'GB', 'Stacking Classifier']):
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    print(f'Accuracy: {scores.mean()}, +/- {scores.std()} - {model_name}')
'''

"\nmodel_1 = KNeighborsClassifier()\nmodel_2 = RandomForestClassifier()\nmodel_3 = GradientBoostingClassifier()\n\nmodel_meta = AdaBoostClassifier()\n\nstackingModel = StackingClassifier(classifiers=[model_1, model_2, model_3], meta_classifier=model_meta)\n\nfor model, model_name in zip([model_1, model_2, model_3, stackingModel], ['KNN', 'RF', 'GB', 'Stacking Classifier']):\n    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')\n    print(f'Accuracy: {scores.mean()}, +/- {scores.std()} - {model_name}')\n"

In [20]:
base_classifiers = [
    ('Logistic Regression', LogisticRegression(), selected_features['Logistic Regression']),
    ('Gaussian Naive Bayes', GaussianNB(), selected_features['Gaussian Naive Bayes']),
    ('Decision Tree', DecisionTreeClassifier(), selected_features['Decision Tree']),
    ('SVM', svm.SVC(kernel='linear', probability=True), selected_features['SVM']),
    ('Gradient Boosting', GradientBoostingClassifier(), selected_features['Gradient Boosting']),
    ('Random Forest', RandomForestClassifier(), selected_features['Random Forest']),
    ('K-Nearest Neighbors', KNeighborsClassifier(), selected_features['K-Nearest Neighbors']),
]

train_meta_features = np.zeros((X_train.shape[0], len(base_classifiers)))
test_meta_features = np.zeros((X_test.shape[0], len(base_classifiers)))

for i, (name, model, chromosome) in enumerate(base_classifiers):
    model.fit(X_train.iloc[:, chromosome], y_train)
    train_meta_features[:, i] = model.predict_proba(X_train.iloc[:, chromosome])[:, 1]
    test_meta_features[:, i] = model.predict_proba(X_test.iloc[:, chromosome])[:, 1]

meta_feat_chromo, meta_feat_score = generations(
    X=pd.DataFrame(train_meta_features),
    y=y_train,
    model_name='Meta Classifier',
    model=AdaBoostClassifier(),
    size=50,
    n_feat=train_meta_features.shape[1],
    n_parents=50,
    mutation_rate=0.033,
    n_gen=20
)

best_meta_features = meta_feat_chromo[-1]

meta_classifier = AdaBoostClassifier()

meta_classifier.fit(train_meta_features[:, best_meta_features], y_train)

y_pred = meta_classifier.predict(test_meta_features[:, best_meta_features])
accuracy = accuracy_score(y_test, y_pred)
print('Stacking Classifier with GA-selected Meta-Features Accuracy:', accuracy)


Classifier Running: Meta Classifier
Best score in generation 1 : [1.0] [array([ True, False,  True,  True, False,  True,  True])]
Best score in generation 2 : [1.0] [array([ True,  True, False, False,  True,  True,  True])]
Best score in generation 3 : [1.0] [array([ True,  True,  True, False, False,  True,  True])]
Best score in generation 4 : [1.0] [array([False, False,  True,  True,  True,  True,  True])]
Best score in generation 5 : [1.0] [array([ True,  True,  True, False,  True,  True, False])]
Best score in generation 6 : [1.0] [array([False, False,  True,  True,  True,  True,  True])]
Best score in generation 7 : [1.0] [array([ True,  True, False,  True,  True,  True, False])]
Best score in generation 8 : [1.0] [array([ True, False,  True,  True,  True,  True, False])]
Best score in generation 9 : [1.0] [array([False, False,  True,  True,  True,  True,  True])]
Best score in generation 10 : [1.0] [array([False,  True,  True,  True,  True,  True, False])]
Best score in generati