<a href="https://colab.research.google.com/github/AbrahamOtero/MLiB/blob/main/6_MetaModels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Combination of models
We import the libraries that we are going to need:

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Voting

We will start by implementing a voting strategy using **VotingClassifier**. We will use three models (although you can add more models if you want): a decision tree, K nearest neighbors, and a support vector machine. We will apply it to the diabetes dataset, and perform a 10-fold evaluation.

In [131]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB

from sklearn.model_selection import train_test_split

url = 'https://raw.githubusercontent.com/AbrahamOtero/MLiB/main/datasets/diabetes.csv'

diabetes = pd.read_csv(url)

# The featrures
X = diabetes.iloc[:, :-1]
# The class
y = diabetes.iloc[:,-1]

# Scale the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# We will reserve a set of test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# We build the voting-based classifier
voting_clf = VotingClassifier(
    # Map of classifiers that "vote". We indicate their name (any string of characters) and the classifier
    estimators=[
        ('dt', DecisionTreeClassifier()),
        ('knn', KNeighborsClassifier()),
        ('gnb', GaussianNB())
    ]
)
voting_clf.fit(X_train, y_train)

scores = cross_val_score(voting_clf, X_train, y_train, cv=10, scoring="accuracy")
# Calculate and display the average accuracy value and standard deviation for the scores of each fold
print("Mean accuracy:", scores.mean())
print("Standard deviation:", scores.std())

Mean accuracy: 0.7586956521739131
Standard deviation: 0.05185591496468073


Let's see how each of the three classifiers performs, as well as the performance of the voting model composed of the three classifiers. We will use the 10% of data that we saved for testing:

In [132]:
for name, clf in voting_clf.named_estimators_.items():
    print(name, "=", clf.score(X_test, y_test))

print("Voting: ", voting_clf.score(X_test, y_test))

dt = 0.7207792207792207
knn = 0.7142857142857143
gnb = 0.7597402597402597
Voting:  0.7792207792207793


We see that the performance of the model composed of the three classifiers is slightly higher than the performance of each of them. In practice this does not always have to be the case (especially if the errors of the different classifiers are correlated, or if one classifier is much better than the others). But we can often improve the performance of each individual model by this aggregation.



## Bagging

An obvious approach to voting is to use multiple versions of the same classifier over altered versions (using sampling) of the data set. For example, using multiple decision trees. We can do this easily by using **BaggingClassifier**:

In [133]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# To build each model we will use a number of samples equal to half of those available.
n_smples = round(X_train.shape[1]/2)

#We created a Bagging classifier composed of 100 decision trees
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, max_samples=n_smples)
socres = cross_val_score(bag_clf, X_train, y_train, cv=10, scoring="accuracy")

# Calculate and display the average accuracy value and standard deviation for the scores of each fold
print("Mean accuracy:", scores.mean())
print("Standard deviation:", scores.std())

Mean accuracy: 0.7586956521739131
Standard deviation: 0.05185591496468073


### Random Forests

A Random Forest is similar to a bag of decision trees. The main difference is that the decision trees are going to be built with different parameters (for example, maximum number of depth levels, a minimum number of instances in each different leaf). When using Bagging all the classifiers have been built with the same parameters, but with different dataset created using sampling.


In [134]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=100)

scores = cross_val_score(rnd_clf, X_train, y_train, cv=10, scoring="accuracy")

# Calculate and display the average accuracy value and standard deviation for the scores of each fold
print("Mean accuracy:", scores.mean())
print("Standard deviation:", scores.std())

Mean accuracy: 0.7543478260869565
Standard deviation: 0.06152596390471699


An interesting property of the Scikit Learn random tree classifier is that it can evaluate how important each attribute is based on how many times it was chosen to be used by the decision trees. To do this we will have to re-fit the classifier on the data (as we used cross_val_score in the previous code):

In [67]:
# Fit the classifier

rnd_clf.fit(X_train, y_train)

# Feature relevance
for score, name in zip(rnd_clf.feature_importances_, diabetes.columns):
    print(round(score, 2), name)

0.1 Pregnancies
0.23 Glucose
0.08 BloodPressure
0.07 SkinThickness
0.12 Insulin
0.16 BMI
0.12 DiabetesPedigreeFunction
0.13 Age


## Boosting

To use Boosting we can use **AdaBoostClassifier**:

In [83]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2), n_estimators=100, algorithm= 'SAMME')

scores = cross_val_score(ada_clf, X_train, y_train, cv=10, scoring="accuracy")

# Calculate and display the average accuracy value and standard deviation for the scores of each fold
print("Mean accuracy:", scores.mean())
print("Standard deviation:", scores.std())



Mean accuracy: 0.675
Standard deviation: 0.13334396216138905


## Stacking

Finall we will use Stacking using the classifier **StackingClassifier**:

In [135]:
from sklearn.ensemble import StackingClassifier

stacking_clf = StackingClassifier(
     # Map of level 0 models
    estimators=[
        ('dt', DecisionTreeClassifier()),
        ('knn', KNeighborsClassifier()),
        ('gnb', GaussianNB())
    ],
     # Level 1 model
    final_estimator=DecisionTreeClassifier()
)


scores = cross_val_score(stacking_clf, X_train, y_train, cv=10, scoring="accuracy")

# Calculate and display the average accuracy value and standard deviation for the scores of each fold
print("Mean accuracy:", scores.mean())
print("Standard deviation:", scores.std())

Mean accuracy: 0.658695652173913
Standard deviation: 0.06877953051981033


# Evaluation of classifiers based on cost

Unfortunately Scikit learn does not have functionality directly to evaluate classifiers based on cost. We will rely on the following code (obtained from the following repository https://github.com/Treers/MetaCost). You will need to run this to create the cost based classifier. If you use it in your project, you will also need to load this class in the environment.

In [89]:
# Just run this code

import pandas as pd
import numpy as np
from sklearn.base import clone

class MetaCost(object):

    """A procedure for making error-based classifiers cost-sensitive

    >>> from sklearn.datasets import load_iris
    >>> from sklearn.linear_model import LogisticRegression
    >>> import pandas as pd
    >>> import numpy as np
    >>> S = pd.DataFrame(load_iris().data)
    >>> S['target'] = load_iris().target
    >>> LR = LogisticRegression(solver='lbfgs', multi_class='multinomial')
    >>> C = np.array([[0, 1, 1], [1, 0, 1], [1, 1, 0]])
    >>> model = MetaCost(S, LR, C).fit('target', 3)
    >>> model.predict_proba(load_iris().data[[2]])
    >>> model.score(S[[0, 1, 2, 3]].values, S['target'])

    .. note:: The form of the cost matrix C must be as follows:
    +---------------+----------+----------+----------+
    |  actual class |          |          |          |
    +               |          |          |          |
    |   +           | y(x)=j_1 | y(x)=j_2 | y(x)=j_3 |
    |       +       |          |          |          |
    |           +   |          |          |          |
    |predicted class|          |          |          |
    +---------------+----------+----------+----------+
    |   h(x)=j_1    |    0     |    a     |     b    |
    |   h(x)=j_2    |    c     |    0     |     d    |
    |   h(x)=j_3    |    e     |    f     |     0    |
    +---------------+----------+----------+----------+
    | C = np.array([[0, a, b],[c, 0 , d],[e, f, 0]]) |
    +------------------------------------------------+
    """
    def __init__(self, S, L, C, m=50, n=1, p=True, q=True):
        """
        :param S: The training set
        :param L: A classification learning algorithm
        :param C: A cost matrix
        :param q: Is True iff all resamples are to be used  for each examples
        :param m: The number of resamples to generate
        :param n: The number of examples in each resample
        :param p: Is True iff L produces class probabilities
        """
        if not isinstance(S, pd.DataFrame):
            raise ValueError('S must be a DataFrame object')
        new_index = list(range(len(S)))
        S.index = new_index
        self.S = S
        self.L = L
        self.C = C
        self.m = m
        self.n = len(S) * n
        self.p = p
        self.q = q

    def fit(self, flag, num_class):
        """
        :param flag: The name of classification labels
        :param num_class: The number of classes
        :return: Classifier
        """
        col = [col for col in self.S.columns if col != flag]
        S_ = {}
        M = []

        for i in range(self.m):
            # Let S_[i] be a resample of S with self.n examples
            S_[i] = self.S.sample(n=self.n, replace=True)

            X = S_[i][col].values
            y = S_[i][flag].values

            # Let M[i] = model produced by applying L to S_[i]
            model = clone(self.L)
            M.append(model.fit(X, y))

        label = []
        S_array = self.S[col].values
        for i in range(len(self.S)):
            if not self.q:
                k_th = [k for k, v in S_.items() if i not in v.index]
                M_ = list(np.array(M)[k_th])
            else:
                M_ = M

            if self.p:
                P_j = [model.predict_proba(S_array[[i]]) for model in M_]
            else:
                P_j = []
                vector = [0] * num_class
                for model in M_:
                    vector[model.predict(S_array[[i]])] = 1
                    P_j.append(vector)

            # Calculate P(j|x)
            P = np.array(np.mean(P_j, 0)).T

            # Relabel
            label.append(np.argmin(self.C.dot(P)))

        # Model produced by applying L to S with relabeled y
        X_train = self.S[col].values
        y_train = np.array(label)
        model_new = clone(self.L)
        model_new.fit(X_train, y_train)

        return model_new

We will start by training a decision tree without considering the cost. We will show the accuracy and the confusion matrix:

In [172]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Assuming X and y are already defined (from your previous code)
# Create a Decision Tree classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Train the classifier
dt_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dt_classifier.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate the confusion matrix
confusion_mat_tree = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(confusion_mat_tree)

Accuracy: 0.7305194805194806
Confusion Matrix:
[[152  54]
 [ 29  73]]


Now we train MetaCost, but with a cost matrix where all errors have the same cost.

In [173]:
# Define the cost matrix
cost_matrix = np.array([[0, 1], [1, 0]])

# Combine X_train and y_train into a single DataFrame
# MetaCost needs the data in a DataFrame
train_df = pd.DataFrame(data=X_train)
train_df['Outcome'] = pd.array(y_train)

# Create a MetaCost object with a DecisionTreeClassifier
metacost_model = MetaCost(train_df, DecisionTreeClassifier(random_state=42), cost_matrix)

# Fit the model to the training data.
# We need to indicate the colum of the class ('Outcome') and the number of classes (2)
model = metacost_model.fit('Outcome', 2)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate the confusion matrix
confusion_mat = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(confusion_mat)


Accuracy: 0.7012987012987013
Confusion Matrix:
[[145  61]
 [ 31  71]]


Now we will carry out the training considering that classifying a diabetic patient as healthy has a cost 5 times higher than classifying a healthy patient as diabetic.

In [175]:
# Define the cost matrix
cost_matrix = np.array([[0, 1], [5, 0]])

# Combine X_train and y_train into a single DataFrame
# MetaCost needs the data in a DataFrame
train_df = pd.DataFrame(data=X_train)
train_df['Outcome'] = pd.array(y_train)

# Create a MetaCost object with a DecisionTreeClassifier
metacost_model = MetaCost(train_df, DecisionTreeClassifier(random_state=42), cost_matrix)

# Fit the model to the training data.
# We need to indicate the colum of the class ('Outcome') and the number of classes (2)
model = metacost_model.fit('Outcome', 2)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate the confusion matrix
confusion_mat = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(confusion_mat)


# Calculate the cost of the classifier
cost = np.sum(confusion_mat * cost_matrix)

# Print the cost
print("Cost:", cost)

Accuracy: 0.724025974025974
Confusion Matrix:
[[186  20]
 [ 65  37]]
Cost: 345


In [176]:
print(confusion_mat_tree)

# Calculate the cost of the classifier
cost = np.sum(confusion_mat_tree * cost_matrix)

# Print the cost
print("Cost:", cost)

[[152  54]
 [ 29  73]]
Cost: 199


In [103]:
# prompt: Generar la matriz de confusión para la anterior clasificador, y calcula su costo multiplicando por la matriz de costo

from sklearn.metrics import confusion_matrix

# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Print the confusion matrix
print("Confusion Matrix:")
print(cm)

# Calculate the cost of the classifier
cost = np.sum(cm * cost_matrix)

# Print the cost
print("Cost:", cost)


Confusion Matrix:
[[447   4]
 [241   0]]
Cost: 245


In [170]:
train_df

Unnamed: 0,0,1,2,3,4,5,6,7,Outcome
0,-1.141852,-0.841722,-3.572597,-1.288212,-0.692891,-4.060474,-0.651972,-0.701198,0
1,1.233880,0.128489,1.390387,-1.288212,-0.692891,-4.060474,-0.724455,1.766346,1
2,-0.844885,-0.309671,0.873409,-0.096379,-0.692891,-0.240205,-0.993245,-0.871374,0
3,0.936914,2.350587,1.080200,-1.288212,-0.692891,0.990912,-0.063049,0.660206,1
4,-0.547919,1.161295,1.080200,-1.288212,-0.692891,-0.049826,1.006073,2.787399,1
...,...,...,...,...,...,...,...,...,...
455,0.342981,0.566649,-0.263941,0.907270,0.522715,-0.430583,-0.183854,-0.616111,0
456,-0.844885,-0.779128,2.734528,-1.288212,-0.692891,-1.217483,-0.799958,-0.531023,0
457,1.827813,-0.622642,0.873409,1.032726,-0.692891,1.727044,2.005732,0.404942,1
458,-1.141852,0.629244,-3.572597,-1.288212,-0.692891,1.320902,-0.805998,-0.360847,1
