# Self-composed ensemble

An ensemble combines the outcomes of two or more machine learning models. Thereby you can achieve better results, but you will have a longer training time. There are two possible ways to classify:
- **Hard Voting:** The label that is predicted by the majority of the models is chosen.
- **Soft Voting:** Every model returns probabilities for each label. Then these values are used to calculate the mean or median probability to predict the label (Simic, 2024).

Because we are looking for the best possible F1-Score, Grid-Search chooses Soft-Voting, as this approach considers the probability, which results in a better performance. 

For each model we use the hyperparamters chosen by their individual Grid-Search. Moreover, we use all seven models, because they add diversity and have different strengths. This also minimizes the risk of a false prediction, because one single model with a weakness is not deciding on its own.


Simic, M. (2024, March 18). Hard vs. Soft Voting Classifiers. Baeldung. https://www.baeldung.com/cs/hard-vs-soft-voting-classifiers

In [None]:
from sklearn.ensemble import VotingClassifier

from sklearn.metrics import confusion_matrix, f1_score
from sklearn.metrics import classification_report

from sklearn.model_selection import KFold, GridSearchCV

estimator = [
('KNN', best_model_knn),
('LogisticRegression', best_model_lg),
('SVC', best_model_svc),
('DT', best_model_dt),
('RandomForest', best_model_rf),
('GradientBoostingClassifier', best_model_gbc),
('XGB', best_model_xgb)
]

ce = VotingClassifier(estimators = estimator)

param_grid_ce = {
    'voting': ['soft', 'hard'],
    'weights': [
        None,
        [2, 1, 1, 1, 1, 1, 1],
        [1, 2, 1, 1, 1, 1, 1],
        [1, 1, 2, 1, 1, 1, 1],
        [1, 1, 1, 2, 1, 1, 1],
        [1, 1, 1, 1, 2, 1, 1],
        [1, 1, 1, 1, 1, 2, 1],
        [1, 1, 1, 1, 1, 1, 2],
    ]
}

kfold = KFold(n_splits=5, shuffle=True, random_state=42)

grid_search = GridSearchCV(estimator=ce, param_grid=param_grid_ce, cv=kfold , scoring='f1', n_jobs=-1)
grid_search.fit(X_train, y_train)

best_model_ce = grid_search.best_estimator_
best_params = grid_search.best_params_
print("Best Parameters:\n", best_params)

y_pred = best_model_ce.predict(X_test)
f1 = f1_score(y_test, y_pred)
print("\nF1 Score:\n", f1)

print("\nClassification Report:\n", classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)


NameError: name 'best_model_knn' is not defined

In [None]:
from sklearn.ensemble import VotingClassifier

from sklearn.metrics import confusion_matrix, f1_score
from sklearn.metrics import classification_report

from sklearn.model_selection import KFold, GridSearchCV, RandomizedSearchCV

estimator = [
('KNN', best_model_knn),
('LogisticRegression', best_model_lg),
('SVC', best_model_svc),
('DT', best_model_dt),
('RandomForest', best_model_rf),
('GradientBoostingClassifier', best_model_gbc),
('XGB', best_model_xgb)
]

ce = VotingClassifier(estimators = estimator)

np.random.seed(42)

weights = [ # List of random weights for models
    None,
    [np.random.uniform(0.5, 2) for _ in range(7)] for _ in range(200)
]

param_grid_ce = {
    'voting': ['soft', 'hard'],
    'weights': weights
}

kfold = KFold(n_splits=5, shuffle=True, random_state=42)

grid_search = GridSearchCV(estimator=ce, param_grid=param_grid_ce, cv=kfold , scoring='f1', n_jobs=-1)
grid_search.fit(X_train, y_train)

best_model_ce = grid_search.best_estimator_
best_params = grid_search.best_params_
print("Best Parameters:\n", best_params)

y_pred = best_model_ce.predict(X_test)
f1 = f1_score(y_test, y_pred)
print("\nF1 Score:\n", f1)

print("\nClassification Report:\n", classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)
