# Setup

This project requires Python 3.7 or above:

In [1]:
%pip install matplotlib
%pip install numpy
%pip install scikit-learn
%pip install pandas
%pip install scipy
%pip install seaborn
%pip install xgboost


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import sys

assert sys.version_info >= (3, 7)

It also requires Scikit-Learn ≥ 1.0.1:

In [3]:
from packaging import version
import sklearn

assert version.parse(sklearn.__version__) >= version.parse("1.0.1")



As we did in previous chapters, let's define the default font sizes to make the figures prettier:

In [4]:
import matplotlib.pyplot as plt

plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

And let's create the `images/ensembles` folder (if it doesn't already exist), and define the `save_fig()` function which is used through this notebook to save the figures in high-res for the book:

In [5]:
from pathlib import Path

IMAGES_PATH = Path() / "images" / "ensembles"
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = IMAGES_PATH / f"{fig_id}.{fig_extension}"
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Voting Classifiers

In [6]:
import matplotlib.pyplot as plt
import numpy as np

Let's build a voting classifier:

In [7]:
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X, y = make_moons(n_samples=500, noise=0.30)
X_train, X_test, y_train, y_test = train_test_split(X, y)

voting_clf = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression()),
        ('rf', RandomForestClassifier()),
        ('svc', SVC())
    ]
)
voting_clf.fit(X_train, y_train)

In [8]:
for name, clf in voting_clf.named_estimators_.items():
    print(name, "=", clf.score(X_test, y_test))

lr = 0.856
rf = 0.928
svc = 0.912


In [9]:
voting_clf.predict(X_test[:1])

array([0])

In [10]:
[clf.predict(X_test[:1]) for clf in voting_clf.estimators_]

[array([0]), array([0]), array([0])]

In [11]:
voting_clf.score(X_test, y_test)

0.912

Now let's use soft voting:

In [12]:
voting_clf.voting = "soft"
voting_clf.named_estimators["svc"].probability = True
voting_clf.fit(X_train, y_train)
voting_clf.score(X_test, y_test)

0.912

# Random Forests

In [13]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16,n_jobs=-1)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)

## Feature Importance

In [14]:
from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)
rnd_clf = RandomForestClassifier(n_estimators=500)
rnd_clf.fit(iris.data, iris.target)
for score, name in zip(rnd_clf.feature_importances_, iris.data.columns):
    print(round(score, 2), name)

0.09 sepal length (cm)
0.02 sepal width (cm)
0.43 petal length (cm)
0.46 petal width (cm)


# Ensemble Learning with Voting Classifier - Implementation Notes

## Overview
This implementation combines multiple classification algorithms using a voting classifier to improve model robustness and performance. The approach leverages different learning algorithms' strengths while mitigating their individual weaknesses.

## Key Components

### 1. Base Classifiers
- **Random Forest**: Handles non-linear relationships and feature interactions
- **Gradient Boosting**: Excellent for complex patterns and feature importance
- **SVM**: Effective for high-dimensional spaces and non-linear classification
- **Neural Network (MLP)**: Captures complex patterns and relationships
- **KNN**: Simple but effective for local pattern recognition
- **Decision Tree**: Interpretable and handles mixed feature types
- **Naive Bayes**: Works well with high-dimensional data
- **Logistic Regression**: Simple, interpretable baseline model

### 2. Data Preprocessing
- Feature scaling using StandardScaler
- Train-test split for proper evaluation
- Handling of potential missing values and outliers

### 3. Voting Strategy
- Using 'soft' voting (probability-based)
- Leverages probability estimates from all classifiers
- More nuanced than hard voting for complex decisions

## Implementation Details

### Feature Engineering Considerations
- Standardization important for SVM and Neural Networks
- Categorical encoding if needed
- Feature selection based on importance scores

### Model Training
- Cross-validation for robust evaluation
- Parallel processing for efficiency
- Hyperparameter optimization opportunities

### Performance Evaluation
- Individual classifier performance tracking
- Ensemble performance comparison
- Potential for confusion matrix analysis

## Advantages of This Approach

1. **Robustness**
   - Multiple algorithms reduce overfitting risk
   - Better generalization to unseen data

2. **Flexibility**
   - Easy to add/remove classifiers
   - Can adjust voting weights if needed

3. **Performance**
   - Often outperforms individual classifiers
   - More stable predictions

## Potential Improvements

1. **Hyperparameter Tuning**
   - GridSearchCV for individual classifiers
   - Optimization of voting weights

2. **Feature Selection**
   - Random Forest feature importance
   - Recursive feature elimination

3. **Model Selection**
   - Add/remove classifiers based on performance
   - Adjust ensemble composition

## Notes on Computational Efficiency
- Parallel processing implementation
- Memory usage considerations
- Scalability for larger datasets

## References
- Scikit-learn documentation
- Ensemble learning best practices
- Cross-validation techniques

In [17]:
# Import necessary libraries
import os
os.environ['MKL_ENABLE_INSTRUCTIONS'] = 'AVX'  # Enable AVX instructions for MKL

import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import seaborn as sns

# Data preprocessing
def prepare_data(X, y):
    """Prepare data for modeling by splitting and scaling"""
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Scale features - important for SVM and neural networks
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    return X_train_scaled, X_test_scaled, y_train, y_test

# Initialize and fit classifiers with optimized parameters
def create_classifiers(X_train, y_train):
    """Create and fit a dictionary of classifiers with tuned parameters"""
    classifiers = {
        'rf': RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42),
        'gb': GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42),
        'svm': SVC(kernel='rbf', C=1.0, probability=True, random_state=42),
        'mlp': MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=1000, random_state=42),
        'knn': KNeighborsClassifier(n_neighbors=5, weights='distance'),
        'dt': DecisionTreeClassifier(max_depth=5, random_state=42),
        'nb': GaussianNB(),
        'lr': LogisticRegression(C=1.0, random_state=42)
    }
    
    # Fit each classifier
    for clf in classifiers.values():
        clf.fit(X_train, y_train)
        
    return classifiers

# Create and train voting classifier
def build_voting_classifier(classifiers, X_train, y_train):
    """Build and train a voting classifier with the given base classifiers"""
    estimators = [(name, clf) for name, clf in classifiers.items()]
    
    voting_clf = VotingClassifier(
        estimators=estimators,
        voting='soft',  # Using probability estimates
        n_jobs=-1  # Parallel processing
    )
    
    voting_clf.fit(X_train, y_train)
    return voting_clf

# Evaluate models
def evaluate_models(voting_clf, classifiers, X_test, y_test):
    """Evaluate individual classifiers and voting classifier"""
    results = {}
    
    # Evaluate voting classifier
    voting_score = voting_clf.score(X_test, y_test)
    results['Voting Classifier'] = voting_score
    
    # Evaluate individual classifiers
    for name, clf in classifiers.items():
        score = clf.score(X_test, y_test)
        results[name] = score
    
    return results

# Main execution
if __name__ == "__main__":
    # Generate sample data
    X, y = make_moons(n_samples=1000, noise=0.3, random_state=42)
    
    # Prepare data
    X_train, X_test, y_train, y_test = prepare_data(X, y)
    
    # Create and train models
    classifiers = create_classifiers(X_train, y_train)  # Pass training data to fit classifiers
    voting_clf = build_voting_classifier(classifiers, X_train, y_train)
    
    # Evaluate models
    results = evaluate_models(voting_clf, classifiers, X_test, y_test)
    
    # Print results
    for model, score in results.items():
        print(f"{model} Score: {score:.4f}")

Voting Classifier Score: 0.9150
rf Score: 0.9150
gb Score: 0.9050
svm Score: 0.9100
mlp Score: 0.9350
knn Score: 0.9250
dt Score: 0.8850
nb Score: 0.8200
lr Score: 0.8250


In [18]:
# Create optimized voting classifier with top performers
optimized_voting_clf = VotingClassifier(
    estimators=[
        ('mlp', MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=1000, random_state=42)),
        ('knn', KNeighborsClassifier(n_neighbors=5, weights='distance')),
        ('rf', RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)),
        ('svm', SVC(kernel='rbf', C=1.0, probability=True, random_state=42))
    ],
    voting='soft',
    weights=[2, 1.5, 1, 1]  # Giving higher weight to better performers
)