Use Sonar dataset from sklearn.datasets, which contains sonar signals for classifying objects as either "rock" or "mine."

In [1]:
from sklearn.datasets import fetch_openml
import numpy as np
import pandas as pd
sonar = fetch_openml(name="sonar", version=1)

X = sonar.data  # Features
y = sonar.target  # Target (rock or mine)

a) Begin by creating a training and testing datasest from the dataset, with a 80-20 ratio, and random_state=1. **1 pt**

In [2]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

b) Train a KNN classifier on the training set to classify sonar signals as either "Rock" or "Mine." Use cross-validation to find an appropriate value of K. Evaluate and print the model's performance on the testing set using accuracy. **-- 9 points**

In [3]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

# Define the range of k values
k_values = list(range(1, 21))

# Dictionary to store cross-validation accuracies for each k
cv_accuracies = {}

# Loop through different k values and perform cross-validation
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    cv_scores = cross_val_score(knn, X_train, y_train, cv=5)  # 5-fold cross-validation
    cv_accuracies[k] = cv_scores.mean()

# Find the best k based on cross-validation accuracy
best_k = max(cv_accuracies, key=cv_accuracies.get)
best_accuracy = cv_accuracies[best_k]

print("Cross-validation accuracies for different k values:")
for k, accuracy in cv_accuracies.items():
    print(f"K = {k}: Mean Accuracy = {accuracy:.4f}")

print(f"\nBest k value based on cross-validation: {best_k} with accuracy {best_accuracy:.4f}")

# Training the KNN model with the best k on the entire training set
best_knn = KNeighborsClassifier(n_neighbors=best_k)
best_knn.fit(X_train, y_train)

# Predicting on the test set using the best model
y_pred = best_knn.predict(X_test)

# Calculate accuracy on the test set
test_accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy on the test set using the best k ({best_k}): {test_accuracy:.4f}")


Cross-validation accuracies for different k values:
K = 1: Mean Accuracy = 0.8137
K = 2: Mean Accuracy = 0.7891
K = 3: Mean Accuracy = 0.8012
K = 4: Mean Accuracy = 0.7652
K = 5: Mean Accuracy = 0.7717
K = 6: Mean Accuracy = 0.7234
K = 7: Mean Accuracy = 0.7358
K = 8: Mean Accuracy = 0.6873
K = 9: Mean Accuracy = 0.6875
K = 10: Mean Accuracy = 0.6934
K = 11: Mean Accuracy = 0.6811
K = 12: Mean Accuracy = 0.6451
K = 13: Mean Accuracy = 0.6512
K = 14: Mean Accuracy = 0.6148
K = 15: Mean Accuracy = 0.6390
K = 16: Mean Accuracy = 0.6209
K = 17: Mean Accuracy = 0.6449
K = 18: Mean Accuracy = 0.6267
K = 19: Mean Accuracy = 0.6330
K = 20: Mean Accuracy = 0.6269

Best k value based on cross-validation: 1 with accuracy 0.8137

Accuracy on the test set using the best k (1): 0.7619


In [4]:
X_train

array([[0.0139, 0.0222, 0.0089, ..., 0.0059, 0.0039, 0.0048],
       [0.0411, 0.0277, 0.0604, ..., 0.005 , 0.0085, 0.0044],
       [0.0731, 0.1249, 0.1665, ..., 0.0194, 0.0207, 0.0057],
       ...,
       [0.0208, 0.0186, 0.0131, ..., 0.0019, 0.0049, 0.0023],
       [0.0412, 0.1135, 0.0518, ..., 0.0225, 0.0098, 0.0085],
       [0.0333, 0.0221, 0.027 , ..., 0.0132, 0.0051, 0.0041]])

In [5]:
sonar

{'data': array([[0.02  , 0.0371, 0.0428, ..., 0.0084, 0.009 , 0.0032],
        [0.0453, 0.0523, 0.0843, ..., 0.0049, 0.0052, 0.0044],
        [0.0262, 0.0582, 0.1099, ..., 0.0164, 0.0095, 0.0078],
        ...,
        [0.0522, 0.0437, 0.018 , ..., 0.0138, 0.0077, 0.0031],
        [0.0303, 0.0353, 0.049 , ..., 0.0079, 0.0036, 0.0048],
        [0.026 , 0.0363, 0.0136, ..., 0.0036, 0.0061, 0.0115]]),
 'target': array(['Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
        'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
        'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
        'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
        'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
        'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
        'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
        'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock', 'Rock',
        'Rock

In [6]:
X

array([[0.02  , 0.0371, 0.0428, ..., 0.0084, 0.009 , 0.0032],
       [0.0453, 0.0523, 0.0843, ..., 0.0049, 0.0052, 0.0044],
       [0.0262, 0.0582, 0.1099, ..., 0.0164, 0.0095, 0.0078],
       ...,
       [0.0522, 0.0437, 0.018 , ..., 0.0138, 0.0077, 0.0031],
       [0.0303, 0.0353, 0.049 , ..., 0.0079, 0.0036, 0.0048],
       [0.026 , 0.0363, 0.0136, ..., 0.0036, 0.0061, 0.0115]])

c) Using any combination of the classification tools we've discussed in class:

- KNN
- Naive Bayes
- SVM
- Decision Tree (including Random Forests)
- Ensemble Methods (AdaBoost, Bagging)

You may also use feature extraction tools like PCA. Train and tune a model on the training set and evaluate its performance on the test set using accuracy. **-- 30 points**

 * accuracy > .95 **-- 30 points**
 * accuracy between 0.94 and 0.95 **-- 25 points**
 * accuracy between 0.92 and 0.94 **-- 20 points**
 * accuracy between 0.9 and 0.92 **-- 15 points**
 * accuracy between 0.85 and 0.9 **-- 10 points**
 * accuracy between 0.8 and 0.85 **-- 7 points**
 * accuracy between 0.7 and 0.8 **-- 5 points**
 * accuracy < 0.7 **-- 3 points**

In [7]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier
import xgboost as xgboost
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

from sklearn.decomposition import PCA
pca = PCA(n_components = 0.98)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

In [8]:
X_train_pca

array([[ 0.08238329, -0.8415542 , -0.37129052, ...,  0.01706236,
         0.08270075, -0.02232914],
       [ 1.31114804,  0.04786938,  0.35541227, ..., -0.00252458,
        -0.12711949,  0.02057906],
       [ 0.18601859,  0.20724731,  0.68373819, ...,  0.04693734,
        -0.11561944,  0.17685122],
       ...,
       [ 0.03711126, -0.94687492, -0.41512858, ..., -0.15250896,
        -0.06377815,  0.00890625],
       [ 0.76279092,  0.54315745,  1.44873934, ..., -0.12771044,
         0.08777093,  0.04116873],
       [ 0.60750661, -1.05395462, -0.31667639, ..., -0.0421534 ,
        -0.06452961, -0.0213803 ]])

# The one with the highest accuracy has been put bewlow this comment, which is an accuracy of 0.9524

In [9]:
# Training the KNN model with the best k on the entire training set
best_k = 1
best_knn = KNeighborsClassifier(n_neighbors=best_k)
best_knn.fit(X_train_pca, y_train)

# Predicting on the test set using the best model
y_pred = best_knn.predict(X_test_pca)

# Calculate accuracy on the test set
test_accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy on the test set using the knn with k = {best_k}: {test_accuracy:.4f}")



Accuracy on the test set using the knn with k = 1: 0.9524


In [10]:
# Create a KNN base classifier
base_knn = KNeighborsClassifier(n_neighbors=5)

# Create Bagging classifier with KNN as base estimator
bagging_knn = BaggingClassifier(base_estimator=base_knn, n_estimators=100, random_state=42)

# Train the Bagging model
bagging_knn.fit(X_train, y_train)

# Make predictions on the test set
y_pred = bagging_knn.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Bagging with KNN Classifier: {accuracy:.4f}")

Accuracy of Bagging with KNN Classifier: 0.6905


In [11]:
# Create a Multinomial Naive Bayes classifier
nb_classifier = MultinomialNB()

# Define the grid of parameters to search
param_grid = {'alpha': [0.1, 0.5, 1.0, 1.5, 2.0]}

# Perform GridSearchCV to tune the 'alpha' parameter
grid_search = GridSearchCV(nb_classifier, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best parameter value
best_alpha = grid_search.best_params_['alpha']

# Train the model using the best alpha value
best_nb_classifier = MultinomialNB(alpha=best_alpha)
best_nb_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_nb_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Tuned Naive Bayes Classifier: {accuracy:.4f}")
print(f"Best alpha value: {best_alpha}")


Accuracy of Tuned Naive Bayes Classifier: 0.6667
Best alpha value: 0.1


In [12]:
# Create a base classifier (e.g., Decision Tree)
base_classifier_nb = MultinomialNB()

# Create AdaBoost classifier with a base estimator and set random_state
adaboost_classifier = AdaBoostClassifier(base_estimator = base_classifier_nb, n_estimators=100, random_state=42)

# Train the AdaBoost model
adaboost_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = adaboost_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of AdaBoost Classifier: {accuracy:.4f}")

Accuracy of AdaBoost Classifier: 0.7143


In [13]:
# Create Bagging classifier with nb as base estimator
bagging_nb = BaggingClassifier(base_estimator=base_classifier_nb, n_estimators=100, random_state=42)

# Train the Bagging model
bagging_nb.fit(X_train, y_train)

# Make predictions on the test set
y_pred = bagging_nb.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Bagging with KNN Classifier: {accuracy:.4f}")

Accuracy of Bagging with KNN Classifier: 0.6667


In [14]:
# Create an SVM classifier
svm_classifier = SVC()

# Define the grid of parameters to search
param_grid = {
    'C': [0.1, 1, 10, 100],  # Penalty parameter C
    'gamma': [1, 0.1, 0.01, 0.001],  # Kernel coefficient for 'rbf'
    'kernel': ['rbf', 'poly', 'sigmoid']  # Kernel type
}

# Perform GridSearchCV to tune hyperparameters
grid_search = GridSearchCV(svm_classifier, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best parameter values
best_C = grid_search.best_params_['C']
best_gamma = grid_search.best_params_['gamma']
best_kernel = grid_search.best_params_['kernel']
# Train the model using the best parameters
best_svm_classifier = SVC(C=best_C, gamma=best_gamma, kernel = 'poly')
best_svm_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_svm_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Tuned SVM Classifier: {accuracy:.4f}")
print(f"Best C value: {best_C}")
print(f"Best gamma value: {best_gamma}")

Accuracy of Tuned SVM Classifier: 0.8095
Best C value: 10
Best gamma value: 0.1


In [15]:
# Create an SVM classifier
svm_classifier = SVC()

# Define the grid of parameters to search
param_grid = {
    'C': [0.1, 1, 10, 100],  # Penalty parameter C
    'gamma': [1, 0.1, 0.01, 0.001],  # Kernel coefficient for 'rbf'
    'kernel': ['rbf', 'poly', 'sigmoid']  # Kernel type
}

# Perform GridSearchCV to tune hyperparameters
grid_search = GridSearchCV(svm_classifier, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_pca, y_train)

# Get the best parameter values
best_C = grid_search.best_params_['C']
best_gamma = grid_search.best_params_['gamma']
best_kernel = grid_search.best_params_['kernel']
# Train the model using the best parameters
best_svm_classifier = SVC(C=best_C, gamma=best_gamma, kernel = 'poly')
best_svm_classifier.fit(X_train_pca, y_train)

# Make predictions on the test set
y_pred = best_svm_classifier.predict(X_test_pca)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Tuned SVM Classifier: {accuracy:.4f}")
print(f"Best C value: {best_C}")
print(f"Best gamma value: {best_gamma}")

Accuracy of Tuned SVM Classifier: 0.5238
Best C value: 10
Best gamma value: 0.1


In [16]:
# Create a base classifier (e.g., Decision Tree)
base_classifier_svm = SVC(probability= True)

# Create AdaBoost classifier with a base estimator and set random_state
adaboost_classifier = AdaBoostClassifier(base_estimator = base_classifier_svm, n_estimators=100, random_state=42)

# Train the AdaBoost model
adaboost_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = adaboost_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of AdaBoost Classifier: {accuracy:.4f}")

Accuracy of AdaBoost Classifier: 0.7381


In [17]:
# Create a base classifier (e.g., Decision Tree)
base_classifier_svm = SVC(probability= True)

# Create AdaBoost classifier with a base estimator and set random_state
adaboost_classifier = AdaBoostClassifier(base_estimator = base_classifier_svm, n_estimators=100, random_state=42)

# Train the AdaBoost model
adaboost_classifier.fit(X_train_pca, y_train)

# Make predictions on the test set
y_pred = adaboost_classifier.predict(X_test_pca)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of AdaBoost Classifier: {accuracy:.4f}")

Accuracy of AdaBoost Classifier: 0.8571


In [18]:
# Create Bagging classifier with nb as base estimator
bagging_svm = BaggingClassifier(base_estimator=base_classifier_svm, n_estimators=100, random_state=42)

# Train the Bagging model
bagging_svm.fit(X_train, y_train)

# Make predictions on the test set
y_pred = bagging_svm.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Bagging with KNN Classifier: {accuracy:.4f}")

Accuracy of Bagging with KNN Classifier: 0.8333


In [19]:
# Create Bagging classifier with nb as base estimator
bagging_svm = BaggingClassifier(base_estimator=base_classifier_svm, n_estimators=100, random_state=42)

# Train the Bagging model
bagging_svm.fit(X_train_pca, y_train)

# Make predictions on the test set
y_pred = bagging_svm.predict(X_test_pca)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Bagging with KNN Classifier: {accuracy:.4f}")

Accuracy of Bagging with KNN Classifier: 0.8571


In [20]:
tree_classifier = DecisionTreeClassifier()

# Define the grid of parameters to search
param_grid = {
    'criterion': ['gini', 'entropy'],  # Function to measure the quality of a split
    'max_depth': [None, 5, 10, 15, 20, 25],  # Maximum depth of the tree
    'min_samples_split': [2, 5, 10]  # Minimum number of samples required to split a node
}

# Perform GridSearchCV to tune hyperparameters
grid_search = GridSearchCV(tree_classifier, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best parameter values
best_criterion = grid_search.best_params_['criterion']
best_max_depth = grid_search.best_params_['max_depth']
best_min_samples_split = grid_search.best_params_['min_samples_split']

# Train the model using the best parameters
best_tree_classifier = DecisionTreeClassifier(criterion = best_criterion, 
                                              max_depth=best_max_depth, 
                                              min_samples_split = best_min_samples_split)
best_tree_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_tree_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Tuned Decision Tree Classifier: {accuracy:.4f}")
print(f"Best criterion: {best_criterion}")
print(f"Best max_depth: {best_max_depth}")
print(f"Best min_samples_split: {best_min_samples_split}")

Accuracy of Tuned Decision Tree Classifier: 0.6905
Best criterion: entropy
Best max_depth: None
Best min_samples_split: 5


In [21]:
tree_classifier = DecisionTreeClassifier()

# Define the grid of parameters to search
param_grid = {
    'criterion': ['gini', 'entropy'],  # Function to measure the quality of a split
    'max_depth': [None, 5, 10, 15, 20, 25],  # Maximum depth of the tree
    'min_samples_split': [2, 5, 10]  # Minimum number of samples required to split a node
}

# Perform GridSearchCV to tune hyperparameters
grid_search = GridSearchCV(tree_classifier, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_pca, y_train)

# Get the best parameter values
best_criterion = grid_search.best_params_['criterion']
best_max_depth = grid_search.best_params_['max_depth']
best_min_samples_split = grid_search.best_params_['min_samples_split']

# Train the model using the best parameters
best_tree_classifier = DecisionTreeClassifier(criterion = best_criterion, 
                                              max_depth=best_max_depth, 
                                              min_samples_split = best_min_samples_split)
best_tree_classifier.fit(X_train_pca, y_train)

# Make predictions on the test set
y_pred = best_tree_classifier.predict(X_test_pca)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Tuned Decision Tree Classifier: {accuracy:.4f}")
print(f"Best criterion: {best_criterion}")
print(f"Best max_depth: {best_max_depth}")
print(f"Best min_samples_split: {best_min_samples_split}")

Accuracy of Tuned Decision Tree Classifier: 0.6667
Best criterion: entropy
Best max_depth: 5
Best min_samples_split: 10


In [22]:
# Create a base classifier (e.g., Decision Tree)
base_classifier_dc = DecisionTreeClassifier(max_depth=1)

# Create AdaBoost classifier with a base estimator and set random_state
adaboost_classifier = AdaBoostClassifier(base_estimator=base_classifier_dc, random_state=42)

# Train the AdaBoost model
adaboost_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = adaboost_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of AdaBoost Classifier: {accuracy:.4f}")

Accuracy of AdaBoost Classifier: 0.8571


In [23]:
# Create a base classifier (e.g., Decision Tree)
base_classifier_dc = DecisionTreeClassifier(max_depth=1)

# Create AdaBoost classifier with a base estimator and set random_state
adaboost_classifier = AdaBoostClassifier(base_estimator=base_classifier_dc, random_state=42)

# Train the AdaBoost model
adaboost_classifier.fit(X_train_pca, y_train)

# Make predictions on the test set
y_pred = adaboost_classifier.predict(X_test_pca)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of AdaBoost Classifier: {accuracy:.4f}")

Accuracy of AdaBoost Classifier: 0.6667


In [24]:
# Create Bagging classifier with nb as base estimator
bagging_dc = BaggingClassifier(base_estimator=base_classifier_dc, n_estimators=100, random_state=42)

# Train the Bagging model
bagging_dc.fit(X_train, y_train)

# Make predictions on the test set
y_pred = bagging_dc.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Bagging with bagging Classifier: {accuracy:.4f}")

Accuracy of Bagging with bagging Classifier: 0.7381


In [25]:
# Create Bagging classifier with nb as base estimator
bagging_dc = BaggingClassifier(base_estimator=base_classifier_dc, n_estimators=100, random_state=42)

# Train the Bagging model
bagging_dc.fit(X_train_pca, y_train)

# Make predictions on the test set
y_pred = bagging_dc.predict(X_test_pca)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Bagging with bGGING Classifier: {accuracy:.4f}")

Accuracy of Bagging with bGGING Classifier: 0.7619


In [26]:
# Define the grid of parameters to search

forest_classifier = RandomForestClassifier(random_state = 42)
param_grid = {
    'n_estimators': [100, 200, 300],  # Number of trees in the forest
    'criterion': ['gini', 'entropy'],  # Function to measure the quality of a split
    'max_depth': [None, 5, 10, 15],  # Maximum depth of the trees
    'min_samples_split': [2, 5, 10]  # Minimum number of samples required to split a node
}

# Perform GridSearchCV to tune hyperparameters
grid_search = GridSearchCV(forest_classifier, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best parameter values
best_n_estimators = grid_search.best_params_['n_estimators']
best_criterion = grid_search.best_params_['criterion']
best_max_depth = grid_search.best_params_['max_depth']
best_min_samples_split = grid_search.best_params_['min_samples_split']

# Train the model using the best parameters
best_forest_classifier = RandomForestClassifier(n_estimators=best_n_estimators, criterion=best_criterion,
                                                max_depth=best_max_depth,
                                                min_samples_split=best_min_samples_split)
best_forest_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_forest_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Tuned Random Forest Classifier: {accuracy:.4f}")
print(f"Best n_estimators: {best_n_estimators}")
print(f"Best criterion: {best_criterion}")
print(f"Best max_depth: {best_max_depth}")
print(f"Best min_samples_split: {best_min_samples_split}")

Accuracy of Tuned Random Forest Classifier: 0.8095
Best n_estimators: 100
Best criterion: entropy
Best max_depth: None
Best min_samples_split: 5


In [27]:
# Define the grid of parameters to search

forest_classifier = RandomForestClassifier(random_state = 42)
param_grid = {
    'n_estimators': [100, 200, 300],  # Number of trees in the forest
    'criterion': ['gini', 'entropy'],  # Function to measure the quality of a split
    'max_depth': [None, 5, 10, 15],  # Maximum depth of the trees
    'min_samples_split': [2, 5, 10]  # Minimum number of samples required to split a node
}

# Perform GridSearchCV to tune hyperparameters
grid_search = GridSearchCV(forest_classifier, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_pca, y_train)

# Get the best parameter values
best_n_estimators = grid_search.best_params_['n_estimators']
best_criterion = grid_search.best_params_['criterion']
best_max_depth = grid_search.best_params_['max_depth']
best_min_samples_split = grid_search.best_params_['min_samples_split']

# Train the model using the best parameters
best_forest_classifier = RandomForestClassifier(n_estimators=best_n_estimators, criterion=best_criterion,
                                                max_depth=best_max_depth,
                                                min_samples_split=best_min_samples_split)
best_forest_classifier.fit(X_train_pca, y_train)

# Make predictions on the test set
y_pred = best_forest_classifier.predict(X_test_pca)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Tuned Random Forest Classifier: {accuracy:.4f}")
print(f"Best n_estimators: {best_n_estimators}")
print(f"Best criterion: {best_criterion}")
print(f"Best max_depth: {best_max_depth}")
print(f"Best min_samples_split: {best_min_samples_split}")

Accuracy of Tuned Random Forest Classifier: 0.7619
Best n_estimators: 200
Best criterion: gini
Best max_depth: None
Best min_samples_split: 2


In [28]:
# Create a base classifier (e.g., Decision Tree)
base_classifier_rfm = RandomForestClassifier()

# Create AdaBoost classifier with a base estimator and set random_state
adaboost_classifier = AdaBoostClassifier(base_estimator=base_classifier_rfm, random_state=42)

# Train the AdaBoost model
adaboost_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = adaboost_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of AdaBoost Classifier: {accuracy:.4f}")

Accuracy of AdaBoost Classifier: 0.8333


In [29]:
# Create a base classifier (e.g., Decision Tree)
base_classifier_rfm = RandomForestClassifier()

# Create AdaBoost classifier with a base estimator and set random_state
adaboost_classifier = AdaBoostClassifier(base_estimator=base_classifier_rfm, random_state=42)

# Train the AdaBoost model
adaboost_classifier.fit(X_train_pca, y_train)

# Make predictions on the test set
y_pred = adaboost_classifier.predict(X_test_pca)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of AdaBoost Classifier: {accuracy:.4f}")

Accuracy of AdaBoost Classifier: 0.7857


In [30]:
bagging_rfm = BaggingClassifier(base_estimator = base_classifier_rfm, n_estimators=100, random_state=42)

# Train the Bagging model
bagging_rfm.fit(X_train, y_train)

# Make predictions on the test set
y_pred = bagging_rfm.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Bagging with rfm Classifier: {accuracy:.4f}")

Accuracy of Bagging with rfm Classifier: 0.7381


In [31]:
bagging_rfm = BaggingClassifier(base_estimator = base_classifier_rfm, n_estimators=100, random_state=42)

# Train the Bagging model
bagging_rfm.fit(X_train_pca, y_train)

# Make predictions on the test set
y_pred = bagging_rfm.predict(X_test_pca)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Bagging with rfm Classifier: {accuracy:.4f}")

Accuracy of Bagging with rfm Classifier: 0.7619


# Bonus (15pts)

In this bonus we will implement 1-dimensional GMM clustering algorithm from scratch. A GMM distribution is composed of `k` components, each characterized by:

1. A mixture proportion
2. A mean for its Normal Distribution
3. A variance for its Normal Distribution

So, to generate a dataset that follows a GMM distrbution we need a list of those parameters. In this exercise we will use a class called `Component` to capture the parameters for a given component. And a GMM will be a list of `Component`s.

In [32]:
class Component:
    def __init__(self, mixture_prop, mean, variance):
        self.mixture_prop = mixture_prop
        self.mean = mean
        self.variance = variance

example_gmm = [Component(.5, 5, 1), Component(.5, 8, 1)]


a) Complete the function below to validate and generate a dataset following a GMM distribution, given a specified set of GMM parameters as above and a size. You may only use the methods already imported in the cell. (10pts)

In [33]:
from numpy.random import normal, uniform

def generate_gmm_dataset(gmm_params, size):
    if not is_valid_gmm(gmm_params):
        raise ValueError("GMM parameters are invalid")
    
    dataset = []
    for _ in range(size):
        comp = get_random_component(gmm_params)
        dataset += ...
    return dataset

def is_valid_gmm(gmm_params):
    '''
        Checks that the sum of the mixture
        proportions is 1
    '''
    return True

def get_random_component(gmm_params):
    '''
        returns component with prob
        proportional to mixture_prop
    '''
    ...
    return 

# test your code: this should return a list of 10 numbers similar to worksheet 8
data = generate_gmm_dataset(example_gmm, 10)

TypeError: 'ellipsis' object is not iterable

b) Finish the implementation below of the Expectation-Maximization Algorithm. Only use methods that have been imported in the cell. Visualize the output of your code by plotting the original mixture distribution curves and the ones learned by the EM algorithm. (15pts)

In [None]:
from scipy.stats import norm
from numpy import array, argmax
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

def gmm_init(k, dataset):
    kmeans = KMeans(k, init='k-means++').fit(X=array(dataset).reshape(-1, 1))
    gmm_params = []
    ...
    return gmm_params


def compute_gmm(k, dataset, probs):
    '''
        Compute P(C_j), mean_j, var_j
    '''
    gmm_params = []
    ...
    return gmm_params


def compute_probs(k, dataset, gmm_params):
    '''
        For all x_i in dataset, compute P(C_j | X_i)
        = P(X_i | C_j)P(C_j) / P(X_i) for all C_j
        return the list of lists of all P(C_j | X_i)
        for all x_i in dataset.
    '''
    probs = []
    ...
    return probs


def expectation_maximization(k, dataset, iterations):
    '''
        Repeat for a set number of iterations.
    '''
    gmm_params = gmm_init(k, dataset)
    for _ in range(iterations):
        # expectation step
        probs = compute_probs(k, dataset, gmm_params)

        # maximization step
        gmm_params = compute_gmm(k, dataset, probs)

    return probs, gmm_params

Notes:

1. your code should work with any number of components, each with reasonable parameters.
2. your code should work for 1 to about 5 iterations of the EM algorithm. It may not work for iterations over 10 because the math we are doing may overflow and create `nans` - that's ok / don't worry about it.