# Assignment 2
Melissa Butler and Emma Franz Hughes
\
CoSci 5010 AutoML
\
April 7, 2023

## Introduction
The Franz Butler Vineyard (TM) would like to predict how much they can charge for a bottle of wine from their vineyard. The local market has three main price points: Boxed, Good, and Fancy.  Each of these categories contains two sub-categories that can affect pricing, albeit not as significantly. They would like to compare various categorical Machine Learning models with optimized hyperparameters, to decide the most accurate model for predicting the quality of a wine, based on 11 testable features.

## Dataset Description
Our dataset consists of quantitative descriptions of various wines. There are 1599 wines sampled. We have 11 features for each observation, each with a numeric rating of a physiochemical property of the wine such as alcohol content and acidity. The output variable we seek to predict corresponds to the quality rating of each wine, from sensory data. There are no missing values. A closer inspection of the quality outputs shows a total of 6 integer categories (3-8) as no wine on the list scored a 1, 2, 9, or 10. So, our input values are 11 numerical valued features and our output is a categorical rating ranging from 3-8.

## Experimental Setup 
We select six estimators implemented in the scikit learn library: a Decision Tree classifier, a Random Forest classifier, an AdaBoost classifier, a K-Nearest Neighbors classifier, a Support Vector classifier, and a Support Vector Regression. The cost function we seek to maximize is accuracy, which was custom coded to compare classifier and regression estimators. For the Support Vector Regression, we define accuracy to be the proportion of predicted values that, once rounded to the nearest integer, match the test values. The parameter space we search includes the above classifiers and a corresponding hyperparameter space for each. We choose hyperparameter ranges based off of documentation from AutoSklearn and the ranges for each hyperparameter can be seen in the code.

All features are included and the data is normalized. We use a 3 by 3 nested resampling.  This ensures that each data point is used in both testing and training and prevents overfitting to a single test-train split. We use scikit's KFold to create test-train split indices (after shuffling the data) and for each test-train split obtain optimized hyperparameters for each estimator. We then use the test portion of data to test each estimator with its optimized hyperparameters to obtain an unbiased accuracy estimate.

The hyperparameter optimization was performed using BayesSearchCV from scikit-optimize. Thus on each outer fold, we obtained optimized hyperparameters from a Bayesian search with our customized accuracy scoring, a 3-fold cross validation, and 50 iterations for each estimator type. The default surrogate and aqcuisition functions were used.

In [None]:
table_train_acc

In [None]:
table_test_acc

In [None]:
table_assignment1_acc

## Results
Above are tables for the obtained training and testing accuracy scores for each fold. Also given is a table of the accuracy values for the same estimators using default hyperparameters, copied from Assignment 1. For the hyperparameters obtained for each fold, see the appendix.

Regardless of parameter choices, we still see that the Support Vector Regression has a better accuracy than the classifiers we tested. For Fold 1, our best estimator had hyperparameters C=36.07, gamma=6.25, kernel=rbf, and degree=3. For Fold 2, the best estimator had hyperparameters C=1000.0, gamma=0.01, kernel=linear, and degree=4. For Fold 3, the best estimator had hyperparameters C=1000.0, gamma=9.93, kernel=poly, and degree=2. Of course, since the polynomial kernel was not chosen for the first two folds, the degree was ignored. Since our value for C was chosen to be 1000.0 for two of the folds, it is possible our upper bound for C should be increased. However, we did not see an increase in accuracy from the values we obtained in Assignment 1 using the default parameters for Support Vector Regression.

We noticed that for the Decision Tree Classifier and Random Forest Classifier our performance after hyperparameter optimization was on average worse than the accuracy in Assignment 1 with the defaul hyperparameters. For the AdaBoost, K Nearest Neighbors, and Support Vector Classifiers we had a slight improvement in accuracy using our optimized hyperparameters over the default hyperparameters. The Support Vector Regression was about the same.

## Resources Used
https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

https://scikit-learn.org/stable/modules/neighbors.html#classification

## Appendix

In [None]:
# Run for required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import collections
from sklearn import preprocessing
from sklearn import tree
from sklearn.model_selection import KFold
from sklearn.svm import SVC, SVR
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from warnings import filterwarnings 
filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from skopt.space import Real, Integer, Categorical
from skopt import BayesSearchCV
from sklearn.metrics import make_scorer

In [75]:
#Read in data
filename = './winequality-red.csv'
df = pd.read_csv(filename, delimiter = ";")
X = df.values[:,0:-1]
y = df.values[:,-1]

#Normalize Feature Data
#X = preprocessing.normalize(X, axis = 0)

#Print dataset
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [77]:
def my_normalize(X):
    X_l1 = preprocessing.normalize(X, norm='l1')
    X_l2 = preprocessing.normalize(X, norm='l2')
    X_max = preprocessing.normalize(X, norm='max')

    return X_l1, X_l2, X_max

In [68]:
def my_hpo(X):
    # Accuracy scorer
    def accuracy(y_pred, y_test):
            temp = [abs(y_pred - y_test) < 1] 
            return sum(sum(temp))/len(y_test)
    acc_scorer = make_scorer(accuracy)

    num_kfolds = 3 # number of kfolds for both outer and inner loops

    # Use scikit learn KFold to create the test-train split indices for the outer sampling, with data shuffled
    kf_outer = KFold(n_splits = num_kfolds, shuffle = True) # outer test-train splits
    outer_est = [] # outer list of estimators
    outer_acc_train = [] # outer list of accuracy scores (from train split)
    outer_acc_test = [] # outer list of accuracy scores (from test split)
    outer_params = [] #outer list of parameters


    # Set estimators and parameters to test
    names = ["Decision\n Tree",
            "Random\n Forest",
            "AdaBoost",
            "Knn"
            #"SVM",
            #"SVR"
            ]

    estimators = [tree.DecisionTreeClassifier(),
                RandomForestClassifier(),
                AdaBoostClassifier(),
                KNeighborsClassifier()
                #SVC(),
                #SVR()
                ]

    bayes_param_spaces = [{"max_depth": Integer(6, 20), # values of max_depth are integers from 6 to 20
            "max_features": Categorical(['sqrt','log2']), 
            "min_samples_leaf": Integer(2, 10),
            "min_samples_split": Integer(2, 10)
        }, # Decision tree search space
        {"bootstrap": Categorical([True, False]), # values for boostrap can be either True or False
            "max_depth": Integer(6, 20),
            "max_features": Categorical(['sqrt','log2']), 
            "min_samples_leaf": Integer(2, 10),
            "min_samples_split": Integer(2, 10),
            "n_estimators": Integer(100, 500)
        }, # Random forest search space
        {"n_estimators": Integer(1e1, 1e3),
            "learning_rate": Real(1e-2,10.0)
        }, #AdaBoost search space
        {"n_neighbors": Integer(1,10),
            "weights": Categorical(['uniform', 'distance']),
            "algorithm": Categorical(['auto','ball_tree', 'kd_tree', 'brute']),
        }, #KNN search space
        {"kernel": Categorical(['linear', 'poly', 'rbf', 'sigmoid']),
            "C": Real(1e-2, 1e3, prior = "log-uniform"),
            "gamma": Real(1e-2,1e3, prior = "log-uniform"),
            "degree": Integer(2,7)
        }, #SVC search space
        {"kernel": Categorical(['linear', 'poly', 'rbf', 'sigmoid']),
            "C": Real(1e-2, 1e3, prior = "log-uniform"),
            "gamma": Real(1e-2,1e3, prior = "log-uniform"),
            "degree": Integer(2,7)
        } #SVM search space
            ]
        
    random_param_spaces = [{'max_depth':(6, 20), # values of max_depth are integers from 6 to 20
            'max_features': ['sqrt','log2'], 
            'min_samples_leaf': (2, 10),
            'min_samples_split': (2, 10)
        }, # Decision tree search space
        {'bootstrap': [True, False], # values for boostrap can be either True or False
            'max_depth': (6, 20),
            'max_features': ['sqrt','log2'], 
            'min_samples_leaf': (2, 10),
            'min_samples_split': (2, 10),
            'n_estimators': (100, 500)
        }, # Random forest search space
        {'n_estimators': (1, 100),
            'learning_rate': (1e-2,10.0)
        }, #AdaBoost search space
        {'n_neighbors': (1,10),
            'weights': ['uniform', 'distance'],
            'algorithm': ['auto','ball_tree', 'kd_tree', 'brute'],
        }] #KNN search space
    """
        {'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
            'C': (1e-2, 1e3, prior = "log-uniform"),
            'gamma': (1e-2,1e3, prior = "log-uniform"),
            'degree': (2,7)
        }, #SVC search space
        {'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
            'C': (1e-2, 1e3, prior = "log-uniform"),
            'gamma': (1e-2,1e3, prior = "log-uniform"),
            'degree': (2,7)
        } #SVM search space
            ]
    """
    for i, (train_index, test_index) in enumerate(kf_outer.split(X)):
        best_est = []
        best_acc_train = []
        best_acc_test = []
        best_params = []
        X_train = X[train_index, :]
        X_test = X[test_index, :]
        y_train = y[train_index]
        y_test = y[test_index]
        kf_inner = KFold(n_splits = num_kfolds, shuffle = True)
        
        # Use a random search for hyperparameters for each estimator
        for j in range(len(estimators)):
            classifier = estimators[j]
            param_space = random_param_spaces[j]
            random_search = RandomizedSearchCV(classifier, param_space, cv=kf_inner,
                                            n_iter=1,
                                            scoring=acc_scorer,
                                            verbose=False,
                                            n_jobs=-1)
            random_search.fit(X_train, y_train)
            best_est.append(random_search.best_estimator_)
            best_acc_train.append(random_search.best_score_)
            best_params.append(random_search.best_params_)
            estimator_test = random_search.best_estimator_.fit(X_train, y_train)
            y_pred = estimator_test.predict(X_test)
            acc = accuracy(y_pred, y_test)
            best_acc_test.append(acc)

        outer_est.append(best_est)
        outer_acc_train.append(best_acc_train)
        outer_acc_test.append(best_acc_test)
        outer_params.append(best_params)

        table_train_acc = pd.DataFrame(data = outer_acc_train, index =['Fold 1', 'Fold 2', 'Fold 3'], columns = names2)
        table_test_acc = pd.DataFrame(data = outer_acc_test, index =['Fold 1', 'Fold 2', 'Fold 3'], columns = names2)

        return table_train_acc, table_test_acc


In [88]:
X_list = my_normalize(X)
norm_names = ['l1','l2','l3']
for n in range(len(X_list)):
    print('-----'+norm_names[n]+'-----')
    X = X_list[n]
    train_df, test_df = my_hpo(X_list[n])
    print(train_df)


-----l1-----
        Decision Tree  Random Forest  AdaBoost       Knn
Fold 1       0.515907       0.581606  0.498162  0.469977
Fold 2       0.515907       0.581606  0.498162  0.469977
Fold 3       0.515907       0.581606  0.498162  0.469977
-----l2-----
        Decision Tree  Random Forest  AdaBoost       Knn
Fold 1       0.496236       0.585367  0.473722  0.501859
Fold 2       0.496236       0.585367  0.473722  0.501859
Fold 3       0.496236       0.585367  0.473722  0.501859
-----l3-----
        Decision Tree  Random Forest  AdaBoost       Knn
Fold 1       0.496241       0.560046  0.485931  0.493435
Fold 2       0.496241       0.560046  0.485931  0.493435
Fold 3       0.496241       0.560046  0.485931  0.493435


In [None]:



def my_table():
    # Create table of accuracy for each estimator for each fold
    #acc_means = np.average(acc, axis = 1)
    names2 = ["Decision Tree",
            "Random Forest",
            "AdaBoost",
            "Knn"
            #"SVM",
            #"SVR"
            ]
    #means = np.column_stack((acc_means, acc_means_threecat))
    table_train_acc = pd.DataFrame(data = outer_acc_train, index =['Fold 1', 'Fold 2', 'Fold 3'], columns = names2)
    table_test_acc = pd.DataFrame(data = outer_acc_test, index =['Fold 1', 'Fold 2', 'Fold 3'], columns = names2)
    assignment1_data = [0.625366, 0.697952, 0.524104, 0.567897]
    #, 0.579721, 0.888671]
    table_assignment1_acc = pd.DataFrame(data = assignment1_data, columns =['Default hyperparameters accuracy'], index = names2)

In [None]:
table_train_acc

In [None]:
outer_params