In [1]:
import sklearn.model_selection
from sklearn.datasets import fetch_openml
import sklearn.metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder



X, y = fetch_openml(data_id=40691, as_frame=True, return_X_y=True)

columns = X.select_dtypes(include=['category', 'int']).columns # Determine if any columns in feature data are categorical/integer 

if not columns.empty:
    enc = OneHotEncoder() # One hot encoding should only be done for categorical/interger values
    X = enc.fit_transform(X)  
    
    

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=42)

clf = RandomForestClassifier(random_state=42)
clf = clf.fit(X_train, y_train)
y_hat = clf.predict(X_test)
print("RF Accuracy", sklearn.metrics.accuracy_score(y_test, y_hat))

RF Accuracy 0.67


In [2]:
from autosklearn.classification import AutoSklearnClassifier

automl = AutoSklearnClassifier(time_left_for_this_task=300,
                              resampling_strategy='cv',
                              seed=349)
automl.fit(X_train, y_train)
y_hat = automl.predict(X_test)
print("AutoML Accuracy", sklearn.metrics.accuracy_score(y_test, y_hat))



AutoML Accuracy 0.6725


In [None]:
There were two significant issues that I noticed with this example. First, the use of one hot encoding appears to have been done without regard to whether the dataset would benefit from it. This is especially true as the 
dataset used in this example is one of the wine quality datasets; the feature data consists entirely of continuous data. Encoding values in the feature space would likely be harmful because all values would be converted
to ones and zeros, which would likely result in the loss of meaning of a lot of values.  This preprocessing approach is more applicable to categorical or integer values where the values of one element do not indicate a 
greater or lesser quality/quantity in regard to other elements. 

The second issue is that the hyperparameter search space is not specified for the auto-sklearn model to tune over. This may not be an issue, as the auto-sklearn documentation states that a large search space is used by
default. However, it is better and more computationally efficient to use ranges in your search space that are known to produce good results. By simply specifying the resampling method, the model if only marginally shows
better performance than its default counter part. 

While these both contribute to poor performance, from my experience, the factor that caused the auto-sklearn model to perform worse than its default counterpart is the lack of defintion of the hyperparameter search
space. One hot encoding did drag down the performance of both moodels quite significantly, but the auto-sklearn model only showed better performance when the resampling strategy was specified. 

In [None]:
Sources: 
    Lekhana_Ganji. (2023, April 18). One hot encoding in machine learning. GeeksforGeeks. https://www.geeksforgeeks.org/ml-one-hot-encoding-of-datasets-in-python/# 
    @inproceedings{feurer-neurips15a,
    title     = {Efficient and Robust Automated Machine Learning},
    author    = {Feurer, Matthias and Klein, Aaron and Eggensperger, Katharina and Springenberg, Jost and Blum, Manuel and Hutter, Frank},
    booktitle = {Advances in Neural Information Processing Systems 28 (2015)},
    pages     = {2962--2970},
    year      = {2015}
}