# Mount Drive

In [0]:
from google.colab import drive
drive.mount('/content/drive')
basedir = '/content/drive/My Drive/PSDA Group 4/Data/' 

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Dependencies



In [0]:
!pip install gama
!apt-get install swig -y
!pip install Cython numpy
!pip install auto-sklearn
!pip freeze > req.txt

In [0]:
from gama import GamaClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss, accuracy_score
import logging
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
import autosklearn.classification

# Read Data

In [0]:
#This dataset tells which of the users purchased/not purchased a particular product
df = pd.read_csv(basedir+'Social_Network_Ads.csv')
df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


# Data Partitioning

In [0]:
X = df.iloc[:, [2, 3]].values
y = df.iloc[:, 4].values

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Auto ML

Wählen Sie ein AutoML Package. Begründen Sie Ihre Auswahl.

We chose to use autosklearn, since it is a drop-in replacement for a scikit-learn estimator, which most of the ML researchers are familiar with. Furthermore it provides us with APIs that can be used to inspect the statistics and the models used. These logs can be used to obtain insights on the behaviour of the search procedure.

We also decided to test out gama, another ML package, which is very similar to autosklearn.

 Führen Sie die Klassifikationsaufgabe von Aufgabe 7 mit AutoML durch. Vergleichen Sie die
Ergebnisse mit den Ergebnissen aus Aufgabe 7

In [0]:
automl = autosklearn.classification.AutoSklearnClassifier()
automl.fit(X_train, y_train)
y_hat = automl.predict(X_test)
print("Accuracy score", accuracy_score(y_test, y_hat))

In [0]:
# configure auto-sklearn
automl = autosklearn.classification.AutoSklearnClassifier(
          time_left_for_this_task=900, # run auto-sklearn for at most 2min
          per_run_time_limit=100, # spend at most 30 sec for each model training
          )

# train model(s)
automl.fit(X_train, y_train)

# evaluate
y_hat = automl.predict(X_test)
test_acc = accuracy_score(y_test, y_hat)
print("Test Accuracy score {0}".format(test_acc))

Test Accuracy score 0.8625


In [0]:
print(automl.sprint_statistics())

In [0]:
print(automl.show_models())

In [0]:
automl = GamaClassifier(max_total_time=180, keep_analysis_log=None)
print("Starting `fit` which will take roughly 3 minutes.")
automl.fit(X_train, y_train)

label_predictions = automl.predict(X_test)
probability_predictions = automl.predict_proba(X_test)

print('accuracy:', accuracy_score(y_test, label_predictions))
print('log loss:', log_loss(y_test, probability_predictions))
# the `score` function outputs the score on the metric optimized towards (by default, `log_loss`)
print('log_loss', automl.score(X_test, y_test))

Starting `fit` which will take roughly 3 minutes.
accuracy: 0.8625
log loss: 0.3590357913678016
log_loss 0.3590357913678016


In [0]:
automl = GamaClassifier(
    max_total_time=400, # in seconds
    n_jobs=2,  # one subprocess
    scoring='accuracy',  # metric to optimize for
    verbosity=logging.INFO,  # to get printed updates about search progress
    keep_analysis_log="trees.log",  # name for a log file to record search output in
)

Using GAMA version 20.2.0.
GamaClassifier(cache=None,post_processing_method=BestFitPostProcessing(),search_method=AsyncEA(),keep_analysis_log=/content/trees.log,verbosity=20,n_jobs=2,max_eval_time=None,max_total_time=400,random_state=None,max_pipeline_length=None,regularize_length=True,scoring=accuracy)


In [0]:
automl.fit(X_train, y_train)

preprocessing took 0.0051s.
Starting EA with new population.
Search phase evaluated 946 individuals.
search took 359.0476s.
postprocess took 0.1441s.


In [0]:
automl.score(X_test, y_test)

0.8625

# Best Model of Task 7

In [0]:
X_train = StandardScaler().fit_transform(X_train)
X_test = StandardScaler().fit_transform(X_test)

In [0]:
param_grid_svc = [{'C': [1,10,100, 1000], 'kernel': ['linear'], },
              {'C': [1,10,100, 1000], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]}]
search_svm = GridSearchCV(SVC(random_state=1), param_grid_svc, scoring='accuracy', cv=10, n_jobs=-1)
search_svm = search_svm.fit(X_train, y_train)
best_score_svm = search_svm.best_score_
best_parameters_svm = search_svm.best_params_
scores_svm = search_svm.cv_results_
Data_svm = pd.DataFrame(scores_svm)
# Data_svm
# Data_svm[['params', 'mean_test_score', 'mean_score_time', 'rank_test_score' ]].to_csv("./svm.dat")

In [0]:
best_parameters_svm

{'C': 1, 'gamma': 0.6, 'kernel': 'rbf'}

In [0]:
best_score_svm

0.928125

By looking at the scores, we can see that our model of task 7 outperforms the automl packages. Even when running autosklearn for more than 20 minutes it only achieve a test accuracy score of 0.8625.

# Our Opinion on Auto ML

Was ist Ihre Meinung zu AutoML?

In general we think that automl packages are useful when working on ML projects. They give us a good and fast overview of what scores are achievable. Also, they are very suitable if the ML researcher has limited knowledge in how to use pipelines that include pre-processing steps such as scaling etc. Hence, they are suitable if it is only necessary to predict the outcome. If the researchers want to get further insights about the data distribution, collinearity, etc. the use of "traditional" ML models might be more insightful. Furthermore, in our case example we can see that automl packages (we tried autosklearn and gama) don't necessarily show you the best achievable results, even when running for a longer time.