# **CIS 5200: Machine Learning**

## **Auto Machine Learning**


- **Content Creator:** Shaozhe Lyu, Xiaonan Liu
- **Content Checkers:** Aditya Pratap Singh
- **Acknowledgements:** This notebook contains an excerpt from the [auto-sklearn](https://automl.github.io/auto-sklearn/master/) 
- **Objectives:** The worksheet mainly shows us how to use `auto-sklearn` to complete a classification task.


In [1]:
%%capture
!pip install penngrader

In [2]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import os 

In [3]:
# For autograder only, do not modify this cell. 
# True for Google Colab, False for autograder
NOTEBOOK = (os.getenv('IS_AUTOGRADER') is None)
if NOTEBOOK:
    print("[INFO, OK] Google Colab.")
else:
    print("[INFO, OK] Autograder.")
    sys.exit()

[INFO, OK] Google Colab.


In [4]:
#PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON'T KNOW WHO 
#TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = 57931095 # YOUR PENN-ID GOES HERE AS AN INTEGER#

In [5]:
import penngrader.grader

grader = penngrader.grader.PennGrader(homework_id = 'CIS_5200_202230_HW_Auto_ML_WS', student_id = STUDENT_ID)

PennGrader initialized with Student ID: 57931095

Make sure this correct or we will not be able to store your grade


In [None]:
# A helper function for grading utils
def grader_serialize(obj):        # A helper function
    '''Dill serializes Python object into a UTF-8 string'''
    byte_serialized = dill.dumps(obj, recurse = True)
    return base64.b64encode(byte_serialized).decode("utf-8")


Auto-sklearn frees a machine learning user from algorithm selection and hyperparameter tuning. It leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction. Learn more about the technology behind auto-sklearn by reading the paper published at [NIPS 2015](https://proceedings.neurips.cc/paper/2015/file/11d0e6287202fced83f79975ec59a3a6-Paper.pdf) .

We will start by installing the packages swig and auto-sklearn:

In [6]:
!sudo apt-get install swig -y
!pip install Cython numpy
!pip install pipelineprofiler
!pip install auto-sklearn

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
  swig3.0
Suggested packages:
  swig-doc swig-examples swig3.0-examples swig3.0-doc
The following NEW packages will be installed:
  swig swig3.0
0 upgraded, 2 newly installed, 0 to remove and 20 not upgraded.
Need to get 1,100 kB of archives.
After this operation, 5,822 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig3.0 amd64 3.0.12-1 [1,094 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig amd64 3.0.12-1 [6,460 B]
Fetched 1,100 kB in 0s (5,238 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/Fr

In [8]:
import pandas as pd
import numpy as np
import PipelineProfiler
import autosklearn.classification
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

## **Auto-sklearn Classification**

After install all dependencies and libraries, let's try it out!

### Data Loading



In [9]:
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = \
    sklearn.model_selection.train_test_split(X, y, random_state=1)

### Building and Fit

Below are two most important hyperparameters:

- **time_left_for_this_task**, optional (default=3600):
Time limit in seconds for the search of appropriate models. By increasing this value, auto-sklearn has a higher chance of finding better models.

- **per_run_time_limit**, optional (default=1/10 of time_left_for_this_task):
Time limit for a single call to the machine learning model. Model fitting will be terminated if the machine learning algorithm runs over the time limit. Set this value high enough so that typical machine learning algorithms can be fit on the training data.

In [10]:
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30
)
automl.fit(X_train, y_train, dataset_name='breast_cancer')

AutoSklearnClassifier(ensemble_class=<class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>,
                      per_run_time_limit=30, time_left_for_this_task=120)

Now we print the final ensemble constructed by auto-sklearn. From this part you can see which kinds of classifier are ensembled by auto-ML. 

In [11]:
print(automl.show_models())

{2: {'model_id': 2, 'rank': 1, 'cost': 0.028368794326241176, 'ensemble_weight': 0.04, 'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7fb212681c40>, 'balancing': Balancing(random_state=1), 'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7fb212090af0>, 'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7fb212090cd0>, 'sklearn_classifier': RandomForestClassifier(max_features=5, n_estimators=512, n_jobs=1,
                       random_state=1, warm_start=True)}, 3: {'model_id': 3, 'rank': 2, 'cost': 0.028368794326241176, 'ensemble_weight': 0.04, 'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice object at 0x7fb212668790>, 'balancing': Balancing(random_state=1), 'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7

In [12]:
print(automl.leaderboard())

          rank  ensemble_weight                type      cost   duration
model_id                                                                
7            1             0.04         extra_trees  0.014184   2.224094
30           2             0.06         extra_trees  0.021277  15.601692
2            3             0.04       random_forest  0.028369   1.985536
3            4             0.04                 mlp  0.028369   1.377759
6            5             0.02                 mlp  0.028369   2.255550
10           6             0.06       random_forest  0.028369   2.316599
11           7             0.06       random_forest  0.028369   2.653618
13           8             0.02   gradient_boosting  0.028369   1.760876
14           9             0.04                 mlp  0.028369   2.366750
19          10             0.04         extra_trees  0.028369   3.484913
5           11             0.02       random_forest  0.035461   2.430756
31          12             0.12       random_forest

### Get the Score of the Final Ensemble

In [13]:
predictions = automl.predict(X_test)
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions))

Accuracy score: 0.958041958041958


In [14]:
#@markdown Report the accuracy score (upto two decimal places, eg: 0.90)
accuracy = '0.95' #@param {type:"string"}
grader.grade(test_case_id = 'test_accuracy', answer = accuracy)

Correct! You earned 1.0/1.0 points. You are a star!

Your submission has been successfully recorded in the gradebook.



AutoML can also be extended to use text description of problems to pick hyperparameters:
* Use vector embeddings of dataset title, description and keywords
* For each new dataset, find the most similar prior dataset and use its hyperparameters
* The similarity metric is learned (supervised)


## *Questions*




In [17]:
#@markdown Describe the most accurate classifier found by AutoML (copy name from leaderboard). 
most_accurate_model = 'random_forest' #@param {type:"string"}
grader.grade(test_case_id = 'test_accurate_model', answer = most_accurate_model)


Correct! You earned 1.0/1.0 points. You are a star!

Your submission has been successfully recorded in the gradebook.


In [18]:
#@markdown What else can you do to improve the accuracy score?
comparison = "extra_trees" #@param {type:"string"}
grader.grade(test_case_id = 'test_imp_acc', answer = comparison)


Correct! You earned 1.0/1.0 points. You are a star!

Your submission has been successfully recorded in the gradebook.
