## Dataset
Zadanie regresji dla zbioru 'Titanic'.
Source: https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/fa71405126017e6a37bea592440b4bee94bf7b9e/titanic.csv
Predykcja przeżycia 'Survived' na podstawie danych o pasażerach. 

## Auto ML z UI
Step by step
* Resource group -> Create
* Machine Learning -> create
* ml.azure.com
* Manage -> Compute
* create compute instance CPU Standard_DS3_v2
* create compute clusters CPU Standard_DS3_v2
* Assets -> Datasets create dataset from web
* Headers from the first file
* Automated ML -> New automated ML run
* New experiment name
* Type Resgression
* Gotowe - można sprawdzić rezultaty

## Jupyter Notebook

### Azure ML Workspace

In [1]:
import azureml.core
print("SDK version:", azureml.core.VERSION)

SDK version: 1.19.0


In [3]:
from azureml.core import Workspace, Dataset
ws = Workspace.from_config()

### Load data

In [4]:
aml_dataset = ws.datasets['titanic']

full_df = aml_dataset.to_pandas_dataframe()

full_df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
full_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### Split dataset into train and test

In [6]:
train_dataset, test_dataset = aml_dataset.random_split(0.9, seed=1)

train_dataset_df = train_dataset.to_pandas_dataframe()
test_dataset_df = test_dataset.to_pandas_dataframe()

print(train_dataset_df.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   796.000000  796.000000  796.000000  635.000000  796.000000   
mean    443.673367    0.374372    2.312814   29.855118    0.518844   
std     258.717152    0.484265    0.835614   14.724893    1.102722   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     214.750000    0.000000    2.000000   20.000000    0.000000   
50%     445.500000    0.000000    3.000000   29.000000    0.000000   
75%     666.250000    1.000000    3.000000   39.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  796.000000  796.000000  
mean     0.373116   31.958856  
std      0.794782   49.798224  
min      0.000000    0.000000  
25%      0.000000    7.895800  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


### List remote AML compute targets available

In [7]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

ComputeTarget.list(ws)

[AmlCompute(workspace=Workspace.create(name='izabela', subscription_id='2c4a7b68-f24b-4160-8590-5bde3308718d', resource_group='autoaml_iza'), name=cpu-cluster, id=/subscriptions/2c4a7b68-f24b-4160-8590-5bde3308718d/resourceGroups/autoaml_iza/providers/Microsoft.MachineLearningServices/workspaces/izabela/computes/cpu-cluster, type=AmlCompute, provisioning_state=Succeeded, location=northeurope, tags=None),
 {
   "id": "/subscriptions/2c4a7b68-f24b-4160-8590-5bde3308718d/resourceGroups/autoaml_iza/providers/Microsoft.MachineLearningServices/workspaces/izabela/computes/cpu-instances",
   "name": "cpu-instances",
   "location": "northeurope",
   "tags": null,
   "properties": {
     "description": null,
     "computeType": "ComputeInstance",
     "computeLocation": "northeurope",
     "resourceId": null,
     "provisioningErrors": null,
     "provisioningState": "Succeeded",
     "properties": {
       "vmSize": "STANDARD_DS3_V2",
       "applications": [
         {
           "displayName"

### Connect to Remote AML Compute

In [8]:
amlcompute_cluster_name = "cpu-cluster"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets

if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
     found = True
     print('Found existing training cluster.')
     # Get existing cluster
     aml_remote_compute = cts[amlcompute_cluster_name]
        
print('Checking cluster status...')
aml_remote_compute.wait_for_completion(show_output = True, min_node_count = 0, timeout_in_minutes = 20)

Found existing training cluster.
Checking cluster status...
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [9]:
aml_remote_compute.get_status().serialize()

{'currentNodeCount': 0,
 'targetNodeCount': 0,
 'nodeStateCounts': {'preparingNodeCount': 0,
  'runningNodeCount': 0,
  'idleNodeCount': 0,
  'unusableNodeCount': 0,
  'leavingNodeCount': 0,
  'preemptedNodeCount': 0},
 'allocationState': 'Steady',
 'allocationStateTransitionTime': '2021-01-25T18:34:46.683000+00:00',
 'errors': [{'error': {'code': 'ClusterCoreQuotaReached',
    'message': 'Operation results in exceeding quota limits of Total Cluster Dedicated Regional vCPUs. Maximum allowed: 6, Current in use: 4, Additional requested: 4. Please contact support to increase the quota for resource type Total Cluster Dedicated Regional vCPUs'}}],
 'creationTime': '2021-01-25T18:34:40.633293+00:00',
 'modifiedTime': '2021-01-25T18:34:56.743759+00:00',
 'provisioningState': 'Succeeded',
 'provisioningStateTransitionTime': None,
 'scaleSettings': {'minNodeCount': 0,
  'maxNodeCount': 1,
  'nodeIdleTimeBeforeScaleDown': 'PT10800S'},
 'vmPriority': 'Dedicated',
 'vmSize': 'STANDARD_DS3_V2'}

### List and select primary metric to drive the AutoML regression problem

In [10]:
from azureml.train import automl
# Get a list of valid metrics for your given task
automl.utilities.get_primary_metrics('regression')

['r2_score',
 'normalized_mean_absolute_error',
 'normalized_root_mean_squared_error',
 'spearman_correlation']

### Define AutoML Experiment settings

In [11]:
import logging
import os

from azureml.train.automl import AutoMLConfig

project_folder = './automl'
os.makedirs(project_folder, exist_ok=True)

automl_config = AutoMLConfig(compute_target=aml_remote_compute,
                             task='regression',
                             primary_metric='r2_score',
                             experiment_timeout_minutes=15,                            
                             training_data=train_dataset,
                             label_column_name="Survived",
                             n_cross_validations=5,                                                   
                             enable_early_stopping=True,
                             featurization='auto',
                             debug_log='automated_ml_errors.log',
                             verbosity=logging.INFO,
                             path=project_folder
                             )

### Define AutoML Experiment settings

In [None]:
from azureml.core import Experiment
from datetime import datetime

now = datetime.now()
time_string = now.strftime("%m-%d-%Y-%H")
experiment_name = "classif-automl-remote-{0}".format(time_string)
print(experiment_name)

experiment = Experiment(workspace=ws, name=experiment_name)

import time
start_time = time.time()
            
run = experiment.submit(automl_config, show_output=True)

print('Manual run timing: --- %s seconds needed for running the whole Remote AutoML Experiment ---' % (time.time() - start_time))

classif-automl-remote-01-25-2021-21
Running on remote.
No run_configuration provided, running on cpu-cluster with default configuration
Running on remote compute: cpu-cluster
Parent Run ID: AutoML_91de3862-bedb-4834-b44c-17278a901277



### Explore results with Widget

In [None]:
from azureml.widgets import RunDetails
RunDetails(run).show()

### Measure Parent Run Time needed for the whole AutoML process

In [None]:
import time
import datetime as dt

run_details = run.get_details()

end_time_utc_str = run_details['endTimeUtc'].split(".")[0]
start_time_utc_str = run_details['startTimeUtc'].split(".")[0]
timestamp_end = time.mktime(datetime.strptime(end_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())
timestamp_start = time.mktime(datetime.strptime(start_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())

parent_run_time = timestamp_end - timestamp_start
print('Run Timing: --- %s seconds needed for running the whole Remote AutoML Experiment ---' % (parent_run_time))

### Retrieve the 'Best Model'

In [None]:
best_run, fitted_model = run.get_output()
print(best_run)
print(fitted_model)

### Make predictions

### Extract X values (feature columns) from test dataset and convert to NumPi array for predicting

In [None]:
import pandas as pd

#remove Y
if 'Survived' in test_dataset_df.columns:
    y_test_df = test_dataset_df.pop('Survived')

x_test_df = test_dataset_df

### Make the actual Predictions

In [None]:
# Try the best model
y_predictions = fitted_model.predict(x_test_df)

print('10 predictions: ')
print(y_predictions[:10])

In [None]:
y_predictions.shape

### Calculate the R2 score with Test Dataset

In [None]:
from sklearn.metrics import r2_score

print('R2 score:')
accuracy_score(y_test_df, y_predictions)