# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
import azureml.core
from azureml.core import Experiment, Workspace, Dataset, Datastore
from azureml.train.automl import AutoMLConfig

import time
import logging


from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
from azureml.core.webservice import Webservice
from azureml.core.model import Model
from azureml.core.environment import Environment

from azureml.automl.core.shared import constants


## Dataset

### Overview
TODO: In this markdown cell, give an overview of the dataset you are using. Also mention the task you will be performing.

This dataset is part of a data from UCI Machine Learning repository: https://archive.ics.uci.edu/ml/datasets/CNNpred%3A+CNN-based+stock+market+prediction+using+a+diverse+set+of+variables

In the original study they used not just DOW but also Nasdaq and other indexes in attempt to create a convolutional neural network for cross market prediction.
In this study I took a slightly less ambitious goal, and tried to just predict the movement of DOW.

I pre-processed the data a little bit by creating a new binary indicator feature that measured whether the stock index would go up or down the following trading day.
This indicator was then used as the label I tried to predict.


The task is to predict the overall movement of the stock market, as measured by the Dow. The dataset contains 82 pre-calculated features, full list of
which can be found in the appendix of this paper https://arxiv.org/pdf/1810.08923.pdf. In summary, these features include things like  
* Relative change of volume
* 10 days Exponential Moving Average
* Relative change of oil price(Brent) 
* Relative change in US dollar to Japanese yen exchange rate 

and so on, for each of the days.  in total the dataset contained data for 1984 days.

One important thing to keep in mind when doing prediction is not to accidentally "look into future". To avoid using future
data, I made sure that even the pre-processing steps only used data from the past. That is, when doing the normal
z-score standardization, I first calculated the mean and std from the first month of data (20 trading days), and
then used them in centering and normalizing the data:

``
normalized_df=(df3-df3.iloc[0:20].min())/(df3.iloc[0:20].max()-df3.iloc[0:20].min())
``

Similarly, when imputting missing features, I did not use things like median of all values for the feature (including 
the future), but a padding strategy where datapoints were filled based on last previously seen value. That way the
same approach could be used in real-life scenario where only past is known.

TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [2]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'uda3experiment'

experiment=Experiment(ws, experiment_name)

import pandas as pd
import numpy as np
data = pd.read_csv('data/Processed_DJI.csv')
# calculate the difference to next date
df = pd.concat([data,data[['Close']].diff(periods=-1).rename({'Close':'Close Diff'}, axis=1)], axis=1)
# create label 1 if stock goes up, or 0 if it goes down
df['y']= np.where(df['Close Diff'] < 0, 1, 0)
#  Use padding to fill NA: this way I am not using future data for the missing data handling
df2 = df.fillna(method='pad')
# Remove the close and close diff as we are now on interested in the binary indicator of up/down
# Removing also the EMA's that are calculated from several past values: these would generate 
# several nan values in the beginnign of the data, as EMA_200 can only be calculated once we have
# observed 200 days... we only have 2000 points of data so having 200 data points with nan on a feature
# is not worth it.
# Also removing Date as it seems like a variable that even if it is predictive, would not generalize
# to future.
df3 = df2.drop(columns= ['EMA_50','EMA_200', 'Close Diff', 'Close', 'Date', 'Name' ])
# and then removing any data points with missing features
df3 = df3.dropna()
normalized_df=(df3-df3.iloc[0:20].min())/(df3.iloc[0:20].max()-df3.iloc[0:20].min())
normalized_df.reset_index()
normalized_df = normalized_df.drop(normalized_df.index[0:20])
normalized_df['y'] = normalized_df['y'].astype(int)
normalized_df.to_csv('data/normalized_data.csv')

Now have the dataset as a pandas dataframe. I'm going to register it as a Tabular dataset in the datastore.

In [3]:
datastore = ws.get_default_datastore()


# upload the local file from src_dir to the target_path in datastore
datastore.upload(src_dir='data', target_path='data')
# create a dataset referencing the cloud location
dataset = Dataset.Tabular.from_delimited_files(datastore.path('data/normalized_data.csv'))

Uploading an estimated of 2 files
Target already exists. Skipping upload for data/normalized_data.csv
Target already exists. Skipping upload for data/Processed_DJI.csv
Uploaded 0 files


In [4]:
dji_ds = dataset.register(workspace=ws,
name='dji_ds',
description='DJI closing prices')

In [5]:
train_ds, test_ds = dji_ds.random_split(percentage=0.7, seed=42)

## For the AutoML we need a training cluster:

In [6]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "cheap-compute"

try:
    # Check for existing compute target
    training_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # If it doesn't already exist, create it
    try:
        compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS11_V2', max_nodes=2)
        training_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
        training_cluster.wait_for_completion(show_output=True)
    except Exception as ex:
        print(ex)

Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

I chose AUC_weighted as the primary metric as it is suitable for a classification task.
Some of the settings like experiment time out were chosen as something that felt sensible.
The task was classification because I try to classify between situations where the DOW goes
up and down, and this label was stored in feature 'y' which was set as the label_column_name.

In [7]:
# TODO: Put your automl settings here
automl_settings = {"name" : 'project 3 automl experiment_{0}'.format(time.time()),
  "experiment_timeout_minutes" : 20,
    "enable_early_stopping" : True,
    "iteration_timeout_minutes": 10,
    "n_cross_validations": 5,
    "primary_metric": 'AUC_weighted',
    "max_concurrent_iterations": 10,
}


# TODO: Put your automl config here
automl_config =AutoMLConfig(compute_target = training_cluster,
    task='classification',
    training_data = train_ds,
    validation_data = test_ds,
    label_column_name ='y',
    featurization = 'auto',
    **automl_settings)

In [8]:
# TODO: Submit your experiment
remote_run = experiment.submit(automl_config)

Running on remote.


## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [9]:
from azureml.widgets import RunDetails
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [25]:
best_run, fitted_model = remote_run.get_output()

Package:azureml-automl-runtime, training version:1.23.0, current version:1.22.0
Package:azureml-core, training version:1.23.0, current version:1.22.0
Package:azureml-dataprep, training version:2.10.1, current version:2.9.1
Package:azureml-dataprep-native, training version:30.0.0, current version:29.0.0
Package:azureml-dataprep-rslex, training version:1.8.0, current version:1.7.0
Package:azureml-dataset-runtime, training version:1.23.0, current version:1.22.0
Package:azureml-defaults, training version:1.23.0, current version:1.22.0
Package:azureml-interpret, training version:1.23.0, current version:1.22.0
Package:azureml-mlflow, training version:1.23.0, current version:1.22.0
Package:azureml-pipeline-core, training version:1.23.0, current version:1.22.0
Package:azureml-telemetry, training version:1.23.0, current version:1.22.0
Package:azureml-train-automl-client, training version:1.23.0, current version:1.22.0
Package:azureml-train-automl-runtime, training version:1.23.0, current versio

In [31]:
print(best_run)

Run(Experiment: uda3experiment,
Id: AutoML_d8a958f3-4fb5-4989-baa2-51e6a2a6d1ea_38,
Type: azureml.scriptrun,
Status: Completed)


In [32]:
print(best_run.properties['score'])

0.5681762754853185


In [33]:
print(fitted_model)

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                  l1_ratio=0.7959183673469387,
                                                                                                  learning_rate='constant',
                                                                                                  loss='log',
                               

In [34]:
fitted_model.steps[1][1].estimators

[('18',
  Pipeline(memory=None,
           steps=[('maxabsscaler', MaxAbsScaler(copy=True)),
                  ('randomforestclassifier',
                   RandomForestClassifier(bootstrap=False, ccp_alpha=0.0,
                                          class_weight='balanced',
                                          criterion='gini', max_depth=None,
                                          max_features=0.05, max_leaf_nodes=None,
                                          max_samples=None,
                                          min_impurity_decrease=0.0,
                                          min_impurity_split=None,
                                          min_samples_leaf=0.06157894736842105,
                                          min_samples_split=0.056842105263157895,
                                          min_weight_fraction_leaf=0.0,
                                          n_estimators=50, n_jobs=1,
                                          oob_score=False, rando

In [43]:
best_run.properties['run_properties']

"classification_labels=None,\n                              estimators=[('18',\n                                           Pipeline(memory=None,\n                                                    steps=[('maxabsscaler',\n                                                            MaxAbsScaler(copy=True"

In [36]:
#TODO: Save the best model
model_name = best_run.properties["model_name"]

registered_model = remote_run.register_model(model_name=model_name, description="DJI AutoML")
best_model = Model(ws, model_name)
best_model.download(exist_ok=True)

'model.pkl'

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [42]:
best_run.download_file("outputs/scoring_file_v_1_0_0.py", "inference/score.py")

best_run.download_file(constants.CONDA_ENV_FILE_PATH, "myenv.yml")
env = Environment.from_conda_specification(name="myenv", file_path="myenv.yml")

inference_config = InferenceConfig(entry_script="inference/score.py", environment=env)
aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1, description="DJI Prediction", enable_app_insights= True)
service = Model.deploy(ws, "automl-dji", [registered_model], inference_config, aciconfig, overwrite=True)
service.wait_for_deployment(True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running...........................................
Succeeded
ACI service creation operation finished, operation "Succeeded"


TODO: In the cell below, send a request to the web service you deployed to test it.

In [38]:
test_df = test_ds.to_pandas_dataframe()
data = {}
data["data"] = json.loads(test_df.iloc[101:104,:].drop('y', axis=1).to_json(orient="records"))
data

{'data': [{'Column1': 408,
   'Volume': 0.1305078756,
   'mom': 1.0507590541,
   'mom1': 0.8714380537,
   'mom2': 1.5286026927,
   'mom3': -0.4686418116,
   'ROC_5': 1.407135642,
   'ROC_10': -0.1696156822,
   'ROC_15': -0.20950716,
   'ROC_20': -0.0598636721,
   'EMA_10': 6.0394001868,
   'EMA_20': 6.8606916156,
   'DTB4WK': -0.5,
   'DTB3': -1.75,
   'DTB6': -2.0,
   'DGS5': -4.96,
   'DGS10': -6.1904761905,
   'Oil': 0.919204956,
   'Gold': 0.5119415377,
   'DAAA': -2.8928571429,
   'DBAA': -2.5666666667,
   'GBP': 0.9696076544,
   'JPY': 0.6580117137,
   'CAD': 0.0583643028,
   'CNY': -2.5106086133,
   'AAPL': 0.924610071,
   'AMZN': 0.5322905941,
   'GE': 0.9361519103,
   'JNJ': 1.1725930792,
   'JPM': 0.929344597,
   'MSFT': 1.0304254556,
   'WFC': 1.2185841617,
   'XOM': 1.0838482703,
   'FCHI': 0.7828818606,
   'FTSE': 0.6620325256,
   'GDAXI': 0.6343438635,
   'GSPC': 1.0769996618,
   'HSI': 1.1447951281,
   'IXIC': 1.0668425413,
   'SSEC': 0.7489837306,
   'RUT': 1.1387563068

In [45]:
import requests
import json

# URL for the web service
#scoring_uri = "http://333ed5ba-9f19-4f23-a460-04d6ad4303a2.westeurope.azurecontainer.io/score"
#scoring_uri = "http://333ed5ba-9f19-4f23-a460-04d6ad4303a2.westeurope.azurecontainer.io/score"
scoring_uri = "http://021291c5-3542-4a9b-869d-c41ff7554579.westeurope.azurecontainer.io/score"
# Create a test dataset, 3 rows from the test set:
# Converting the Pandas dataframe into correct shape got bit convoluted... maybe there are easier ways
data = {}
data["data"] = json.loads(test_df.iloc[101:104,:].drop('y', axis=1).to_json(orient="records"))
input_data = str.encode(json.dumps(data))

# Set the content type
headers = {'Content-Type': 'application/json'}

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print(resp.text)

"{\"result\": [1.0, 1.0, 1.0]}"


TODO: In the cell below, print the logs of the web service and delete the service

In [46]:
service.get_logs()



In [None]:

service.delete()
training_cluster.delete()

In [None]:
# above message because I already deleted the cl