# Wind Turbine: Azure ML with scikit-learn

In this notebook, we'll build and analyze a new model to predict wind turbine wake winds.

It is important to consider the two main conditions that influence the presence of wind wake:
1. Overall wind farm direction and turbine wind direction are both are between 40° - 45° degrees.
1. High difference that's greater than one minute between `TurbineSpeedStdDev` and `WindSpeedStdDev`.

The above conditions are well known features to predict when `Wind Wake` is affecting the wind turbine.

# Instructions

Before you begin with this lab, please make sure to follow the steps below:
1. Locate the default datastore for the workspace, this can be done by authenticating against the workspace (cell #2) and execute the following command: `ws.get_default_datastore()`
1. Locate the dataset parquet file in the lab materials: `TrainingDataset.parquet`
1. Upload the dataset for this lab to the default datastore for the workspace. You can do this via Azure Portal or via Microsoft Azure Storage Explorer.
1. Ensure you have the correct version of `scikit-learn` and `joblib` installed. To install these dependencies, you can execute the cell below, skip this step if the dependencies are already installed.
1. Restart your kernel

In [None]:
!pip install scikit-learn==0.22.1 joblib==0.14.1

## Setup Azure ML

In the next cell, we will create a new Workspace config object using the `<subscription_id>`, `<resource_group_name>`, and `<workspace_name>`. This will fetch the matching Workspace and prompt you for authentication. Please click on the link and input the provided details.

For more information on **Workspace**, please visit: [Microsoft Workspace Documentation](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py)

`<subscription_id>` = You can get this ID from the landing page of your Resource Group.

`<resource_group_name>` = This is the name of your Resource Group.

`<workspace_name>` = This is the name of your Workspace.

In [None]:
from azureml.core.workspace import Workspace
from azureml.core.authentication import InteractiveLoginAuthentication
project_folder = './scripts'

try:    
    interactive_auth = InteractiveLoginAuthentication(tenant_id="<tenant_id>")
    # Get instance of the Workspace and write it to config file
    ws = Workspace(
        subscription_id = '<subscription_id>', 
        resource_group = '<resource_group>', 
        workspace_name = '<workspace_name>',
        auth = interactive_auth)

    # Writes workspace config file
    ws.write_config()
    
    print('Library configuration succeeded')
except Exception as e:
    print(e)
    print('Workspace not found')

## Fetch our data

Let's retrieve our dataset from the default workspace Datastore.

In [None]:
from azureml.core import Dataset
from azureml.data.datapath import DataPath
from azureml.core import Datastore

datastore = ws.get_default_datastore()

datastore_path = [DataPath(datastore, '*.parquet')]

tabular = Dataset.Tabular.from_parquet_files(path=datastore_path)
tabular = tabular.register(workspace=ws, 
                           name='wind_turbine_training', 
                           create_new_version=True)
tabular = Dataset.get_by_name(ws, name='wind_turbine_training')
print(tabular.version)
data = tabular.to_pandas_dataframe()
data.head(5)

Next, we'll take a subset of our data and then proceed to visualize it to better understand any patterns and trends that might exist to drive good ML models.

In [None]:
subset = tabular.take_sample(probability=0.4, seed=123).to_pandas_dataframe()

## Dataset Description

Describe our current dataset. The table below shows the different statistical values for our training subset.

In [None]:
subset.describe()

## Turbine Wind Direction

Let's take a look at the Turbine Wind Direction distribution against the Wind Direction Angle. As we can see, we have a considerable alteration between 40° and 50° degrees.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

hstyle={"rwidth":0.75,'edgecolor':'black'}

# Analyze distribution of TurbineWindDirection in the dataset
fig, ax = plt.subplots()
sns.distplot(subset[['TurbineWindDirection']], ax=ax, 
             hist_kws=hstyle).set_title("Turbine Wind Direction Distribuition")
ax.set_xlim(0,360)
ax.set(xlabel="Wind Direction Angle")
plt.show()

## Turbine Wind Direction vs Alter Blades

Let's take a look at how our training dataset conducts for `Alter Blades` against the `Wind Direction Angle`. It is very clear that between 40° and 50° degrees we have a clear spike of `True` values for `Alter Blades`. Keep in mind, that the target column for our prediction is `Alter Blades`, this column will enable us to identify a wake condition.

In [None]:
g = sns.FacetGrid(subset, col='AlterBlades')
g.map(sns.distplot, 'TurbineWindDirection', hist_kws=hstyle)
g.set(xlabel="Wind Direction Angle")

## Turbine Speed

Let's take a look at the Turbine Speed distribution. In the chart, we can observe the distribution has values between 10 and 25 km/h.

In [None]:
fig, ax = plt.subplots()
sns.distplot(subset[['TurbineSpeedAverage']], ax=ax, 
             hist_kws=hstyle).set_title("Average Turbine Speed Distribuition")
ax.set(xlabel="Average Turbine Speed")
plt.show()

## Turbine Speed Standard Deviation vs Alter Blades

Let's take a look at how our training dataset behaves for `Alter Blades` against the `Turbine Speed Standard Deviation`. 

In [None]:
# Analyze how age influences whether customers have responded to insurance campaigns
g = sns.FacetGrid(subset, col='AlterBlades')
g.map(sns.distplot, 'TurbineSpeedStdDev', hist_kws=hstyle)
g.set(xlabel="Turbine Speed Std Dev")

## Wind Speed

Let's take a look at the Wind Speed distribution. In the chart, we can observe the distribution has values between 10 and 25 km/h.

In [None]:
fig, ax = plt.subplots()
sns.distplot(subset[['WindSpeedAverage']], ax=ax, 
             hist_kws=hstyle).set_title("Average Wind Speed Distribuition")
ax.set(xlabel="Average Wind Speed")
plt.show()

## Wind Speed Standard Deviation vs Alter Blades

Let's take a look at how our training dataset behaves for `Alter Blades` against the `Turbine Speed Standard Deviation`.

In [None]:
# Analyze how age influences whether customers have responded to insurance campaigns
g = sns.FacetGrid(subset, col='AlterBlades')
g.map(sns.distplot, 'WindSpeedStdDev', hist_kws=hstyle)

## Isolate AlterBlades rows true values

Let's create a Facet Grid to understand the trends that our `True` values from the `Alter Blades` column has against other features in the dataset such as:
1. Turbine Speed Standard Deviation
1. Turbine Wind Direction
1. Wind Speed Standard Deviation

As we are able to see, when `Turbine Wind Direction` is around 40° to 45° degrees, it is a very good indication for an `Alter Blades: True` value. Also, we are able to see that high `Turbine Speed Standard Deviation` versus a low `Wind Speed Standard Deviation` are also key features to achieve a `True` value in the `Alter Blades` column

In [None]:
alterBlades = subset.loc[subset.AlterBlades]
g = sns.FacetGrid(alterBlades, col='AlterBlades')
g.map(plt.hist, 'TurbineSpeedStdDev')
g.set(xlabel="Turbine Speed Std Dev")

display(g)

g = sns.FacetGrid(alterBlades, col='AlterBlades')
g.map(plt.hist, 'TurbineWindDirection')
g.set(xlabel="Turbine Wind Direction Angle")

display(g)

g = sns.FacetGrid(alterBlades, col='AlterBlades')
g.map(plt.hist, 'WindSpeedStdDev')
g.set(xlabel="Wind Speed Std Dev")

display(g)

## Pairplot Wind Speed Std Dev, Turbine Speed Std Dev and Alter Blades

Let's place our key features in a Pair plot to analyze their trends.

In [None]:
# Analyze how age and category gardening spend is influenced by age and region
sns.pairplot(subset, vars=['WindSpeedStdDev', 'TurbineSpeedStdDev'], hue='AlterBlades')

## Create experiment

In our script, there are three distinct sections:
1. Setting up the scikit-learn logistic regression model pipeline (including encoding our features).
1. Analyzing and logging the results of the model training.
1. Running the model explainer to understand the key model drivers.

In [None]:
%%writefile $project_folder/train.py

from azureml.core import Run

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

from utils import *

# Fetch current run
run = Run.get_context()
    
# Fetch dataset from the run by name
dataset = run.input_datasets['training']

# Convert dataset to Pandas data frame
X_train, X_test, y_train, y_test = split_dataset(dataset)

# Setup scikit-learn pipeline
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, list(X_train.columns.values))])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

model = clf.fit(X_train, y_train)

# Analyze model performance
analyze_model(clf, X_test, y_test)

# Save model
model_id = save_model(clf)

## Create a Workspace Experiment

The Experiment constructor allows to create an experiment instance. The constructor takes in the current workspace, which is fetched by calling `Workspace.from_config()` and an experiment name. 

For more information on **Experiment**, please visit: [Microsoft Experiment Documentation](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment.experiment?view=azure-ml-py)

In [None]:
from azureml.core.experiment import Experiment

# Get an instance of the Workspace from the config file
ws = Workspace.from_config()

experiment_name = 'wake-detection-experiment'

# Create Experiment
experiment = Experiment(ws, experiment_name)

## Create Automated ML Compute cluster

Firstly, check for the existence of the cluster. If it already exists, we are able to reuse it. Checking for the existence of the cluster can be performed by calling the constructor `ComputeTarget()` with the current workspace and name of the cluster.

In case the cluster does not exist, the next step will be to provide a configuration for the new AML cluster by calling the function `AmlCompute.provisioning_configuration()`. It takes as parameters the VM size and the max number of nodes that the cluster can scale up to. After the configuration has executed, `ComputeTarget.create()` should be called with the previously configuration object and the workspace object.

For more information on **ComputeTarget**, please visit: [Microsoft get_data Documentation](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.computetarget?view=azure-ml-py)

For more information on **AmlCompute**, please visit: [Microsoft get_data Documentation](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.compute.akscompute?view=azure-ml-py)


**Note:** Please wait for the execution of the cell to finish before moving forward.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Create AML CPU Compute Cluster
try:
    compute_target = ComputeTarget(workspace=ws, name='cpucluster')
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_DS12_v2',
                                                           max_nodes=4)

    compute_target = ComputeTarget.create(ws, 'cpucluster', compute_config)
    compute_target.wait_for_completion(show_output=True)

## Submit Experiment

We'll use remote compute for this job. We need to install a couple of extra libraries, including those required for model interpretability.

The `experiment.submit()` function is called to send the experiment for execution. The only parameter received by this function is the `Estimator` object.

In [None]:
from azureml.train.sklearn import SKLearn

estimator = SKLearn(source_directory=project_folder,
                    compute_target=compute_target,
                    entry_script='train.py',
                    inputs=[tabular.as_named_input('training')],
                    pip_packages=['azureml-dataprep[fuse,pandas]','joblib==0.14.1','azureml-interpret','azureml-contrib-interpret','matplotlib','scikit-learn==0.22.1','seaborn'])

run = experiment.submit(estimator)
run

## Monitor Experiment

The creation of an object of type `Run` will enable us to observe the experiment’s progress and results. The object is created by calling the constructor `Run()`. It takes, as arguments, the experiment and the identifier of the run to fetch. After the object has been instantiated, the `RunDetails()` function will retrieve the progress, metrics, and tasks for the specified run. They will be displayed by calling the function `show()` over the mentioned object.

**Note:** Please wait for the execution of the cell to finish before moving forward. (Status should be **Completed**)

In [None]:
from azureml.core import Run
from azureml.widgets import RunDetails

run = Run(experiment, run.id)
RunDetails(run).show()

## Encode dataset and download trained model

First step is to encode our training data to take the shape expected by the Onnx converter. Next, download the model obtained from the best run. In order to download the model, the function `download_model()` should be called. This will take care of downloading the model obtained from the best run.

In [None]:
from utils import *
from scripts.utils import *

# Convert dataset to Pandas data frame
X_train, X_test, y_train, y_test = split_dataset(tabular)
model = download_model(run)

## Convert model to Onnx format

Export the Sklearn model to Onnx format by using `skl2onnx`. This step will output an Onnx model that we will be able to publish to the Azure SQL Edge Database Instance to use along with our `PREDICT` statement. 

In [None]:
import skl2onnx
import onnxmltools

# Convert the scikit model to onnx format
onnx_model = skl2onnx.convert_sklearn(model, 'Wind Turbine Dataset', convert_dataframe_schema(X_train))
# Save the onnx model locally
onnx_model_path = 'windturbinewake.model.onnx'
onnxmltools.utils.save_model(onnx_model, onnx_model_path)

## Save model to Azure Blob Storage

Let's save our Onnx model to the default workspace Datastore.

datastore.upload_files(files=[onnx_model_path],
                         overwrite=True,
                         show_progress=True)