Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/how-to-use-azureml/work-with-data/datasets-tutorial/labeled-datasets/labeled-datasets.png)

# Introduction to labeled datasets

Labeled datasets are output from Azure Machine Learning [labeling projects](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-create-labeling-projects). It captures the reference to the data (e.g. image files) and its labels. 

This tutorial introduces the capabilities of labeled datasets and how to use it in training.

Learn how-to:

> * Set up your development environment
> * Explore labeled datasets
> * Train a simple deep learning neural network

## Prerequisite:
* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning
* Go through Azure Machine Learning [labeling projects](https://docs.microsoft.com/azure/machine-learning/service/how-to-create-labeling-projects) and export the labels as an Azure Machine Learning dataset
* Go through the [configuration notebook](../../../configuration.ipynb) to:
    * install the latest version of azureml-sdk
    * install the latest version of azureml-contrib-dataset
    * create a workspace and its configuration file (`config.json`)

## Set up

In [None]:
from PIL import Image
import numpy as np
import cv2
import keras
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Dropout

In [None]:
import os
import azureml.core
import azureml.contrib.dataset
from azureml.core import Dataset, Workspace, Experiment
from azureml.contrib.dataset import FileHandlingOption

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)
print("Azure ML Contrib Version", azureml.contrib.dataset.VERSION)

### Connect to workspace

Create a workspace object from the existing workspace. `Workspace.from_config()` reads the file **config.json** and loads the details into an object named `workspace`.

In [None]:
# load workspace
workspace = Workspace.from_config()
print('Workspace name: ' + workspace.name, 
      'Azure region: ' + workspace.location, 
      'Subscription id: ' + workspace.subscription_id, 
      'Resource group: ' + workspace.resource_group, sep='\n')

### Create experiment

Create an experiment to track the runs in your workspace

In [None]:
# create an ML experiment
experiment = Experiment(workspace=workspace, name='labeled-datasets')

## Explore labeled datasets

**Note**: How to create labeled datasets is not covered in this tutorial. To create labeled datasets, you can go through [labeling projects](https://docs.microsoft.com/azure/machine-learning/service/how-to-create-labeling-projects) and export the output labels as Azure Machine Lerning datasets. 

`malaria_labels` used in this tutorial section is the output from a labeling project, with the task type of "Object Identification".

In [None]:
# get animal_labels dataset from the workspace
malaria_labels = Dataset.get_by_name(workspace, 'malaria_20200928_173440')

### Labeled dataset to pandas
You can load labeled datasets into pandas DataFrame. There are 3 file handling option that you can choose to load the data files referenced by the labeled datasets:
* Streaming: The default option to load data files.
* Download: Download your data files to a local path.
* Mount: Mount your data files to a mount point. Mount only works for Linux-based compute, including Azure Machine Learning notebook VM and Azure Machine Learning Compute.

In [None]:
malaria_pd = malaria_labels.to_pandas_dataframe(file_handling_option=FileHandlingOption.MOUNT)
malaria_pd

In [None]:
#malaria_pd = malaria_labels.to_pandas_dataframe(file_handling_option=FileHandlingOption.DOWNLOAD, target_path='./download/', overwrite_download=True)
#malaria_pd

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# read images from downloaded path
img = mpimg.imread(malaria_pd.loc[0,'image_url'])
imgplot = plt.imshow(img)

### Labeled dataset to torchvision
You can also load labeled datasets into [torchvision datasets](https://pytorch.org/docs/stable/torchvision/datasets.html), so that you can leverage on the open source libraries provided by PyTorch for image transformation and training.

In [None]:
from torchvision.transforms import functional as F

# load animal_labels dataset into torchvision dataset
pytorch_dataset = malaria_labels.to_torchvision()
img = pytorch_dataset[0][0]
print(type(img))

# use methods from torchvision to transform the img into grayscale
pil_image = F.to_pil_image(img)
gray_image = F.to_grayscale(pil_image, num_output_channels=3)

imgplot = plt.imshow(gray_image)

## Data Prepearaion

Data Preperation: We will make data and labels list where data will be image to array implementatation which contains RGB values of each image. and label will be class of cells here I will be taking 0 and 1 for two classes

In [None]:
data=[]
labels=[]
for row in malaria_pd.itertuples():
    try:
        image=cv2.imread(row.image_url)
        image_from_array = Image.fromarray(image, 'RGB')
        size_image = image_from_array.resize((50, 50))
        data.append(np.array(size_image))
        if row.label == "Malaria":
            label = 1
        else:
            label = 0
        labels.append(label)
    except AttributeError:
        print("")

In [None]:
Cells=np.array(data)
labels=np.array(labels)

In [None]:
s=np.arange(Cells.shape[0])
np.random.shuffle(s)
Cells=Cells[s]
labels=labels[s]

In [None]:
num_classes=len(np.unique(labels))
len_data=len(Cells)

### Label encoding
Here the problem has two classes so last output layer of neural network will have 2 neurons one for each class, One hot encoding will help us to change labels in binary format. example: 2 can be represented as [1 0] if output layer has 2 neurons and [0 0 1 0] if output has 4 neurons/classes

In [None]:
labels = keras.utils.to_categorical(labels, num_classes)

### Split data
Do Train/Test Split of data and labels that prepared in early section. Classes are defined as the unique labels in the data. Here it will be 2 as Parasitized:0 and Uninfected:1, here 0 and 1 are the mapping in labels for these two classes

In [None]:
(x_train,x_test)=Cells[(int)(0.1*len_data):],Cells[:(int)(0.1*len_data)]
x_train = x_train.astype('float32')/255 # As we are working on image data we are normalizing data by divinding 255.
x_test = x_test.astype('float32')/255
train_len=len(x_train)
test_len=len(x_test)

In [None]:
(y_train,y_test)=labels[(int)(0.1*len_data):],labels[:(int)(0.1*len_data)]

In [None]:
#y_train=keras.utils.to_categorical(y_train,num_classes)
#y_test=keras.utils.to_categorical(y_test,num_classes)

## Train an image classification model

### Create Sequential Model:
Here I will be using Relu{max(0,z)}, You can try tanh/sigmoid/Leaky Relu for finding performance on various activation functions.Our output layer will be softmax activation rather than sigmoid as we have more than one class to classify. softmax activation calculates e^value/sum(all_values_in_axis[0 or 1])

In [None]:

model=Sequential()
model.add(Conv2D(filters=16,kernel_size=2,padding="same",activation="relu",input_shape=(50,50,3)))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=32,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
model.add(Conv2D(filters=64,kernel_size=2,padding="same",activation="relu"))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(500,activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(2,activation="softmax"))#2 represent output layer neurons 
model.summary()


In [None]:
# compile the model with loss as binary_crossentropy and using adam optimizer you can test result by trying RMSProp as well as Momentum
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

### Fit model

In [None]:
#Fit the model with min batch size as 20[can tune batch size to some factor of 2^power ] 
history = model.fit(x_train,y_train,batch_size=20,epochs=20,verbose=1, validation_data=(x_test, y_test))

### Log Experiment info

In [None]:
run = experiment.start_logging()

#### Check the accuracy on Test data

In [None]:
accuracy = model.evaluate(x_test, y_test, verbose=1)
print('\n', 'Test_Accuracy:-', accuracy[1])

In [None]:
run.log("Accuracy", accuracy[1], description='Test_Accuracy')

#### Log model summary

In [None]:
from contextlib import redirect_stdout
model_summary_fileneme = 'modelsummary.txt'

with open(model_summary_fileneme, 'w') as f:
    with redirect_stdout(f):
        model.summary()

run.upload_file(model_summary_fileneme, model_summary_fileneme)

#### Analyze model Accuracy and Loss 

In [None]:
plt.figure(figsize=(12,8))
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model Accuracy vs Loss')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.xlabel('Epoch')
plt.ylabel('Acc/Loss')
plt.legend(['Acc Train','Acc Validation', 'Loss Train','Loss Validation'],loc='upper left')
plt.plot()
run.log_image('Accuracy_Loss', path=None, plot=plt, description='Model Accuracy vs Loss')

#### Save model weights

In [None]:
from keras.models import load_model
model_filename = 'cells.h5'
model.save(model_filename)
run.upload_file(model_filename, model_filename)

In [None]:
run.complete()