# Working with data in Azure ML

### Introduction DataStores

Definition: encapsule info nécessaire pour se connecter aux sources de données

Utilisations possibles: 
- integrer des données dans une experience
- écrire/stocker des résultats d'une expérience

##### Type de datastores

Datastores crées pour de multiple type de source de données:
- Azure Storage (blob and file containers
- Azure Data Lake Storage 
- Az SQL Datrabase
- Az Databricks file systm (DBFS)

##### Utilisation de datastores

##### Enregistrement de datastores

2 manières de créerun datastores : 
- 1 click bouton
- 2 Via SDK python

Ici le code enregistr un Azure Storage blob contanie en tant que datastore du nom de blob_data :

In [None]:
from azureml.core import Workspace, Datastore

ws = Workspace.from_config()

# Register a new datastore
blob_ds = Datastore.register_azure_blob_container(workspace=ws,
    datastore_name='blob_data',  
    container_name='data_container',
    account_name='az_store_acct',
    account_key='123456abcde789…')

### Gerer ses datastores via Azure ML Studio

Lister les noms de l'ensemble de ses datastores de son ET

In [None]:
for ds_name in ws.datastores:
    print(ds_name)

Obtenir la réference d'un datastore

In [None]:
blob_store = Datastore.get(ws, datastore_name='blob_data')


Otbenir/retrouver le datastore créer par défaut intitulé workspaceblobstore dans son ET. 

In [None]:
default_store = ws.get_default_datastore()


Changer le datastore par défaut

In [None]:
ws.set_default_datastore('blob_data')


### Utiliser un Datastores

1- enregistrer et telecharger des données 

you can upload files from your local file system to a datastore so that it will be accessible to experiments running in the workspace, regardless of where the experiment script is actually being run

In [None]:
blob_ds.upload(src_dir='/files',
               target_path='/data/files',
               overwrite=True, show_progress=True)


default_ds.upload_files(files=['./data/diabetes.csv', './data/diabetes2.csv'], # Upload the diabetes csv files in /data
                       target_path='diabetes-data/', # Put it in a folder path in the datastore
                       overwrite=True, # Replace existing files of the same name
                       show_progress=True)



blob_ds.download(target_path='downloads',
                 prefix='/data',
                 show_progress=True)

Remarque : quand on upload un/des fichiers on obtient une reference data exemple:
$AZUREML_DATAREFERENCE_53837193eee242a9afef31806ecd1277

A data reference provides a way to pass the path to a folder in a datastore to a script, regardless of where the script is being run, so that the script can access data in the datastore location


The following code gets a reference to the **diabetes-data** folder where you uploaded the diabetes CSV files, and specifically configures the data reference for *download* - in other words, it can be used to download the contents of the folder to the compute context where the data reference is being used. Downloading data works well for small volumes of data that will be processed on local compute. When working with remote compute, you can also configure a data reference to *mount* the datastore location and read data directly from the data source.

Utilisation d'un datastores dans une expérience --> necessité de transmettre une référence de données au script
La référence de données est configurée  selon le mode d'accès:

**Download**: The contents of the path associated with the data reference is downloaded to the compute context where the experiment is running.

**Upload**: The files generated by your experiment script are uploaded to the datastore after the run completes.

**Mount**: The path on the datastore is mounted as remote storage in the experiment compute context, enabling the contents to be accessed remotely (note that this mode is only available when the experiment is run on a remote compute target - you cannot use this mode with local compute).

In [None]:
data_ref = blob_ds.path('data/files').as_download(path_on_compute='training_data') #telechargement de la base de données dans datastors
#ici est la référence data dont on parlait précédemment. 


estimator = SKLearn(source_directory='experiment_folder',
                    entry_script='training_script.py'
                    compute_target='local',
                    script_params = {'--data_folder': data_ref}) #ajout de la reference de la banque de données/ des données

In [None]:
import os
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--data_folder', type=str, dest='data_folder')
args = parser.parse_args()

data_files = os.listdir(args.data_folder) # utilisation de argparse pour faciliter l'accès rapide au datastore définit plus haut

Use the data reference in a training script

To use the data reference in a training script, you must define a parameter for it. Run the following two code cells to create:

1. A folder named **diabetes_training_from_datastore**
2. A script that trains a classification model by using the training data in all of the CSV files in the folder referenced by the data reference parameter passed to it.

###### Train a model from a datastores

In [None]:
import os

# Create a folder for the experiment files
experiment_folder = 'diabetes_training_from_datastore' #le fichier de l'experience
os.makedirs(experiment_folder, exist_ok=True)   
print(experiment_folder, 'folder created.')   

In [None]:
%%writefile $experiment_folder/diabetes_training.py
# Import libraries
'''
import os
import argparse
from azureml.core import Run
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
'''

# Get parameters
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder reference')
args = parser.parse_args()
reg = args.reg_rate

'''# Get the experiment run context
run = Run.get_context()'''

# load the diabetes data from the data reference
data_folder = args.data_folder
print("Loading data from", data_folder)
# Load all files and concatenate their contents as a single dataframe
all_files = os.listdir(data_folder)
diabetes = pd.concat((pd.read_csv(os.path.join(data_folder,csv_file)) for csv_file in all_files))

'''
# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()
'''

In [None]:
from azureml.train.sklearn import SKLearn
from azureml.core import Experiment
from azureml.widgets import RunDetails

# Set up the parameters
script_params = {
    '--regularization': 0.1, # regularization rate
    '--data-folder': data_ref # data reference to download files from datastore
}



# Create an estimator
estimator = SKLearn(source_directory=experiment_folder,
                    entry_script='diabetes_training.py',
                    script_params=script_params,
                    compute_target = 'local'
                   )
'''
# Create an experiment
experiment_name = 'diabetes-training'
experiment = Experiment(workspace = ws, name = experiment_name)

# Run the experiment
run = experiment.submit(config=estimator)
# Show the run details while running
RunDetails(run).show()
run.wait_for_completion()

'''

### Introduction to Datasets

Datasets are versioned packaged data objects that can be easily consumed in experiments and pipelines. Datasets are the recommended way to work with data

##### Types d'ensemble de datasets

- 1) Tabulaires = données lues sous forme de tableau structurés
- 2) Fichiers = données non strucutrées 

##### Création et enregistrement d'ensembles de données 

**1) Création de datasets** 

--> créer un ensemble de données à partir de fichiers individuels ou de chemins relatifs

rmq: chemins peuvent inclure caractéristiques génériques exemple: /files/*.csv


**2) Enregistrer le dataset dans l'ET**

--> permet de rendre dispo pour une utilisation dans une expérience ou dans un pipeline de traitement

######  <font color='green'>Création + enregistrement de données tabulaires</font>

In [None]:
from azureml.core import Dataset

blob_ds = ws.get_default_datastore()
csv_paths = [(blob_ds, 'data/files/current_data.csv'),  #L'ensemble de données de cet exemple inclut les données de deux chemins de fichier
             (blob_ds, 'data/files/archive/*.csv')]
tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)
tab_ds = tab_ds.register(workspace=ws, name='csv_table')  #Après avoir créé l'ensemble de données, le code 
                                                          #l'enregistre dans l'espace de travail sous le nom csv_table .



###### <font color='green'>Création + enregistrement de données fichiers</font>

In [None]:
from azureml.core import Dataset

blob_ds = ws.get_default_datastore()
file_ds = Dataset.File.from_files(path=(blob_ds, 'data/files/images/*.jpg'))
file_ds = file_ds.register(workspace=ws, name='img_files')

##### Récupération des données enregistré

Récupération des données, 2 façons:
The datasets dictionary attribute of a Workspace object.


1- via le **dictionnaire des attributs du dataset** de l'objet de type **Workspace** 
2- La méthode **get_by_name** ou **get_by_id** de la classe **Dataset** 

In [None]:
import azureml.core
from azureml.core import Workspace, Dataset

# Load the workspace from the saved config file
ws = Workspace.from_config()

# Get a dataset from the workspace datasets collection  --> Méthode 1
ds1 = ws.datasets['csv_table']

# Get a dataset by name from the datasets class --> méthode 2
ds2 = Dataset.get_by_name(ws, 'img_files')

##### Datasets versioning

In [None]:
#versioning
img_paths = [(blob_ds, 'data/files/images/*.jpg'),
             (blob_ds, 'data/files/images/*.png')]
file_ds = Dataset.File.from_files(path=img_paths)

file_ds = file_ds.register(workspace=ws, name='img_files', create_new_version=True)  #--> versioning


#retrouver une version
img_ds = Dataset.get_by_name(workspace=ws, name='img_files', version=2)  # --> version 2 

### Utiliser un dataset

##### Travailler avec un dataset directement

In [None]:
df = tab_ds.to_pandas_dataframe()  #tab_ds étant le base de données tééchargée au préalable 
# code to work with dataframe goes here

Méthode to_path () pour renvoyer une liste des chemins de fichier encapsulés par l'ensemble de données

In [None]:
for file_path in file_ds.to_path():
    print(file_path)

##### Passer un ensemble de données à un script d'expérimentation

Inclusion dans un estimateur

Rmq : étant donné que le script devra fonctionner avec un objet Dataset , vous devez inclure le package **azureml-sdk** complet ou le package **azureml-dataprep** avec la bibliothèque supplémentaire **pandas** **dans l'environnement de calcul du script.**

In [None]:
estimator = SKLearn( source_directory='experiment_folder',
                     entry_script='training_script.py',
                     compute_target='local',
                     inputs=[tab_ds.as_named_input('csv_data')],
                     pip_packages=['azureml-dataprep[pandas]')

Transformation de la base de donnée en format pandas dans script de l'experience

In [None]:
run = Run.get_context()
data = run.input_datasets['csv_data'].to_pandas_dataframe()

et lors de la transmission d'un ensemble de **données de fichier**, **spécifier le mode d'accès**

In [None]:
estimator = Estimator( source_directory='experiment_folder',
                     entry_script='training_script.py'
                     compute_target='local',
                     inputs=[img_ds.as_named_input('img_data').as_download(path_on_compute='data')],
                     pip_packages=['azureml-dataprep[pandas]')

##### Exercice

###### Création dataset

<font color='green'>1- création d'un dataset de type tabular à partir d'un datastore</font>

In [None]:
from azureml.core import Dataset

# Get the default datastore
default_ds = ws.get_default_datastore()

#Create a      **TABULAR**       dataset FROM the path on the      DATASTORE     (this may take a short while)
tab_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'diabetes-data/*.csv'))

# Display the first 20 rows as a Pandas dataframe
tab_data_set.take(20).to_pandas_dataframe()

<font color='green'>2 - création d'un dataset de type file à partir d'un datastore</font>

In [None]:
#Create a     **FILE**           dataset FROM the path on the     DATASTORE    (this may take a short while)
file_data_set = Dataset.File.from_files(path=(default_ds, 'diabetes-data/*.csv'))

# Get the files in the dataset
for file_path in file_data_set.to_path():
    print(file_path)

###### Enregistrement 

But de l'enregistrement des données sous la forme de datasets: les rendres plus accessible pour faire tourner des experiences dans un espace de travail.


Ici:
- tabular dataset = **diabetes dataset**
- file dataset = **diabetes files**

In [None]:
# Register the tabular dataset
try:
    tab_data_set = tab_data_set.register(workspace=ws, 
                                        name='diabetes dataset',
                                        description='diabetes data',
                                        tags = {'format':'CSV'},
                                        create_new_version=True)
except Exception as ex:
    print(ex)

# Register the file dataset
try:
    file_data_set = file_data_set.register(workspace=ws,
                                            name='diabetes file dataset',
                                            description='diabetes files',
                                            tags = {'format':'CSV'},
                                            create_new_version=True)
except Exception as ex:
    print(ex)

print('Datasets registered')

#### Visulalisation du dataset

1- via la plateforme azure (section dataset)
2- via SDK Python pour otbenir la liste des datasets:

In [None]:
print("Datasets:")
for dataset_name in list(ws.datasets.keys()):
    dataset = Dataset.get_by_name(ws, dataset_name)
    print("\t", dataset.name, 'version', dataset.version)

Pour otbenir une version specifique d'un dataset:

```python
dataset_v1 = Dataset.get_by_name(ws, 'diabetes dataset', version = 1)
```

######  Train a Model from a Tabular Dataset


 You can pass datasets to scripts as *inputs* in the estimator being used to run the script.

1- création de l'experience 

Remarque: rien ne change

In [None]:
import os

# Create a folder for the experiment files
experiment_folder = 'diabetes_training_from_tab_dataset'
os.makedirs(experiment_folder, exist_ok=True)
print(experiment_folder, 'folder created')

2- création du script

Dans ce script, on utilise la base de données, mais c'est à travers l'étape d'après que le chargement est fait.

In [None]:
'''
%%writefile $experiment_folder/diabetes_training.py
# Import libraries
import argparse
from azureml.core import Run
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Set regularization hyperparameter (passed as an argument to the script)
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
args = parser.parse_args()
reg = args.reg_rate

# Get the experiment run context
run = Run.get_context()
'''
# load the diabetes data (passed as an input dataset)
print("Loading Data...")
diabetes = run.input_datasets['diabetes'].to_pandas_dataframe()

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values
'''
# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

3- lancement de l'exeprience avec le paramètre **input**

In [None]:
'''
from azureml.train.sklearn import SKLearn
from azureml.core import Experiment
from azureml.widgets import RunDetails

# Set the script parameters
script_params = {
    '--regularization': 0.1
}
'''
# Get the training dataset
diabetes_ds = ws.datasets.get("diabetes dataset")

# Create an estimator
estimator = SKLearn(source_directory=experiment_folder,
                    entry_script='diabetes_training.py',
                    script_params=script_params,
                    compute_target = 'local',
                    inputs=[diabetes_ds.as_named_input('diabetes')], # Pass the Dataset object as an input...
                    pip_packages=['azureml-dataprep[pandas]'] # ...so you need the dataprep package
                   )

'''
# Create an experiment
experiment_name = 'diabetes-training'
experiment = Experiment(workspace = ws, name = experiment_name)

# Run the experiment
run = experiment.submit(config=estimator)
# Show the run details while running
RunDetails(run).show()
run.wait_for_completion()
'''

###### Train a Model from a File Dataset

Identique que précedement :


```python

#load the diabetes dataset

print("Loading Data...")
data_path = run.input_datasets['diabetes'] # Get the training data from the estimator input
all_files = glob.glob(data_path + "/*.csv")
diabetes = pd.concat((pd.read_csv(f) for f in all_files))
```

Pour lancer l'experience en utiisant les données fichiers :
For large volumes of data, you'd generally use the **as_mount** method to stream the files directly from the dataset source; but when running on local compute (as we are in this example), you need to use the **as_download** option to download the dataset files to a local folder

```python

#Get the training dataset
diabetes_ds = ws.datasets.get("diabetes file dataset")

#Create an estimator
estimator = SKLearn(source_directory=experiment_folder,
                    entry_script='diabetes_training.py',
                    script_params=script_params,
                    compute_target = 'local',
                    inputs=[diabetes_ds.as_named_input('diabetes').as_download(path_on_compute='diabetes_data')], # Pass the Dataset object as an input
                    pip_packages=['azureml-dataprep[pandas]'] # so we need the dataprep package
                   )
```