# DataPreparing

THe first step of our notebooks will be to run a DataPreparing script.  
This contains all the necessary code to transform our original sequences into data that is ready for AI training.  

To benefit from the perks of our Azure cloud service, we will be creating a new dataset to store our processed sequences.

## Setup

Our virtual machine might not have all packages installed yet. So let's go and install some packages.  
We can use cell-magic for this, which will allow us to stay inside this notebook and just executing the cells.  

Later on, these cells might nog be necessary anymore, which is why we include it at the top. During other builds, you can just ignore these.

As a best practice, let's make sure to only work on the version we know is safe. This is a great way to organising our AI projects. By keeping the versions linked like this, no unexpected new version would break our code!

In [4]:
# This cell can be used to fill in some values that you will be referring to in the coming cells
train_test_split_factor = 0.20

In [5]:
# Importing the default packages for data processing and visualisation
import numpy as np # Used to process our sequences in a data-format
from shutil import rmtree

import os,math
from glob import glob

import warnings
warnings.filterwarnings("ignore") # Warnings that can be ignored will be ignored

import random
SEED = 42 # Everytime you want to randomize items, use this `random.seed(SEED)` option. This way, you are always having the same randomization as I have.
random.seed(SEED)

In [6]:
# Import AzureML packages
from azureml.core import Workspace
from azureml.core import Dataset
from azureml.data.datapath import DataPath
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

## Step 1: Connecting to the Azure ML Workspace

Azure Machine Learning needs to connect through the Azure SDK with the Workspace object. This contains all the information inside of this 'Laboratory'

The information below should reflect your situation regarding Azure. You should have a ResourceGroup called '04_AzureML' and a workspace name called 'segersnathan' if you followed my instructions on HackMD.
The subscription ID, however, is something that has been created by Azure itself.

Luckily, this ML studio gives us a quick way to find this information.
Click on the \/-arrow in the upper-right corner over there ↗️, next to your profile picture.

Most of your information is in there as well, but you still can't find your subscription_**id** there ...

Press the 'Download config' option, and you'll be left with this information:

```json
{
    "subscription_id": "7c50f9c3-289b-4ae0-a075-08784b3b9042",
    "resource_group": "NathanReserve",
    "workspace_name": "segersnathan"
}
```

Which gives you exactly the information you need 🥰

There's also an option to use this configuration itself. Search for the documentation on how to do it: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace(class)?view=azure-ml-py


In [7]:
## Either get environment variables, or a fallback name, which is the second parameter.
## Currently, fill in the fallback values. Later on, we will make sure to work with Environment values. So we're already preparing for it in here!
workspace_name = os.environ.get('WORKSPACE', 'hermans-cedric-ml')
subscription_id = os.environ.get('SUBSCRIPTION_ID', 'REDACTED')
resource_group = os.environ.get('RESOURCE_GROUP', '04-AzureML')

In [8]:
ws = Workspace.get(name=workspace_name,
               subscription_id=subscription_id,
               resource_group=resource_group)

## Step 2 -- Data preparing

The initial dataset contains two types of sequences: polyketide synthases and non-ribosomal peptide synthase.

## Step 2.1 -- Checking our data

Let us first explore how the data looks. We'll create 2 subdirectories under a data directory, one for each gene.
If you want to update this to more genes later, simply adapt the `GENES` list. Because PKS contains about 16000 sequences, and the 
webpage kept crashing when uploading the data,
we have added some code to process the two type of sequences in different ways (same results, but PKS was catted to one file, NRPS are
seperate files) 

In [9]:
GENES = ['NRPS', 'PKS']

We will need to create temporary directories to store the sequences while we process them.
This script will create a `data` folder, and then make subdirectories for each animal.

In [11]:
data_folder = os.path.join(os.getcwd(), 'data')
os.makedirs(data_folder, exist_ok=True)
for gene_name in GENES:
    os.makedirs(os.path.join(data_folder, 'genes', gene_name), exist_ok=True)

In [12]:
# Get all the datasets that were registered in the UI
# We can then easily select the ones we need
datasets = Dataset.get_all(workspace=ws) # Make sure to give our workspace with it
print(datasets)

{ 'NRPS': DatasetRegistration(id='cd0e7d4c-fd07-4538-8df1-5344ac913e2b', name='NRPS', version=1, description='', tags={}),
  'PKS': DatasetRegistration(id='019c4c3d-8a82-4260-98ce-2e3047dc8170', name='PKS', version=1, description='', tags={}),
  'animals-testing-set': DatasetRegistration(id='b8ec1966-c9da-4ca9-b07d-e41ee35f5e07', name='animals-testing-set', version=1, description='The Animal Images to test, resized tot 64, 64', tags={'animals': 'cats,dogs,pandas', 'AI-Model': 'CNN', 'Split size': '0.2', 'type': 'testing'}),
  'animals-training-set': DatasetRegistration(id='02ab0351-03ab-4e0e-b42f-620fa183bd4d', name='animals-training-set', version=1, description='The Animal Images to train, resized tot 64, 64', tags={'animals': 'cats,dogs,pandas', 'AI-Model': 'CNN', 'Split size': '0.8', 'type': 'training'}),
  'cats': DatasetRegistration(id='db847d56-0389-43e8-93aa-6cd3df7507c7', name='cats', version=1, description='', tags={}),
  'dogs': DatasetRegistration(id='1f40c908-e568-4434-a620

A check to see if we have our datasets

In [13]:
# Write your answer here
nomissing=True
for gene in GENES:
    if gene in datasets.keys():
        continue
    nomissing=False
    print("Missing dataset %s"%gene)
if nomissing:
    print("All datasets present!")

All datasets present!


### Step 2.2 Processing and uploading the processed sequences

We need to process our sequences so we can use them in our model, which is just a normal Neural Network with 21 inputs (one for each amino acid)

In [14]:
# Let's create a mounth point. Think of it like your D:/ drive on your PC
mount_path = os.path.join(os.getcwd(), 'mount')
os.makedirs(mount_path, exist_ok=True)

In [33]:
def mountProcessSequences(gene_name):
    # Define a path to store the gene sequences onto. We'll choose for `data/processed/genes` this time. Again, create subdirectories for all the genes
    processed_path = os.path.join(os.getcwd(), 'data', 'processed', 'genes')
    os.makedirs(processed_path, exist_ok=True)
    # The mount context is to load in the dataset to our directory.
    # Make sure to stop it when it's finished!

    # Get the dataset name for this gene, then mount to the directory
    mounted_context = datasets[gene_name].mount(mount_path)
    print('Starting the Mount context, to get all the original sequences.')
    mounted_context.start()

    # Get all the sequence paths with the `glob()` method.
    print(f'Processing all sequences for {gene_name} ...')
    sequencePaths = glob(f"{mount_path}/**.fasta")
    print(len(sequencePaths))
    aminoacids=["A","R","N","D","C","E","Q","G","H","I","L","K","M","F","P","S","T","W","Y","V"]
    # reading and preprocessing of the sequences
    X = []
    y = []
    # read  sequences
    if gene_name == 'NRPS':
        for seq in sequencePaths:
            with open(seq) as ifile:
                text=ifile.readlines()
                sequence=''.join([l.strip() for l in text[1:]])
                X.append([sequence.count(aa) for aa in aminoacids])
                y.append([1,0])
    elif gene_name == 'PKS':
        for seq in sequencePaths:
            print(seq)
            with open(seq) as ifile:
                text = ifile.read().split(">")[1:]
                for s in text:
                    sequence=''.join([l.strip() for l in s[1:]])
                    X.append([sequence.count(aa) for aa in aminoacids])
                    y.append([0,1])
    print(len(X))
    np.savetxt(os.path.join(processed_path, f"{gene_name}_X.np"), np.array(X))
    np.savetxt(os.path.join(processed_path, f"{gene_name}_y.np"), np.array(y))
        
    
    print(f'... Done')
    # Stop the context now.
    mounted_context.stop()
    print(f"... Context stopped and freed.")

def uploadGeneSequences():
    processed_path = os.path.join(os.getcwd(), 'data', 'processed', 'genes')
    # Upload the directory as a new dataset
    print(f'Uploading directory now ...')
    resized_dataset = Dataset.File.upload_directory(
                        # Enter the sourece directory on our machine where the resized pictures are
                        src_dir = processed_path,
                        # Create a DataPath reference where to store our sequences to. We'll use the default datastore for our workspace.
                        target = DataPath(datastore=ws.get_default_datastore(), path_on_datastore=f'processed_genes'),
                        overwrite=True)
    # Make sure to register the dataset whenever everything is uploaded.
    resized_dataset.register(ws,
                            name=f'processed_genes',
                            description=f'Gene sequences processed to amino acid counts',
                            tags={'genes': str(GENES), 'AI-Model': 'NN'}, # Optional tags, can always be interesting to keep track of these!
                            create_new_version=True)


In [35]:
%%time
# Process all the gene sequences now.
# We'll use Cell magic once more, to time how long this takes!
mountProcessSequences('NRPS')
mountProcessSequences('PKS')
uploadGeneSequences()

Uploading directory now ...
Validating arguments.
Arguments validated.
Uploading file to processed_genes
Uploading an estimated of 6 files
Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/compute-ch/code/Users/cedric.hermans2/AzureML-assignment/data/processed/genes/.amlignore
Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/compute-ch/code/Users/cedric.hermans2/AzureML-assignment/data/processed/genes/.amlignore, 1 files out of an estimated total of 6
Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/compute-ch/code/Users/cedric.hermans2/AzureML-assignment/data/processed/genes/.amlignore.amltmp
Uploaded /mnt/batch/tasks/shared/LS_root/mounts/clusters/compute-ch/code/Users/cedric.hermans2/AzureML-assignment/data/processed/genes/.amlignore.amltmp, 2 files out of an estimated total of 6
Uploading /mnt/batch/tasks/shared/LS_root/mounts/clusters/compute-ch/code/Users/cedric.hermans2/AzureML-assignment/data/processed/genes/NRPS_X.np
Uploaded /mnt/batch/tasks/shared/LS

Our processed sequence datasets are now registered onto the datasets of Azure.