## IDC Classification Trainer

This notebook provides the functionality to train a CaffeNet model for classification of Invasive Ductal Carcinoma (IDC). We are training the model on the lastest Skylake cluster (c009) on Intel A.I. DevCloud (Colfax Cluster) and will use the model with Intel Movidius.

This tutorial is part of BreastCancerAI Project by Adam Milton-Barker.

# Create DataSorter job

The first step is to create a script that can be used to create a job on the A.I. DevCloud for sorting the data ready for training. 

Before you run the following block make sure you have followed all of the steps at the beginning of the README file in the home directory of this project.

In [22]:
%%writefile IDC-Classifier-Trainer
cd $PBS_O_WORKDIR
echo "* Hello world from compute server `hostname` on the A.I. DevCloud!"
echo "* The current directory is ${PWD}."
echo "* Compute server's CPU model and number of logical CPUs:"
lscpu | grep 'Model name\\|^CPU(s)'
echo "* Python available to us:"
export PATH=/glob/intel-python/python3/bin:$PATH;
which python
python --version 
echo "* Starting IDC-Classifier-DataSorter job"
echo "* This job sorts the data for the IDC Classifier"
python Trainer.py
sleep 10
echo "*Adios"
# Remember to have an empty line at the end of the file; otherwise the last command will not run


Writing IDC-Classifier-Trainer


# Check the job script was created

Now check that the job script was created successfully by executing the following block which will print out the files located in the current directory. If all was successful, you should see the file "AA3PG-Trainer". You can also open this file to confirm that the contents are correct.

In [23]:
%ls

[0m[01;34mcomponents[0m/  [01;34mdata[0m/  IDC-Classifier-Trainer  [01;34mmodel[0m/  Trainer.ipynb  Trainer.py


# Submit the job script

Now it is time to submit your training job script, this will queue the training script ready for execution and return your job ID. In this command we set the walltime to 24 hours, which should give our script enough time to fully complete without getting killed. 

In [25]:
!qsub -l walltime=24:00:00 IDC-Classifier-Trainer

150717.c009


# Check the status of the job

Now you can monitor the status of the job by executing the following block. You may need to do this a number of times before the job completes. 

JOB STATUSES

R: Running  
Q: Waiting in queue

In [26]:
!qstat

Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
150666.c009                ...ub-singleuser u13339          00:00:23 R jupyterhub     
150717.c009                ...ifier-Trainer u13339                 0 R batch          


You can also get a full list of stats for the job by executing the following block, replacing the ID with your job ID:

In [27]:
!qstat -f 150717

Job Id: 150717.c009
    Job_Name = IDC-Classifier-Trainer
    Job_Owner = u13339@c009-n011
    job_state = R
    queue = batch
    server = c009
    Checkpoint = u
    ctime = Sat Aug 11 13:00:02 2018
    Error_Path = c009-n011:/home/u13339/CaffeNet/IDC-Classifier-Trainer.e15071
	7
    exec_host = c009-n022/0-1
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = n
    mtime = Sat Aug 11 13:00:03 2018
    Output_Path = c009-n011:/home/u13339/CaffeNet/IDC-Classifier-Trainer.o1507
	17
    Priority = 0
    qtime = Sat Aug 11 13:00:02 2018
    Rerunable = True
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=2
    Resource_List.walltime = 24:00:00
    session_id = 198910
    Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/home/u13339,
	PBS_O_LOGNAME=u13339,
	PBS_O_PATH=/glob/intel-python/python3/bin/:/glob/intel-python/python3
	/bin/:/glob/intel-python/python2/bin/:/glob/development-tools/versions
	/intel-parallel-studio-2018-upda

# Check error and output files

After the above job finished you will see two files in your current directory, as the job ID in my case was 150717, my error file ends with e150717 and my output file ends with o150717. In this case the error file contained a FutureWarning. The output will show you the full output of your program.

In [36]:
%ls

[0m[01;34mcomponents[0m/             IDC-Classifier-Trainer.e150717  Trainer.ipynb
[01;34mdata[0m/                   IDC-Classifier-Trainer.o150717  Trainer.py
IDC-Classifier-Trainer  [01;34mmodel[0m/


# Create Training job

Next step is to create a job on the A.I. DevCloud for training the IDC Classifier.  Open a new terminal in Notebooks and go to the CaffeNet folder on AI DevCloud and execute the following command, noting down the ID that is provided.

__echo caffe train --solver ~/IDC-Classifier/CaffeNet/data/solver.prototxt | qsub -o ~/IDC-Classifier/CaffeNet/model/output.txt -e ~/IDC-Classifier/CaffeNet/model/train.log__

## Check Training Job

Check the job id provided to you by executing the following block:

In [1]:
!qstat -f 150967

Job Id: 150967.c009
    Job_Name = STDIN
    Job_Owner = u13339@c009-n021
    resources_used.cput = 00:24:55
    resources_used.energy_used = 0
    resources_used.mem = 5551248kb
    resources_used.vmem = 1959492480kb
    resources_used.walltime = 00:02:09
    job_state = R
    queue = batch
    server = c009
    Checkpoint = u
    ctime = Sun Aug 12 04:17:36 2018
    Error_Path = c009-n021:/home/u13339/CaffeNet/model/train.log
    exec_host = c009-n043/0-1
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = n
    mtime = Sun Aug 12 04:17:37 2018
    Output_Path = c009-n021:/home/u13339/CaffeNet/model/output.txt
    Priority = 0
    qtime = Sun Aug 12 04:17:36 2018
    Rerunable = True
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=2
    Resource_List.walltime = 06:00:00
    session_id = 41002
    Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/home/u13339,
	PBS_O_LOGNAME=u13339,
	PBS_O_PATH=/glob/intel-python/python3/bin/

# Plot learning curve

Here we will use our training logs to plot a graph of how the network performed during training and evaluation:

In [3]:
!python components/plot_learning_curve.py /home/u13339/CaffeNet/model/train.log /home/u13339/CaffeNet/model/train.png