# Model Development

Once a Data Science Project has been created and linked to a Data Science User by the Data Science Admin, the Data Science User can login to SageMaker Studio to develop machine learning models for that project. 

With the deployment of the MLOps platform we provide seed code and sample datasets demonstrating data processing and model development (depending on the type of the machine learning problem. This is defined during SageMaker project creation). This seed code is meant to teach data scientists to use SageMaker and shorten the journey for going from an ad-hoc experimentation in SageMaker Jupyter Notebooks to models in production.

In this section we will describe the typical steps that will be undertaken by a data scientist working on a project to develop models on the MLOps platform. 

1. Clone the remote project repository from AWS Code Commit into the Data Scientist's Amazon SageMaker Studio workspace.
2. Upload data for model training to the S3 bucket associated with the project.
3. Copy a template for model development from the seedcode-repo.
4. Run data pre-processing using an Amazon SageMaker Processing job.
5. Train the model using an Amazon SageMaker Training job.
6. Run a batch inference job or set up and test a real-time inference endpoint.
7. Optionally run a Hyperparameter tuning job.

## Workflow

1. Assume the Data Scientist Role by following the link [AWS console](https://signin.aws.amazon.com/switchrole). The role has been created for you during the deployment. The role name has the following pattern:  **region-datascientist-account_id**. You might find it helpful to first navigate to IAM in AWS Console in order to retrieve the role name and assume it using [AWS console](https://signin.aws.amazon.com/switchrole) link and log in to SageMaker Studio as the SageMaker user _email alias-project name_. For detailed instructions see section **Setting up the platform and the data science environment**. 

2. In the SageMaker Studio click on SageMaker resources, choose Projects and select the project you are working on. Note that if you are assigned to multiple projects, you might see more than one project on the list. In a few seconds after you clicked on a project a repository associated with the project will be shown in the main screen. Clone the repository as shown below.

 ![ds_clone_repo.gif](images/ds_clone_repo.gif)

3. In your file system visible in SageMaker Studio UI you will see seedcode directory along with the cloned repository. First, open the seedcode directory. There you will find a Jupyter notebook `0_upload_data.ipynb`. The first section of the notebook contains code to load a few necessary libraries and to get the name of the S3 bucket associated with the project. **Take a note of the bucket name as you will need it later** (you can also come back to this notebook to retrieve the bucket name again). The other two sections contain instructions to upload either tabular or text data for a binary classification problem. Please run all cells as shown below by going to Run -> Run All Cells, and Click.

![ds_upload_data.gif](images/ds_upload_data.gif)

4. To learn how to develop a model using sklearn library use the code in seedcode/sklearn directory (follow the same steps for other templates such as xgboost, tensorflow and hugging face). We recommend making a copy of the code from the seedcode/sklearn directory into you project repository by following these steps:

* Open the terminal `(File->New->Terminal)`
* Navigate to the directory with the cloned project repository
* Copy the seed code using the following command: `cp -r ~/seedcode/sklearn/* ./`

5. Start the data preprocessing job by opening the Jupyter Notebook `1_preprocess.ipynb` and running all cells (In the menu go to Run > Run All Cells). Make sure to examine the content of the cells to undertand the steps taken to prepare the data for model training. It will take a few minutes to complete the job. 

![ds_upload_data.gif](images/ds_cp_code.gif)

To monitor the progress navigate to [SageMaker Console](https://console.aws.amazon.com/sagemaker/home) and **Processing Jobs**. 

![ds_preprocessing_job5.png](images/ds_preprocessing_job5.png)

After the processing job is completed, you will find three new datasets uploaded to the S3 bucket associated with the project (`df_train`, `df_test`, `df_val`) as shown below.

![ds_preprocessing_job6_studio.png](images/ds_preprocessing_job6_studio.png)

6. Once the data is prepared for training we can open notebook `2_training.ipynb` to run an example training job. Run all cells to set up the libraries, job parameters and all cells under the section **Train**.

![ds_training_job.png](images/ds_training_job.gif)

It will take a few minutes to complete the training job. 

![ds_training_job4.png](images/ds_training_job4_studio.png)

To monitor the progress navigate to [SageMaker Console](https://console.aws.amazon.com/sagemaker/home), **Training** and then **Training jobs**.

![ds_training_job5.png](images/ds_training_job5.png)

7. The same notebook `2_training.ipynb` also demonstrates how to do Hyperparameter Optimization and inference (Batch Transform and Realtime Inference). To observe how to do it run the cells in the corresponding section. 

Batch transform:

![ds_training_job6_studio.png](images/ds_training_job6_studio.png)

To monitor the progress navigate to [SageMaker Console](https://console.aws.amazon.com/sagemaker/home), **Inference** and then  **Batch transform jobs**.

![ds_training_job7.png](images/ds_training_job7.png)

Realtime Inference:

![ds_training_job8_studio.png](images/ds_training_job8_studio.png)


Hyperparameter optimization:

![ds_training_job9_studio.png](images/ds_training_job9_studio.png)

To monitor the progress navigate to [SageMaker Console](https://console.aws.amazon.com/sagemaker/home), **Training** and then **Hyperparameter tuning jobs**.

![ds_training_job10.png](images/ds_training_job10.png)

These jobs are recorded as [SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) which could be tracked with their associated artifacts. You can check all the experiment training jobs from tab Experiments in the Project panel as shown below:

![ds_experiment.gif](images/ds_experiment.gif)
