# 5. Convert notebooks to scripts

A machine learning project requires experimentation where hypotheses are tested with agile tools like Jupyter Notebook using real datasets. Once the model is ready for production, the model code should be placed in a production code repository. In some cases, the model code must be converted to Python scripts to be placed in the production code repository. This tutorial covers a recommended approach on how to export experimentation code to Python scripts.

In this tutorial, you learn how to:
- Clean nonessential code
- Refactor Jupyter Notebook code into functions
- Create Python scripts for related tasks

**First, accept your adsc3910_worksheet5 on Moodle**

## 1. Remove all nonessential code

Some code written during experimentation is only intended for exploratory purposes. Therefore, the first step to convert experimental code into production code is to remove this nonessential code. Removing nonessential code will also make the code more maintainable. In this section, you'll remove code from the `experimentation/Diabetes Ridge Regression Training.ipynb` notebook.

## 2. Refactor code into functions
Second, the Jupyter code needs to be refactored into functions. Refactoring code into functions makes unit testing easier and makes the code more maintainable. In this section, you'll refactor:
- The Diabetes Ridge Regression Training notebook(`experimentation/Diabetes Ridge Regression Training.ipynb`)
- The Diabetes Ridge Regression Scoring notebook(`experimentation/Diabetes Ridge Regression Scoring.ipynb`)

### Refactor Diabetes Ridge Regression Training notebook into functions

In experimentation/Diabetes Ridge Regression Training.ipynb, complete the following steps:
- Create a function called `split_data` to split the data frame into test and train data. The function should take the dataframe df as a parameter, and return a dictionary containing the keys train and test.
  
    Move the code under the Split Data into Training and Validation Sets heading into the split_data function and modify it to return the data object.
- Create a function called `train_model`, which takes the parameters data and args and returns a trained model.
  
    Move the code under the heading Training Model on Training Set into the train_model function and modify it to return the reg_model object. Remove the args dictionary, the values will come from the args parameter.
- Create a function called `get_model_metrics`, which takes parameters reg_model and data, and evaluates the model then returns a dictionary of metrics for the trained model.
  
    Move the code under the Validate Model on Validation Set heading into the get_model_metrics function and modify it to return the metrics object.

Still in experimentation/Diabetes Ridge Regression Training.ipynb, complete the following steps:
1. Create a new function called `main`, which takes no parameters and returns nothing.
2. Move the code under the "Load Data" heading into the main function.
3. Add invocations for the newly written functions into the main function:
```python
# Split Data into Training and Validation Sets
data = split_data(df)
# Train Model on Training Set
args = {
    "alpha": 0.5
}
reg = train_model(data, args)
# Validate Model on Validation Set
metrics = get_model_metrics(reg, data)
```
4. Move the code under the "Save Model" heading into the main function.

At this stage, there should be no code remaining in the notebook that isn't in a function, other than import statements in the first cell.

Add a statement that calls the main function.

```python
main()
```

After refactoring, `experimentation/Diabetes Ridge Regression Training.ipynb` should look like the following code without the markdown:

```python
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd
import joblib


# Split the dataframe into test and train data
def split_data(df):
    # YOUR CODE HERE
    return data


# Train the model, return the model
def train_model(data, args):
    # YOUR CODE HERE
    return reg_model


# Evaluate the metrics for the model
def get_model_metrics(reg_model, data):
    # YOURE CODE HERE
    return metrics


def main():
    # YOUR CODE HERE

main()
```

### Refactor Diabetes Ridge Regression Scoring notebook into functions

In `experimentation/Diabetes Ridge Regression Scoring.ipynb`, complete the following steps:

1. Create a new function called `init`, which takes no parameters and return nothing.
2. Copy the code under the "Load Model" heading into the init function.
3. Once the init function has been created, replace all the code under the heading "Load Model" with a single call to init as follows:
```python
init()
```

In `experimentation/Diabetes Ridge Regression Scoring.ipynb`, complete the following steps:
1. Create a new function called `run`, which takes `raw_data` and `request_headers` as parameters and returns a dictionary of results
2. Copy the code under the "Prepare Data" and "Score Data" headings into the run function.
The `run` function should look like the following code (Remember to remove the statements that set the variables raw_data and request_headers, which will be used later when the run function is called):

```python
def run(raw_data, request_headers):
    # YOUR CODE HERE
    return {"result": result.tolist()}
```

Once the run function has been created, replace all the code under the "Prepare Data" and "Score Data" headings with the following code:

```python
raw_data = '{"data":[[1,2,3,4,5,6,7,8,9,10],[10,9,8,7,6,5,4,3,2,1]]}'
request_header = {}
prediction = run(raw_data, request_header)
print("Test result: ", prediction)
```

The previous code sets variables `raw_data` and `request_header`, calls the `run` function with raw_data and request_header, and prints the predictions.

After refactoring, `experimentation/Diabetes Ridge Regression Scoring.ipynb` should look like the following code without the markdown:

```python
import json
import numpy
import os
import joblib

def init():
    # YOUR CODE HERE

def run(raw_data, request_headers):
    # YOUR CODE HERE

    return {"result": result.tolist()}

init()
test_row = '{"data":[[1,2,3,4,5,6,7,8,9,10],[10,9,8,7,6,5,4,3,2,1]]}'
request_header = {}
prediction = run(test_row, {})
print("Test result: ", prediction)
```

## Combine related functions in Python files

Third, related functions need to be merged into Python files to better help code reuse. In this section, you'll be creating Python files for the following notebooks:
- The Diabetes Ridge Regression Training notebook(`experimentation/Diabetes Ridge Regression Training.ipynb`)
- The Diabetes Ridge Regression Scoring notebook(`experimentation/Diabetes Ridge Regression Scoring.ipynb`)

### Create Python file for the Diabetes Ridge Regression Training notebook

In visual studio code, right click on the .ipynb file and choose **Import Notebook to Script**. This will create a python script from your notebook.

Save the script as `train.py`

Once the notebook has been converted to `train.py`, remove any unwanted comments. Replace the call to `main()` at the end of the file with a conditional invocation like the following code:


```python
if __name__ == '__main__':
    main()
```

`train.py` can now be invoked from a terminal by running `python train.py`. The functions from train.py can also be called from other files.

### Create Python file for the Diabetes Ridge Regression Scoring notebook

In visual studio code, right click on the .ipynb file and choose **Import Notebook to Script**. This will create a python script from your notebook.

Save the script as `score.py`

Once the notebook has been converted to `score.py`, remove any unwanted comments.

The model variable needs to be global so that it's visible throughout the script. Add the following statement at the beginning of the `init` function:

```python
global model
```

## Run the script in the terminal

1. Open a terminal and navigate to the `adsc3910_worksheet5/experimentation` folder
2. Activate the adsc_3610 environment by typing `conda activate adsc_3610`
3. Run the `train.py` script by typing `python train.py`. After running the script, should now see a .pkl file being generated. That is your trained model.
4. Run the `score.py` script by typing `python score.py`. After running the script, you should see the result of the prediction and mse values