# Task 2: Prepare for SageMaker Training

## Setup and Prepare Data

We need local folder for data and for the training script. Let us also store the data locally.

In [6]:
!mkdir -p src
!mkdir -p data
import sys
sys.path.append('src')

In [2]:
import pandas as pd
from sklearn import datasets
import os
import sys

sys.path.append('src')

digits = datasets.load_digits()
digits_df = pd.DataFrame(digits.data)
digits_df['y'] = digits.target
digits_df.to_csv(os.path.join('data', 'digits.csv'), index=False)

## Preparing for Training

Defining a `requirements.txt` file is an optional step, and not required to launch SageMaker Training job. It is, however, supported and can be used to manage dependencies. SageMaker will automatically use specified source directory, containing our `train.py` and possibly a `requirements.txt`. For more information, see the [documentation here](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#using-third-party-libraries).

In [4]:
%%writefile src/requirements.txt

# Not necessary for our training but we may define additional libraries here as required
#[optional-additional-libraries]

Overwriting src/requirements.txt


The training script creates a new Random Forest classifier. We handle input arguments in a way that does allow to run the script locally as well as in a SageMaker Training container.

In [3]:
%%writefile src/train.py

import argparse
import pandas as pd
import joblib
import os

from pathlib import Path
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support


def fit(train_dir, n_estimators, max_depth):
    digits_df = pd.read_csv(Path(train_dir)/'digits.csv')
    X_train, X_test, y_train, y_test = train_test_split(digits_df.iloc[:, :-1], digits_df.iloc[:, -1], test_size=0.2)

    # Create and train Random Forest model
    model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
    model.fit(X_train, y_train)

    # Perform validation
    y_pred = model.predict(X_test)
    pre, rec, f1, _ = precision_recall_fscore_support(y_test, y_pred, pos_label=1, average='weighted')

    print(f'pre: {pre:5.3f} rec: {rec:5.3f} f1: {f1:5.3}')

    return model


if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    parser.add_argument('--train', type=str)
    parser.add_argument('--model-dir', type=str)
    parser.add_argument('--n-estimators', type=int, default=100)
    parser.add_argument('--max-depth', type=int, default=10)

    args, _ = parser.parse_known_args()
    trained_model = fit(train_dir=args.train,
                        n_estimators=args.n_estimators,
                        max_depth=args.max_depth)

    joblib.dump(trained_model, os.path.join(args.model_dir, 'model.joblib'))

Writing src/train.py


## Testing Training Script

Let's test fit function.

In [7]:
from train import fit

fit('data', 100, 10)

FileNotFoundError: [Errno 2] No such file or directory: 'data/digits.csv'