# Python - `scikit-learn` and `sasviya` Random Forests

Before starting, the `sasctl` package has to be installed. To do this, let's open a Terminal in VSCode and run the following command:
```bash
pip install sasctl
```

<div style="text-align: center;">
    <img src='https://raw.githubusercontent.com/Mat-Gug/workbench-session/main/img/new_terminal.png' width=50%>
</div>
<div style="text-align: center;">
    <img src='https://raw.githubusercontent.com/Mat-Gug/workbench-session/main/img/sasctl_installation.png' width=50%>
</div>

## 1. Importing Packages

In [1]:
import os
import json
import pickle
import requests
import warnings

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

import sasviya
from sasviya.ml.tree import ForestClassifier

from sasctl import Session, pzmm
from sasctl.services import model_repository as mr

## 2. Model training

In [4]:
df = pd.read_csv('data/CUSTOMS.csv')
print(f"Number of observations: {df.shape[0]}")
print(f"Number of variables: {df.shape[1]}")
df.head()

Number of observations: 7043
Number of variables: 21


Unnamed: 0,CertificateOfOrigin,EUCitizen,Perishable,Fragile,Volume,PreDeclared,MultiplePackage,Category,OnlineDeclaration,ExporterValidation,...,LithiumBatteries,ExpressDelivery,EntryPoint,Origin,PaperlessBilling,PaymentMethod,Weight,Price,Inspection,packageID
0,No,0,Yes,No,2,No,,Clothing,No,Yes,...,No,No,Antwerp,China,Yes,Electronic check,29.85,29.85,No,7590-VHVEG
1,Yes,0,No,No,35,Yes,No,Clothing,Yes,No,...,No,No,Antwerp,US,No,Mailed check,56.95,1889.5,No,5575-GNVDE
2,Yes,0,No,No,3,Yes,No,Clothing,Yes,Yes,...,No,No,Antwerp,China,Yes,Mailed check,53.85,108.15,Yes,3668-QPYBK
3,Yes,0,No,No,46,No,,Clothing,Yes,No,...,Yes,No,Antwerp,US,No,Bank transfer (automatic),42.3,1840.75,No,7795-CFOCW
4,No,0,No,No,3,Yes,No,Electronics,No,No,...,No,No,Antwerp,China,Yes,Electronic check,70.7,151.65,Yes,9237-HQITU


In [5]:
target = 'Inspection'
X = df.drop([target, "packageID"], axis=1)
y = df[target]

X.dtypes

CertificateOfOrigin     object
EUCitizen                int64
Perishable              object
Fragile                 object
Volume                   int64
PreDeclared             object
MultiplePackage         object
Category                object
OnlineDeclaration       object
ExporterValidation      object
SecuredDelivery         object
LithiumBatteries        object
ExpressDelivery         object
EntryPoint              object
Origin                  object
PaperlessBilling        object
PaymentMethod           object
Weight                 float64
Price                  float64
dtype: object

### 2.1 `scikit-learn` Model

In [6]:
binary_cols = ['CertificateOfOrigin', 'Perishable', 'Fragile',
               'PreDeclared', 'PaperlessBilling']
ohe_cols = ['MultiplePackage', 'OnlineDeclaration',
            'ExporterValidation', 'SecuredDelivery', 'LithiumBatteries',
            'ExpressDelivery', 'Category', 'EntryPoint', 'Origin', 'PaymentMethod']
binary_mapping = [['No', 'Yes']]

# Define the preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('binary', OrdinalEncoder(categories=[['No', 'Yes']]*len(binary_cols)), binary_cols),
        ('ohe', OneHotEncoder(dtype='int64', handle_unknown='ignore', sparse_output=False), ohe_cols),
        ('impute', SimpleImputer(), ['Price'])
    ],
    remainder='passthrough',  # Keep the remaining columns as they are
    force_int_remainder_cols=False
)

# Define the pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=50, random_state=12345))
])

In [7]:
pipeline.fit(X, y)

### 2.2 `sasviya` Model

In [8]:
nominal_cols = ['CertificateOfOrigin', 'EUCitizen', 
                'Perishable', 'Fragile', 'PreDeclared',
                'MultiplePackage', 'OnlineDeclaration',
                'ExporterValidation', 'SecuredDelivery', 'LithiumBatteries',
                'ExpressDelivery', 'PaperlessBilling', 'Category', 'EntryPoint',
                'Origin', 'PaymentMethod']

sasviya_rf = ForestClassifier(n_estimators=50, random_state=12345)
sasviya_rf.fit(X, y, nominals=nominal_cols)

ForestClassifier(n_estimators=50, random_state=12345)

## 3. Integrating models into SAS Viya Platform

Common steps for the 2 models:

1. Obtaining the authorization code to the Viya instance of interest
2. Establishing a connection to the Viya server
3. Creating a project in SAS Model Manager from Workbench

#### Obtain the authorization code to the Viya instance of interest (`create.demo.sas` in our case)

[Follow this link to get the authorization code](https://create.demo.sas.com/SASLogon/oauth/authorize?client_id=sas.cli&response_type=code). In the case you are using a different Viya instance, you need to replace `"https://create.demo.sas.com/"` with your server URL for the `server` variable and change the previous URL to: `https://your-server-url.com/SASLogon/oauth/authorize?client_id=sas.cli&response_type=code`.

Request a `.pem` file with the necessary permissions from your Viya Admin, and replace `"your-pem-file-name"` with the path to this `.pem` file for the `verification_file` variable (this step can be skipped by replacing everywhere `verify=verification_file` to `verify=False`):

In [14]:
# paste the authorization code here
auth_code = "your-auth-code"

server = "https://create.demo.sas.com/"

# URL to obtain the access token
url = f"{server}/SASLogon/oauth/token"

# Payload and headers for the request
auth_payload = f'grant_type=authorization_code&code={auth_code}'
headers = {
    'Accept': 'application/json',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Authorization': 'Basic c2FzLmNsaTo='
}
# verification_file = "your-pem-file-name"

# Send the POST request to obtain the access and the refresh token
response = requests.request("POST", url, headers=headers, data=auth_payload, verify=False)
# response = requests.request("POST", url, headers=headers, data=auth_payload, verify=verification_file)
response_json = json.loads(response.text)

# Extract the access and the refresh tokens from the response
access_token = response_json['access_token']
refresh_token = response_json['refresh_token']

# Save the refresh token to a .txt file:
with open('refresh_token.txt', 'w') as file:
    file.write(refresh_token)



#### Establishing a connection to the Viya server

In [15]:
# Establish a connection to SAS Viya using the access token
st = Session(server, token=access_token, verify_ssl=False)
# os.environ['CAS_CLIENT_SSL_CA_LIST'] = verification_file
st = Session(server, token=access_token)
st

<sasctl.core.Session at 0x7f3184b46fd0>

The **access token** has a default life of 1 hour before it expires (`response_json['expires_in']`). The **refresh token** can be used to issue a new token when the current one expires. Its validity is 14 days (by looking at `response_json['refresh_expires_in']`):

In [16]:
response_json['expires_in'], response_json['refresh_expires_in']

(3599, 1209599)

A new access token can be obtained by means of the following procedure, without repeating all the previous steps:

In [17]:
# get access token for viya env using refresh token.
server = "https://create.demo.sas.com/"
url = f"{server}/SASLogon/oauth/token"
headers = {
    'Accept': 'application/json',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Authorization': 'Basic c2FzLmNsaTo='
}

with open('refresh_token.txt', 'r') as token:
    refresh_token = token.read()

refresh_payload = f'grant_type=refresh_token&refresh_token={refresh_token}'

# response = requests.request("POST", url, headers=headers, data=refresh_payload, verify=verification_file)
response = requests.request("POST", url, headers=headers, data=refresh_payload, verify=False)
new_access_token = response.json()['access_token']

# Establish a connection to SAS Viya using the new access token
st = Session(server, token=new_access_token, verify_ssl=False)
# os.environ['CAS_CLIENT_SSL_CA_LIST'] = verification_file
st = Session(server, token=new_access_token)
st



<sasctl.core.Session at 0x7f317cca8910>

#### Creating a project in SAS Model Manager from Workbench

In [18]:
# Update the project_name and repository_name values to match your preferences
project_name = "Open Source Models - Workshop"
repository_name = "Public"

repository = mr.get_repository(repository_name)

try:
    project = mr.create_project(project_name, repository)
except:
    project = mr.get_project(project_name)

# Save the project id to a .txt file:
with open('project_id.txt', 'w') as file:
    file.write(project.id)

### 3.1 `sasviya` Model

The steps to follow are:

1. Obtaining the authorization code to the Viya instance of interest ✅
2. Establishing a connection to the Viya server ✅
3. Creating a project in SAS Model Manager from Workbench ✅
4. Specifying model parameters and pushing the model into SAS Model Manager

#### Specifying model parameters and pushing the model into SAS Model Manager

In [19]:
model_params = {
    "name": "sasviya_randomForest",
    "projectId": project.id,
    "type": "ASTORE",
}

astore = mr.post(
    "/models",
    files={"files": ("model_export.astore", sasviya_rf.export())},
    data=model_params,
)

### 3.2 `scikit-learn` Model

The steps to follow are:

1. Obtaining the authorization code to the Viya instance of interest ✅
2. Establishing a connection to the Viya server ✅
3. Creating a project in SAS Model Manager from Workbench ✅
4. Saving the pipeline to a pickle file and information about the model to JSON files
5. Importing the JSON files into the project
6. Creating the scoring code and adding both the pickle file and the scoring code to the model in SAS Model Manager

#### Saving the pipeline to a pickle file and information about the model to JSON files

In [20]:
# Define the model name
model_name = "sklearn_randomForest"

# Create the 'sklearn_mm_assets' folder in the current working directory (where the Jupyter Notebook is located)
current_directory = os.getcwd()
new_folder_name = 'sklearn_mm_assets'
new_folder_path = os.path.join(current_directory, new_folder_name)
os.makedirs(new_folder_path, exist_ok=True)

# Save the trained pipeline as a pickle file in the new folder
with open('sklearn_mm_assets/sklearnPipeline.pkl', 'wb') as file:
    pickle.dump(pipeline, file)

target_df =pd.DataFrame(data=[[0.8,0.2,"No"]],columns=['P_InspectionNo','P_InspectionYes','I_Inspection'])

pzmm.JSONFiles.write_var_json(X, is_input=True, json_path="sklearn_mm_assets/")
pzmm.JSONFiles.write_var_json(target_df, is_input=False, json_path="sklearn_mm_assets/")
pzmm.JSONFiles.write_model_properties_json(model_name=model_name,
                            model_desc="scikit-learn Random Forest Classification model",
                            target_variable="Inspection",
                            model_algorithm="sklearn.ensemble.RandomForestClassifier",
                            target_values=["No","Yes"],
                            json_path="sklearn_mm_assets/",
                            modeler='Mattia')

inputVar.json was successfully written and saved to sklearn_mm_assets/inputVar.json
outputVar.json was successfully written and saved to sklearn_mm_assets/outputVar.json
ModelProperties.json was successfully written and saved to sklearn_mm_assets/ModelProperties.json


#### Importing the JSON files into the project

In [21]:
warnings.filterwarnings("ignore", message="The following arguments are required for the automatic generation of score code")
warnings.filterwarnings("ignore", message="This model's properties are different from the project's.")

import_model = pzmm.ImportModel.import_model(
    overwrite_model=True,
    model_files="sklearn_mm_assets/",
    model_prefix=model_name,
    project=project.id
)

All model files were zipped to sklearn_mm_assets.


#### Creating the scoring code and adding both the pickle file and the scoring code to the model in SAS Model Manager

- For more information on the format requirements for the score code, check out the [documentation](https://go.documentation.sas.com/doc/en/mdlmgrcdc/v_054/mdlmgrug/n04i7s6bdu7ilgn1e350am3byuxx.htm#p1sooft3tx23sgn1qr4puq0koo92).

In [22]:
%%writefile ./sklearn_mm_assets/sklearnPipelineScore.py
import settings
import pickle
import pandas as pd
import numpy as np

with open(settings.pickle_path+'/sklearnPipeline.pkl', "rb") as _pickle_file:
    _thisModelFit = pd.read_pickle(_pickle_file)

def scoreModel(CertificateOfOrigin, EUCitizen, Perishable, Fragile, Volume,
                 PreDeclared, MultiplePackage, Category, OnlineDeclaration,
                 ExporterValidation, SecuredDelivery, LithiumBatteries,
                 ExpressDelivery, EntryPoint, Origin, PaperlessBilling,
                 PaymentMethod, Weight, Price):
    "Output: P_InspectionNo, P_InspectionYes, I_Inspection"

    try:
        global _thisModelFit
    except NameError:
        with open(settings.pickle_path+'/sklearnPipeline.pkl', "rb") as _pickle_file:
            _thisModelFit = pd.read_pickle(_pickle_file)

    # Check if inputs are pandas Series, otherwise create a DataFrame with index [0]
    index = None
    if not isinstance(CertificateOfOrigin, pd.Series):
        index = [0]
        
    # Create the input DataFrame
    df = pd.DataFrame({
        'CertificateOfOrigin': CertificateOfOrigin,
        'EUCitizen': EUCitizen,
        'Perishable': Perishable,
        'Fragile': Fragile,
        'Volume': Volume,
        'PreDeclared': PreDeclared,
        'MultiplePackage': MultiplePackage,
        'Category': Category,
        'OnlineDeclaration': OnlineDeclaration,
        'ExporterValidation': ExporterValidation,
        'SecuredDelivery': SecuredDelivery,
        'LithiumBatteries': LithiumBatteries,
        'ExpressDelivery': ExpressDelivery,
        'EntryPoint': EntryPoint,
        'Origin': Origin,
        'PaperlessBilling': PaperlessBilling,
        'PaymentMethod': PaymentMethod,
        'Weight': Weight,
        'Price': Price
    }, index=index)
    
    # Generate predictions
    y_pred_prob = _thisModelFit.predict_proba(df)
    y_pred = _thisModelFit.predict(df)

    # Handle single prediction vs. multiple predictions
    if df.shape[0] == 1:
        result = {
            'P_InspectionNo': float(y_pred_prob[0][0]),
            'P_InspectionYes': float(y_pred_prob[0][1]),
            'I_Inspection': str(y_pred[0])
        }
        return pd.DataFrame([result])
    else:
        # For multiple predictions, return a DataFrame with all results
        results = pd.DataFrame({
            'P_InspectionNo': [float(prob[0]) for prob in y_pred_prob],
            'P_InspectionYes': [float(prob[1]) for prob in y_pred_prob],
            'I_Inspection': [str(pred) for pred in y_pred]
        })
    return results

Writing ./sklearn_mm_assets/sklearnPipelineScore.py


In [23]:
model = mr.get_model(model_name)
pklFileName = 'sklearn_mm_assets/sklearnPipeline.pkl'

scorefile = mr.add_model_content(
    model,
    open('sklearn_mm_assets/sklearnPipelineScore.py', 'rb'),
    name='sklearnPipelineScore.py',
    role='score'
)

python_pickle = mr.add_model_content(
    model,
    open(pklFileName, 'rb'),
    name=pklFileName,
    role='python pickle'
)

