Recommendation system requires training with each new items/user on our platform, so it's reasonable to make a continues training pipleline "Kubeflow".

# Setting Up GCP

## Setting global variables

In [None]:
# Put the project id here
PROJECT_ID = !(gcloud config get-value core/project)
PROJECT_ID = PROJECT_ID[0]
REGION = 'us-central1' # Define your zone here
BUCKET = 'gs://' + PROJECT_ID # Our bucket, feel free to change it as fits to your project

Excute next cell if you didn't make your bucker yet

In [None]:
gsutil mb -p $PROJECT_ID gs://$PROJECT_ID
gsutil acl ch -u AllUsers:R gs://$PROJECT_ID

PRIOR TO STARTING THE LAB: Make sure you create a new instance with AI Platform Pipelines. Once the GKE cluster is spun up, copy the endpoint because you will need it in this lab.

In [9]:
END_POINT = "" #Your cluster endpoint #GKE host url
PIPELINE_NAME = "Rec_Anime_NNFC_TF"

We must make sure all apis are enabled, run the next cell to enable it. "We can run it in cloud shell as well."

In [7]:
%%bash
gcloud services enable \
  serviceusage.googleapis.com \
  compute.googleapis.com \
  container.googleapis.com \
  iam.googleapis.com \
  servicemanagement.googleapis.com \
  cloudresourcemanager.googleapis.com \
  ml.googleapis.com \
  iap.googleapis.com \
  sqladmin.googleapis.com \
  meshconfig.googleapis.com \
  krmapihosting.googleapis.com \
  servicecontrol.googleapis.com \
  endpoints.googleapis.com

bash: line 1: gcloud: command not found


CalledProcessError: Command 'b'gcloud services enable \\\n  serviceusage.googleapis.com \\\n  compute.googleapis.com \\\n  container.googleapis.com \\\n  iam.googleapis.com \\\n  servicemanagement.googleapis.com \\\n  cloudresourcemanager.googleapis.com \\\n  ml.googleapis.com \\\n  iap.googleapis.com \\\n  sqladmin.googleapis.com \\\n  meshconfig.googleapis.com \\\n  krmapihosting.googleapis.com \\\n  servicecontrol.googleapis.com \\\n  endpoints.googleapis.com\n'' returned non-zero exit status 127.

In [None]:
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
CLOUD_BUILD_SERVICE_ACCOUNT="${PROJECT_NUMBER}@cloudbuild.gserviceaccount.com"
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member serviceAccount:$CLOUD_BUILD_SERVICE_ACCOUNT \
  --role roles/editor

## Setting Up Anthos Service Mesh

With Anthos Service Mesh, you get an Anthos tested and supported distribution of Istio, letting you create and deploy a service mesh on GKE on Google Cloud and other platforms with full Google support. "Needed for Google Kubernetes Engine"

In [2]:
%%bash
curl --request POST \
  --header "Authorization: Bearer $(gcloud auth print-access-token)" \
  --data '' \
  https://meshconfig.googleapis.com/v1alpha1/projects/${PROJECT_ID}:initialize

# Training Script

We will define our training script here

In [11]:
import os
train_script_dir = "NNFC" # Our training scripts will be saved in this directory
if not os.path.exists(train_script_dir): 
    os.makedirs(train_script_dir)

In [12]:
%%writefile ./NNFC/train.py

import fire
import pandas as pd
import os
import config
from Utils.data.preprocess import preprocessor_anime_data,preprocess_colabritive
from train.colabritive_system import NNCollaborativeFiltering
import pickle
import subprocess

def save_model(save_dir,model,save_tf: bool=True):
    """args:
            save_dir: The model registry path to save the model.
            model: The trained model we want to save.
            save_tf: if True save it in tensorflow formate, otherwise saves it as picke.
    """

    model_name = os.path.join("save_dir","anime_recommender")
    if save_tf:
        model.save(f"{model_name}.h5")
    else:
        with open(model_name, 'wb') as model_file:
        pickle.dump(model, f"{model_name}.pkl")

def train(training_dataset_path,output_dir,epochs):
    path_anime = os.path.join("training_dataset_path", "anime.csv")
    path_anime_list = os.path.join("training_dataset_path", "animelist.csv")

    # I want to test on user_id 0
    r_anime = pd.read_csv(path_anime,low_memory=True) # THat's the maximum for that

    anime_data = preprocessor_anime_data(r_anime).get_transformed_data()
    # Data behave differently when loading next row
    prepro_nnfc_class = preprocess_colabritive(path_anime_list,load_rows=88) # That's why SQL is important

    x_user,x_item,y = prepro_nnfc_class.get_x_y_data_NNCF(my_class.get_users_for_item(),anime_data)
    n_users,n_items = prepro_nnfc_class.get_num_user_items()

    model = NNCollaborativeFiltering(n_users=n_users, n_items=n_items)
    history,model = model.train_model(x_user,x_item,y,epochs=epochs,embedding_dims=10, d_layers=[10]) # The rest of training params could be added in func param as well
    
    save_model(output_dir,model,True)


if __name__ == '__main__':
    fire.Fire(train)

Writing ./NNFC/train.py


In [None]:
%%writefile ./NNFC/Dockerfile

FROM gcr.io/deeplearning-platform-release/base-cpu
RUN pip install -U fire tensorflow==2.10 pandas scikit-learn
WORKDIR /app
COPY NFFC/train.py .
COPY Utils/* .
COPY train/colabritive_system.py .

ENTRYPOINT ["python", "train.py"]

In [13]:
RECOM_NNFC_IMAGE_NAME='recomm_nnfc_image'
RECOM_NNFC_IMAGE_TAG='latest'
RECOM_NNFC_IMAGE_URI=f'gcr.io/{PROJECT_ID}/{RECOM_NNFC_IMAGE_NAME}:{RECOM_NNFC_IMAGE_TAG}'

In [None]:
!gcloud builds submit --tag $SCIKIT_IMAGE_URI $SCIKIT_IMAGE_NAME

# Pipeline script

In [None]:
%%writefile ./pipeline/recom_anime_tf_pipeline.py

import os
import kfp
from kfp.dsl.types import GCPProjectID
from kfp.dsl.types import GCPRegion
from kfp.dsl.types import GCSPath
from kfp.dsl.types import String
from kfp.gcp import use_gcp_secret
import kfp.components as comp, create_component_from_func
import kfp.dsl as dsl
import kfp.gcp as gcp

TF_TRAINER_IMAGE = os.getenv('RECOM_NNFC_IMAGE_NAME')
BUCKET = os.getenv('BUCKET')

# Paths to export the training/validation data from bigquery
TRAINING_OUTPUT_PATH = BUCKET + '/census/data/training.csv'
VALIDATION_OUTPUT_PATH = BUCKET + '/census/data/validation.csv'

COMPONENT_URL_SEARCH_PREFIX = 'https://raw.githubusercontent.com/kubeflow/pipelines/0.2.5/components/gcp/'

# Create component factories
component_store = kfp.components.ComponentStore(
    local_search_paths=None, url_search_prefixes=[COMPONENT_URL_SEARCH_PREFIX])

# Load BigQuery and AI Platform Training op
mlengine_train_op = component_store.load_component('ml_engine/train')



@dsl.pipeline(
    name='Recom_Anime_NNFC_Pipeline',
    description='Pipeline continuesly train recommender system for animes, NNFC model'
)
def pipeline(
    project_id,
    region='us-central1'
):


    # These are the output directories where our models will be saved
    tf_output_dir = BUCKET + '/Rec-Anime/models/tf'
    
    # Training arguments to be passed to the TF Trainer
    tf_args = [
        '--training_dataset_path', create_training_split.outputs['output_gcs_path'],
        '--output_dir', tf_output_dir,
        '--epochs', '130',
    ]
    
    
    # AI Platform Training Jobs with trainer images 
    train_tf = mlengine_train_op(
        project_id=project_id,
        region=region,
        master_image_uri=TF_TRAINER_IMAGE,
        args=tf_args).set_display_name('Tensorflow Anime Recommender NNFC Model - AI Platform Training')

We Must set our training image name to use it

In [14]:
TAG = 'latest'
TF_TRAINER_IMAGE = f'gcr.io/{PROJECT_ID}/tensorflow_trainer_image:{TAG}'

In [15]:
%env TF_TRAINER_IMAGE={TF_TRAINER_IMAGE}

env: TF_TRAINER_IMAGE=gcr.io//bin/bash: gcloud: command not found/tensorflow_trainer_image:latest


# Compile Pipeline

Compile the pipeline

In [None]:
!dsl-compile --py ./pipeline/recom_anime_tf_pipeline.py --output recom_anime_tf_pipeline.yaml


In [None]:
!head recom_anime_tf_pipeline.yaml


In [None]:
!sed -i 's/\"command\": \[\]/\"command\": \[python, -u, -m, kfp_component.launcher\]/g' recom_anime_tf_pipeline.yaml


In [None]:
!cat recom_anime_tf_pipeline.yaml | grep "component.launcher"


In [None]:
!kfp --endpoint $ENDPOINT pipeline upload \
-p $PIPELINE_NAME \
./recom_anime_tf_pipeline.yaml

# Deploy the model

In [None]:
%%bash
# Set necessary variables: 
MODEL_NAME="Rec-Anime-NNFC-TF"
MODEL_VERSION="1.0"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/babyweight/export/exporter/ | tail -1)

# Set the region to global by executing the following command: 
gcloud config set ai_platform/region global

echo "Deploying the model '$MODEL_NAME', version '$MODEL_VERSION' from $MODEL_LOCATION"
echo "... this will take a few minutes"

# Deploy trained model: 
gcloud ai-platform models create ${MODEL_NAME} --regions $REGION
# Create a new AI Platform version.
# TODO
gcloud ai-platform versions create ${MODEL_VERSION} \
  --model ${MODEL_NAME} \
  --origin ${MODEL_LOCATION} \
  --runtime-version $TFVERSION