In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# [TODO] Add your H1 title heading here

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.10.13

## Overview

{TODO: Include a paragraph or two explaining what this example demonstrates, who should be interested in it, and what you need to know before you get started.}

Learn more about [web-doc-title](linkback-to-webdoc-page). {TODO: if more than one primary feature, add tag/linkback for each one}

### Objective

In this tutorial, you learn how to {TODO: Complete the sentence explaining briefly what you will learn from the notebook, such as
training, hyperparameter tuning, or serving}:

This tutorial uses the following Google Cloud ML services and resources:

- *{TODO: Add high level bullets for the services/resources demonstrated; e.g., Vertex AI Training}*


The steps performed include:

- *{TODO: Add high level bullets for the steps of performed in the notebook}*

### Dataset

{TODO: Include a paragraph with Dataset information and where to obtain it.} 

{TODO: Make sure the dataset is accessible to the public. **Googlers**: Add your dataset to the [public samples bucket](http://goto/cloudsamples#sample-storage-bucket) within gs://cloud-samples-data/vertex-ai, if it doesn't already exist there.}

### Costs 

{TODO: Update the list of billable products that your tutorial uses.}

This tutorial uses billable components of Google Cloud:

* Vertex AI
* {TODO: BigQuery}
* Cloud Storage

{TODO: Include links to pricing documentation for each product you listed above.
 NOTE: If you use BigQuery or Dataflow, you need to add this to the pricing.
}

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
{ TODO: [BigQuery pricing](https://cloud.google.com/bigquery/pricing), }
and [Cloud Storage pricing](https://cloud.google.com/storage/pricing), 
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook. 

{TODO: Suggest using the latest major GA version of each package; i.e., --upgrade}

In [1]:
! pip3 install --upgrade --quiet google-cloud-aiplatform

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com). {TODO: Update the APIs needed for your tutorial. Edit the API names, and update the link to append the API IDs, separating each one with a comma. For example, container.googleapis.com,cloudbuild.googleapis.com}

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [2]:
PROJECT_ID = "mlops-473205"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

Updated property [core/project].


#### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [1]:
REGION = "us-central1"  # @param {type: "string"}

### Authenticate your Google Cloud account

The Cloud SDK, code and other libraries currently run as the service account identity of the Workbench Instance running this notebook.

**- Authenticate the Cloud SDK with your credentials :**

In [2]:
# ! gcloud auth login

**- Authenticate code and libraries with your credentials :**

In [None]:
# ! gcloud auth application-default

**- Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

- *{Note to notebook author: For any user-provided strings that need to be unique (like bucket names or model ID's), append "-unique" to the end so proper testing can occur}*

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

### Import libraries

In [None]:
from google.cloud import aiplatform

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project.

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

{TODO: Include commands to delete individual resources below}

In [1]:
import os

# Delete endpoint resource
# e.g. `endpoint.delete()`

# Delete model resource
# e.g. `model.delete()`

# Delete Cloud Storage objects that were created
delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI

In [4]:
import os
import pandas as pd
import numpy as np

file_path = "data.csv"
print("Checking file:", file_path, "exists->", os.path.exists(file_path))

Checking file: data.csv exists-> True


In [7]:
df = pd.read_csv(file_path)

df.head()

Unnamed: 0,sno,age,gender,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,0,63,male,3,145.0,233.0,1,0,150.0,0,2.3,0,0,1,yes
1,1,37,male,2,130.0,250.0,0,1,187.0,0,3.5,0,0,2,yes
2,2,41,female,1,130.0,204.0,0,0,172.0,0,1.4,2,0,2,yes
3,3,56,male,1,120.0,236.0,0,1,178.0,0,0.8,2,0,2,yes
4,4,57,female,0,,354.0,0,1,163.0,1,0.6,2,0,2,yes


In [8]:
print(df.dtypes)

sno           int64
age           int64
gender       object
cp            int64
trestbps    float64
chol        float64
fbs           int64
restecg       int64
thalach     float64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target       object
dtype: object


In [9]:
print(df.head(10).to_string(index=False))

 sno  age gender  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  ca  thal target
   0   63   male   3     145.0 233.0    1        0    150.0      0      2.3      0   0     1    yes
   1   37   male   2     130.0 250.0    0        1    187.0      0      3.5      0   0     2    yes
   2   41 female   1     130.0 204.0    0        0    172.0      0      1.4      2   0     2    yes
   3   56   male   1     120.0 236.0    0        1    178.0      0      0.8      2   0     2    yes
   4   57 female   0       NaN 354.0    0        1    163.0      1      0.6      2   0     2    yes
   5   57   male   0     140.0 192.0    0        1    148.0      0      0.4      1   0     1    yes
   6   56 female   1     140.0 294.0    0        0    153.0      0      1.3      1   0     2    yes
   7   44   male   1     120.0 263.0    0        1    173.0      0      0.0      2   0     3    yes
   8   52   male   2     172.0 199.0    1        1    162.0      0      0.5      2   0     3    yes


In [10]:
missing = df.isna().sum().sort_values(ascending=False)
print("Missing values per column (top 20):\n", missing.head(20).to_string())

Missing values per column (top 20):
 thalach     5
trestbps    4
chol        1
gender      0
age         0
cp          0
sno         0
fbs         0
restecg     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0


In [11]:
print("\nNumeric summary (describe):")
print(df.describe(include=[np.number]).transpose().to_string())


Numeric summary (describe):
          count        mean        std    min    25%    50%     75%    max
sno       303.0  151.000000  87.612784    0.0   75.5  151.0  226.50  302.0
age       303.0   54.366337   9.082101   29.0   47.5   55.0   61.00   77.0
cp        303.0    0.966997   1.032052    0.0    0.0    1.0    2.00    3.0
trestbps  299.0  131.712375  17.629032   94.0  120.0  130.0  140.00  200.0
chol      302.0  246.317881  51.908285  126.0  211.0  240.5  274.75  564.0
fbs       303.0    0.148515   0.356198    0.0    0.0    0.0    0.00    1.0
restecg   303.0    0.528053   0.525860    0.0    0.0    1.0    1.00    2.0
thalach   298.0  149.865772  22.563687   71.0  134.5  152.5  166.00  202.0
exang     303.0    0.326733   0.469794    0.0    0.0    0.0    1.00    1.0
oldpeak   303.0    1.039604   1.161075    0.0    0.0    0.8    1.60    6.2
slope     303.0    1.399340   0.616226    0.0    1.0    1.0    2.00    2.0
ca        303.0    0.729373   1.022606    0.0    0.0    0.0    1.00    

In [12]:
cat_cols = df.select_dtypes(include=['object','category','bool']).columns.tolist()
for c in cat_cols:
    uniques = df[c].nunique(dropna=False)
    sample = df[c].unique()[:10]
    print(f" - {c}: {uniques} unique; sample: {sample}")

 - gender: 2 unique; sample: ['male' 'female']
 - target: 2 unique; sample: ['yes' 'no']


In [19]:
# Preprocessing cell (self-contained)
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import numpy as np

print("\n--- Preprocessing ---")
# assume df is already loaded and contains 'target'
X = df.drop('target', axis=1)
y = df['target']

# identify numeric and categorical feature columns
numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = X.select_dtypes(include=['object','category','bool']).columns.tolist()

print("Numeric columns:", numeric_cols)
print("Categorical columns:", categorical_cols)

# infer task type (regression vs classification)
if pd.api.types.is_numeric_dtype(y):
    unique_vals = y.nunique(dropna=True)
    print("Target is numeric with", unique_vals, "unique values.")
    # heuristic: treat as regression when many unique numeric values
    is_regression = unique_vals > 20
else:
    print("Target is non-numeric -> classification.")
    is_regression = False

print("Inferred task type:", "Regression" if is_regression else "Classification")

# simple pipelines
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)
    ],
    remainder='drop'
)

# train/test split â€” stratify only for classification
stratify_arg = None if is_regression else y
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=stratify_arg
)

print("Train shape:", X_train.shape, "Test shape:", X_test.shape)



--- Preprocessing ---
Numeric columns: ['sno', 'age', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
Categorical columns: ['gender']
Target is non-numeric -> classification.
Inferred task type: Classification
Train shape: (242, 14) Test shape: (61, 14)


In [20]:
# Cell 5: Model selection, training, and evaluation
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import (accuracy_score, classification_report, confusion_matrix,
                             mean_squared_error, r2_score)
from sklearn.pipeline import make_pipeline

print("\n--- Training model ---")
if is_regression:
    model = RandomForestRegressor(n_estimators=100, random_state=42)
else:
    model = RandomForestClassifier(n_estimators=200, random_state=42, class_weight='balanced')

pipeline = make_pipeline(preprocessor, model)
pipeline.fit(X_train, y_train)
print("Model trained.")

# Predictions and evaluation
y_pred = pipeline.predict(X_test)

if is_regression:
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"Regression results: MSE = {mse:.4f}, RMSE = {mse**0.5:.4f}, R2 = {r2:.4f}")
else:
    acc = accuracy_score(y_test, y_pred)
    print(f"Classification accuracy on test set: {acc:.4f}")
    print("\nClassification report:")
    print(classification_report(y_test, y_pred, zero_division=0))
    print("\nConfusion matrix (rows=true, cols=pred):")
    print(confusion_matrix(y_test, y_pred))



--- Training model ---
Model trained.
Classification accuracy on test set: 1.0000

Classification report:
              precision    recall  f1-score   support

          no       1.00      1.00      1.00        28
         yes       1.00      1.00      1.00        33

    accuracy                           1.00        61
   macro avg       1.00      1.00      1.00        61
weighted avg       1.00      1.00      1.00        61


Confusion matrix (rows=true, cols=pred):
[[28  0]
 [ 0 33]]


In [21]:
# Cell 6: Feature importance (approximate) for tree models
print("\n--- Feature importance (approximate) ---")
try:
    # Need to extract feature names after preprocessing
    feature_names = []
    if numeric_cols:
        feature_names.extend(numeric_cols)
    if categorical_cols:
        # Extract categories from fitted OneHotEncoder
        ohe = pipeline.named_steps['columntransformer'].named_transformers_['cat'].named_steps['onehot']
        ohe_feature_names = ohe.get_feature_names_out(categorical_cols).tolist()
        feature_names.extend(ohe_feature_names)
    # Extract feature importances from the final estimator
    importances = pipeline.named_steps[list(pipeline.named_steps.keys())[-1]].feature_importances_
    fi = pd.Series(importances, index=feature_names).sort_values(ascending=False).head(30)
    print(fi.to_string())
except Exception as e:
    print("Could not compute feature importances:", e)



--- Feature importance (approximate) ---
sno              0.539582
cp               0.082221
thal             0.067006
oldpeak          0.045896
thalach          0.043717
ca               0.041247
slope            0.039380
exang            0.036297
chol             0.029058
age              0.028326
trestbps         0.019960
gender_female    0.009754
gender_male      0.007617
restecg          0.006956
fbs              0.002982


In [25]:
# Cell 7: Save pipeline to disk (optional)
import joblib

# manually define target column since you did not run Cell 3
target = "target"

model_path = "./trained_pipeline.joblib"
joblib.dump(pipeline, model_path)
print("Saved trained pipeline to:", model_path)

# Provide a small predict example on the first 5 test rows
print("\nExample predictions for first 5 test samples:")
example_X = X_test.head(5)
example_preds = pipeline.predict(example_X)
example_out = example_X.copy()
example_out['predicted_' + target] = example_preds

example_out.head()


Saved trained pipeline to: ./trained_pipeline.joblib

Example predictions for first 5 test samples:


Unnamed: 0,sno,age,gender,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,predicted_target
179,179,57,male,0,150.0,276.0,0,0,112.0,1,0.6,1,1,1,no
197,197,67,male,0,125.0,254.0,1,1,163.0,0,0.2,1,2,3,no
285,285,46,male,0,140.0,311.0,0,1,120.0,1,1.8,1,2,3,no
194,194,60,male,2,140.0,185.0,0,0,155.0,0,3.0,1,0,2,no
188,188,50,male,2,140.0,233.0,0,1,163.0,0,0.6,1,1,3,no
