In [4]:
import sagemaker
import boto3
import os

session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Create or use existing bucket
s3 = boto3.client('s3')
bucket_name = "c160506a4117400l11163953t1w772864730-sandboxbucket-qaengkzb6snp"
print(f"Using bucket: {bucket_name}")


  from pandas.core.computation.check import NUMEXPR_INSTALLED


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
Using bucket: c160506a4117400l11163953t1w772864730-sandboxbucket-qaengkzb6snp


**Purpose:**  
This cell sets up the SageMaker and S3 environment for the notebook.

**Step-by-step:**
- **Imports**:  
  - `sagemaker` → to interact with Amazon SageMaker services.  
  - `boto3` → AWS SDK for Python, used to interact with S3 and other AWS services.  
  - `os` → for handling local file paths and environment variables.  

- **SageMaker session & role**:  
  - `session = sagemaker.Session()` → creates a SageMaker session for managing interactions.  
  - `role = sagemaker.get_execution_role()` → retrieves the IAM role that grants permissions to access AWS resources.

- **S3 setup**:  
  - `s3 = boto3.client('s3')` → creates an S3 client to upload/download data.  
  - `bucket_name` → stores the S3 bucket name where data and models will be saved.  
  - `print(...)` → confirms the bucket being used.


In [5]:
local_data = 'health_data.csv'

prefix = 'insurance-rf'
data_s3_path = session.upload_data(local_data, bucket=bucket_name, key_prefix=f'{prefix}/data')

print("Full dataset S3 path:", data_s3_path)


Full dataset S3 path: s3://c160506a4117400l11163953t1w772864730-sandboxbucket-qaengkzb6snp/insurance-rf/data/health_data.csv


**Purpose:**  
This cell uploads the local dataset to the specified S3 bucket so it can be accessed in AWS.

**Step-by-step:**
- `local_data = 'health_data.csv'` → specifies the path/name of the local CSV file containing the dataset.
- `prefix = 'insurance-rf'` → sets a folder prefix in the S3 bucket to organize related files.
- `data_s3_path = session.upload_data(...)` → uploads the local file to S3 under the path `<prefix>/data` inside the chosen bucket.  
  - `bucket=bucket_name` → target bucket.  
  - `key_prefix=f'{prefix}/data'` → folder structure in S3.
- `print(...)` → displays the full S3 path where the dataset is stored.


In [6]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np


df = pd.read_csv(f's3://{bucket_name}/{prefix}/data/health_data.csv')


target_column = 'charges'  # change if your target is named differently
X = df.drop(target_column, axis=1)
y = df[target_column]


categorical_cols = X.select_dtypes(include=['object']).columns
preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
], remainder='passthrough')


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


model = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('rf', RandomForestRegressor(
        n_estimators=100,
        max_depth=10,
        random_state=42
    ))
])

kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=kf, scoring='r2')
print(f"Cross-validation R² scores: {scores}")
print(f"Mean CV R²: {np.mean(scores):.4f}")


model.fit(X_train, y_train)


print(f"Final Train R²: {model.score(X_train, y_train):.4f}")
print(f"Final Test R²: {model.score(X_test, y_test):.4f}")


severe performance issues, see also https://github.com/dask/dask/issues/10276

To fix, you should specify a lower version bound on s3fs, or
update the current installation.



Cross-validation R² scores: [0.81049434 0.82718637 0.87472705 0.81790796 0.8486359 ]
Mean CV R²: 0.8358
Final Train R²: 0.9652
Final Test R²: 0.8641


**Purpose:**  
This cell loads the dataset from S3, preprocesses it, builds a Random Forest regression model, evaluates it with cross-validation, and fits the final model.

**Step-by-step:**
1. **Imports**:  
   - `pandas` for data handling.  
   - `RandomForestRegressor` for building the prediction model.  
   - `train_test_split`, `KFold`, `cross_val_score` for model evaluation.  
   - `OneHotEncoder` for encoding categorical variables.  
   - `ColumnTransformer` for applying transformations to specific columns.  
   - `Pipeline` for combining preprocessing and modeling steps.  
   - `numpy` for numerical operations.

2. **Load dataset**:  
   - `pd.read_csv(f's3://...')` → loads the CSV file directly from the S3 bucket into a DataFrame `df`.

3. **Separate features and target**:  
   - `target_column = 'charges'` → defines the column to predict.  
   - `X = df.drop(target_column, axis=1)` → all input features.  
   - `y = df[target_column]` → target variable.

4. **Preprocessing setup**:  
   - `categorical_cols = X.select_dtypes(include=['object']).columns` → finds columns with string/object data.  
   - `preprocessor = ColumnTransformer([...], remainder='passthrough')` → applies One-Hot Encoding to categorical columns and keeps all other columns as-is.

5. **Train-test split**:  
   - Splits the dataset into 80% training and 20% testing.

6. **Pipeline creation**:  
   - Combines preprocessing and Random Forest model into a single `Pipeline` for easier training and prediction.

7. **Cross-validation**:  
   - Uses 5-fold cross-validation (`KFold`) to evaluate model performance on training data.  
   - `cross_val_score` returns R² scores for each fold.  
   - Prints the mean R² score.

8. **Model training & evaluation**:  
   - `model.fit(...)` trains the pipeline on training data.  
   - Prints final R² scores for both train and test sets to assess performance.


In [8]:
sample_input = pd.DataFrame([{
    'age': 18,
    'sex': 'male',
    'bmi': 33.77,
    'children': 1,
    'smoker': 'no',
    'region': 'southeast'
}])

predicted_charge = model.predict(sample_input)
print(f"Predicted charge: {predicted_charge[0]:.2f}")


Predicted charge: 3884.79


**Purpose:**  
This cell creates a sample input for prediction and uses the trained model to estimate insurance charges.

**Step-by-step:**
1. **Create sample input**:  
   - `pd.DataFrame([{...}])` → creates a one-row DataFrame containing the same columns as the training data (`age`, `sex`, `bmi`, `children`, `smoker`, `region`).  
   - This simulates a new customer’s data.

2. **Make prediction**:  
   - `model.predict(sample_input)` → passes the new data through the preprocessing pipeline and the Random Forest model to generate a prediction.  
   - The result is a NumPy array with one value.

3. **Display result**:  
   - `print(...)` → prints the predicted insurance charge, formatted to two decimal places.


In [7]:
import joblib
import tarfile
import boto3
import os

# Save model locally as .joblib
model_filename = "rf_insurance_model.joblib"
joblib.dump(model, model_filename)

# Create tar.gz containing the model
tar_filename = "model.tar.gz"
with tarfile.open(tar_filename, "w:gz") as tar:
    tar.add(model_filename, arcname=model_filename)

# Upload to S3 in correct format
model_s3_path = f"s3://{bucket_name}/{prefix}/model/{tar_filename}"
s3.upload_file(tar_filename, bucket_name, f"{prefix}/model/{tar_filename}")

print("Model packaged & uploaded to:", model_s3_path)


Model packaged & uploaded to: s3://c160506a4117400l11163953t1w772864730-sandboxbucket-qaengkzb6snp/insurance-rf/model/model.tar.gz


**Purpose:**  
This cell saves the trained model, packages it for storage, and uploads it to the S3 bucket.

**Step-by-step:**
1. **Save the model locally**:  
   - `joblib.dump(model, model_filename)` → stores the trained pipeline (preprocessing + model) in a `.joblib` file so it can be reused without retraining.

2. **Package the model**:  
   - Opens a `.tar.gz` archive in write mode.  
   - Adds the `.joblib` model file to the archive.  
   - The `.tar.gz` format is required by SageMaker for model deployment.

3. **Upload to S3**:  
   - `s3.upload_file(...)` → uploads the packaged model to the S3 bucket inside the `prefix/model/` folder.  
   - `model_s3_path` stores the full S3 path for reference.

4. **Confirmation**:  
   - Prints the S3 location where the model is saved.


In [11]:
import boto3
import joblib
import pandas as pd


bucket_name = "c160506a4117400l11163953t1w772864730-sandboxbucket-aci4rul7kn8r"
model_key = "insurance-rf/model/model.tar.gz"  # Path in S3


local_tar = "model.tar.gz"
local_model_filename = "rf_insurance_model.joblib"


s3 = boto3.client('s3')
s3.download_file(bucket_name, model_key, local_tar)


import tarfile
with tarfile.open(local_tar, "r:gz") as tar:
    tar.extractall()

model = joblib.load(local_model_filename)


sample_input = pd.DataFrame([{
    'age': 24,
    'sex': 'female',
    'bmi': 26.6,
    'children': 0,
    'smoker': 'no',
    'region': 'northeast'
}])


predicted_charge = model.predict(sample_input)
print(f"Predicted charge: {predicted_charge[0]:.2f}")


Predicted charge: 4493.04


**Purpose:**  
This cell downloads the trained model from S3, loads it locally, and uses it to make a prediction without deploying to SageMaker.

**Step-by-step:**
1. **S3 configuration**:  
   - `bucket_name` → name of the S3 bucket where the model is stored.  
   - `model_key` → path to the packaged `.tar.gz` model file in S3.  
   - `local_tar` and `local_model_filename` → local file names for the downloaded archive and extracted model.

2. **Download from S3**:  
   - Creates an S3 client with `boto3.client('s3')`.  
   - `download_file(...)` → retrieves the model archive from S3 and saves it locally as `model.tar.gz`.

3. **Extract model**:  
   - Opens the `.tar.gz` archive in read mode and extracts its contents (the `.joblib` model file) into the current directory.

4. **Load model**:  
   - `joblib.load(local_model_filename)` → loads the trained pipeline into memory for inference.

5. **Prepare input data**:  
   - Creates a one-row DataFrame with the same feature columns as the training dataset to simulate a new prediction request.

6. **Make prediction**:  
   - `model.predict(sample_input)` → runs the preprocessing and prediction steps on the new data.  
   - Prints the predicted insurance charge, formatted to two decimal places.


In [9]:
pip install streamlit


Collecting streamlit
  Using cached streamlit-1.48.1-py3-none-any.whl.metadata (9.5 kB)
Collecting altair!=5.4.0,!=5.4.1,<6,>=4.0 (from streamlit)
  Using cached altair-5.5.0-py3-none-any.whl.metadata (11 kB)
Collecting cachetools<7,>=4.0 (from streamlit)
  Using cached cachetools-6.1.0-py3-none-any.whl.metadata (5.4 kB)
Collecting gitpython!=3.1.19,<4,>=3.0.7 (from streamlit)
  Using cached gitpython-3.1.45-py3-none-any.whl.metadata (13 kB)
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Using cached pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.19,<4,>=3.0.7->streamlit)
  Using cached gitdb-4.0.12-py3-none-any.whl.metadata (1.2 kB)
Collecting smmap<6,>=3.0.1 (from gitdb<5,>=4.0.1->gitpython!=3.1.19,<4,>=3.0.7->streamlit)
  Using cached smmap-5.0.2-py3-none-any.whl.metadata (4.3 kB)
Using cached streamlit-1.48.1-py3-none-any.whl (9.9 MB)
Using cached altair-5.5.0-py3-none-any.whl (731 kB)
Using cached cachetools-6.1.0-py3-none-any

In [11]:
import boto3

s3 = boto3.client('s3')

bucket_name = "c160506a4117400l11163953t1w772864730-sandboxbucket-qaengkzb6snp"
key = "insurance-rf/model/model.tar.gz"
local_file = "model.tar.gz"

s3.download_file(bucket_name, key, local_file)
print("✅ Model downloaded from S3")


✅ Model downloaded from S3


In [14]:
import tarfile
import os

# Path to your tar.gz file
tar_path = "model.tar.gz"
extract_path = "model_extracted"

# Create a folder to extract contents
os.makedirs(extract_path, exist_ok=True)

# Extract tar.gz
with tarfile.open(tar_path, "r:gz") as tar:
    tar.extractall(path=extract_path)

print("✅ Model extracted")

# Check extracted files
print("Contents:", os.listdir(extract_path))


✅ Model extracted
Contents: ['rf_insurance_model.joblib']


In [18]:
import joblib

# Path to your extracted model file
model_path = "model_extracted/rf_insurance_model.joblib"

# Load the model
model = joblib.load(model_path)

print("✅ Model loaded successfully")


✅ Model loaded successfully
