# User Segmentation with SageMaker

This notebook walks through the process of building a user segmentation model using AWS SageMaker. We will use the K-Means algorithm to cluster users based on their data from Redshift.

## 1. Setup and Initialization

Import necessary libraries, initialize SageMaker session, and define S3 buckets and roles.

In [None]:
import sagemaker
import boto3
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role()

# TODO: Replace with your S3 bucket names
data_bucket = 'my-awesome-project-csv-bucket-dev' 
model_artifacts_bucket = 'my-awesome-project-sagemaker-artifacts-dev'

print(f'SageMaker role is: {role}')
print(f'Using data bucket: {data_bucket}')

## 2. Data Extraction and Preprocessing

Load data from Redshift (user data and CRM data). Then, preprocess the data to prepare it for training. This might involve handling missing values, feature scaling, and one-hot encoding.

In [None]:
# TODO: Write code to connect to Redshift and load data
# You can use libraries like psycopg2 or SQLAlchemy

# Example: Load data into a pandas DataFrame
# crm_data = pd.read_sql('SELECT * FROM crm_data', redshift_connection)
# user_data = pd.read_sql('SELECT * FROM user_data', redshift_connection)

# TODO: Merge and preprocess the data
# combined_data = pd.merge(user_data, crm_data, on='user_id')
# preprocessed_data = ...

# TODO: Upload the preprocessed data to S3 for training
# preprocessed_data.to_csv('train.csv', header=False, index=False)
# sagemaker_session.upload_data('train.csv', bucket=data_bucket, key_prefix='user-segmentation/training')

## 3. Model Training

Use the SageMaker K-Means estimator to train the clustering model.

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.estimator import Estimator

# Get the pre-built K-Means image
container = get_image_uri(boto3.Session().region_name, 'kmeans')

# Create the Estimator
kmeans = Estimator(container,
                   role,
                   train_instance_count=1,
                   train_instance_type='ml.c4.xlarge',
                   output_path=f's3://{model_artifacts_bucket}/user-segmentation/output',
                   sagemaker_session=sagemaker_session)

# Set hyperparameters
kmeans.set_hyperparameters(k=5, # Example: 5 clusters
                           feature_dim=10) # TODO: Set the correct feature dimension

# Train the model
# train_data_path = f's3://{data_bucket}/user-segmentation/training'
# kmeans.fit({'train': train_data_path})

## 4. Model Deployment

Deploy the trained model to a SageMaker endpoint for real-time inference.

In [None]:
# Deploy the model
# kmeans_predictor = kmeans.deploy(initial_instance_count=1,
#                                instance_type='ml.t2.medium')

## 5. Inference and Cleanup

Make predictions with the deployed endpoint and then delete the endpoint to avoid incurring costs.

In [None]:
# Example prediction
# test_data = [...] 
# result = kmeans_predictor.predict(test_data)
# print(result)

# Clean up the endpoint
# sagemaker_session.delete_endpoint(kmeans_predictor.endpoint)