### Practical Exam: Customer Purchase Prediction
RetailTech Solutions is a fast-growing international e-commerce platform operating in over 20 countries across Europe, North America, and Asia. They specialize in fashion, electronics, and home goods, with a unique business model that combines traditional retail with a marketplace for independent sellers.

The company has seen rapid growth. A key part of their success has been their data-driven approach to personalization. However, as they plan their expansion into new markets, they need to improve their ability to predict customer behavior.

Their marketing team wants to predict which customers are most likely to make a purchase based on their browsing behavior.

As an AI Engineer, you will help build this prediction system. Your work will directly impact RetailTech's growth strategy and their goal of increasing revenue.

The marketing team has collected customer session data in raw_customer_data.csv, but it contains missing values and inconsistencies that need to be addressed. Create a cleaned version of the dataframe:

Start with the data in the file raw_customer_data.csv
Your output should be a DataFrame named clean_data
All column names and values should match the table below.

In [1]:
import pandas as pd

# Load the raw data
df = pd.read_csv("raw_customer_data.csv")

# Handle missing values according to specifications
df['time_spent'] = df['time_spent'].fillna(df['time_spent'].median())
df['pages_viewed'] = df['pages_viewed'].fillna(df['pages_viewed'].mean())  # Removed rounding
df['basket_value'] = df['basket_value'].fillna(0)
df['device_type'] = df['device_type'].fillna("Unknown")
df['customer_type'] = df['customer_type'].fillna("New")

# Convert data types
df['customer_id'] = df['customer_id'].astype(int)
df['pages_viewed'] = df['pages_viewed'].astype(int)  # Still converting to int as pages viewed should be whole numbers

# Create the cleaned dataframe
clean_data = df.copy()

# Verify no missing values remain
assert clean_data.isna().sum().sum() == 0, "There are still missing values in the data"

Task 2
The pre-cleaned dataset model_data.csv needs to be prepared for our neural network. Create the model features:

* Start with the data in the file model_data.csv
* Scale numerical features (time_spent, pages_viewed, basket_value) to 0-1 range
* Apply one-hot encoding to the categorical features (device_type, customer_type)
* The column names should have the following format: variable_name_category_name (e.g., device_type_Desktop)
* Your output should be a DataFrame named model_feature_set, with all column names from model_data.csv except for the columns where one-hot encoding was applied.


In [2]:
from sklearn.preprocessing import MinMaxScaler

# Load the model data
df = pd.read_csv("model_data.csv")

# Identify columns
numerical_cols = ['time_spent', 'pages_viewed', 'basket_value']
categorical_cols = ['device_type', 'customer_type']
target_col = 'purchase'

# Scale numerical features (0-1 range)
scaler = MinMaxScaler()
scaled_numerical = pd.DataFrame(
    scaler.fit_transform(df[numerical_cols]), 
    columns=numerical_cols
)

# One-hot encode categorical features
encoded_categorical = pd.get_dummies(
    df[categorical_cols], 
    prefix=categorical_cols
)

# Combine features and drop original categorical columns
model_feature_set = pd.concat([
    df[['customer_id']],  # Keep customer_id
    scaled_numerical,
    encoded_categorical,
    df[target_col]  # Keep target variable
], axis=1)

# Verify all original categorical columns are gone
assert all(col not in model_feature_set.columns for col in categorical_cols), "Original categorical columns still exist"

Task 3
Now that all preparatory work has been done, create and train a neural network that would allow the company to predict purchases.

* Using PyTorch, create a network with:
    * At least one hidden layer with 8 units
    * ReLU activation for hidden layer
    * Sigmoid activation for the output layer
* Using the prepared features in input_model_features.csv, train the model to predict purchases.
* Use the validation dataset validation_features.csv to predict new values based on the trained model.
* Your model should be named purchase_model and your output should be a DataFrame named validation_predictions with columns customer_id and purchase. The purchase column must be your predicted values.

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Prepare data (using output from Task 2)
X = model_feature_set.drop(columns=['customer_id', 'purchase']).values.astype('float32')
y = model_feature_set['purchase'].values.astype('float32')

# Split into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert to PyTorch tensors
train_dataset = TensorDataset(
    torch.tensor(X_train), 
    torch.tensor(y_train).unsqueeze(1)
)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Define the model
class PurchaseModel(nn.Module):
    def __init__(self, input_dim):
        super(PurchaseModel, self).__init__()
        self.layer1 = nn.Linear(input_dim, 8)
        self.relu = nn.ReLU()
        self.output = nn.Linear(8, 1)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        x = self.relu(self.layer1(x))
        x = self.sigmoid(self.output(x))
        return x

# Initialize model
input_dim = X_train.shape[1]
purchase_model = PurchaseModel(input_dim)

# Loss and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(purchase_model.parameters(), lr=0.001)

# Training loop
epochs = 50
for epoch in range(epochs):
    for batch_x, batch_y in train_loader:
        optimizer.zero_grad()
        outputs = purchase_model(batch_x)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
    
    # Print training progress
    if (epoch+1) % 10 == 0:
        print(f'Epoch {epoch+1}/{epochs}, Loss: {loss.item():.4f}')

# Validation
with torch.no_grad():
    val_tensor = torch.tensor(X_val)
    predictions = purchase_model(val_tensor).numpy().flatten()
    predicted_labels = (predictions >= 0.5).astype(int)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_val, predicted_labels)
    print(f'Validation Accuracy: {accuracy:.4f}')

# Load validation features (as per task instructions)
val_df = pd.read_csv('validation_features.csv')
X_validation = val_df.drop(columns=['customer_id']).values.astype('float32')
customer_ids = val_df['customer_id'].values

# Make predictions on the provided validation set
with torch.no_grad():
    val_tensor = torch.tensor(X_validation)
    predictions = purchase_model(val_tensor).numpy().flatten()
    predicted_labels = (predictions >= 0.5).astype(int)

# Create final output DataFrame
validation_predictions = pd.DataFrame({
    'customer_id': customer_ids,
    'purchase': predicted_labels
})

Epoch 10/50, Loss: 0.7867
Epoch 20/50, Loss: 0.5926
Epoch 30/50, Loss: 0.3489
Epoch 40/50, Loss: 0.5754
Epoch 50/50, Loss: 0.4622
Validation Accuracy: 0.8000
