# Task 1

The marketing team has collected customer session data in `raw_customer_data.csv`, but it contains missing values and inconsistencies that need to be addressed.
Create a cleaned version of the dataframe:

- Start with the data in the file `raw_customer_data.csv`
- Your output should be a DataFrame named `clean_data`
- All column names and values should match the table below.
</br>

| Column Name | Criteria |
|------------|----------|
| customer_id | Integer. Unique identifier for each customer. No missing values. |
| time_spent | Float. Minutes spent on website per session. Missing values should be replaced with median. |
| pages_viewed | Integer. Number of pages viewed in session. Missing values should be replaced with mean. |
| basket_value | Float. Value of items in basket. Missing values should be replaced with 0. |
| device_type | String. One of: Mobile, Desktop, Tablet. Missing values should be replaced with "Unknown". |
| customer_type | String. One of: New, Returning. Missing values should be replaced with "New". |
| purchase | Binary. Whether customer made a purchase (1) or not (0). Target variable. |

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [20]:
import pandas as pd

# Tempel path yang sudah disalin di sini
path = '/content/drive/MyDrive/Colab Notebooks/AI Engineering/For Project/CustomerPurchasePrediction/raw_customer_data.csv'
df = pd.read_csv(path)

In [22]:

import pandas as pd
import numpy as np

# Path File
path = '/content/drive/MyDrive/Colab Notebooks/AI Engineering/For Project/CustomerPurchasePrediction/raw_customer_data.csv'
df = pd.read_csv(path)

# 1. Handle missing values according to specifications
# For time_spent: Replace missing values with median
time_spent_median = df['time_spent'].median()
df['time_spent'] = df['time_spent'].fillna(time_spent_median)

# For pages_viewed: Replace missing values with mean (after converting to float if needed)
if df['pages_viewed'].dtype == 'object':
    # Handle any non-numeric values if they exist
    df['pages_viewed'] = pd.to_numeric(df['pages_viewed'], errors='coerce')
pages_viewed_mean = df['pages_viewed'].mean()
df['pages_viewed'] = df['pages_viewed'].fillna(pages_viewed_mean).astype(int)  # Convert to int as required

# For basket_value: Replace missing values with 0
df['basket_value'] = df['basket_value'].fillna(0)

# For device_type: Replace missing values with "Unknown"
df['device_type'] = df['device_type'].fillna('Unknown')

# For customer_type: Replace missing values with "New"
df['customer_type'] = df['customer_type'].fillna('New')

# 2. Ensure data types are correct
df['customer_id'] = df['customer_id'].astype(int)
df['time_spent'] = df['time_spent'].astype(float)
df['pages_viewed'] = df['pages_viewed'].astype(int)
df['basket_value'] = df['basket_value'].astype(float)
df['purchase'] = df['purchase'].astype(int)

# 3. Standardize categorical values (in case there are variations in case or extra spaces)
df['device_type'] = df['device_type'].str.title().str.strip()
df['customer_type'] = df['customer_type'].str.title().str.strip()

# 4. Validate categorical values
valid_devices = ['Mobile', 'Desktop', 'Tablet', 'Unknown']
valid_customer_types = ['New', 'Returning']

df['device_type'] = df['device_type'].apply(lambda x: x if x in valid_devices else 'Unknown')
df['customer_type'] = df['customer_type'].apply(lambda x: x if x in valid_customer_types else 'New')

# Create the clean_data DataFrame
clean_data = df.copy()

# Display the first few rows of the cleaned data
print("First few rows of cleaned data:")
print(clean_data.head())

# Display information about the cleaned data
print("\nDataFrame info:")
print(clean_data.info())

# Display summary statistics
print("\nSummary statistics:")
print(clean_data.describe())

# Save the cleaned data to a CSV file (optional)
clean_data.to_csv('cleaned_customer_data.csv', index=False)
print("\nCleaned data has been saved to 'cleaned_customer_data.csv'")

# Note: The clean_data DataFrame is now ready for use in the notebook


First few rows of cleaned data:
   customer_id  time_spent  pages_viewed  basket_value device_type  \
0            1   23.097867             7     50.574647      Mobile   
1            2   57.092144             3     56.891022      Mobile   
2            3   44.187643            14      8.348296      Mobile   
3            4   36.320851            10     43.481489      Mobile   
4            5   10.205100            16      0.000000      Mobile   

  customer_type  purchase  
0     Returning         0  
1     Returning         1  
2     Returning         0  
3           New         1  
4     Returning         1  

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   customer_id    500 non-null    int64  
 1   time_spent     500 non-null    float64
 2   pages_viewed   500 non-null    int64  
 3   basket_value   500 non-null    fl

You can mount your Google Drive to access files stored there directly from this notebook.

Run the following cell to mount your Drive. This will prompt you to authorize Colab to access your Google Drive files.

Once your drive is mounted, you can navigate to the desired directory and copy files using standard Python file operations or shell commands.

For example, to copy a file named `my_file.csv` from a folder named `MyFolder` in your Google Drive to the current directory in Colab, you can use this command:

# Task 2
The pre-cleaned dataset `model_data.csv` needs to be prepared for our neural network.
Create the model features:

- Start with the data in the file `model_data.csv`
- Scale numerical features (`time_spent`, `pages_viewed`, `basket_value`) to 0-1 range
- Apply one-hot encoding to the categorical features (`device_type`, `customer_type`)
    - The column names should have the following format: variable_name_category_name (e.g., `device_type_Desktop`)
- Your output should be a DataFrame named `model_feature_set`, with all column names from `model_data.csv` except for the columns where one-hot encoding was applied.


In [25]:
# Write your answer to Task 2 here
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from pathlib import Path

# Path File
path = '/content/drive/MyDrive/Colab Notebooks/AI Engineering/For Project/CustomerPurchasePrediction/model_data.csv'
df = pd.read_csv(path)

# Columns per task specification
numeric_features = ['time_spent', 'pages_viewed', 'basket_value']
categorical_features = ['device_type', 'customer_type']

# 1) Scale numeric features to 0-1 range (Min-Max)
scaler = MinMaxScaler()
df[numeric_features] = scaler.fit_transform(df[numeric_features])

# 2) One-hot encode categorical features with the required naming format
#    This will create columns like 'device_type_Desktop', 'customer_type_New', etc.
df_dummies = pd.get_dummies(df[categorical_features], prefix=categorical_features)

# 3) Build the final feature set: all original columns except the categorical ones, plus the one-hot columns
model_feature_set = pd.concat([
    df.drop(columns=categorical_features),
    df_dummies
], axis=1)

# Optional: persist to disk for downstream use
output_path = Path.cwd() / 'model_feature_set.csv'
model_feature_set.to_csv(output_path, index=False)

# Quick sanity outputs
print('model_feature_set preview:')
print(model_feature_set.head())
print('\nColumns in model_feature_set:')
print(list(model_feature_set.columns))
print(f"\nSaved feature set to: {output_path}")

model_feature_set preview:
   customer_id  time_spent  pages_viewed  basket_value  purchase  \
0          501    0.664167      0.500000      0.000000         1   
1          502    0.483681      0.222222      0.524981         1   
2          503    0.231359      0.111111      0.457291         0   
3          504    0.792944      0.277778      0.000000         1   
4          505    0.649210      0.166667      0.484283         1   

   device_type_Desktop  device_type_Mobile  device_type_Tablet  \
0                 True               False               False   
1                False                True               False   
2                False                True               False   
3                False               False               False   
4                False               False                True   

   device_type_Unknown  customer_type_New  customer_type_Returning  
0                False               True                    False  
1                False       

# Task 3

Now that all preparatory work has been done, create and train a neural network that would allow the company to predict purchases.

- Using PyTorch, create a network with:
   - At least one hidden layer with 8 units
   - ReLU activation for hidden layer
   - Sigmoid activation for the output layer
- Using the prepared features in `input_model_features.csv`, train the model to predict purchases.
- Use the validation dataset `validation_features.csv` to predict new values based on the trained model.
- Your model should be named `purchase_model` and your output should be a DataFrame named `validation_predictions` with columns `customer_id` and `purchase`. The `purchase` column must be your predicted values.


In [33]:
!pip install torch torchvision torchaudio



In [35]:
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Assuming model_feature_set is already loaded and preprocessed
# If not, you would need to load it here:
# model_feature_set = pd.read_csv('model_feature_set.csv')

# Separate features and target
# Drop customer_id before splitting as it's not a feature for training
X = model_feature_set.drop(columns=["customer_id", "purchase"])
y = model_feature_set["purchase"]
customer_ids = model_feature_set["customer_id"] # Keep customer_ids for the final output

# Convert boolean columns to integers for PyTorch compatibility
for col in X.columns:
    if X[col].dtype == 'bool':
        X[col] = X[col].astype(int)


# Split data into training and validation sets
X_train, X_val, y_train, y_val, train_ids, val_ids = train_test_split(
    X, y, customer_ids, test_size=0.2, random_state=42
)


# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values.reshape(-1, 1), dtype=torch.float32)
X_val_tensor = torch.tensor(X_val.values, dtype=torch.float32)


# Define the neural network
class PurchaseNet(nn.Module):
    def __init__(self, input_dim):
        super(PurchaseNet, self).__init__()
        self.fc1 = nn.Linear(input_dim, 8)
        self.relu = nn.ReLU()
        self.output = nn.Linear(8, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.output(x)
        return self.sigmoid(x)

# Initialize model
input_dim = X_train.shape[1]
purchase_model = PurchaseNet(input_dim)

# Training setup
criterion = nn.BCELoss()
optimizer = optim.Adam(purchase_model.parameters(), lr=0.01)
epochs = 100

# Training loop
losses = []
for epoch in range(epochs):
    purchase_model.train()
    optimizer.zero_grad()
    outputs = purchase_model(X_train_tensor)
    loss = criterion(outputs, y_train_tensor)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())

    # Print loss every 10 epochs
    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}')


# Predict on validation set
purchase_model.eval()
with torch.no_grad():
    predictions = purchase_model(X_val_tensor).numpy().flatten()

# Create output DataFrame
validation_predictions = pd.DataFrame({'customer_id': val_ids, 'purchase': predictions})
validation_predictions["purchase"] = (validation_predictions["purchase"] > 0.5).astype(int)

# Display the first few rows of the predictions
print("\nFirst few rows of validation predictions:")
print(validation_predictions.head())

# The purchase_model is now trained and validation_predictions DataFrame is created.

Epoch [10/100], Loss: 0.5492
Epoch [20/100], Loss: 0.5048
Epoch [30/100], Loss: 0.5009
Epoch [40/100], Loss: 0.4904
Epoch [50/100], Loss: 0.4826
Epoch [60/100], Loss: 0.4749
Epoch [70/100], Loss: 0.4673
Epoch [80/100], Loss: 0.4602
Epoch [90/100], Loss: 0.4548
Epoch [100/100], Loss: 0.4512

First few rows of validation predictions:
     customer_id  purchase
361          862         1
73           574         1
374          875         1
155          656         0
104          605         1
