<a href="https://colab.research.google.com/github/Jack-ki1/MARTIAL_JENGA_PROJECT_AI_ENGINEERING/blob/main/FINAL_PROJECT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Phase 1: Problem Definition and Data Collection

Define the Goal: Clearly state the objective: to build a regression model that predicts the numerical value of a customer's purchase amount (in USD)
based on their demographic and behavioral data.

Collect Data: Download the "Customer Shopping Trends Dataset" from Kaggle and load it into your chosen environment (e.g., Python with Pandas).

Phase 2: Exploratory Data Analysis (EDA) and Data Preparation

Understand the Data: Examine the dataset's structure, identify data types (numerical, categorical), look for missing values, and analyze the distribution of key variables like "Age," "Annual Income," and the target variable "Purchase Amount (USD)".

Handle Missing Values: Decide how to manage any missing or null data points. This might involve removing rows with missing values or filling them in (imputation) based on statistical measures (e.g., mean, median) or other predictive methods.

Encode Categorical Variables: Machine learning models typically require numerical input. Convert categorical features like "Gender," "Item Purchased," "Category," "Payment Method," "Color," "Season," and "Location" into a numerical format using techniques like one-hot encoding or label encoding.

Feature Engineering (Optional but Recommended): Create new, more informative features from existing ones if possible (e.g., a "Total Purchases per Year" metric if raw dates are available).

Split the Data: Divide your dataset into two or three parts: a training set (for teaching the model), a validation set (for tuning the model), and a test set (for evaluating the final, trained model on unseen data).

Phase 3: Model Building and Training
Select a Model: Choose appropriate regression algorithms suitable for predicting continuous numerical values. Popular choices for this type of problem include:

Linear Regression (simple baseline model)

Random Forest Regressor (generally high performance)

XGBoost Regressor (often wins competitions)

Support Vector Regressor (SVR)

Train the Model: Use the training data to train the selected algorithm(s) to find patterns and relationships between the input features and the target variable "Purchase Amount (USD)".

Phase 4: Model Evaluation and Tuning

Evaluate Performance: Assess how well your model makes predictions using appropriate regression metrics. Key metrics include:

Root Mean Square Error (RMSE): Measures the average difference between the predicted and actual amounts; lower is better.

Mean Absolute Error (MAE): Another measure of average error, often easier to interpret in the original units (USD).

R-squared (R²): Indicates the proportion of the variance in the target variable that is predictable from the features.

Tune Hyperparameters: Adjust the internal settings (hyperparameters) of your chosen model(s) to optimize performance.

Techniques like cross-validation can ensure robustness and prevent overfitting.

Phase 5: Deployment and Further Steps

Make Predictions: Once you are satisfied with your model's performance, use it to make predictions on your held-out test set or new, unseen customer data.

Interpret and Deploy: Understand the insights gained from the model (e.g., which features most influence purchase amount). You can then integrate the model into a practical application or a business decision-making process.


## DEFIINE TASK:

### Building a regression model to predict the numerical value of a customer's purchase amount (in USD) based on their demographic and behavioral data.

In [1]:
#load librairies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [2]:
df2=pd.read_csv("shopping_trends.csv")
df=df2.copy()

display("A look IN THE SHAPE OF THE DATASET:",df.shape)
display("A SAMPLE LOOK INTO THE DATASET:",df.sample(20))
display("A DESCRIPTION OF THE DATASET:",df.describe())
display("INFORMATION ON THE DATASET:",df.info())

FileNotFoundError: [Errno 2] No such file or directory: 'shopping_trends.csv'

In [None]:
df.columns

In [None]:
df.drop(['Item Purchased','Location', 'Size', 'Color', 'Season','Shipping Type', 'Discount Applied', 'Promo Code Used','Preferred Payment Method'], axis=1, inplace=True)
display("SHAPE OF THE DATASET:",df.shape)
display("INFORMATION ON THE DATASET:",df.info())

In [None]:
#using one hot encoder to encode non-numeric columns
#from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
#from sklearn.compose import ColumnTransformer
#from sklearn.pipeline import Pipeline

# --- 2. Define Columns by Encoding Type ---

# Columns for One-Hot Encoding (Nominal, Unordered)
one_hot_cols = ['Gender', 'Category', 'Subscription Status', 'Payment Method']

# Column for Ordinal Encoding (Ordered)
ordinal_col = ['Frequency of Purchases']

# Numerical columns (to be passed through without encoding)
numerical_cols = ['Customer ID','Age', 'Previous Purchases', 'Review Rating', 'Purchase Amount (USD)'] # Include target variable here to pass through initially

# --- 3. Define the Ordinal Categories in the Correct Order ---
# This step is crucial to maintain the ranking
frequency_order = [
    'Rarely',
    'Annually',
    'Every 3 Months',
    'Quarterly',
    'Bi-Weekly',
    'Fortnightly',
    'Monthly',
    'Weekly'
]

ordinal_categories = [frequency_order]

# --- 4. Create Preprocessing Pipelines ---

# One-hot encoder pipeline
one_hot_transformer = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Ordinal encoder pipeline
ordinal_transformer = OrdinalEncoder(categories=ordinal_categories)

# --- 5. Combine Transformers using ColumnTransformer ---
# This applies the correct transformation to each column group simultaneously
preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', one_hot_transformer, one_hot_cols),
        ('ordinal', ordinal_transformer, ordinal_col),
        ('numeric', 'passthrough', numerical_cols) # Keep numerical columns as they are
    ],
    remainder='drop' # Drops any columns not specified
)

# --- 6. Apply the transformations and create the final encoded DataFrame ---

# Fit and transform the data
df_encoded_np = preprocessor.fit_transform(df)

# Get the new column names for the one-hot encoded features
one_hot_feature_names = preprocessor.named_transformers_['onehot'].get_feature_names_out(one_hot_cols)

# Combine all new column names in the correct order
all_feature_names = list(one_hot_feature_names) + ordinal_col + numerical_cols

# Convert the resulting numpy array back into a pandas DataFrame
df_encoded = pd.DataFrame(df_encoded_np, columns=all_feature_names)

# --- 7. Display the result ---
print(df_encoded.head())
print("\nEncoded DataFrame shape:", df_encoded.shape)

In [None]:
# Assuming 'df_encoded' is your final DataFrame from the previous step

# Define the target variable (y)
y = df_encoded['Purchase Amount (USD)']

# Define the features (X) by dropping the target column from the DataFrame
X = df_encoded.drop('Purchase Amount (USD)', axis=1)

# Verify the shapes of your X and y
print(f"Shape of X (features): {X.shape}")
print(f"Shape of y (target): {y.shape}")

# View the first few rows of X to confirm the structure
print("\nFirst 5 rows of X:")
print(X.head())

In [None]:
# Split the data with an 80/20 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42 )

# Print the size of each new set
print(f"\nTraining set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print(f"y_train size: {y_train.shape[0]}")
print(f"y_test size: {y_test.shape[0]}")


In [None]:
# Preprocessing: Linear Regression is sensitive to scale → standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
#use the linearregression model
model=LinearRegression()
model.fit(X_train_scaled, y_train)

In [None]:
#MAKE PREDICTIONS
y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)

In [None]:
#--- MODEL EVALUATION --
# Calculate key regression metrics on both train and test sets
 # This helps diagnose overfitting (large gap between train and test performance)
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)
train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)


print("=== LINEAR REGRESSION PERFORMANCE ===")
print(f"Train RMSE: ${train_rmse * 100_000:,.2f} | Test RMSE: ${test_rmse *
100_000:,.2f}")
print(f"Train MAE:  ${train_mae * 100_000:,.2f} | Test MAE:  ${test_mae *
100_000:,.2f}")
print(f"Train R²:   {train_r2:.4f} | Test R²: {test_r2:.4f}")

In [None]:
# --- 7. MODEL INTERPRETATION --
# Extract feature names and their corresponding coefficients (weights)
feature_names = housing.feature_names
coefficients = model.coef_
2025-11-13
 {test_r2:.4f}")
# Create a DataFrame for easy sorting and visualization
coef_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})
coef_df = coef_df.sort_values(by='Coefficient', key=abs, ascending=False)  # Sort
by absolute magnitude
# Plot the feature importances (coefficients)
plt.figure(figsize=(10, 6))
plt.barh(coef_df['Feature'], coef_df['Coefficient'], color='skyblue')
plt.xlabel('Coefficient Value')
plt.title('Linear Regression: Feature Coefficients (Impact on House Price)')
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)  # Vertical line at
zero
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()