Phase 1: Problem Definition and Data Collection

Define the Goal: Clearly state the objective: to build a regression model that predicts the numerical value of a customer's purchase amount (in USD)
based on their demographic and behavioral data.

Collect Data: Download the "Customer Shopping Trends Dataset" from Kaggle and load it into your chosen environment (e.g., Python with Pandas). 

Phase 2: Exploratory Data Analysis (EDA) and Data Preparation

Understand the Data: Examine the dataset's structure, identify data types (numerical, categorical), look for missing values, and analyze the distribution of key variables like "Age," "Annual Income," and the target variable "Purchase Amount (USD)".

Handle Missing Values: Decide how to manage any missing or null data points. This might involve removing rows with missing values or filling them in (imputation) based on statistical measures (e.g., mean, median) or other predictive methods.

Encode Categorical Variables: Machine learning models typically require numerical input. Convert categorical features like "Gender," "Item Purchased," "Category," "Payment Method," "Color," "Season," and "Location" into a numerical format using techniques like one-hot encoding or label encoding.

Feature Engineering (Optional but Recommended): Create new, more informative features from existing ones if possible (e.g., a "Total Purchases per Year" metric if raw dates are available).

Split the Data: Divide your dataset into two or three parts: a training set (for teaching the model), a validation set (for tuning the model), and a test set (for evaluating the final, trained model on unseen data). 

Phase 3: Model Building and Training
Select a Model: Choose appropriate regression algorithms suitable for predicting continuous numerical values. Popular choices for this type of problem include:

Linear Regression (simple baseline model)

Random Forest Regressor (generally high performance)

XGBoost Regressor (often wins competitions)

Support Vector Regressor (SVR)

Train the Model: Use the training data to train the selected algorithm(s) to find patterns and relationships between the input features and the target variable "Purchase Amount (USD)". 

Phase 4: Model Evaluation and Tuning

Evaluate Performance: Assess how well your model makes predictions using appropriate regression metrics. Key metrics include:

Root Mean Square Error (RMSE): Measures the average difference between the predicted and actual amounts; lower is better.

Mean Absolute Error (MAE): Another measure of average error, often easier to interpret in the original units (USD).

R-squared (R¬≤): Indicates the proportion of the variance in the target variable that is predictable from the features.

Tune Hyperparameters: Adjust the internal settings (hyperparameters) of your chosen model(s) to optimize performance.

Techniques like cross-validation can ensure robustness and prevent overfitting. 

Phase 5: Deployment and Further Steps

Make Predictions: Once you are satisfied with your model's performance, use it to make predictions on your held-out test set or new, unseen customer data.

Interpret and Deploy: Understand the insights gained from the model (e.g., which features most influence purchase amount). You can then integrate the model into a practical application or a business decision-making process. 


## DEFIINE TASK:

### Building a regression model to predict the numerical value of a customer's purchase amount (in USD) based on their demographic and behavioral data.

In [24]:
#load librairies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [3]:
df2=pd.read_csv("shopping_trends.csv")
df=df2.copy()

display("A look IN THE SHAPE OF THE DATASET:",df.shape)
display("A SAMPLE LOOK INTO THE DATASET:",df.sample(20))
display("A DESCRIPTION OF THE DATASET:",df.describe())
display("INFORMATION ON THE DATASET:",df.info())

'A look IN THE SHAPE OF THE DATASET:'

(3900, 19)

'A SAMPLE LOOK INTO THE DATASET:'

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Payment Method,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Preferred Payment Method,Frequency of Purchases
2685,2686,56,Female,Boots,Footwear,25,Virginia,M,Gray,Winter,2.7,No,Venmo,Express,No,No,49,Credit Card,Quarterly
2006,2007,58,Male,Scarf,Accessories,97,New Mexico,S,Pink,Fall,4.4,No,Debit Card,Express,No,No,2,Credit Card,Annually
3046,3047,60,Female,Shoes,Footwear,49,Pennsylvania,M,White,Spring,3.2,No,Bank Transfer,Store Pickup,No,No,25,Cash,Every 3 Months
3252,3253,69,Female,Sandals,Footwear,21,Montana,L,Brown,Spring,2.7,No,Bank Transfer,Store Pickup,No,No,48,Credit Card,Fortnightly
2673,2674,67,Female,Hoodie,Clothing,21,Nevada,M,Olive,Summer,2.6,No,Cash,2-Day Shipping,No,No,4,Cash,Monthly
3750,3751,42,Female,Hoodie,Clothing,88,Arizona,M,Gold,Fall,3.1,No,Debit Card,Free Shipping,No,No,5,PayPal,Monthly
865,866,64,Male,Blouse,Clothing,26,Colorado,L,Charcoal,Winter,4.8,Yes,Bank Transfer,Store Pickup,Yes,Yes,23,Venmo,Annually
2775,2776,37,Female,Socks,Clothing,25,Montana,L,Red,Fall,3.6,No,Venmo,Free Shipping,No,No,16,Debit Card,Weekly
837,838,25,Male,Hoodie,Clothing,28,Illinois,L,Brown,Spring,4.9,Yes,Credit Card,Express,Yes,Yes,33,Cash,Every 3 Months
3736,3737,42,Female,Jeans,Clothing,56,Idaho,M,Red,Spring,4.6,No,Cash,Store Pickup,No,No,2,PayPal,Every 3 Months


'A DESCRIPTION OF THE DATASET:'

Unnamed: 0,Customer ID,Age,Purchase Amount (USD),Review Rating,Previous Purchases
count,3900.0,3900.0,3900.0,3900.0,3900.0
mean,1950.5,44.068462,59.764359,3.749949,25.351538
std,1125.977353,15.207589,23.685392,0.716223,14.447125
min,1.0,18.0,20.0,2.5,1.0
25%,975.75,31.0,39.0,3.1,13.0
50%,1950.5,44.0,60.0,3.7,25.0
75%,2925.25,57.0,81.0,4.4,38.0
max,3900.0,70.0,100.0,5.0,50.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Customer ID               3900 non-null   int64  
 1   Age                       3900 non-null   int64  
 2   Gender                    3900 non-null   object 
 3   Item Purchased            3900 non-null   object 
 4   Category                  3900 non-null   object 
 5   Purchase Amount (USD)     3900 non-null   int64  
 6   Location                  3900 non-null   object 
 7   Size                      3900 non-null   object 
 8   Color                     3900 non-null   object 
 9   Season                    3900 non-null   object 
 10  Review Rating             3900 non-null   float64
 11  Subscription Status       3900 non-null   object 
 12  Payment Method            3900 non-null   object 
 13  Shipping Type             3900 non-null   object 
 14  Discount

'INFORMATION ON THE DATASET:'

None

In [4]:
df.columns

Index(['Customer ID', 'Age', 'Gender', 'Item Purchased', 'Category',
       'Purchase Amount (USD)', 'Location', 'Size', 'Color', 'Season',
       'Review Rating', 'Subscription Status', 'Payment Method',
       'Shipping Type', 'Discount Applied', 'Promo Code Used',
       'Previous Purchases', 'Preferred Payment Method',
       'Frequency of Purchases'],
      dtype='object')

In [53]:
df.drop(['Item Purchased','Location', 'Size', 'Color', 'Season','Shipping Type', 'Discount Applied', 'Promo Code Used','Preferred Payment Method'], axis=1, inplace=True)
display("SHAPE OF THE DATASET:",df.shape)
display("INFORMATION ON THE DATASET:",df.info())

'SHAPE OF THE DATASET:'

(3900, 10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Customer ID             3900 non-null   int64  
 1   Age                     3900 non-null   int64  
 2   Gender                  3900 non-null   object 
 3   Category                3900 non-null   object 
 4   Purchase Amount (USD)   3900 non-null   int64  
 5   Review Rating           3900 non-null   float64
 6   Subscription Status     3900 non-null   object 
 7   Payment Method          3900 non-null   object 
 8   Previous Purchases      3900 non-null   int64  
 9   Frequency of Purchases  3900 non-null   object 
dtypes: float64(1), int64(4), object(5)
memory usage: 304.8+ KB


'INFORMATION ON THE DATASET:'

None

In [14]:
#using one hot encoder to encode non-numeric columns
#from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
#from sklearn.compose import ColumnTransformer
#from sklearn.pipeline import Pipeline

# --- 2. Define Columns by Encoding Type ---

# Columns for One-Hot Encoding (Nominal, Unordered)
one_hot_cols = ['Gender', 'Category', 'Subscription Status', 'Payment Method']

# Column for Ordinal Encoding (Ordered)
ordinal_col = ['Frequency of Purchases']

# Numerical columns (to be passed through without encoding)
numerical_cols = ['Customer ID','Age', 'Previous Purchases', 'Review Rating', 'Purchase Amount (USD)'] # Include target variable here to pass through initially

# --- 3. Define the Ordinal Categories in the Correct Order ---
# This step is crucial to maintain the ranking
frequency_order = [
    'Rarely',
    'Annually',
    'Every 3 Months',  
    'Quarterly',
    'Bi-Weekly',      
    'Fortnightly',     
    'Monthly',
    'Weekly'
]

ordinal_categories = [frequency_order]

# --- 4. Create Preprocessing Pipelines ---

# One-hot encoder pipeline
one_hot_transformer = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Ordinal encoder pipeline
ordinal_transformer = OrdinalEncoder(categories=ordinal_categories)

# --- 5. Combine Transformers using ColumnTransformer ---
# This applies the correct transformation to each column group simultaneously
preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', one_hot_transformer, one_hot_cols),
        ('ordinal', ordinal_transformer, ordinal_col),
        ('numeric', 'passthrough', numerical_cols) # Keep numerical columns as they are
    ],
    remainder='drop' # Drops any columns not specified
)

# --- 6. Apply the transformations and create the final encoded DataFrame ---

# Fit and transform the data
df_encoded_np = preprocessor.fit_transform(df)

# Get the new column names for the one-hot encoded features
one_hot_feature_names = preprocessor.named_transformers_['onehot'].get_feature_names_out(one_hot_cols)

# Combine all new column names in the correct order
all_feature_names = list(one_hot_feature_names) + ordinal_col + numerical_cols

# Convert the resulting numpy array back into a pandas DataFrame
df_encoded = pd.DataFrame(df_encoded_np, columns=all_feature_names)

# --- 7. Display the result ---
print(df_encoded.head())
print("\nEncoded DataFrame shape:", df_encoded.shape)

   Gender_Female  Gender_Male  Category_Accessories  Category_Clothing  \
0            0.0          1.0                   0.0                1.0   
1            0.0          1.0                   0.0                1.0   
2            0.0          1.0                   0.0                1.0   
3            0.0          1.0                   0.0                0.0   
4            0.0          1.0                   0.0                1.0   

   Category_Footwear  Category_Outerwear  Subscription Status_No  \
0                0.0                 0.0                     0.0   
1                0.0                 0.0                     0.0   
2                0.0                 0.0                     0.0   
3                1.0                 0.0                     0.0   
4                0.0                 0.0                     0.0   

   Subscription Status_Yes  Payment Method_Bank Transfer  Payment Method_Cash  \
0                      1.0                           0.0         

In [20]:
# Assuming 'df_encoded' is your final DataFrame from the previous step

# Define the target variable (y)
y = df_encoded['Purchase Amount (USD)']

# Define the features (X) by dropping the target column from the DataFrame
X = df_encoded.drop('Purchase Amount (USD)', axis=1)

# Verify the shapes of your X and y
print(f"Shape of X (features): {X.shape}")
print(f"Shape of y (target): {y.shape}")

# View the first few rows of X to confirm the structure
print("\nFirst 5 rows of X:")
print(X.head())

Shape of X (features): (3900, 19)
Shape of y (target): (3900,)

First 5 rows of X:
   Gender_Female  Gender_Male  Category_Accessories  Category_Clothing  \
0            0.0          1.0                   0.0                1.0   
1            0.0          1.0                   0.0                1.0   
2            0.0          1.0                   0.0                1.0   
3            0.0          1.0                   0.0                0.0   
4            0.0          1.0                   0.0                1.0   

   Category_Footwear  Category_Outerwear  Subscription Status_No  \
0                0.0                 0.0                     0.0   
1                0.0                 0.0                     0.0   
2                0.0                 0.0                     0.0   
3                1.0                 0.0                     0.0   
4                0.0                 0.0                     0.0   

   Subscription Status_Yes  Payment Method_Bank Transfer  Payme

In [22]:
# Split the data with an 80/20 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42 )

# Print the size of each new set
print(f"\nTraining set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print(f"y_train size: {y_train.shape[0]}")
print(f"y_test size: {y_test.shape[0]}")



Training set size: 3120 samples
Testing set size: 780 samples
y_train size: 3120
y_test size: 780


In [26]:
# Preprocessing: Linear Regression is sensitive to scale ‚Üí standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [28]:
#use the linearregression model
model=LinearRegression()
model.fit(X_train_scaled, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [30]:
#MAKE PREDICTIONS
y_train_pred = model.predict(X_train_scaled) 
y_test_pred = model.predict(X_test_scaled)

In [34]:
#--- MODEL EVALUATION --
# Calculate key regression metrics on both train and test sets
 # This helps diagnose overfitting (large gap between train and test performance) 
train_mse = mean_squared_error(y_train, y_train_pred) 
test_mse = mean_squared_error(y_test, y_test_pred) 
train_rmse = np.sqrt(train_mse) 
test_rmse = np.sqrt(test_mse) 
train_mae = mean_absolute_error(y_train, y_train_pred) 
test_mae = mean_absolute_error(y_test, y_test_pred) 
train_r2 = r2_score(y_train, y_train_pred) 
test_r2 = r2_score(y_test, y_test_pred)


print("=== LINEAR REGRESSION PERFORMANCE ===") 
print(f"Train RMSE: ${train_rmse * 100_000:,.2f} | Test RMSE: ${test_rmse * 
100_000:,.2f}") 
print(f"Train MAE:  ${train_mae * 100_000:,.2f} | Test MAE:  ${test_mae * 
100_000:,.2f}") 
print(f"Train R¬≤:   {train_r2:.4f} | Test R¬≤: {test_r2:.4f}")

=== LINEAR REGRESSION PERFORMANCE ===
Train RMSE: $2,360,316.44 | Test RMSE: $2,377,201.04
Train MAE:  $2,045,833.13 | Test MAE:  $2,075,104.81
Train R¬≤:   0.0060 | Test R¬≤: -0.0099


### üîç **1. Understanding the Metrics**

- **RMSE (Root Mean Squared Error)**: Measures the average magnitude of errors in predictions, penalizing larger errors more heavily due to the squaring.  
- **MAE (Mean Absolute Error)**: Measures average absolute prediction errors‚Äîless sensitive to outliers than RMSE.  
- **R¬≤ (R-squared / Coefficient of Determination)**:
  - Ranges from **-‚àû to 1**.
  - **R¬≤ = 1**: Perfect fit.
  - **R¬≤ = 0**: Model predicts the mean of the target.
  - **R¬≤ < 0**: Model performs **worse than a horizontal line (mean predictor)**.

> üí° **Note**: You multiplied RMSE and MAE by **100,000**, which suggests your original target variable (e.g., house price, salary) was likely scaled (e.g., divided by 100,000) during preprocessing. So the errors are now expressed in **dollars**.

---

### üìä **2. Interpreting Your Results**

#### ‚úÖ **Train vs. Test Performance**
| Metric | Train | Test | Gap |
|-------|--------|--------|------|
| **RMSE** | $2,360,316 | $2,377,201 | Very small (~$17k) |
| **MAE** | $2,045,833 | $2,075,104 | Small (~$29k) |
| **R¬≤** | **+0.0060** | **-0.0099** | Slight drop |

- **The train and test errors are nearly identical**, which **rules out overfitting**.
- However, the **R¬≤ is close to zero (train) and slightly negative (test)**‚Äîthis is a **major red flag**.

---

### üö© **3. Key Interpretation: The Model Is Not Learning**

- An **R¬≤ of 0.006** on the training set means your model explains **only 0.6%** of the variance in the target‚Äî**almost nothing**.
- A **negative R¬≤ on the test set (-0.0099)** means your model is **worse than simply predicting the mean** of the target variable.
- In practical terms: **your linear regression model is barely better than random guessing** (or a constant baseline).

This suggests **severe underfitting**‚Äîthe model is **too simple** or the **features lack predictive power** for the target.

---

### üõ†Ô∏è **4. Likely Causes & Recommendations**

#### ‚ùå Possible Issues:
1. **Poor feature selection**: Input features may not be correlated with the target.
2. **Non-linear relationships**: Linear regression assumes linear relationships; real-world data might be non-linear.
3. **Missing important features**: Key predictors may be absent.
4. **Data quality issues**: Outliers, incorrect preprocessing, or target leakage.
5. **Inadequate feature engineering**: Raw features may need transformation (e.g., log, polynomial, interaction terms).

#### ‚úÖ Recommended Next Steps:
1. **Baseline comparison**:  
   Compute the RMSE/MAE of a **dummy regressor** that always predicts the mean of `y_train`. Your model should beat this‚Äîbut currently, it doesn‚Äôt on the test set.

2. **Exploratory Data Analysis (EDA)**:  
   - Check correlations between features and target (`df.corr()`).
   - Visualize relationships (scatter plots, pair plots).

3. **Try more powerful models**:  
   Test **Random Forest**, **Gradient Boosting (XGBoost)**, or **polynomial regression** to see if performance improves.

4. **Feature engineering**:  
   - Add interaction terms, log transforms, or binning.
   - Handle categorical variables properly (one-hot encoding, target encoding).

5. **Rescale features** (if not done):  
   While linear regression is scale-invariant for prediction, regularization (e.g., Ridge/Lasso) requires scaling.

6. **Check for data leakage or time-series issues**:  
   If your data has a temporal component, random train-test splits may be invalid.

---

### üìå **Summary**

> Your linear regression model **fails to capture meaningful patterns** in the data. The near-zero R¬≤ and negative test R¬≤ indicate it **performs worse than a naive mean predictor**. However, since train and test performance are similar, the issue is **underfitting‚Äînot overfitting**. Focus on **better features, non-linear models, or deeper data exploration**.

If you share more context (e.g., what the target variable is, number of features, sample size), I can offer more targeted advice.

In [None]:
# --- 7. MODEL INTERPRETATION --
# Extract feature names and their corresponding coefficients (weights) 
feature_names = housing.feature_names 
coefficients = model.coef_ 
2025-11-13
 {test_r2:.4f}") 
# Create a DataFrame for easy sorting and visualization 
coef_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients}) 
coef_df = coef_df.sort_values(by='Coefficient', key=abs, ascending=False)  # Sort 
by absolute magnitude 
# Plot the feature importances (coefficients) 
plt.figure(figsize=(10, 6)) 
plt.barh(coef_df['Feature'], coef_df['Coefficient'], color='skyblue') 
plt.xlabel('Coefficient Value') 
plt.title('Linear Regression: Feature Coefficients (Impact on House Price)') 
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)  # Vertical line at 
zero 
plt.grid(axis='x', linestyle='--', alpha=0.7) 
plt.tight_layout() 
plt.show()