# Aurelion Sales Amount Prediction — End-to-End ML Pipeline

**Author:** Isabel Feraudo  
**Date:** 2025-11  

## Summary
This project implements an end-to-end Machine Learning pipeline to predict the total amount of a sales transaction line using structured retail data from Aurelion Store.
The workflow covers data ingestion from multiple relational sources, preprocessing, feature engineering, model training, evaluation, and inference. Sales, products, and customer datasets are merged into a unified analytical dataset, where the target variable (importe) is derived from quantity and unit price.
Two regression models are trained and compared — Linear Regression and Random Forest Regressor — using a reproducible preprocessing pipeline that includes numerical scaling and categorical encoding. Model performance is evaluated with MAE, RMSE, and R² metrics.
The project follows a production-oriented structure with modular data loading, reusable pipelines, and environment reproducibility, demonstrating practical skills in machine learning engineering, data preparation, and model evaluation.

###  Tech stack
Python, Pandas, Scikit-learn, Matplotlib, Seaborn

### Architecture
Modular ML pipeline with reproducible project structure

### Goal
Demonstrate applied machine learning and pipeline design for real-world retail data.

## Imports & Config

In [None]:
# Standard libraries
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt

# ML tools
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

### Load Data

# Load dataset
df = pd.read_csv("data.csv")

# Quick preview
df.head()

### Exploratory Data Analysis (EDA)

In [None]:
# Basic info
df.info()

# Statistics
df.describe()

# Missing values
df.isnull().sum()

# Example visualization
plt.figure()
df.hist()
plt.show()

### Data Preprocessing

In [None]:
# Example preprocessing steps

# Drop missing values
df = df.dropna()

# Feature / target split
X = df.drop("target", axis=1)
y = df["target"]

# Train test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

### Model Training

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

### Evaluation

In [None]:
predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)

### Experiment Tracking

In [None]:
save metrics
results = {
    "model": "LinearRegression",
    "mse": mse
}

pd.DataFrame([results]).to_csv("results.csv", index=False)

### Model Export

In [None]:
import joblib
joblib.dump(model, "model.joblib")

# Conclusions (Markdown cell)
## Results
The Random Forest Regressor achieved the best predictive performance, outperforming the Linear Regression model across all evaluation metrics (MAE, RMSE, and R²). This indicates that the relationship between the predictors and the target variable is not purely linear and benefits from a non-linear modeling approach.
The most influential factors in predicting the total transaction amount are quantity and unit price, which directly determine the monetary value of each sales line. The product category introduces additional variability that improves model accuracy when properly encoded.
From a data perspective, the preprocessing pipeline successfully handled heterogeneous features by combining numerical scaling and categorical encoding in a reproducible workflow. The modular project structure enables consistent training, evaluation, and inference across environments.
Overall, the results demonstrate that a structured machine learning pipeline can effectively model retail transaction behavior and provide reliable predictions for operational or analytical use cases.

## Business Impact
This predictive model can support data-driven decision making in retail operations by providing reliable estimates of transaction amounts before purchase completion.
Potential applications include:

**Revenue estimation and planning:**
The model enables early prediction of sales value, helping forecast revenue trends and support financial planning.

**Pricing and demand analysis:**
By capturing the relationship between quantity, price, and category, the model can assist in identifying price-sensitive products and evaluating pricing strategies.

**Operational optimization:**
Predicted transaction values can inform inventory management, promotional strategies, and product prioritization.

**Scalable analytics foundation:**
The modular pipeline design allows easy integration into dashboards, APIs, or monitoring systems, making it suitable for production-oriented environments.

This project demonstrates how machine learning pipelines can transform raw transactional data into actionable insights with direct business relevance.