### 👋 **SalesNexus: Business Understanding**

![SalesNexus Banner](bg.jpg)

---

## 🎯 Goal

To **predict daily sales** for each item across stores, enabling:

✅ **Dynamic Pricing** — Adjust pricing based on demand and promotions
✅ **Inventory Optimization** — Reduce waste, prevent stockouts
✅ **Better Profitability** — Minimize surplus, maximize sales efficiency

---

## 🧐 Why This Matters

Modern retailers operate in highly competitive environments. Manual forecasting often fails to:

* Anticipate **seasonal fluctuations** (holidays, promotions)
* Respond to **regional differences** across stores
* Adjust to **external influences** (economic changes, events)

**Resulting Problems:**

* Overstocking ➔ Increased waste and holding costs
* Understocking ➔ Missed sales and dissatisfied customers
* Suboptimal pricing ➔ Reduced margins and competitive disadvantage

---

## 💡 Objective

Develop a robust sales forecasting system that predicts future sales per item, per store, and allows:

* Automated, data-driven pricing adjustments
* Smarter inventory planning
* Insights into sales drivers (trend, seasonality, promotions)

---

## 🎯 Stakeholders

* **Retail Chains**: Streamline supply chain and pricing strategies.
* **Pricing Teams**: Identify revenue optimization opportunities.
* **Store Operations**: Maintain optimal inventory levels.
* **Customers**: Enjoy better availability and pricing.

---

## 📊 Evaluation Metrics

* **RMSLE** (Root Mean Squared Logarithmic Error): Main metric for this Kaggle-style prediction.
* **Additional metrics**: MAE (Mean Absolute Error), RMSE (Root Mean Squared Error) for deeper error analysis.

---

## 🚀 Impact

* **Increase Profitability**: Minimize waste and surplus inventory.
* **Optimize Operations**: Reduce storage and logistics costs.
* **Better Experience**: Improve product availability and pricing precision for customers.

---



In [1]:
import os

In [2]:
%pwd

'c:\\Arjun_Works\\SalesNexus\\research'

In [3]:
os.chdir('../')

In [4]:
%pwd

'c:\\Arjun_Works\\SalesNexus'

In [5]:
from dataclasses import dataclass
from pathlib import Path
from typing import Dict

@dataclass(frozen=True)
class DataAcquisitionConfig:
    """Config for downloading and accessing raw data files."""
    root_dir: Path
    source: str          
    dataset_name: str    
    local_dir: Path       
    data_files: Dict[str, str]  


In [6]:
from ml_service.constants import *
from ml_service.utils.main_utils import read_yaml, create_directories

In [7]:
class ConfigurationManager:
    def __init__(self, config_filepath: str):
        """Initialize the configuration manager.

        Args:
            config_filepath (str): Path to the main configuration file (YAML).
        """
        self.config = read_yaml(config_filepath)
        create_directories([self.config.artifacts_root])

    def get_data_acquisition_config(self) -> DataAcquisitionConfig:
        """Get the configuration for data acquisition.

        Returns:
            DataAcquisitionConfig: Paths and source details for data acquisition.
        """
        config = self.config.data_acquisition
        create_directories([config.root_dir])

        return DataAcquisitionConfig(
            root_dir=Path(config.root_dir),
            source=config.source,
            dataset_name=config.dataset_name,
            local_dir=Path(config.local_dir),
            data_files=dict(config.data_files)  
        )


### Data Acquisition 

In [8]:
from ml_service.logging import logger

logger.info(f"Current working directory: {Path.cwd()}")

[2025-06-23 21:52:05,767: INFO: 306878826: Current working directory: c:\Arjun_Works\SalesNexus]


In [None]:
import opendatasets as od


# Load configuration
config_manager = ConfigurationManager(CONFIG_FILE_PATH)
data_acquisition_config = config_manager.get_data_acquisition_config()

# Download dataset from Kaggle
od.download(data_acquisition_config.source, data_acquisition_config.root_dir)

[2025-06-23 21:52:05,872: INFO: main_utils: yaml file: config\config.yaml loaded successfully]
[2025-06-23 21:52:05,874: INFO: main_utils: created directory at: artifacts]
[2025-06-23 21:52:05,876: INFO: main_utils: created directory at: artifacts/data_acquisition]
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:

### Data Loading

In [None]:
import pandas as pd

# Check available files in the data directory
print("Files in data directory:", list(data_acquisition_config.local_dir.glob("*")))

# Extract the file names from the configuration
train_file = data_acquisition_config.data_files["train"]
test_file = data_acquisition_config.data_files["test"]
oil_file = data_acquisition_config.data_files["oil"]
stores_file = data_acquisition_config.data_files["stores"]
transactions_file = data_acquisition_config.data_files["transactions"]
holiday_events_file = data_acquisition_config.data_files["holidays_events"]



# Define paths for the dataset files
train_path = data_acquisition_config.local_dir / train_file
test_path = data_acquisition_config.local_dir / test_file
oil_file = data_acquisition_config.local_dir / oil_file
stores_file = data_acquisition_config.local_dir / stores_file
transactions_file = data_acquisition_config.local_dir / transactions_file
holidays_events_file = data_acquisition_config.local_dir / holiday_events_file

# Load the dataset
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)
oil_df = pd.read_csv(oil_file)
stores_df = pd.read_csv(stores_file)
transactions_df = pd.read_csv(transactions_file)
holidays_events_df = pd.read_csv(holidays_events_file)


Files in data directory: [WindowsPath('artifacts/data_acquisition/store-sales-time-series-forecasting/holidays_events.csv'), WindowsPath('artifacts/data_acquisition/store-sales-time-series-forecasting/oil.csv'), WindowsPath('artifacts/data_acquisition/store-sales-time-series-forecasting/sample_submission.csv'), WindowsPath('artifacts/data_acquisition/store-sales-time-series-forecasting/stores.csv'), WindowsPath('artifacts/data_acquisition/store-sales-time-series-forecasting/test.csv'), WindowsPath('artifacts/data_acquisition/store-sales-time-series-forecasting/train.csv'), WindowsPath('artifacts/data_acquisition/store-sales-time-series-forecasting/transactions.csv')]


In [None]:
train_df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0
1,1,2013-01-01,1,BABY CARE,0.0,0
2,2,2013-01-01,1,BEAUTY,0.0,0
3,3,2013-01-01,1,BEVERAGES,0.0,0
4,4,2013-01-01,1,BOOKS,0.0,0


In [None]:
test_df.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion
0,3000888,2017-08-16,1,AUTOMOTIVE,0
1,3000889,2017-08-16,1,BABY CARE,0
2,3000890,2017-08-16,1,BEAUTY,2
3,3000891,2017-08-16,1,BEVERAGES,20
4,3000892,2017-08-16,1,BOOKS,0


In [None]:
oil_df.head()

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2


In [None]:
transactions_df.head()

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922


In [None]:
holidays_events_df.head()

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False


In [None]:
stores_df.head()

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4


In [None]:
from pathlib import Path
from typing import Dict
import pandas as pd
import opendatasets as od
import shutil

class DataLoader:
    """
    Handles downloading, flattening, and loading raw data files from Kaggle.
    """
    def __init__(self, data_dir: Path, source: str, data_files: Dict[str, str], dataset_name: str) -> None:
        """
        Args:
            data_dir (Path): Directory where raw files reside.
            source (str): Kaggle dataset URL.
            data_files (dict): Mapping of dataset names to filenames.
            dataset_name (str): Name of the Kaggle dataset (folder name after download).
        """
        self.data_dir = data_dir
        self.source = source
        self.data_files = data_files
        self.dataset_name = dataset_name

    def download(self) -> None:
        """Check if files already exist. If not, download and flatten."""
        if self._all_files_exist():
            print(f"✅ All files already present in: {self.data_dir}. Skipping download.")
            return

        # Otherwise, proceed with downloading
        self.data_dir.mkdir(parents=True, exist_ok=True)
        print(f"🚀 Downloading dataset from Kaggle...")
        od.download(self.source, str(self.data_dir))
        print(f"✅ Download complete.")

        self._flatten_download()

    def _all_files_exist(self) -> bool:
        """Check if all required files already exist in the data_dir."""
        return all((self.data_dir / filename).exists() for filename in self.data_files.values())

    def _flatten_download(self) -> None:
        """Move files from Kaggle's subfolder into data_dir root."""
        kaggle_subdir = self.data_dir / self.dataset_name
        if kaggle_subdir.exists():
            for file in kaggle_subdir.iterdir():
                shutil.move(str(file), str(self.data_dir))
            kaggle_subdir.rmdir()
            print(f"📂 Flattened downloaded files into: {self.data_dir}")

    def load(self, name: str) -> pd.DataFrame:
        """Load a dataset by its configured name."""
        if name not in self.data_files:
            raise ValueError(f"Dataset '{name}' not found in configuration.")
        path = self.data_dir / self.data_files[name]
        if not path.exists():
            raise FileNotFoundError(f"File not found at {path}")
        print(f"📥 Loading: {path}")
        return pd.read_csv(path)

    def load_all(self) -> Dict[str, pd.DataFrame]:
        """Load all configured datasets as a dict of DataFrames."""
        return {name: self.load(name) for name in self.data_files.keys()}


In [None]:
from pathlib import Path

# Assuming data_acquisition_config is already obtained from ConfigurationManager
loader = DataLoader(
    data_dir=Path(data_acquisition_config.local_dir),
    source=data_acquisition_config.source,
    data_files=data_acquisition_config.data_files,
    dataset_name=data_acquisition_config.dataset_name
)



In [None]:

# 1️⃣ Download + flatten files
loader.download()

# 2️⃣ Load individual files
train_df = loader.load("train")
test_df = loader.load("test")

# 3️⃣ Load all at once
all_data = loader.load_all()

# Example: see loaded train data
print(train_df.head())

In [None]:
train_df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0
1,1,2013-01-01,1,BABY CARE,0.0,0
2,2,2013-01-01,1,BEAUTY,0.0,0
3,3,2013-01-01,1,BEVERAGES,0.0,0
4,4,2013-01-01,1,BOOKS,0.0,0
