# **(RETAIL SALES ETL NOTEBOOK)**

## Objectives

The objective of this notebook is to perform Extract, Transform, Load (ETL) operations on retail sales data to prepare it for analysis.

## Inputs

- Raw retail sales dataset stored in dataset/raw-data/
- Data fields: Weekly sales, store info, promotional markdowns, holidays, etc.
- Python libraries: pandas, numpy

## Outputs

- Cleaned and transformed dataset saved to dataset/clean-data/cleaned_sales_data.csv
- Engineered features like Total_MarkDown

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Section 1

## ETL Process

In this section, we perform Extract, Transform, Load (ETL) operations on the retail sales data:

- **Extract**: Load raw CSV files (sales, stores, features).
- **Transform**: Merge datasets, handle missing values, convert data types, engineer features.
- **Load**: Save the cleaned dataset to the clean-data directory.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import os

# Extract: Load raw data
sales_df = pd.read_csv('dataset/raw-data/sales-data-set.csv')
stores_df = pd.read_csv('dataset/raw-data/stores-data-set.csv')
features_df = pd.read_csv('dataset/raw-data/Features-data-set.csv')

# Transform: Merge datasets
# Merge sales with stores
merged_df = pd.merge(sales_df, stores_df, on='Store', how='left')

# Merge with features on Store and Date
merged_df = pd.merge(merged_df, features_df, on=['Store', 'Date'], how='left')

# Convert Date to datetime
merged_df['Date'] = pd.to_datetime(merged_df['Date'], format='%d/%m/%Y')

# Handle missing values: Replace 'NA' with NaN and fill MarkDowns with 0
merged_df.replace('NA', np.nan, inplace=True)
markdown_cols = ['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5']
merged_df[markdown_cols] = merged_df[markdown_cols].fillna(0)

# Feature engineering: Total MarkDown
merged_df['Total_MarkDown'] = merged_df[markdown_cols].sum(axis=1)

# Ensure clean-data directory exists
os.makedirs('dataset/clean-data', exist_ok=True)

# Load: Save cleaned data
merged_df.to_csv('dataset/clean-data/cleaned_sales_data.csv', index=False)

print("ETL completed. Cleaned data saved to dataset/clean-data/cleaned_sales_data.csv")