# **Retail Sales Analysis Notebook**

## Objectives

* Write your notebook objective here, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write down which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Section 1

Section 1 content

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Data Acquisition

In [2]:
import os           # Import os for data directory management
import shutil       # Import shutil for file operations
import kagglehub    # Import Kaggle Hub to Download retail datasets

# Download dataset 
path = kagglehub.dataset_download("manjeetsingh/retaildataset")
print(f'Files downloaded to: {path}')

# Copy CSV files to data directory
os.makedirs("../data", exist_ok=True)
for file in os.listdir(path):
    if file.endswith(".csv"):
        shutil.copy2(os.path.join(path, file), f"../data/{file.lower().replace(' ', '-')}")

print('backlog feat: remove files from the kaggle cashe folder on copy see https://github.com/users/Julian-Elliott/projects/3/views/1?pane=issue&itemId=115149029&issue=Julian-Elliott%7Cretail-sales-analysis%7C13')

print("Files copied to ../data")

  from .autonotebook import tqdm as notebook_tqdm


Files downloaded to: /Users/julianelliott/.cache/kagglehub/datasets/manjeetsingh/retaildataset/versions/2
backlog feat: remove files from the kaggle cashe folder on copy see https://github.com/users/Julian-Elliott/projects/3/views/1?pane=issue&itemId=115149029&issue=Julian-Elliott%7Cretail-sales-analysis%7C13
Files copied to ../data


In [3]:
# Load the retail datasets into dataframes
sales_df = pd.read_csv("../data/sales-data-set.csv")
stores_df = pd.read_csv("../data/stores-data-set.csv") 
features_df = pd.read_csv("../data/features-data-set.csv")

### Inspecting Dataset Data and building the data dictionary

In [4]:
# Custom Function to create a comprehensive data dictionary for multiple datasets as experienced issues with ydata-profiling
def create_data_dictionary(datasets_dict):
    # Quick ref for detailed descriptions for each known column (from Kaggle data card https://www.kaggle.com/datasets/manjeetsingh/retaildataset/data)
    descriptions = {
        'Store': 'The store number',
        'Date': 'The week start date',
        'Temperature': 'Average temperature in the region',
        'Fuel_Price': 'Cost of fuel in the region',
        'MarkDown1': 'Anonymized promotional markdown data (after Nov 2011)',
        'MarkDown2': 'Anonymized promotional markdown data (after Nov 2011)',
        'MarkDown3': 'Anonymized promotional markdown data (after Nov 2011)',
        'MarkDown4': 'Anonymized promotional markdown data (after Nov 2011)',
        'MarkDown5': 'Anonymized promotional markdown data (after Nov 2011)',
        'CPI': 'The consumer price index',
        'Unemployment': 'The unemployment rate',
        'IsHoliday': 'Whether the week is a special holiday week',
        'Dept': 'The department number',
        'Weekly_Sales': 'Sales for the given department in the given store',
        'Type': 'Store type classification',
        'Size': 'Store size in square feet'
    }
    
    dictionary_data = []
    for dataset_name, df in datasets_dict.items():
        for column in df.columns:
            # Get sample values (non-null)
            sample_values = df[column].dropna().head(3).tolist()
            sample_str = ', '.join([str(x) for x in sample_values])
            
            dictionary_data.append({
                'Dataset': dataset_name,
                'Column': column,
                'Data Type': str(df[column].dtype),
                'Missing Values': df[column].isnull().sum(),
                'Missing %': round((df[column].isnull().sum() / len(df)) * 100, 2),
                'Unique Values': df[column].nunique(),
                'Sample Values': sample_str,
                'Description': descriptions.get(column, 'Description needed')
            })
    return pd.DataFrame(dictionary_data)

# Create datasets dictionary
datasets = {
    'Sales': sales_df,
    'Stores': stores_df,
    'Features': features_df
}

# Generate data dictionary that can be called upon later
data_dictionary = create_data_dictionary(datasets)

# Display data dictionary
data_dictionary

Unnamed: 0,Dataset,Column,Data Type,Missing Values,Missing %,Unique Values,Sample Values,Description
0,Sales,Store,int64,0,0.0,45,"1, 1, 1",The store number
1,Sales,Dept,int64,0,0.0,81,"1, 1, 1",The department number
2,Sales,Date,object,0,0.0,143,"05/02/2010, 12/02/2010, 19/02/2010",The week start date
3,Sales,Weekly_Sales,float64,0,0.0,359464,"24924.5, 46039.49, 41595.55",Sales for the given department in the given store
4,Sales,IsHoliday,bool,0,0.0,2,"False, True, False",Whether the week is a special holiday week
5,Stores,Store,int64,0,0.0,45,"1, 2, 3",The store number
6,Stores,Type,object,0,0.0,3,"A, A, B",Store type classification
7,Stores,Size,int64,0,0.0,40,"151315, 202307, 37392",Store size in square feet
8,Features,Store,int64,0,0.0,45,"1, 1, 1",The store number
9,Features,Date,object,0,0.0,182,"05/02/2010, 12/02/2010, 19/02/2010",The week start date


#### Potential Issues shown in the Data Dictionary

Markdown columns `1` through `5`, `CPI` and `Unemployment` in the `Features` dataset presents notable data quality challenges:

- **Missing Data:**  
  Each MarkDown column contains a significant percentage of missing values (ranging from 50% to over 64%). Part of this is expected, as promotional markdown data is only available after November 2011, and earlier records do not include these values.  
  To a lesser extent, the `CPI` (Consumer Price Index) and `Unemployment` columns are also missing about 7% of their values. While this is less dramatic, it may still impact analyses involving economic features of our dataset.
  
- **Time-Dependent Availability:**  
  The fact that MarkDown data is only present for dates after November 2011 (as explained in the kaggle [data card](https://www.kaggle.com/datasets/manjeetsingh/retaildataset/data)) means that analyses involving these columns must account for their partial availability. Any models or insights involving markdowns will be biased toward the later part of the dataset and may not generalise to earlier periods.

- **Other Features:**  
  Other columns such as `Store`, `Date`, `Temperature`, `IsHoliday` and `Weekly_Sales` have no missing values and can be used with greater confidence in a general analyses.

In [5]:
# Create temporary dataframes to avoid modifying the original data before transform stage
temp_sales_df = sales_df.copy()
temp_features_df = features_df.copy()

# Convert temp date columns with correct format (DD/MM/YYYY)
temp_sales_df['Date'] = pd.to_datetime(sales_df['Date'], format='%d/%m/%Y')
temp_features_df['Date'] = pd.to_datetime(features_df['Date'], format='%d/%m/%Y')

# Check the date columns range of values
print("\nSales Date range:", temp_sales_df['Date'].min(), "to", temp_sales_df['Date'].max())
print("Features Date range:", temp_features_df['Date'].min(), "to", temp_features_df['Date'].max())

# Clean up temporary objects
del temp_sales_df, temp_features_df


Sales Date range: 2010-02-05 00:00:00 to 2012-10-26 00:00:00
Features Date range: 2010-02-05 00:00:00 to 2013-07-26 00:00:00


### Dataset Splitting Strategy

Based on the data quality issues identified above, we'll create **two analysis datasets** to maximize data utility:

#### **Dataset 1: General Analysis (Full Timeline)**
- **Purpose:** Sales trends, seasonality, store performance analysis
- **Timeline:** Complete dataset (2010-2012)
- **Features:** Store, Date, Temperature, Fuel_Price, CPI, Unemployment, IsHoliday, Weekly_Sales
- **Advantage:** Maximum data coverage for robust trend analysis

#### **Dataset 2: Promotional Analysis (Nov 2011 onwards)**
- **Purpose:** Impact of markdowns and promotional strategies
- **Timeline:** From November 2011 when MarkDown data becomes available
- **Features:** All features including MarkDown1-5 columns
- **Advantage:** Complete feature set for promotional impact analysis

This approach allows me to:
- **Maximize data usage** - Use full timeline where appropriate
- **Maintain data quality** - Focus on complete records for markdown analysis
- **Enable comprehensive insights** - Compare pre/post promotional periods

#### Implementing the Split

**Naming Conventions:**
- General analysis datasets: Keep existing names (`sales_df`, `features_df`, `stores_df`)
- Promotional analysis datasets: Use `promo_` prefix (`promo_sales`, `promo_features`)

In [10]:
# Convert date columns to datetime for proper filtering
features_df['Date'] = pd.to_datetime(features_df['Date'], format='%d/%m/%Y')
sales_df['Date'] = pd.to_datetime(sales_df['Date'], format='%d/%m/%Y')

# Find the first date where any MarkDown data is available
markdown_cols = ['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5']
first_markdown_date = features_df[features_df[markdown_cols].notna().any(axis=1)]['Date'].min()
print(f"First MarkDown data available from: {first_markdown_date}")

# Create promotional datasets (from first markdown date onwards)
promo_features = features_df[features_df['Date'] >= first_markdown_date].copy()
promo_sales = sales_df[sales_df['Date'] >= first_markdown_date].copy()

# Create promotional datasets dictionary
promo_datasets = {
    'Promo_Sales': promo_sales,
    'Promo_Features': promo_features
}

# Generate data dictionary for promotional datasets
promo_data_dictionary = create_data_dictionary(promo_datasets)

print(f"Promo Date range: {first_markdown_date} to {promo_features['Date'].max()}")

# Display promotional data dictionary filtered for rows with missing values
print(f'\nMissing values after promo_split')
promo_data_dictionary[promo_data_dictionary['Missing Values'] > 0]

First MarkDown data available from: 2011-11-11 00:00:00
Promo Date range: 2011-11-11 00:00:00 to 2013-07-26 00:00:00

Missing values after promo_split


Unnamed: 0,Dataset,Column,Data Type,Missing Values,Missing %,Unique Values,Sample Values,Description
9,Promo_Features,MarkDown1,float64,18,0.44,4023,"10382.9, 6074.12, 410.31",Anonymized promotional markdown data (after No...
10,Promo_Features,MarkDown2,float64,1129,27.88,2715,"6115.67, 254.39, 98.0",Anonymized promotional markdown data (after No...
11,Promo_Features,MarkDown3,float64,437,10.79,2885,"215.07, 51.98, 55805.51",Anonymized promotional markdown data (after No...
12,Promo_Features,MarkDown4,float64,586,14.47,3405,"2406.62, 427.39, 8.0",Anonymized promotional markdown data (after No...
14,Promo_Features,CPI,float64,585,14.44,1125,"217.9980849, 218.2205088, 218.4676211",The consumer price index
15,Promo_Features,Unemployment,float64,585,14.44,206,"7.866, 7.866, 7.866",The unemployment rate


---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---