# **Retail Sales Analysis Notebook**

## Objectives

* Write your notebook objective here, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write down which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Section 1

Section 1 content

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Data Acquisition

In [2]:
import os           # Import os for data directory management
import shutil       # Import shutil for file operations
import kagglehub    # Import Kaggle Hub to Download retail datasets

# Download dataset 
path = kagglehub.dataset_download("manjeetsingh/retaildataset")
print(f'Files downloaded to: {path}')

# Copy CSV files to data directory
os.makedirs("../data", exist_ok=True)
for file in os.listdir(path):
    if file.endswith(".csv"):
        shutil.copy2(os.path.join(path, file), f"../data/{file.lower().replace(' ', '-')}")

print('backlog feat: remove files from the kaggle cashe folder on copy see https://github.com/users/Julian-Elliott/projects/3/views/1?pane=issue&itemId=115149029&issue=Julian-Elliott%7Cretail-sales-analysis%7C13')

print("Files copied to ../data")

  from .autonotebook import tqdm as notebook_tqdm


Files downloaded to: /Users/julianelliott/.cache/kagglehub/datasets/manjeetsingh/retaildataset/versions/2
backlog feat: remove files from the kaggle cashe folder on copy see https://github.com/users/Julian-Elliott/projects/3/views/1?pane=issue&itemId=115149029&issue=Julian-Elliott%7Cretail-sales-analysis%7C13
Files copied to ../data


In [3]:
# Load the retail datasets into dataframes
sales_df = pd.read_csv("../data/sales-data-set.csv")
stores_df = pd.read_csv("../data/stores-data-set.csv") 
features_df = pd.read_csv("../data/features-data-set.csv")

### Inspecting Dataset Data and building the data dictionary

In [None]:
# Custom Function to create a comprehensive data dictionary for multiple datasets as experienced issues with ydata-profiling
def create_data_dictionary(datasets_dict):
    # Quick ref for detailed descriptions for each known column (from Kaggle data card https://www.kaggle.com/datasets/manjeetsingh/retaildataset/data)
    descriptions = {
        'Store': 'The store number',
        'Date': 'The week start date',
        'Temperature': 'Average temperature in the region',
        'Fuel_Price': 'Cost of fuel in the region',
        'MarkDown1': 'Anonymized promotional markdown data (after Nov 2011)',
        'MarkDown2': 'Anonymized promotional markdown data (after Nov 2011)',
        'MarkDown3': 'Anonymized promotional markdown data (after Nov 2011)',
        'MarkDown4': 'Anonymized promotional markdown data (after Nov 2011)',
        'MarkDown5': 'Anonymized promotional markdown data (after Nov 2011)',
        'CPI': 'The consumer price index',
        'Unemployment': 'The unemployment rate',
        'IsHoliday': 'Whether the week is a special holiday week',
        'Dept': 'The department number',
        'Weekly_Sales': 'Sales for the given department in the given store',
        'Type': 'Store type classification',
        'Size': 'Store size in square feet'
    }
    
    dictionary_data = []
    for dataset_name, df in datasets_dict.items():
        for column in df.columns:
            # Get sample values (non-null)
            sample_values = df[column].dropna().head(3).tolist()
            sample_str = ', '.join([str(x) for x in sample_values])
            
            dictionary_data.append({
                'Dataset': dataset_name,
                'Column': column,
                'Data Type': str(df[column].dtype),
                'Missing Values': df[column].isnull().sum(),
                'Missing %': round((df[column].isnull().sum() / len(df)) * 100, 2),
                'Unique Values': df[column].nunique(),
                'Sample Values': sample_str,
                'Description': descriptions.get(column, 'Description needed')
            })
    return pd.DataFrame(dictionary_data)

# Create datasets dictionary
datasets = {
    'Sales': sales_df,
    'Stores': stores_df,
    'Features': features_df
}

# Generate data dictionary that can be called upon later
data_dictionary = create_data_dictionary(datasets)

# Display data dictionary
data_dictionary

Unnamed: 0,Dataset,Column,Data Type,Missing Values,Missing %,Unique Values,Sample Values,Description
0,Sales,Store,int64,0,0.0,45,"1, 1, 1",The store number
1,Sales,Dept,int64,0,0.0,81,"1, 1, 1",The department number
2,Sales,Date,object,0,0.0,143,"05/02/2010, 12/02/2010, 19/02/2010",The week start date
3,Sales,Weekly_Sales,float64,0,0.0,359464,"24924.5, 46039.49, 41595.55",Sales for the given department in the given store
4,Sales,IsHoliday,bool,0,0.0,2,"False, True, False",Whether the week is a special holiday week
5,Stores,Store,int64,0,0.0,45,"1, 2, 3",The store number
6,Stores,Type,object,0,0.0,3,"A, A, B",Store type classification
7,Stores,Size,int64,0,0.0,40,"151315, 202307, 37392",Store size in square feet
8,Features,Store,int64,0,0.0,45,"1, 1, 1",The store number
9,Features,Date,object,0,0.0,182,"05/02/2010, 12/02/2010, 19/02/2010",The week start date


---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [5]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block after 'try' statement on line 2 (553063055.py, line 5)