# Retail Sales Analysis

The Retail Sales Analysis Project aims to leverage data analytics techniques to derive actionable insights from a retail company's sales data. Retail businesses operate in a dynamic environment where understanding consumer behavior, optimizing pricing strategies, and identifying trends are crucial for success. This project focuses on analyzing various aspects of sales data to help businesses make informed decisions that drive revenue growth, enhance operational efficiency, and improve customer satisfaction.

In this project, we will explore and analyze historical sales data, which includes information such as product sales, pricing, customer demographics, promotional activities, and other related variables. By utilizing data analysis tools and techniques, we will uncover key patterns and trends that influence retail performance. The goal is to provide a comprehensive understanding of the sales dynamics within a retail business, enabling stakeholders to improve inventory management, optimize marketing efforts, and forecast future sales trends.

## Objectives

1. **Exploratory Data Analysis (EDA)**: To understand and summarize the key features, generate insights, and visualize insights into particular graphs required for particular sales related measure.

2. **Statistical Analysis**: Perform Hypothesis Testing with the help of tools such as ANOVA Test from SciPy module and various important functions in NumPy module.

3. **Data Cleaning and Preprocessing**: Cleaning and Transforming raw data to provide insights and improve accuracy using the pandas module


## Pre-requisites

In order to create or perform this project there are certain requirements the user will need to be familiar with:

* Functions
* Control Statements
* Exception Handling
* File Handling
* NumPy Module
* Pandas Module
* Matplotlib Module
* Seaborn Module
* SciPy Module

## Step 1 - Importing Modules

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import f_oneway

## Step 2 - Loading Data and Preprocessing

In [2]:
def load_data():
    try:
        data = pd.read_csv('sales_data.csv')
        pd.set_option('display.max_columns', None)
        data
        data['Date'] = pd.to_datetime(data['Date'])
        data.info()
        data.isnull().any()
        data.duplicated()
        return data
    except FileNotFoundError:
        print("Error: File not found. Please check the file path.")
        exit()
    except pd.errors.ParserError:
        print("Error: File format is incorrect. Please check the file.")
        exit()

## Step 3 - Exploratory Data Analysis Using NumPy, Pandas, File Handling, Matplotlib and Seaborn

In [3]:
def perform_eda(data, output_file):
    with open(output_file, 'a') as file:
        data['Total_Sales'] = data['Quantity'] * data['Price']

        monthly_sales = data.groupby(data['Date'].dt.to_period('M'))['Total_Sales'].apply(np.sum)
        file.write(f"Monthly Sales Trend:\n{monthly_sales}\n\n")

        top_products = (
            data.groupby('Product')['Quantity']
            .apply(np.sum)
            .sort_values(ascending=False)
            .head(5)
        )
        file.write(f"Top 5 Products by Quantity Sold:\n{top_products}\n\n")

        region_performance = (
            data.groupby('Region')['Total_Sales']
            .apply(np.sum)
        )
        file.write(f"Sales by Region:\n{region_performance}\n\n")

        sns.boxplot(x='Discount', y='Total_Sales', data=data)
        plt.title('Impact of Discount on Total Sales')
        plt.savefig("discount_impact_sales.png")  # Save visualization as an image
        plt.close()

        data['Profit_Margin'] = np.divide(data['Profit'], data['Total_Sales'])
        sns.barplot(x='Category', y='Profit_Margin', data=data)
        plt.title('Profit Margin by Category')
        plt.savefig("profit_margin_category.png")  # Save visualization as an image
        plt.close()

        sns.countplot(x='Category', data=data)
        plt.title('Count of Category')
        plt.savefig("Category.png")  # Save visualization as an image
        plt.close()

## Step 4 - Statistical Analysis using ANOVA Test and NumPy

In [4]:
def perform_statistical_analysis(data, output_file):
    with open(output_file, 'a') as file:
        regions = np.unique(data['Region'])
        region_sales = [data[data['Region'] == region]['Total_Sales'].values for region in regions]

        # Performing ANOVA test
        f_stat, p_val = f_oneway(*region_sales)
        file.write(f"ANOVA Test Result: F-Statistic = {f_stat}, P-Value = {p_val}\n")

        if p_val < 0.05:
            file.write("There is a significant difference in sales between regions.\n\n")
        else:
            file.write("There is no significant difference in sales between regions.\n\n")

## Step 5 - Creating Module functions for each Data Insight to write

In [5]:
def calculate_total_sales(data):
    return np.sum(data['Total_Sales'])

def get_top_products(data, top_n=5):
    grouped = data.groupby('Product')['Quantity'].apply(np.sum)
    return grouped.sort_values(ascending=False).head(top_n)

def calculate_profit_margin(data):
    return np.sum(data['Profit']) / np.sum(data['Total_Sales'])

## Step 6 - Main Execution

In [6]:
def main():
    input_file = 'sales_data.csv'
    output_file = 'analysis_results.txt'

    with open(output_file, 'w') as file:
        file.write("Sales Data Analysis Results\n")
        file.write("=" * 50 + "\n\n")

    data = load_data()

    # Perform EDA
    perform_eda(data, output_file)

    # Perform Statistical Analysis
    perform_statistical_analysis(data, output_file)

    with open(output_file, 'a') as file:
        # Calculate total sales
        total_sales = calculate_total_sales(data)
        file.write(f"Total Sales: {total_sales}\n\n")

        # Get top products
        top_products = get_top_products(data)
        file.write(f"Top Products by Quantity Sold:\n{top_products}\n\n")

        # Calculate profit margin
        profit_margin = calculate_profit_margin(data)
        file.write(f"Overall Profit Margin: {profit_margin:.2f}\n\n")

    print(f"Analysis completed. Results have been saved to {output_file}.")
    print("Visualizations saved as images.")

if __name__ == "__main__":
    main()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      70 non-null     datetime64[ns]
 1   Product   70 non-null     object        
 2   Category  70 non-null     object        
 3   Region    70 non-null     object        
 4   Quantity  70 non-null     int64         
 5   Price     70 non-null     int64         
 6   Discount  70 non-null     int64         
 7   Profit    70 non-null     int64         
dtypes: datetime64[ns](1), int64(4), object(3)
memory usage: 4.5+ KB
Analysis completed. Results have been saved to analysis_results.txt.
Visualizations saved as images.


## Retail Sales Analysis Final Combined Code of all the Steps

In [7]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import f_oneway

# Data Loading and Preprocessing
def load_data():
    try:
        data = pd.read_csv('sales_data.csv')
        pd.set_option('display.max_columns', None)
        data
        data['Date'] = pd.to_datetime(data['Date'])
        data.info()
        data.isnull().any()
        data.duplicated()
        return data
    except FileNotFoundError:
        print("Error: File not found. Please check the file path.")
        exit()
    except pd.errors.ParserError:
        print("Error: File format is incorrect. Please check the file.")
        exit()

# Exploratory Data Analysis using NumPy
def perform_eda(data, output_file):
    with open(output_file, 'a') as file:
        data['Total_Sales'] = data['Quantity'] * data['Price']

        monthly_sales = data.groupby(data['Date'].dt.to_period('M'))['Total_Sales'].apply(np.sum)
        file.write(f"Monthly Sales Trend:\n{monthly_sales}\n\n")

        top_products = (
            data.groupby('Product')['Quantity']
            .apply(np.sum)
            .sort_values(ascending=False)
            .head(5)
        )
        file.write(f"Top 5 Products by Quantity Sold:\n{top_products}\n\n")

        region_performance = (
            data.groupby('Region')['Total_Sales']
            .apply(np.sum)
        )
        file.write(f"Sales by Region:\n{region_performance}\n\n")

        sns.boxplot(x='Discount', y='Total_Sales', data=data)
        plt.title('Impact of Discount on Total Sales')
        plt.savefig("discount_impact_sales.png")  # Save visualization as an image
        plt.close()

        data['Profit_Margin'] = np.divide(data['Profit'], data['Total_Sales'])
        sns.barplot(x='Category', y='Profit_Margin', data=data)
        plt.title('Profit Margin by Category')
        plt.savefig("profit_margin_category.png")  # Save visualization as an image
        plt.close()

        sns.countplot(x='Category', data=data)
        plt.title('Count of Category')
        plt.savefig("Category.png")  # Save visualization as an image
        plt.close()

# Statistical Analysis with ANOVA using NumPy
def perform_statistical_analysis(data, output_file):
    with open(output_file, 'a') as file:
        regions = np.unique(data['Region'])
        region_sales = [data[data['Region'] == region]['Total_Sales'].values for region in regions]

        # Performing ANOVA test
        f_stat, p_val = f_oneway(*region_sales)
        file.write(f"ANOVA Test Result: F-Statistic = {f_stat}, P-Value = {p_val}\n")

        if p_val < 0.05:
            file.write("There is a significant difference in sales between regions.\n\n")
        else:
            file.write("There is no significant difference in sales between regions.\n\n")

# Custom Module Functions
def calculate_total_sales(data):
    return np.sum(data['Total_Sales'])

def get_top_products(data, top_n=5):
    grouped = data.groupby('Product')['Quantity'].apply(np.sum)
    return grouped.sort_values(ascending=False).head(top_n)

def calculate_profit_margin(data):
    return np.sum(data['Profit']) / np.sum(data['Total_Sales'])

# Main Execution
def main():
    input_file = 'sales_data.csv'
    output_file = 'analysis_results.txt'

    with open(output_file, 'w') as file:
        file.write("Sales Data Analysis Results\n")
        file.write("=" * 50 + "\n\n")

    data = load_data()

    perform_eda(data, output_file)

    perform_statistical_analysis(data, output_file)

    with open(output_file, 'a') as file:
        # Calculate total sales
        total_sales = calculate_total_sales(data)
        file.write(f"Total Sales: {total_sales}\n\n")

        # Get top products
        top_products = get_top_products(data)
        file.write(f"Top Products by Quantity Sold:\n{top_products}\n\n")

        # Calculate profit margin
        profit_margin = calculate_profit_margin(data)
        file.write(f"Overall Profit Margin: {profit_margin:.2f}\n\n")

    print(f"Analysis completed. Results have been saved to {output_file}.")
    print("Visualizations saved as images.")

if __name__ == "__main__":
    main()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      70 non-null     datetime64[ns]
 1   Product   70 non-null     object        
 2   Category  70 non-null     object        
 3   Region    70 non-null     object        
 4   Quantity  70 non-null     int64         
 5   Price     70 non-null     int64         
 6   Discount  70 non-null     int64         
 7   Profit    70 non-null     int64         
dtypes: datetime64[ns](1), int64(4), object(3)
memory usage: 4.5+ KB
Analysis completed. Results have been saved to analysis_results.txt.
Visualizations saved as images.


## Conclusions of the Project

Based on the analysis of the Retail Sales Data, several key insights were derived to improve business strategies and decisions.

* **Monthly Sales Trend**: Sales has steadily increased from January to May with a notable peak of **12830** worth of total sales. This data indicates successful promotional activities during this period.

* **Top Products by Quantity Sold**: **Product C** emerged as the highest selling product. **Product B** emerged as lowest selling product. This highlights the products which bring in the majority of sales.

* **Sales by region** : The South region generated the most sales of **14830** followed by East region **13970** and North region **12180** and finally West reports the lowest sales of **8040** indicating an improvement or growth required for that area

* **Regional sales comparision**: Based on the ANOVA test, the probability value (p_value) is **0.321** and f_statistic is **1.188** suggesting there is not significant difference between the regions in their sales

* **Profit margin** : The overall profit margin was calculated as **17%** including room for improvement in fields such as cost optimisation, pricing strategies and many more.

* **Total Sales**: The total sales for the analyzed period amounted to **49020**, henceforth providing a reference line for future potnetial sales

* Visualizations regarding the data have been generated into **.png** files for references whereas the analysis is stored inside **.txt** file

