# Business Problem Overview

## FreshMart: Maximising Total Sales Revenue through Smarter Retail Decisions

FreshMart is a fast-growing grocery retail chain based in the United States, serving thousands of customers across various cities and countries. Known for its wide product range and affordable pricing, FreshMart has built a strong presence in both urban and suburban markets.

As the company prepares for its next phase of growth, leadership wants to focus not just on adding new stores, but on increasing Total Sales Revenue from its existing network. This means a better understanding what drives revenue, from which products perform well, to how different regions, customer segments, and sales staff contribute to the bottom line.

The company believes there are many untapped opportunities to grow sales. These may lie in:

- how product categories perform across months,
- which types of customers spend more,
- how employees contribute to store-level sales,
- which cities or countries have higher or lower sales, and
- how discounts are influencing buying behavior.

### Key Metric: Total Sales Revenue

$\text{Total Sales Revenue} = \text{Unit Price} \times \text{Quantity} \times (1 - \text{Discount})$. This metric helps measure how much money FreshMart is making from selling products after applying discounts.

### Objective

The objective of this project is to:

- Break down Total Sales Revenue into its core components such as product price, quantity sold, and discounts applied.
- Explore how different product categories, classifications, and features contribute to revenue across time.
- Segment customers based on order value, quantity purchased, and spending behavior.
- Evaluate the performance of individual sales employees and understand how their contribution varies by region or time.
- Compare city-wise and country-wise sales patterns to highlight high- and low-performing areas.
- Analyse trends over time to understand when sales peak or drop, and how this differs by product or location.

# Dataset Overview

- **Dataset Name:** FreshMart Analytics Dataset
- **Number of Tables:** 7

## Table Overviews

### categories

- **Table Name:** categories
- **Number of rows:** 11
- **Number of columns:** 2
- **Description:** This table gives the different product categories that FreshMart sells.

### cities

- **Table Name:** cities
- **Number of rows:** 96
- **Number of columns:** 4
- **Description:** This table gives a list of cities and zipcodes FreshMart operates in.

### countries

- **Table Name:** countries
- **Number of rows:** 206
- **Number of columns:** 3
- **Description:** This table gives a list of countries FreshMart operates in.

### customers

- **Table Name:** customers
- **Number of rows:** 98759
- **Number of columns:** 6
- **Description:** This table gives details of FreshMart customers. 

### employees

- **Table Name:** employees
- **Number of rows:** 23
- **Number of columns:** 8
- **Description:** This table gives details of FreshMart employees. 

### products

- **Table Name:** products
- **Number of rows:** 452
- **Number of columns:** 9
- **Description:** This table gives details of the products sold by FreshMart. 

### sales

- **Table Name:** sales
- **Number of rows:** 6758125
- **Number of columns:** 9
- **Description:** This table gives detailed transaction history of FreshMart.

## Column Definitions

### categories

* **CategoryID**
    * Description: Unique identifier for each product category.
    * Example: 1
* **CategoryName**
    * Description: Name of the product category.
    * Example: Beverages

### cities

* **CityID**
    * Description: Unique identifier for each city.
    * Example: 101
* **CityName**
    * Description: Name of the city.
    * Example: San Diego
* **Zipcode**
    * Description: Represents the zipcode the city is in.
    * Example: 500000
* **CountryID**
    * Description: Reference to the corresponding country from countries.
    * Example: 1

### countries

  * **CountryID**
    * Description: Unique identifier for each country.
    * Example: 1
  * **CountryName**
    * Description: Name of the country.
    * Example: United States
  * **CountryCode**
    * Description: Two-letter country code.
    * Example: US


### customers

  * **CustomerID**
    * Description: Unique identifier for each customer.
    * Example: 1001
  * **FirstName**
    * Description: First name of the customer.
    * Example: Emma
  * **MiddleInitial**
    * Description: Middle initial of the customer.
    * Example: A
  * **LastName**
    * Description: Last name of the customer.
    * Example: Johnson
  * **cityID**
    * Description: City of the customer. Refers to cities.
    * Example: 101
  * **Address**
    * Description: Residential address of the customer.
    * Example: 123 Elm Street


### employees

  * **EmployeeID**
    * Description: Unique identifier for each employee.
    * Example: 501
  * **FirstName**
    * Description: First name of the employee.
    * Example: Michael
  * **MiddleInitial**
    * Description: Middle initial of the employee.
    * Example: B
  * **LastName**
    * Description: Last name of the employee.
    * Example: Davis
  * **BirthDate**
    * Description: Date of birth of the employee in YYYY-MM-DD format.
    * Example: 1985-07-14
  * **Gender**
    * Description: Gender of the employee.
    * Example: Male
  * **CityID**
    * Description: City where the employee is based. Refers to cities.
    * Example: 103
  * **HireDate**
    * Description: Date when the employee was hired.
    * Example: 2021-04-01


### products

  * **ProductID**
    * Description: Unique identifier for each product.
    * Example: 301
  * **ProductName**
    * Description: Name of the product.
    * Example: Organic Apple
  * **Price**
    * Description: Unit price of the product in USD.
    * Example: 3.50
  * **CategoryID**
    * Description: Category reference for the product. Refers to categories.
    * Example: 2
  * **Class**
    * Description: Classification type of the product (e.g., Standard, Premium).
    * Example: Premium
  * **ModifyDate**
    * Description: Date when the product information was last updated.
    * Example: 2023-06-01
  * **Resistant**
    * Description: Product resistance category.
    * Example: Water-resistant
  * **IsAllergic**
    * Description: Indicates whether the item contains allergens.
    * Example: No
  * **VitalityDays**
    * Description: Indicates the product's shelf life or freshness period.
    * Example: 7


### sales

  * **SalesID**
    * Description: Unique identifier for each sale.
    * Example: 7001
  * **SalesPersonID**
    * Description: Employee responsible for the sale. Refers to employees.
    * Example: 501
  * **CustomerID**
    * Description: Customer making the purchase. Refers to customers.
    * Example: 1001
  * **ProductID**
    * Description: Product being sold. Refers to products.
    * Example: 301
  * **Quantity**
    * Description: Number of product units sold.
    * Example: 3
  * **Discount**
    * Description: Discount applied to this sale, shown as a decimal.
    * Example: 0.10
  * **TotalPrice**
    * Description: Final sale price after applying discount.
    * Example: 9.45
  * **SalesDate**
    * Description: Date and time of the sale in YYYY-MM-DD HH:MM:SS format.
    * Example: 2024-05-15 14:32:00
  * **TransactionNumber**
    * Description: Unique identifier for the transaction.
    * Example: TXN-20240515-0001


# Analysis & Visualisation

## Importing and Cleaning Data

### Importing the Necessary Libraries

#### Install the libraries

In [1]:
# To download files from Google Drive
%pip install gdown

# To get output in markdown format
%pip install ipython

# To work with dataframes
%pip install pandas

# To display dataframes in markdown text
%pip install tabulate


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


#### Import the libraries

In [1]:
import gdown
import os
from IPython.display import display, Markdown
import pandas as pd

### Loading the Dataset from Google Drive

In [2]:
FOLDER_URL = 'https://drive.google.com/drive/folders/1FcrdY8uZLE04-U5K2dqnjhDJVenFwGlk' # The Google Drive folder URL
DOWNLOAD_DIRECTORY = os.path.abspath(os.path.join(os.getcwd(), "..", "Dataset")) # The directory where the downloaded CSVs will be stored, you can adjust it to your choice but all further analysis will be done based on this directory structure.

In [None]:
# This snippet will download all the contents from the Google Drive Folder

# Make the download directory if it doesn't exist
os.makedirs(DOWNLOAD_DIRECTORY, exist_ok=True)

# Download all the contents of the Google Drive folder
gdown.download_folder(url=FOLDER_URL, output=DOWNLOAD_DIRECTORY, quiet=False, use_cookies=False)

# Delete all of the non-CSV files
for filename in os.listdir(DOWNLOAD_DIRECTORY):
    if not filename.lower().endswith(".csv"):
        file_path = os.path.join(DOWNLOAD_DIRECTORY , filename)
        os.remove(file_path)
        display(Markdown(f"Removed non-CSV file: {filename}"))
        

In [3]:
# This snippet will load all the CSVs into dataframes

dataset = {} # The dataframes will be stored in a dictionary to make it easier to loop through all of them

for filename in os.listdir(DOWNLOAD_DIRECTORY):
    if filename.endswith('.csv'):
        key = os.path.splitext(filename)[0] # Extracting the key for each file where the key is the filename without the .csv
        file_path = os.path.join(DOWNLOAD_DIRECTORY, filename) # Determining the path of each file in the DOWNLOAD_DIRECTORY
        df = pd.read_csv(file_path) # Reading each CSV into a dataframe
        dataset[key] = df
        display(Markdown(f"Loaded: `{filename}` -> key=`'{key}'`"))

Loaded: `categories.csv` -> key=`'categories'`

Loaded: `cities.csv` -> key=`'cities'`

Loaded: `countries.csv` -> key=`'countries'`

Loaded: `customers.csv` -> key=`'customers'`

Loaded: `employees.csv` -> key=`'employees'`

Loaded: `products.csv` -> key=`'products'`

Loaded: `sales.csv` -> key=`'sales'`

####  Viewing the First Few Rows of each Table in the Dataset

In [None]:
display(Markdown("The first 5 rows of each table in the dataset:"))
for key, df in dataset.items():
    display(Markdown(f"##### {key}"))
    display(Markdown(df.head().to_markdown(index=False)))

The first 5 rows of each table in the dataset:

##### categories

|   CategoryID | CategoryName   |
|-------------:|:---------------|
|            1 | Confections    |
|            2 | Shell fish     |
|            3 | Cereals        |
|            4 | Dairy          |
|            5 | Beverages      |

##### cities

|   CityID | CityName       |   Zipcode |   CountryID |
|---------:|:---------------|----------:|------------:|
|        1 | Dayton         |     80563 |          32 |
|        2 | Buffalo        |     17420 |          32 |
|        3 | Chicago        |     44751 |          32 |
|        4 | Fremont        |     20641 |          32 |
|        5 | Virginia Beach |     62389 |          32 |

##### countries

|   CountryID | CountryName   | CountryCode   |
|------------:|:--------------|:--------------|
|           1 | Armenia       | AN            |
|           2 | Canada        | FO            |
|           3 | Belize        | MK            |
|           4 | Uganda        | LV            |
|           5 | Thailand      | VI            |

##### customers

|   CustomerID | FirstName   | MiddleInitial   | LastName   |   CityID | Address                      |
|-------------:|:------------|:----------------|:-----------|---------:|:-----------------------------|
|            1 | Stefanie    | Y               | Frye       |       79 | 97 Oak Avenue                |
|            2 | Sandy       | T               | Kirby      |       96 | 52 White First Freeway       |
|            3 | Lee         | T               | Zhang      |       55 | 921 White Fabien Avenue      |
|            4 | Regina      | S               | Avery      |       40 | 75 Old Avenue                |
|            5 | Daniel      | S               | Mccann     |        2 | 283 South Green Hague Avenue |

##### employees

|   EmployeeID | FirstName   | MiddleInitial   | LastName   | BirthDate               | Gender   |   CityID | HireDate                |
|-------------:|:------------|:----------------|:-----------|:------------------------|:---------|---------:|:------------------------|
|            1 | Nicole      | T               | Fuller     | 1981-03-07 00:00:00.000 | F        |       80 | 2011-06-20 07:15:36.920 |
|            2 | Christine   | W               | Palmer     | 1968-01-25 00:00:00.000 | F        |        4 | 2011-04-27 04:07:56.930 |
|            3 | Pablo       | Y               | Cline      | 1963-02-09 00:00:00.000 | M        |       70 | 2012-03-30 18:55:23.270 |
|            4 | Darnell     | O               | Nielsen    | 1989-02-06 00:00:00.000 | M        |       39 | 2014-03-06 06:55:02.780 |
|            5 | Desiree     | L               | Stuart     | 1963-05-03 00:00:00.000 | F        |       23 | 2014-11-16 22:59:54.720 |

##### products

|   ProductID | ProductName                |   Price |   CategoryID | Class   | ModifyDate              | Resistant   | IsAllergic   |   VitalityDays |
|------------:|:---------------------------|--------:|-------------:|:--------|:------------------------|:------------|:-------------|---------------:|
|           1 | Flour - Whole Wheat        | 74.2988 |            3 | Medium  | 2018-02-16 08:21:49.190 | Durable     | Unknown      |              0 |
|           2 | Cookie Chocolate Chip With | 91.2329 |            3 | Medium  | 2017-02-12 11:39:10.970 | Unknown     | Unknown      |              0 |
|           3 | Onions - Cippolini         |  9.1379 |            9 | Medium  | 2018-03-15 08:11:51.560 | Weak        | False        |            111 |
|           4 | Sauce - Gravy, Au Jus, Mix | 54.3055 |            9 | Medium  | 2017-07-16 00:46:28.880 | Durable     | Unknown      |              0 |
|           5 | Artichokes - Jerusalem     | 65.4771 |            2 | Low     | 2017-08-16 14:13:35.430 | Durable     | True         |             27 |

##### sales

|   SalesID |   SalesPersonID |   CustomerID |   ProductID |   Quantity |   Discount |   TotalPrice | SalesDate               | TransactionNumber    |
|----------:|----------------:|-------------:|------------:|-----------:|-----------:|-------------:|:------------------------|:---------------------|
|         1 |               6 |        27039 |         381 |          7 |        0   |            0 | 2018-02-05 07:38:25.430 | FQL4S94E4ME1EZFTG42G |
|         2 |              16 |        25011 |          61 |          7 |        0   |            0 | 2018-02-02 16:03:31.150 | 12UGLX40DJ1A5DTFBHB8 |
|         3 |              13 |        94024 |          23 |         24 |        0   |            0 | 2018-05-03 19:31:56.880 | 5DT8RCPL87KI5EORO7B0 |
|         4 |               8 |        73966 |         176 |         19 |        0.2 |            0 | 2018-04-07 14:43:55.420 | R3DR9MLD5NR76VO17ULE |
|         5 |              10 |        32653 |         310 |          9 |        0   |            0 | 2018-02-12 15:37:03.940 | 4BGS0Z5OMAZ8NDAFHHP3 |

### Checking the Shape of the Dataset

In [None]:
display(Markdown("The following table shows how many rows and columns are in each table of the dataset:"))
markdown_text = "|Table|# Rows|# Columns| \n |-----|-------|---------| \n"
for key, df in dataset.items():
    markdown_text += f"|{key}|{df.shape[0]}|{df.shape[1]}|\n" # The shape method shows the size of the dataframe
display(Markdown(markdown_text))

The following table shows how many rows and columns are in each table of the dataset:

|Table|# Rows|# Columns| 
 |-----|-------|---------| 
|categories|11|2|
|cities|96|4|
|countries|206|3|
|customers|98759|6|
|employees|23|8|
|products|452|9|
|sales|6758125|9|


### Displaying Dataset Information

In [11]:
for key, df in dataset.items():
    display(Markdown(f'Table information for {key}:'))
    df.info()

Table information for categories:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   CategoryID    11 non-null     int64 
 1   CategoryName  11 non-null     object
dtypes: int64(1), object(1)
memory usage: 308.0+ bytes


Table information for cities:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   CityID     96 non-null     int64 
 1   CityName   96 non-null     object
 2   Zipcode    96 non-null     int64 
 3   CountryID  96 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 3.1+ KB


Table information for countries:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206 entries, 0 to 205
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   CountryID    206 non-null    int64 
 1   CountryName  206 non-null    object
 2   CountryCode  205 non-null    object
dtypes: int64(1), object(2)
memory usage: 5.0+ KB


Table information for customers:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98759 entries, 0 to 98758
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   CustomerID     98759 non-null  int64 
 1   FirstName      98759 non-null  object
 2   MiddleInitial  97782 non-null  object
 3   LastName       98759 non-null  object
 4   CityID         98759 non-null  int64 
 5   Address        98759 non-null  object
dtypes: int64(2), object(4)
memory usage: 4.5+ MB


Table information for employees:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   EmployeeID     23 non-null     int64 
 1   FirstName      23 non-null     object
 2   MiddleInitial  23 non-null     object
 3   LastName       23 non-null     object
 4   BirthDate      23 non-null     object
 5   Gender         23 non-null     object
 6   CityID         23 non-null     int64 
 7   HireDate       23 non-null     object
dtypes: int64(2), object(6)
memory usage: 1.6+ KB


Table information for products:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452 entries, 0 to 451
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ProductID     452 non-null    int64  
 1   ProductName   452 non-null    object 
 2   Price         452 non-null    float64
 3   CategoryID    452 non-null    int64  
 4   Class         452 non-null    object 
 5   ModifyDate    452 non-null    object 
 6   Resistant     452 non-null    object 
 7   IsAllergic    452 non-null    object 
 8   VitalityDays  452 non-null    float64
dtypes: float64(2), int64(2), object(5)
memory usage: 31.9+ KB


Table information for sales:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6758125 entries, 0 to 6758124
Data columns (total 9 columns):
 #   Column             Dtype  
---  ------             -----  
 0   SalesID            int64  
 1   SalesPersonID      int64  
 2   CustomerID         int64  
 3   ProductID          int64  
 4   Quantity           int64  
 5   Discount           float64
 6   TotalPrice         float64
 7   SalesDate          object 
 8   TransactionNumber  object 
dtypes: float64(2), int64(5), object(2)
memory usage: 464.0+ MB


### Checking for Duplicate Values in the Dataset

In [None]:
display(Markdown("The below table shows the number of duplicate rows for each table in the dataset:"))
markdown_text = '|Table|Duplicate Rows| \n |----|------| \n'
for key, df in dataset.items():
    markdown_text += f'|{key}|{len(df[df.duplicated()])}| \n' # The duplicated() method shows which rows are duplicate
display(Markdown(markdown_text))

The below table shows the number of duplicate rows for each table in the dataset:

|Table|Duplicate Rows| 
 |----|------| 
|categories|0| 
|cities|0| 
|countries|0| 
|customers|0| 
|employees|0| 
|products|0| 
|sales|0| 


### Checking for Missing / Null Values

In [None]:
display(Markdown("The below tables will show how many missing values in each of column of each table:"))
for key, df in dataset.items():
    display(Markdown(f"##### {key}"))
    missing_values = df.isnull().sum()
    markdown_table = missing_values.to_frame().reset_index()  # This is done so the output can be shown in markdown format
    markdown_table.columns = ['Column Name', '# of Missing Values']
    display(Markdown(markdown_table.to_markdown(index=False)))

The below tables will show how many missing values in each of column of each table:

##### categories

| Column Name   |   # of Missing Values |
|:--------------|----------------------:|
| CategoryID    |                     0 |
| CategoryName  |                     0 |

##### cities

| Column Name   |   # of Missing Values |
|:--------------|----------------------:|
| CityID        |                     0 |
| CityName      |                     0 |
| Zipcode       |                     0 |
| CountryID     |                     0 |

##### countries

| Column Name   |   # of Missing Values |
|:--------------|----------------------:|
| CountryID     |                     0 |
| CountryName   |                     0 |
| CountryCode   |                     1 |

##### customers

| Column Name   |   # of Missing Values |
|:--------------|----------------------:|
| CustomerID    |                     0 |
| FirstName     |                     0 |
| MiddleInitial |                   977 |
| LastName      |                     0 |
| CityID        |                     0 |
| Address       |                     0 |

##### employees

| Column Name   |   # of Missing Values |
|:--------------|----------------------:|
| EmployeeID    |                     0 |
| FirstName     |                     0 |
| MiddleInitial |                     0 |
| LastName      |                     0 |
| BirthDate     |                     0 |
| Gender        |                     0 |
| CityID        |                     0 |
| HireDate      |                     0 |

##### products

| Column Name   |   # of Missing Values |
|:--------------|----------------------:|
| ProductID     |                     0 |
| ProductName   |                     0 |
| Price         |                     0 |
| CategoryID    |                     0 |
| Class         |                     0 |
| ModifyDate    |                     0 |
| Resistant     |                     0 |
| IsAllergic    |                     0 |
| VitalityDays  |                     0 |

##### sales

| Column Name       |   # of Missing Values |
|:------------------|----------------------:|
| SalesID           |                     0 |
| SalesPersonID     |                     0 |
| CustomerID        |                     0 |
| ProductID         |                     0 |
| Quantity          |                     0 |
| Discount          |                     0 |
| TotalPrice        |                     0 |
| SalesDate         |                 67526 |
| TransactionNumber |                     0 |

### Summary of Dataset Observations