# 'Kwanza Tukule' Sales Data Analysis and Insights.
*By Muniu Paul ™️*

## 1. BUSINESS UNDERSTANDING.
### Overview.
Kwanza Tukule is a cashless B2B business that leverages technology and an efficient supply chain to ensure accessible and affordable nutritious food for the many in Kenya. By focusing on street food vendors and kiosks, which serve as critical food access points in low-income areas, Kwanza Tukule addresses the challenges of affordability and availability of ingredients. The company sources essential products directly from manufacturers and delivers them through last-mile distribution, cutting out intermediaries and offering products at affordable rates to vendors, thus reducing food costs for the local population.

### Problem Statement.
This project aims to analyze anonymized sales data to derive actionable insights that will help Kwanza Tukule optimize its operations, better serve its vendors, and improve access to affordable food. The insights will focus on understanding sales trends, customer behavior, and inventory management to inform business decisions.

### Objectives.
1. Data Cleaning and Preparation: Cleaning and preparing the dataset.
2. Sales Analysis: Analyzing trends and performance to understand the business context.
3. Forecasting and Segmentation: Performing advanced analysis like forecasting and customer segmentation.
4. Recommendations: Providing actionable recommendations to improve operations.
5. Dashboard Creation: Creating an interactive dashboard that summarizes your findings.

### Stakeholder : 
**Kwanza Tukule Foods Limited** - A B2B company focused on providing accessible, affordable, and nutritious food to street food vendors and kiosks in low-income areas of Kenya.

# 2. Data Understanding.
The dataset provided by **Kwanza Tukule** contains anonymized sales data. You can access it [here](dataset/Case%20Study%20Data%20-%20Read%20Only%20-%20case_study_data_2025-01-16T06_49_12.19881Z.csv).

##### Importing packages/libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [9]:
# check the first 5 rows
data = pd.read_csv('dataset/Case Study Data.csv')
data.head()

Unnamed: 0,DATE,ANONYMIZED CATEGORY,ANONYMIZED PRODUCT,ANONYMIZED BUSINESS,ANONYMIZED LOCATION,QUANTITY,UNIT PRICE
0,"August 18, 2024, 9:32 PM",Category-106,Product-21f4,Business-de42,Location-1ba8,1,850
1,"August 18, 2024, 9:32 PM",Category-120,Product-4156,Business-de42,Location-1ba8,2,1910
2,"August 18, 2024, 9:32 PM",Category-121,Product-49bd,Business-de42,Location-1ba8,1,3670
3,"August 18, 2024, 9:32 PM",Category-76,Product-61dd,Business-de42,Location-1ba8,1,2605
4,"August 18, 2024, 9:32 PM",Category-119,Product-66e0,Business-de42,Location-1ba8,5,1480


In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333405 entries, 0 to 333404
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   DATE                 333405 non-null  object
 1   ANONYMIZED CATEGORY  333405 non-null  object
 2   ANONYMIZED PRODUCT   333405 non-null  object
 3   ANONYMIZED BUSINESS  333405 non-null  object
 4   ANONYMIZED LOCATION  333405 non-null  object
 5   QUANTITY             333405 non-null  int64 
 6   UNIT PRICE           333397 non-null  object
dtypes: int64(1), object(6)
memory usage: 17.8+ MB


The dataset consists of 333,405 rows and 7 columns. The features include:

`DATE`: The date of the transaction.

`ANONYMIZED CATEGORY`: The category of the product sold.

`ANONYMIZED PRODUCT`: The specific product that was sold.

`ANONYMIZED BUSINESS`: The business that made the sale.

`ANONYMIZED LOCATION`: The location of the transaction.

`QUANTITY`: The number of units sold in the transaction.

`UNIT PRICE`: The price per unit of the product.

# Section 1: Data Cleaning and Preparation.

## (a) Data Quality Assessment.

#### i. Check for missing values


In [None]:
missing_values = data.isnull().sum()
missing_values

DATE                   0
ANONYMIZED CATEGORY    0
ANONYMIZED PRODUCT     0
ANONYMIZED BUSINESS    0
ANONYMIZED LOCATION    0
QUANTITY               0
UNIT PRICE             8
dtype: int64

Given that the rest of the dataset had no missing values, I decided to drop the rows with missing values in the 'UNIT PRICE' column to avoid introducing bias into the analysis.

In [12]:
data.dropna(subset=['UNIT PRICE'], inplace=True)

In [13]:
# Verify that no missing values remain
missing_values_after = data.isnull().sum()
missing_values_after


DATE                   0
ANONYMIZED CATEGORY    0
ANONYMIZED PRODUCT     0
ANONYMIZED BUSINESS    0
ANONYMIZED LOCATION    0
QUANTITY               0
UNIT PRICE             0
dtype: int64