# Exploratory Data Analysis (EDA): Footwear Sales (2018–2026)

Before building models or drawing conclusions, it is essential to develop a strong understanding of the dataset. Exploratory Data Analysis (EDA) helps us examine just that.

In this section, we will:
- Review the overall structure of the dataset (features, data types, and size)
- Identify inconsistent entries
- Detect potential outliers
- Assess whether any variables require cleaning, transformation, or further investigation

By performing this initial analysis, we ensure that the dataset is reliable and well-prepared for deeper statistical analysis and visualization.

--------------------------------------------------------------

## 1. Import the neccesary packages:
Importing the python packages I will be using to better understand the data <br>
Create a .venv from the requirements.txt

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import warnings as ws
ws.filterwarnings("ignore")

## 2. Explore the dataset and import the CSV
Import the data from the dataset and start ot analyze it. 

In [9]:
df_sales = pd.read_csv("../../Databases/global_sports_footwear_sales_2018_2026.csv")
df_sales.sample(5)

Unnamed: 0,order_id,order_date,brand,model_name,category,gender,size,color,base_price_usd,discount_percent,final_price_usd,units_sold,revenue_usd,payment_method,sales_channel,country,customer_income_level,customer_rating
5665,ORD105665,2026-06-25,ASICS,Model-686,Gym,Women,11,Black,62,10,55.8,1,55.8,Card,Retail Store,USA,Low,4.6
22174,ORD122174,2025-01-05,Adidas,Model-442,Running,Men,10,Red,178,30,124.6,4,498.4,Wallet,Retail Store,UAE,Medium,3.1
13514,ORD113514,2022-04-26,Puma,Model-988,Gym,Women,7,Blue,210,20,168.0,4,672.0,Wallet,Retail Store,UAE,Low,4.7
9876,ORD109876,2025-04-27,Puma,Model-627,Running,Women,8,White,167,0,167.0,2,334.0,Bank Transfer,Online,India,Low,4.2
1548,ORD101548,2018-09-14,ASICS,Model-319,Running,Men,6,Blue,94,15,79.9,4,319.6,Card,Retail Store,USA,High,3.6


Now we’ll check the data types for each column and check the summary of the numerical columns so we can determine our next action.

In [10]:
df_sales.info()

<class 'pandas.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   order_id               30000 non-null  str    
 1   order_date             30000 non-null  str    
 2   brand                  30000 non-null  str    
 3   model_name             30000 non-null  str    
 4   category               30000 non-null  str    
 5   gender                 30000 non-null  str    
 6   size                   30000 non-null  int64  
 7   color                  30000 non-null  str    
 8   base_price_usd         30000 non-null  int64  
 9   discount_percent       30000 non-null  int64  
 10  final_price_usd        30000 non-null  float64
 11  units_sold             30000 non-null  int64  
 12  revenue_usd            30000 non-null  float64
 13  payment_method         30000 non-null  str    
 14  sales_channel          30000 non-null  str    
 15  country      

### Key Data Integrity Observations

- No null values were detected across any of the columns.
- All columns contain approximately 30,000 records, indicating that the dataset does not have missing entries.
- The majority of features are stored as string (`str`) data types, which should be carefully considered when designing SQL tables and defining appropriate column types.
- There are no object types which is good for data integrity and keeps a consistent wokflow

In [11]:
df_sales.describe()

Unnamed: 0,size,base_price_usd,discount_percent,final_price_usd,units_sold,revenue_usd,customer_rating
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,8.500867,139.634633,13.332167,121.029035,2.5002,302.714948,4.001543
std,1.710896,46.062549,9.864198,42.511586,1.121149,179.149272,0.577546
min,6.0,60.0,0.0,42.0,1.0,42.0,3.0
25%,7.0,100.0,5.0,85.0,1.0,156.75,3.5
50%,9.0,140.0,10.0,119.2,3.0,268.2,4.0
75%,10.0,180.0,20.0,153.6,4.0,414.0,4.5
max,11.0,219.0,30.0,219.0,4.0,876.0,5.0


### Key Data Description Observations

- Data is daily based on units sold that range from min. 1.0 to max. 4.0 and which is further confirmed by checking the Model-686 sample below.
- From a business standpoint discounts range from 0% to 30% at max
- Customer rating ranges probably from 0 to 5 but the std is so low and the mean is 4 so that means rating mainly are in the upper ratings

In [13]:
df_sales[df_sales["model_name"] == "Model-686"].sample(5)

Unnamed: 0,order_id,order_date,brand,model_name,category,gender,size,color,base_price_usd,discount_percent,final_price_usd,units_sold,revenue_usd,payment_method,sales_channel,country,customer_income_level,customer_rating
26223,ORD126223,2025-10-17,Reebok,Model-686,Training,Men,9,Red,107,15,90.95,2,181.9,Bank Transfer,Online,Germany,Medium,4.4
12325,ORD112325,2025-01-27,ASICS,Model-686,Basketball,Unisex,7,White,106,15,90.1,1,90.1,Cash,Retail Store,UK,High,3.3
3653,ORD103653,2024-12-16,ASICS,Model-686,Training,Men,6,Red,126,15,107.1,4,428.4,Card,Retail Store,India,Low,4.4
7443,ORD107443,2018-10-12,New Balance,Model-686,Gym,Unisex,7,Blue,109,15,92.65,1,92.65,Card,Online,India,Low,3.3
29659,ORD129659,2020-03-12,ASICS,Model-686,Gym,Women,6,Grey,75,30,52.5,4,210.0,Cash,Retail Store,India,High,3.7
