In [4]:
import pandas as pd

In [5]:
sales = pd.read_csv('./data/sales_data.csv')

In [13]:
print(f"First 5 rows of the data:")
sales.head()

First 5 rows of the data:


Unnamed: 0,order_date,time,aging,customer_id,gender,device_type,customer_login_type,product_category,product,sales,quantity,discount,profit,shipping_cost,order_priority,payment_method
0,2018-01-01,10:11:40,5.0,29317,Male,Web,Member,Auto & Accessories,Car Media Players,140.0,4.0,0.3,43.2,4.3,Medium,e_wallet
1,2018-01-01,22:30:44,7.0,42270,Male,Web,Member,Auto & Accessories,Car Pillow & Neck Rest,231.0,5.0,0.1,139.5,13.9,High,money_order
2,2018-01-01,21:55:31,10.0,14563,Male,Web,Member,Auto & Accessories,Car Speakers,211.0,5.0,0.1,120.5,12.0,High,credit_card
3,2018-01-01,13:57:15,9.0,58601,Male,Web,Member,Auto & Accessories,Tyre,250.0,4.0,0.2,150.0,15.0,Critical,credit_card
4,2018-01-01,15:17:41,2.0,48342,Male,Web,Member,Auto & Accessories,Tyre,250.0,1.0,0.1,165.0,16.5,High,credit_card


In [14]:
print(f'Dataset information:')
sales.info()

Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51290 entries, 0 to 51289
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   order_date           51290 non-null  object 
 1   time                 51290 non-null  object 
 2   aging                51289 non-null  float64
 3   customer_id          51290 non-null  int64  
 4   gender               51290 non-null  object 
 5   device_type          51290 non-null  object 
 6   customer_login_type  51290 non-null  object 
 7   product_category     51290 non-null  object 
 8   product              51290 non-null  object 
 9   sales                51289 non-null  float64
 10  quantity             51288 non-null  float64
 11  discount             51289 non-null  float64
 12  profit               51290 non-null  float64
 13  shipping_cost        51289 non-null  float64
 14  order_priority       51288 non-null  object 
 15  payment_method 

In [10]:
print(f'Columns: {sales.columns.tolist()}')

Columns: ['order_date', 'time', 'aging', 'customer_id', 'gender', 'device_type', 'customer_login_type', 'product_category', 'product', 'sales', 'quantity', 'discount', 'profit', 'shipping_cost', 'order_priority', 'payment_method']


In [15]:
print(f'Decimal statistics:')
sales.describe()

Decimal statistics:


Unnamed: 0,aging,customer_id,sales,quantity,discount,profit,shipping_cost
count,51289.0,51290.0,51289.0,51288.0,51289.0,51290.0,51289.0
mean,5.255035,58155.758764,152.340872,2.502983,0.303821,70.407226,7.041557
std,2.959948,26032.215826,66.495419,1.511859,0.131027,48.729488,4.871745
min,1.0,10000.0,33.0,1.0,0.1,0.5,0.1
25%,3.0,35831.25,85.0,1.0,0.2,24.9,2.5
50%,5.0,61018.0,133.0,2.0,0.3,59.9,6.0
75%,8.0,80736.25,218.0,4.0,0.4,118.4,11.8
max,10.5,99999.0,250.0,5.0,0.5,167.5,16.8


In [20]:
sales.isnull().sum()

order_date             0
time                   0
aging                  1
customer_id            0
gender                 0
device_type            0
customer_login_type    0
product_category       0
product                0
sales                  1
quantity               2
discount               1
profit                 0
shipping_cost          1
order_priority         2
payment_method         0
dtype: int64

In [26]:
sales.select_dtypes(include='object').nunique()

order_date               356
time                   35275
gender                     2
device_type                2
customer_login_type        4
product_category           4
product                   42
order_priority             4
payment_method             5
dtype: int64

The dataset appears to be mostly normalized. Some columns (aging, sales, quantity, discount, shipping_cost, order_priority) have several null records. We could either delete rows where any field is null or replace the null values with the column average.

I also suggest converting the date and time columns to Pandas datetime format. To prepare the data for model analysis, we should convert categorical (object) values to numbers. For this, we can use One-Hot Encoding for columns like gender, device_type, customer_login_type, product_category, and payment_method. If model performance isn’t a major concern, we could also include the product column in the encoding. For order_priority, however, Ordinal Encoding would be more suitable due to its inherent order.

I'm also unsure if customer_id should be included as a feature, as it doesn't provide any meaningful relationship; for example, customer_id 60000 isn't “less than” customer_id 60001 in any meaningful way.

If we use a model sensitive to feature scaling, we might consider standardizing the profit and shipping_cost columns, for instance, by applying Standard Scaling.