# EDA notebook for getting the basic understanding of what the data contains

## The files and the columns they contain with descriptions [Click here to vist Kaggle's page](https://www.kaggle.com/datasets/larysa21/retail-data-american-football-gear-sales) :


### 1. AF_online_sales_dataset.csv
1. tmstmp - Timestamp of the transaction
2. product_category - Broad category of the product
3. product_subcategory - Subcategory for finer product classification
4. product - Name of the sold prouct
5. brand_name - Brand of product
6. product_price - Price per unit of the product
7. quantity_sold - Number of units sold
8. total_amount - Total revenue generated from the transaction
9. total_costs - Total cost incurred for the transaction
10. payment_type - Method of payment used
11. shipping_method - Shipping service used for delivery
12. coupon_discount - Discount applied through a coupon, if any
13. customer_firstname - First name of the customer
14. customer_lastname - Last name of the customer
15. customer_gender - Gender of the customer
16. customer_age - Age of the customer
17. customer_shirtsize - Shirt size of the customer
18. customer_email - Email of the customer
19. customer_phone - Phone number of the customer
20. customer_address - Customer's address (primary)
21. address_details - Additional address details
22. customer_city - City of the customer
23. customer_state - state of the customer
24. store_website - Website where the transaction occured
25. employee_firstname - First name of the employee handling the transaction
26. employee_lastname - Last name of the employee handling the transaction
27. employee_email - Email of the employee
28. employee_skill - Skillset of the employee
29. employee_education - Educational background of the employee

### 2. AF_offline_sales_dataset.csv
1. product_name - Name of sold product
2. brand - Brand of product
3. category - Broad category of the product
4. subcategory - Subcategory for finer product classification
5. supplier - Supplier or vendor name
6. date - Timestamp of the transaction
7. price - Price per unit of the product
8. quantity_sold - Number of units sold
9. amount_sold - Total revenue generated from the transaction
10. cost_amount - Total cost incurred for the transaction
11. payment_method - Method of payment used
12. customer_firstname - First name of the customer
13. customer_lastname - Last name of the customer
14. customer_gender - Gender of the customer
15. customer_email - Email of the customer
16. customer_phone - Phone number of the customer
17. store_type - Type of the store
18. store_street - Street address of the store
19. store_city - City where store is located
20. store_state - State where the store is located

### Conclusions:
1. The files have different number of columns
2. Strangely enough, the file with online transactions has employees' data, the offline data doesn't, ok
3. There is a number of columns containing similar data but called differently (tmstmp & date for example)
4. Although the data in both files is not the same structure, it is quite close on to another one, as both datasets represent sales (online sales and in-store sales)
5. After transforming the data, some values in newly created columns will have to be filled with some some logic (later)

## Plan of the EDA:
1. Load each file (online and offline sales)
2. Understand the structure of both datasets
3. Unite datasets
4. Add a column to each dataset for online/offline sales (sales_channel)
5. Check unique values for both parts and make conclusions on what should be done in order to prepare the df to normalization

### 1. Load the files and get the first impression of them

In [1]:
# all required imports
import pandas as pd
import os

# show all columns
pd.set_option('display.max_columns', None)

In [2]:
# Define the base path
base_path = os.path.join('..', 'raw_data')

In [3]:
# Read the files
df_online = pd.read_csv(os.path.join(base_path, 'AF_online_sales_dataset.csv'))
df_offline = pd.read_csv(os.path.join(base_path, 'AF_offline_sales_dataset.csv'))

In [4]:
df_online.head(2)

Unnamed: 0,tmstmp,product_category,product_subcategory,product,brand_name,product_price,quantity_sold,total_amount,total_costs,payment_type,shipping_method,coupon_discount,customer_firstname,customer_lastname,customer_gender,customer_age,customer_shirtsize,customer_email,customer_phone,customer_address,address_details,customer_city,customer_state,store_website,employee_firstname,employee_lastname,employee_email,employee_skill,employee_education
0,2023-02-02 14:41:12,Protection,Padding,All Star KP2500 Small Adult Knee Pad (Pairs),All Star,18.69,5,93.45,23.85,credit card,DHL,0,Maribelle,Crickmer,Female,34,2XL,mcrickmer0@cbsnews.com,7045187192,810 Elgar Terrace,PO Box 62435,Charlotte,NC,helmetheroshop.com,Orelee,Curmi,povill5o@admin.ch,Rubber,Kardan University
1,2024-01-09 00:58:17,Footwear,Socks,Multi-Sport Sock,Champro,10.63,1,7.97,3.85,PayPal,USPS,25,Katerina,Baird,Female,28,XS,kbaird1@vkontakte.ru,7343341391,022 Cardinal Point,PO Box 88829,Detroit,MI,pigskinprovisions.com,Zeb,Absalom,crolingson3b@berkeley.edu,Live Events,National Taras Shevchenko University of Kiev


In [5]:
df_offline.head(2)

Unnamed: 0,product_name,brand,category,subcategory,supplier,date,price,quantity_sold,amount_sold,cost_amount,payment_method,customer_firstname,customer_lastname,customer_gender,customer_email,customer_phone,store_type,store_street,store_city,store_state
0,Riddell Victor-I Inflation Air Bladder (R91229...,Riddell,Helmets,Helmet Components,Balistreri Inc.,2023-07-31 16:54:01,16.11,4,64.44,29.76,cash,Farleigh,Geach,Male,fgeach55@aol.com,296-345-9732,specialized,20 Fairfield Plaza,Portland,OR
1,McDavid Thigh Support,McDavid,Protective Gear,Thigh,NFL Properties LLC,2023-03-16 20:06:34,20.0,6,120.0,81.36,debit card,Stesha,Peiser,Female,speiserl@squarespace.com,562-102-4205,superstore,654 Pine View Place,New Orleans,LA


### 2. Understand the structure of both datasets

In [6]:
# columns in both dfs:
print(df_online.columns)
print(df_offline.columns)

Index(['tmstmp', 'product_category', 'product_subcategory', 'product',
       'brand_name', 'product_price', 'quantity_sold', 'total_amount',
       'total_costs', 'payment_type', 'shipping_method', 'coupon_discount',
       'customer_firstname', 'customer_lastname', 'customer_gender',
       'customer_age', 'customer_shirtsize', 'customer_email',
       'customer_phone', 'customer_address', 'address_details',
       'customer_city', 'customer_state', 'store_website',
       'employee_firstname', 'employee_lastname', 'employee_email',
       'employee_skill', 'employee_education'],
      dtype='object')
Index(['product_name', 'brand', 'category', 'subcategory', 'supplier', 'date',
       'price', 'quantity_sold', 'amount_sold', 'cost_amount',
       'payment_method', 'customer_firstname', 'customer_lastname',
       'customer_gender', 'customer_email', 'customer_phone', 'store_type',
       'store_street', 'store_city', 'store_state'],
      dtype='object')


### 3. Unite the datasets
### 4. Add a column to each dataset for online/offline sales (sales_channel)

In [7]:
# Standardize column names (renaming where necessary for the columns having similar business sense)
df_offline.rename(columns={
    'product_name': 'product',
    'brand': 'brand_name',
    'category': 'product_category',
    'subcategory': 'product_subcategory',
    'date': 'tmstmp',
    'price': 'product_price',
    'amount_sold': 'total_amount',
    'cost_amount': 'total_costs',
    'payment_method': 'payment_type'
}, inplace=True)

# Add missing columns and fill with None/NaN where necessary
missing_columns = set(df_online.columns) - set(df_offline.columns)
for col in missing_columns:
    df_offline[col] = None  # Fill missing columns in offline with None

missing_columns = set(df_offline.columns) - set(df_online.columns)
for col in missing_columns:
    df_online[col] = None  # Fill missing columns in online with None

# Add a sales channel column
df_online['sales_channel'] = 'Online'
df_offline['sales_channel'] = 'Offline'

# Concatenate both DataFrames
df_combined = pd.concat([df_online, df_offline], ignore_index=True)

In [8]:
df_combined.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1034000 entries, 0 to 1033999
Data columns (total 35 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   tmstmp               1034000 non-null  object 
 1   product_category     1034000 non-null  object 
 2   product_subcategory  1034000 non-null  object 
 3   product              1034000 non-null  object 
 4   brand_name           1034000 non-null  object 
 5   product_price        1034000 non-null  float64
 6   quantity_sold        1034000 non-null  int64  
 7   total_amount         1034000 non-null  float64
 8   total_costs          1034000 non-null  float64
 9   payment_type         1034000 non-null  object 
 10  shipping_method      533000 non-null   object 
 11  coupon_discount      533000 non-null   object 
 12  customer_firstname   1034000 non-null  object 
 13  customer_lastname    1034000 non-null  object 
 14  customer_gender      1034000 non-null  object 
 15

#### Review what we've got

#### Observations
1. So, we got a united dataset with 1.034.000 records: 533.000 of which came from online sales dataset and 501.000 records from offline dataset
2. We have a number of columns which have values from one of the 2 datasets only:

    - Columns, containing data from Online sales dataset:
        - shipping_method      
        - coupon_discount      
        - customer_age         
        - customer_shirtsize   
        - customer_address     
        - address_details      
        - customer_city        
        - customer_state       
        - store_website        
        - employee_firstname   
        - employee_lastname    
        - employee_email       
        - employee_skill     
        - employee_education 
    
    - Columns, containing data from Offline sales dataset:
        - store_street         
        - supplier             
        - store_city           
        - store_state          
        - store_type  
3. Basically, it is up to us how we handle this issue. It is possible to apply some logic in order to fill the columns

### 5. Check unique values for both parts and make conclusions on what should be done in order to prepare the df to normalization

#### 1. tmstmp

In [9]:
# Convert tmstmp column to datetime
df_combined['tmstmp'] = pd.to_datetime(df_combined['tmstmp'], errors='coerce')

# Group by sales_channel and get min/max dates
date_summary = df_combined.groupby('sales_channel')['tmstmp'].agg(['min', 'max'])

date_summary

Unnamed: 0_level_0,min,max
sales_channel,Unnamed: 1_level_1,Unnamed: 2_level_1
Offline,2022-01-01 00:00:24,2024-01-16 23:57:11
Online,2022-01-01 00:08:53,2024-01-16 23:59:38


#### 2. product_category     

In [10]:
# Group by sales_channel and get product_categories
product_category_summary = df_combined.groupby(['sales_channel', 'product_category'])['product_category'].agg(['count'])

product_category_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,count
sales_channel,product_category,Unnamed: 2_level_1
Offline,Accessories,21870
Offline,Clothing & Apparel,34763
Offline,Footwear,37513
Offline,Gloves,46986
Offline,Helmets,195460
Offline,Protection,81325
Offline,Protective Gear,13187
Offline,Shoulder Pads,69896
Online,Accessories,22855
Online,Clothing & Apparel,37073


In [11]:
# We are completely ok with product_category

#### 3. product_subcategory  

In [12]:
# Group by sales_channel and get product_subcategories
product_subcategory_summary = df_combined.groupby(['sales_channel', 'product_subcategory'])['product_category'].agg(['count'])

product_subcategory_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,count
sales_channel,product_subcategory,Unnamed: 2_level_1
Offline,Accessories,6816
Offline,Adult,98407
Offline,Arm,7903
Offline,Chin Straps,13881
Offline,Compression,4334
Offline,Detachable Cleats,5293
Offline,Elbow,1724
Offline,Eye Black,4463
Offline,Face Masks,46208
Offline,Forearm,2646


In [13]:
# Get unique product subcategories per sales channel
subcategories_per_channel = df_combined.groupby('sales_channel')['product_subcategory'].unique()

# Convert to sets for easy comparison
online_subcategories = set(subcategories_per_channel.get('Online', []))
offline_subcategories = set(subcategories_per_channel.get('Offline', []))

# Find differences
only_in_online = online_subcategories - offline_subcategories
only_in_offline = offline_subcategories - online_subcategories

# Display results
print("Subcategories only in Online sales:", only_in_online)
print("Subcategories only in Offline sales:", only_in_offline)

Subcategories only in Online sales: set()
Subcategories only in Offline sales: set()


In [14]:
# We are completely ok with subcategories column

#### 4. product              

In [15]:
# Get unique products per sales channel
products_per_channel = df_combined.groupby('sales_channel')['product'].unique()

# Convert to sets for easy comparison
online_products = set(products_per_channel.get('Online', []))
offline_products = set(products_per_channel.get('Offline', []))

# Find differences
only_in_online_products = online_products - offline_products
only_in_offline_products = offline_products - online_products

# Display results
print("Products only in Online sales:", only_in_online_products)
print("Products only in Offline sales:", only_in_offline_products)

Products only in Online sales: set()
Products only in Offline sales: set()


In [16]:
# We are ok with products column

#### 5. brand_name           

In [17]:
# Group by sales_channel and get product_subcategories
brand_summary = df_combined.groupby(['sales_channel', 'brand_name'])['brand_name'].agg(['count'])

brand_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,count
sales_channel,brand_name,Unnamed: 2_level_1
Offline,Adams,22797
Offline,Adidas,12341
Offline,All Star,11339
Offline,Champro,3434
Offline,Crep Protect,6816
...,...,...
Online,Under Armour,23791
Online,Vettex,1866
Online,Warrior Shield,1858
Online,Wilson,928


In [18]:
# Get unique brands per sales channel
brands_per_channel = df_combined.groupby('sales_channel')['brand_name'].unique()

# Convert to sets for easy comparison
online_products = set(brands_per_channel.get('Online', []))
offline_products = set(brands_per_channel.get('Offline', []))

# Find differences
only_in_online_brands = online_products - offline_products
only_in_offline_brands = offline_products - online_products

# Display results
print("Brands only in Online sales:", only_in_online_brands)
print("Brands only in Offline sales:", only_in_offline_brands)

Brands only in Online sales: set()
Brands only in Offline sales: set()


In [19]:
# We are ok with column brand_name. Neither column has values the the other one doesn't have

#### 6. product_price        

In [20]:
# Group by sales_channel and get min/max/average/median product_price
product_price_summary = df_combined.groupby('sales_channel')['product_price'].agg(['min', 'max', 'mean', 'median'])

product_price_summary

Unnamed: 0_level_0,min,max,mean,median
sales_channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Offline,1.99,800.96,76.995706,42.04
Online,1.99,800.95,77.47048,42.08


In [21]:
# Ok

#### 7. quantity_sold        

In [22]:
# Group by sales_channel and get min/max/average/median quantity_sold
quantity_sold_summary = df_combined.groupby('sales_channel')['quantity_sold'].agg(['min', 'max', 'mean', 'median'])

quantity_sold_summary

Unnamed: 0_level_0,min,max,mean,median
sales_channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Offline,1,10,5.502864,6.0
Online,1,10,5.504255,6.0


In [23]:
# Ok

#### 8. total_amount         

In [24]:
# Group by sales_channel and get min/max/average/median total_amount
total_amount_summary = df_combined.groupby('sales_channel')['total_amount'].agg(['min', 'max', 'mean', 'median'])

total_amount_summary

Unnamed: 0_level_0,min,max,mean,median
sales_channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Offline,2.05,8001.6,424.335285,187.15
Online,1.64,8007.4,404.625096,177.63


#### 9.total_costs          

In [25]:
# Group by sales_channel and get min/max/average/median total_costs
total_costs_summary = df_combined.groupby('sales_channel')['total_costs'].agg(['min', 'max', 'mean', 'median'])

total_costs_summary

Unnamed: 0_level_0,min,max,mean,median
sales_channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Offline,0.8,3203.1,169.547726,74.88
Online,0.82,3203.3,170.137032,75.12


#### 10.payment_type         

In [26]:
# Group by sales_channel and get payment_type summary
payment_type_summary = df_combined.groupby(['sales_channel', 'payment_type'])['payment_type'].agg(['count'])

payment_type_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,count
sales_channel,payment_type,Unnamed: 2_level_1
Offline,bank transfer,71297
Offline,cash,107336
Offline,credit card,107305
Offline,debit card,107697
Offline,gift card,35698
Offline,mobile payment,71667
Online,PayPal,96469
Online,bank wire,48743
Online,cash,48456
Online,credit card,96867


In [27]:
# Ok we have various payment_methods for both online and offline data which is quite logical and good, we can move on

#### 11. shipping_method      

In [28]:
# Group by sales_channel and get shipping_method summary
shipping_method_summary = df_combined.groupby(['sales_channel', 'shipping_method'])['shipping_method'].agg(['count'])

shipping_method_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,count
sales_channel,shipping_method,Unnamed: 2_level_1
Online,DHL,133087
Online,DSV,133542
Online,USPS,133488
Online,dpd,132883


In [29]:
# This is quite logical as well, as the online data had this column, there are no values for offline data. We will add
# a value like 'instore_pickup' or something like that for the offline part of the data

#### 12. coupon_discount      

In [30]:
# Group by sales_channel and get min/max/average/median coupon_discount
coupon_discount_summary = df_combined.groupby('sales_channel')['coupon_discount'].agg(['min', 'max', 'mean', 'median'])

coupon_discount_summary

  return np.nanmean(a, axis, out=out, keepdims=keepdims)


Unnamed: 0_level_0,min,max,mean,median
sales_channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Offline,,,,
Online,0.0,25.0,4.9806,0.0


In [31]:
# Offline data didn't have such a column 'coupon_discount', we will fill it with 0 (the offline part of the data)

#### 13. customer_firstname

In [32]:
# Get unique customer_firstname per sales channel
first_names_per_channel = df_combined.groupby('sales_channel')['customer_firstname'].unique()

# Convert to sets for easy comparison
online_firstname = set(first_names_per_channel.get('Online', []))
offline_firstname = set(first_names_per_channel.get('Offline', []))

# Find differences
only_in_online_firstnames = online_firstname - offline_firstname
only_in_offline_firstnames = offline_firstname - online_firstname

# Display results
print("First names only in Online sales:", only_in_online_firstnames)
print("First names in Offline sales:", only_in_offline_firstnames)

First names only in Online sales: set()
First names in Offline sales: set()


In [33]:
# ok, nothing surprising, there aren't any first names in one part which don't exist in another part of the data

#### 14. customer_lastname

In [34]:
# Get unique customer_lastname per sales channel
last_names_per_channel = df_combined.groupby('sales_channel')['customer_lastname'].unique()

# Convert to sets for easy comparison
online_lasttname = set(last_names_per_channel.get('Online', []))
offline_lastname = set(last_names_per_channel.get('Offline', []))

# Find differences
only_in_online_lastnames = online_lasttname - offline_lastname
only_in_offline_lastnames = offline_lastname - online_lasttname

# Display results
print("Last names only in Online sales:", only_in_online_lastnames)
print("Last names only in Offline sales:", only_in_offline_lastnames)

Last names only in Online sales: {'Romero', 'Studd', 'Souness', 'Woodland', 'Tomes', 'Alyoshin', 'Vigers', 'Brazer', 'Mitkcov', 'Eilhart', 'Wawer', 'Arsey', 'McHaffy', 'Gerhartz', 'Keri', 'Haswall', 'Lauthian', 'Glassard', 'Bather', 'Bill', 'Widocks', 'Alwell', 'Parmeter', 'Brocking', 'Royson', 'Purkiss', 'Klamman', 'Branche', 'Glazer', 'Caustic', 'Goundsy', 'Battson'}
Last names only in Offline sales: {'Barwell', 'Trobey', 'Hodge'}


In [35]:
# ok, we see that there were different users, who bought online and offline, quite logical too

#### 15. customer_gender

In [36]:
# Group by sales_channel and get customer_gender summary
gender_summary = df_combined.groupby(['sales_channel', 'customer_gender'])['customer_gender'].agg(['count'])

gender_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,count
sales_channel,customer_gender,Unnamed: 2_level_1
Offline,Agender,8759
Offline,Bigender,8476
Offline,Female,226773
Offline,Genderfluid,8107
Offline,Genderqueer,8498
Offline,Male,223997
Offline,Non-binary,8224
Offline,Polygender,8166
Online,Agender,8819
Online,Bigender,8879


#### 16. customer_age         

In [37]:
# Group by sales_channel and get min/max/average/median customer_age
customer_age_summary = df_combined.groupby('sales_channel')['customer_age'].agg(['min', 'max', 'mean', 'median'])

customer_age_summary

  return np.nanmean(a, axis, out=out, keepdims=keepdims)


Unnamed: 0_level_0,min,max,mean,median
sales_channel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Offline,,,,
Online,16.0,42.0,29.009642,29.0


In [38]:
# Offline data didn't have such a column 'cusomer_age', we will probably not fill it with any values, unfortunately

#### 17. customer_shirtsize   

In [39]:
# Group by sales_channel and get customer_shirtsize summary
shirtsize_summary = df_combined.groupby(['sales_channel', 'customer_shirtsize'])['customer_shirtsize'].agg(['count'])

shirtsize_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,count
sales_channel,customer_shirtsize,Unnamed: 2_level_1
Online,2XL,75688
Online,3XL,76521
Online,L,75932
Online,M,76121
Online,S,75863
Online,XL,76333
Online,XS,76542


In [40]:
# Offline data didn't have such a column 'customer_shirtsize', we will probably not fill it with any values,
# unfortunately

#### 18. customer_email       

In [41]:
# Get unique customer_email per sales channel
email_per_channel = df_combined.groupby('sales_channel')['customer_email'].unique()

# Convert to sets for easy comparison
online_email = set(email_per_channel.get('Online', []))
offline_email = set(email_per_channel.get('Offline', []))

# Find differences
only_in_online_email = online_email - offline_email
only_in_offline_email = offline_email - online_email

# Display results
print("Email only in Online sales:", len(only_in_online_email))
print("Email only in Offline sales:", len(only_in_offline_email))

Email only in Online sales: 438000
Email only in Offline sales: 365000


In [42]:
# Ok, it is also pretty logical that in most cases we have different customers in online and offline data

#### 19. customer_phone (redundant to check, but let's take a look at the formats at least)

In [43]:
# Get sample phone numbers for each sales channel
online_samples = df_combined[df_combined['sales_channel'] == 'Online']['customer_phone'].sample(5, random_state=42)
offline_samples = df_combined[df_combined['sales_channel'] == 'Offline']['customer_phone'].sample(5, random_state=42)

# Print results
print("Online Sales Channel - Sample Phone Numbers:")
print(online_samples.to_list())

print("\nOffline Sales Channel - Sample Phone Numbers:")
print(offline_samples.to_list())

Online Sales Channel - Sample Phone Numbers:
[4046162577, 8151163527, 7574642820, 5174233916, 7031178726]

Offline Sales Channel - Sample Phone Numbers:
['602-343-3062', '549-913-3670', '579-641-1913', '652-380-5468', '450-298-1317']


In [44]:
# they look good, but I think it would be better to transform the phone numbers to look the same in both parts

#### 20. customer_address

In [45]:
# this column exists in online part of the data only, let's take a look at it

In [46]:
online_samples_customer_address =\
    df_combined[df_combined['sales_channel'] == 'Online']['customer_address'].sample(5, random_state=42)

online_samples_customer_address

511050      85 Barnett Circle
360840         59 Judy Center
320052        44 Moland Point
279735         6 Elka Parkway
516430    970 Cherokee Street
Name: customer_address, dtype: object

In [47]:
# ok, nothing to do here

#### 21. address_details

In [48]:
# this column exists in online part of the data only, let's take a look at it

In [49]:
online_samples_address_details =\
    df_combined[df_combined['sales_channel'] == 'Online']['address_details'].sample(5, random_state=42)

online_samples_address_details.tolist()

['Room 1052', 'Room 1996', 'Apt 1513', 'Apt 14', 'Room 1393']

#### 22. customer_city        

In [50]:
# this column exists in online part of the data only, let's take a look at it

In [51]:
online_samples_customer_city =\
    df_combined[df_combined['sales_channel'] == 'Online']['customer_city'].sample(5, random_state=42)

online_samples_customer_city.tolist()

['Atlanta', 'Rockford', 'Virginia Beach', 'Lansing', 'Washington']

In [52]:
# To fill this column for the offline part of the data, we will get values from store_city column, assuming customers
# live in cities where they phisically purchase things

#### 23. customer_state

In [53]:
# this column exists in online part of the data only, let's take a look at it

In [54]:
online_samples_customer_state =\
    df_combined[df_combined['sales_channel'] == 'Online']['customer_state'].sample(5, random_state=42)

online_samples_customer_state.tolist()

['GA', 'IL', 'VA', 'MI', 'DC']

In [55]:
# To fill this column for the offline part of the data, we will get values from store_state column, assuming customers
# live in states where they phisically purchase things

#### 24. store_website

In [56]:
# We do know this column exits in the online part of the data, we can't fill the offline part with any values
# we'll just take a look at a few examples

In [57]:
online_samples_store_website =\
    df_combined[df_combined['sales_channel'] == 'Online']['store_website'].sample(5, random_state=42)

online_samples_store_website.tolist()

['snaproutesports.com',
 'endzoneexpress.com',
 'tackle treasures.com',
 'spikestrategies.com',
 'helmethavenusa.com']

#### 25. employee_firstname

In [58]:
# We do know this column exits in the online part of the data, we can't fill the offline part with any values
# we'll just take a look at a few examples

In [59]:
online_samples_employee_firstname =\
    df_combined[df_combined['sales_channel'] == 'Online']['employee_firstname'].sample(5, random_state=42)

online_samples_employee_firstname.tolist()

['Ryann', 'Dare', 'Innis', 'Dana', 'Johnna']

#### 26. employee_lastname

In [60]:
# We do know this column exits in the online part of the data, we can't fill the offline part with any values
# we'll just take a look at a few examples

In [61]:
online_samples_employee_lastname =\
    df_combined[df_combined['sales_channel'] == 'Online']['employee_lastname'].sample(5, random_state=42)

online_samples_employee_lastname.tolist()

['Spendlove', 'Wrightham', 'Whawell', 'Lacasa', 'Litel']

In [62]:
# weird names but ok

#### 27. employee_email

In [63]:
# We do know this column exits in the online part of the data, we can't fill the offline part with any values
# we'll just take a look at a few examples

In [64]:
online_samples_employee_email =\
    df_combined[df_combined['sales_channel'] == 'Online']['employee_email'].sample(5, random_state=42)

online_samples_employee_email.tolist()

['phainsworth4f@economist.com',
 'mgoodley4x@scientificamerican.com',
 'kspeck6e@imgur.com',
 'nghioni6d@salon.com',
 'atowsie5x@samsung.com']

In [65]:
# ok, invented emails but this is our data

#### 28. employee_skill

In [66]:
# We do know this column exits in the online part of the data, we can't fill the offline part with any values
# we'll just take a look at a few examples

In [67]:
online_samples_employee_skill =\
    df_combined[df_combined['sales_channel'] == 'Online']['employee_skill'].sample(20, random_state=42)

online_samples_employee_skill.tolist()

['Scripting',
 'HGV',
 'Urban Infill',
 'Microsoft Exchange',
 'NFL',
 'Hyperion',
 'Group Work',
 'EEOC',
 'Ducting',
 'LTL',
 'Mechanical Engineering',
 'Whiplash',
 'DS3',
 'DS4000',
 'DMX',
 'DS4000',
 'NSA',
 'GSE',
 'DBMS',
 'JCR']

In [68]:
# looks like it is something random, but ok, nevermind

#### 29. employee_education

In [69]:
# We do know this column exits in the online part of the data, we can't fill the offline part with any values
# we'll just take a look at a few examples

In [70]:
online_samples_employee_education =\
    df_combined[df_combined['sales_channel'] == 'Online']['employee_education'].sample(20, random_state=42)

online_samples_employee_education.tolist()

['University of the Southern Caribbean',
 'Windsor University School of Medicine',
 'Zetech College',
 'CollegeAmerica, Denver',
 'Institut National de la Recherche Scientifique, Université du Québec',
 'Lubbock Christian University',
 'Dar al Hekma College',
 'Universitas Pembangunan Nasional "Veteran" East Java',
 'North University of Baia Mare',
 'Charisma University',
 'Hogeschool voor Wetenschap en Kunst (VLEKHO), Brussel',
 'National Chung Cheng University',
 'Technological University (Magway)',
 'China USA Business University',
 'Norwegian School of Management BI',
 'China USA Business University',
 'University "Petre Andrei" Iasi',
 'Harris-Stowe State University',
 'University of Wisconsin - River Falls',
 'Campbell University']

In [71]:
# ok, it also looks like it is quite a random list of places to study...

#### 30. store_state

In [72]:
# We do know this column exits in the offline part of the data, we can't fill the offline part with any values
# we'll just take a look at a few examples and we'll use this column for 'customer_state' column to fill missing data
# for the offline part of it

In [73]:
online_samples_store_state =\
    df_combined[df_combined['sales_channel'] == 'Offline']['store_state'].sample(20, random_state=42)

online_samples_store_state.tolist()

['AL',
 'TX',
 'FL',
 'NV',
 'UT',
 'MT',
 'OH',
 'MI',
 'VA',
 'NY',
 'IN',
 'CA',
 'VA',
 'LA',
 'OH',
 'OH',
 'CA',
 'IL',
 'NV',
 'VA']

#### 31. store_city

In [74]:
# We do know this column exits in the offline part of the data, we can't fill the offline part with any values
# we'll just take a look at a few examples and we'll use this column for 'customer_city' column to fill missing data
# for the offline part of it

In [75]:
online_samples_store_city =\
    df_combined[df_combined['sales_channel'] == 'Offline']['store_city'].sample(20, random_state=42)

online_samples_store_city.tolist()

['Mobile',
 'Dallas',
 'Orlando',
 'Reno',
 'Salt Lake City',
 'Helena',
 'Cincinnati',
 'Lansing',
 'Manassas',
 'Rochester',
 'Anderson',
 'San Bernardino',
 'Manassas',
 'Shreveport',
 'Cleveland',
 'Cincinnati',
 'Redwood City',
 'Peoria',
 'Reno',
 'Manassas']

#### 32. supplier

In [76]:
# this is an interesting column. It didn't exist in the online part of the data, offline only.
# It should be checked if there are certain suppliers for certain brands or some other dependency or any other insight

##### 32.1 checking suppliers by brand

In [77]:
# Group by sales_channel and get supplier summary
supplier_summary = df_combined.groupby(['sales_channel', 'supplier'])['supplier'].agg(['count'])

supplier_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,count
sales_channel,supplier,Unnamed: 2_level_1
Offline,ANTA Sports Products Ltd.,14253
Offline,BRG Sports Inc.,28695
Offline,Balistreri Inc.,14368
Offline,Certor Sports LLC,14130
Offline,EP Sports,14399
Offline,EZ GARD Industries Inc.,14317
Offline,Fanatics Retail Group North LLC,14439
Offline,Forelle Inc.,14253
Offline,Franklin Sports,14246
Offline,Grady and Sons,14181


In [78]:
# so, we checked the 'brand_name' column. We saw, that every brand in online part of the data appears in the offline one
# and vise versa. Now we need to check if there are 2 suppliers that supply the same brand of products

In [79]:
# Group by sales_channel and get supplier summary
supplier_brand_summary = df_combined.groupby(['sales_channel', 'supplier', 'brand_name'])['supplier'].agg(['count'])

supplier_brand_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count
sales_channel,supplier,brand_name,Unnamed: 3_level_1
Offline,ANTA Sports Products Ltd.,Adams,634
Offline,ANTA Sports Products Ltd.,Adidas,327
Offline,ANTA Sports Products Ltd.,All Star,310
Offline,ANTA Sports Products Ltd.,Champro,101
Offline,ANTA Sports Products Ltd.,Crep Protect,195
Offline,...,...,...
Offline,XTECH Protective Equipment LLC,Under Armour,615
Offline,XTECH Protective Equipment LLC,Vettex,52
Offline,XTECH Protective Equipment LLC,Warrior Shield,60
Offline,XTECH Protective Equipment LLC,Wilson,29


In [80]:
# Count the number of unique suppliers for each brand
brand_supplier_counts = df_combined.groupby('brand_name')['supplier'].nunique()

# Find brands that have multiple suppliers
brands_with_multiple_suppliers = brand_supplier_counts[brand_supplier_counts > 1]

# Find brands with unique suppliers only
brands_with_unique_suppliers = brand_supplier_counts[brand_supplier_counts == 1]

# Display results
print(f"Number of brands with multiple suppliers: {len(brands_with_multiple_suppliers)}")
print(f"Number of brands with unique suppliers: {len(brands_with_unique_suppliers)}")

# Display some brands with multiple suppliers
print("\nSample brands with multiple suppliers:")
print(brands_with_multiple_suppliers.head(10))  # Display first 10 for readability


Number of brands with multiple suppliers: 31
Number of brands with unique suppliers: 0

Sample brands with multiple suppliers:
brand_name
Adams           34
Adidas          34
All Star        34
Champro         34
Crep Protect    34
Cutters         34
Douglas         34
Evoshield       34
Forelle         34
Franklin        34
Name: supplier, dtype: int64


In [81]:
# The conclusion is that different suppliers can supply similar brands, suppliers are not unique per brand.

##### 32.2 Checking the uniqueness of suppliers per product.

In [82]:
# We want to understand if each supplier supplies specific products

In [83]:
# Count the number of unique suppliers for each product
product_supplier_counts = df_combined.groupby('product')['supplier'].nunique()

# Find products that have multiple suppliers
products_with_multiple_suppliers = product_supplier_counts[product_supplier_counts > 1]

# Find products with unique suppliers only
products_with_unique_suppliers = product_supplier_counts[product_supplier_counts == 1]

# Display results
print(f"Number of products with multiple suppliers: {len(products_with_multiple_suppliers)}")
print(f"Number of products with unique suppliers: {len(products_with_unique_suppliers)}")

# Display some products with multiple suppliers
print("\nSample products with multiple suppliers:")
print(products_with_multiple_suppliers.head(10))  # Display first 10 for readability


Number of products with multiple suppliers: 573
Number of products with unique suppliers: 0

Sample products with multiple suppliers:
product
3DX Jaw Guard Upgrade Set Lower jaw overlays         34
3DX Jaw Guard Xenith                                 34
Adams Chinstrap 4-point Lo (S) Gel25                 34
Adams FBFM-NOPO                                      34
Adams Hip Pad 3PC Set 1/2" Vinyl High Rise (1206)    34
Adams Hip Pad 3PC Slotted High Rise (1303)           34
Adams Hip Pad Set 3PC Intermediate Snap-In (1326)    34
Adams Spine Pad (1205)                               34
Adams Spine Pad (1303)                               34
Adams T-ANJOP-D                                      34
Name: supplier, dtype: int64


In [84]:
# so, we see that there are no products supplied by unique suppliers, which is quite interesting, but this is the data

### Conclusion:

    - Unfortunately this column won't be much useful as it wasn't correctly populated

#### 33. store_street

In [85]:
# We do know this column exits in the offline part of the data, we can't fill the offline part with any values

In [86]:
offline_samples_store_street =\
    df_combined[df_combined['sales_channel'] == 'Offline']['store_street'].sample(20, random_state=42)

offline_samples_store_street.tolist()

['674 Westerfield Street',
 '146 Bobwhite Street',
 '9 Twin Pines Junction',
 '52 Grover Pass',
 '31338 Center Alley',
 '14315 Melby Park',
 '28 Basil Drive',
 '05227 Grasskamp Park',
 '4 Dawn Parkway',
 '88 Harbort Alley',
 '235 Cody Pass',
 '10 Hansons Hill',
 '4 Dawn Parkway',
 '47631 Lukken Circle',
 '059 Arizona Place',
 '28 Basil Drive',
 '5 Tennyson Way',
 '9559 Evergreen Alley',
 '52 Grover Pass',
 '4 Dawn Parkway']

In [87]:
#so we have 100 unique stores
df_combined[df_combined['sales_channel'] == 'Offline']['store_street'].nunique()

100

#### 34. store_type

In [88]:
# We do know this column exits in the offline part of the data, we can fill the online data with like 'online' or smth

In [89]:
# Group by sales_channel and get store_type summary
store_type_summary = df_combined.groupby(['sales_channel', 'store_type'])['store_type'].agg(['count'])

store_type_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,count
sales_channel,store_type,Unnamed: 2_level_1
Offline,franchise,83593
Offline,megastore,129588
Offline,specialized,63662
Offline,superstore,224157


#### 35. sales_channel

In [90]:
# Group by sales_channel and get sales_channel count
sales_channel_summary = df_combined.groupby('sales_channel')['sales_channel'].agg(['count'])

sales_channel_summary

Unnamed: 0_level_0,count
sales_channel,Unnamed: 1_level_1
Offline,501000
Online,533000


In [91]:
# Count the number of unique brands for each product
product_brand_counts = df_combined.groupby('product')['brand_name'].nunique()

# Find products that have multiple brands
products_with_multiple_brands = product_brand_counts[product_brand_counts > 1]

# Find products with unique brand only
products_with_unique_brand = product_brand_counts[product_brand_counts == 1]

# Display results
print(f"Number of products with multiple brands: {len(products_with_multiple_brands)}")
print(f"Number of products with a unique brand: {len(products_with_unique_brand)}")

# Display some products with multiple brands
print("\nSample products with multiple brands:")
print(products_with_multiple_brands.head(10))


Number of products with multiple brands: 1
Number of products with a unique brand: 572

Sample products with multiple brands:
product
Eye Black Stickers    2
Name: brand_name, dtype: int64


In [92]:
df_combined[df_combined['product']=='Eye Black Stickers']

Unnamed: 0,tmstmp,product_category,product_subcategory,product,brand_name,product_price,quantity_sold,total_amount,total_costs,payment_type,shipping_method,coupon_discount,customer_firstname,customer_lastname,customer_gender,customer_age,customer_shirtsize,customer_email,customer_phone,customer_address,address_details,customer_city,customer_state,store_website,employee_firstname,employee_lastname,employee_email,employee_skill,employee_education,store_street,store_state,store_type,store_city,supplier,sales_channel
390,2022-05-31 12:58:50,Accessories,Eye Black,Eye Black Stickers,Franklin,10.36,6,62.16,9.30,credit card,dpd,0,Kristoffer,Demeter,Male,18,L,kdemeterau@typepad.com,7274164798,5928 Ramsey Center,Room 439,Saint Petersburg,FL,snaproutesports.com,Burk,Hauch,ghonsch2q@bing.com,LLVM,Virginia Military Institute,,,,,,Online
845,2022-03-02 17:54:01,Accessories,Eye Black,Eye Black Stickers,Rawlings,6.02,9,54.18,30.42,bank wire,DSV,0,Ginny,Poston,Female,25,M,gpostonnh@si.edu,4155245104,238 Hooker Court,12th Floor,San Francisco,CA,passplaypalace.com,Nealson,Cooch,mlongford1n@posterous.com,KPI Implementation,Universidad Nacional de Entre Ríos,,,,,,Online
978,2022-08-25 09:26:08,Accessories,Eye Black,Eye Black Stickers,Franklin,2.45,2,4.90,8.42,PayPal,USPS,0,Cosetta,Vasilischev,Female,39,3XL,cvasilischevr6@storify.com,8167010404,14157 Kropf Hill,10th Floor,Lees Summit,MO,endzoneemporium.com,Ardella,Edler,aliesj@freewebs.com,Simulations,Yonok University,,,,,,Online
1412,2022-09-18 18:58:27,Accessories,Eye Black,Eye Black Stickers,Rawlings,6.92,10,62.28,14.90,cash,dpd,10,Cathryn,Beetham,Female,18,M,cbeethambg@1688.com,3179717343,7391 Warrior Street,15th Floor,Indianapolis,IN,quarterbackquarters.com,Kerry,Guynemer,mhansford5t@forbes.com,CFD,Universität des Saarlandes,,,,,,Online
1996,2022-04-05 20:38:55,Accessories,Eye Black,Eye Black Stickers,Franklin,10.24,1,10.24,1.93,cash,dpd,0,Birdie,Houldin,Female,22,XS,bhouldinro@nba.com,2256640670,880 Melody Drive,PO Box 42533,Baton Rouge,LA,fieldforgefashions.com,Mair,Blacklidge,gcoastq@dailymotion.com,ABS,European Carolus Magnus University,,,,,,Online
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1033385,2023-02-19 13:11:57,Accessories,Eye Black,Eye Black Stickers,Franklin,5.70,4,22.80,12.56,mobile payment,,,Chrisse,Lenthall,Male,,,clenthallap@livejournal.com,222-845-2506,,,,,,,,,,,46043 Golf Terrace,HI,megastore,Honolulu,Forelle Inc.,Offline
1033777,2023-12-16 06:33:58,Accessories,Eye Black,Eye Black Stickers,Rawlings,8.07,3,24.21,11.13,credit card,,,Riobard,Heinl,Genderfluid,,,rheinlll@xing.com,935-524-7865,,,,,,,,,,,7 Transport Court,FL,superstore,Clearwater,Grady and Sons,Offline
1033795,2022-10-26 23:50:33,Accessories,Eye Black,Eye Black Stickers,Franklin,8.52,7,59.64,25.27,debit card,,,Kelci,Thewys,Female,,,kthewysm3@woothemes.com,957-723-9365,,,,,,,,,,,247 Mariners Cove Alley,TX,superstore,Irving,VOIT Corp.,Offline
1033887,2023-07-07 23:02:04,Accessories,Eye Black,Eye Black Stickers,Franklin,3.33,8,26.64,31.52,cash,,,Enrichetta,Klainer,Female,,,eklaineron@google.fr,706-134-8480,,,,,,,,,,,442 Charing Cross Terrace,CA,megastore,Mountain View,XTECH Protective Equipment LLC,Offline


In [93]:
unique_combinations = df_combined[['product', 'brand_name', 'product_category', 'product_subcategory']].drop_duplicates()

print(f"Number of unique combinations: {len(unique_combinations)}")


Number of unique combinations: 574


In [94]:
unique_combinations = df_combined[['product']].drop_duplicates()

print(f"Number of unique combinations: {len(unique_combinations)}")

Number of unique combinations: 573


In [95]:
unique_combinations = df_combined[['product', 'brand_name']].drop_duplicates()

print(f"Number of unique combinations: {len(unique_combinations)}")

Number of unique combinations: 574


In [96]:
unique_combinations = df_combined[['product_category']].drop_duplicates()

print(f"Number of unique combinations: {len(unique_combinations)}")

Number of unique combinations: 8


In [97]:
unique_combinations = df_combined[['product_subcategory']].drop_duplicates()

print(f"Number of unique combinations: {len(unique_combinations)}")

Number of unique combinations: 28
