# EXPLORATORY DATA ANALYSIS

- The dataset used here is a cleaned version of the original dataset.
- The data cleaning and preparation process is in the file cleaningProcess.ipynb - [Click here to view data cleaning process](cleaningProcess.ipynb)

# The following are the different objectives in this Exploration:

1. Customer Segmentation:
- What are the distinct customer segments based on their purchasing behavior?
- How can we categorize customers as new, returning, or loyal?

2. Customer Retention:
- What is the customer retention rate, and how can we improve it?
- What strategies can be implemented to reduce churn among customers?

3. Product and Department Insights:
- Analyze the distribution of products across different departments.
- Identify which departments have the highest sales and reordering rates.
- Investigate which products and departments contribute the most to overall revenue.

4. Market Basket Analysis:
- Perform market basket analysis to identify product associations and frequently co-purchased products. This can inform cross-selling and bundling strategies.

5. Re-ordering patterns:
- Calculate the reordering rate for products

In [1]:
#This cell will be used for importing packages needed for subsequent stages
import pandas as pd

In [2]:
df = pd.read_excel(r'C:\Users\GARETH TIROP\Desktop\Data Science and Analytics Docs\Cleaned_Ecommerce_Dataset.xlsx',chunksize = 1000)

In [3]:
df

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,department_id,department,product_name
0,2425083,49125,1,2,18,0,17,1,0,13,pantry,baking ingredients
1,2425083,49125,1,2,18,0,91,2,0,16,dairy eggs,soy lactosefree
2,2425083,49125,1,2,18,0,36,3,0,16,dairy eggs,butter
3,2425083,49125,1,2,18,0,83,4,0,4,produce,fresh vegetables
4,2425083,49125,1,2,18,0,83,5,0,4,produce,fresh vegetables
...,...,...,...,...,...,...,...,...,...,...,...,...
1048570,671875,33287,5,6,15,30,69,12,1,15,canned goods,soup broth bouillon
1048571,671875,33287,5,6,15,30,91,13,1,16,dairy eggs,soy lactosefree
1048572,671875,33287,5,6,15,30,61,14,1,19,snacks,cookies cakes
1048573,671875,33287,5,6,15,30,61,15,1,19,snacks,cookies cakes


# Customer Segmentation

- What are the distinct customer segments based on their purchasing behavior?

In [6]:
# Categorizing customers as new, returning or loyal

#This is based on the days_since_prior_order column

#First, we check the unique values and their count

uniqueValuesCount = df["days_since_prior_order"].value_counts()
print("The Unique values in 'days_since_prior_order' column are: \n", uniqueValuesCount)

#New customers have days since their prior purchase as 0

#Returning customers have only 2 sets of occurences

# Loyal customers are those with few days since their prior purchase 


The Unique values in 'days_since_prior_order' column are: 
 30    110629
7     109259
6      80867
0      78837
5      67932
4      64923
8      60943
3      58765
2      46384
9      39477
14     33239
10     31811
1      30295
11     27014
13     27000
12     25137
15     21590
16     14706
21     14518
20     12816
17     12580
18     12048
22     10183
19     10080
28      8532
23      7620
24      6687
27      6384
29      6367
26      6076
25      5876
Name: days_since_prior_order, dtype: int64


30

In [10]:
#To properly segement customers:
#: Group dataset by order_id , since, when some orders a group of items, they are all put together under 1 order_id for that specific instance
#Therefore, we can have a user_id with more than 1 unique order_id which shows that they have ordered more than once

order_groups = df.groupby('order_id')['user_id'].agg(user_ids=list).reset_index()
# Analyze the behavior of 'user_id'
order_groups['user_count'] = order_groups['user_ids'].apply(len)

# Display the DataFrame showing order_id, user_ids, and user_count
print(order_groups)

# Identify user_ids that appear only once and those that appear more than once
single_orders = order_groups[order_groups['user_count'] == 1]['user_ids'].explode().unique()
multiple_orders = order_groups[order_groups['user_count'] > 1]['user_ids'].explode().unique()

print("User IDs with a single order:", single_orders)
print("User IDs with multiple orders:", multiple_orders)


        order_id                                           user_ids  \
0             11           [143742, 143742, 143742, 143742, 143742]   
1             64                     [5430, 5430, 5430, 5430, 5430]   
2            110  [73241, 73241, 73241, 73241, 73241, 73241, 732...   
3            192  [189954, 189954, 189954, 189954, 189954, 18995...   
4            205                           [190709, 190709, 190709]   
...          ...                                                ...   
103756   3420964  [69985, 69985, 69985, 69985, 69985, 69985, 699...   
103757   3420977                   [109561, 109561, 109561, 109561]   
103758   3420991  [186459, 186459, 186459, 186459, 186459, 18645...   
103759   3421027  [51127, 51127, 51127, 51127, 51127, 51127, 511...   
103760   3421074                   [167185, 167185, 167185, 167185]   

        user_count  
0                5  
1                5  
2               10  
3               15  
4                3  
...            ...  
