# E-Commerce EDA & Modeling
____

**E-Commerce Data**

We've used a a set of data that a company actually sold online to wholesalers (customers) from 2010.12.01 to 2011.12.09.

The E-Commerce Data contains 8 data columns: Transaction Number(InvoiceNo), Product Code(StockCode), Product Name(Description), Purchase Quantity(Quantity), Purchase Date(InvoiceDate), Product Price(UnitPrice), Customer ID(CustomerID), and Purchases's Country(Country).

___
**1. Data Preparation**

**2. EDA (Exploratory Data Analysis)**

   - 2.1 Quantity
   - 2.2 StockCode
   - 2.3 CustomerID
   - 2.4 Description
   - 2.5 InvoiceDate
   - 2.6 UnitPrice
   - 2.7 Amount
   - 2.8 Country
   - 2.9 Reorder Item  

**3. Modeling**

   - 3.1 Data Mining
   - 3.2 Customer Data 
   - 3.3 Customer Classification 
   - 3.2 K-means Clustering

**4. Conclusion**
___

# 1. Data Preparation

**Preprocessing**

Importing libraries and data.csv file.

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline
color = sns.color_palette()

import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
import missingno as msno 
import pandas_profiling
import datetime

In [None]:
import matplotlib.font_manager as fm
[(f.name, f.fname) for f in fm.fontManager.ttflist if 'Apple' in f.name]

In [None]:
import matplotlib.pyplot as plt
plt.rc('font', family="AppleGothic")

In [None]:
from pandas import DataFrame
data = pd.read_csv(r"/kaggle/input/ecommerce-data/data.csv", encoding = 'ISO-8859-1')
data.head()

   **Each Item meanings**
   - **InvoiceNo**: Invoice number.<br>
   - **StockCode**: Product (item) code.<br>
   - **Description**: Product (item) name. <br>
   - **Quantity**: The quantities of each product (item) per transaction.<br>
   - **InvoiceDate**: Invice Date and time.<br>
   - **UnitPrice**: Unit price.<br>
   - **CustomerID**: Customer number.<br>
   - **Country**: Country name. <br>

Check how the datas are

In [None]:
data.info()

In [None]:
data.shape

In [None]:
data.describe()

### Missing values

Checking the missing values and dropping the Nulls.

In [None]:
# Checked where the missing values are = On Description and CustomerID

data.isnull().sum() / data.shape[0]

In [None]:
# Checked where and how much null datas are.

data.isnull().sum().sort_values(ascending=False)

In [None]:
# Cheking where Description is NaN

data[data.Description.isnull()].head()

In [None]:
# Checking where CustomerID is NaN

data[data.CustomerID.isnull()].head

In [None]:
# 'Description' NaN => 'CustomerID' Nan :Conirmed that the NaN values of the 'Description' are the NaN value of the 'CustomerID'

In [None]:
# Number of 'Customer'ID NaN values: 1454

data[data.Description.isnull()].CustomerID.isnull().value_counts()

### Dropping the NaN

In [None]:
# Removed the NAN value of the data.

data_n= data.dropna()

In [None]:
data_n.isnull().sum() / data.shape[0]

In [None]:
data_n.isnull().sum().sort_values(ascending=False)

In [None]:
# Checking again to make sure there are no null values

data_n.info()

In [None]:
data_ntype = {'CustomerID': str,'InvoiceNo': str}

In [None]:
# Checked the duplicate values in the data : 5225개

print('number of duplicates: {}'.format(data_n.duplicated().sum()))

Check whether there are outliers of Description and InvoiceNo by counting the number of letters.

In [None]:
# 'Description' length => Des_len

data_n['Des_len'] = data_n.Description.apply(lambda x: len(x))
data_n.head()

In [None]:
data_n.Des_len.describe()

In [None]:
# 'InvoiceNo' length => Invo_len

data_n['Invo_len'] = data_n.InvoiceNo.apply(lambda x: len(x))
data_n.head()

In [None]:
data_n.Invo_len.describe()

In [None]:
# After confirming that there is no problem with 'Description' and 'InvoiceNo', drop it.

data_n = data_n.drop(columns = ['Des_len', 'Invo_len'])
data.head()

# 2. EDA (Exploratory Data Analysis)

### 2.1 Quantity

Quantity : Purchased Quantity by each person

In [None]:
# When checking the 'Quantity' values, we found out that the minimum value exceeds - 8 million.
# If you check the data yourself, you can see that this is because the refunds are recorded as - values.

data_n.Quantity.describe()

In [None]:
# Remove 'Quantity' - value and refine it to data_n.

data_n = data_n[data_n.Quantity > 0]

In [None]:
# Cheking again if the minimum purchase quantity is succesfully refined.

data_n.Quantity.describe()

In [None]:
data_n.head()

In [None]:
# Visualized the quantities that received the most orders by counting 'Quantity' 'by the number of 'InvoiceNo'.
# The most ordered Quantity is 1.

qt=data_n.groupby('Quantity')['InvoiceNo'].count().sort_values(ascending=False).iloc[0:30]
plt.figure(figsize=(40,10))
sns.barplot(qt.index, qt.values, palette="YlOrRd")
plt.xlabel('Quantity',fontsize=15)
plt.ylabel("Number of Orders",fontsize=15)
plt.title("Quantity",fontsize=20);
plt.xticks(fontsize=15);

In [None]:
# As showed on the upper graph, you can see that there are many times when there were huge orders, but since their order frequencies were low, we focused on ~25.
# Distribution of less than Quantity 25: Order in quantities of 1 to 15 were high.
# As for the quantity, it can be seen that 1-13 quantity is receiving the most order.

plt.figure(figsize=(20,5))
sns.distplot(data_n[data_n['Quantity'] < 25]['Quantity'].values, kde=True, bins=10,color='orange')

### 2.2 StockCode

StockCode : Code for each product

In [None]:
# Checked the values that are not 'StockCode' numbers.
# Since the dataset was a transaction data, it includes not only trading transactions, but also parcels and money to be paid to banks.

data_n[data_n['StockCode'].str.contains('^[ba-zA-Z]+', regex=True)]['StockCode'].unique()

In [None]:
#POST            -> POSTAGE                      
#D               -> Discount                     
#C2              -> CARRIAGE                    
#M               -> Manual                     
#BANK CHARGES    -> Bank Charges            
#PADS            -> PADS TO MATCH ALL CUSHIONS 
#DOT             -> DOTCOM POSTAGE 

In [None]:
#Delete the lines on the 'StockCode' that are texts, since they are not transaction details.

data_n=data_n[~data_n['StockCode'].isin(['POST', 'C2', 'M', 'BANK CHARGES', 'PADS', 'DOT'])].copy()
data_n

In [None]:
# Checked again if it was perfectly excluded from 'StockCode'.

data_n[data_n['StockCode'].str.contains('^[ba-zA-Z]+', regex=True)]['StockCode'].unique()

In [None]:
# Visualized the number of 'StockCodes' that sold the most.

stockcode_c = data.StockCode.value_counts().sort_values(ascending=False)
plt.figure(figsize=(20,5))
sns.barplot(stockcode_c.iloc[0:30].index,
            stockcode_c.iloc[0:30].values,
            palette="Greens_r")
plt.ylabel("Counts")
plt.xlabel("Stockcode")
plt.title("Stockcode");

In [None]:
# Looked at the data of the 10 most ordered StockCodes.
# Found out the 'StockCode's' most ordered countries are all UK.

print('The TOP 10 Stockcodes with most number of orders') 
stockCode_best = data_n.groupby(by=['StockCode','Country'], as_index=False)['InvoiceNo'].count()
stockCode_best.sort_values(by='InvoiceNo', ascending=False).head(10)

In [None]:
# Sorted the 'StockCodes' with the most orders in ascending order of 'Quantity'.
# Listed the items with the most purchases, in the order of the most purchases. Of the five 'StockCodes', you can see that 85123A and 85099B sell a lot at once.

data_n[data_n['StockCode'].isin(['85123A','22423','85099B','47566','20725'])].sort_values(by=['Quantity'], axis=0, ascending=False).head(20)

### 2.3 CustomerID

CustomerID : Customer's ID

In [None]:
data_n.CustomerID=data_n.CustomerID.astype('int64')

In [None]:
# Visualized the data of the customers who made the most purchases.

customer_c = data.CustomerID.value_counts().sort_values(ascending=False).iloc[0:30] 
plt.figure(figsize=(20,5))
sns.barplot(customer_c.index, customer_c.values, order=customer_c.index,palette="Reds_r")
plt.ylabel("Counts")
plt.xlabel("CustomerID")
plt.title("Which customers are most common?");

In [None]:
# Found out the 'CustomerID' that placed the most orders.

print('The TOP 10 customers with most number of orders') 
customer_best = data_n.groupby(by=['CustomerID','Country'], as_index=False)['InvoiceNo'].count()
customer_best.sort_values(by='InvoiceNo', ascending=False).head(10)

In [None]:
# Sorted the 'CustomerID' with the most orders in ascending order by 'Quantity'
# Found out the most of the customers who placed many orders bought the ones that the price is low, but are purchasing in a large quantity.

data_n[data_n['CustomerID'].isin(['17841','14911','14096','12748','14606'])].sort_values(by=['Quantity'], axis=0, ascending=False).head(20)

### 2.4 Description

Description : Product name of the product being sold

In [None]:
# Changed the whole 'Description' into uppercase

data_n['Description'] = data.Description.str.upper()

In [None]:
#Checked out the 10 most sold products.

data.Description.value_counts()[:10]

In [None]:
# Visualized what products have sold the most.

description_c = data.Description.value_counts().sort_values(ascending=False).iloc[0:30]
plt.figure(figsize=(20,5))
sns.barplot(description_c.index, description_c.values, palette="Blues_r")
plt.ylabel("Counts")
plt.title("Description");
plt.xticks(rotation=90);

In [None]:
# Found out the most ordered product was "WHITE HANGING HEART T-LIGHT HOLDER"
# And that most of them were ordered from the UK.

print('The TOP 10 Description with most number of orders') 
customer_best = data_n.groupby(by=['Description','Country'], as_index=False)['InvoiceNo'].count()
customer_best.sort_values(by='InvoiceNo', ascending=False).head(10)

In [None]:
# 'Description': Analyzed the product name.

description=[data_n.Description.value_counts().index]
description

In [None]:
# Split the 'Description' by spaces.

description_most=data_n['Description'].str.split(expand=True).stack().value_counts()
df=pd.DataFrame(description_most)
df

In [None]:
# Visualized the keywords of the 'Description'.
# Various keywords such as SET, BAG, RETROSPOT, and VINTAGE are searched,
# And can be seen that this is a e-commerce that sells various group of products.

df1=df[0].sort_values(ascending=False).iloc[0:50]
plt.figure(figsize=(20,5))
sns.barplot(df1.index, df1.values, palette="autumn_r")
plt.ylabel("Counts")
plt.title("Description Frequency");
plt.xticks(rotation=90);

### 2.5 InvoiceDate

InvoiceDate : Order date and time

In [None]:
# Change the date format by 12/1/2010 11:52 to 2010-12-01 11:52:00 
# Change 'InvoiceDate' to be more user-friendly


data_n['InvoiceDate'] = pd.to_datetime(data.InvoiceDate, format='%m/%d/%Y %H:%M')

In [None]:
# Cut the InvoiceDate into year, month, day, and hour, and then prepare the  statistics for each classification.
# Since two years exist,2010 and 2011, the 'Year' was changed by combined year/month.


data_n.insert(loc=2, column='year',value=data_n['InvoiceDate'].map(lambda x: 100*x.year + x.month))
data_n.insert(loc=3, column='month', value=data_n.InvoiceDate.dt.month)
# +1 to make Monday=1.....until Sunday=7
data_n.insert(loc=4, column='day', value=(data_n.InvoiceDate.dt.dayofweek)+1)
data_n.insert(loc=5, column='hour', value=data_n.InvoiceDate.dt.hour)

In [None]:
data_n.drop(['InvoiceDate'], axis=1)

In [None]:
# Visualized the time zone where the product sold the most: 12 o'clock and between.

df4=data_n['hour'].value_counts().sort_values(ascending=False).iloc[0:50]
plt.figure(figsize=(20,5))
sns.barplot(df4.index, df4.values, palette="coolwarm_r")
plt.ylabel("Counts")
plt.title("Hour");

In [None]:
# Visualized the month where the product sold the most:
# Found out November sold the most, and as the year-end approaches, sales increases.

df4=data_n['month'].value_counts().sort_values(ascending=False).iloc[0:50]
plt.figure(figsize=(20,5))
sns.barplot(df4.index, df4.values, palette="coolwarm_r")
plt.ylabel("Counts")
plt.title("Month");

### 2.6 UnitPrice

UnitPrice : Price per product

In [None]:
# Checked out the unit price of the item, and In this process found out that there is a price of zero.

data_n.UnitPrice.describe()

In [None]:
# Total of 33 products with 'UnitPrice' of 0.

data_n.loc[data_n.UnitPrice == 0].sort_values(by="Quantity", ascending=False).count()

In [None]:
# 'UnitPrice' of 0: Is unknown whether it is a free product or a promotion.

data_n.loc[data_n.UnitPrice == 0].sort_values(by="Quantity", ascending=False).head()

In [None]:
# Visualized distribution of UnitPrice less than 10: The lower the price, the more the sells.
# Can see that why wholesellers are coming for this E-commerce

plt.figure(figsize=(12,4))
sns.distplot(data_n[data_n['UnitPrice'] < 10]['UnitPrice'].values, kde=True, bins=10,color='red')

### 2.7 Amount

In [None]:
# Since the 'Quantity' and the 'Unitprice are' are separated, it is hard to know the actual moneyflow for one transaction.
# Added a column of the sales item called 'Amount' with 'Quantity'*'UnitPrice' that can show the actual one-time purchase.

data_n['Amount'] = data_n['Quantity'] * data_n['UnitPrice']

In [None]:
data_n.head()

In [None]:
data_n.info()

In [None]:
# Checked how much sales it made per month.
# Caution: The Dataset's date was from December 1, 2010 to December 9, 2011 are not perfect one-year data.
# And 2011 of December is not a perfect one-month data.

data_n.groupby('year')['Amount'].sum()

In [None]:
# Visualized monthly sales: November 2011 shows the highest sales.
# Since the data of 2011-December was about 1/3of the month, by multiplying simply 3 times
# It can be said the monthly sales was constantly getting higher as the year grew older.

df2=data_n.groupby('year')['Amount'].sum()
plt.figure(figsize=(20,5))
ax = plt.subplot()
sns.barplot(df2.index, df2.values, palette='PRGn_r', ax=ax)
ax.get_yaxis().get_major_formatter().set_scientific(False)
ax.set_xlabel
ax.set_ylabel('Amount')
ax.set_title('Monthly Sales')

In [None]:
# Found the outliers of 'Amount' through scatterplot

plt.figure(figsize=(20,5))
plt.scatter(x=data_n.index, y=data_n['Amount'])

In [None]:
# Removed more than 25000 outliers from data (for average analysis of later modeling)

data_n = data_n[data_n['Amount'] < 25000]
plt.figure(figsize=(20,5))
plt.scatter(x=data_n.index, y=data_n['Amount'])
plt.xticks(rotation=90)

### 2.8 Country

Country : Country of purchaser

In [None]:
# Check what 'Countries' are in the data.

data_n.Country.unique()

In [None]:
# Check how many purchases were made by 'Country'.
# UK was the most purchasers.

data_n.Country.value_counts()

It was possible to change the country name, but didn't proceed as it is meaningless to change all those. If someone wants to, I recommend to do only UK, or top 3 countries.

In [None]:
#data_n = data_n.replace({'United Kingdom':'UK','France':'FR','Germany':DE}) 

In [None]:
# Number of orders by country: The UK is the highest.

df7=data_n.groupby('Country')['InvoiceNo'].count().sort_values(ascending=False)
plt.figure(figsize=(30,10))
sns.barplot(df7.index, df7.values, palette="inferno_r")
plt.xlabel('Country',fontsize=15)
plt.ylabel("Number of Orders",fontsize=15)
plt.title("Country",fontsize=20);
plt.xticks(rotation=90,fontsize=20);

In [None]:
# When deleted UK, the top 3 number of orders by country were Germany, France, and Ireland.

df7=data_n.groupby('Country')['InvoiceNo'].count().sort_values(ascending=False)
del df7['United Kingdom']
plt.figure(figsize=(30,10))
sns.barplot(df7.index, df7.values, palette="inferno_r")
plt.xlabel('Country',fontsize=15)
plt.ylabel("Number of Orders",fontsize=15)
plt.title("Country",fontsize=20);
plt.xticks(rotation=90,fontsize=20);

In [None]:
# Total sales by country: The UK is the highest.

df3=data_n.groupby('Country')['Amount'].sum().sort_values(ascending=False)
plt.figure(figsize=(30,10))
sns.barplot(df3.index, df3.values, palette="inferno_r")
plt.xlabel('Country',fontsize=15)
plt.ylabel("Amount",fontsize=15)
plt.title("Average amount by Country",fontsize=20);
plt.xticks(rotation=90,fontsize=20);

In [None]:
# The countries with the highest sales average are the Netherlands, Australia and Japan.
# Here we can see that the gross sales and averages are irrelevant.

df8=data_n.groupby('Country')['Amount'].mean().sort_values(ascending=False)
plt.figure(figsize=(30,10))
sns.barplot(df8.index, df8.values, palette="inferno_r")
plt.xlabel('Country',fontsize=15)
plt.ylabel("Amount",fontsize=15)
plt.title("Average amount by Country",fontsize=20);
plt.xticks(rotation=90,fontsize=20);

In [None]:
# The percentage of the UK on which placing the most orders. => A total of 89.19%.

uk_count = data_n[data['Country'] == 'United Kingdom']['Country'].count()
all_count = data_n['Country'].count()
uk_perc = uk_count/all_count
print(str('UK : {0:.2f}%').format(uk_perc*100))

### 2.9 Reorder Item

Checked the products with the most reorders through the items of'CustomerID','StockCode', and'InvoiceDate'.

In [None]:
# Identified the most repurchased items.

df_sort = data_n.sort_values(['CustomerID', 'StockCode', 'InvoiceDate'])
df_sort_shift1 = df_sort.shift(1)
df_sort_reorder = df_sort.copy()
df_sort_reorder['Reorder'] = np.where(df_sort['StockCode'] == df_sort_shift1['StockCode'], 1,0)
df_sort_reorder.head(5)

In [None]:
# The most reordered product is "WHITE HANGING HEART T-LIGHT HOLDER".

pd.DataFrame((df_sort_reorder.groupby(['Description'])['Reorder'].sum())).sort_values('Reorder', ascending = False).head(10)

In [None]:
# Visualized of monthly reorder products: November 2011 has the most reorders.
# As the period passes, the number of reorders increases, and the regular purchase rate increases.

notreorder = (df_sort_reorder[df_sort_reorder['Reorder'] == 0 ].groupby(['year'])['Amount'].sum())
reorder = (df_sort_reorder[df_sort_reorder['Reorder'] == 1 ].groupby(['year'])['Amount'].sum())
yearmonth = pd.DataFrame([notreorder , reorder], index=['First Buy', 'Reorder']).transpose()
yearmonth.plot.bar(stacked=True)

#  3. Modeling

### 3.1 Data Mining

Rearrange for model the data afterwards.

In [None]:
data_n.info()

In [None]:
# Create data_2 to use for data mining.
# Drop unnecessary columns from data_2.

data2 = data_n.groupby(['InvoiceNo','InvoiceDate','CustomerID']).sum()
data2 = data2.drop(columns = 'UnitPrice')
data2 = data2.drop(columns = 'year')
data2 = data2.drop(columns = 'month')
data2 = data2.drop(columns = 'day')
data2 = data2.drop(columns = 'hour')
data2.head()

In [None]:
data2.describe()

In [None]:
# To eliminate outliers, check the skew value.

In [None]:
#If skewness is -0.5 to 0,5, the data is quite symmetric.
# If skewness is -1~-0.5 or 0.5~1, the data is moderately skewed.
# If skewness is less than -1 or greater than 1, the data is quite skewed.

In [None]:
from scipy.stats import skew

In [None]:
skew(data2.Amount)

In [None]:
# mean-stddev <= data <= mean+stddev: Because the data is skewed, use the formula to remove outliers
# Amount skew reduction

data2 = data2.query('Amount >= 0 and Amount <= 458.583140 + 939.357035') #mean+std

In [None]:
data2.describe()

In [None]:
# Visualized data_2 by boxplot.

plt.figure(figsize=(20,5))
sns.boxplot(data2.Amount)

In [None]:
# Visualised data_2 by displot.

sns.displot(ax=ax,data=data2.Amount,height = 5,aspect = 3)

In [None]:
# Since the result of skewness is very skewed to the left, it needs to be screwed again.

skew(data2.Quantity)

In [None]:
# 'Quantity' skew reduction

data2 = data2.query('Quantity >= 0  and Quantity <= 204.244463 + 231.298757')
data2.describe()

In [None]:
print(skew(data2.Amount))
print(skew(data2.Quantity))

In [None]:
plt.figure(figsize=(20,5))
sns.boxplot(data2.Amount)

In [None]:
# Use data_2 with reduced skewness (removed outliers).

### 3.2 Customer Data

Review the data of CustomerID and check the trend of user usage.

In [None]:
#Look'CustomerID' data with data2

data2 = data2.reset_index()
invoice = data2['InvoiceNo'].tolist()

In [None]:
data_n = data_n[data_n.InvoiceNo.isin(invoice)]

In [None]:
data_n.head()

In [None]:
data_n.describe()

In [None]:
# CustomerID: Guest(null) => When customerid is null, it is classified as guest, otherwise, it is classified as customer.
value = {'CustomerID':'Guest'}
data = data_n.fillna(value = value)
data[data_n.CustomerID == 'Guest'].head()

In [None]:
data_n.CustomerID.nunique()

In [None]:
user_month = data_n.groupby('year').CustomerID.nunique().reset_index()
user_month.columns = ['month','total_user']
user_month
# Note: Again, December is in both 2010 and 2011, and since there are overlapping days it is divided into year columns, not month.

In [None]:
# Check that the data is between 2010-12-01 08:26:00 and 2011-12-09 12:50:00
data_n.InvoiceDate 

In [None]:
# Unique User=Guest's user trend
# The last month's sharp decline in customers are due to the insufficient data from January 1, 2020 to January 9, 2011.

f, ax = plt.subplots(figsize=(20, 5))
sns.lineplot(data=user_month.total_user)
plt.xlabel('Month')
plt.ylabel('Unique User')
plt.title('Unique User by Month')
plt.xticks([1,2,3,4,5,6,7,8,9,10,11,12])

Checked that most of our customers are wholesalers (resellers), and confirmed that there are many Guest (CustomerID null).

### 3.3 Customer Classification

   - It categorizes customers as customer clustering and similar to the RFM customer value analysis commonly used in marketing. It has three factors. Recency, Frequency and Monetary.

   - Recency: It is a variable indicating when a customer's last purchase time is, and the current relationship is more significant for customers who have recently purchased.
   - Frequency: It's a variable as to how often a customer purchases during a specified period, and the higher the number of purchases during the same period, the higher the score is charged, and it is possible to judge the customer's purchase/use activity.
   - Monetary: A variable representing the total purchase amount of a customer over a certain period of time. Higher purchase amount can result in a higher score, but if excessively high purchase amount exists, an upper limit is placed when measuring the RFM index to prevent distortion of the overall index.
   
##### The RFM score is given by a*Recency + b*Frequency + c * Monetary, and weighting a,b,c wheter which factor is important on the industry. However, only customer classification was performed using those factors. The following five items were used for customer clustering and classification.

In [None]:
# 'cust_id': customer ID
# 'total_product': Total transaction volume per customer
# 'total_trx': Total transaction amount per customer
# 'recent_trx': Date from the last transaction date
# 'freq': Transaction frequency within the data period

In [None]:
# Saved in data_cust which has the following 6 items.

data_cust = data_n[['CustomerID','InvoiceDate','Quantity','UnitPrice','Amount','StockCode']]
data_cust.head()

In [None]:
# Checked total_product by customer.

total_bought = data_cust.groupby('CustomerID').StockCode.nunique().reset_index()
total_bought.columns = ['cust_id','total_product']
total_bought.head()

In [None]:
# Made the total transaction price per customer as 'total_trx'.

total_trx = data_cust.groupby('CustomerID').Amount.sum().reset_index()
total_trx.columns = ['cust_id','total_trx']
total_trx.head()

In [None]:
data_n.InvoiceDate.max()

In [None]:
# Made the interval between the last day of the order and the InvoiceDate LastTrx. (Day since last transactions happen)

data_n['LastTrx'] = (pd.to_datetime('2011/12/09 12:50:00') -data_n.InvoiceDate).dt.days
data_n.tail()

In [None]:
# Identify 'freq' as the frequency of purchase per data date range.
cus_frequency = data_cust.groupby('CustomerID').InvoiceDate.nunique().reset_index()
cus_frequency.columns = ['cust_id','freq']
cus_frequency.head()

In [None]:
data_n.describe()

In [None]:
# Set recent purchases as 'recent_trx'. If the number is low, it is recent, and if it is high, it's been quite a while since it was purchased.
cus_recent_trx = data_n.groupby('CustomerID').LastTrx.min().reset_index()
cus_recent_trx.columns = ['cust_id','recent_trx']
cus_recent_trx.head()

In [None]:
# Merged with 'cust_id'

cust = pd.DataFrame()
cust['cust_id'] = cus_recent_trx.cust_id
cust = cust.merge(total_bought, on='cust_id')
cust = cust.merge(total_trx, on='cust_id')
cust = cust.merge(cus_recent_trx, on='cust_id')
cust = cust.merge(cus_frequency, on='cust_id')
cust.head()

### 3.4 K-means Clustering

In [None]:
from sklearn.cluster import KMeans

In [None]:
# Used the sum of squares of the distance

ssd = []
K = range(1,10)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(cust)
    ssd.append(km.inertia_)

In [None]:
# elbow method: A method to find the number of clusters where the variability within the cluster decreases sharply as additional clusters are increased.
# The fact that the intra-cluster volatility has dropped sharply means that similar people are well tied together.

plt.figure(figsize=(20,5))
plt.plot(K, ssd, 'bx-')
plt.xlabel('k')
plt.ylabel('ssd')
plt.title('Elbow Method For Optimal k')
plt.show()

In [None]:
kmeans = KMeans(n_clusters=4)
model = kmeans.fit(cust)

In [None]:
pred = model.labels_
cust['Cluster'] = pred
cust.head()

In [None]:
# Checked the distribution of 4 clusters: how recently and how many purchases

plt.figure(figsize=(20,5))

sns.scatterplot(data=cust, x="total_trx", y="recent_trx", hue="Cluster")
plt.title('Cluster by Total Transaction and Recencys')
plt.show()

In [None]:
# Checked 'total_trx' in customer cluster

customers = cust.groupby('Cluster').mean().reset_index()
customers.sort_values('total_trx')

In [None]:
contribution = cust.groupby('Cluster').total_trx.sum().reset_index()
contribution['Contribution (%)'] = (contribution.total_trx/contribution.total_trx.sum())*100
contribution

In [None]:
# Group 0 purchasesd less frequently (low freq), spends less (low total_trx), purchased long ago (high recent_trx), and occupies a large percentage (high contribution).
# Seasonal Customer

# Group 1 purchased less frequently (low freq), spends less (low total_trx), purchased long ago (high recent_trx), and occupies a large percentage (high contribution)
# Seasonal Customer

# Group 2 purchased bit frequently (medium freq), spends okay (medium total_trx), purchased recently(low recent_trx), and occupies a low percentage(low contribution)
# Loyal Customer

# Group 3 purchased frequently (high freq), spends a lot (very high total_trx), purchased recently (low recent_trx), and occupies a low percentage(low contribution).
# Dropshipper

In [None]:
# Visualized the number of people distributed in each cluster

sns.displot(ax=ax,data=cust,x='Cluster',height = 5,aspect = 3)

# 4.1 Conclusion

## EDA (Exploratory Data Analysis)

 1. By classifying the selling price range of products, it was possible to know that what were sold on the site, and the frequency of the prices.
 
 
 2. Found out that his data includes not only product transaction details but also other things such as POSTAGE, Discount, CARRIAGE, Manual, Bank Charges, PADS TO MATCH ALL CUSHIONS, DOTCOM POSTAGE, Amazon fee an etc.
 
 
 3. WHITE HANGING HEART T-LIGHT HOLDER was the bestseller and also had a lot of resale. It was possible to categorize the main products through the sales ranking, high purchase rate and reorganize products with a low purchase rate and also reorganize the product composition plan


 4. As a result of analyzing the product names by word unit, the popular product names were such as'SET','RETROSPOT', "VINTAGE", "DESIGN", and "CHRISTMAS." It may not be able to identify the exact product category, but it can analyzed that this E-Commerce does not only sell specific product group but also various products groups.
 
 
 5. Most of the transactions took place around 12 o'clock, and November was the most trading volume and sales, and the number of transactions increased with the end of the year. Over time, the rate of repurchases and regular purchases also increased.
 
 
 6.  We don't know whether it is an online business based in the UK or neighboring countries, but we can guess as the UK accounted for 89% of the volume of transactions, with the highest total sales. The average of the transaction amount per item purchased was the highest in the Netherlands, which indicates that the Netherlands is purchasing products at a high price, although the total sales are not high.
 
 
 7. It can be seen that more than 50% of the sold products' price are less than $$2, and less than 70% are less than $3.
 
 
 8. It was possible to divide the purchase rate of customers who made the first purchase and customers who made reorders (reorders) on a monthly basis. From January to June, the first half of the year, many customers make their first purchases, and from July to December, the second half of the year, both first-time customers and re-orders are high.

## Modeling
#### Through the EDA, various analyzes such as the sales measure of customers and the purchase rate of products were able to analyze. Based on this EDA, customers can be classified into four groups for the purpose of seasonal customers, regular customers, and consignment sales by classifying customers by frequency purchase volume, sales volume per purchase, and recent purchase level through customer data.
1.	Group 0 : Low frequency, low consumption, and purchased long ago: Seasonal customer group
2.  Group 1 : Low frequency, low consumption, and purchased long ago: Seasonal customer group
3.  Group 2 : Medium frequency, medium consumption, and recently purchased customer:  Loyal customer group
4.  Group 3 : High frequency, high consumption, and recently purchased customer:Dropshipper

   - Seasonal Customer : 99% of all customers, 60% of all sales: Group 0, Group 1
   - Loyal Customer : 0.4%of all customers, 30% of all sales : Group 2
   - Dropshipper : 0.4%of all customers, 10% of all sales : Group 3

## Business

1. In the original data, it was excluded from the pre-processing process, but there were cases where the purchase was refunded for 80,000 pieces per purchase. If this situation is repeated, a request for confirmation of the refund is proposed to the purchasing customer.
2. Through the EDA process, we saw the state that the repurchase rate and the regular purchase rate were maintained well, and we can see that the site (shopping mall) is operating well. However, in order to develop, we need to run promotions for seasonal customers, which are 99% of our customers. Here are some options.
   - In addition to widening the coverage of products for seasonal customers in a specific month, the purchase width is increased by gradually introducing products similar to those purchased by seasonal customers.
   - Increase the discount range applied per product, and if a large purchase is made by more than a certain quantity, through contacts such as email with customers, give out various services such as shipping costs, discounts, seasonal promotion coupons, individual promotions to provide increased loyalty to the site.
3. Give benefits for promotions such as coupons and discount events for each rating by creating a rating for each customer. However,start the basic group as a VIP so the lower group is not alienated. When a purchase is made as a customer who has a ID, give the VIP level immediately. Also make the first purchaser also proceed with promotions such as a discount on the first purchase and a free shipping coupon. This classifies the frequency of purchase by class, which can be a strategy for customized promotions for each customer. Also, since regular customers who are Silver VIPs are important customers who account for 30% of sales, they provide many discounts and services. In addition, since the number of Silver VIP and Gold VIP is small, communication with dedicated staff per customer can be provided.One thing to be aware of is the consignment customers who will become Gold VIPs, however, since they are doing a consignment business, they aim for profits between purchases, and if there is a cheaper seller, they can move to the place of purchase immediately. For them, thorough delivery times, delivery conditions must be kept at the best, and periodic price comparisons must be made to ensure that cheaper products are not listed in tight, and unique and special items must be maintained on the current site.
   - Group 0,1(Seasonal Customer) : VIP
   - Group 2 (Loyal Customer) : Silver VIP
   - Group 3 (Dropshipper) : Gold VIP
4. Promote Seasonal, holiday, and anniversary promotions by country. If the data is set in advance on the delivery date and sales start date of the products necessary for the national event day, It will be able to establish itself as the product group that generates the largest sales of the month even it is not a product that generates continuous sales. This will give you a corporate image that you can quickly and easily purchase anniversary event products for the site at low prices through the same promotion every year.
5. Since there are many customers who are in the wholesale business as a whole, keep records of previously purchased items display data-based UI/UX on the site to increase convenience by and make repurchases easier.
