## Product Cost of Customer Acquisition
---
#### Another Practical Data Science Primer

The marketing team for an eCommerce platform has asked you to help identify how much they should spend to acquire 1 new customer.

This eCommerce site charges 10% from their customer's sales as their fee.

You are given three tables:
1. Invoice Table: information on every transaction
2. Product Table: contains details about the individual products sold.
3. Customer Table: details about the customer and their location.

##### Questions:
---
1. What is the eCommerce company's customer acquisition cost (CAC)?

    1.1 CAC = (Sales and Marketing Expense) / (Number of New Customers)

2. What is average Life Time Value (LTV) of a customer?

    2.1 What is the LTV to CAC ratio?
    
    2.2 Can the company afford to spend more to acquire a new customer?

3. What is the return rate, and which product is returned the most?

    3.1 Return rate = (total items returned) / (total items sold)

4. If the company decides to extend its market to another country, what is the feasible choice, and why?

5. Which was the most successful quarter in acquiring new customers?

    5.1 Note that this depends on multiple factors.

6. Devise a recommendation system based on the purchase data:

    6.1 If a customer buys product A and B, what is the probability that the customer will buy product C?
    
    6.2 What are the most purchased items by people who purchased product D? Hint: consider collaborative filtering methods.

In [1]:
# import needed libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
# import local files for analysis

df_customer = pd.read_csv('./Customer_info_table.csv')
df_prod = pd.read_csv('./Product_info_table.csv')
df_inv = pd.read_csv('./Invoice_info_table.csv')

In [15]:
# View Customer Info table & stats:

print(f'Customer Info:')
print(f'{df_customer.info()}, \n')
print(f'Size of Cust Info Table: {df_customer.shape}')
print(f'# of unique customers: {df_customer.CustomerID.nunique()}')
print(f'Are null values listed: {df_customer.CustomerID.isnull().any()}\n')
df_customer.head()

Customer Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4389 entries, 0 to 4388
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   CustomerID  4380 non-null   float64
 1   Country     4389 non-null   object 
dtypes: float64(1), object(1)
memory usage: 68.7+ KB
None, 

Size of Cust Info Table: (4389, 2)
# of unique customers: 4372
Are null values listed: True



Unnamed: 0,CustomerID,Country
0,16143.0,United Kingdom
1,13983.0,United Kingdom
2,15854.0,United Kingdom
3,17634.0,United Kingdom
4,12933.0,United Kingdom


In [11]:
# Take a look at the Product Table:

print(f'Product Info:')
print(f'{df_prod.info()}, \n')
print(f'Size of Product Table: {df_prod.shape}')
print(f'# of unique Products: \n{df_prod.nunique()}\n')
df_prod.head()

Product Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18053 entries, 0 to 18052
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   StockCode    18053 non-null  object 
 1   Description  17093 non-null  object 
 2   UnitPrice    18053 non-null  float64
dtypes: float64(1), object(2)
memory usage: 423.2+ KB
None, 

Size of Product Table: (18053, 3)
# of unique Products: 
StockCode      4070
Description    4211
UnitPrice      1630
dtype: int64



Unnamed: 0,StockCode,Description,UnitPrice
0,22027,TEA PARTY BIRTHDAY CARD,0.42
1,90214C,"""LETTER """"C"""" BLING KEY RING""",0.85
2,84748,FOLK FELT HANGING MULTICOL GARLAND,2.51
3,47585A,PINK FAIRY CAKE CUSHION COVER,4.21
4,90018A,SILVER M.O.P ORBIT DROP EARRINGS,4.24


In [5]:
# Take a look at the Customer Invoice Table:
print(f'Invoice Data: {df_inv.info()}\n')
print(f'Size of Invoice Table: {df_inv.shape}')
print(f'# of unique customers: {df_inv.CustomerID.nunique()}\n')
df_inv.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536480 entries, 0 to 536479
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    536480 non-null  object 
 1   StockCode    536480 non-null  object 
 2   Quantity     536480 non-null  int64  
 3   InvoiceDate  536480 non-null  object 
 4   CustomerID   401549 non-null  float64
dtypes: float64(1), int64(1), object(3)
memory usage: 20.5+ MB
Invoice Data: None

Size of Invoice Table: (536480, 5)
# of unique customers: 4372



Unnamed: 0,InvoiceNo,StockCode,Quantity,InvoiceDate,CustomerID
0,536408,22706,25,12/1/2010 11:41,14307.0
1,536528,22634,1,12/1/2010 13:17,15525.0
2,536529,22164,6,12/1/2010 13:20,14237.0
3,536544,22111,2,12/1/2010 14:32,
4,536544,21238,4,12/1/2010 14:32,


In [6]:
# notice 2 values for each table are the same.

print(f'# of unique customers by invoice: {df_inv.CustomerID.nunique()}')
print(f'# of unique customers by customer database: {df_customer.CustomerID.nunique()}')

# of unique customers by invoice: 4372
# of unique customers by customer database: 4372


#### Question 1: Customer Aquisition Cost
---

In [7]:
# Customer Info Table and the Customer Invoice table:
# Share the same amount of unique entries
# Perform a merge on the Customer ID colummn
# reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

df_ci = pd.merge(df_customer, df_inv, how='left', left_on='CustomerID', right_on='CustomerID')
print(f'Size of df_ci: {df_ci.shape}')

# reference for duplicates: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html
print(f'Check for duplicated values: {df_ci.duplicated().value_counts()}\n')
df_ci.head()

# Size of the dataframe and # of non-unique values are equal
# So the merge went well: all entries are unique in some way.

Size of df_ci: (1616852, 6)
Check for duplicated values: False    1616852
dtype: int64



Unnamed: 0,CustomerID,Country,InvoiceNo,StockCode,Quantity,InvoiceDate
0,16143.0,United Kingdom,537211,22666,6,12/5/2010 15:18
1,16143.0,United Kingdom,564190,23240,6,8/23/2011 16:01
2,16143.0,United Kingdom,552711,21218,6,5/11/2011 8:32
3,16143.0,United Kingdom,552711,21174,12,5/11/2011 8:32
4,16143.0,United Kingdom,543538,22667,6,2/9/2011 13:57


In [23]:
# Arrange data frame from highest to lowest Quantity

df_ci.InvoiceDate.value_counts()

10/31/2011 14:41    10017
12/8/2011 9:28       6741
12/9/2011 10:03      6579
12/5/2011 17:24      6489
6/29/2011 15:58      6345
                    ...  
4/28/2011 18:27         1
7/28/2011 13:40         1
9/2/2011 12:41          1
10/6/2011 12:37         1
12/2/2011 16:32         1
Name: InvoiceDate, Length: 23260, dtype: int64

In [35]:
# pull all invoice dates from the dataframe into new variables
dates = df_ci.InvoiceDate

print(type(dates))
print(type(dates[0])) # Panda Series of string data

print(f'\n {dates}')
print(f'\n1st array index for \'dates\' = {dates[0]}')

<class 'pandas.core.series.Series'>
<class 'str'>

 0          12/5/2010 15:18
1          8/23/2011 16:01
2           5/11/2011 8:32
3           5/11/2011 8:32
4           2/9/2011 13:57
                ...       
1616847    11/6/2011 13:53
1616848    12/1/2011 10:38
1616849    11/6/2011 13:53
1616850    12/1/2011 10:38
1616851    12/1/2011 10:38
Name: InvoiceDate, Length: 1616852, dtype: object

1st array index for 'dates' = 12/5/2010 15:18


In [36]:
# remove time from the date/time stamp: (didn't work yet)
# https://stackoverflow.com/questions/29310116/removing-time-from-datetime-variable-in-pandas



TypeError: descriptor 'strftime' for 'datetime.date' objects doesn't apply to a 'Series' object