<a href="https://colab.research.google.com/github/MathewLipman/Work-Samples/blob/main/Retail_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Examination of retail dataset provided by Kaggle here:
https://www.kaggle.com/datasets/regivm/retailtransactiondata

Common reasons for data requests:


*   Surveying (identify our **X** customers)
*   Investing (allocate resources for a digital marketing initiative)
*   Creating  (create a loyalty program for **X** customers)

Establishing the reason for the request is critical as data requests are commonly a means to an end.


In [None]:
import pandas as pd
import datetime as dt

We can parse our dates as we initiate reading our file and creating a pandas dataframe. We can also alternatively use pd.to_datetime() post load
**The dates prior to being parsed were object(string) type**

In [None]:
data = pd.read_csv("/content/drive/MyDrive/Dataset/rfm_xmas19.txt", parse_dates=["trans_date"])

125k rows and 3 columns

In [None]:
print(data.shape)

(125000, 3)


In [None]:
print(data.head())
print(data.tail())

  customer_id trans_date  tran_amount
0      FM5295 2017-11-11           35
1      FM4768 2019-12-15           39
2      FM2122 2017-11-26           52
3      FM1217 2016-08-16           99
4      FM1850 2018-08-20           78
       customer_id trans_date  tran_amount
124995      FM8433 2016-03-26           64
124996      FM7232 2019-05-19           38
124997      FM8731 2019-08-28           42
124998      FM8133 2018-09-14           13
124999      FM7996 2019-09-13           36


In [None]:
group_by_customer = data.groupby("customer_id")
print(group_by_customer.head())

       customer_id trans_date  tran_amount
0           FM5295 2017-11-11           35
1           FM4768 2019-12-15           39
2           FM2122 2017-11-26           52
3           FM1217 2016-08-16           99
4           FM1850 2018-08-20           78
...            ...        ...          ...
124305      FM8376 2017-07-22           69
124326      FM8077 2017-10-13           37
124395      FM7856 2019-10-10           46
124602      FM7822 2019-04-08           74
124878      FM7856 2017-03-28           50

[34439 rows x 3 columns]


In [None]:
last_transcation = group_by_customer['trans_date'].max()
print(last_transcation.head())

customer_id
FM1112   2019-10-14
FM1113   2019-11-09
FM1114   2019-11-12
FM1115   2019-12-05
FM1116   2019-05-25
Name: trans_date, dtype: datetime64[ns]


In [None]:
last_transcation.shape

(6889,)

In [None]:
best_churn = pd.DataFrame(last_transaction)
print(best_churn.head())

            trans_date
customer_id           
FM1112      2019-10-14
FM1113      2019-11-09
FM1114      2019-11-12
FM1115      2019-12-05
FM1116      2019-05-25


In [None]:
cut_off_date = dt.datetime(2019, 9, 16)

In [None]:
best_churn['churned'] = best_churn['trans_date'].apply(lambda date: 1 if date < cut_off_date else 0)

In [None]:
best_churn['churned'].value_counts()

0    4695
1    2194
Name: churned, dtype: int64

In [None]:
churned_customers = best_churn[best_churn['churned'] == 1]

In [None]:
churned_customers.shape

(2194, 2)

In [None]:
print(best_churn)

            trans_date  churned
customer_id                    
FM1112      2019-10-14        0
FM1113      2019-11-09        0
FM1114      2019-11-12        0
FM1115      2019-12-05        0
FM1116      2019-05-25        1
...                ...      ...
FM8996      2019-09-09        1
FM8997      2019-03-28        1
FM8998      2019-09-22        0
FM8999      2019-04-02        1
FM9000      2019-11-28        0

[6889 rows x 2 columns]


In [None]:
best_churn["nr_of_transactions"] = group_by_customer.size()

In [None]:
print(best_churn['nr_of_transactions'])

customer_id
FM1112    15
FM1113    20
FM1114    19
FM1115    22
FM1116    13
          ..
FM8996    13
FM8997    14
FM8998    13
FM8999    12
FM9000    13
Name: nr_of_transactions, Length: 6889, dtype: int64


In [None]:
print(group_by_customer.head())

       customer_id trans_date  tran_amount
0           FM5295 2017-11-11           35
1           FM4768 2019-12-15           39
2           FM2122 2017-11-26           52
3           FM1217 2016-08-16           99
4           FM1850 2018-08-20           78
...            ...        ...          ...
124305      FM8376 2017-07-22           69
124326      FM8077 2017-10-13           37
124395      FM7856 2019-10-10           46
124602      FM7822 2019-04-08           74
124878      FM7856 2017-03-28           50

[34439 rows x 3 columns]


In [None]:
print(group_by_customer.size())

customer_id
FM1112    15
FM1113    20
FM1114    19
FM1115    22
FM1116    13
          ..
FM8996    13
FM8997    14
FM8998    13
FM8999    12
FM9000    13
Length: 6889, dtype: int64


In [None]:
best_churn["nr_of_transactions"] = group_by_customer.size()

In [None]:
best_churn['total_spent'] = group_by_customer['tran_amount'].sum()


In [None]:
print(best_churn.head())

            trans_date  churned  nr_of_transactions  total_spent
customer_id                                                     
FM1112      2019-10-14        0                  15         1012
FM1113      2019-11-09        0                  20         1490
FM1114      2019-11-12        0                  19         1432
FM1115      2019-12-05        0                  22         1659
FM1116      2019-05-25        1                  13          857


In [None]:
print(best_churn.head())

             churned  nr_of_transactions  total_spent
customer_id                                          
FM1112             0                  15         1012
FM1113             0                  20         1490
FM1114             0                  19         1432
FM1115             0                  22         1659
FM1116             1                  13          857


In [None]:
best_churn['rank'] = (best_churn['nr_of_transactions'] / 2) + (best_churn['total_spent'] / 2)

date column was popped from **group_by_customer** since it was no longer relevant (after grouping together transcations to transcation vol and spend to spend total) I should of stored it in a variable and need to explore if and when drop is better than pop to use.

.pop():

This method is used to remove a single column from a DataFrame and return that column. The changes are done in-place, which means the original DataFrame is modified. It's most commonly used when you want to use a column of data separately, but you don't need it in the original DataFrame anymore.



Ideal use case for .pop(): When you want to remove a single column and use the removed column for further operations.

.drop():

This method is used to remove rows or columns by specifying labels. You can remove multiple columns or rows at once, and by default, it does not modify the original DataFrame; it returns a new one instead.

If you want to drop in-place (i.e., modify the original DataFrame), you can use df.drop('column_name', axis=1, inplace=True).

Ideal use case for .drop(): When you want to remove one or more columns or rows and don't need them for further operations.

Core differences:

.pop() operates in-place and can only operate on a single column, returning the popped column. .drop() operates on a copy by default, can remove multiple rows or columns at once, and returns the modified DataFrame.
.pop() only operates on columns, while .drop() can operate on either rows or columns.
They are not completely interchangeable because of these differences. For example, you can't use .pop() to remove a row or multiple columns, and you can't use .drop() to remove a column and return it for further use (unless you do additional steps to keep track of the dropped data).


In [None]:
print(best_churn.head())

             churned  nr_of_transactions  total_spent   rank
customer_id                                                 
FM1112             0                  15         1012  513.5
FM1113             0                  20         1490  755.0
FM1114             0                  19         1432  725.5
FM1115             0                  22         1659  840.5
FM1116             1                  13          857  435.0


In [None]:
best_churn[
    ["nr_of_transactions", "total_spent"]
].describe().loc[["min", "max"]]

Unnamed: 0,nr_of_transactions,total_spent
min,4.0,149.0
max,39.0,2933.0


In [None]:
best_churn[
    ["nr_of_transactions", "total_spent"]
].describe().loc[["mean"]]

Unnamed: 0,nr_of_transactions,total_spent
mean,18.144869,1179.269705


the backslashes need to be used

This formula takes each value from the column and subtracts the min value from that column from it it then divides it by the max value for that column with the minimum subtracted from it. 

This allows us to do scaling releative to the range

In [None]:
best_churn["scaled_tran"] = (best_churn["nr_of_transactions"] \
                             - best_churn["nr_of_transactions"].min()) \
                             / (best_churn["nr_of_transactions"].max() \
                             - best_churn["nr_of_transactions"].min())

In [None]:
best_churn['scaled_spent'] = (best_churn['total_spent'] \
                              - best_churn['total_spent'].min()) \
                              / (best_churn['total_spent'].max() \
                              - best_churn['total_spent'].min())



In [None]:
print(best_churn.head())

             churned  nr_of_transactions  total_spent   rank  scaled_tran  \
customer_id                                                                 
FM1112             0                  15         1012  513.5     0.314286   
FM1113             0                  20         1490  755.0     0.457143   
FM1114             0                  19         1432  725.5     0.428571   
FM1115             0                  22         1659  840.5     0.514286   
FM1116             1                  13          857  435.0     0.257143   

             scaled_spent  
customer_id                
FM1112           0.309986  
FM1113           0.481681  
FM1114           0.460848  
FM1115           0.542385  
FM1116           0.254310  


Now that we have both amounts scaled we are going to add a new column to store this value into, we can compare ranks before and after scaling.

In [None]:
best_churn['score'] = 100*(.5*best_churn['scaled_tran'] + .5*best_churn['scaled_spent'])

In [None]:
print(best_churn.head())

             churned  nr_of_transactions  total_spent   rank  scaled_tran  \
customer_id                                                                 
FM1112             0                  15         1012  513.5     0.314286   
FM1113             0                  20         1490  755.0     0.457143   
FM1114             0                  19         1432  725.5     0.428571   
FM1115             0                  22         1659  840.5     0.514286   
FM1116             1                  13          857  435.0     0.257143   

             scaled_spent      score  
customer_id                           
FM1112           0.309986  31.213567  
FM1113           0.481681  46.941195  
FM1114           0.460848  44.470956  
FM1115           0.542385  52.833539  
FM1116           0.254310  25.572660  


By scaling we have now balanced both variables as we wanted to weigh them evenly and we have a 100 point ranking system

In [None]:
val_range = best_churn[
    ['rank', 'score']
].describe().loc[['min', 'max', 'mean']]    
print(val_range)

             rank       score
min     76.500000    0.000000
max   1486.000000  100.000000
mean   598.707287   38.710362


In [None]:
best_churn.sort_values('score', inplace=True,  ascending=False)

The top customers are still the top customers but scaling our rank to an 100 point system makes it easier to understand

In [None]:
print(best_churn)

             churned  nr_of_transactions  total_spent    rank  scaled_tran  \
customer_id                                                                  
FM4424             0                  39         2933  1486.0     1.000000   
FM4320             0                  38         2647  1342.5     0.971429   
FM3799             1                  36         2513  1274.5     0.914286   
FM5109             0                  35         2506  1270.5     0.885714   
FM3805             1                  35         2453  1244.0     0.885714   
...              ...                 ...          ...     ...          ...   
FM7716             1                   4          221   112.5     0.000000   
FM7224             1                   4          191    97.5     0.000000   
FM8504             0                   4          190    97.0     0.000000   
FM8559             1                   4          157    80.5     0.000000   
FM7333             1                   4          149    76.5   

In [None]:
print(data.head())

  customer_id trans_date  tran_amount
0      FM5295 2017-11-11           35
1      FM4768 2019-12-15           39
2      FM2122 2017-11-26           52
3      FM1217 2016-08-16           99
4      FM1850 2018-08-20           78


In [None]:
mean_tran_amount = data['tran_amount'].mean()
print(mean_tran_amount)

64.991912


Based on a $1000 coupon budget, we take the avg transcation amount and reduce it by 2/3 and then divide that amount by coupon amount. This gives us the amount of customers we can offer a reasonable discount for to entice them to shop 

In [None]:
coupon = data['tran_amount'].mean() *.3
nr_of_customers = 1000 / coupon

In [None]:
print(coupon, nr_of_customers)
print(coupon, nr_of_customers, sep="\n")

19.4975736 51.28843314123969
19.4975736
51.28843314123969


In [None]:
top_50_churned = best_churn[best_churn['churned'] == 1]
top_50_churned = top_50_churned.head(50)
top_50_churned.to_csv('/content/drive/MyDrive/Dataset/retail_data.txt')

In [None]:
top_50_churnedb = best_churn.loc[best_churn['churned'] == 1].head(50)
top_50_churnedb.to_csv('/content/drive/MyDrive/Dataset/retail_datab.csv')

In [None]:
print(top_50_churned.head())

In [None]:
print(top_50_churnedb.head())