#### TODO: 
- [x] Import purchaseDatesPerUser.csv
- [x] Calculate timedeltas between purchase dates (not for specific product)
- [x] get last purchase date of user (not for specific product)
- [x] add means to most recent purchase date for four weeks in February
- [ ] items: partOfMonth purchase of item per user? -> on specific item over all users to much variance
- [ ] calculate **score** if user is more or less active? actual number of purchase days/mean(purchaseDaysPerUser)
- [ ] calculate **trend** of timedeltas between purchase days (smaller or wider to determine activity level)?
- [ ] calculate both means for daysToPurchaseBefore (specific product & over all items, but for specific user)
- [ ] Clustering Users by purchases of category (split categories); calculate means like above only for users of one cluster

---

In [40]:
# import libraries
from datetime import datetime, timedelta, date
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 500)
pd.set_option('display.min_rows', 500)
pd.set_option('max_columns', None)

### purchaseDatesPerUserAndItem
This Dataframe includes information about one specific user purchasing one specific item (Jun20-Jan21)

- **userID:**  unique user identifier
- **itemID:** unique item identifier
- **purchaseDates:** list of dates, the specific item in one row was purchased
- **ordersPerPurchase:** list of dates with the amount of purchased items per date (order corresponding to purchaseDates)
- **purchaseDaysCount:** number of days, the specific item was purchased
- **daysToPurchaseBefore:** number of days between two purchase dates (order corresponding to purchaseDates)
- **lastPurchaseDate:** last date, the specific item was bought

In [2]:
filePath = r"C:\Users\LEAND\Coding\knime-workspace\DMC2022\Leander\csv\purchaseDatesPerUserAndItem.csv"

# read file and convert String purchaseDates to List of Strings; convert ordersPerPurchase to List of Strings
purchInfoUserItem = pd.read_csv(filePath, sep="|", converters={
    'purchaseDates': lambda x: [pd.to_datetime(date, format="%Y-%m-%d") for date in x[1:-1].split(',')], # 1:-1 -> don't include brackets, convert to datetime 
    'ordersPerPurchase': lambda x: [int(i) for i in x[1:-1].split(',')] # split String, convert to int
})

In [3]:
# Conversion-Test
purchInfoUserItem['purchaseDates'][1], purchInfoUserItem['ordersPerPurchase'][1]

([Timestamp('2020-12-11 00:00:00'), Timestamp('2021-01-15 00:00:00')], [1, 1])

In [4]:
# function for calculating days between purchase dates of one item
def calcTimeDeltaToNextIndex(listOfDatetime):
    timeDeltas = [0]
    if len(listOfDatetime) == 1:
        return timeDeltas
    for i in range(0, len(listOfDatetime)-1):
        timeDeltaDays = (listOfDatetime[i+1] - listOfDatetime[i]).days
        timeDeltas.append(timeDeltaDays)
    return timeDeltas

In [5]:
# calculate timedeltas for every row
purchInfoUserItem['daysToPurchaseBefore'] = purchInfoUserItem['purchaseDates'].apply(calcTimeDeltaToNextIndex)
purchInfoUserItem['lastPurchaseDate'] = purchInfoUserItem['purchaseDates'].apply(lambda x: x[len(x)-1])
purchInfoUserItem

Unnamed: 0,userID,itemID,purchaseDates,ordersPerPurchase,purchaseDaysCount,daysToPurchaseBefore,lastPurchaseDate
0,0,1505,[2020-09-01 00:00:00],[1],1,[0],2020-09-01
1,0,6446,"[2020-12-11 00:00:00, 2021-01-15 00:00:00]","[1, 1]",2,"[0, 35]",2021-01-15
2,0,9325,[2020-11-20 00:00:00],[1],1,[0],2020-11-20
3,0,12468,[2020-08-03 00:00:00],[1],1,[0],2020-08-03
4,0,12505,[2020-08-18 00:00:00],[1],1,[0],2020-08-18
5,0,13146,[2021-01-25 00:00:00],[2],1,[0],2021-01-25
6,0,15083,[2020-08-03 00:00:00],[1],1,[0],2020-08-03
7,0,20664,"[2020-06-05 00:00:00, 2020-10-09 00:00:00, 202...","[1, 1, 1]",3,"[0, 126, 63]",2020-12-11
8,0,26387,[2020-10-09 00:00:00],[1],1,[0],2020-10-09
9,0,28231,"[2020-11-20 00:00:00, 2020-12-11 00:00:00, 202...","[1, 1, 2]",3,"[0, 21, 45]",2021-01-25


---

### purchaseDatesPerUser

This dataframe includes information about one specific user purchasing at a specific date (Jun20-Jan21)

- **userID:** unique user identifier
- **dateOfPurchase:** the date the user made a purchase
- **itemsPurchased:** list of items, that were purchased on that date
- **itemcountOnPurchaseDay:** number of items, that were purchased on that date
- **OverallMean(itemCountOnPurchaseDay):** mean of items purchased on one day (all users included)
- **purchaseDatesofUser:** list of purchase dates, on which the user purchased something
- **TotalPurchaseDaysOfUser:** number of purchase dates the user purchased something
- **OverallMean(purchaseDaysPerUser):** mean of purchase days over all users (Jun20-Jan21)

#### Dates per User-Purchase

In [7]:
filePath = r"C:\Users\LEAND\Coding\knime-workspace\DMC2022\Leander\csv\purchaseDatesPerUser.csv"

purchInfoUser = pd.read_csv(filePath, sep="|", converters={
    'dateOfPurchase': lambda date: pd.to_datetime(date, format="%Y-%m-%d"),
    'itemsPurchased': lambda x: [int(item) for item in x[1:-1].split(',')],
    'purchaseDatesOfUser': lambda x: [pd.to_datetime(date, format="%Y-%m-%d") for date in x[1:-1].split(',')]
})

In [8]:
# Calculate timedeltas between all purchase dates of user
#purchInfoUserItem['daysToPurchaseBefore'] = purchInfoUserItem['purchaseDates'].apply(calcTimeDeltaToNextIndex)
#purchInfoUserItem['lastPurchaseDate'] = purchInfoUserItem['purchaseDates'].apply(lambda x: x[len(x)-1])
purchInfoUser

Unnamed: 0,userID,dateOfPurchase,itemsPurchased,itemcountOnPurchaseDay,OverallMean(itemcountPerPurchaseDay),purchaseDatesOfUser,TotalPurchaseDaysOfUser,OverallMean(purchaseDaysPerUser)
0,4,2020-06-01,"[18860, 30779]",2,2.981231,"[2020-06-01 00:00:00, 2020-06-07 00:00:00, 202...",13,7.790281
1,20,2020-06-01,[18613],1,2.981231,"[2020-06-01 00:00:00, 2020-06-22 00:00:00, 202...",7,7.790281
2,55,2020-06-01,"[9547, 10844, 17912, 24763]",4,2.981231,"[2020-06-01 00:00:00, 2020-06-23 00:00:00, 202...",10,7.790281
3,76,2020-06-01,"[2787, 23050, 26645]",3,2.981231,"[2020-06-01 00:00:00, 2020-06-13 00:00:00, 202...",13,7.790281
4,89,2020-06-01,[6287],1,2.981231,"[2020-06-01 00:00:00, 2020-08-23 00:00:00, 202...",4,7.790281
5,116,2020-06-01,"[9408, 10360, 25677, 29809, 31548]",5,2.981231,"[2020-06-01 00:00:00, 2020-09-18 00:00:00, 202...",5,7.790281
6,135,2020-06-01,"[13660, 14930, 22174]",3,2.981231,"[2020-06-01 00:00:00, 2020-06-26 00:00:00, 202...",10,7.790281
7,202,2020-06-01,"[5654, 26940]",2,2.981231,"[2020-06-01 00:00:00, 2020-06-27 00:00:00, 202...",9,7.790281
8,204,2020-06-01,"[16006, 28812]",2,2.981231,"[2020-06-01 00:00:00, 2020-09-27 00:00:00, 202...",5,7.790281
9,240,2020-06-01,"[7318, 18843, 18884, 26645]",4,2.981231,"[2020-06-01 00:00:00, 2020-06-21 00:00:00, 202...",15,7.790281


#### Grouped by User; all Purchase Dates + Info in one Row

In [97]:
def getWeekOfFebruary(date):
    if date.month != 2:
        return
    return int(np.ceil(date.day/(28/4)))

In [140]:
# read csv again, no conversion of strings

filePath = r"C:\Users\LEAND\Coding\knime-workspace\DMC2022\Leander\csv\purchaseDatesPerUser.csv"
userPurchaseDates = pd.read_csv(filePath, sep="|")

# group by userID & purchaseDatesOfUser but keep columns as is; then drop unnecessary count-column
userPurchaseDates = userPurchaseDates.groupby(by=['userID', 'purchaseDatesOfUser']).size().reset_index(name='count')
userPurchaseDates = userPurchaseDates.drop(columns = ['count'])

# convert string to list of datetimes
userPurchaseDates['purchaseDatesOfUser'] = userPurchaseDates['purchaseDatesOfUser'].apply(
    lambda x: [pd.to_datetime(date, format="%Y-%m-%d") for date in x[1:-1].split(',')]
)

# calculate timedeltas between purchase-days of user
userPurchaseDates['daysToPurchaseDateBefore'] = userPurchaseDates['purchaseDatesOfUser'].apply(calcTimeDeltaToNextIndex)

# calculate mean of days between different purchases
userPurchaseDates['mean(DaysBetweenPurchases)'] = userPurchaseDates['daysToPurchaseDateBefore'].apply(
    lambda listOfDays: sum(listOfDays)/len(listOfDays)
)

# calculate std deviation of days between different purchases to get better feeling of how good of an estimation mean is
userPurchaseDates['std(DaysBetweenPurchases)'] = userPurchaseDates['daysToPurchaseDateBefore'].apply(
    lambda listOfDays: np.std(listOfDays)
)

# get last (recent) purchase date
userPurchaseDates['recentPurchaseDate'] = userPurchaseDates['purchaseDatesOfUser'].apply(
    lambda dates: dates[len(dates)-1]
)

# round meanDays
userPurchaseDates['meanDaysBetweenPur'] = userPurchaseDates['mean(DaysBetweenPurchases)'].apply(
    lambda mean: round(mean)
)

# calculate future purchase dates based on mean purchases
userPurchaseDates['recentPurchase+1'] = userPurchaseDates['recentPurchaseDate'] + pd.to_timedelta(userPurchaseDates['meanDaysBetweenPur'] * 1, unit='D')
userPurchaseDates['recentPurchase+2'] = userPurchaseDates['recentPurchaseDate'] + pd.to_timedelta(userPurchaseDates['meanDaysBetweenPur'] * 2, unit='D')
userPurchaseDates['recentPurchase+3'] = userPurchaseDates['recentPurchaseDate'] + pd.to_timedelta(userPurchaseDates['meanDaysBetweenPur'] * 3, unit='D')
userPurchaseDates['recentPurchase+4'] = userPurchaseDates['recentPurchaseDate'] + pd.to_timedelta(userPurchaseDates['meanDaysBetweenPur'] * 4, unit='D')

# drop rounded meanDays again
userPurchaseDates.drop(columns=['meanDaysBetweenPur'])

# get estimated weeks in february for purchase of customer
userPurchaseDates['estPurchaseWeeksFeb'] = userPurchaseDates[[
    'recentPurchase+1',
    'recentPurchase+2',
    'recentPurchase+3',
    'recentPurchase+4']].values.tolist()
userPurchaseDates['estPurchaseWeeksFeb'] = userPurchaseDates['estPurchaseWeeksFeb'].apply(
    lambda x: [getWeekOfFebruary(pd.to_datetime(date, format="%Y-%m-%d")) for date in x]
)

userPurchaseDates

Unnamed: 0,userID,purchaseDatesOfUser,daysToPurchaseDateBefore,mean(DaysBetweenPurchases),std(DaysBetweenPurchases),recentPurchaseDate,meanDaysBetweenPur,recentPurchase+1,recentPurchase+2,recentPurchase+3,recentPurchase+4,estPurchaseWeeksFeb
0,0,"[2020-06-05 00:00:00, 2020-08-03 00:00:00, 202...","[0, 59, 15, 14, 38, 42, 14, 7, 35, 10]",23.400000,17.900838,2021-01-25,23,2021-02-17,2021-03-12,2021-04-04,2021-04-27,"[3, None, None, None]"
1,1,"[2020-07-07 00:00:00, 2020-09-01 00:00:00, 202...","[0, 56, 6, 22, 19, 22, 20, 10, 24, 14]",19.300000,14.311184,2021-01-16,19,2021-02-04,2021-02-23,2021-03-14,2021-04-02,"[1, 4, None, None]"
2,2,"[2020-06-29 00:00:00, 2020-07-19 00:00:00, 202...","[0, 20, 22, 44, 9, 17, 34, 6, 21, 37]",21.000000,13.349157,2021-01-25,21,2021-02-15,2021-03-08,2021-03-29,2021-04-19,"[3, None, None, None]"
3,3,"[2020-06-10 00:00:00, 2020-06-12 00:00:00, 202...","[0, 2, 18, 21, 30, 3, 3, 45, 2, 8, 5, 27, 24, ...",14.500000,12.634279,2021-01-28,14,2021-02-11,2021-02-25,2021-03-11,2021-03-25,"[2, 4, None, None]"
4,4,"[2020-06-01 00:00:00, 2020-06-07 00:00:00, 202...","[0, 6, 10, 27, 33, 14, 19, 41, 18, 13, 49, 6, 6]",18.615385,14.285757,2021-01-29,19,2021-02-17,2021-03-08,2021-03-27,2021-04-15,"[3, None, None, None]"
5,5,"[2020-11-18 00:00:00, 2020-11-30 00:00:00, 202...","[0, 12, 56]",22.666667,24.073960,2021-01-25,23,2021-02-17,2021-03-12,2021-04-04,2021-04-27,"[3, None, None, None]"
6,6,"[2020-06-21 00:00:00, 2020-07-07 00:00:00, 202...","[0, 16, 64, 4, 1, 88, 15, 2, 20]",23.333333,29.518356,2021-01-17,23,2021-02-09,2021-03-04,2021-03-27,2021-04-19,"[2, None, None, None]"
7,7,"[2020-08-16 00:00:00, 2020-10-05 00:00:00, 202...","[0, 50, 69, 42]",40.250000,25.222758,2021-01-24,40,2021-03-05,2021-04-14,2021-05-24,2021-07-03,"[None, None, None, None]"
8,8,"[2020-06-08 00:00:00, 2020-06-18 00:00:00, 202...","[0, 10, 38, 17, 140]",41.000000,51.045078,2020-12-30,41,2021-02-09,2021-03-22,2021-05-02,2021-06-12,"[2, None, None, None]"
9,9,"[2020-07-18 00:00:00, 2020-10-19 00:00:00, 202...","[0, 93, 12, 80]",46.250000,40.733125,2021-01-19,46,2021-03-06,2021-04-21,2021-06-06,2021-07-22,"[None, None, None, None]"
