# Purpose of this file:
Authors: User team | 16.6.2022

0. All features are calculated based on the all orders occured until **before 04.01.2021**. The original data is from **new_orders_aggregation.csv**
1. Feature named **u_EF** (Engagement feature). This feature shows us the amount of months that each userID has had at least one order during the period of time.
    - The way to calculate it:
        * For example, userID 10 had following orders overtime:
            * date : order
            * 2.6.21 : 1
            * 13.6.21 : 2
            * 27.6.21 : 1
            * 10.10.21 : 3
        --> **EF  = 2**
2. Feature named **u_avg_orders** shows us the average orders of each userID over the time. It is calculated by the total number of orders of each userID divides the amount of their orders
3. Feature named **u_avg_period**. This feature gives us the information about the average cycle (frequency) that one userID made one order
    - Preprocessing data: 
        * merging userID and date
        * dropping duplicated row ( which have the same userID and date)
    - Then calculating:
        * For example, we know that one user A made an order on normalized_days_list: [23, 41, 50]:
            * **u_avg_period = ((50 - 41) + (41 - 23)) / (len(days_list) - 1) = 13.5**
            * To be clearer, we calculate the average distance between elements in normalized_days_list
4. Features named **u_first_day** and **u_last_day** tell us when is the first day and the last day that one userID made an order
5. Feature **u_std_avg_period** is the standard deviation which is calculated based on the user_avg_period
    - It tells us how spread out the average period of purchasing of each user
    - Method of calculation:
        * For example, we know that one user A made an order on normalized_day_list: [23, 41, 50]:
            * **u_avg_period = ((50 - 41) + (41 - 23)) / (len(days_list) - 1) = 13.5**
            * distance between days in normalized_day_list are: 9, 18
            * u_std_avg_period = $sqr(1/2 * ((9 - 13.5)^2 + (18 - 13.5)^2)) $
6. Feature **u_mean_bought** is the mean value of all days that one user made an order during the period of time
    - For example, we know that one user A made an order on normalized_days_list: [23, 41, 50]:
        * **u_mean_bought = (23 + 41 + 50) / len(normalized_days_list)**
7. Feature **u_std_bought** is the standard deviation which is calculated based on the user_mean_bought. 


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import math

## I. Import data and preprocessing data

### 1. Import dataset

In [2]:
orders = pd.read_csv('new_orders_aggregation.csv', index_col = 0, sep = '|')
orders

  mask |= (ar1 == a)


Unnamed: 0,userID,itemID,date,order
0,0,1505,2020-09-01,1
1,0,6446,2020-12-11,1
2,0,6446,2021-01-15,1
3,0,9325,2020-11-20,1
4,0,12468,2020-08-03,1
...,...,...,...,...
1071015,46137,22403,2021-01-18,1
1071016,46137,22583,2021-01-31,1
1071017,46137,28343,2020-08-08,1
1071018,46137,28900,2020-08-08,2


### 2. Normalizing date from 1.June 2020 to 31.1 2021 in orders
- normalized_day include the days which are counted from 1 to... , in which 1 equals to the day 01.06.2020 in column date

In [3]:
date = datetime.strptime("01.06.2020", "%d.%m.%Y")
orders['date'] = pd.to_datetime(orders['date'], infer_datetime_format = True, cache = True)
orders['month'] = orders['date'].dt.month
orders['normalized_day'] = orders['date'].apply(lambda x: (x-date).days + 1)
orders

Unnamed: 0,userID,itemID,date,order,month,normalized_day
0,0,1505,2020-09-01,1,9,93
1,0,6446,2020-12-11,1,12,194
2,0,6446,2021-01-15,1,1,229
3,0,9325,2020-11-20,1,11,173
4,0,12468,2020-08-03,1,8,64
...,...,...,...,...,...,...
1071015,46137,22403,2021-01-18,1,1,232
1071016,46137,22583,2021-01-31,1,1,245
1071017,46137,28343,2020-08-08,1,8,69
1071018,46137,28900,2020-08-08,2,8,69


In [4]:
# extract orders to use, when creating features for submission file
#orders.to_csv('preprocessed_orders_till_31.csv')

### 3. Copying all data before 4.1.2021 to a new dataframe and calculating all features based on the new dataframe

In [5]:
df_orders = orders[orders['normalized_day'] < 218]
df_orders

Unnamed: 0,userID,itemID,date,order,month,normalized_day
0,0,1505,2020-09-01,1,9,93
1,0,6446,2020-12-11,1,12,194
3,0,9325,2020-11-20,1,11,173
4,0,12468,2020-08-03,1,8,64
5,0,12505,2020-08-18,1,8,79
...,...,...,...,...,...,...
1071011,46137,2667,2020-09-17,1,9,109
1071014,46137,20209,2020-08-08,1,8,69
1071017,46137,28343,2020-08-08,1,8,69
1071018,46137,28900,2020-08-08,2,8,69


## II. Creating the feature u_EF (Engagement Feature) and u_avg_order

### 1. Calculating the amount of months of each userID in df_orders

In [6]:
user_month = df_orders.drop(columns = ['itemID', 'date', 'order', 'normalized_day'])
user_month

Unnamed: 0,userID,month
0,0,9
1,0,12
3,0,11
4,0,8
5,0,8
...,...,...
1071011,46137,9
1071014,46137,8
1071017,46137,8
1071018,46137,8


In [7]:
# Groupping by userID and then counting the number of unique values in column month. Renaming the column month into EF
user_ef = user_month.groupby('userID').nunique().reset_index()
user_ef.rename(columns = {'month': 'u_EF'}, inplace = True)
user_ef

Unnamed: 0,userID,u_EF
0,0,6
1,1,6
2,2,7
3,3,6
4,4,6
...,...,...
46039,46133,3
46040,46134,5
46041,46135,3
46042,46136,6


### 2. Calculating avg_orders

In [8]:
user_order = df_orders.drop(columns = ['itemID', 'date', 'month', 'normalized_day'])
user_order

Unnamed: 0,userID,order
0,0,1
1,0,1
3,0,1
4,0,1
5,0,1
...,...,...
1071011,46137,1
1071014,46137,1
1071017,46137,1
1071018,46137,2


In [9]:
user_avg_order = user_order.groupby('userID').order.mean().reset_index()
user_avg_order.head()

Unnamed: 0,userID,order
0,0,1.0
1,1,1.230769
2,2,1.2
3,3,1.289474
4,4,1.705882


In [10]:
user_avg_order['u_avg_orders'] = user_avg_order.apply(lambda row: round(row['order'], 2), axis=1)
user_avg_order.head()

Unnamed: 0,userID,order,u_avg_orders
0,0,1.0,1.0
1,1,1.230769,1.23
2,2,1.2,1.2
3,3,1.289474,1.29
4,4,1.705882,1.71


### 2. Merging the features u_EF and u_avg_orders to df_orders

In [11]:
df = (df_orders.merge(user_ef, how = 'left', on = 'userID')).merge(user_avg_order.drop(columns=['order']), how = 'left', on ='userID')
df

Unnamed: 0,userID,itemID,date,order,month,normalized_day,u_EF,u_avg_orders
0,0,1505,2020-09-01,1,9,93,6,1.0
1,0,6446,2020-12-11,1,12,194,6,1.0
2,0,9325,2020-11-20,1,11,173,6,1.0
3,0,12468,2020-08-03,1,8,64,6,1.0
4,0,12505,2020-08-18,1,8,79,6,1.0
...,...,...,...,...,...,...,...,...
919701,46137,2667,2020-09-17,1,9,109,2,1.4
919702,46137,20209,2020-08-08,1,8,69,2,1.4
919703,46137,28343,2020-08-08,1,8,69,2,1.4
919704,46137,28900,2020-08-08,2,8,69,2,1.4


## III. Creating features u_avg_period, u_first_bought, u_last_bought, u_std_avg_period, u_mean_bought, u_std_bought

### 1. Calculating u_avg_period

#### a. only concern about userID and normalized_day in the df_orders dataframe

In [12]:
user_day = df_orders.drop(columns = ['itemID', 'date', 'order', 'month'])
user_day

Unnamed: 0,userID,normalized_day
0,0,93
1,0,194
3,0,173
4,0,64
5,0,79
...,...,...
1071011,46137,109
1071014,46137,69
1071017,46137,69
1071018,46137,69


#### b. Removing all duplicated rows which have the same (userID and normalized_day) except for the first occurrence

In [13]:
user_day.drop_duplicates(keep='first', inplace = True)
# sorting value of column normalized_day increasingly
user_day.sort_values(by = ['normalized_day'], inplace = True)
user_day

Unnamed: 0,userID,normalized_day
304016,13081,1
31102,1371,1
210997,9068,1
344563,14861,1
799518,34500,1
...,...,...
277915,11951,217
384623,16577,217
139950,5994,217
983342,42355,217


#### c. having normalized_day_list, which contain all the days one user made orders over the time

In [14]:
# groupby userID and then get the list of all normalized day of each userID
user_day_list = user_day.groupby('userID').normalized_day.unique().reset_index()
# rename the column normalized_day into normalized_day_list
user_day_list.rename(columns={'normalized_day': 'normalized_day_list'}, inplace=True)
user_day_list

Unnamed: 0,userID,normalized_day_list
0,0,"[5, 64, 79, 93, 131, 173, 187, 194]"
1,1,"[37, 93, 99, 121, 140, 162, 182, 192, 216]"
2,2,"[29, 49, 71, 115, 124, 141, 175, 181, 202]"
3,3,"[10, 12, 30, 51, 81, 84, 87, 132, 134, 142, 14..."
4,4,"[1, 7, 17, 44, 77, 91, 110, 151, 169, 182]"
...,...,...
46039,46133,"[80, 120, 121, 167, 175]"
46040,46134,"[56, 107, 127, 167, 196]"
46041,46135,"[126, 162, 166, 190]"
46042,46136,"[2, 27, 45, 62, 109, 120, 123, 151, 185, 211]"


#### d. calculating the u_avg_period
- For example, we know that one user A made an order on normalized_day_list: [23, 41, 50]:
    * **u_avg_period = ((50 - 41) + (41 - 23)) / (len(normalized_day_list) - 1)**
    * To be clearer, we calculate the average distance between elements in normalized_day_list

In [15]:
# writing a function which calculate the average distance between elements in a list
def calculate_distance(list_distance):
    avg_cicle = 0
    # if list_distance has only one element (that means user made only an order overthe time), so avg_cicle should be 0
    if len(list_distance) <= 1:
        return 0
    else:
        for i in range(len(list_distance) - 1, 0, -1): # i goes down to 1
            j = i - 1
            avg_cicle = avg_cicle + list_distance[i] - list_distance[j]
        return round(avg_cicle / (len(list_distance) - 1), 2)  # result is rounded to 2 decimals

In [16]:
# apply the function calculate_distance to all rows of user_day_list. The result is saved to column named user_avg_period
user_day_list['u_avg_period'] = user_day_list.apply(lambda row: calculate_distance(row['normalized_day_list']), axis = 1)
user_day_list

Unnamed: 0,userID,normalized_day_list,u_avg_period
0,0,"[5, 64, 79, 93, 131, 173, 187, 194]",27.00
1,1,"[37, 93, 99, 121, 140, 162, 182, 192, 216]",22.38
2,2,"[29, 49, 71, 115, 124, 141, 175, 181, 202]",21.62
3,3,"[10, 12, 30, 51, 81, 84, 87, 132, 134, 142, 14...",15.54
4,4,"[1, 7, 17, 44, 77, 91, 110, 151, 169, 182]",20.11
...,...,...,...
46039,46133,"[80, 120, 121, 167, 175]",23.75
46040,46134,"[56, 107, 127, 167, 196]",35.00
46041,46135,"[126, 162, 166, 190]",21.33
46042,46136,"[2, 27, 45, 62, 109, 120, 123, 151, 185, 211]",23.22


#### e. getting the first day and the last day that one userID bought something

In [17]:
user_day_list['u_first_bought'] = user_day_list.apply(lambda row: row['normalized_day_list'][0], axis=1)
user_day_list['u_last_bought'] = user_day_list.apply(lambda row: row['normalized_day_list'][-1], axis=1)
user_day_list

Unnamed: 0,userID,normalized_day_list,u_avg_period,u_first_bought,u_last_bought
0,0,"[5, 64, 79, 93, 131, 173, 187, 194]",27.00,5,194
1,1,"[37, 93, 99, 121, 140, 162, 182, 192, 216]",22.38,37,216
2,2,"[29, 49, 71, 115, 124, 141, 175, 181, 202]",21.62,29,202
3,3,"[10, 12, 30, 51, 81, 84, 87, 132, 134, 142, 14...",15.54,10,212
4,4,"[1, 7, 17, 44, 77, 91, 110, 151, 169, 182]",20.11,1,182
...,...,...,...,...,...
46039,46133,"[80, 120, 121, 167, 175]",23.75,80,175
46040,46134,"[56, 107, 127, 167, 196]",35.00,56,196
46041,46135,"[126, 162, 166, 190]",21.33,126,190
46042,46136,"[2, 27, 45, 62, 109, 120, 123, 151, 185, 211]",23.22,2,211


### 2. Calculating u_std_avg_period

#### a. getting the list of distance between elements in normalized_day_list

In [18]:
# function to get list of distance of elements in normalized_day_list
def get_distance_list(list_distance):
    distances = []
    # if list_distance has only one element (that means user made only an order overthe time), so distances should be 0
    if len(list_distance) <= 1:
        distances.append(0)
    else:
        for i in range(len(list_distance) - 1, 0, -1): # i goes down to 1
            j = i - 1
            distances.append(list_distance[i] - list_distance[j])
    return distances

In [19]:
user_day_list['distance_normalized_day'] = user_day_list.apply(lambda row: get_distance_list(row['normalized_day_list']), axis=1)
user_day_list

Unnamed: 0,userID,normalized_day_list,u_avg_period,u_first_bought,u_last_bought,distance_normalized_day
0,0,"[5, 64, 79, 93, 131, 173, 187, 194]",27.00,5,194,"[7, 14, 42, 38, 14, 15, 59]"
1,1,"[37, 93, 99, 121, 140, 162, 182, 192, 216]",22.38,37,216,"[24, 10, 20, 22, 19, 22, 6, 56]"
2,2,"[29, 49, 71, 115, 124, 141, 175, 181, 202]",21.62,29,202,"[21, 6, 34, 17, 9, 44, 22, 20]"
3,3,"[10, 12, 30, 51, 81, 84, 87, 132, 134, 142, 14...",15.54,10,212,"[14, 24, 27, 5, 8, 2, 45, 3, 3, 30, 21, 18, 2]"
4,4,"[1, 7, 17, 44, 77, 91, 110, 151, 169, 182]",20.11,1,182,"[13, 18, 41, 19, 14, 33, 27, 10, 6]"
...,...,...,...,...,...,...
46039,46133,"[80, 120, 121, 167, 175]",23.75,80,175,"[8, 46, 1, 40]"
46040,46134,"[56, 107, 127, 167, 196]",35.00,56,196,"[29, 40, 20, 51]"
46041,46135,"[126, 162, 166, 190]",21.33,126,190,"[24, 4, 36]"
46042,46136,"[2, 27, 45, 62, 109, 120, 123, 151, 185, 211]",23.22,2,211,"[26, 34, 28, 3, 11, 47, 17, 18, 25]"


#### b. using np.std to calculate the standard deviation of distance_normalized_day

In [20]:
user_day_list['u_std_avg_period'] = user_day_list.apply(lambda row: round(np.std(row['distance_normalized_day']), 2), axis=1)
user_day_list

Unnamed: 0,userID,normalized_day_list,u_avg_period,u_first_bought,u_last_bought,distance_normalized_day,u_std_avg_period
0,0,"[5, 64, 79, 93, 131, 173, 187, 194]",27.00,5,194,"[7, 14, 42, 38, 14, 15, 59]",17.94
1,1,"[37, 93, 99, 121, 140, 162, 182, 192, 216]",22.38,37,216,"[24, 10, 20, 22, 19, 22, 6, 56]",14.02
2,2,"[29, 49, 71, 115, 124, 141, 175, 181, 202]",21.62,29,202,"[21, 6, 34, 17, 9, 44, 22, 20]",11.63
3,3,"[10, 12, 30, 51, 81, 84, 87, 132, 134, 142, 14...",15.54,10,212,"[14, 24, 27, 5, 8, 2, 45, 3, 3, 30, 21, 18, 2]",12.91
4,4,"[1, 7, 17, 44, 77, 91, 110, 151, 169, 182]",20.11,1,182,"[13, 18, 41, 19, 14, 33, 27, 10, 6]",10.77
...,...,...,...,...,...,...,...
46039,46133,"[80, 120, 121, 167, 175]",23.75,80,175,"[8, 46, 1, 40]",19.52
46040,46134,"[56, 107, 127, 167, 196]",35.00,56,196,"[29, 40, 20, 51]",11.64
46041,46135,"[126, 162, 166, 190]",21.33,126,190,"[24, 4, 36]",13.20
46042,46136,"[2, 27, 45, 62, 109, 120, 123, 151, 185, 211]",23.22,2,211,"[26, 34, 28, 3, 11, 47, 17, 18, 25]",12.20


### 3. Calculating u_mean_bought and u_std_bought

#### a. using np.mean() to calculate mean of normalized_day_list

In [21]:
user_day_list['u_mean_bought'] = user_day_list.apply(lambda row: round(np.mean(row['normalized_day_list']), 2), axis=1)
user_day_list.head()

Unnamed: 0,userID,normalized_day_list,u_avg_period,u_first_bought,u_last_bought,distance_normalized_day,u_std_avg_period,u_mean_bought
0,0,"[5, 64, 79, 93, 131, 173, 187, 194]",27.0,5,194,"[7, 14, 42, 38, 14, 15, 59]",17.94,115.75
1,1,"[37, 93, 99, 121, 140, 162, 182, 192, 216]",22.38,37,216,"[24, 10, 20, 22, 19, 22, 6, 56]",14.02,138.0
2,2,"[29, 49, 71, 115, 124, 141, 175, 181, 202]",21.62,29,202,"[21, 6, 34, 17, 9, 44, 22, 20]",11.63,120.78
3,3,"[10, 12, 30, 51, 81, 84, 87, 132, 134, 142, 14...",15.54,10,212,"[14, 24, 27, 5, 8, 2, 45, 3, 3, 30, 21, 18, 2]",12.91,106.71
4,4,"[1, 7, 17, 44, 77, 91, 110, 151, 169, 182]",20.11,1,182,"[13, 18, 41, 19, 14, 33, 27, 10, 6]",10.77,84.9


#### b. using np.std() to calculate standard deviation of normalized_day_list

In [22]:
user_day_list['u_std_bought'] = user_day_list.apply(lambda row: round(np.std(row['normalized_day_list']), 2), axis=1)
user_day_list.head()

Unnamed: 0,userID,normalized_day_list,u_avg_period,u_first_bought,u_last_bought,distance_normalized_day,u_std_avg_period,u_mean_bought,u_std_bought
0,0,"[5, 64, 79, 93, 131, 173, 187, 194]",27.0,5,194,"[7, 14, 42, 38, 14, 15, 59]",17.94,115.75,62.77
1,1,"[37, 93, 99, 121, 140, 162, 182, 192, 216]",22.38,37,216,"[24, 10, 20, 22, 19, 22, 6, 56]",14.02,138.0,53.28
2,2,"[29, 49, 71, 115, 124, 141, 175, 181, 202]",21.62,29,202,"[21, 6, 34, 17, 9, 44, 22, 20]",11.63,120.78,57.42
3,3,"[10, 12, 30, 51, 81, 84, 87, 132, 134, 142, 14...",15.54,10,212,"[14, 24, 27, 5, 8, 2, 45, 3, 3, 30, 21, 18, 2]",12.91,106.71,63.85
4,4,"[1, 7, 17, 44, 77, 91, 110, 151, 169, 182]",20.11,1,182,"[13, 18, 41, 19, 14, 33, 27, 10, 6]",10.77,84.9,64.02


### 4. Merging features to df dataframe

In [23]:
short_user_day_list = user_day_list.drop(columns = ['normalized_day_list', 'distance_normalized_day'])
short_user_day_list.head()

Unnamed: 0,userID,u_avg_period,u_first_bought,u_last_bought,u_std_avg_period,u_mean_bought,u_std_bought
0,0,27.0,5,194,17.94,115.75,62.77
1,1,22.38,37,216,14.02,138.0,53.28
2,2,21.62,29,202,11.63,120.78,57.42
3,3,15.54,10,212,12.91,106.71,63.85
4,4,20.11,1,182,10.77,84.9,64.02


In [24]:
df = df.merge(short_user_day_list, how = 'left', on = 'userID')
df

Unnamed: 0,userID,itemID,date,order,month,normalized_day,u_EF,u_avg_orders,u_avg_period,u_first_bought,u_last_bought,u_std_avg_period,u_mean_bought,u_std_bought
0,0,1505,2020-09-01,1,9,93,6,1.0,27.0,5,194,17.94,115.75,62.77
1,0,6446,2020-12-11,1,12,194,6,1.0,27.0,5,194,17.94,115.75,62.77
2,0,9325,2020-11-20,1,11,173,6,1.0,27.0,5,194,17.94,115.75,62.77
3,0,12468,2020-08-03,1,8,64,6,1.0,27.0,5,194,17.94,115.75,62.77
4,0,12505,2020-08-18,1,8,79,6,1.0,27.0,5,194,17.94,115.75,62.77
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
919701,46137,2667,2020-09-17,1,9,109,2,1.4,40.0,69,109,0.00,89.00,20.00
919702,46137,20209,2020-08-08,1,8,69,2,1.4,40.0,69,109,0.00,89.00,20.00
919703,46137,28343,2020-08-08,1,8,69,2,1.4,40.0,69,109,0.00,89.00,20.00
919704,46137,28900,2020-08-08,2,8,69,2,1.4,40.0,69,109,0.00,89.00,20.00


In [29]:
final_dataset = df.drop(columns = ['itemID', 'date', 'order', 'month', 'normalized_day']).drop_duplicates(subset=['userID'], keep = 'first')
final_dataset

Unnamed: 0,userID,u_EF,u_avg_orders,u_avg_period,u_first_bought,u_last_bought,u_std_avg_period,u_mean_bought,u_std_bought
0,0,6,1.00,27.00,5,194,17.94,115.75,62.77
14,1,6,1.23,22.38,37,216,14.02,138.00,53.28
27,2,7,1.20,21.62,29,202,11.63,120.78,57.42
47,3,6,1.29,15.54,10,212,12.91,106.71,63.85
85,4,6,1.71,20.11,1,182,10.77,84.90,64.02
...,...,...,...,...,...,...,...,...,...
919611,46133,3,1.10,23.75,80,175,19.52,132.60,34.76
919632,46134,5,1.22,35.00,56,196,11.64,130.60,48.45
919659,46135,3,1.86,21.33,126,190,13.20,161.00,22.87
919666,46136,6,1.97,23.22,2,211,12.20,103.50,65.14


- **Explain one observation (each userID)**: For each userID, we know:
    * how many different months in which the user made at least an order (u_EF)
    * the average amount of orders that the user bought (u_avg_orders)
    * the frequency of purchasing (u_avg_period)
    * the first and last normalized days that the user bought something (u_first_bought, u_last_bought)
    * the standard deviation of frequency of purchasing (u_std_avg_period)
    * the mean of normalized day that the user bought something (u_mean_bought) and the standard deviation based on this mean (u_std_bought)

In [30]:
# extract dataframe
#final_dataset.to_csv('U_FEAT_till_3_1.csv')