# What do I want to find out?
- How many products does a user buy / view / put in cart on average?
- How many products does a user interact with in one session on average?
- How many sessions does a user have on average?
- How many users are there?
- How many sessions are there?
- How did the indicators change from October to November?

### Imports

In [1]:
import pandas as pd
import numpy as np
import dask.dataframe as dd
import time

### Loading data

In [2]:
t1 = time.time()
df_oct = dd.read_csv("./data/2019-Oct.csv")
df_nov = dd.read_csv("./data/2019-Nov.csv")

# Runtime calculations
t2 = time.time()
print(f"\nFinished operation in {round(t2-t1, 2)}s")


Finished operation in 0.1s


In [3]:
df_oct.describe()


Unnamed: 0_level_0,product_id,category_id,price,user_id
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,float64,float64,float64,float64
,...,...,...,...


## How many unique users are there?

In [4]:
t1 = time.time()
# Amount of users
amount_usr_oct = df_oct["user_id"].nunique().compute()
amount_usr_nov = df_nov["user_id"].nunique().compute()
print(f"Amount of unique users in October: {amount_usr_oct}")
print(f"Amount of unique users in November: {amount_usr_nov}")
print(f"Delta from October to November: {amount_usr_nov - amount_usr_oct}")

# Runtime calculations
t2 = time.time()
print(f"\nFinished operation in {round(t2-t1, 2)}s")

Amount of unique users in October: 3022290
Amount of unique users in November: 3696117
Delta from October to November: 673827

Finished operation in 133.61s


## How many sessions are there?

In [9]:
t1 = time.time()
# Amount of sessions
amount_sess_oct = df_oct["user_session"].compute().nunique()
amount_sess_nov = df_nov["user_session"].compute().nunique()
print(f"Amount of unique sessions in October: {amount_sess_oct}")
print(f"Amount of unique users in November: {amount_sess_nov}")
print(f"Delta from October to November: {amount_sess_nov - amount_sess_oct}")

# Runtime calculations
t2 = time.time()
print(f"\nFinished operation in {round(t2-t1, 2)}s")

Amount of unique sessions in October: 9244421
Amount of unique users in November: 13776050
Delta from October to November: 4531629
Finished operation in 197.93s


## How many sessions does a user have on average?

In [10]:
t1 = time.time()
avrg_sess_oct = amount_sess_oct/amount_usr_oct
avrg_sess_nov = amount_sess_nov/amount_usr_nov
print(f"The average number of session a user had in October: {avrg_sess_oct}")
print(f"The average number of session a user had in November: {avrg_sess_nov}")
print(f"Delta from October to November: {avrg_sess_nov - avrg_sess_oct}")

# Runtime calculations
t2 = time.time()
print(f"\nFinished operation in {round(t2-t1, 2)}s")

The average number of session a user had in October: 3.058747175155263
The average number of session a user had in November: 3.727168268753397
Delta from October to November: 0.6684210935981341
Finished operation in 0.04s


In [None]:
var_sess_oct = df_oct[["user_id", "user_session"]].groupby("user_id").compute()
var_sess_oct.head()

## How many products does a user buy / view / put in cart on average?

In [14]:
t1 = time.time()

amount_events_oct = df_oct[["user_id", "event_type"]].groupby("event_type").count().compute().reset_index()
#amount_events_nov = df_nov[["user_id", "event_type"]].groupby("user_id").count().compute()
print(amount_events_oct)
#print(amount_events_nov)
# Runtime calculations
t2 = time.time()
print(f"\nFinished operation in {round(t2-t1, 2)}s")

            event_time
event_type            
cart           3028930
purchase        916939
view          63556110


In [8]:
amount_interaction_oct = amount_events_oct["user_id"].div(amount_usr_oct).reset_index().drop("index", axis=1)
amount_interaction_oct["event_type"] = ["cart", "purchase", "view"]
print(amount_interaction_oct)

     user_id event_type
0   0.306561       cart
1   0.245790   purchase
2  13.492881       view


In [9]:
t1 = time.time()
tt = df_oct[["user_id", "event_type"]].groupby("user_id").count().reset_index()
tt_max = tt.idxmax().compute()
x = tt.compute().iloc[tt_max[0]]
print(f"The most active user has the ID {x[0]} with a total of {x[1]} interactions.")

# Runtime calculations
t2 = time.time()
print(f"\nFinished operation in {round(t2-t1, 2)}s")


The most active user has the ID 512475445 with a total of 7436 interactions.

Finished operation in 89.12s
