Finding User Purchases

Write a query that'll identify returning active users. 

A returning active user is a user that has made a second purchase within 7 days of any other of their purchases.

Output a list of user_ids of these returning active users.

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

In [3]:
amazon_transactions = pd.read_csv("../CSV/amazon_transactions.csv")
columns_to_keep = ["id", "user_id", "item", "created_at", "revenue"]
amazon_transactions = amazon_transactions[columns_to_keep]
amazon_transactions.head(3)

Unnamed: 0,id,user_id,item,created_at,revenue
0,1,109,milk,2020-03-03,123
1,2,139,biscuit,2020-03-18,421
2,3,120,milk,2020-03-18,176


In [4]:
amazon_transactions["created_at"] = pd.to_datetime(amazon_transactions["created_at"]).dt.strftime('%m-%d-%Y')
amazon_transactions.head(3)

Unnamed: 0,id,user_id,item,created_at,revenue
0,1,109,milk,03-03-2020,123
1,2,139,biscuit,03-18-2020,421
2,3,120,milk,03-18-2020,176


In [5]:
df = amazon_transactions.sort_values(by=['user_id', 'created_at'], ascending=[True, True])
df

Unnamed: 0,id,user_id,item,created_at,revenue
94,95,100,bread,03-07-2020,410
73,74,100,banana,03-13-2020,175
19,20,100,banana,03-18-2020,599
28,29,100,milk,03-29-2020,410
26,27,101,milk,03-01-2020,449
...,...,...,...,...,...
58,59,149,biscuit,03-11-2020,827
18,19,149,banana,03-29-2020,382
49,50,150,banana,03-04-2020,299
37,38,150,banana,03-20-2020,927


In [6]:
df['prev_value'] = df.groupby('user_id')['created_at'].shift()
df

Unnamed: 0,id,user_id,item,created_at,revenue,prev_value
94,95,100,bread,03-07-2020,410,
73,74,100,banana,03-13-2020,175,03-07-2020
19,20,100,banana,03-18-2020,599,03-13-2020
28,29,100,milk,03-29-2020,410,03-18-2020
26,27,101,milk,03-01-2020,449,
...,...,...,...,...,...,...
58,59,149,biscuit,03-11-2020,827,
18,19,149,banana,03-29-2020,382,03-11-2020
49,50,150,banana,03-04-2020,299,
37,38,150,banana,03-20-2020,927,03-04-2020


In [7]:
df['days'] = (pd.to_datetime(df['created_at']) - pd.to_datetime(df['prev_value'])).dt.days
df

Unnamed: 0,id,user_id,item,created_at,revenue,prev_value,days
94,95,100,bread,03-07-2020,410,,
73,74,100,banana,03-13-2020,175,03-07-2020,6.0
19,20,100,banana,03-18-2020,599,03-13-2020,5.0
28,29,100,milk,03-29-2020,410,03-18-2020,11.0
26,27,101,milk,03-01-2020,449,,
...,...,...,...,...,...,...,...
58,59,149,biscuit,03-11-2020,827,,
18,19,149,banana,03-29-2020,382,03-11-2020,18.0
49,50,150,banana,03-04-2020,299,,
37,38,150,banana,03-20-2020,927,03-04-2020,16.0


Метод `.shift()` в библиотеке pandas применяется к временным рядам (или сериям данных) и выполняет сдвиг значений вдоль временной оси. Давайте рассмотрим его более подробно:

```python
df['prev_value'] = df.groupby('user_id')['created_at'].shift()
```

1. **`df.groupby('user_id')['created_at']`**: Эта часть кода группирует DataFrame `df` по столбцу 'user_id' и выбирает только столбец 'created_at'. В результате получается группированный объект DataFrame.

2. **`.shift()`**: Этот метод применяется к каждой группе данных. Он сдвигает значения внутри каждой группы на заданное количество шагов. В данном случае, по умолчанию, метод сдвигает значения на одну позицию назад.

3. **`df['prev_value'] = ...`**: Создает новый столбец 'prev_value' в исходном DataFrame `df` и присваивает ему результат сдвига. Теперь 'prev_value' содержит предыдущее значение 'created_at' для каждого пользователя внутри своей группы.

Пример:

Исходные данные:
```
user_id | created_at
--------|----------------
1       | 2022-01-01
1       | 2022-01-02
2       | 2022-01-03
2       | 2022-01-04
```

После применения `.shift()`:
```
user_id | created_at  | prev_value
--------|--------------|----------------
1       | 2022-01-01   | NaN
1       | 2022-01-02   | 2022-01-01
2       | 2022-01-03   | NaN
2       | 2022-01-04   | 2022-01-03
```

Заметьте, что для первой записи каждой группы (в данном случае, каждого пользователя) значение в 'prev_value' устанавливается как NaN, так как для него нет предыдущего значения.

In [8]:
result = df[df['days'] <= 7]['user_id'].unique()

Давайте разберем код по шагам:

1. `df['days'] <= 7`: Создает булеву серию (маску), где условие сравнения проверяет, что значения в столбце 'days' меньше или равны 7.

2. `df[df['days'] <= 7]`: Используя созданную маску, выбирает строки из DataFrame `df`, где условие выполнено, то есть 'days' меньше или равно 7.

3. `['user_id']`: Выбирает только столбец 'user_id' из отфильтрованного DataFrame.

4. `.unique()`: Возвращает уникальные значения в столбце 'user_id'.

Таким образом, код создает массив уникальных 'user_id' для тех строк DataFrame `df`, где значение в столбце 'days' меньше или равно 7. Это может использоваться для получения уникальных пользователей, чей 'days' удовлетворяет заданному условию (меньше или равно 7).

In [9]:
result

array([100, 103, 105, 109, 110, 111, 112, 114, 117, 120, 122, 128, 129,
       130, 131, 133, 141, 143, 150])

Solution Walkthrough
This walkthrough will explain how to identify returning active users using the provided code. The code imports the necessary libraries, manipulates the date column in the dataset, sorts the dataset, calculates the number of days between purchases for each user, and finally filters and outputs a list of user IDs for returning active users.

Understanding The Data
The data consists of a dataset called "amazon_transactions" with columns such as "user_id" and "created_at". Each row represents a transaction made by a user on Amazon, and the "created_at" column contains the date and time when the transaction took place.

The Problem Statement
The goal is to identify returning active users, which are users who made a second purchase within 7 days of any other previous purchase. In other words, we want to find users who made at least two purchases within a one-week period.

Breaking Down The Code
Let's dissect the provided code step by step:

The first three lines import the necessary libraries and modules: Pandas, NumPy, and the datetime module from the datetime library.

The next line of code converts the "created_at" column in the "amazon_transactions" dataset into datetime format using the "pd.to_datetime" function. It then converts the datetime format back into a string format with the specified date format '%m-%d-%Y' using the ".dt.strftime()" method. This ensures that the "created_at" column is in the desired format.

The "df" variable takes the "amazon_transactions" dataset and sorts it based on two columns: 'user_id' in ascending order and 'created_at' in ascending order.

The next line of code creates a new column called 'prev_value' in the "df" dataset. This column contains the previous value of the 'created_at' column for each user. It uses the ".groupby()" function to group the dataset by 'user_id' and selects the 'created_at' column. The ".shift()" function then shifts the values by one row to get the previous value.

Another new column called 'days' is created in the "df" dataset. This column calculates the number of days between the current 'created_at' date and the previous 'created_at' date for each user. It subtracts the previous date from the current date using the ".dt.days" method.

The next line of code filters the "df" dataset based on the condition that the 'days' column is less than or equal to 7. It selects the 'user_id' column and gets the unique values using the ".unique()" method. This filters out the users who made a second purchase within 7 days of any other previous purchase.

Bringing It All Together
The complete code combines these steps to identify returning active users:

import pandas as pd
import numpy as np
from datetime import datetime

amazon_transactions["created_at"] = pd.to_datetime(
    amazon_transactions["created_at"]
).dt.strftime("%m-%d-%Y")
df = amazon_transactions.sort_values(
    by=["user_id", "created_at"], ascending=[True, True]
)
df["prev_value"] = df.groupby("user_id")["created_at"].shift()
df["days"] = (
    pd.to_datetime(df["created_at"])
    - pd.to_datetime(df["prev_value"])
).dt.days
result = df[df["days"] <= 7]["user_id"].unique()
This code imports the necessary libraries and modules, converts the date column to the desired format, sorts the dataset, calculates the number of days between purchases for each user, and filters the dataset to find returning active users based on the specified condition. Finally, it outputs a list of user IDs for these returning active users.

Conclusion
The provided code effectively identifies returning active users by filtering the dataset based on the condition of having a second purchase within 7 days of any other previous purchase.