# Problem Statement

## Tables Description

### **Visits Table**
| Column Name   | Type    |
|---------------|---------|
| user_id       | int     |
| visit_date    | date    |

- `(user_id, visit_date)` is the primary key, ensuring unique rows.
- Each row represents a user's visit to the bank on a specific date.

### **Transactions Table**
| Column Name      | Type    |
|------------------|---------|
| user_id          | int     |
| transaction_date | date    |
| amount           | int     |

- The `Transactions` table may contain duplicate rows.
- Each row represents a transaction performed by a user on a specific date with a specified `amount`.
- It is guaranteed that every `transaction_date` in this table has a corresponding `(user_id, transaction_date)` in the `Visits` table.

---

## Goal

The bank wants to create a chart showing:
1. **`transactions_count`**: The number of transactions performed during a single visit.
2. **`visits_count`**: The number of visits corresponding to each `transactions_count`.

The output should include all possible values of `transactions_count` (from 0 to the maximum observed) and should be sorted by `transactions_count`.

---

## Output Format

The output table should have two columns:
1. `transactions_count`: Number of transactions during a single visit.
2. `visits_count`: Number of users who performed this many transactions during a visit.

---

## Example

### Input:

#### Visits Table:
| user_id | visit_date  |
|---------|-------------|
| 1       | 2020-01-01  |
| 2       | 2020-01-02  |
| 12      | 2020-01-01  |
| 19      | 2020-01-03  |
| 1       | 2020-01-02  |
| 2       | 2020-01-03  |
| 1       | 2020-01-04  |
| 7       | 2020-01-11  |
| 9       | 2020-01-25  |
| 8       | 2020-01-28  |

#### Transactions Table:
| user_id | transaction_date | amount |
|---------|------------------|--------|
| 1       | 2020-01-02       | 120    |
| 2       | 2020-01-03       | 22     |
| 7       | 2020-01-11       | 232    |
| 1       | 2020-01-04       | 7      |
| 9       | 2020-01-25       | 33     |
| 9       | 2020-01-25       | 66     |
| 8       | 2020-01-28       | 1      |
| 9       | 2020-01-25       | 99     |

---

### Output:

| transactions_count | visits_count |
|--------------------|--------------|
| 0                  | 4            |
| 1                  | 5            |
| 2                  | 0            |
| 3                  | 1            |

---

### Explanation:

1. **`transactions_count = 0`**:  
   The visits `(1, "2020-01-01")`, `(2, "2020-01-02")`, `(12, "2020-01-01")`, and `(19, "2020-01-03")` did not result in any transactions.  
   Therefore, `visits_count = 4`.

2. **`transactions_count = 1`**:  
   The visits `(2, "2020-01-03")`, `(7, "2020-01-11")`, `(8, "2020-01-28")`, `(1, "2020-01-02")`, and `(1, "2020-01-04")` resulted in exactly one transaction.  
   Therefore, `visits_count = 5`.

3. **`transactions_count = 2`**:  
   No visits resulted in exactly two transactions.  
   Therefore, `visits_count = 0`.

4. **`transactions_count = 3`**:  
   The visit `(9, "2020-01-25")` resulted in exactly three transactions.  
   Therefore, `visits_count = 1`.

5. **Stop at `transactions_count = 3`**:  
   Since there are no visits with more than three transactions, the result table ends here.


In [50]:
import pandas as pd
import numpy as np

data = [[1, '2020-01-01'], 
        [2, '2020-01-02'], 
        [12, '2020-01-01'], 
        [19, '2020-01-03'], 
        [1, '2020-01-02'], 
        [2, '2020-01-03'], 
        [1, '2020-01-04'], 
        [7, '2020-01-11'], 
        [9, '2020-01-25'], 
        [8, '2020-01-28']]
visits = pd.DataFrame(
    data, 
    columns=['user_id', 
             'visit_date']).astype({'user_id':'Int64', 
             'visit_date':'datetime64[ns]'})
display(visits)

Unnamed: 0,user_id,visit_date
0,1,2020-01-01
1,2,2020-01-02
2,12,2020-01-01
3,19,2020-01-03
4,1,2020-01-02
5,2,2020-01-03
6,1,2020-01-04
7,7,2020-01-11
8,9,2020-01-25
9,8,2020-01-28


In [51]:
data = [[1, '2020-01-02', 120], 
        [2, '2020-01-03', 22], 
        [7, '2020-01-11', 232], 
        [1, '2020-01-04', 7], 
        [9, '2020-01-25', 33], 
        [9, '2020-01-25', 66], 
        [8, '2020-01-28', 1], 
        [9, '2020-01-25', 99]]
transactions = pd.DataFrame(
    data, 
    columns=['user_id', 
             'transaction_date', 
             'amount']).astype({'user_id':'Int64', 
             'transaction_date':'datetime64[ns]', 
             'amount':'Int64'})
display(transactions)

Unnamed: 0,user_id,transaction_date,amount
0,1,2020-01-02,120
1,2,2020-01-03,22
2,7,2020-01-11,232
3,1,2020-01-04,7
4,9,2020-01-25,33
5,9,2020-01-25,66
6,8,2020-01-28,1
7,9,2020-01-25,99


**Step 1. Merge Visits and Transactions**
- Merges the visits DataFrame with the transactions DataFrame using a left join. The merge aligns rows based on user_id and visit_date from visits and user_id and transaction_date from transactions.
- A new DataFrame (df) where each visit may or may not be associated with a transaction.

In [52]:
df = visits.merge(transactions, 
                  how="left",
                  left_on=["user_id", "visit_date"],
                  right_on=["user_id", "transaction_date"])
display(df)

Unnamed: 0,user_id,visit_date,transaction_date,amount
0,1,2020-01-01,NaT,
1,2,2020-01-02,NaT,
2,12,2020-01-01,NaT,
3,19,2020-01-03,NaT,
4,1,2020-01-02,2020-01-02,120.0
5,2,2020-01-03,2020-01-03,22.0
6,1,2020-01-04,2020-01-04,7.0
7,7,2020-01-11,2020-01-11,232.0
8,9,2020-01-25,2020-01-25,33.0
9,9,2020-01-25,2020-01-25,66.0


**Step 2. Flag Transactions Matching Visit Date**
- Creates a new column transaction that is 1 if the visit_date matches the transaction_date, otherwise 0.
- A binary flag indicating whether a transaction occurred on the same day as the visit.

In [53]:
df["transaction"] = np.where(df["visit_date"] == df["transaction_date"], 1, 0)
display(df)

Unnamed: 0,user_id,visit_date,transaction_date,amount,transaction
0,1,2020-01-01,NaT,,0
1,2,2020-01-02,NaT,,0
2,12,2020-01-01,NaT,,0
3,19,2020-01-03,NaT,,0
4,1,2020-01-02,2020-01-02,120.0,1
5,2,2020-01-03,2020-01-03,22.0,1
6,1,2020-01-04,2020-01-04,7.0,1
7,7,2020-01-11,2020-01-11,232.0,1
8,9,2020-01-25,2020-01-25,33.0,1
9,9,2020-01-25,2020-01-25,66.0,1


**Step 3. Drop Unnecessary Columns**
- Removes the transaction_date and amount columns, as they are no longer needed.
- A cleaner DataFrame with only relevant columns.

**Step 4. Count Transactions Per Visit**
- Adds a column transactions_count that aggregates the total number of transactions for each unique combination of user_id and visit_date.
- Each row now contains the total number of transactions for that visit.

In [54]:
df = df.drop(columns=["transaction_date", "amount"])
df["transactions_count"] = df.groupby(["user_id", "visit_date"])["transaction"].transform("sum")
display(df)

Unnamed: 0,user_id,visit_date,transaction,transactions_count
0,1,2020-01-01,0,0
1,2,2020-01-02,0,0
2,12,2020-01-01,0,0
3,19,2020-01-03,0,0
4,1,2020-01-02,1,1
5,2,2020-01-03,1,1
6,1,2020-01-04,1,1
7,7,2020-01-11,1,1
8,9,2020-01-25,1,3
9,9,2020-01-25,1,3


**Step 5. Remove Duplicates**
- Removes duplicate rows, retaining only the first occurrence of each unique row.
- A deduplicated DataFrame.

In [55]:
df = df.drop_duplicates(keep="first")
display(df)

Unnamed: 0,user_id,visit_date,transaction,transactions_count
0,1,2020-01-01,0,0
1,2,2020-01-02,0,0
2,12,2020-01-01,0,0
3,19,2020-01-03,0,0
4,1,2020-01-02,1,1
5,2,2020-01-03,1,1
6,1,2020-01-04,1,1
7,7,2020-01-11,1,1
8,9,2020-01-25,1,3
11,8,2020-01-28,1,1


**Step 6. Aggregate by Transaction Counts**
- Counts how many times each value of transactions_count appears in the dataset. The result is a frequency table of transactions_count.
- A DataFrame with columns transactions_count and the count of visits for each transaction count.

**Step 7. Rename Count Column**
- Renames the automatically generated count column to visits_count for clarity.
- A cleaner, more descriptive DataFrame.

In [56]:
df = df[["transactions_count"]].value_counts().reset_index()
df = df.rename(columns={"count": "visits_count"})
display(df)

Unnamed: 0,transactions_count,visits_count
0,1,5
1,0,4
2,3,1


**Step 8. Identify Missing Transaction Counts**
- Finds transaction counts (i) that are not present in the existing DataFrame and adds them to a list.
- A list (not_present) of missing transaction counts.


In [57]:
not_present = []
for i in range(df["transactions_count"].max()):
    if i not in df["transactions_count"].values:
        not_present.append(i)

**Step 9. Create Zero-Frequency DataFrame**
- Creates a new DataFrame with the missing transaction counts and assigns them a visits_count of 0.
- A DataFrame (df_zero) representing transaction counts with zero frequency.


In [58]:
df_zero = pd.DataFrame()
df_zero["transactions_count"] = not_present
df_zero["visits_count"] = 0
display(df_zero)

Unnamed: 0,transactions_count,visits_count
0,2,0


**Step 10. Concatenate DataFrames**
- Combines the original DataFrame with the zero-frequency DataFrame (df_zero).
- A unified DataFrame that includes both existing and zero-frequency transaction counts.

In [59]:
df = pd.concat([df, df_zero], axis=0)
display(df)

Unnamed: 0,transactions_count,visits_count
0,1,5
1,0,4
2,3,1
0,2,0


**Step 11. Sort by Transaction Count**
- Sorts the DataFrame in ascending order based on transactions_count.
- A DataFrame sorted by transaction counts, ready for analysis or visualization.

In [60]:
df = df.sort_values(by=["transactions_count"])
display(df)

Unnamed: 0,transactions_count,visits_count
1,0,4
0,1,5
0,2,0
2,3,1


References: [1] https://leetcode.com/problems/number-of-transactions-per-visit/?lang=pythondata