

| Column Name  | Type |
|--------------|------|
| **player_id**    | int  |
| **device_id**    | int  |
| **event_date**   | date |
| **games_played** | int  |

Table: Activity
- **(player_id, event_date)** is the primary key (a combination of columns with unique values).
- This table shows the activity of players of some games.
- Each row represents a **player** who logged in and played a number of games (possibly 0) before logging out, on some day, using some device.

---

### Install Date
- The **install date** of a player is the **first login day** of that player.

### Day One Retention
- We define day one retention of some date x to be the number of players whose install date is x and they logged back in on the day right after x, divided by the number of players whose install date is x, rounded to 2 decimal places.

---

### Task
Write a solution to report for each install date, the number of players that installed the game on that day, and the day one retention.

Return the result table in any order.

The result format should follow this example:

---

### Example

#### Input
**Activity** table:

| player_id | device_id | event_date | games_played |
|-----------|-----------|------------|--------------|
| 1         | 2         | 2016-03-01 | 5            |
| 1         | 2         | 2016-03-02 | 6            |
| 2         | 3         | 2017-06-25 | 1            |
| 3         | 1         | 2016-03-01 | 0            |
| 3         | 4         | 2016-07-03 | 5            |

#### Output
| install_dt | installs | Day1_retention |
|------------|----------|----------------|
| 2016-03-01 | 2        | 0.50           |
| 2017-06-25 | 1        | 0.00           |

#### Explanation
- **Player 1** and **Player 3** both installed the game on **2016-03-01**.
  - Only **Player 1** logged back in on **2016-03-02**, so the day one retention for **2016-03-01** is \( 1 / 2 = 0.50 \).
- **Player 2** installed the game on **2017-06-25** but did **not** log back in on **2017-06-26**, so the day one retention for **2017-06-25** is \( 0 / 1 = 0.00 \).


In [43]:
import pandas as pd
import numpy as np

data = [[1, 2, '2016-03-01', 5], 
        [1, 2, '2016-03-02', 6], 
        [2, 3, '2017-06-25', 1], 
        [3, 1, '2016-03-01', 0], 
        [3, 4, '2018-07-03', 5]]

activity = pd.DataFrame(data, 
                        columns=['player_id', 'device_id', 'event_date', 'games_played']).astype(
    {'player_id':'Int64', 'device_id':'Int64', 'event_date':'datetime64[ns]', 'games_played':'Int64'})


**Step 1: Sort the DataFrame**

- Sort rows first by player_id (ascending) and then by event_date (ascending).
- This sorting is crucial for the next step where we will use .shift(-1) to find the next event date.

In [44]:
activity = activity.sort_values(by=["player_id", "event_date"])

display(activity)

Unnamed: 0,player_id,device_id,event_date,games_played
0,1,2,2016-03-01,5
1,1,2,2016-03-02,6
2,2,3,2017-06-25,1
3,3,1,2016-03-01,0
4,3,4,2018-07-03,5


**Step 2: Calculate install date, first login date, and expected day 1 login date**
- install_dt: We group by player_id, look at the event_date column, and use .transform("first"). This means for each player_id, the install_dt column will be the first date that player appeared in the data (i.e., their “installation date”).
- login_date: Again, group by player_id and look at event_date, but use .shift(-1). This takes the next row’s event date for each row, effectively telling us the player’s next login date. Because we use shift(-1), each row ends up showing when the player next logged in after that day.
- day1_login_date: We add one day (pd.Timedelta(days=1)) to the install_dt. This represents what the “Day 1” date is for each player (the day after install).

In [45]:
activity["install_dt"] = activity.groupby(["player_id"])["event_date"].transform("first")
activity["login_date"] = activity.groupby(["player_id"])["event_date"].shift(-1)
activity["day1_login_date"] = activity["install_dt"] + pd.Timedelta(days=1)

display(activity)

Unnamed: 0,player_id,device_id,event_date,games_played,install_dt,login_date,day1_login_date
0,1,2,2016-03-01,5,2016-03-01,2016-03-02,2016-03-02
1,1,2,2016-03-02,6,2016-03-01,NaT,2016-03-02
2,2,3,2017-06-25,1,2017-06-25,NaT,2017-06-26
3,3,1,2016-03-01,0,2016-03-01,2018-07-03,2016-03-02
4,3,4,2018-07-03,5,2016-03-01,NaT,2016-03-02


**Step 3: Create the is_login flag**

- is_login: We want to see if the player actually logged in on “Day 1”. We compare day1_login_date to login_date. If they match (==), we set is_login to 1; otherwise 0. Internally, np.where(condition, value_if_true, value_if_false) is used.

In [46]:
activity["is_login"] = np.where(activity["day1_login_date"] == activity["login_date"], 1, 0)

display(activity)

Unnamed: 0,player_id,device_id,event_date,games_played,install_dt,login_date,day1_login_date,is_login
0,1,2,2016-03-01,5,2016-03-01,2016-03-02,2016-03-02,1
1,1,2,2016-03-02,6,2016-03-01,NaT,2016-03-02,0
2,2,3,2017-06-25,1,2017-06-25,NaT,2017-06-26,0
3,3,1,2016-03-01,0,2016-03-01,2018-07-03,2016-03-02,0
4,3,4,2018-07-03,5,2016-03-01,NaT,2016-03-02,0


**Step 4: Grouping and aggregating data**

We now group the DataFrame by install_dt (the installation date). .agg(...) is an aggregation method:
- installs=("player_id", "nunique"): For each installation date, count the number of unique players (because multiple rows might be the same player with different events).
- logins=("is_login", "sum"): Sum all the is_login flags to see how many actual Day 1 logins occurred.

In [47]:
activity = activity.groupby(["install_dt"]).agg(
    installs=("player_id", "nunique"),
    logins=("is_login", "sum")
)

display(activity)

Unnamed: 0_level_0,installs,logins
install_dt,Unnamed: 1_level_1,Unnamed: 2_level_1
2016-03-01,2,1
2017-06-25,1,0


**Step 5: Calculate the Day 1 retention**

- Day1_retention: We divide the total number of Day 1 logins by the total number of installs for that day. This gives a proportion (or rate) of how many players who installed actually came back on Day 1.
- custom_round function:
Takes in a number x and how many decimals you want.
Multiplies x by 10^decimals, adds 0.5, and takes the integer part, then divides by 10^decimals.
This is a common technique to perform “round half up.” It's all because while rounding, **pandas** rounds certain numbers - for example 0.125: it's rounded to 0.12 instead of 0.13. But interestingly, 0.135 is rounded to 0.14. Python is using a specific rounding strategy (commonly known as “Bankers Rounding” or “round half to even”) plus it’s dealing with floating-point representations under the hood.
- Apply function:
We apply this custom rounding to the Day1_retention column, ensuring that the result is rounded to 2 decimals in a “round half up” manner.

In [48]:
activity["Day1_retention"] = activity["logins"] / activity["installs"]

def custom_round(x, decimals=2):
    offset = 10 ** decimals
    return int(x * offset + 0.5) / offset

activity["Day1_retention"] = activity["Day1_retention"].apply(lambda x: custom_round(x, 2))

display(activity)

Unnamed: 0_level_0,installs,logins,Day1_retention
install_dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-03-01,2,1,0.5
2017-06-25,1,0,0.0


**Step 6: Final cleanup**
- Drop logins: We remove the intermediate logins column since we only want to keep Day1_retention in the final dataset.
- .reset_index(): Because we grouped by install_dt previously, install_dt became the index. Now we want it back as a normal column, so we reset the index.

Finally, we display(activity) to show the resulting DataFrame, which now contains:

- install_dt: The installation date.
- installs: Number of new installs on that date (unique players).
- Day1_retention: The percentage (or ratio) of Day 1 returns, rounded to 2 decimal places.

In [49]:
activity = activity.drop(columns=["logins"]).reset_index()

display(activity)

Unnamed: 0,install_dt,installs,Day1_retention
0,2016-03-01,2,0.5
1,2017-06-25,1,0.0
