# 511. Game Play Analysis I

### Difficulty
**Easy**

---

## Problem Statement

Given an `Activity` table, write a **SQL query** to find the **first login date** for each player.

Return the result table in **any order**.

---

## Table Schema

### **Table: Activity**
| Column Name   | Type    |
|---------------|---------|
| `player_id`   | `int`   |
| `device_id`   | `int`   |
| `event_date`  | `date`  |
| `games_played`| `int`   |

- **Primary Key**: `(player_id, event_date)` (combination of columns with unique values).
- Each row represents:
  - A player who logged in and played a certain number of games (possibly `0`) on a specific date using a specific device.

---

## Example

### **Input**
#### **Activity table:**
| player_id | device_id | event_date | games_played |
|-----------|-----------|------------|--------------|
| 1         | 2         | 2016-03-01 | 5            |
| 1         | 2         | 2016-05-02 | 6            |
| 2         | 3         | 2017-06-25 | 1            |
| 3         | 1         | 2016-03-02 | 0            |
| 3         | 4         | 2018-07-03 | 5            |

---

### **Output**
| player_id | first_login |
|-----------|-------------|
| 1         | 2016-03-01  |
| 2         | 2017-06-25  |
| 3         | 2016-03-02  |

---

### **Explanation**
- For `player_id = 1`, the first login date is **2016-03-01**.
- For `player_id = 2`, the first login date is **2017-06-25**.
- For `player_id = 3`, the first login date is **2016-03-02**.

---

## **Constraints**
- Each `player_id` can appear in the table multiple times with different `event_date`.
- The `event_date` column contains **unique dates** for a given `player_id`.

---


# Solution

In [1]:
import pandas as pd

In [2]:
def first_login(activity: pd.DataFrame) -> pd.DataFrame:
    result = activity.groupby('player_id', as_index=False)['event_date'].min()
    result.rename(columns={'event_date': 'first_login'}, inplace=True)
    return result

### **Time & Space Complexity**
| **Operation** | **Time Complexity** | **Space Complexity** | **Why?** |
|---------------|---------------------|---------------------|----------|
| **Grouping (`groupby`)** | **O(n)** | **O(k)** | Groups `n` rows into `k` unique players (`k ≤ n`). |
| **Aggregation (`min`)** | **O(n)** | **O(k)** | Computes the minimum for each group. |
| **Renaming Columns** | **O(k)** | **O(1)** | Simple renaming of columns. |
| **Total Complexity** | **O(n)** | **O(k)** |

---

### **Why Does `groupby` Have O(n) Time Complexity in Pandas but O(n log n) in SQL?**

The difference in time complexity between `groupby` in **Pandas** and **SQL** arises from how these systems handle grouping.

---

### **1️⃣ Pandas `groupby`**
- In Pandas, `groupby` uses **hash-based grouping**:
  - Each value of the `groupby` key (e.g., `player_id`) is hashed and used to assign rows to groups.
  - Hash-based grouping is efficient and operates in **O(n)** time.
- **No sorting** is performed in Pandas `groupby` unless explicitly requested.

---

### **2️⃣ SQL `GROUP BY`**
- In SQL, `GROUP BY` is typically **sorting-based**, although modern databases may use hashing in specific scenarios.
- Sorting-based grouping involves:
  1. Sorting rows by the `GROUP BY` column(s) → **O(n log n)**.
  2. Scanning the sorted rows to aggregate them → **O(n)**.
- **Default behavior:** Most SQL implementations assume sorting for `GROUP BY`, leading to **O(n log n)** complexity.

---

### **3️⃣ Why the Difference?**
- Pandas is designed for in-memory data processing and prioritizes speed and simplicity for most use cases.
- SQL prioritizes flexibility and correctness in large-scale distributed environments, where sorting provides predictable ordering of results.

---

# Example:

In [3]:
data = {
    'player_id': [1, 1, 2, 3, 3],
    'device_id': [2, 2, 3, 1, 4],
    'event_date': ['2016-03-01', '2016-05-02', '2017-06-25', '2016-03-02', '2018-07-03'],
    'games_played': [5, 6, 1, 0, 5]
}

activity = pd.DataFrame(data)


In [4]:
activity

Unnamed: 0,player_id,device_id,event_date,games_played
0,1,2,2016-03-01,5
1,1,2,2016-05-02,6
2,2,3,2017-06-25,1
3,3,1,2016-03-02,0
4,3,4,2018-07-03,5


In [5]:
activity.groupby('player_id', as_index=False)['event_date'].min()

Unnamed: 0,player_id,event_date
0,1,2016-03-01
1,2,2017-06-25
2,3,2016-03-02


In [6]:
first_login(activity)

Unnamed: 0,player_id,first_login
0,1,2016-03-01
1,2,2017-06-25
2,3,2016-03-02
