# 1050. Actors and Directors Who Cooperated At Least Three Times

### Difficulty
**Easy**

---

## Problem Statement

Given an `ActorDirector` table, write a **SQL query** to find all the pairs `(actor_id, director_id)` where the **actor has cooperated with the director at least three times**.

Return the result table in **any order**.

---

## Table Schema

### **Table: ActorDirector**
| Column Name  | Type  |
|-------------|-------|
| `actor_id`  | `int` |
| `director_id` | `int` |
| `timestamp` | `int` |

- `timestamp` is the **primary key** (unique values).
- Each row represents a **collaboration** between an **actor** and a **director** at a specific point in time.

---

## Example

### **Input**
#### **ActorDirector table:**
| actor_id | director_id | timestamp |
|----------|------------|-----------|
| 1        | 1          | 0         |
| 1        | 1          | 1         |
| 1        | 1          | 2         |
| 1        | 2          | 3         |
| 1        | 2          | 4         |
| 2        | 1          | 5         |
| 2        | 1          | 6         |

---

### **Output**
| actor_id | director_id |
|----------|------------|
| 1        | 1          |

---

### **Explanation**
- The pair **(1, 1)** appears **three times**, so it is included in the output.
- The pair **(1, 2)** appears **twice**, so it is **not included**.
- The pair **(2, 1)** appears **twice**, so it is **not included**.

---


# Solution

In [1]:
import pandas as pd

def actors_and_directors(actor_director: pd.DataFrame) -> pd.DataFrame:
    count = actor_director.groupby(['actor_id', 'director_id']).size().reset_index(name='count')
    
    # Filter rows where count > 2
    result = count[count['count'] > 2][['actor_id', 'director_id']]
    
    return result

---

### **Time & Space Complexity**
| **Operation** | **Time Complexity** | **Space Complexity** |
|--------------|----------------|----------------|
| `groupby().size()` | **O(n log n)** | **O(k)** (storing unique pairs) |
| `reset_index()` | **O(k)** | **O(k)** |
| `Filtering (count > 2)` | **O(k)** | **O(k)** |
| **Overall Complexity** | **O(n log n)** | **O(k)** |

- **`n` = number of rows in the `actor_director` table**.
- **`k` = number of unique `(actor_id, director_id)` pairs**.

---

# Example Implementation

In [2]:
data = {'actor_id': [1,1,1,1,1,2,2],
       'director_id': [1,1,1,2,2,1,1],
       'timestamp': [0,1,2,3,4,5,6]}
df = pd.DataFrame(data)

In [3]:
df

Unnamed: 0,actor_id,director_id,timestamp
0,1,1,0
1,1,1,1
2,1,1,2
3,1,2,3
4,1,2,4
5,2,1,5
6,2,1,6


In [4]:
actors_and_directors(df)

Unnamed: 0,actor_id,director_id
0,1,1


# Comparing leetcode 596 to 1050

---

## **1️⃣ Difference Between `.size()` and `.count()`**
- **`.size()`** → Counts the **total number of rows** in each group (regardless of missing values).
- **`.count()`** → Counts **non-null values** in a specific column.

---

## **2️⃣ Why Did I Use `.count()` for `find_classes()`?**
```python
grouped = courses.groupby('class')['student'].count()
```
- We **specifically counted the number of students** in each class.
- Since each student belongs to a class, counting the `"student"` column was **meaningful**.
- If there were missing values (`NaN`) in `"student"`, `.count()` **ignores them**.

✅ **This approach works well because:**  
- We **only care about student counts per class**.
- The `"student"` column is the correct thing to count.

---

## **3️⃣ Why Did I Use `.size()` for `actors_and_directors()`?**
```python
count = actor_director.groupby(['actor_id', 'director_id']).size()
```
- This is because we are **counting the number of times an actor worked with a director**.
- If I used `.count()`, I would need to **specify a column** (e.g., `timestamp`), but since all rows are valid, **`.size()` works better**.

✅ **This approach works well because:**  
- We just need to count **the number of times** each actor-director pair appears in the dataset.
- **All rows represent valid collaborations**, so counting rows directly makes more sense.

---

## **4️⃣ Summary of When to Use `.size()` vs `.count()`**
| Scenario | Use `.size()` | Use `.count()` |
|----------|-------------|--------------|
| Counting occurrences of a group | ✅ Yes | ❌ No |
| Counting **non-null** values in a column | ❌ No | ✅ Yes |
| Counting **rows** regardless of missing values | ✅ Yes | ❌ No |
| Counting relationships (like actor-director pairs) | ✅ Yes | ❌ No |
| Counting class enrollments (students per class) | ❌ No | ✅ Yes |

---

## **5️⃣ Final Rule of Thumb**
- Use `.size()` **when counting occurrences of an event (like collaborations)**.
- Use `.count()` **when counting valid entries in a column (like students in a class)**.

---

### **Why Did I Use `count[count['count'] > 2]` Instead of `grouped[grouped > 4]`?**

### **1️⃣ The Difference Between the Two**
Both solutions filter groups **based on how many times they appear**, but there are key differences in how **grouped data is structured**:

| **Function**      | **Grouping** | **Resulting Object** | **How We Filtered** |
|------------------|-------------|----------------|----------------|
| `find_classes()` | `groupby('class')['student'].count()` | Pandas **Series** | `grouped[grouped > 4]` (filtering directly on Series index) |
| `actors_and_directors()` | `groupby(['actor_id', 'director_id']).size()` | Pandas **Series**, but converted into **DataFrame** | `count[count['count'] > 2]` (filtering by column name) |

---

### **2️⃣ Why Didn’t I Need `.reset_index(name='count')` for `find_classes()`?**
In `find_classes()`, we directly counted students per class:
```python
grouped = courses.groupby('class')['student'].count()
result = grouped[grouped > 4].reset_index()
```
- **Why it works directly:** Since `.count()` returns a **Series** where the index is `class`, we can **directly filter** with `grouped > 4` before converting it back to a DataFrame with `.reset_index()`.

---

### **3️⃣ Why Did I Need `.reset_index(name='count')` in `actors_and_directors()`?**
In `actors_and_directors()`, the `groupby()` operation results in a **multi-indexed Series**, meaning:
```python
count = actor_director.groupby(['actor_id', 'director_id']).size()
```
- This gives a **Series**, not a DataFrame.
- The `index` contains **both actor_id and director_id**, and the values are the counts.
- **Filtering directly would be more difficult** because we would be filtering on an index with two levels.

#### **To make filtering easier:**
```python
count = actor_director.groupby(['actor_id', 'director_id']).size().reset_index(name='count')
```
- `.reset_index(name='count')` converts the **multi-indexed Series** into a **DataFrame** with explicit columns:  
  - `actor_id`
  - `director_id`
  - `count` (new column containing the count)

Then, filtering is easier:
```python
result = count[count['count'] > 2][['actor_id', 'director_id']]
```
- **Filtering now works using column names**, rather than relying on the index.

---

### **4️⃣ Why Didn’t I Use `reset_index()` for `find_classes()`?**
Because in `find_classes()`, the `groupby('class')['student'].count()` **returns a Series with `class` as the index**.
```python
grouped[grouped > 4].reset_index()
```
- Here, `grouped > 4` **filters on the index**, which is valid.
- **In contrast**, `actors_and_directors()` had a multi-indexed Series, making filtering directly **harder**.

---

### **🔍 Summary of the Key Differences**
| **Aspect** | **find_classes()** | **actors_and_directors()** |
|------------|----------------|----------------------|
| **Grouping** | `groupby('class')['student'].count()` | `groupby(['actor_id', 'director_id']).size()` |
| **Result Type** | **Series** (index = `class`) | **Multi-indexed Series** (index = `(actor_id, director_id)`) |
| **How We Filter** | `grouped[grouped > 4]` (direct index filtering) | `count[count['count'] > 2]` (column filtering after `reset_index()`) |
| **Why We Used `.reset_index()`** | Not necessary because filtering works on index | Necessary to make filtering easier |

---

### **✅ Takeaways**
- If **grouping results in a single-level indexed Series**, you **can filter directly** (like in `find_classes()`).
- If **grouping results in a multi-indexed Series**, it's **better to use `.reset_index()`** before filtering (like in `actors_and_directors()`).
