# Identifying Consecutive High Attendance Records in Stadium Visits Using Pandas

Analyzing patterns in attendance data can provide valuable insights into the popularity and trends of events held at a stadium. One common analysis task is identifying periods of high attendance, especially consecutive days where the number of attendees meets or exceeds a certain threshold. In this tutorial, we'll explore how to identify records with three or more consecutive visit IDs where the number of people is greater than or equal to 100 using Python's Pandas library.

## Problem Statement

You are provided with a **Stadium** table that records daily visits to a stadium. Your task is to identify and display records where there are three or more consecutive visit IDs, and each of these visits had a number of people attending that is **greater than or equal to 100**.

### Stadium Table

| Column Name | Type  |
|-------------|-------|
| id          | int   |
| visit_date  | date  |
| people      | int   |

- **id**: Unique identifier for each stadium visit. As the `id` increases, the `visit_date` also increases.
- **visit_date**: The date of the stadium visit.
- **people**: The number of people who attended the stadium on that visit date.

**Primary Key**: The `id` column ensures that each record is unique.

Each row in the table represents a single visit to the stadium, including the date of the visit and the number of attendees.

### Objective

Write a solution to display the records that satisfy the following conditions:

1. **Consecutive IDs**: The records must have three or more consecutive `id` values.
2. **High Attendance**: Each of these records must have `people` greater than or equal to `100`.
3. **Exclusion of Non-Consecutive Visits**: Visits that do not form a consecutive sequence of `id`s should not be included, even if they meet the attendance criteria individually.

### Output Format

Return the result table ordered by `visit_date` in ascending order. The table should include the following columns:

| Column Name | Type  |
|-------------|-------|
| id          | int   |
| visit_date  | date  |
| people      | int   |

## Example

### Input

**Stadium Table:**

| id | visit_date | people |
|----|------------|--------|
| 1  | 2017-01-01 | 10     |
| 2  | 2017-01-02 | 109    |
| 3  | 2017-01-03 | 150    |
| 4  | 2017-01-04 | 99     |
| 5  | 2017-01-05 | 145    |
| 6  | 2017-01-06 | 1455   |
| 7  | 2017-01-07 | 199    |
| 8  | 2017-01-09 | 188    |

### Output

| id | visit_date | people |
|----|------------|--------|
| 5  | 2017-01-05 | 145    |
| 6  | 2017-01-06 | 1455   |
| 7  | 2017-01-07 | 199    |
| 8  | 2017-01-09 | 188    |

### Explanation

- **Records with IDs 5, 6, 7, and 8**:
  - These records have consecutive `id` values (5, 6, 7, 8).
  - Each of these records has `people` ≥ 100.
  - Although `id` 8 is not immediately consecutive to `id` 7 in terms of `visit_date`, it is consecutive in `id`, which satisfies the problem's condition.
  
- **Records with IDs 2 and 3**:
  - Although they have high attendance (`people` ≥ 100), they do not form a sequence of three consecutive `id`s.
  
- **Record with ID 4**:
  - It has `people` < 100, so it does not meet the attendance criterion.

As a result, only the records with IDs 5, 6, 7, and 8 are included in the output.


In [3]:
import pandas as pd
data = [[1, '2017-01-01', 10], 
        [2, '2017-01-02', 109], 
        [3, '2017-01-03', 150], 
        [4, '2017-01-04', 99], 
        [5, '2017-01-05', 145], 
        [6, '2017-01-06', 1455], 
        [7, '2017-01-07', 199], 
        [8, '2017-01-09', 188]]
stadium = pd.DataFrame(data, 
                       columns=['id', 
                                'visit_date', 
                                'people']).astype({'id':'Int64', 
                                                   'visit_date':'datetime64[ns]', 
                                                   'people':'Int64'})
display(stadium)

Unnamed: 0,id,visit_date,people
0,1,2017-01-01,10
1,2,2017-01-02,109
2,3,2017-01-03,150
3,4,2017-01-04,99
4,5,2017-01-05,145
5,6,2017-01-06,1455
6,7,2017-01-07,199
7,8,2017-01-09,188


**Step 1. Filtering Records with High Attendance**

- Filtering: Selects only those records where the number of people attending (people) is greater than or equal to 100.

In [5]:
stadium = stadium[stadium["people"] >= 100]
display(stadium)

Unnamed: 0,id,visit_date,people
1,2,2017-01-02,109
2,3,2017-01-03,150
4,5,2017-01-05,145
5,6,2017-01-06,1455
6,7,2017-01-07,199
7,8,2017-01-09,188


**Step 2. Assigning a Rank to Each Record and Calculating the Difference Between id and rank**

- Ranking: Assigns a sequential rank (starting from 0) to each record in the filtered DataFrame. This rank is based on the current order of the DataFrame.
- Identifying Consecutive Sequences: The rank helps in determining consecutive sequences by comparing it with the id.
- Grouping Key: Calculates a new column dif by subtracting the rank from the id.
- Consecutive Identification: For consecutive ids, the difference id - rank remains constant. This property allows us to group consecutive records.


In [7]:
stadium["rank"] = range(len(stadium))
stadium["dif"] = stadium["id"] - stadium["rank"]
display(stadium)

Unnamed: 0,id,visit_date,people,rank,dif
1,2,2017-01-02,109,0,2
2,3,2017-01-03,150,1,2
4,5,2017-01-05,145,2,3
5,6,2017-01-06,1455,3,3
6,7,2017-01-07,199,4,3
7,8,2017-01-09,188,5,3


**Step 3. Counting Consecutive IDs within Each Group**

- Counting: For each group defined by the dif value, counts the number of records.
- Identifying Valid Groups: Determines which groups have three or more consecutive ids by counting the number of records in each group.

In [9]:
stadium["consecutive_count"] = stadium.groupby(["dif"])["id"].transform("count")
display(stadium)

Unnamed: 0,id,visit_date,people,rank,dif,consecutive_count
1,2,2017-01-02,109,0,2,2
2,3,2017-01-03,150,1,2,2
4,5,2017-01-05,145,2,3,4
5,6,2017-01-06,1455,3,3,4
6,7,2017-01-07,199,4,3,4
7,8,2017-01-09,188,5,3,4


**Step 4. Filtering Groups with Three or More Consecutive IDs**

- Filtering: Retains only those records where the consecutive_count is three or more.
- Condition Satisfaction: Ensures that only groups with at least three consecutive ids are considered, aligning with the problem's requirement.

In [11]:
stadium = stadium[stadium["consecutive_count"] >= 3]
display(stadium)

Unnamed: 0,id,visit_date,people,rank,dif,consecutive_count
4,5,2017-01-05,145,2,3,4
5,6,2017-01-06,1455,3,3,4
6,7,2017-01-07,199,4,3,4
7,8,2017-01-09,188,5,3,4


**Step 5. Selecting the Required Columns**

- Column Selection: Retains only the id, visit_date, and people columns, removing intermediary columns (rank, dif, consecutive_count) used for processing.

In [13]:
stadium = stadium[["id", "visit_date", "people"]]

stadium = stadium.sort_values(by="visit_date", 
                              ascending=True)
display(stadium)

Unnamed: 0,id,visit_date,people
4,5,2017-01-05,145
5,6,2017-01-06,1455
6,7,2017-01-07,199
7,8,2017-01-09,188


Reference: [1] https://leetcode.com/problems/human-traffic-of-stadium/?lang=pythondata