When managing automated systems, understanding how tasks perform over time is crucial for identifying patterns, optimizing processes, and ensuring reliability. In a system where tasks are executed daily, each task can independently succeed or fail. However, analyzing performance on a day-to-day basis can quickly become overwhelming, especially over extended periods. Instead, grouping consecutive days with the same performance state (success or failure) provides a clearer picture of system behavior.

In this blog, we'll explore a fascinating problem where we analyze daily task performance over a year and group them into continuous intervals of success or failure. By diving into structured datasets, we’ll generate a summary report that reveals how the system performed during specific periods.

This approach not only simplifies complex datasets but also offers actionable insights into system performance trends. Whether you're a data analyst, a developer, or a system manager, understanding this method can help you better interpret operational data and make informed decisions.

Let’s get started by defining the problem and building a solution step by step!

### Problem Description

#### Table: `Failed`

| Column Name | Type |
|-------------|------|
| fail_date   | date |

- `fail_date` is the primary key (column with unique values) for this table.
- This table contains the days of failed tasks.

#### Table: `Succeeded`

| Column Name  | Type |
|--------------|------|
| success_date | date |

- `success_date` is the primary key (column with unique values) for this table.
- This table contains the days of succeeded tasks.

A system runs one task every day. Each task is independent of the previous tasks, and it can either fail or succeed.

#### Task

Write a solution to report the `period_state` for each continuous interval of days in the period from `2019-01-01` to `2019-12-31`. 

- `period_state` is **`failed`** if tasks in the interval failed.
- `period_state` is **`succeeded`** if tasks in the interval succeeded.
- For each interval, retrieve:
  - `start_date`
  - `end_date`

The result should be ordered by `start_date`.

---

### Example

#### Input:

**Failed table:**

| fail_date   |
|-------------|
| 2018-12-28  |
| 2018-12-29  |
| 2019-01-04  |
| 2019-01-05  |

**Succeeded table:**

| success_date |
|--------------|
| 2018-12-30   |
| 2018-12-31   |
| 2019-01-01   |
| 2019-01-02   |
| 2019-01-03   |
| 2019-01-06   |

#### Output:

| period_state | start_date | end_date   |
|--------------|------------|------------|
| succeeded    | 2019-01-01 | 2019-01-03 |
| failed       | 2019-01-04 | 2019-01-05 |
| succeeded    | 2019-01-06 | 2019-01-06 |

#### Explanation:

- The report ignores the system state in 2018 as the period of interest is `2019-01-01` to `2019-12-31`.
- From `2019-01-01` to `2019-01-03`, all tasks succeeded, so the system state is **`succeeded`**.
- From `2019-01-04` to `2019-01-05`, all tasks failed, so the system state is **`failed`**.
- On `2019-01-06`, the task succeeded, so the system state is **`succeeded`**.


In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import numpy as np

data = [['2018-12-28'], 
        ['2018-12-29'], 
        ['2019-01-04'], 
        ['2019-01-05']]
failed = pd.DataFrame(data, 
                      columns=['fail_date']).astype({
                      'fail_date':'datetime64[ns]'})
data = [['2018-12-30'], 
        ['2018-12-31'], 
        ['2019-01-01'], 
        ['2019-01-02'], 
        ['2019-01-03'], 
        ['2019-01-06']]
succeeded = pd.DataFrame(data, 
                         columns=['success_date']).astype({
                         'success_date':'datetime64[ns]'})

display(failed, succeeded)

range_start = pd.Timestamp("2019-01-01")
range_end = pd.Timestamp("2019-12-31")

Unnamed: 0,fail_date
0,2018-12-28
1,2018-12-29
2,2019-01-04
3,2019-01-05


Unnamed: 0,success_date
0,2018-12-30
1,2018-12-31
2,2019-01-01
3,2019-01-02
4,2019-01-03
5,2019-01-06


**Step 1: Add a state Column to the failed DataFrame**
- A new column, state, is added to the failed DataFrame with the value 'failed' for all rows.

**Step 2: Rename the fail_date Column in the failed DataFrame**
- The column fail_date in the failed DataFrame is renamed to date.

In [2]:
failed['state'] = 'failed'
failed = failed.rename(columns={'fail_date': 'date'})
display(failed)

Unnamed: 0,date,state
0,2018-12-28,failed
1,2018-12-29,failed
2,2019-01-04,failed
3,2019-01-05,failed


**Step 3: Add a state Column to the succeeded DataFrame**
- A new column, state, is added to the succeeded DataFrame with the value 'succeeded' for all rows.

**Step 4: Rename the success_date Column in the succeeded DataFrame**
- The column success_date in the succeeded DataFrame is renamed to date.


In [3]:
succeeded['state'] = 'succeeded'
succeeded = succeeded.rename(columns={'success_date': 'date'})
display(succeeded)

Unnamed: 0,date,state
0,2018-12-30,succeeded
1,2018-12-31,succeeded
2,2019-01-01,succeeded
3,2019-01-02,succeeded
4,2019-01-03,succeeded
5,2019-01-06,succeeded


**Step 5: Combine the failed and succeeded DataFrames**
- The two DataFrames, failed and succeeded, are concatenated into a single DataFrame, df, stacking their rows.

**Step 6: Sort the Combined DataFrame by the date Column**
- The combined DataFrame df is sorted in ascending order by the date column.

In [4]:
df = pd.concat([failed, succeeded])
df = df.sort_values(by='date', ascending=True)
display(df)

Unnamed: 0,date,state
0,2018-12-28,failed
1,2018-12-29,failed
0,2018-12-30,succeeded
1,2018-12-31,succeeded
2,2019-01-01,succeeded
3,2019-01-02,succeeded
4,2019-01-03,succeeded
2,2019-01-04,failed
3,2019-01-05,failed
5,2019-01-06,succeeded


**Step 7: Filter Rows by Date Range**
- Rows in df are filtered to only include dates between '2019-01-01' and '2019-12-31'.

In [5]:
df = df[df['date'].between('2019-01-01', '2019-12-31')]
display(df)

Unnamed: 0,date,state
2,2019-01-01,succeeded
3,2019-01-02,succeeded
4,2019-01-03,succeeded
2,2019-01-04,failed
3,2019-01-05,failed
5,2019-01-06,succeeded


**Step 8: Assign a Unique Period Identifier for Consecutive States**
- A new column period is added to identify consecutive groups of the same state.

In [6]:
df["period_state_previous"] = df["state"].shift(periods = 1)
df["period_switch"] = np.where(df["state"] != df["period_state_previous"], 1, 0)
df["period"] = df["period_switch"].cumsum()
display(df)

Unnamed: 0,date,state,period_state_previous,period_switch,period
2,2019-01-01,succeeded,,1,1
3,2019-01-02,succeeded,succeeded,0,1
4,2019-01-03,succeeded,succeeded,0,1
2,2019-01-04,failed,succeeded,1,2
3,2019-01-05,failed,failed,0,2
5,2019-01-06,succeeded,failed,1,3


**Step 9: Group by period and state and Aggregate Start/End Dates**
- The DataFrame is grouped by period and state, and the date column is aggregated to find:
- The earliest date (min) as start_date. 
- The latest date (max) as end_date.

**Step 10: Reset Index and Rename Columns**
- The index is reset, and the state column is renamed to period_state

In [7]:
df = df.groupby(['period', 
                 'state']).agg(start_date=('date', 'min'), 
                               end_date=('date', 'max'))
df = df.reset_index().rename(columns={'state': 'period_state'})
display(df)

Unnamed: 0,period,period_state,start_date,end_date
0,1,succeeded,2019-01-01,2019-01-03
1,2,failed,2019-01-04,2019-01-05
2,3,succeeded,2019-01-06,2019-01-06


**Step 11: Select and Reorganize Columns**
- The DataFrame is reduced to three columns: period_state, start_date, and end_date.

In [8]:
df = df[['period_state', 'start_date', 'end_date']]
display(df)

Unnamed: 0,period_state,start_date,end_date
0,succeeded,2019-01-01,2019-01-03
1,failed,2019-01-04,2019-01-05
2,succeeded,2019-01-06,2019-01-06


References:
[1] https://leetcode.com/problems/report-contiguous-dates/description/?lang=pythondata