# Calculating Cumulative Salary Summaries for Employees Using Pandas

Understanding employee compensation over time is crucial for organizations to analyze salary trends, budget effectively, and make informed financial decisions. In this blog post, we'll explore how to calculate the cumulative salary summary for each employee over a rolling three-month window using Python's Pandas library. We'll walk through the problem statement, examine a detailed example, and implement a step-by-step solution.

## Problem Statement

You are provided with an **Employee** table that records the monthly salaries of employees throughout the year 2020. Your task is to calculate a cumulative salary summary for each employee, which involves summing up the salaries for each month and the two preceding months. This summary excludes the most recent month each employee has worked.

### Employee Table

| Column Name | Type |
|-------------|------|
| id          | int  |
| month       | int  |
| salary      | int  |

- **id**: Unique identifier for each employee.
- **month**: The month of the year (1 to 12) during which the salary was recorded.
- **salary**: The salary amount for the employee in that particular month.

**Primary Key**: The combination of `(id, month)` ensures that each record is unique for an employee in a specific month.

Each row in the table represents an employee's salary for a particular month in 2020.

### Objective

Write a solution to calculate the cumulative salary summary for every employee in a single unified table.

The cumulative salary summary for an employee can be calculated as follows:

For each month that the employee worked, sum up the salaries in that month and the previous two months. This is their 3-month sum for that month. If an employee did not work for the company in previous months, their effective salary for those months is 0.
Do not include the 3-month sum for the most recent month that the employee worked for in the summary.
Do not include the 3-month sum for any month the employee did not work.
Return the result table ordered by id in ascending order. In case of a tie, order it by month in descending order.

## Example

### Input

**Employee Table:**

| id | month | salary |
|----|-------|--------|
| 1  | 1     | 20     |
| 2  | 1     | 20     |
| 1  | 2     | 30     |
| 2  | 2     | 30     |
| 3  | 2     | 40     |
| 1  | 3     | 40     |
| 3  | 3     | 60     |
| 1  | 4     | 60     |
| 3  | 4     | 70     |
| 1  | 7     | 90     |
| 1  | 8     | 90     |

### Output

| id | month | Salary |
|----|-------|--------|
| 1  | 7     | 90     |
| 1  | 4     | 130    |
| 1  | 3     | 90     |
| 1  | 2     | 50     |
| 1  | 1     | 20     |
| 2  | 1     | 20     |
| 3  | 3     | 100    |
| 3  | 2     | 40     |
|----|-------|--------|

### Explanation

Let's break down the cumulative salary calculations for each employee:

#### Employee 1

Employee `1` has salary records for months `1`, `2`, `3`, `4`, `7`, and `8`. We exclude the most recent month (`8`) from the summary.

1. **Month 7**:
   - **Salaries Considered**: Month `7`, `6` (did not work, treated as `0`), and `5` (did not work, treated as `0`).
   - **Cumulative Salary**: `90 + 0 + 0 = 90`

2. **Month 4**:
   - **Salaries Considered**: Month `4`, `3`, and `2`.
   - **Cumulative Salary**: `60 + 40 + 30 = 130`

3. **Month 3**:
   - **Salaries Considered**: Month `3`, `2`, and `1`.
   - **Cumulative Salary**: `40 + 30 + 20 = 90`

4. **Month 2**:
   - **Salaries Considered**: Month `2`, `1`, and `0` (did not work, treated as `0`).
   - **Cumulative Salary**: `30 + 20 + 0 = 50`

5. **Month 1**:
   - **Salaries Considered**: Month `1`, `0`, and `-1` (both treated as `0`).
   - **Cumulative Salary**: `20 + 0 + 0 = 20`

#### Employee 2

Employee `2` has salary records for months `1` and `2`. We exclude the most recent month (`2`) from the summary.

1. **Month 1**:
   - **Salaries Considered**: Month `1`, `0`, and `-1` (both treated as `0`).
   - **Cumulative Salary**: `20 + 0 + 0 = 20`

#### Employee 3

Employee `3` has salary records for months `2`, `3`, and `4`. We exclude the most recent month (`4`) from the summary.

1. **Month 3**:
   - **Salaries Considered**: Month `3`, `2`, and `1` (did not work, treated as `0`).
   - **Cumulative Salary**: `60 + 40 + 0 = 100`

2. **Month 2**:
   - **Salaries Considered**: Month `2`, `1`, and `0` (both treated as `0`).
   - **Cumulative Salary**: `40 + 0 + 0 = 40`

The final cumulative salary summary excludes the most recent month each employee worked and includes only the months they actively worked, summing salaries over the current and two preceding months, treating non-working months as `0`.


In [3]:
import pandas as pd

data = [[1, 1, 20], 
        [2, 1, 20], 
        [1, 2, 30], 
        [2, 2, 30], 
        [3, 2, 40], 
        [1, 3, 40], 
        [3, 3, 60], 
        [1, 4, 60], 
        [3, 4, 70], 
        [1, 7, 90], 
        [1, 8, 90]]
employee = pd.DataFrame(data, 
                        columns=['id', 
                                 'month', 
                                 'salary']).astype({'id':'Int64', 
                                                    'month':'Int64', 
                                                    'salary':'Int64'})
display(employee)

Unnamed: 0,id,month,salary
0,1,1,20
1,2,1,20
2,1,2,30
3,2,2,30
4,3,2,40
5,1,3,40
6,3,3,60
7,1,4,60
8,3,4,70
9,1,7,90


**Step 1. Identifying the Most Recent Month for Each Employee**

- Grouping: employee.groupby('id') groups the DataFrame by the id column, effectively segregating data for each employee.
- Identifying Maximum Month: For each group (employee), ['month'].idxmax() identifies the index of the row where the month value is the highest. This corresponds to the most recent month for each employee.

In [5]:
most_recent = employee.groupby('id')['month'].idxmax()

print(most_recent)

id
1    10
2     3
3     8
Name: month, dtype: int64


**Step 2. Removing the Most Recent Month for Each Employee**

- employee.index.isin(most_recent): Creates a boolean mask where True indicates that the row's index is in the most_recent Series.
- Negation ~: Inverts the mask, so True becomes False and vice versa.
- Filtering: employee[...] selects rows where the mask is True, effectively removing the rows corresponding to the most recent month for each employee.

In [7]:
df = employee[~employee.index.isin(most_recent)]

display(df)

Unnamed: 0,id,month,salary
0,1,1,20
1,2,1,20
2,1,2,30
4,3,2,40
5,1,3,40
6,3,3,60
7,1,4,60
9,1,7,90


**Step 3. Self-Joining the DataFrame Based on Employee ID**

- Self-Join: The merge function is used to perform a self-join on the id column. This means each row in df will be paired with every other row that has the same id.
- Resulting Columns: After the merge, columns from the left DataFrame will have suffix _x and from the right _y. For example, month_x, salary_x, month_y, salary_y.

In [9]:
df = df.merge(df, on='id')

display(df)

Unnamed: 0,id,month_x,salary_x,month_y,salary_y
0,1,1,20,1,20
1,1,1,20,2,30
2,1,1,20,3,40
3,1,1,20,4,60
4,1,1,20,7,90
5,1,2,30,1,20
6,1,2,30,2,30
7,1,2,30,3,40
8,1,2,30,4,60
9,1,2,30,7,90


**Step 4. Filtering to Keep Only the Most Recent 3 Months' Data**

- Lambda Function: Applies a condition to each row of the DataFrame.
- month_x - month_y: Calculates the difference between month_x and month_y.
- .isin([0,1,2]): Checks if the difference is either 0, 1, or 2.
- Filtering: Retains only those rows where the difference between month_x and month_y is within the most recent three months.

In [11]:
df = df.loc[lambda x: (x.month_x-x.month_y).isin([0,1,2]), :]

display(df)

Unnamed: 0,id,month_x,salary_x,month_y,salary_y
0,1,1,20,1,20
5,1,2,30,1,20
6,1,2,30,2,30
10,1,3,40,1,20
11,1,3,40,2,30
12,1,3,40,3,40
16,1,4,60,2,30
17,1,4,60,3,40
18,1,4,60,4,60
24,1,7,90,7,90


**Step 5. Calculating the Sum of Salaries Over the Recent 3 Months**
- Grouping: df.groupby(['id','month_x']) groups the DataFrame by both id and month_x. This means each group corresponds to an employee's specific month.
- Aggregation: ['salary_y'].sum() sums the salary_y values within each group. Since salary_y represents salaries from month_y within the recent three months relative to month_x, this sum effectively aggregates the salaries for month_x and the two preceding months.
- Reset Index: .reset_index() converts the resulting Series back into a DataFrame with id and month_x as columns.

In [13]:
df = df.groupby(['id','month_x'])['salary_y'].sum().reset_index()

display(df)

Unnamed: 0,id,month_x,salary_y
0,1,1,20
1,1,2,50
2,1,3,90
3,1,4,130
4,1,7,90
5,2,1,20
6,3,2,40
7,3,3,100


**Step 6. Renaming Columns and Sorting the DataFrame**

- Renaming Columns:
1. 'month_x' is renamed to 'month' for clarity.
2. 'salary_y' is renamed to 'Salary' to reflect the aggregated salary values.
- Sorting: The DataFrame is sorted first by 'id' in ascending order and then by 'month' in descending order (ascending=[True, False]). This ensures that for each employee, the most recent months appear first.

In [15]:
df = df.rename(columns={'month_x':'month',
                        'salary_y':'Salary'}).sort_values(by=['id', 'month'], 
                                                          ascending=[True, False])

display(df)

Unnamed: 0,id,month,Salary
4,1,7,90
3,1,4,130
2,1,3,90
1,1,2,50
0,1,1,20
5,2,1,20
7,3,3,100
6,3,2,40


References:
[1] https://leetcode.com/problems/find-cumulative-salary-of-an-employee/?envType=list&envId=p4ampudi