# Compare Departmental Average Salaries to Company Average

Analyzing salary distributions across departments is essential for organizations to ensure fair compensation, identify disparities, and make informed financial decisions. In this exercise, you'll compare the average salaries of each department to the company's overall average salary for each pay month.

### Salary Table

| Column Name | Type | Description |
|-------------|------|-------------|
| id          | int  | Primary key. Unique identifier for each salary record. |
| employee_id | int  | Foreign key referencing the Employee table. Identifies the employee receiving the salary. |
| amount      | int  | The salary amount paid to the employee for the month. |
| pay_date    | date | The date when the salary was paid. |

**Notes:**
- Each row represents a salary payment to an employee for a specific month.
- The `id` column uniquely identifies each salary record.
- `employee_id` links the salary to an employee in the Employee table.

### Employee Table

| Column Name   | Type | Description |
|---------------|------|-------------|
| employee_id   | int  | Primary key. Unique identifier for each employee. |
| department_id | int  | Identifies the department to which the employee belongs. |

**Notes:**
- Each row represents an employee and their associated department.
- `employee_id` uniquely identifies each employee.

## Objective

For each pay month, determine whether the **average salary** of employees in each department is **higher than**, **lower than**, or **the same as** the **company's average salary** for that month.

### Requirements:

1. **Comparison Result**: For each department and pay month, indicate if the department's average salary is:
   - `"higher"` than the company's average.
   - `"lower"` than the company's average.
   - `"same"` as the company's average.

2. **Output Columns**:
   - `pay_month`: The month and year of the salary payments in `YYYY-MM` format.
   - `department_id`: The identifier of the department.
   - `comparison`: The result of the comparison (`"higher"`, `"lower"`, or `"same"`).

3. **Ordering**: The result table can be in any order.

## Example

### Input

**Salary Table:**

| id | employee_id | amount | pay_date   |
|----|-------------|--------|------------|
| 1  | 1           | 9000   | 2017-03-31 |
| 2  | 2           | 6000   | 2017-03-31 |
| 3  | 3           | 10000  | 2017-03-31 |
| 4  | 1           | 7000   | 2017-02-28 |
| 5  | 2           | 6000   | 2017-02-28 |
| 6  | 3           | 8000   | 2017-02-28 |

**Employee Table:**

| employee_id | department_id |
|-------------|---------------|
| 1           | 1             |
| 2           | 2             |
| 3           | 2             |

### Output

| pay_month | department_id | comparison |
|-----------|---------------|------------|
| 2017-02   | 1             | same       |
| 2017-03   | 1             | higher     |
| 2017-02   | 2             | same       |
| 2017-03   | 2             | lower      |

### Explanation

**February 2017 (`2017-02`):**

- **Company's Average Salary**:
  - `(7000 + 6000 + 8000) / 3 = 7000`
  
- **Department 1**:
  - Only employee `1` with a salary of `7000`.
  - **Average Department Salary**: `7000`
  - **Comparison**: `"same"` (`7000` vs. `7000`)
  
- **Department 2**:
  - Employees `2` and `3` with salaries of `6000` and `8000` respectively.
  - **Average Department Salary**: `(6000 + 8000) / 2 = 7000`
  - **Comparison**: `"same"` (`7000` vs. `7000`)

**March 2017 (`2017-03`):**

- **Company's Average Salary**:
  - `(9000 + 6000 + 10000) / 3 ≈ 8333.33`
  
- **Department 1**:
  - Only employee `1` with a salary of `9000`.
  - **Average Department Salary**: `9000`
  - **Comparison**: `"higher"` (`9000` vs. `8333.33`)
  
- **Department 2**:
  - Employees `2` and `3` with salaries of `6000` and `10000` respectively.
  - **Average Department Salary**: `(6000 + 10000) / 2 = 8000`
  - **Comparison**: `"lower"` (`8000` vs. `8333.33`)


In [3]:
import pandas as pd

data = [[1, 1, 9000, '2017/03/31'], 
        [2, 2, 6000, '2017/03/31'], 
        [3, 3, 10000, '2017/03/31'], 
        [4, 1, 7000, '2017/02/28'], 
        [5, 2, 6000, '2017/02/28'], 
        [6, 3, 8000, '2017/02/28']]
salary = pd.DataFrame(data, columns=['id', 
                                     'employee_id', 
                                     'amount', 
                                     'pay_date']).astype({'id':'Int64', 
                                                          'employee_id':'Int64', 
                                                          'amount':'Int64', 
                                                          'pay_date':'datetime64[ns]'})
data = [[1, 1], 
        [2, 2], 
        [3, 2]]
employee = pd.DataFrame(data, 
                        columns=['employee_id', 
                                 'department_id']).astype({'employee_id':'Int64', 
                                                           'department_id':'Int64'})
display(salary)
display(employee)

Unnamed: 0,id,employee_id,amount,pay_date
0,1,1,9000,2017-03-31
1,2,2,6000,2017-03-31
2,3,3,10000,2017-03-31
3,4,1,7000,2017-02-28
4,5,2,6000,2017-02-28
5,6,3,8000,2017-02-28


Unnamed: 0,employee_id,department_id
0,1,1
1,2,2
2,3,2


**Step 1. Merging the Salary and Employee DataFrames**

- Combine salary records with corresponding employee details to enrich the salary data with departmental information.
- how="left": Performs a left join, meaning all records from the salary DataFrame are retained, and matching records from the employee DataFrame are brought in. If there's no match, the resulting columns from employee will have NaN values.
- on="employee_id": Specifies the key column to join on, which is present in both DataFrames.

In [5]:
df = salary.merge(employee, how="left", on="employee_id")
display(df)

Unnamed: 0,id,employee_id,amount,pay_date,department_id
0,1,1,9000,2017-03-31,1
1,2,2,6000,2017-03-31,2
2,3,3,10000,2017-03-31,2
3,4,1,7000,2017-02-28,1
4,5,2,6000,2017-02-28,2
5,6,3,8000,2017-02-28,2


**Step 2. Converting Payment Dates to Monthly Format**

- Aggregate salary data on a monthly basis rather than daily, facilitating easier comparison and analysis.
- .dt.strftime('%Y-%m'): .dt Accessor: Allows vectorized string operations on datetime columns.
- strftime('%Y-%m'): Formats the date to a string in the YYYY-MM format, effectively representing the month and year.

In [7]:
df["pay_month"] = df["pay_date"].dt.strftime('%Y-%m')
display(df)

Unnamed: 0,id,employee_id,amount,pay_date,department_id,pay_month
0,1,1,9000,2017-03-31,1,2017-03
1,2,2,6000,2017-03-31,2,2017-03
2,3,3,10000,2017-03-31,2,2017-03
3,4,1,7000,2017-02-28,1,2017-02
4,5,2,6000,2017-02-28,2,2017-02
5,6,3,8000,2017-02-28,2,2017-02


**Step 3. Calculating the Average Salary per Department per Month**

- Determine the average salary paid within each department for each month.
- groupby(["pay_month", "department_id"]): Groups the DataFrame by both pay_month and department_id, ensuring that the average is calculated within each unique combination of month and department.
- ["amount"].transform("mean"): .transform("mean"): Calculates the mean salary for each group and broadcasts the result back to the original DataFrame, maintaining the same index.

**Step 4. Calculating the Average Salary per Company per Month**
- Determine the overall average salary paid across the entire company for each month.
- groupby(["pay_month"]): Groups the DataFrame solely by pay_month, aggregating data across all departments.
- ["amount"].transform("mean"): Calculates the mean salary for each month and broadcasts the result to the original DataFrame.

In [9]:
df["avg_dept_salary"] = df.groupby(["pay_month", "department_id"])["amount"].transform("mean")
df["avg_company_salary"] = df.groupby(["pay_month"])["amount"].transform("mean")
display(df)

Unnamed: 0,id,employee_id,amount,pay_date,department_id,pay_month,avg_dept_salary,avg_company_salary
0,1,1,9000,2017-03-31,1,2017-03,9000.0,8333.333333
1,2,2,6000,2017-03-31,2,2017-03,8000.0,8333.333333
2,3,3,10000,2017-03-31,2,2017-03,8000.0,8333.333333
3,4,1,7000,2017-02-28,1,2017-02,7000.0,7000.0
4,5,2,6000,2017-02-28,2,2017-02,7000.0,7000.0
5,6,3,8000,2017-02-28,2,2017-02,7000.0,7000.0


**Step 5. Comparing Departmental Averages to Company Averages**
- Determine whether each department's average salary is higher than, lower than, or the same as the company's average salary for that month.
- .apply(): Applies a function along an axis of the DataFrame.
axis=1: Specifies that the function should be applied to each row.
- Lambda Function: Conditions:
1. If avg_dept_salary < avg_company_salary: Assign "lower".
2. If avg_dept_salary > avg_company_salary: Assign "higher".
3. Else: Assign "same".


In [11]:
df["comparison"] = df.apply(lambda row: "lower" if row["avg_dept_salary"] < row["avg_company_salary"] 
                            else ("higher" if row["avg_dept_salary"] > row["avg_company_salary"] 
                                  else "same"), 
                            axis=1)
display(df)

Unnamed: 0,id,employee_id,amount,pay_date,department_id,pay_month,avg_dept_salary,avg_company_salary,comparison
0,1,1,9000,2017-03-31,1,2017-03,9000.0,8333.333333,higher
1,2,2,6000,2017-03-31,2,2017-03,8000.0,8333.333333,lower
2,3,3,10000,2017-03-31,2,2017-03,8000.0,8333.333333,lower
3,4,1,7000,2017-02-28,1,2017-02,7000.0,7000.0,same
4,5,2,6000,2017-02-28,2,2017-02,7000.0,7000.0,same
5,6,3,8000,2017-02-28,2,2017-02,7000.0,7000.0,same


**Step 6. Selecting the Required Columns for the Final Output**
- Streamline the DataFrame to include only the necessary information for the final analysis and output.
- Column Selection: ["pay_month", "department_id", "comparison"]: Retains only these three columns, removing all others (employee_id, pay_date, amount, avg_dept_salary, avg_company_salary).

In [13]:
df = df[["pay_month", "department_id", "comparison"]]
display(df)

Unnamed: 0,pay_month,department_id,comparison
0,2017-03,1,higher
1,2017-03,2,lower
2,2017-03,2,lower
3,2017-02,1,same
4,2017-02,2,same
5,2017-02,2,same


**Step 7. Removing Duplicate Rows**
- Ensure that the final output contains unique records without any redundant entries.
- drop_duplicates(): Removes duplicate rows from the DataFrame.
- keep="first": Retains the first occurrence of each duplicate row and drops the subsequent ones.


In [15]:
df = df.drop_duplicates(keep="first")
display(df)

Unnamed: 0,pay_month,department_id,comparison
0,2017-03,1,higher
1,2017-03,2,lower
3,2017-02,1,same
4,2017-02,2,same


References: [1] https://leetcode.com/problems/average-salary-departments-vs-company/?lang=pythondata