In [120]:
import polars as pl
print(pl.__version__)

0.20.31


### Nth Highest Salary

#### Question

DataFrame: Employee

| Column Name | Type |
|:-----------:|:----:|
| id          | int  |
| salary      | int  |

id is the primary key (column with unique values) for this table. <br>
Each row of this table contains information about the salary of an employee.
 

Write a solution to find the nth highest salary from the Employee table. If there is no nth highest salary, return null.

Example 1:

Input:<br>
Employee dataframe:

| id | salary |
|:--:|:------:|
| 1  | 100    |
| 2  | 200    |
| 3  | 300    |

n = 2<br>
Output: 

| getNthHighestSalary(2) |
|:----------------------:|
| 200                    |


Example 2:

Input:<br>
Employee dataframe:

| id | salary |
|:---|:------:|
| 1  | 100    |

n = 2<br>
Output: 

| getNthHighestSalary(2) |
|:----------------------:|
| null                   |

#### Testcase

In [121]:
# Test data
data = [[1, 100], [2, 200], [3, 300]]

# Create the DataFrame
employee = pl.DataFrame(
    data,
    schema=['Id', 'Salary']
)

# Display the DataFrame
print(employee)

shape: (3, 2)
┌─────┬────────┐
│ Id  ┆ Salary │
│ --- ┆ ---    │
│ i64 ┆ i64    │
╞═════╪════════╡
│ 1   ┆ 100    │
│ 2   ┆ 200    │
│ 3   ┆ 300    │
└─────┴────────┘


#### Solution

In [122]:
def nth_highest_salary(employee: pl.DataFrame, N: int):

    # Sort the DataFrame by salary in descending order
    sorted_df = employee.sort('Salary', descending=True)
    
    # Check if n is within the range of the number of rows
    if n <= len(sorted_df):
        # Retrieve the nth highest salary
        nth_salary = sorted_df[n-1, "Salary"]
        return pl.DataFrame({'Nth Highest Salary':[nth_salary]})
    else:
        nth_salary = None
        return pl.DataFrame({'Nth Highest Salary':[nth_salary]})
    
# Display the result
n = 2
print(nth_highest_salary(employee=employee, N=n))

shape: (1, 1)
┌────────────────────┐
│ Nth Highest Salary │
│ ---                │
│ i64                │
╞════════════════════╡
│ 200                │
└────────────────────┘


### Second Highest Salary

#### Question

DataFrame: Employee

| Column Name | Type |
|:-----------:|:----:|
| id          | int  |
| salary      | int  |

id is the primary key (column with unique values) for this table.<br>
Each row of this table contains information about the salary of an employee.
 

Write a solution to find the second highest salary from the Employee table. If there is no second highest salary, return null (return None in Pandas).

Example 1:

Input:<br>
Employee dataframe:

| id | salary |
|:--:|:------:|
| 1  | 100    |
| 2  | 200    |
| 3  | 300    |

Output: 

| SecondHighestSalary |
|:-------------------:|
| 200                 |

Example 2:

Input:<br>
Employee dataframe:

| id | salary |
|:--:|:------:|
| 1  | 100    |

Output: 

| SecondHighestSalary |
|:-------------------:|
| null                |


#### Testcase

In [123]:
# Test data
data = [[1, 100], [2, 200], [3, 300]]

# Create the DataFrame
employee = pl.DataFrame(
    data,
    schema=['id', 'salary']
)

# Display the DataFrame
print(employee)

shape: (3, 2)
┌─────┬────────┐
│ id  ┆ salary │
│ --- ┆ ---    │
│ i64 ┆ i64    │
╞═════╪════════╡
│ 1   ┆ 100    │
│ 2   ┆ 200    │
│ 3   ┆ 300    │
└─────┴────────┘


#### Solution

In [124]:
def second_highest_salary(employee: pl.DataFrame) -> pl.DataFrame:
    
    # Drop duplicate salaries to ensure unique values
    unique_salaries_df = employee.unique(subset=['salary'])
    
    # Sort the DataFrame by salary in descending order
    sorted_df = unique_salaries_df.sort('salary', descending=True)
    
    # Check if there is a second highest salary
    if len(sorted_df) >= 2:
        # Retrieve the second highest salary
        second_salary = sorted_df[1, 'salary']
        return pl.DataFrame({'SecondHighestSalary':[second_salary]})
    else:
        second_salary = None
        return pl.DataFrame({'SecondHighestSalary':[second_salary]})

# Display the result
print(second_highest_salary(employee=employee))

shape: (1, 1)
┌─────────────────────┐
│ SecondHighestSalary │
│ ---                 │
│ i64                 │
╞═════════════════════╡
│ 200                 │
└─────────────────────┘


### Department Highest Salary

#### Question

DataFrame: Employee

| Column Name  | Type    |
|:------------:|:-------:|
| id           | int     |
| name         | varchar |
| salary       | int     |
| departmentId | int     |

id is the primary key (column with unique values) for this table.<br>
departmentId is a foreign key (reference columns) of the ID from the Department table.<br>
Each row of this table indicates the ID, name, and salary of an employee. It also contains the ID of their department.
 

DataFrame: Department

| Column Name | Type    |
|:-----------:|:-------:|
| id          | int     |
| name        | varchar |

id is the primary key (column with unique values) for this table. It is guaranteed that department name is not NULL.<br>
Each row of this table indicates the ID of a department and its name.
 

Write a solution to find employees who have the highest salary in each of the departments.

Return the result table in any order.

Example:

Input:<br>
Employee dataframe:

| id | name  | salary | departmentId |
|:--:|:-----:|:------:|:------------:|
| 1  | Joe   | 70000  | 1            |
| 2  | Jim   | 90000  | 1            |
| 3  | Henry | 80000  | 2            |
| 4  | Sam   | 60000  | 2            |
| 5  | Max   | 90000  | 1            |

Department dataframe:

| id | name  |
|:--:|:-----:|
| 1  | IT    |
| 2  | Sales |

Output:

| Department | Employee | Salary |
|:----------:|:--------:|:------:|
| IT         | Jim      | 90000  |
| Sales      | Henry    | 80000  |
| IT         | Max      | 90000  |

Explanation: Max and Jim both have the highest salary in the IT department and Henry has the highest salary in the Sales department.

#### Testcase

In [125]:
# Test data
data_emp = [[1, 'Joe', 70000, 1], [2, 'Jim', 90000, 1], [3, 'Henry', 80000, 2], [4, 'Sam', 60000, 2], [5, 'Max', 90000, 1]]
data_dept = [[1, 'IT'], [2, 'Sales']]

# Create the DataFrame
employee = pl.DataFrame(
    data_emp,
    schema=['id', 'name', 'salary', 'departmentId']
)

department = pl.DataFrame(
    data_dept,
    schema=['id', 'name']
)

# Display the DataFrame
print('employee df:', employee)
print('department df', department)

employee df: shape: (5, 4)
┌─────┬───────┬────────┬──────────────┐
│ id  ┆ name  ┆ salary ┆ departmentId │
│ --- ┆ ---   ┆ ---    ┆ ---          │
│ i64 ┆ str   ┆ i64    ┆ i64          │
╞═════╪═══════╪════════╪══════════════╡
│ 1   ┆ Joe   ┆ 70000  ┆ 1            │
│ 2   ┆ Jim   ┆ 90000  ┆ 1            │
│ 3   ┆ Henry ┆ 80000  ┆ 2            │
│ 4   ┆ Sam   ┆ 60000  ┆ 2            │
│ 5   ┆ Max   ┆ 90000  ┆ 1            │
└─────┴───────┴────────┴──────────────┘
department df shape: (2, 2)
┌─────┬───────┐
│ id  ┆ name  │
│ --- ┆ ---   │
│ i64 ┆ str   │
╞═════╪═══════╡
│ 1   ┆ IT    │
│ 2   ┆ Sales │
└─────┴───────┘


#### Solution

In [126]:
def department_highest_salary(employee: pl.DataFrame, department: pl.DataFrame) -> pl.DataFrame:

    # Join both dataframes
    merged_df = employee.join(department, left_on='departmentId', right_on='id', suffix='_department')
    # Get max salary by department
    max_salaries = employee.group_by('departmentId').agg(pl.col('salary').max()).select('salary')
    # Fetch the employees who have the maximum salary by its department
    result = (
        merged_df
            .filter(pl.col('salary').is_in(max_salaries))
            .select(['name_department', 'name', 'salary'])
            .rename({'name_department': 'Department', 'name': 'Employee', 'salary': 'Salary'})
    )

    return result

# Display the result
print(department_highest_salary(employee=employee, department=department))

shape: (3, 3)
┌────────────┬──────────┬────────┐
│ Department ┆ Employee ┆ Salary │
│ ---        ┆ ---      ┆ ---    │
│ str        ┆ str      ┆ i64    │
╞════════════╪══════════╪════════╡
│ IT         ┆ Jim      ┆ 90000  │
│ Sales      ┆ Henry    ┆ 80000  │
│ IT         ┆ Max      ┆ 90000  │
└────────────┴──────────┴────────┘


### Rank Scores

#### Question

DataFrame: Scores

| Column Name | Type    |
|:-----------:|:-------:|
| id          | int     |
| score       | decimal |

id is the primary key (column with unique values) for this table.<br>
Each row of this table contains the score of a game. Score is a floating point value with two decimal places.
 

Write a solution to find the rank of the scores. The ranking should be calculated according to the following rules:

The scores should be ranked from the highest to the lowest.<br>
If there is a tie between two scores, both should have the same ranking.<br>
After a tie, the next ranking number should be the next consecutive integer value. In other words, there should be no holes between ranks.<br>
Return the result table ordered by score in descending order.

Example:

Input:<br>
Scores dataframe:

| id | score |
|:--:|:-----:|
| 1  | 3.50  |
| 2  | 3.65  |
| 3  | 4.00  |
| 4  | 3.85  |
| 5  | 4.00  |
| 6  | 3.65  |

Output: 

| score | rank |
|:-----:|:----:|
| 4.00  | 1    |
| 4.00  | 1    |
| 3.85  | 2    |
| 3.65  | 3    |
| 3.65  | 3    |
| 3.50  | 4    |  

#### Testcase

In [127]:
# Test data
data = [[1, 3.5], [2, 3.65], [3, 4.0], [4, 3.85], [5, 4.0], [6, 3.65]]

# Create the DataFrame
scores = pl.DataFrame(
    data,
    schema=['id', 'score']
)

# Display the DataFrame
print(scores)

shape: (6, 2)
┌─────┬───────┐
│ id  ┆ score │
│ --- ┆ ---   │
│ i64 ┆ f64   │
╞═════╪═══════╡
│ 1   ┆ 3.5   │
│ 2   ┆ 3.65  │
│ 3   ┆ 4.0   │
│ 4   ┆ 3.85  │
│ 5   ┆ 4.0   │
│ 6   ┆ 3.65  │
└─────┴───────┘


#### Solution

In [128]:
def order_scores(scores: pl.DataFrame) -> pl.DataFrame:
    
    # Calculate the dense rank of the scores in descending order
    scores = scores.with_columns(
        pl.col('score').rank(method='dense', descending=True).alias('rank')
    )
    
    # Drop the 'id' column and sort by 'score' in descending order
    result = scores.drop('id').sort('score', descending=True)
    
    return result

# Display the result
print(order_scores(scores=scores))

shape: (6, 2)
┌───────┬──────┐
│ score ┆ rank │
│ ---   ┆ ---  │
│ f64   ┆ u32  │
╞═══════╪══════╡
│ 4.0   ┆ 1    │
│ 4.0   ┆ 1    │
│ 3.85  ┆ 2    │
│ 3.65  ┆ 3    │
│ 3.65  ┆ 3    │
│ 3.5   ┆ 4    │
└───────┴──────┘


### Delete Duplicate Emails

#### Question

DataFrame: Person

| Column Name | Type    |
|:-----------:|:-------:|
| id          | int     |
| email       | varchar |

id is the primary key (column with unique values) for this table.<br>
Each row of this table contains an email. The emails will not contain uppercase letters.
 

Write a solution to delete all duplicate emails, keeping only one unique email with the smallest id.

For SQL users, please note that you are supposed to write a DELETE statement and not a SELECT one.

For Pandas users, please note that you are supposed to modify Person in place.

After running your script, the answer shown is the Person table. The driver will first compile and run your piece of code and then show the Person table. The final order of the Person table does not matter.

Example:

Input:<br>
Person dataframe:

| id | email            |
|:--:|:----------------:|
| 1  | john@example.com |
| 2  | bob@example.com  |
| 3  | john@example.com |

Output: 

| id | email            |
|:--:|:----------------:|
| 1  | john@example.com |
| 2  | bob@example.com  |

Explanation: john@example.com is repeated two times. We keep the row with the smallest Id = 1.

#### Testcase

In [129]:
# Test data
data = [[1, 'john@example.com'], [2, 'bob@example.com'], [3, 'john@example.com']]

# Create the DataFrame
person = pl.DataFrame(
    data,
    schema=['id', 'email']
)

# Display the DataFrame
print(person)

shape: (3, 2)
┌─────┬──────────────────┐
│ id  ┆ email            │
│ --- ┆ ---              │
│ i64 ┆ str              │
╞═════╪══════════════════╡
│ 1   ┆ john@example.com │
│ 2   ┆ bob@example.com  │
│ 3   ┆ john@example.com │
└─────┴──────────────────┘


#### Solution

In [130]:
def delete_duplicate_emails(person: pl.DataFrame) -> pl.DataFrame:
  person = person.sort('id', descending=False)
  person = person.unique(subset=['email'], keep='first')
  return person

# Display the result
print(delete_duplicate_emails(person=person))

shape: (2, 2)
┌─────┬──────────────────┐
│ id  ┆ email            │
│ --- ┆ ---              │
│ i64 ┆ str              │
╞═════╪══════════════════╡
│ 1   ┆ john@example.com │
│ 2   ┆ bob@example.com  │
└─────┴──────────────────┘


### Rearrange Products Table

#### Question

DataFrame: Products

| Column Name | Type    |
|:-----------:|:-------:|
| product_id  | int     |
| store1      | int     |
| store2      | int     |
| store3      | int     |

product_id is the primary key (column with unique values) for this table.<br>
Each row in this table indicates the product's price in 3 different stores: store1, store2, and store3.<br>
If the product is not available in a store, the price will be null in that store's column.
 

Write a solution to rearrange the Products table so that each row has (product_id, store, price). If a product is not available in a store, do not include a row with that product_id and store combination in the result table.

Return the result table in any order.

Example:

Input:<br>
Products dataframe:

| product_id | store1 | store2 | store3 |
|:----------:|:-------|:------:|:------:|
| 0          | 95     | 100    | 105    |
| 1          | 70     | null   | 80     |

Output: 

| product_id | store  | price |
|:----------:|:------:|:-----:|
| 0          | store1 | 95    |
| 0          | store2 | 100   |
| 0          | store3 | 105   |
| 1          | store1 | 70    |
| 1          | store3 | 80    |

Explanation:<br>
Product 0 is available in all three stores with prices 95, 100, and 105 respectively.<br>
Product 1 is available in store1 with price 70 and store3 with price 80. The product is not available in store2.

#### Testcase

In [131]:
# Test data
data = [[0, 95, 100, 105], [1, 70, None, 80]]

# Create the DataFrame
products = pl.DataFrame(
    data,
    schema=['product_id', 'store1', 'store2', 'store3']
)

# Display the DataFrame
print(products)

shape: (2, 4)
┌────────────┬────────┬────────┬────────┐
│ product_id ┆ store1 ┆ store2 ┆ store3 │
│ ---        ┆ ---    ┆ ---    ┆ ---    │
│ i64        ┆ i64    ┆ i64    ┆ i64    │
╞════════════╪════════╪════════╪════════╡
│ 0          ┆ 95     ┆ 100    ┆ 105    │
│ 1          ┆ 70     ┆ null   ┆ 80     │
└────────────┴────────┴────────┴────────┘


#### Solution

In [132]:
def rearrange_products_table(products: pl.DataFrame) -> pl.DataFrame:
    
    # Use the melt function to reshape the DataFrame
    melted_df = products.melt(
        id_vars='product_id', 
        variable_name='store', 
        value_name='price'
    )
    
    # Drop rows with null values
    result = melted_df.drop_nulls()
    
    return result

# Display the DataFrame
print(rearrange_products_table(products=products))

shape: (5, 3)
┌────────────┬────────┬───────┐
│ product_id ┆ store  ┆ price │
│ ---        ┆ ---    ┆ ---   │
│ i64        ┆ str    ┆ i64   │
╞════════════╪════════╪═══════╡
│ 0          ┆ store1 ┆ 95    │
│ 1          ┆ store1 ┆ 70    │
│ 0          ┆ store2 ┆ 100   │
│ 0          ┆ store3 ┆ 105   │
│ 1          ┆ store3 ┆ 80    │
└────────────┴────────┴───────┘
