<pre>
Table: Employee

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| id          | int     |
| name        | varchar |
| department  | varchar |
| managerId   | int     |
+-------------+---------+
id is the primary key (column with unique values) for this table.
Each row of this table indicates the name of an employee, their department, and the id of their manager.
If managerId is null, then the employee does not have a manager.
No employee will be the manager of themself.
 

Write a solution to find managers with at least five direct reports.

Return the result table in any order.

The result format is in the following example.

 

Example 1:

Input: 
Employee table:
+-----+-------+------------+-----------+
| id  | name  | department | managerId |
+-----+-------+------------+-----------+
| 101 | John  | A          | null      |
| 102 | Dan   | A          | 101       |
| 103 | James | A          | 101       |
| 104 | Amy   | A          | 101       |
| 105 | Anne  | A          | 101       |
| 106 | Ron   | B          | 101       |
+-----+-------+------------+-----------+
Output: 
+------+
| name |
+------+
| John |
+------+
</pre>

In [0]:
spark

In [0]:
# importing pyspark sql functions
from pyspark.sql.functions import *

# importing sql types from pyspark
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, DoubleType, IntegerType, DateType, FloatType

# importing SparkSession
from pyspark.sql import SparkSession


In [0]:
# creating spark session and providing app name
spark = SparkSession.builder.appName("leetcode-top-50-sql-solution-with-pyspark").getOrCreate()

In [0]:
# creating Schema
# Define the schema for the Employee table
employee_schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("department", StringType(), True),
    StructField("managerId", IntegerType(), True)
])





In [0]:

employee_df = spark.createDataFrame([
     (101, "John", "A", None),
    (102, "Dan", "A", 101),
    (103, "James", "A", 101),
    (104, "Amy", "A", 101),
    (105, "Anne", "A", 101),
    (106, "Ron", "B", 101)
], schema=employee_schema)







In [0]:
employee_df.display()

id,name,department,managerId
101,John,A,
102,Dan,A,101.0
103,James,A,101.0
104,Amy,A,101.0
105,Anne,A,101.0
106,Ron,B,101.0


In [0]:
# Leetcode Solution in Spark SQL
# Creating Temporary view for the product dataframe for sql queries
employee_df.createOrReplaceTempView('employee')


sql_result = spark.sql(
    '''
    SELECT name
    FROM employee
    WHERE id IN (SELECT managerId 
    FROM employee 
    GROUP BY managerId 
    HAVING COUNT(managerId)>=5)
    
    '''
)

# Displaying Result
sql_result.display()

name
John


In [0]:
# Leetcode Solution in PySpark

# Calculating managers with at least five direct reports.
manager_df = employee_df.filter(col("managerId").isNotNull()).groupBy("managerId").agg(count("*").alias("direct_reports"))
manager_df.show()

# filtering manager  with at least five direct reports.
filter_df = manager_df.select("managerId").filter(col("direct_reports")>=5)

# Joining the dataframe and selecting the manager name
join_df = employee_df.join(filter_df, employee_df.id == filter_df.managerId,"inner").select("name").display()

+---------+--------------+
|managerId|direct_reports|
+---------+--------------+
|      101|             5|
+---------+--------------+



name
John
