In [0]:
from pyspark.sql import SparkSession 
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder.appName("read").getOrCreate()


Problem 1683

In [0]:
tweets = [(1, "let us code"),
          (2, "More than fifteen chars are here!")]

tweets_schema = """ tweet_id int , content string"""

df_tweets = spark.createDataFrame(tweets, tweets_schema)
df_tweets.display()

tweet_id,content
1,let us code
2,More than fifteen chars are here!


In [0]:
df_result = df_tweets.filter(length("content")<= 15)

df_result.display()

tweet_id,content
1,let us code


In [0]:
# Using col method 
df_tweets.filter(length(col("content"))<= 15).display()


tweet_id,content
1,let us code


Problem 1075


Solution Walkthrough 


The first step in solving the problem was to create DataFrames for the project and employee tables. 

To work with these tables using PySpark's SQL functionality, I created temporary view tables for both DataFrames. This allowed me to use SQL queries to calculate the average experience years for each project.

The query involved summing the experience_years and dividing by the count of employee_id for each project. I grouped the results by project_id as the goal was to compute the average work experience for each project. Since the data for projects and employees are stored in separate temporary tables, I used an inner join to combine the employee and project tables on the employee_id field. This provided the necessary data for calculating the average experience years per project.

Once I achieved the desired result with the SQL query, I made a second attempt using PySpark DataFrame functions. First, I joined the employee and project DataFrames to allow for querying on the combined table. I then selected the required columns for the calculation of average experience years and used the alias function to rename any duplicate column names, thus resolving ambiguity. After that, I calculated the average using the groupBy function to group by project_id and computed the average experience years for each project. Finally, I displayed the resulting DataFrame containing the calculated average values.


In [0]:

project = [(1, 1),
           (1,2),
           (1,3),
           (2,1),
           (2,4)]

project_schema = ''' project_id int, employee_id int'''

df_project = spark.createDataFrame(project , project_schema)
df_project.display()

employee = [(1, "Khaled", 3),
            (2, "ALi", 2),
            (3, "John", 1),
            (4, "Doe", 2)]

employee_schema = ''' employee_id int, name string , experience_years int'''

df_employee = spark.createDataFrame(employee, employee_schema)
df_employee.display()

project_id,employee_id
1,1
1,2
1,3
2,1
2,4


employee_id,name,experience_years
1,Khaled,3
2,ALi,2
3,John,1
4,Doe,2


Write an SQL query that reports the average experience years of all the employees for each project, rounded to 2 digits.

Return the result table in any order.

SQL Solution 

In [0]:
df_project.createOrReplaceTempView("project")
df_employee.createOrReplaceTempView("employee")


In [0]:
df_result = spark.sql("""
select
    p.project_id, 
    ROUND(SUM(e.experience_years) / COUNT(e.employee_id), 2) AS avg_years
FROM 
    employee e
JOIN 
    project p
ON 
    e.employee_id = p.employee_id
GROUP BY 
    p.project_id

    """)

df_result.display()

project_id,avg_years
1,2.0
2,2.5


In [0]:
df_joined = df_employee.join(df_project, df_employee.employee_id==df_project.employee_id, "inner")\
    .select(df_employee.employee_id.alias("Id"), df_employee.experience_years, df_project.project_id)

df_avg = df_joined.groupBy("project_id").agg(round(sum("experience_years")/count("Id"), 2).alias("average_years"))

df_avg.display()



                              


project_id,average_years
1,2.0
2,2.5


Problem 577

Solution Walkthrough 

To begin, I created two DataFrames: df_employee1 and df_bonus. 
Next, I created temporary views for both DataFrames using createOrReplaceTempView. This allowed me to perform SQL queries on the data, treating these DataFrames as tables. For the SQL query, I needed to list the employees with a bonus less than 1000, including those without any bonus. To achieve this, I used a LEFT JOIN to join the employee1 and bonus tables on empId. The LEFT JOIN ensures that all employees are included in the result, even if they don’t have a corresponding record in the bonus table.

The SQL query selected the name and bonus columns, and filtered the results to only include rows where the bonus was less than 1000 or where the bonus was NULL. This was done using the WHERE clause in the query.

After getting the result using SQL, I replicated the same logic using PySpark's DataFrame functions. I joined the df_employee1 and df_bonus DataFrames using left join on the empId field. Then, I applied a filter to select only those rows where the bonus was less than 1000 or NULL. To display the final result with only the required columns, I used the select function to choose the name and bonus columns.

Finally, I displayed the resulting DataFrame to see the names and bonus values of employees who meet the specified conditions.

In [0]:
employee_1= [(3, "Brad", None, 4000),
             (1, "John", 3, 1000),
             (2, "Dan", 3, 2000),
             (4, "Thomas", 3, 4000)]
employee_1_schema= """ empId int, name string, supervisor int, salary int"""

df_employee1= spark.createDataFrame(employee_1, employee_1_schema)
df_employee1.display()


bonus = [(2,500),
         (4,2000)]
bonus_schema = """empId int, bonus int """
df_bonus= spark.createDataFrame(bonus, bonus_schema)
df_bonus.display()

empId,name,supervisor,salary
3,Brad,,4000
1,John,3.0,1000
2,Dan,3.0,2000
4,Thomas,3.0,4000


empId,bonus
2,500
4,2000


Write a solution to report the name and bonus amount of each employee with a bonus less than 1000.

Return the result table in any order.

In [0]:
df_employee1.createOrReplaceTempView("employee1")
df_bonus.createOrReplaceTempView("bonus")

In [0]:
df_result1= spark.sql(""" select e.name , b.bonus 
          
          from employee1 e
         left join bonus b 
          on e.empId=b.empId
          where b.bonus < 1000 or b.bonus is null
          """)
df_result1.display()


name,bonus
Brad,
John,
Dan,500.0


In [0]:
df_joined1 = df_employee1.join(df_bonus, df_employee1.empId==df_bonus.empId, "left")

df_solution = df_joined1.filter((col("bonus") < 1000 )|col("bonus").isNull()).select("name", "bonus")
df_solution.display()


name,bonus
Brad,
John,
Dan,500.0
