Question 2
Create the following tables as per cardinality:

emp: eno, ename, gender, designation, city, salary, dno

dept: dno, dname, location

One department can have many employees.

Insert 5 records and perform tasks below:

In [1]:
import findspark
findspark.init()
import pandas as pd
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, when, concat_ws, upper, lower, avg, sum, count, udf, max, min,concat_ws, upper, lower, length, substring, instr, trim, lpad, rpad
from pyspark.sql.types import StringType, IntegerType, DoubleType


In [2]:
# Initialize Spark session
spark = SparkSession.builder.appName("EmployeeDepartment").getOrCreate()

1. Create emp and dept DataFrames.

In [3]:
emp_data = [
    (1, 'Ravi', 'M', 'Manager', 'Pune', 50000, 1),
    (2, 'Neha', 'F', 'Analyst', 'Pune', 30000, 1),
    (3, 'Ayesha', 'F', 'Clerk', 'Nagpur', 18000, 2),
    (4, 'Vijay', 'M', 'Analyst', 'Mumbai', 25000, 2),
    (5, 'Tina', 'F', 'Clerk', 'Pune', 17000, 1)
]

emp_columns = ['eno', 'ename', 'gender', 'designation', 'city', 'salary', 'dno']
emp = spark.createDataFrame(emp_data, emp_columns)

dept_data = [
    (1, 'HR', 'Pune'),
    (2, 'Finance', 'Mumbai')
]

dept_columns = ['dno', 'dname', 'location']
dept = spark.createDataFrame(dept_data, dept_columns)


Defines employee and department dataframes with relationship ‘one department-many employees’.

2. Print schema for both dataframes.

In [4]:
emp.printSchema()
dept.printSchema()


root
 |-- eno: long (nullable = true)
 |-- ename: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- designation: string (nullable = true)
 |-- city: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- dno: long (nullable = true)

root
 |-- dno: long (nullable = true)
 |-- dname: string (nullable = true)
 |-- location: string (nullable = true)



Verifies each dataframe’s columns and data types.

3. Filter employees by designation ‘Analyst’ and salary > 20000.

In [5]:
print("Employees with designation Analyst:")
emp.filter(col("designation") == "Analyst").show()

print("Employees with salary > 20000:")
emp.filter(col("salary") > 20000).show()


Employees with designation Analyst:
+---+-----+------+-----------+------+------+---+
|eno|ename|gender|designation|  city|salary|dno|
+---+-----+------+-----------+------+------+---+
|  2| Neha|     F|    Analyst|  Pune| 30000|  1|
|  4|Vijay|     M|    Analyst|Mumbai| 25000|  2|
+---+-----+------+-----------+------+------+---+

Employees with salary > 20000:
+---+-----+------+-----------+------+------+---+
|eno|ename|gender|designation|  city|salary|dno|
+---+-----+------+-----------+------+------+---+
|  1| Ravi|     M|    Manager|  Pune| 50000|  1|
|  2| Neha|     F|    Analyst|  Pune| 30000|  1|
|  4|Vijay|     M|    Analyst|Mumbai| 25000|  2|
+---+-----+------+-----------+------+------+---+



Filter queries help analyze subsets of employees for decision making.

4. Show departments for female employees.

In [6]:
emp.filter(col("gender") == "F").join(dept, "dno", "left").show()


+---+---+------+------+-----------+------+------+-------+--------+
|dno|eno| ename|gender|designation|  city|salary|  dname|location|
+---+---+------+------+-----------+------+------+-------+--------+
|  1|  2|  Neha|     F|    Analyst|  Pune| 30000|     HR|    Pune|
|  2|  3|Ayesha|     F|      Clerk|Nagpur| 18000|Finance|  Mumbai|
|  1|  5|  Tina|     F|      Clerk|  Pune| 17000|     HR|    Pune|
+---+---+------+------+-----------+------+------+-------+--------+



Combines department data for female employees, useful for gender diversity reports.

5. Show all employees grouped department-wise.
python

In [7]:
emp.join(dept, "dno").orderBy("dno").show()


+---+---+------+------+-----------+------+------+-------+--------+
|dno|eno| ename|gender|designation|  city|salary|  dname|location|
+---+---+------+------+-----------+------+------+-------+--------+
|  1|  1|  Ravi|     M|    Manager|  Pune| 50000|     HR|    Pune|
|  1|  2|  Neha|     F|    Analyst|  Pune| 30000|     HR|    Pune|
|  1|  5|  Tina|     F|      Clerk|  Pune| 17000|     HR|    Pune|
|  2|  3|Ayesha|     F|      Clerk|Nagpur| 18000|Finance|  Mumbai|
|  2|  4| Vijay|     M|    Analyst|Mumbai| 25000|Finance|  Mumbai|
+---+---+------+------+-----------+------+------+-------+--------+



Lists employees with their departments organized for clarity.

6. Calculate average salary per department.

In [8]:
emp.groupBy("dno").avg("salary").show()


+---+------------------+
|dno|       avg(salary)|
+---+------------------+
|  1|32333.333333333332|
|  2|           21500.0|
+---+------------------+



Helps understand compensation distribution across departments.

7. Count male employees per department.

In [9]:
emp.filter(col("gender") == "M").groupBy("dno").count().show()


+---+-----+
|dno|count|
+---+-----+
|  1|    1|
|  2|    1|
+---+-----+



Supports workforce demographic analysis.

8. List employees with salary less than 20000 and designation ‘Clerk’.

In [10]:
emp.filter((col("salary") < 20000) & (col("designation") == "Clerk")).show()


+---+------+------+-----------+------+------+---+
|eno| ename|gender|designation|  city|salary|dno|
+---+------+------+-----------+------+------+---+
|  3|Ayesha|     F|      Clerk|Nagpur| 18000|  2|
|  5|  Tina|     F|      Clerk|  Pune| 17000|  1|
+---+------+------+-----------+------+------+---+



Targets specific salary and role-based employee filtering.

9. Create UDF to classify salary level and display.

In [11]:
def salary_level(sal):
    if sal >= 40000:
        return 'High'
    elif sal >= 20000:
        return 'Medium'
    else:
        return 'Low'

salary_level_udf = udf(salary_level, StringType())
emp_with_salary_level = emp.withColumn("salary_level", salary_level_udf(col("salary")))
emp_with_salary_level.select("ename", "salary", "salary_level").show()


+------+------+------------+
| ename|salary|salary_level|
+------+------+------------+
|  Ravi| 50000|        High|
|  Neha| 30000|      Medium|
|Ayesha| 18000|         Low|
| Vijay| 25000|      Medium|
|  Tina| 17000|         Low|
+------+------+------------+



Adds a new column classifying salaries for easy categorization.