Q1: How do you create a PySpark DataFrame from a list or a dictionary?

From a List:

You can create a PySpark DataFrame from a list of tuples or lists using spark.createDataFrame().

In [1]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("interview_questions").getOrCreate()

# List of tuples
data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]

# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()


25/02/14 12:14:52 WARN Utils: Your hostname, apples-MacBook-Air-5.local resolves to a loopback address: 127.0.0.1; using 192.168.1.129 instead (on interface en0)
25/02/14 12:14:52 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/14 12:14:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/02/14 12:15:07 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

+-----+---+
| Name|Age|
+-----+---+
|Alice| 34|
|  Bob| 45|
|Cathy| 29|
+-----+---+



From a Dictionary:

You can create a DataFrame from a dictionary by converting it into a list of tuples or using pandas.DataFrame as an intermediate step.


In [2]:
# Dictionary
data = {"Name": ["Alice", "Bob", "Cathy"], "Age": [34, 45, 29]}

# Using pandas as intermediate
import pandas as pd
pandas_df = pd.DataFrame(data)
df = spark.createDataFrame(pandas_df)
df.show()


+-----+---+
| Name|Age|
+-----+---+
|Alice| 34|
|  Bob| 45|
|Cathy| 29|
+-----+---+



2. What are the different ways to filter records in a PySpark DataFrame?

    You can filter records in a PySpark DataFrame using:

    •	filter() or where() methods.

    •	SQL-like expressions or column-based conditions.


In [3]:
# Using filter() with a condition
df.filter(df["Age"] > 30).show()

# Using where() with a condition
df.where(df["Age"] > 30).show()

# Using SQL-like expression
df.filter("Age > 30").show()

# Multiple conditions
df.filter((df["Age"] > 30) & (df["Name"] == "Alice")).show()


                                                                                

+-----+---+
| Name|Age|
+-----+---+
|Alice| 34|
|  Bob| 45|
+-----+---+

+-----+---+
| Name|Age|
+-----+---+
|Alice| 34|
|  Bob| 45|
+-----+---+

+-----+---+
| Name|Age|
+-----+---+
|Alice| 34|
|  Bob| 45|
+-----+---+

+-----+---+
| Name|Age|
+-----+---+
|Alice| 34|
+-----+---+



3. How do you perform pivot and unpivot operations in PySpark?

Pivot is used to transform rows into columns based on unique values in a column.

In [5]:
# Sample DataFrame
data = [("Alice", "Math", 85), ("Alice", "Science", 90), ("Bob", "Math", 75)]
df = spark.createDataFrame(data, ["Name", "Subject", "Score"])

# Pivot operation
pivot_df = df.groupBy("Name").pivot("Subject").avg("Score")
pivot_df.show()


                                                                                

+-----+----+-------+
| Name|Math|Science|
+-----+----+-------+
|  Bob|75.0|   NULL|
|Alice|85.0|   90.0|
+-----+----+-------+



Unpivot is used to transform columns into rows. 

PySpark does not have a direct unpivot function, but you can use selectExpr() or stack().

In [6]:
from pyspark.sql.functions import expr

# Unpivot using stack
unpivot_df = pivot_df.selectExpr(
    "Name", "stack(2, 'Math', Math, 'Science', Science) as (Subject, Score)"
)
unpivot_df.show()


[Stage 40:>                                                         (0 + 4) / 4]

+-----+-------+-----+
| Name|Subject|Score|
+-----+-------+-----+
|  Bob|   Math| 75.0|
|  Bob|Science| NULL|
|Alice|   Math| 85.0|
|Alice|Science| 90.0|
+-----+-------+-----+



                                                                                

4. Explain the difference between withColumn() and select() when modifying columns.

withColumn():

•	Used to add or replace a single column in a DataFrame.

•	Returns a new DataFrame with the added/modified column.

•	Syntax: df.withColumn("new_column", expression)


select():

•	Used to select specific columns or create new columns.

•	Can modify multiple columns at once.

•	Syntax: df.select("col1", "col2", expr("col3 + 1").alias("new_col"))


5. What are the key differences between DataFrame and Pandas DataFrame in PySpark?

Aspect	        PySpark DataFrame	                       Pandas DataFrame

Execution	Lazy evaluation (optimized execution plan).	Eager evaluation (immediate execution).

Scalability	Distributed and scalable (handles big data).	Single-node (limited to memory size).

API	SQL-like, functional programming.	Pythonic, object-oriented.

Performance	Optimized for large datasets.	Optimized for small to medium datasets.

Immutability	Immutable (operations return new DataFrame).	Mutable (in-place modifications allowed).

Ease of Use	Requires understanding of distributed systems.	Easier for small-scale data manipulation.




How do you create an RDD from an external file in PySpark?

You can create an RDD from an external file (e.g., text file, CSV) using the textFile() or wholeTextFiles() method in PySpark.


In [None]:
from pyspark import SparkContext

# Create a SparkContext
sc = SparkContext("local", "RDD Example")

# Create an RDD from a text file
rdd = sc.textFile("path/to/file.txt")

# Display the first few lines
print(rdd.take(5))


•	textFile(): Reads a file and returns an RDD where each element is a line from the file.

•	wholeTextFiles(): Reads a file and returns an RDD of key-value pairs, where the key is the file path and the value is the file content.


7. Explain the difference between map() and flatMap() in PySpark RDDs with examples.

map():

•	Applies a function to each element of the RDD and returns a new RDD with the transformed elements.

•	The output RDD has the same number of elements as the input RDD.


In [None]:
rdd = sc.parallelize([1, 2, 3, 4])
mapped_rdd = rdd.map(lambda x: x * 2)
print(mapped_rdd.collect())  # Output: [2, 4, 6, 8]


flatMap():

•	Applies a function to each element of the RDD and returns a new RDD by flattening the results.

•	The output RDD can have more or fewer elements than the input RDD.


In [None]:
rdd = sc.parallelize(["Hello World", "PySpark is awesome"])
flat_mapped_rdd = rdd.flatMap(lambda x: x.split(" "))
print(flat_mapped_rdd.collect())  # Output: ["Hello", "World", "PySpark", "is", "awesome"]


8. What are the advantages and disadvantages of using RDDs over DataFrames?

Advantages of RDDs:

1.	Fine-Grained Control: RDDs provide low-level APIs for precise control over data transformations and actions.

2.	Flexibility: Supports complex, custom operations that may not be easily expressible in DataFrames.

3.	Immutability: RDDs are immutable, ensuring data consistency in distributed environments.

Disadvantages of RDDs:

1.	Performance: RDDs lack the optimizations (e.g., Catalyst optimizer, Tungsten execution engine) available in DataFrames.

2.	Ease of Use: RDDs require more manual effort for common operations compared to DataFrames.

3.	No Schema: RDDs do not have a schema, making it harder to work with structured data.

When to Use RDDs:

•	For unstructured data (e.g., text, graphs).

•	When you need low-level control over transformations.


9. How do you sort an RDD based on a specific column value?

To sort an RDD, use the sortBy() method, which sorts the RDD based on a key function.


In [None]:
# Sample RDD
rdd = sc.parallelize([("Alice", 34), ("Bob", 45), ("Cathy", 29)])

# Sort by age (second column)
sorted_rdd = rdd.sortBy(lambda x: x[1])
print(sorted_rdd.collect())  # Output: [("Cathy", 29), ("Alice", 34), ("Bob", 45)]


10. Describe how data is distributed across partitions in an RDD and how it impacts performance.

Data Distribution in RDDs:

•	An RDD is divided into partitions, which are distributed across nodes in a cluster.

•	Each partition contains a subset of the data.

•	The number of partitions is determined by the input data size and the configuration (e.g., spark.default.parallelism).

Impact on Performance:

1.	Parallelism: More partitions increase parallelism, allowing more tasks to run concurrently.

2.	Load Balancing: Even distribution of data across partitions ensures balanced workloads.

3.	Shuffling: Operations like groupByKey() or reduceByKey() may cause shuffling, which can be expensive in terms of network and disk I/O.

4.	Memory Usage: Too many partitions can lead to overhead, while too few can cause underutilization of resources.

Optimizing Partitions:

•	Use repartition() or coalesce() to adjust the number of partitions.

•	Aim for partitions of roughly equal size to avoid skew


11. How do you perform a left anti join and left semi join in PySpark SQL?

Left Anti Join:

•	Returns only the rows from the left DataFrame that do not have a match in the right DataFrame.

•	Syntax: df1.join(df2, on="key", how="left_anti")


In [11]:
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
df2 = spark.createDataFrame([(1, "HR")], ["id", "dept"])

# Left Anti Join
result = df1.join(df2, on="id", how="left_anti")
result.show()


                                                                                

+---+----+
| id|name|
+---+----+
|  2| Bob|
+---+----+



Left Semi Join:

•	Returns only the rows from the left DataFrame that have a match in the right DataFrame.

•	Syntax: df1.join(df2, on="key", how="left_semi")


In [12]:
# Left Semi Join
result = df1.join(df2, on="id", how="left_semi")
result.show()




+---+-----+
| id| name|
+---+-----+
|  1|Alice|
+---+-----+



                                                                                