# Spark Essentials

## What is PySpark in Apache Spark and how is it useful?

- A) PySpark is a Python library for parallel processing of data, and it is useful because it provides an easy-to-use programming interface for data processing and analysis
- B) PySpark is a Python package for distributed computing, and it is useful because it allows Python developers to write Spark jobs in Python instead of Java or Scala ***
- C) PySpark is a Python-based cluster manager for Apache Spark, and it is useful because it simplifies the deployment and management of Spark clusters
- D) PySpark is a Python-based machine learning library, and it is useful because it provides a powerful set of tools for machine learning and data analysis in Python

## What is `findspark` in PySpark and how is it useful?

- A) `findspark` is a package that allows users to locate and download Spark libraries, and it is useful because it simplifies the setup process for Spark on local machines
- B) `findspark` is a module that automatically finds the location of Spark and sets the necessary environment variables, and it is useful because it makes it easier to integrate PySpark with existing Python environments ***
- C) `findspark` is a Python-based cluster manager for Apache Spark, and it is useful because it simplifies the deployment and management of Spark clusters
- D) `findspark` is a machine learning library for PySpark, and it is useful because it provides a powerful set of tools for machine learning and data analysis in PySpark

## Which of the following code snippets correctly shows how to create and configure a SparkConf object in PySpark?

- A)

`from pyspark import SparkContext`

`conf = SparkConf().setAppName("MyApp")`
`sc = SparkContext(conf=conf)` ***

- B)

`from pyspark import SparkConf, SparkContext`

`sc = SparkContext("local", "MyApp")`

- C)

`from pyspark import SparkConf, SparkContext`

`conf = SparkConf().setAppName("MyApp")`
`sc = SparkContext.getOrCreate(conf=conf)`

- D)

`from pyspark import SparkConf`

`conf = SparkConf().setAppName("MyApp").setMaster("local")`



## What is the purpose of `SparkContext` in PySpark?

- A) To provide a connection to a Spark cluster and coordinate the execution of Spark jobs. ***
- B) To define a Spark application's configuration settings, such as the application name and the location of input data.
- C) To provide a high-level API for working with distributed data in PySpark.
- D) To manage the execution of tasks within a Spark job and handle failures and retries.

## In PySpark, what is the purpose of the `getOrCreate()` method of `SparkSession`?

- A) To create a new `SparkSession `instance if one does not already exist, or to return an existing `SparkSession` instance if one has already been created. ***
- B) To create a new RDD from a given data source, such as a file or database table.
- C) To create a new DataFrame by applying a schema to a given data source, such as a file or database table.
- D) To create a new `StreamingContext` that can consume data from various sources, such as Kafka or HDFS.

## What is the main purpose of `SparkSession` in PySpark?

- A) To create a connection to a Spark cluster and coordinate the execution of Spark jobs.
- B) To define a Spark application's configuration settings, such as the application name and the location of input data.
- C) To provide a high-level API for working with distributed data in PySpark. ***
- D) To manage the execution of tasks within a Spark job and handle failures and retries.

##  Which of the following data structures are available in Apache Spark?

- A) RDDs
- B) DataFrames
- C) Datasets
- D) All of the above ***

##  What is an RDD in Apache Spark?

- A) A type of database that stores structured data in a distributed manner.
- B) A distributed collection of objects that can be processed in parallel. ***
- C) A type of machine learning algorithm used in Spark.
- D) A tool used to manage Spark clusters.

## What is a DataFrame in Apache Spark?

- A) A database management system used for storing and querying large-scale data.
- B) A distributed collection of objects that can be processed in parallel.
- C) A tabular view of data with named columns, similar to a relational database table. ***
- D) A machine learning algorithm used for clustering data.

## What is a DataSet in Apache Spark?

- A) A distributed collection of objects that can be processed in parallel.
- B) A collection of data organized into named columns, similar to a relational database table.
- C) A distributed collection of objects that are strongly typed, providing a more efficient and convenient API for working with structured data. ***
- D) A type of machine learning algorithm used in Spark.



## Which of the following code snippets creates an RDD in PySpark?

- A) `rdd = sc.parallelize(["apple", "banana", "orange"])` ***
- B) `rdd = spark.read.text("data.txt")`
- C) 
`rdd = Seq(("apple", 2), ("banana", 4), ("orange", 1))`

`spark.createDataFrame(rdd)`
- D) 
`rdd = [1, 2, 3, 4, 5]`

`sc.parallelize(rdd)`

## What is lazy evaluation in PySpark?

- A) Lazy evaluation is a feature in PySpark that allows us to chain transformations on an RDD or DataFrame without actually executing them until an action is called. ***

- B) Lazy evaluation is a feature in PySpark that allows us to execute transformations on an RDD or DataFrame without chaining them.

- C) Lazy evaluation is a feature in PySpark that allows us to execute actions on an RDD or DataFrame without transforming them.

- D) Lazy evaluation is a feature in PySpark that automatically optimizes transformations and actions for faster execution.

## What does the `persist()` method in PySpark do?

- A) The `persist()` method allows you to store a PySpark RDD or DataFrame in memory or on disk for faster access. ***

- B) The `persist()` method allows you to delete a PySpark RDD or DataFrame from memory or disk.

- C) The `persist()` method allows you to rename a PySpark RDD or DataFrame.

- D) The `persist()` method allows you to filter a PySpark RDD or DataFrame.

## Which of the following code snippets can be used to create a DataFrame using Spark SQL in PySpark?

- A)

`from pyspark.sql import SparkSession`

`spark = SparkSession.builder.appName("example").getOrCreate()`
`data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]`
`df = spark.createDataFrame(data, ["Name", "Age"])`
`df.show()`***

- B)

`from pyspark.sql import SparkSession`

`spark = SparkSession.builder.appName("example").getOrCreate()`
`data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]`
`rdd = spark.sparkContext.parallelize(data)`
`df = spark.createDataFrame(rdd, ["Name", "Age"])`
`df.show()`

- C)

`from pyspark.sql import SparkSession`

`spark = SparkSession.builder.appName("example").getOrCreate()`
`data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]`
`df = spark.sql("SELECT _1 as Name, _2 as Age FROM VALUES %s" % str(data))`
`df.show()`

- D)

`from pyspark.sql import SparkSession`

`spark = SparkSession.builder.appName("example").getOrCreate()`
`data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]`
`rdd = spark.sparkContext.parallelize(data)`
`df = rdd.toDF(["Name", "Age"])`
`df.show()`






## Which of the following statements is true about Spark SQL?

- A) Spark SQL allows you to execute SQL queries and perform analysis on structured data within Spark. ***

- B) Spark SQL is only compatible with relational databases like MySQL and PostgreSQL.

- C) Spark SQL is a separate engine from Spark and requires separate installation.

- D) Spark SQL is used only for processing unstructured data.

## What does the `map` function in PySpark do?

- A) It applies a function to each element of an RDD and returns a new RDD. ***

- B) It returns the first element of an RDD.

- C) It removes all the duplicates from an RDD.

- D) It sorts the elements of an RDD in descending order.

## What does the `filter` function in PySpark do?

- A) It removes all the duplicates from an RDD.

- B) It returns the first element of an RDD.

- C) It applies a function to each element of an RDD and returns a new RDD.

- D) It selects the elements of an RDD that satisfy a given condition and returns a new RDD. ***

## What does the `sortBy` function in PySpark do?

- A) It sorts an RDD in ascending order based on a key function.***

- B) It returns the first element of an RDD.

- C) It applies a function to each element of an RDD and returns a new RDD.

- D) It selects the elements of an RDD that satisfy a given condition and returns a new RDD.