<a href="https://colab.research.google.com/github/Nikhil-singh955/AI-Project/blob/main/PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. What is PySpark

Ans1. PySpark is the Python API for Apache Spark. It enables Python developers to harness the power of Spark's distributed computing capabilities for big data processing and analytics while using Python, one of the most popular programming languages. PySpark provides easy access to Spark's core functionalities, including data processing, machine learning, and real-time analytics.


# 2. What are the industrial benefits of PySpark?

These days, almost every industry makes use of big data to evaluate where they stand and grow. When you hear the term big data, Apache Spark comes to mind. Following are the industry benefits of using PySpark that supports Spark:

# Media streaming:
Spark can be used to achieve real-time streaming to provide personalized recommendations to subscribers. Netflix is one such example that uses Apache Spark. It processes around 450 billion events every day to flow to its server-side apps.

# Finance:
Banks use Spark for accessing and analyzing the social media profiles and in turn get insights on what strategies would help them to make the right decisions regarding customer segmentation, credit risk assessments, early fraud detection etc.

# Healthcare:
Providers use Spark for analyzing the past records of the patients to identify what health issues the patients might face posting their discharge. Spark is also used to perform genome sequencing for reducing the time required for processing genome data.

# Travel Industry:
Companies like TripAdvisor uses Spark to help users plan the perfect trip and provide personalized recommendations to the travel enthusiasts by comparing data and review from hundreds of websites regarding the place, hotels, etc.

# Retail and e-commerce:
This is one important industry domain that requires big data analysis for targeted advertising. Companies like Alibaba run Spark jobs for analyzing petabytes of data for enhancing customer experience, providing targetted offers, sales and optimizing the overall performance.

# 3. How can you handle missing values in a PySpark DataFrame? Provide an example to demonstrate your approach.

In PySpark, missing values can be handled using the following methods provided by the DataFrame API:

fillna(): Replace missing values with a specified value.

dropna(): Remove rows with missing values.

replace(): Replace specific values with others.

Custom Imputation: Use a calculated value (like mean or median) to fill missing data

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import mean

# Initialize Spark session
spark = SparkSession.builder.appName("MissingValuesExample").getOrCreate()

# Create a sample DataFrame
data = [
    (1, "Alice", 50),
    (2, "Bob", None),
    (3, None, 70),
    (4, "David", 80),
    (5, None, None)
]
columns = ["ID", "Name", "Score"]

df = spark.createDataFrame(data, columns)

# Show original DataFrame
print("Original DataFrame:")
df.show()


# 4. What is the difference between RDD, DataFrame, and Dataset?

RDD: Low-level, unstructured, and requires manual serialization/deserialization.

DataFrame: High-level, structured, optimized API with support for SQL queries.

Dataset: Combines the best of RDDs and DataFrames with strong typing (available only in Scala and Java).

#5. How can you perform filtering in a PySpark DataFrame?

In [None]:
df.filter(df["Age"] > 25).show()


# 6. What are transformations and actions in PySpark?

Transformations: Lazy operations that define a computation plan (e.g., map(), filter()).

Actions: Trigger the execution and return results (e.g., collect(), count()).

# 7. How do you join two DataFrames in PySpark?


In [None]:
df1.join(df2, df1["ID"] == df2["ID"], "inner").show()


# 8. What is SparkSession?

SparkSession is the entry point for PySpark applications. It provides methods to create DataFrames, access Spark functionality, and manage the Spark application lifecycle.

# 9. How do you read and write data in PySpark?



In [None]:
# READ

df = spark.read.csv("file_path.csv", header=True)

# WRITE

df.write.csv("output_path", header=True)



#10. What are the benefits of using PySpark over Pandas?

Handles large datasets that don't fit into memory.

Optimized for distributed computing.

Supports fault tolerance and parallel processing.


# 11. Explain PySpark's partitioning.

Partitioning divides data into smaller chunks (partitions) to process them in parallel, improving performance.

# 12. How do you handle null values in PySpark?

In [None]:
# Fill null values:

df.fillna({"column_name": "default_value"}).show()


# Drop rows with nulls:

df.dropna().show()


 # 13. How can you monitor and debug a PySpark application?


Use the Spark UI for monitoring jobs, stages, and tasks.

Enable event logs to debug performance issues.

Utilize explain() to view execution plans:

In [None]:
df.explain()


#14. What is PySpark MLlib?


PySpark MLlib is a library for scalable machine learning in Spark, providing tools for classification, regression, clustering, and collaborative filtering.



#15 What is PySpark UDF?

UDF stands for User Defined Functions. In PySpark, UDF can be created by creating a python function and wrapping it with PySpark SQL’s udf() method and using it on the DataFrame or SQL. These are generally created when we do not have the functionalities supported in PySpark’s library and we have to use our own logic on the data. UDFs can be reused on any number of SQL expressions or DataFrames.

# 16. What do you understand about PySpark DataFrames?

PySpark DataFrame is a distributed collection of well-organized data that is equivalent to tables of the relational databases and are placed into named columns. PySpark DataFrame has better optimisation when compared to R or python. These can be created from different sources like Hive Tables, Structured Data Files, existing RDDs, external databases etc as shown in the image below:

# 17. Is PySpark faster than pandas?

PySpark supports parallel execution of statements in a distributed environment, i.e on different cores and different machines which are not present in Pandas. This is why PySpark is faster than pandas.