**1.** Working with RDDs:

a) Write a Python program to create an RDD from a local data source.

b) Implement transformations and actions on the RDD to perform data processing tasks.

c) Analyze and manipulate data using RDD operations such as map, filter, reduce, or
aggregate.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
rdd = spark.sparkContext.textFile("path/to/local/data/file.txt")

# Example usage
rdd.collect()

In [None]:
# Example transformations
filtered_rdd = rdd.filter(lambda line: "error" in line)
mapped_rdd = rdd.map(lambda line: (line.split()[0], 1))

# Example actions
count = rdd.count()
first_element = rdd.first()

# Example chaining transformations and actions
result = rdd.filter(lambda line: "error" in line).count()

In [None]:
# Example map operation
mapped_rdd = rdd.map(lambda line: line.upper())

# Example filter operation
filtered_rdd = rdd.filter(lambda line: "error" in line)

# Example reduce operation
total_length = rdd.map(lambda line: len(line)).reduce(lambda a, b: a + b)

# Example aggregate operation
agg_result = rdd.aggregate(0, lambda a, line: a + len(line), lambda a, b: a + b)

**2.** Spark DataFrame Operations:
    
a) Write a Python program to load a CSV file into a Spark DataFrame.

b)Perform common DataFrame operations such as filtering, grouping, or joining.

c) Apply Spark SQL queries on the DataFrame to extract insights from the data.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("path/to/csv/file.csv", header=True, inferSchema=True)

# Example usage
df.show()

In [None]:
# Example filtering
filtered_df = df.filter(df["age"] > 30)

# Example grouping
grouped_df = df.groupBy("department").agg({"salary": "mean"})

# Example joining
joined_df = df.join(department_df, on="department", how="inner")

In [None]:
# Register DataFrame as a temporary table
df.createOrReplaceTempView("employees")

# Example SQL query
result = spark.sql("SELECT department, AVG(salary) FROM employees GROUP BY department")
result.show()

**3.** Spark Streaming:
    
a) Write a Python program to create a Spark Streaming application.

b) Configure the application to consume data from a streaming source (e.g., Kafka or a
socket).

c) Implement streaming transformations and actions to process and analyze the incoming
data stream.

In [None]:
from pyspark.streaming import StreamingContext

spark = SparkSession.builder.getOrCreate()
ssc = StreamingContext(spark.sparkContext, batchDuration=1)

# Example usage
lines = ssc.socketTextStream("localhost", 9999)
lines.pprint()

ssc.start()
ssc.awaitTermination()

In the example above, we are consuming data from a socket on localhost and port 9999. You can modify the source based on your requirements.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.jars.packages", "mysql:mysql-connector-java:8.0.26") \
    .getOrCreate()

# Example usage with MySQL
df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:mysql://localhost:3306/database") \
    .option("driver", "com.mysql.cj.jdbc.Driver") \
    .option("dbtable", "table") \
    .option("user", "username") \
    .option("password", "password") \
    .load()

# Example usage with PostgreSQL
df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/database") \
    .option("driver", "org.postgresql.Driver") \
    .option("dbtable", "table") \
    .option("user", "username") \
    .option("password", "password") \
    .load()

**4.** Spark SQL and Data Source Integration:

a) Write a Python program to connect Spark with a relational database (e.g., MySQL,
PostgreSQL).

b)Perform SQL operations on the data stored in the database using Spark SQL.

c) Explore the integration capabilities of Spark with other data sources, such as Hadoop
Distributed File System (HDFS) or Amazon S3.

In [None]:
# Example streaming transformations and actions
filtered_lines = lines.filter(lambda line: "error" in line)
filtered_lines.pprint()

# Example word count
word_counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
word_counts.pprint()

In [None]:
# Register DataFrame as a temporary table
df.createOrReplaceTempView("table")

# Example SQL query
result = spark.sql("SELECT * FROM table WHERE age > 30")
result.show()

In [None]:
# Example usage with HDFS
df = spark.read.csv("hdfs://localhost:9000/path/to/file.csv")

# Example usage with Amazon S3
df = spark.read.csv("s3a://bucket-name/path/to/file.csv")