![image](https://user-images.githubusercontent.com/57321948/196933065-4b16c235-f3b9-4391-9cfe-4affcec87c35.png)

# Submitted by: Mohammad Wasiq

## Email: `gl0427@myamu.ac.in`

# Pre-Placement Training Assignment - `Big Data` 

## Apache Spark

**Q1. Working with RDDs:**

**a) Write a Python program to create an RDD from a local data source.**

In [None]:
from pyspark import SparkContext

# Create a SparkContext
sc = SparkContext("local", "RDD Example")

# Create an RDD from a local data source
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Perform operations on the RDD
result = rdd.map(lambda x: x * 2).collect()

# Print the result
print(result)

# Stop the SparkContext
sc.stop()

**b) Implement transformations and actions on the RDD to perform data processing tasks.**

In [None]:
from pyspark import SparkContext

# Create a SparkContext
sc = SparkContext("local", "RDD Example")

# Create an RDD from a local data source
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Transformations
squared_rdd = rdd.map(lambda x: x ** 2)
filtered_rdd = squared_rdd.filter(lambda x: x > 10)

# Actions
result = filtered_rdd.collect()
count = filtered_rdd.count()
sum = filtered_rdd.sum()

# Print the results
print("Filtered RDD: ", result)
print("Count: ", count)
print("Sum: ", sum)

# Stop the SparkContext
sc.stop()

**c) Analyze and manipulate data using RDD operations such as map, filter, reduce, or aggregate.**

In [None]:
from pyspark import SparkContext

# Create a SparkContext
sc = SparkContext("local", "RDD Operations")

# Create an RDD from a local data source
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Map operation to square each element
squared_rdd = rdd.map(lambda x: x ** 2)

# Filter operation to select even numbers
even_rdd = squared_rdd.filter(lambda x: x % 2 == 0)

# Reduce operation to compute the sum
sum = even_rdd.reduce(lambda x, y: x + y)

# Aggregate operation to compute sum and count
sum_count = even_rdd.aggregate((0, 0),
                              lambda acc, value: (acc[0] + value, acc[1] + 1),
                              lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))

# Print the results
print("Squared RDD: ", squared_rdd.collect())
print("Even RDD: ", even_rdd.collect())
print("Sum using Reduce: ", sum)
print("Sum and Count using Aggregate: ", sum_count)

# Stop the SparkContext
sc.stop()

**Q2. Spark DataFrame Operations:**

**a) Write a Python program to load a CSV file into a Spark DataFrame.**

In [None]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("CSV to DataFrame") \
    .getOrCreate()

# Load CSV file into DataFrame
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)

# Print schema and show data
df.printSchema()
df.show()

# Stop the SparkSession
spark.stop()

**b)Perform common DataFrame operations such as filtering, grouping, or joining.**



In [None]:
# Filtering

# Filter rows where age is greater than 30
filtered_df = df.filter(df.age > 30)

# Filter rows where country is 'USA'
filtered_df = df.filter(df.country == 'USA')

In [None]:
# Grouping And Aggregation

# Group by country and calculate average age
grouped_df = df.groupBy('country').avg('age')

# Group by country and count the number of records
grouped_df = df.groupBy('country').count()

In [None]:
# Joining

# Join two DataFrames based on a common column 'id'
joined_df = df1.join(df2, on='id', how='inner')

# Join two DataFrames using different join conditions
joined_df = df1.join(df2, df1.id == df2.id, 'inner')

**c) Apply Spark SQL queries on the DataFrame to extract insights from the data.**

In [None]:
# Register the DataFrame as a temporary view
df.createOrReplaceTempView("people")

# Perform SQL queries on the DataFrame
result = spark.sql("SELECT name, age FROM people WHERE age > 30")

# Show the result
result.show()

**Q3. Spark Streaming:**

**a) Write a Python program to create a Spark Streaming application.**


In [None]:
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two execution threads and a batch interval of 1 second
spark = SparkSession.builder.appName("StreamingExample").getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 1)

# Create a DStream that reads data from a text file stream
lines = ssc.textFileStream("path/to/input/directory")

# Perform transformations on the DStream
words = lines.flatMap(lambda line: line.split(" "))
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

# Print the word counts
wordCounts.pprint()

# Start the streaming context
ssc.start()
ssc.awaitTermination()


**b) Configure the application to consume data from a streaming source (e.g., Kafka or a socket).**


In [None]:
# Consuming data from Kafka

from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

# Create a local StreamingContext with two execution threads and a batch interval of 1 second
spark = SparkSession.builder.appName("KafkaStreamingExample").getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 1)

# Define the Kafka consumer parameters
kafkaParams = {"bootstrap.servers": "localhost:9092", "group.id": "group-id"}

# Create a DStream that consumes data from Kafka
kafkaStream = KafkaUtils.createDirectStream(ssc, ["topic"], kafkaParams)

# Extract the values from the Kafka messages
lines = kafkaStream.map(lambda x: x[1])

# Perform transformations on the DStream
# ...

# Start the streaming context
ssc.start()
ssc.awaitTermination()

In [None]:
# Consuming data from a socket

from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two execution threads and a batch interval of 1 second
spark = SparkSession.builder.appName("SocketStreamingExample").getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 1)

# Create a DStream by connecting to a socket
lines = ssc.socketTextStream("localhost", 9999)

# Perform transformations on the DStream
# ...

# Start the streaming context
ssc.start()
ssc.awaitTermination()

**c) Implement streaming transformations and actions to process and analyze the incoming data stream.**

In [None]:
# Word count

from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two execution threads and a batch interval of 1 second
spark = SparkSession.builder.appName("StreamingWordCount").getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 1)

# Create a DStream by connecting to a socket
lines = ssc.socketTextStream("localhost", 9999)

# Split each line into words
words = lines.flatMap(lambda line: line.split(" "))

# Count each word in each batch
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda x, y: x + y)

# Print the word counts
wordCounts.pprint()

# Start the streaming context
ssc.start()
ssc.awaitTermination()

In [None]:
# Filtering

from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two execution threads and a batch interval of 1 second
spark = SparkSession.builder.appName("StreamingFilter").getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 1)

# Create a DStream by connecting to a socket
lines = ssc.socketTextStream("localhost", 9999)

# Filter lines based on a condition
filteredLines = lines.filter(lambda line: "error" in line)

# Print the filtered lines
filteredLines.pprint()

# Start the streaming context
ssc.start()
ssc.awaitTermination()

In [None]:
# Windowed operations

from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two execution threads and a batch interval of 1 second
spark = SparkSession.builder.appName("StreamingWindow").getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 1)

# Create a DStream by connecting to a socket
lines = ssc.socketTextStream("localhost", 9999)

# Windowed word count
wordCounts = lines.flatMap(lambda line: line.split(" ")).countByValueAndWindow(windowDuration=10, slideDuration=5)

# Print the word counts for each window
wordCounts.pprint()

# Start the streaming context
ssc.start()
ssc.awaitTermination()

**Q4. Spark SQL and Data Source Integration:**

**a) Write a Python program to connect Spark with a relational database (e.g., MySQL, PostgreSQL).**

In [None]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("SparkSQLDatabaseIntegration") \
    .config("spark.driver.extraClassPath", "/path/to/mysql-connector-java.jar") \
    .getOrCreate()

# Configure the MySQL JDBC connection properties
jdbc_url = "jdbc:mysql://localhost:3306/mydatabase"
connection_properties = {
    "user": "username",
    "password": "password",
    "driver": "com.mysql.jdbc.Driver"
}

# Read data from a MySQL table
df = spark.read.jdbc(url=jdbc_url, table="mytable", properties=connection_properties)

# Perform operations on the DataFrame
df.show()

# Write data to a MySQL table
df.write.jdbc(url=jdbc_url, table="newtable", mode="append", properties=connection_properties)

# Close the SparkSession
spark.stop()

**b) Perform SQL operations on the data stored in the database using Spark SQL.**


In [None]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("SparkSQLDatabaseIntegration") \
    .config("spark.driver.extraClassPath", "/path/to/mysql-connector-java.jar") \
    .getOrCreate()

# Configure the MySQL JDBC connection properties
jdbc_url = "jdbc:mysql://localhost:3306/mydatabase"
connection_properties = {
    "user": "username",
    "password": "password",
    "driver": "com.mysql.jdbc.Driver"
}

# Read data from a MySQL table into a DataFrame
df = spark.read.jdbc(url=jdbc_url, table="mytable", properties=connection_properties)

# Register the DataFrame as a temporary table
df.createOrReplaceTempView("mytable")

# Execute SQL queries on the table
result = spark.sql("SELECT * FROM mytable WHERE column1 > 100")

# Show the result
result.show()

# Close the SparkSession
spark.stop()

**c) Explore the integration capabilities of Spark with other data sources, such as Hadoop Distributed File System (HDFS) or Amazon S3.**


In [None]:
# Hadoop Distributed File System (HDFS)

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("SparkHDFSIntegration") \
    .getOrCreate()

# Read data from HDFS into a DataFrame
df = spark.read.csv("hdfs://localhost:9000/path/to/file.csv")

# Perform operations on the DataFrame
# ...

# Write data to HDFS
df.write.csv("hdfs://localhost:9000/output/path")

# Close the SparkSession
spark.stop()


In [None]:
# Amazon S3

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("SparkS3Integration") \
    .getOrCreate()

# Read data from Amazon S3 into a DataFrame
df = spark.read.csv("s3a://bucket-name/path/to/file.csv")

# Perform operations on the DataFrame
# ...

# Write data to Amazon S3
df.write.csv("s3a://bucket-name/output/path")

# Close the SparkSession
spark.stop()