**AIM**-To create and build a Spark session and application using PySpark, perform operations on a sample dataset, and verify the setup.

**STEPS**:

**Step 1: Install and Setup Dependencies**

1.Install Java:

Apache Spark requires Java to run. The following command installs Java 8

2.Download Apache Spark:

Spark version 3.0.0 compatible with Hadoop 2.7 is downloaded

The downloaded .tgz file is extracted

3.Install Findspark and PySpark:

Findspark: Makes it easier for Python to locate Spark installations.

PySpark: Python bindings for Apache Spark, version 3.0.0 is installed to match the downloaded Spark version.


In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null  # Suppress output
!wget -q http://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
!tar xf spark-3.0.0-bin-hadoop2.7.tgz
!pip install findspark
!pip install pyspark==3.0.0  # Install PySpark matching Spark version



**Step 2: Configure Environment Variables**

1.Set Java and Spark Home Paths:

Define the paths for Java and Spark installations

2.Initialize Findspark:

Findspark is initialized to link the Spark environment


In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop2.7"

In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession

**Step 3: Build a SparkSession**

1. Create a Spark Session:

A SparkSession is built explicitly with the following parameters:

master("local[*]"): Indicates that the application will run locally, utilizing all available CPU cores.

appName("MySparkApp"): Names the Spark application as "MySparkApp".

config: Additional configurations (e.g., specifying the classpath for driver dependencies).


In [None]:
# Build SparkSession explicitly with configuration
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("MySparkApp") \
    .config("spark.driver.extraClassPath", os.path.join(os.environ["SPARK_HOME"], "jars/*")) \
    .getOrCreate()

**Verify Spark Session Creation**

In [None]:
# Verify SparkSession creation
print(spark)

<pyspark.sql.session.SparkSession object at 0x793fa83425f0>


**Step 4: Define and Load a Dataset**

1.Define a Sample Dataset:

A Python list of dictionaries is created to represent the dataset

In [None]:
# Define a sample dataset
data = [
    {"Name": "Alice", "Age": 25, "City": "New York"},
    {"Name": "Bob", "Age": 30, "City": "San Francisco"},
    {"Name": "Cathy", "Age": 27, "City": "Seattle"},
    {"Name": "David", "Age": 35, "City": "Austin"}
]

2.Define a Schema:

A StructType schema is created to define the structure of the DataFrame

In [None]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define a schema for the dataset
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("City", StringType(), True)
])

3.Create a DataFrame:

The dataset is converted into a Spark DataFrame using the defined schema

In [None]:
# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)

4.Show the DataFrame:

Display the contents of the DataFrame

In [None]:
# Show the DataFrame
print("Dataset loaded into Spark DataFrame:")
df.show()

Dataset loaded into Spark DataFrame:
+-----+---+-------------+
| Name|Age|         City|
+-----+---+-------------+
|Alice| 25|     New York|
|  Bob| 30|San Francisco|
|Cathy| 27|      Seattle|
|David| 35|       Austin|
+-----+---+-------------+



**Step 5: Perform Data Transformations**
1.Filter Operation:

In [None]:
# Perform a simple transformation
print("Filter rows where Age > 30:")
df_filtered = df.filter(df["Age"] > 30)
df_filtered.show()

Filter rows where Age > 30:
+-----+---+------+
| Name|Age|  City|
+-----+---+------+
|David| 35|Austin|
+-----+---+------+



2.Group-By and Count Operation:

In [None]:
# Perform a group-by and count operation
print("Group by City and count:")
df_grouped = df.groupBy("City").count()
df_grouped.show()

Group by City and count:
+-------------+-----+
|         City|count|
+-------------+-----+
|San Francisco|    1|
|       Austin|    1|
|      Seattle|    1|
|     New York|    1|
+-------------+-----+



**RESULT**:
The code successfully demonstrates:

1. Installing and configuring Spark and Java.
2. Building a SparkSession with explicit configurations.
3. Creating a DataFrame from a Python dataset.
4. Performing transformations like filtering and grouping.