# Welcome to PySpark Playground! ðŸŽ‰

This notebook will help you get started with PySpark in a cluster environment.

## What is PySpark?

PySpark is the Python API for Apache Spark, a unified analytics engine for large-scale data processing. It allows you to leverage Spark's distributed computing capabilities using Python.

## Prerequisites

Make sure your Spark cluster is running before executing the cells below.

## Step 1: Initialize Spark Session

First, let's create a Spark session connected to our cluster.

In [None]:
from pyspark.sql import SparkSession

# Create Spark Session
spark = SparkSession.builder \
    .appName('PySpark Playground - Welcome') \
    .master('spark://spark-master:7077') \
    .config('spark.executor.memory', '1g') \
    .config('spark.executor.cores', '1') \
    .getOrCreate()

print(f'âœ… Spark Version: {spark.version}')
print(f'âœ… Spark Master: {spark.sparkContext.master}')
print(f'âœ… App Name: {spark.sparkContext.appName}')

## Step 2: Check Cluster Resources

Let's verify that we're connected to the cluster and see what resources are available.

In [None]:
# Get cluster information
sc = spark.sparkContext

print(f'Default Parallelism: {sc.defaultParallelism}')
print(f'Application ID: {sc.applicationId}')

# You can view the Spark UI at http://localhost:8080

## Step 3: Create Your First DataFrame

Let's create a simple DataFrame to test our cluster.

In [None]:
# Sample data
data = [
    ('Alice', 34, 'Data Engineer'),
    ('Bob', 45, 'Data Scientist'),
    ('Catherine', 29, 'ML Engineer'),
    ('David', 38, 'Analytics Engineer'),
    ('Eve', 32, 'Data Analyst')
]

columns = ['Name', 'Age', 'Job Title']

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Display the DataFrame
df.show()

# Show schema
df.printSchema()

## Step 4: Perform Basic Operations

Let's try some basic transformations and actions.

In [None]:
# Filter: Get people older than 30
print('People older than 30:')
df.filter(df.Age > 30).show()

# Select specific columns
print('\nNames and Job Titles:')
df.select('Name', 'Job Title').show()

# Count
print(f'\nTotal records: {df.count()}')

## Step 5: Work with RDDs

PySpark also supports RDDs (Resilient Distributed Datasets).

In [None]:
# Create an RDD
numbers = sc.parallelize(range(1, 101))

# MapReduce: Calculate sum of squares
sum_of_squares = numbers.map(lambda x: x ** 2).reduce(lambda a, b: a + b)

print(f'Sum of squares from 1 to 100: {sum_of_squares}')

## Next Steps

Now that you've successfully initialized Spark and run some basic operations, try exploring:

- **DataFrame Basics**: Learn more about DataFrame operations
- **SQL Queries**: Use Spark SQL for data analysis
- **ML Example**: Build machine learning models with MLlib

Create a new notebook from these templates in the PySpark Playground UI!

## Useful Resources

- [PySpark Documentation](https://spark.apache.org/docs/latest/api/python/)
- [Spark SQL Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)
- [MLlib Guide](https://spark.apache.org/docs/latest/ml-guide.html)

In [None]:
# Don't forget to stop the Spark session when you're done!
# spark.stop()