# PySpark Beginner Tutorial
This notebook introduces PySpark, a distributed computing framework that runs on Apache Spark. PySpark allows Python to interact with Spark clusters, enabling the processing of large-scale data.

## 1. Initializing SparkContext and SparkSession
**Explanation of SparkContext Parameters**:
- `master`: Defines the master URL to connect (e.g., `local` for local mode, `spark://host:port` for a Spark cluster).
- `appName`: The name of your application. This will appear on the Spark UI.
- `config`: Additional Spark configuration settings.
- `spark.executor.memory`: Amount of memory to use per executor process (e.g., `2g`).
- `spark.executor.cores`: Number of CPU cores to use per executor.
- `spark.driver.memory`: Amount of memory to use for the driver (e.g., `1g`).

In [None]:
#import findspark
#findspark.init()
from pyspark.sql import SparkSession

# Initialize SparkSession with additional configurations
spark = SparkSession.builder \
    .appName('PySpark Beginner Tutorial') \
    .master('local[*]') \
    .config('spark.executor.memory', '2g') \
    .config('spark.executor.cores', '2') \
    .config('spark.driver.memory', '1g') \
    .getOrCreate()

# Check SparkSession version
spark.version

'3.1.2'

## 2. Creating a DataFrame
We'll create a DataFrame with additional data and columns to demonstrate PySpark DataFrame operations.

In [None]:
data = [('John', 28, 'M'), ('Anna', 23, 'F'), ('Mike', 35, 'M'), ('Sara', 31, 'F')]
columns = ['Name', 'Age', 'Sex']

# Create a DataFrame with additional columns
df = spark.createDataFrame(data, columns)

# Show DataFrame
df.show()

+----+---+---+
|Name|Age|Sex|
+----+---+---+
|John| 28|  M|
|Anna| 23|  F|
|Mike| 35|  M|
|Sara| 31|  F|
+----+---+---+


## 3. DataFrame Operations
Here are some common DataFrame operations such as checking schema, selecting columns, filtering rows, and computing basic statistics.

In [None]:
# Check the schema of the DataFrame
df.printSchema()

# Select the 'Name' column
df.select('Name').show()

# Filter rows where age > 30
df.filter(df.Age > 30).show()

# Compute basic statistics
df.describe().show()

root
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- Sex: string (nullable = true)

+----+
|Name|
+----+
|John|
|Anna|
|Mike|
|Sara|
+----+


+----+---+---+
|Name|Age|Sex|
+----+---+---+
|Mike| 35|  M|
|Sara| 31|  F|
+----+---+---+


+-------+----+
|summary| Age|
+-------+----+
|  count|   4|
|   mean|29.25|
| stddev| 4.573474244670772|
|    min|  23|
|    max|  35|
+-------+----+
