# Spark Basics Tutorial

## 1. Initializing SparkSession
The entry point to Spark is the `SparkSession` class.

In [5]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg

spark = SparkSession \
    .builder \
    .appName("Spark Basics") \
    .getOrCreate()

print("Spark Session Created!")
print(f"Spark UI is available at: {spark.sparkContext.uiWebUrl}")

Spark Session Created!
Spark UI is available at: http://e861d3569819:4040


## 2. Creating a DataFrame
You can create a DataFrame from a list of data or by reading files.

In [6]:
data = [
    ("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
]

columns = ["employee_name", "department", "salary"]
df = spark.createDataFrame(data=data, schema=columns)

df.show()
df.printSchema()

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|        James|     Sales|  3000|
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|        Maria|   Finance|  3000|
|        James|     Sales|  3000|
|        Scott|   Finance|  3300|
|          Jen|   Finance|  3900|
|         Jeff| Marketing|  3000|
|        Kumar| Marketing|  2000|
|         Saif|     Sales|  4100|
+-------------+----------+------+

root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: long (nullable = true)



## 3. Basic Transformations
Transformations are lazy. They create a new DataFrame from an existing one.

In [7]:
# Select specific columns
df.select("employee_name", "salary").show(5)

# Filter data
df.filter(df.salary > 4000).show()

# Add a new column
df.withColumn("bonus", col("salary") * 0.1).show()

+-------------+------+
|employee_name|salary|
+-------------+------+
|        James|  3000|
|      Michael|  4600|
|       Robert|  4100|
|        Maria|  3000|
|        James|  3000|
+-------------+------+
only showing top 5 rows

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|         Saif|     Sales|  4100|
+-------------+----------+------+

+-------------+----------+------+-----+
|employee_name|department|salary|bonus|
+-------------+----------+------+-----+
|        James|     Sales|  3000|300.0|
|      Michael|     Sales|  4600|460.0|
|       Robert|     Sales|  4100|410.0|
|        Maria|   Finance|  3000|300.0|
|        James|     Sales|  3000|300.0|
|        Scott|   Finance|  3300|330.0|
|          Jen|   Finance|  3900|390.0|
|         Jeff| Marketing|  3000|300.0|
|        Kumar| Marketing|  2000|200.0|
|         Saif|     Sales|  4100|410.0|
+--------

## 4. Aggregations
Grouping and calculating metrics.

In [8]:
print("Salary average by department:")
df.groupBy("department").agg(avg("salary")).show()

print("Employee count by department:")
df.groupBy("department").count().show()

Salary average by department:
+----------+-----------+
|department|avg(salary)|
+----------+-----------+
|     Sales|     3760.0|
|   Finance|     3400.0|
| Marketing|     2500.0|
+----------+-----------+

Employee count by department:
+----------+-----+
|department|count|
+----------+-----+
|     Sales|    5|
|   Finance|    3|
| Marketing|    2|
+----------+-----+

