# Introduction to PySpark

## Course Overview

This notebook provides a comprehensive introduction to PySpark, the Python API for Apache Spark. We'll cover:
- What is Apache Spark?
- Setting up a Spark session
- Working with DataFrames
- Data manipulation and transformations
- Basic analytics and machine learning
- Performance and distributed computing concepts

## 1. Installation and Setup

Before running this notebook, ensure you have PySpark installed:
```
pip install pyspark
```

In [None]:
# Importing SparkSession
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

# Create a Spark Session
spark = SparkSession.builder \
    .appName("PySpark Introduction") \
    .getOrCreate()

print("Spark Version:", spark.version)
print("PySpark is ready to use!")

## 2. Creating DataFrames

PySpark offers multiple ways to create DataFrames:

In [None]:
# From a list of tuples
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Show the DataFrame
df.show()

# From a dictionary
data_dict = {
    "Name": ["David", "Eve"],
    "City": ["New York", "San Francisco"],
    "Salary": [75000, 85000]
}
df_dict = spark.createDataFrame(list(map(list, zip(*data_dict.values()))), list(data_dict.keys()))
df_dict.show()

## 3. Reading Data from Different Sources

In [None]:
# Reading CSV (replace with your file path)
# df_csv = spark.read.csv('path/to/your/file.csv', header=True, inferSchema=True)

# Example with sample data
sales_data = spark.createDataFrame([
    ("Laptop", "Electronics", 1000, 50),
    ("Phone", "Electronics", 500, 100),
    ("Book", "Media", 20, 200),
    ("Tablet", "Electronics", 300, 75)
], ["Product", "Category", "Price", "Stock"])

sales_data.show()

# Print schema
sales_data.printSchema()

## 4. DataFrame Operations

In [None]:
# Select and filter
electronics = sales_data.filter(sales_data.Category == "Electronics")
print("Electronics Products:")
electronics.show()

# Add a new column
sales_with_total = sales_data.withColumn("Total_Value", F.col("Price") * F.col("Stock"))
print("\nSales with Total Value:")
sales_with_total.show()

## 5. Aggregations and Group By

In [None]:
# Group by and aggregate
category_summary = sales_data.groupBy("Category") \
    .agg(
        F.sum("Stock").alias("Total_Stock"),
        F.avg("Price").alias("Average_Price")
    )

print("Category Summary:")
category_summary.show()

## 6. Window Functions

In [None]:
from pyspark.sql.window import Window

# Window function to rank products within category
window_spec = Window.partitionBy("Category").orderBy(F.col("Price").desc())

ranked_products = sales_data.withColumn(
    "Price_Rank", 
    F.dense_rank().over(window_spec)
)

print("Ranked Products:")
ranked_products.show()

## 7. Basic Machine Learning with PySpark

A simple example of linear regression:

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

# Prepare data for ML
ml_data = spark.createDataFrame([
    (1, 2, 10),
    (2, 3, 15),
    (3, 4, 20),
    (4, 5, 25)
], ["feature1", "feature2", "label"])

# Assemble features
assembler = VectorAssembler(
    inputCols=["feature1", "feature2"],
    outputCol="features"
)

# Prepare data
training_data = assembler.transform(ml_data)

# Create and train linear regression model
lr = LinearRegression(featuresCol="features", labelCol="label")
model = lr.fit(training_data)

# Print model coefficients
print("Coefficients:", model.coefficients)
print("Intercept:", model.intercept)

## 8. Performance and Distributed Computing

PySpark is designed for large-scale data processing:
- Lazy evaluation
- Distributed computing
- Optimization techniques

In [None]:
# Demonstrate lazy evaluation
large_data = spark.range(0, 10000000)
filtered_data = large_data.filter(large_data.id % 2 == 0)
result = filtered_data.count()

print(f"Number of even numbers: {result}")

## 9. Closing the Spark Session

In [None]:
# Always good practice to stop the Spark session
spark.stop()

## 10. Next Steps and Further Learning

To continue learning PySpark:
- Explore Spark's documentation
- Practice with real-world big data scenarios
- Learn advanced ML and streaming capabilities

Recommended resources:
- Apache Spark official documentation
- Online big data and distributed computing courses
- Data engineering and big data processing books