<a href="https://colab.research.google.com/github/Denuka1993/DataScienceExercise/blob/main/learning_lab_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Learning Lab: Solutions & Explanations

**Course:** GIK2Q3 Applied Big Data and Cloud Computing  
**Duration:** ~3 hours  
**Week:** 4

---

## üìö About This Notebook

This is the **solutions notebook** for the Learning Lab. It contains:

- ‚úÖ **Worked solutions** for all exercises
- üìñ **Detailed explanations** of key concepts
- üí° **Common pitfalls** and how to avoid them
- üéØ **Discussion answers** for group activities

Use this as a reference for self-study after the lab session.

---

## Section 1: Environment Setup

### üìñ Troubleshooting Tips

**Common Issues:**
- **Java not found:** Most common on Windows. Solution: Use Google Colab as a backup.
- **PySpark import fails:** Usually means Java isn't configured. Check JAVA_HOME.
- **Memory errors:** Reduce driver memory or use Colab (12GB available).

**Tip:** If you're having trouble with local setup, Google Colab is a reliable alternative that works out of the box.

### üêç Note: BrokenPipeError Messages

You may see `BrokenPipeError: [Errno 32] Broken pipe` messages after some cells. **These are harmless** ‚Äî your code worked fine! Just ignore the traceback if you see the expected output before the error.

In [None]:
# Environment setup (same as main notebook)
import sys
print(f"Python version: {sys.version}")

try:
    import pyspark
    print(f"‚úì PySpark already installed: version {pyspark.__version__}")
except ImportError:
    print("Installing PySpark...")
    %pip install pyspark -q
    import pyspark
    print(f"‚úì PySpark installed: version {pyspark.__version__}")

In [None]:
import pandas as pd
import numpy as np
import time
import os
from collections import defaultdict

# Java setup for local installations
java_paths = [
    "/opt/homebrew/opt/openjdk@17",
    "/usr/local/opt/openjdk@17",
    "/usr/lib/jvm/java-17-openjdk-amd64",
]

java_home = os.environ.get("JAVA_HOME")
if not java_home:
    for path in java_paths:
        if os.path.exists(path):
            os.environ["JAVA_HOME"] = path
            java_home = path
            break

if java_home:
    print(f"‚úì JAVA_HOME set to: {java_home}")
else:
    print("‚ö†Ô∏è Java not found locally. Colab users can ignore this.")

print(f"‚úì Pandas version: {pd.__version__}")
print(f"‚úì NumPy version: {np.__version__}")

---

## Section 2: Understanding MapReduce

### üìñ Why MapReduce Matters

MapReduce is the mental model that underlies all distributed data processing. Even though Spark abstracts it away, understanding Map ‚Üí Shuffle ‚Üí Reduce helps you:

1. **Reason about parallelism:** What can run in parallel? What requires coordination?
2. **Understand performance:** Why is shuffle expensive? Why does data skew hurt?
3. **Debug Spark jobs:** The Spark UI shows stages that map directly to this model

**Key Insight:**
- Map operations are "embarrassingly parallel" ‚Äî each document is independent
- Shuffle is the expensive part ‚Äî data must move across the network
- Reduce can also be parallel if the operation is associative/commutative

In [None]:
# Sample text data for MapReduce examples
text_data = """
Big data is transforming how we understand the world
Data science and machine learning rely on big data
The world of data is growing every day
Machine learning models need data to learn
Big models require big data and big compute
"""

lines = text_data.strip().split('\n')
print(f"We have {len(lines)} 'documents' to process")

In [None]:
# Standard MapReduce implementation (for reference)
def map_function(document):
    words = document.lower().split()
    return [(word, 1) for word in words]

def shuffle(mapped_pairs):
    grouped = defaultdict(list)
    for key, value in mapped_pairs:
        grouped[key].append(value)
    return grouped

def reduce_function(key, values):
    return (key, sum(values))

# Execute MapReduce
mapped_results = []
for doc in lines:
    mapped_results.extend(map_function(doc))

shuffled = shuffle(mapped_results)
final_counts = {k: sum(v) for k, v in shuffled.items()}

print("Word counts:")
for word, count in sorted(final_counts.items(), key=lambda x: -x[1])[:8]:
    print(f"  {word}: {count}")

---

## üèãÔ∏è Exercise A: MapReduce Challenge ‚Äî SOLUTIONS

### Challenge 1: Filter Words > 3 Characters

**Key Concept:** Filtering during the map phase is more efficient than filtering after ‚Äî it means less data to shuffle across the network.

In [None]:
# SOLUTION: Challenge 1 - Count only words with more than 3 characters

def map_function_filtered(document):
    """MAP: Emit (word, 1) only for words with more than 3 characters."""
    words = document.lower().split()
    # KEY INSIGHT: Filter during map = less data to shuffle!
    return [(word, 1) for word in words if len(word) > 3]

# Test the solution
mapped_filtered = []
for doc in lines:
    mapped_filtered.extend(map_function_filtered(doc))

shuffled_filtered = shuffle(mapped_filtered)
filtered_counts = {k: sum(v) for k, v in shuffled_filtered.items()}

print("‚úÖ SOLUTION: Words with >3 characters:")
print("-" * 40)
for word, count in sorted(filtered_counts.items(), key=lambda x: -x[1])[:8]:
    print(f"  {word}: {count}")

print(f"\nüìä We filtered out: {len(final_counts) - len(filtered_counts)} short words")
print(f"   Original word count: {len(final_counts)}")
print(f"   After filter: {len(filtered_counts)}")

### üí° Common Mistakes - Challenge 1

| Mistake | Why It's Wrong | Correct Approach |
|---------|----------------|------------------|
| `len(word) >= 3` | Includes 3-letter words | Use `len(word) > 3` |
| Filtering after reduce | Works but inefficient | Filter in map phase |
| Forgetting `.lower()` | "Big" and "big" counted separately | Normalize case first |

### Challenge 2: Count Capital Words

**Key Concept:** The choice of what to emit in map depends on what you're trying to count. You can't always lowercase ‚Äî context matters for your analysis goals.

In [None]:
# SOLUTION: Challenge 2 - Count words starting with capital letters

original_text = """
Big data is transforming how we understand the world
Data science and machine learning rely on big data
The world of data is growing every day
Machine learning models need data to learn
Big models require big data and big compute
"""

def map_capitals(document):
    """MAP: Emit (word, 1) only for words starting with a capital letter."""
    words = document.split()
    # KEY INSIGHT: Check first character, but keep original case for the key
    return [(word, 1) for word in words if word and word[0].isupper()]

# Test the solution
original_lines = original_text.strip().split('\n')
mapped_capitals = []
for doc in original_lines:
    mapped_capitals.extend(map_capitals(doc))

shuffled_capitals = shuffle(mapped_capitals)
capital_counts = {k: sum(v) for k, v in shuffled_capitals.items()}

print("‚úÖ SOLUTION: Capital words:")
print("-" * 40)
for word, count in sorted(capital_counts.items(), key=lambda x: -x[1])[:8]:
    print(f"  {word}: {count}")

### üí° Discussion Point - Challenge 2

**Question:** "Big" appears twice but "big" (lowercase) appears more times. Should they be combined?

**Answer:** It depends on your use case!
- For finding proper nouns ‚Üí keep them separate
- For general word frequency ‚Üí combine them with `.lower()`

This is a great example of how data processing decisions depend on business requirements.

### Challenge 3: Find the Longest Word

**Key Concept:** This is a great example of thinking about what's parallelizable. Finding a maximum is associative ‚Äî you can find the max of maxes!

In [None]:
# SOLUTION: Challenge 3 - Find the longest word using MapReduce pattern

def map_longest(document):
    """MAP: Find the longest word in this document."""
    words = document.split()
    if not words:
        return []
    # Emit (dummy_key, longest_word_in_this_doc)
    # We use a single key so all words go to one reducer
    longest = max(words, key=len)
    return [("longest", longest)]

def reduce_longest(words):
    """REDUCE: Compare words from all mappers, keep the longest."""
    return max(words, key=len)

# Execute
mapped_longest = []
for doc in lines:
    result = map_longest(doc)
    if result:
        mapped_longest.append(result[0][1])  # Extract just the word

print("Words from each document (map phase):")
for i, word in enumerate(mapped_longest):
    print(f"  Doc {i}: '{word}' ({len(word)} chars)")

final_longest = reduce_longest(mapped_longest)
print(f"\n‚úÖ SOLUTION: Longest word is '{final_longest}' ({len(final_longest)} characters)")

### üìñ Why This Works in Parallel

Finding a maximum is an **associative operation**:
```
max(max(a, b), max(c, d)) = max(a, b, c, d)
```

This means we can:
1. **Map phase:** Find the longest word in each partition independently
2. **Reduce phase:** Compare the "local maximums" to find the global maximum

**Not all operations parallelize this well!** For example, median requires seeing all data.

---

## Section 3: SparkSession

### üìñ Key Points

1. **SparkSession** is the single entry point to Spark (replaces the older SparkContext and SQLContext)
2. `local[*]` means use all available CPU cores ‚Äî great for development
3. The **Spark UI** is invaluable for debugging and understanding performance

**Note for Colab users:** The Spark UI link (localhost:4040) won't work in Colab ‚Äî that's expected.

### üß† Can Everything Be Parallelized?

A key conceptual question from the learning lab: **If parallelism is so powerful, why not always use Spark?**

**The Answer:**

1. **Not all problems are "embarrassingly parallel"**
   - Word count: ‚úÖ Each document is independent
   - Sorting: ‚ùå Need to see all elements
   - Iterative algorithms: ‚ùå Each step depends on the previous

2. **The Shuffle is Expensive**
   - In a cluster, shuffle = sending data across the network
   - Networks are 100-1000x slower than RAM

3. **Amdahl's Law**
   - If 10% of your task is sequential, you can never get more than 10x speedup
   - Even with infinite parallel resources!

4. **Coordination Overhead**
   - Tracking data locations, detecting failures, collecting results
   - For small tasks, overhead exceeds the work itself

**Practical Wisdom:**

| Data Size | Best Tool |
|-----------|-----------|
| < 1 GB | Pandas |
| 1-100 GB | Spark (single machine) |
| 100+ GB | Spark (cluster) |
| Real-time | Streaming (Kafka, Flink) |

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder \
    .appName("LearningLab-Solutions") \
    .master("local[*]") \
    .config("spark.driver.memory", "2g") \
    .getOrCreate()

print(f"‚úì Spark version: {spark.version}")
print(f"‚úì Using {spark.sparkContext.defaultParallelism} CPU cores")
print(f"\nüåê Spark UI: {spark.sparkContext.uiWebUrl}")
print("\nüî• Spark is ready to go!")
print("\n   (Colab users: The Spark UI link won't work ‚Äî that's expected. Skip UI steps.)")

---

## Section 4-5: Pandas to Spark DataFrames

### üìñ Key Syntax Differences

| Pandas | Spark | Note |
|--------|-------|------|
| `df.head()` | `df.show()` | Spark returns None, prints to stdout |
| `df[['col']]` | `df.select('col')` | Spark uses method chaining |
| `df[df['x'] > 5]` | `df.filter(df['x'] > 5)` | Very similar! |
| `df['new'] = ...` | `df.withColumn(...)` | Spark is immutable |
| `len(df)` | `df.count()` | count() triggers computation! |

In [None]:
# Create sample data
np.random.seed(42)
n_rows = 100_000

cities = ['Stockholm', 'Gothenburg', 'Malm√∂', 'Uppsala', 'Lund',
          'V√§ster√•s', '√ñrebro', 'Link√∂ping', 'Helsingborg', 'J√∂nk√∂ping']

large_data = {
    'id': range(n_rows),
    'age': np.random.randint(18, 70, n_rows),
    'city': np.random.choice(cities, n_rows),
    'salary': np.random.randint(30000, 100000, n_rows),
    'years_experience': np.random.randint(0, 40, n_rows)
}

df_large_pandas = pd.DataFrame(large_data)
df_large_spark = spark.createDataFrame(df_large_pandas)

print(f"Created dataset: {n_rows:,} rows")
print(f"Partitions: {df_large_spark.rdd.getNumPartitions()}")

---

## üèãÔ∏è Exercise B: Partition Experiment ‚Äî SOLUTIONS

### üìñ Understanding Partitions

**Why partitions matter:**
- Too few ‚Üí not enough parallelism, CPU cores sit idle
- Too many ‚Üí scheduling overhead, many small tasks
- Rule of thumb: 2-4 partitions per core for CPU-bound work

**repartition() vs coalesce():**
- `repartition(n)` ‚Äî full shuffle, can increase or decrease partition count
- `coalesce(n)` ‚Äî no shuffle, can only decrease (more efficient for reducing partitions)

In [None]:
# SOLUTION: Partition Experiment

n_rows_experiment = 1_000_000

experiment_data = {
    'id': range(n_rows_experiment),
    'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], n_rows_experiment),
    'value': np.random.randint(1, 1000, n_rows_experiment),
}

df_experiment_pandas = pd.DataFrame(experiment_data)
df_experiment = spark.createDataFrame(df_experiment_pandas)

print(f"Dataset: {n_rows_experiment:,} rows")
print(f"Default partitions: {df_experiment.rdd.getNumPartitions()}")
print(f"Available cores: {spark.sparkContext.defaultParallelism}")

In [None]:
def time_aggregation(df, label):
    """Time a groupBy aggregation."""
    start = time.time()
    result = df.groupBy('category').agg(
        F.avg('value').alias('avg'),
        F.sum('value').alias('sum'),
        F.count('value').alias('count')
    ).collect()
    duration = time.time() - start
    print(f"{label}: {duration*1000:.0f} ms ({df.rdd.getNumPartitions()} partitions)")
    return duration

# Run experiments
print("\n‚è±Ô∏è Timing experiments:")
print("-" * 50)

time_default = time_aggregation(df_experiment, "Default")

df_2 = df_experiment.coalesce(2)
time_2 = time_aggregation(df_2, "2 partitions")

df_100 = df_experiment.repartition(100)
time_100 = time_aggregation(df_100, "100 partitions")

In [None]:
# SOLUTION: Analysis
print("\nüìä Performance Summary:")
print("=" * 50)
print(f"Default ({df_experiment.rdd.getNumPartitions()} partitions): {time_default*1000:.0f} ms")
print(f"2 partitions:   {time_2*1000:.0f} ms")
print(f"100 partitions: {time_100*1000:.0f} ms")

print("\n‚úÖ EXPECTED OBSERVATIONS:")
print("-" * 50)
print("‚Ä¢ 2 partitions: Often SLOWER (only 2 tasks, other cores idle)")
print("‚Ä¢ 100 partitions: May be SLOWER (scheduling overhead, small tasks)")
print(f"‚Ä¢ Sweet spot: Usually around {spark.sparkContext.defaultParallelism * 2}-{spark.sparkContext.defaultParallelism * 4} partitions")
print("\nüí° Key insight: More partitions ‚â† always better!")

### üí° Discussion Answers - Exercise B

**Q: What happens with too few partitions?**
- A: Cores sit idle. With 2 partitions on 8 cores, 6 cores do nothing.

**Q: What happens with too many partitions?**
- A: Scheduling overhead. Each task has startup cost. With 1000 tiny tasks, you spend more time scheduling than computing.

**Q: When might you want MORE partitions than cores?**
- A: When data is skewed! If one partition has 90% of the data, you want to split it further so the work is balanced.

---

## üèãÔ∏è Exercise C: Spark UI Scavenger Hunt ‚Äî SOLUTIONS

### üìñ What to Look For in the Spark UI

After running an aggregation, open the Spark UI:

| Item | Where to Find | What You'll See |
|------|---------------|----------------|
| Jobs | Jobs tab | 1-2 jobs per query |
| Stages | Click on a job | 2 stages (read ‚Üí shuffle ‚Üí aggregate) |
| Tasks | Click on a stage | # tasks = # partitions |
| Shuffle | Stage details | Shuffle Read/Write in bytes |
| Timeline | Event Timeline | Colored bars showing parallel execution |

### üìä Key Visualizations

**Event Timeline** (in stage details):
- Shows tasks as horizontal bars on a timeline
- Overlapping bars = parallel execution
- Gaps = waiting (shuffle, scheduling)

**DAG Visualization** (click "DAG Visualization"):
- Shows how stages connect and depend on each other
- Arrows between stages = shuffles (data movement)
- Visual version of `.explain()` output

In [None]:
# Run the scavenger hunt query
scavenger_result = df_large_spark.groupBy('city').agg(
    F.avg('salary').alias('avg_salary'),
    F.max('years_experience').alias('max_experience'),
    F.count('*').alias('employee_count')
).orderBy(F.desc('avg_salary'))

scavenger_result.show()

print(f"\nüåê Spark UI: {spark.sparkContext.uiWebUrl}")
print("\nGo to Jobs tab ‚Üí Click latest job ‚Üí Explore stages!")

In [None]:
# For Colab users: Examine the query plan
print("Query Plan (Physical):")
print("=" * 60)
scavenger_result.explain(True)

### üìñ How to Read the Query Plan

The output above shows **4 different views** of the same query, from abstract to concrete:

**1. Parsed Logical Plan** ‚Äî What you wrote, parsed into a tree structure
- Just translates your code, no optimizations yet

**2. Analyzed Logical Plan** ‚Äî Same as above, but with resolved column types
- Now Spark knows `avg_salary` is a `double`, `employee_count` is a `bigint`, etc.

**3. Optimized Logical Plan** ‚Äî Spark's optimizer has improved your query!
- Notice the `Project` step ‚Äî Spark realized it only needs 3 columns (`city`, `salary`, `years_experience`), so it drops `id` and `age` early to save memory

**4. Physical Plan** ‚Äî The actual execution strategy (this is what runs!)

### üîç Reading the Physical Plan (Bottom to Top!)

Read from the **bottom up** ‚Äî that's the order of execution:

```
Scan ExistingRDD           ‚Üê 1. Read the data
   ‚Üì
Project [city, salary, years_experience]  ‚Üê 2. Keep only needed columns
   ‚Üì
HashAggregate (partial)    ‚Üê 3. Partial aggregation BEFORE shuffle (optimization!)
   ‚Üì
Exchange hashpartitioning  ‚Üê 4. SHUFFLE! Send data to reducers by city
   ‚Üì
HashAggregate (final)      ‚Üê 5. Final aggregation after shuffle
   ‚Üì
Exchange rangepartitioning ‚Üê 6. Another shuffle for sorting
   ‚Üì
Sort                       ‚Üê 7. Sort by avg_salary descending
```

### üéØ Key Insights

- **Two shuffles!** (`Exchange` = shuffle). One for grouping by city, one for sorting.
- **Partial aggregation**: Spark computes partial averages/counts *before* shuffling (see `partial_avg`, `partial_count`). This reduces network traffic ‚Äî smart!
- **AdaptiveSparkPlan**: Spark can adjust the plan at runtime based on actual data sizes.
- **200 partitions**: The default shuffle partition count (you can tune this with `spark.sql.shuffle.partitions`).

### ‚úÖ Scavenger Hunt Answers

| Question | Typical Answer | Explanation |
|----------|---------------|-------------|
| How many jobs? | 1-2 | One for the aggregation, sometimes one for show() |
| Stages per job? | 2 | Stage 1: Partial aggregate, Stage 2: Final aggregate |
| Tasks per stage? | = partition count | One task per partition |
| Concurrent execution? | Yes | Look at the timeline ‚Äî parallel colored bars |
| Shuffle data size? | Few KB | Aggregates are small (just counts/sums per city) |

**Why 2 stages?**
- Stage 1: Each partition computes partial aggregates (local sum, count)
- Shuffle: Send partial results to reducer
- Stage 2: Combine partial results into final aggregates

This is the same Map-Shuffle-Reduce pattern we saw earlier!

---

## üèãÔ∏è Exercise D: Distributed Thinking ‚Äî SOLUTIONS

### Scenario 1: Web Log Analysis (500 GB)

**Q1: Would this fit in Pandas?**
- A: No! 500 GB requires ~500 GB RAM (plus overhead)
- Most laptops have 8-32 GB RAM
- You'd need a very expensive machine or Spark on a cluster

**Q2: How many partitions for a 10-node cluster?**
- Rule of thumb: 2-4 partitions per core
- If each node has 8 cores: 10 √ó 8 √ó 3 ‚âà 240 partitions
- Or: 500 GB / 128 MB per partition ‚âà 4000 partitions
- The larger number is often better for large data

**Q3: What if one file is 400 GB?**
- This is **data skew**!
- One partition (the 400 GB file) will take forever
- Others finish quickly, then wait
- Solution: Repartition the data, or use salting techniques

### Scenario 2: Data Pipeline Failure

**Q1: What is this symptom called?**
- A: **Data skew** or **straggler task**
- One task has much more data than others

**Q2: What might cause it?**
- Uneven data distribution (e.g., 90% of orders from one customer)
- Join on a skewed key
- One corrupt/huge input file
- Null values all going to one partition

**Q3: Investigation and fix:**
1. Check Spark UI ‚Üí Stages ‚Üí Look for uneven task durations
2. Check input data distribution
3. Fixes:
   - Salting: Add random prefix to skewed keys
   - Broadcast join for small tables
   - Filter out bad data before processing

### Scenario 3: Real-time vs Batch

**Trade-offs:**

| Aspect | Batch | Real-time |
|--------|-------|----------|
| Latency | Hours | Seconds |
| Complexity | Simpler | More complex |
| Cost | Cheaper (can use spot instances) | More expensive (always on) |
| Fault tolerance | Easy (restart job) | Harder (need checkpointing) |

**Real-time essential:**
- Fraud detection (need to block transaction immediately)
- Stock trading
- Real-time recommendations
- Alerting/monitoring

**Batch sufficient:**
- Daily sales reports
- Monthly billing
- Training ML models
- Historical analysis

---

## Section 8: Lazy Evaluation

### üìñ Why Lazy Evaluation is Powerful

**1. Optimization:** Spark can see the entire pipeline and optimize it
   - Push filters down (filter early = less data to process)
   - Combine operations
   - Avoid unnecessary shuffles

**2. Efficiency:** Spark doesn't compute what you don't need
   - If you call `filter().select().take(10)`, why process all rows?

**Important:** Transformations like `filter()` don't run immediately ‚Äî nothing happens until you call an action like `show()` or `collect()`!

In [None]:
# Demonstrating lazy evaluation
start = time.time()

# This is INSTANT - just building a plan
result = df_large_spark \
    .filter(df_large_spark['age'] > 30) \
    .filter(df_large_spark['salary'] > 50000) \
    .select('city', 'salary') \
    .groupBy('city') \
    .agg(F.avg('salary').alias('avg_salary'))

planning_time = time.time() - start
print(f"‚ö° Planning took: {planning_time*1000:.2f} ms")
print(f"   Nothing computed yet! Just built a plan.")
print(f"\nQuery plan:")
result.explain()

In [None]:
# Now trigger execution
start = time.time()
actual_result = result.collect()
execution_time = time.time() - start

print(f"‚è±Ô∏è Execution took: {execution_time*1000:.2f} ms")
result.show()

---

## Wrap-up & Key Takeaways

### üìñ What You Learned Today

1. **Environment setup** ‚Äî How to verify and configure PySpark
2. **MapReduce** ‚Äî The foundational paradigm (Map ‚Üí Shuffle ‚Üí Reduce)
3. **SparkSession** ‚Äî The entry point to Spark functionality
4. **MapReduce vs Spark** ‚Äî Same word count in 5 lines instead of 30!
5. **Pandas ‚Üí Spark** ‚Äî Converting DataFrames and the similar syntax
6. **Scaling** ‚Äî Why Spark matters for large datasets
7. **Partitions** ‚Äî How Spark distributes data for parallel processing
8. **Lazy evaluation** ‚Äî Transformations vs Actions
9. **Spark UI** ‚Äî Monitoring what Spark is doing
10. **Distributed thinking** ‚Äî Reasoning about data at scale

### Key Concepts to Remember

- **MapReduce** is the foundational pattern: Map (transform) ‚Üí Shuffle (group) ‚Üí Reduce (combine)
- Spark is designed for **distributed** computing ‚Äî it shines with big data
- Spark keeps data **in memory** instead of writing to disk like Hadoop MapReduce
- **Lazy evaluation** lets Spark optimize your entire pipeline
- **Partitions** determine parallelism ‚Äî balance is key
- **Data skew** (uneven partitions) is a common performance killer
- The **Spark UI** is your best friend for debugging and optimization

### Common Pitfalls to Avoid

- Collecting too much data to the driver (`collect()` on large data)
- Assuming `filter()` runs immediately (it doesn't ‚Äî it's lazy!)
- Ignoring data skew
- Using the wrong partition count

### What's Next?

- **Next week (Week 5):** Lab 1 will be published ‚Äî your first graded assignment!
- Lab 1 dives deeper into transformations, actions, and the execution model
- You'll work with larger datasets and submit your completed notebook

In [None]:
# Clean up
spark.stop()
print("‚úì Spark session stopped.")
print("\nüéâ Learning Lab Solutions Complete!")