# 🚀 Google Colab Setup for PySpark Tutorial

**⚠️ IMPORTANT**: Run all cells in this notebook FIRST before starting any PySpark tutorial module in Google Colab.

This notebook will:
- ✅ Install Java (required for PySpark)
- ✅ Install PySpark and all dependencies
- ✅ Configure environment variables
- ✅ Test the installation
- ✅ Provide troubleshooting tips

**Estimated time**: 2-3 minutes

## 🔍 Environment Detection

In [None]:
import sys
import os
import platform

print("🔍 Environment Detection:")
print(f"   Python Version: {sys.version.split()[0]}")
print(f"   Operating System: {platform.system()}")
print(f"   Platform: {platform.platform()}")

# Check if running in Google Colab
IN_COLAB = 'google.colab' in sys.modules
print(f"   Google Colab: {'✅ Yes' if IN_COLAB else '❌ No (Local environment)'}")

if not IN_COLAB:
    print("\n💻 You're running locally - PySpark should already be installed!")
    print("   Skip to the 'Test Installation' section at the bottom.")
else:
    print("\n🚀 Google Colab detected - proceeding with setup...")

## ☕ Java Installation (Required for PySpark)

In [None]:
if IN_COLAB:
    print("☕ Installing Java 8 (required for PySpark)...")
    
    # Update package list and install Java
    !apt-get update -qq > /dev/null 2>&1
    !apt-get install openjdk-8-jdk-headless -qq > /dev/null 2>&1
    
    # Set JAVA_HOME environment variable
    os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
    
    print("✅ Java installation complete!")
    
    # Verify Java installation
    print("\n🔍 Verifying Java installation:")
    !java -version
    
else:
    print("💻 Local environment - checking existing Java installation...")
    try:
        !java -version
        print("✅ Java is already installed!")
    except:
        print("❌ Java not found. Please install Java 8 or 11 manually.")

## 🐍 PySpark and Dependencies Installation

In [None]:
if IN_COLAB:
    print("📦 Installing PySpark and dependencies...")
    
    # Install core packages
    !pip install -q pyspark==3.5.0
    print("   ✅ PySpark 3.5.0 installed")
    
    # Install data manipulation libraries
    !pip install -q pandas numpy
    print("   ✅ Data libraries (pandas, numpy) installed")
    
    # Install visualization libraries
    !pip install -q matplotlib seaborn plotly
    print("   ✅ Visualization libraries installed")
    
    # Install ML and utility libraries
    !pip install -q scikit-learn faker
    print("   ✅ ML and utility libraries installed")
    
    print("\n🎉 All packages installed successfully!")
    
else:
    print("💻 Local environment - checking existing installations...")
    
    packages = ['pyspark', 'pandas', 'numpy', 'matplotlib', 'seaborn', 'plotly', 'sklearn', 'faker']
    for package in packages:
        try:
            if package == 'sklearn':
                import sklearn
            else:
                __import__(package)
            print(f"   ✅ {package} is installed")
        except ImportError:
            print(f"   ❌ {package} is missing - install with: pip install {package}")

## ⚙️ Environment Configuration

In [None]:
if IN_COLAB:
    print("⚙️ Configuring PySpark environment variables...")
    
    # Set PySpark environment variables
    os.environ["SPARK_HOME"] = "/usr/local/lib/python3.10/dist-packages/pyspark"
    os.environ["PYSPARK_PYTHON"] = "python3"
    os.environ["PYSPARK_DRIVER_PYTHON"] = "python3"
    
    # Add PySpark to Python path
    import sys
    sys.path.append('/usr/local/lib/python3.10/dist-packages')
    
    print("✅ Environment configuration complete!")
    
    # Display environment variables
    print("\n🔍 Environment Variables:")
    print(f"   JAVA_HOME: {os.environ.get('JAVA_HOME', 'Not set')}")
    print(f"   SPARK_HOME: {os.environ.get('SPARK_HOME', 'Not set')}")
    print(f"   PYSPARK_PYTHON: {os.environ.get('PYSPARK_PYTHON', 'Not set')}")
    
else:
    print("💻 Local environment - environment should be pre-configured")

## 🧪 Test PySpark Installation

In [None]:
print("🧪 Testing PySpark installation...")

try:
    # Import PySpark
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import col, count, sum as spark_sum
    print("   ✅ PySpark imports successful")
    
    # Create Spark session
    spark = SparkSession.builder \
        .appName("ColabSetupTest") \
        .config("spark.driver.memory", "2g") \
        .config("spark.executor.memory", "1g") \
        .config("spark.sql.shuffle.partitions", "4") \
        .getOrCreate()
    
    print(f"   ✅ Spark session created successfully")
    print(f"   📊 Spark version: {spark.version}")
    print(f"   📝 Application name: {spark.sparkContext.appName}")
    
    # Create test DataFrame
    test_data = [(1, "Alice", 25), (2, "Bob", 30), (3, "Charlie", 35)]
    columns = ["id", "name", "age"]
    df = spark.createDataFrame(test_data, columns)
    
    print("   ✅ Test DataFrame created")
    
    # Test basic operations
    row_count = df.count()
    avg_age = df.agg(spark_sum("age")).collect()[0][0] / row_count
    
    print(f"   ✅ Basic operations work - {row_count} rows, avg age: {avg_age}")
    
    # Display sample data
    print("\n📊 Sample DataFrame:")
    df.show()
    
    print("\n🎉 PySpark is working perfectly!")
    print("🚀 You're ready to start the PySpark tutorial!")
    
    # Clean up
    spark.stop()
    
except Exception as e:
    print(f"❌ Error testing PySpark: {str(e)}")
    print("\n🔧 Troubleshooting suggestions:")
    print("   1. Restart the runtime: Runtime > Restart runtime")
    print("   2. Re-run all cells in this notebook")
    print("   3. Check the troubleshooting section below")

## 🔧 Troubleshooting Guide

### Common Issues and Solutions

#### 1. Java Not Found Error
```
Error: JAVA_HOME is not set
```
**Solution**: Re-run the Java installation cell above

#### 2. PySpark Import Error
```
ModuleNotFoundError: No module named 'pyspark'
```
**Solution**: 
- Restart runtime: `Runtime > Restart runtime`
- Re-run installation cells

#### 3. Memory Issues
```
OutOfMemoryError or session crashes
```
**Solution**: Use smaller datasets or reduce Spark memory settings

#### 4. Session Timeout
```
Session disconnected after inactivity
```
**Solution**: Re-run this setup notebook and continue from where you left off

### 📞 Getting Help
- Check the main tutorial README for more troubleshooting tips
- Open an issue on GitHub if problems persist
- Use the Colab "Help" menu for Colab-specific issues

## 🎯 Next Steps

### ✅ Setup Complete!

You can now proceed to any PySpark tutorial module:

1. **[Module 1: Foundation & Setup](01_pyspark_foundation_setup.ipynb)** - Start here for basics
2. **[Module 2: DataFrame Operations](02_dataframe_operations.ipynb)** - Core data operations
3. **[Module 6: Machine Learning](06_machine_learning_mllib.ipynb)** - ML with PySpark
4. **[Module 10: End-to-End Project](10_end_to_end_project.ipynb)** - Complete project

### 💡 Pro Tips for Colab:

1. **Save frequently**: Colab sessions can timeout
2. **Use smaller datasets**: Colab has memory limitations
3. **Monitor resources**: Check RAM/Disk usage in the sidebar
4. **Keep this setup handy**: Bookmark this notebook for future sessions

### 🚀 Happy Learning!

You're all set to master PySpark with this comprehensive tutorial series!