# SemanticLayer Troubleshooting Guide

This notebook provides step-by-step solutions for common issues when setting up and running the semantic layer project.

**Contents:**
1. Java/JDK Issues
2. PySpark Installation & Configuration
3. Missing or Corrupted Data Files
4. Virtual Environment Problems
5. Test Failures
6. Colab-Specific Issues

## Issue 1: Java/JDK Not Found or Misconfigured

### Symptoms
- Error: `Java gateway process exited`
- Error: `JAVA_HOME not set`
- Error: `Unable to locate Java`

### Solution by Operating System

#### macOS (Homebrew)
```bash
# Install Java
brew install openjdk@11

# Find installation path
brew --prefix openjdk@11
# Returns something like: /usr/local/opt/openjdk@11

# Add to shell profile
echo 'export PATH="/usr/local/opt/openjdk@11/bin:$PATH"' >> ~/.zshrc
echo 'export JAVA_HOME="/usr/local/opt/openjdk@11"' >> ~/.zshrc

# Reload shell
source ~/.zshrc

# Verify
java -version
```

#### Ubuntu/Debian Linux
```bash
# Update package manager
sudo apt-get update

# Install OpenJDK
sudo apt-get install -y openjdk-11-jdk

# Verify
java -version

# If not in PATH, add to ~/.bashrc
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc
source ~/.bashrc
```

#### Windows (PowerShell)
```powershell
# Install via Chocolatey (if available)
choco install openjdk11

# Or download from https://adoptium.net/
# Then set environment variable (Admin PowerShell):
$env:JAVA_HOME = "C:\Program Files\Java\jdk-11.0.x"
[System.Environment]::SetEnvironmentVariable("JAVA_HOME", $env:JAVA_HOME, "User")

# Verify
java -version
```

### Verify Installation
After setup, test with:
```bash
python -c "from pyspark.sql import SparkSession; SparkSession.builder.master('local').getOrCreate()"
```
If no errors appear, Java is properly configured!

## Issue 2: PySpark Installation Fails or Is Very Slow

### Symptoms
- `ERROR: Could not find a version that satisfies the requirement pyspark`
- Installation hangs for >10 minutes
- Disk space errors

### Solutions

#### Clear pip cache and retry
```bash
pip cache purge
pip install --upgrade pip setuptools wheel
pip install -r SemanticLayer/requirements.txt
```

#### Install specific PySpark version
```bash
pip install pyspark==3.4.0
```

#### Check available disk space
```bash
# macOS/Linux
df -h

# Windows PowerShell
Get-PSDrive C
```
Ensure >1GB free.

#### Use a pre-built wheel (faster)
```bash
pip install --only-binary :all: pyspark
```

## Issue 3: Missing or Corrupted Data Files

### Symptoms
- `FileNotFoundError: gold_view.csv not found`
- CSV files are empty or have 0 rows
- `metadata.json` is malformed

### Solution: Re-run ETL

#### Option A: Full PySpark ETL (recommended)
```bash
python SemanticLayer/scripts/process_data_spark.py
```

#### Option B: Pandas fallback (no Java needed)
```bash
python SemanticLayer/scripts/process_data.py
```

#### Option C: Manual reset
```bash
# Remove old files
rm -rf SemanticLayer/data/silver/*
rm -rf SemanticLayer/data/gold/*
rm SemanticLayer/data/metadata.json

# Re-run ETL
python SemanticLayer/scripts/process_data_spark.py
```

#### Verify files were created
```bash
ls -lah SemanticLayer/data/silver/
ls -lah SemanticLayer/data/gold/
cat SemanticLayer/data/metadata.json
```

## Issue 4: Virtual Environment Problems

### Symptoms
- `ModuleNotFoundError: No module named 'pyspark'`
- Dependencies installed but not found
- Wrong Python version

### Solution: Recreate Virtual Environment

```bash
# Deactivate current venv (if active)
deactivate

# Remove old venv
rm -rf .venv

# Create fresh venv
python -m venv .venv

# Activate (choose your OS):
# macOS/Linux:
source .venv/bin/activate

# Windows PowerShell:
.venv\Scripts\Activate.ps1

# Windows Command Prompt:
.venv\Scripts\activate.bat

# Install dependencies
pip install --upgrade pip
pip install -r SemanticLayer/requirements.txt

# Verify
python -c "import pyspark; print(pyspark.__version__)"
```

### Check Python version
```bash
python --version  # Should be 3.8+
which python      # Should show path inside .venv
```

## Issue 5: Tests Fail with Assertion Errors

### Symptoms
- `AssertionError: Expected 145.49, got 145.48`
- Test times out after 60 seconds
- `ImportError` in test file

### Solution: Debug ETL Output

```bash
# 1. Check if gold_view.csv exists and has data
head -20 SemanticLayer/data/gold/gold_view.csv

# 2. View as pandas to check values
python -c "import pandas as pd; print(pd.read_csv('SemanticLayer/data/gold/gold_view.csv').to_string())"

# 3. Check metadata
cat SemanticLayer/data/metadata.json

# 4. Re-run ETL with verbose output
python -u SemanticLayer/scripts/process_data_spark.py

# 5. Run specific test with verbose output
pytest -vv SemanticLayer/tests/test_etl.py::test_gold_view_values
```

### Common Assertion Causes
- Floating-point precision: Expected `145.50`, got `145.4999999` (use `pytest.approx()`)
- Data not re-generated: Run `process_data_spark.py` again
- Old cached data: Delete `SemanticLayer/data/` and rebuild

## Issue 6: Colab-Specific Issues

### Symptom 1: "gold_view.csv not found" in Colab

**Solution:** Mount Google Drive
```python
from google.colab import drive
drive.mount('/content/drive')

# Then read from mounted drive
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/gold_view.csv')
```

Or upload the file manually:
```python
from google.colab import files
uploaded = files.upload()
# Then select gold_view.csv from your computer
```

### Symptom 2: Java errors in Colab

**Solution:** Use DuckDB instead (no Java)
```python
# Install DuckDB
!pip install duckdb

# No need for PySpark in Colab; use DuckDB for SQL queries
import duckdb
duckdb.sql("SELECT * FROM 'gold_view.csv' LIMIT 5").show()
```

### Symptom 3: Dependencies not installing

```python
# Upgrade pip first
!pip install --upgrade pip

# Then install all dependencies
!pip install pandas duckdb pyspark
```

## Quick Diagnostic Script

Run this script to check your setup:

```python
import sys
import os
import subprocess

print("=" * 60)
print("SEMANTIC LAYER SETUP DIAGNOSTIC")
print("=" * 60)

# Check Python version
print(f"\n✓ Python version: {sys.version}")
if sys.version_info < (3, 8):
    print("  ⚠ WARNING: Python 3.8+ required")

# Check Java
try:
    java_version = subprocess.check_output(['java', '-version'], stderr=subprocess.STDOUT).decode()
    print(f"✓ Java installed: {java_version.split()[0]}")
except FileNotFoundError:
    print("✗ Java NOT found (required for PySpark)")

# Check required packages
packages = ['pandas', 'pyspark', 'duckdb', 'pytest']
for pkg in packages:
    try:
        __import__(pkg)
        print(f"✓ {pkg} installed")
    except ImportError:
        print(f"✗ {pkg} NOT installed")

# Check data files
data_files = [
    'SemanticLayer/data/silver/customers_silver.csv',
    'SemanticLayer/data/silver/transactions_silver.csv',
    'SemanticLayer/data/gold/gold_view.csv',
    'SemanticLayer/data/metadata.json'
]

print("\nData files:")
for f in data_files:
    if os.path.exists(f):
        size = os.path.getsize(f)
        print(f"✓ {f} ({size} bytes)")
    else:
        print(f"✗ {f} (missing)")

print("\n" + "=" * 60)
```

Save as `diagnostic.py` and run:
```bash
python diagnostic.py
```

## Still Stuck?

1. **Post the error message** with context (OS, Python version, command)
2. **Run the diagnostic script** above and share output
3. **Check the main SETUP_GUIDE.md** in the repo root
4. **Open an issue** on GitHub with: OS, error logs, and diagnostic output

---

**Resources:**
- [PySpark Troubleshooting](https://spark.apache.org/docs/latest/)
- [Java Installation Guide](https://www.java.com/en/download/help/download_options.html)
- [DuckDB Documentation](https://duckdb.org/)