# Data Pipeline

This notebook orchestrates the complete data processing pipeline:

1. **Clean CSVs** - Cleans the raw listings data
2. **Setup Database** - Creates the SQLite database and schema
3. **Populate Database** - Loads cleaned data into the database
4. **View Database** - Displays database contents and statistics

Run all cells sequentially to execute the complete pipeline.


In [24]:
import subprocess
import sys
import os
from pathlib import Path
import time

# Set up paths
project_root = Path().resolve()
print(f"Project root: {project_root}")

# Define notebook paths
notebooks_dir = project_root / "notebooks"
sql_dir = project_root / "sql"

clean_notebook = notebooks_dir / "clean_csvs.ipynb"
populate_notebook = sql_dir / "etl" / "populate_database.ipynb"
view_notebook = sql_dir / "view" / "view_database.ipynb"
setup_script = sql_dir / "setup_local_db_sqlite.sh"

# Verify all files exist
print("\nüìã Pipeline Components:")
print(f"  ‚úì Clean CSV notebook: {clean_notebook.exists()}")
print(f"  ‚úì Setup script: {setup_script.exists()}")
print(f"  ‚úì Populate database notebook: {populate_notebook.exists()}")
print(f"  ‚úì View database notebook: {view_notebook.exists()}")

if not all([clean_notebook.exists(), setup_script.exists(), 
            populate_notebook.exists(), view_notebook.exists()]):
    print("\n‚ö†Ô∏è  Warning: Some required files are missing!")
    sys.exit(1)


Project root: /Users/anishj29/Desktop/Github Projects/Airbnb-Price-Predictor

üìã Pipeline Components:
  ‚úì Clean CSV notebook: True
  ‚úì Setup script: True
  ‚úì Populate database notebook: True
  ‚úì View database notebook: True


## Step 1: Clean CSV Files

This step cleans the raw listings data and saves it to `data/processed/listings_cleaned.csv`.


In [25]:
print("=" * 60)
print("STEP 1: Cleaning CSV Files")
print("=" * 60)

# Execute the clean_csvs notebook using nbclient (avoids event loop issues)
try:
    from nbclient import NotebookClient
    import nbformat
    import os
    
    print("Loading notebook...")
    with open(clean_notebook, 'r') as f:
        nb = nbformat.read(f, as_version=4)
    
    # Set working directory to the notebook's directory for relative paths
    notebook_dir = clean_notebook.parent
    original_cwd = os.getcwd()
    
    try:
        os.chdir(str(notebook_dir))
        print(f"Changed working directory to: {notebook_dir}")
        
        print("Executing notebook cells...")
        client = NotebookClient(nb, timeout=300, kernel_name='python3')
        client.execute()
        
        print("‚úì Successfully cleaned CSV files")
        print(f"  Output: {project_root / 'data' / 'processed' / 'listings_cleaned.csv'}")
    finally:
        os.chdir(original_cwd)
    
except ImportError:
    print("nbclient not available. Please install it:")
    print("  pip install nbclient")
    raise
except Exception as e:
    print(f"‚úó Error cleaning CSV files: {e}")
    raise


STEP 1: Cleaning CSV Files
Loading notebook...
Changed working directory to: /Users/anishj29/Desktop/Github Projects/Airbnb-Price-Predictor/notebooks
Executing notebook cells...


‚úì Successfully cleaned CSV files
  Output: /Users/anishj29/Desktop/Github Projects/Airbnb-Price-Predictor/data/processed/listings_cleaned.csv


## Step 2: Setup Database Schema

This step creates the SQLite database and applies the schema.


In [26]:
print("=" * 60)
print("STEP 2: Setting up Database Schema")
print("=" * 60)

# Make script executable
os.chmod(setup_script, 0o755)

# Execute the setup script with --force flag to skip interactive prompt
try:
    result = subprocess.run(
        ["bash", str(setup_script), "--force"],
        cwd=str(project_root),
        capture_output=True,
        text=True,
        check=True
    )
    print("‚úì Successfully created database and schema")
    if result.stdout:
        print(result.stdout)
except subprocess.CalledProcessError as e:
    print(f"‚úó Error setting up database:")
    if e.stderr:
        print(e.stderr)
    if e.stdout:
        print(e.stdout)
    raise


STEP 2: Setting up Database Schema
‚úì Successfully created database and schema
Setting up local SQLite database...
Removed existing database file (--force mode)
Creating database and running schema...

‚úì Database setup complete!

Database file: /Users/anishj29/Desktop/Github Projects/Airbnb-Price-Predictor/sql/../data/airbnb.db

You can now run the populate_database.ipynb notebook to load data.



## Step 3: Populate Database

This step loads the cleaned data into the database.


In [27]:
print("=" * 60)
print("STEP 3: Populating Database")
print("=" * 60)
print("This may take a few minutes...")

# Execute the populate_database notebook using nbclient (avoids event loop issues)
try:
    from nbclient import NotebookClient
    import nbformat
    import os
    
    print("Loading notebook...")
    with open(populate_notebook, 'r') as f:
        nb = nbformat.read(f, as_version=4)
    
    # Set working directory to project root for relative paths
    original_cwd = os.getcwd()
    
    try:
        os.chdir(str(project_root))
        print(f"Changed working directory to: {project_root}")
        
        print("Executing notebook cells...")
        client = NotebookClient(nb, timeout=600, kernel_name='python3')
        client.execute()
        
        print("‚úì Successfully populated database")
        
        # Show any important output from the last cell
        if nb.cells:
            last_cell = nb.cells[-1]
            if last_cell.cell_type == 'code' and last_cell.outputs:
                for output in last_cell.outputs[-3:]:  # Show last 3 outputs
                    if output.output_type == 'stream' and 'stdout' in output:
                        lines = output['text'].split('\n')
                        important = [l for l in lines if any(kw in l.lower() for kw in ['inserted', 'populated', 'success', 'error'])]
                        if important:
                            print("\n".join(important))
    finally:
        os.chdir(original_cwd)
                        
except ImportError:
    print("nbclient not available, trying alternative method...")
    # Fall back to using subprocess with a separate Python process
    import os
    import shutil
    import tempfile
    
    python_cmd = shutil.which("python3") or shutil.which("python") or "python3"
    
    # Create a simple script to execute the notebook
    script_content = f"""
import sys
from pathlib import Path
sys.path.insert(0, str(Path('{project_root}').resolve()))

from nbclient import NotebookClient
import nbformat

notebook_path = Path('{populate_notebook}').resolve()
with open(notebook_path, 'r') as f:
    nb = nbformat.read(f, as_version=4)

client = NotebookClient(nb, timeout=600, kernel_name='python3')
client.execute()
"""
    
    with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
        f.write(script_content)
        script_path = f.name
    
    try:
        env = os.environ.copy()
        env['PYTHONUNBUFFERED'] = '1'
        result = subprocess.run(
            [python_cmd, script_path],
            cwd=str(project_root),
            capture_output=True,
            text=True,
            check=True,
            env=env,
            timeout=600
        )
        print("‚úì Successfully populated database")
        if result.stdout:
            lines = result.stdout.split('\n')
            important = [l for l in lines if any(kw in l.lower() for kw in ['inserted', 'populated', 'success', 'error'])]
            if important:
                print("\n".join(important[-10:]))
    except subprocess.CalledProcessError as e:
        print(f"‚úó Error populating database:")
        if e.stderr:
            stderr_lines = e.stderr.split('\n')
            # Find and show error context
            for i, line in enumerate(stderr_lines):
                if 'Error' in line or 'Traceback' in line or 'Exception' in line:
                    start = max(0, i - 2)
                    end = min(len(stderr_lines), i + 15)
                    print("\n".join(stderr_lines[start:end]))
                    break
            else:
                print("\n".join(stderr_lines[-50:]))
        raise
    finally:
        os.unlink(script_path)
except Exception as e:
    print(f"‚úó Error executing notebook: {e}")
    raise


STEP 3: Populating Database
This may take a few minutes...
Loading notebook...
Changed working directory to: /Users/anishj29/Desktop/Github Projects/Airbnb-Price-Predictor
Executing notebook cells...
‚úì Successfully populated database


## Step 4: View Database

This step displays the database contents and statistics.


In [28]:
print("=" * 60)
print("STEP 4: Viewing Database")
print("=" * 60)

# Execute the view_database notebook using nbclient
try:
    from nbclient import NotebookClient
    import nbformat
    import os
    
    print("Loading notebook...")
    with open(view_notebook, 'r') as f:
        nb = nbformat.read(f, as_version=4)
    
    # Set working directory to project root for relative paths
    original_cwd = os.getcwd()
    
    try:
        os.chdir(str(project_root))
        print(f"Changed working directory to: {project_root}")
        
        print("Executing notebook cells...")
        client = NotebookClient(nb, timeout=300, kernel_name='python3')
        client.execute()
        
        print("‚úì Successfully generated database view")
        print(f"\nüìä View the results in: {view_notebook}")
    finally:
        os.chdir(original_cwd)
    
except ImportError:
    print("nbclient not available. Please install it:")
    print("  pip install nbclient")
    raise
except Exception as e:
    print(f"‚úó Error viewing database: {e}")
    raise


STEP 4: Viewing Database
Loading notebook...
Changed working directory to: /Users/anishj29/Desktop/Github Projects/Airbnb-Price-Predictor
Executing notebook cells...
‚úì Successfully generated database view

üìä View the results in: /Users/anishj29/Desktop/Github Projects/Airbnb-Price-Predictor/sql/view/view_database.ipynb


## Pipeline Complete! ‚úÖ

All steps have been executed successfully. The database is ready for use.

**Summary:**
- ‚úÖ CSV files cleaned
- ‚úÖ Database schema created
- ‚úÖ Database populated with data
- ‚úÖ Database view generated

**Next Steps:**
- Open `sql/view/view_database.ipynb` to explore the database
- Use the database for analysis or modeling


In [29]:
# Final verification
import sqlite3

db_path = project_root / "data" / "airbnb.db"

if db_path.exists():
    conn = sqlite3.connect(str(db_path))
    cur = conn.cursor()
    
    # Get table counts
    tables = ['neighbourhood', 'listing']
    print("\nüìä Final Database Statistics:")
    print("-" * 60)
    
    for table in tables:
        cur.execute(f"SELECT COUNT(*) FROM {table};")
        count = cur.fetchone()[0]
        print(f"  {table:20s}: {count:>10,} rows")
    
    cur.close()
    conn.close()
    
    print("\n‚úÖ Pipeline completed successfully!")
    print(f"   Database location: {db_path}")
else:
    print("\n‚ö†Ô∏è  Warning: Database file not found!")



üìä Final Database Statistics:
------------------------------------------------------------
  neighbourhood       :        230 rows
  listing             :     14,436 rows

‚úÖ Pipeline completed successfully!
   Database location: /Users/anishj29/Desktop/Github Projects/Airbnb-Price-Predictor/data/airbnb.db
