# PyCaret comparison notebook

This notebook runs PyCaret's regression auto-experiment (compare_models) on the same preprocessed dataset used in the main notebook.

Goals:
- Load the preprocessed dataset (same as `rio_iqr` from the main notebook) or read `data_preprocessed.csv` if you exported it.
- Run PyCaret regression setup and compare_models.
- Save PyCaret leaderboard to `pycaret_leaderboard.csv` for later merging with other results.

Notes:
- Do NOT run the install cells automatically inside a production kernel unless you understand the environment. Use the pip commands provided below in a terminal (venv) or in Colab.
- If you want me to run PyCaret inside this environment, tell me and I can attempt to run the install, but it's usually safer for you to run in your `.venv`.

## 1) Quick environment check
This cell prints Python and key package versions. Run it to confirm your environment before installing PyCaret.

In [None]:
import sys, platform, importlib
print('Python:', sys.version.replace('',' '))
libs = ['pandas','numpy','sklearn','pycaret','torch']
for lib in libs:
    try:
        m = importlib.import_module(lib)
        print(lib, 'version ->', getattr(m, '__version__', 'unknown'))
    except Exception as e:
        print(lib, 'not installed or import failed:', e)

Python:  3 . 1 1 . 0 r c 1   ( m a i n ,   A u g   1 2   2 0 2 2 ,   1 0 : 0 2 : 1 4 )   [ G C C   1 1 . 2 . 0 ] 
pandas not installed or import failed: No module named 'pandas'
numpy not installed or import failed: No module named 'numpy'
sklearn not installed or import failed: No module named 'sklearn'
pycaret not installed or import failed: No module named 'pycaret'
torch not installed or import failed: No module named 'torch'


## 2) Installation guidance (run in terminal / Colab)

If you're using a local virtualenv (.venv), run in your terminal (do NOT run inside a production notebook kernel unless you know what you're doing):

```bash
# Activate your venv first, then run:
pip install pycaret[full]==3.0.0rc1  # or a compatible stable version
pip install lightgbm xgboost catboost
```

For Google Colab use this cell (uncomment and run in Colab):

```python
# !pip install -q pycaret[full]==3.0.0rc1 lightgbm xgboost catboost
```

PyCaret 3 requires scikit-learn >=1.2 and Python 3.8-3.11; if you're on Python 3.12 your environment may not be compatible with the latest stable pycaret. Check compatibility before installing.

## 3) Load preprocessed data

This cell tries several fallbacks to load the same preprocessed dataframe used in your main notebook:
1) If you exported `rio_iqr` to `data_preprocessed.csv` in the main notebook, it will load it.
2) If you have the main notebook variables saved in a pickle `preprocessed_vars.pickle`, you can load them.
3) If you run this notebook inside the same kernel/session as the main notebook, you can import variables directly (not recommended).

In [5]:
import os, pandas as pd
candidates = ['data_preprocessed.csv','rio_iqr.csv','rio_iqr.parquet']
df = None
for fn in candidates:
    if os.path.exists(fn):
        print('Loading', fn)
        df = pd.read_csv(fn) if fn.endswith('.csv') else pd.read_parquet(fn)
        break
if df is None:
    print('No local preprocessed file found. Please either:')
    print(' - Export `rio_iqr` from your main notebook to data_preprocessed.csv and re-run this cell')
    print(' - Or place the preprocessed CSV in the same folder as this notebook')
else:
    print('Loaded dataframe shape:', df.shape)
    display(df.head())

[0m[31mERROR: Exception:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pip/_internal/cli/base_command.py", line 165, in exc_logging_wrapper
    status = run_func(*args)
  File "/usr/lib/python3/dist-packages/pip/_internal/cli/req_command.py", line 205, in wrapper
    return func(self, options, args)
  File "/usr/lib/python3/dist-packages/pip/_internal/commands/install.py", line 389, in run
    to_install = resolver.get_installation_order(requirement_set)
  File "/usr/lib/python3/dist-packages/pip/_internal/resolution/resolvelib/resolver.py", line 188, in get_installation_order
    weights = get_topological_weights(
  File "/usr/lib/python3/dist-packages/pip/_internal/resolution/resolvelib/resolver.py", line 276, in get_topological_weights
    assert len(weights) == expected_node_count
AssertionError[0m[31m
[0m[31mERROR: Exception:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pip/_internal/cli/base_command.py", line 165, in 

ModuleNotFoundError: No module named 'pandas'

## 4) PyCaret setup and compare_models

The cell below demonstrates a standard PyCaret regression workflow. It expects the dataframe `df` to exist and the target column to be named `price` (change if your target is different).

IMPORTANT: Do not auto-install packages inside the kernel unless you accept the risk. If PyCaret isn't installed, follow the instructions in the installation cell and come back here.

In [1]:
# Run this after installing pycaret in your environment and ensuring `df` is loaded
!pip install pycaret -q
try:
    from pycaret.regression import setup, compare_models, pull, save_config
except Exception as e:
    print('PyCaret import failed:', e)
    raise

TARGET = 'price'
if 'df' not in globals() or df is None:
    raise RuntimeError('Dataframe `df` not loaded. Run the data-loading cell first or export `rio_iqr` to data_preprocessed.csv')

# Basic PyCaret setup
s = setup(df, target=TARGET, silent=True, session_id=42, fold=5, verbose=False)


PyCaret import failed: No module named 'pycaret'


ModuleNotFoundError: No module named 'pycaret'