# Health and Sleep Analysis: A Comparative Study with TabPFN

This notebook demonstrates and compares a traditional classification model (Logistic Regression) with the TabPFN (Prior-Data Fitted Network) classifier on the OpenML 'sleep' dataset (ID: 205). The primary goal is to showcase the end-to-end process from data loading and preprocessing to model training, evaluation, and comparison, highlighting TabPFN's capabilities on small tabular datasets.

In [3]:
from sklearn.impute import SimpleImputer
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report
from tabpfn import TabPFNClassifier # Corrected import name
import openml

### 1. Load the Dataset

We will load the 'sleep' dataset (ID: 205) from OpenML. This dataset was identified during the EDA phase. We will also separate features (X) and the target variable (y), which is 'danger_index'.

In [4]:
print("Loading 'sleep' dataset (ID: 205) from OpenML...")
dataset = openml.datasets.get_dataset(205, download_data=True, download_qualities=True, download_features_meta_data=True)

target_column = 'danger_index' 

X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format='dataframe',
    target=target_column 
)

df = X.copy()
if y is not None:
    df[target_column] = y

print("Dataset loaded successfully.")
print("First 5 rows of the combined DataFrame (for inspection):")
print(df.head())
print("\nShape of features X:", X.shape)
print("Shape of target y:", y.shape)
print("\nTarget variable column name:", target_column)
print("\nValue counts for the target variable:")
if y is not None:
    print(y.value_counts())
else:
    print("Target variable 'y' could not be loaded.")

categorical_features = [attribute_names[i] for i, is_cat in enumerate(categorical_indicator) 
                        if is_cat and attribute_names[i] in X.columns]
numerical_features = [attribute_names[i] for i, is_cat in enumerate(categorical_indicator) 
                      if not is_cat and attribute_names[i] in X.columns]

print(f"\nIdentified categorical features in X: {categorical_features}")
print(f"Identified numerical features in X: {numerical_features}")

print("\nMissing values in X before imputation:")
print(X.isnull().sum()[X.isnull().sum() > 0])
print("\nMissing values in y before imputation (if any):")
if y is not None:
    print(y.isnull().sum())
else:
    print("y is None")

Loading 'sleep' dataset (ID: 205) from OpenML...
Dataset loaded successfully.
First 5 rows of the combined DataFrame (for inspection):
   body_weight  brain_weight  max_life_span  gestation_time  predation_index  \
0     6654.000        5712.0           38.6           645.0                3   
1        1.000           6.6            4.5            42.0                3   
2        3.385          44.5           14.0            60.0                1   
3        0.920           5.7            NaN            25.0                5   
4     2547.000        4603.0           69.0           624.0                3   

   sleep_exposure_index  total_sleep  danger_index  
0                     5          3.3             3  
1                     1          8.3             3  
2                     1         12.5             1  
3                     2         16.5             3  
4                     5          3.9             4  

Shape of features X: (62, 7)
Shape of target y: (62,)

Target var

### 2. Handle Missing Values

We will impute missing values using `SimpleImputer` from scikit-learn.
- For **numerical features**, we'll use the 'median' strategy.
- For **categorical features**, we'll use the 'most_frequent' strategy.

In [5]:
numerical_imputer = SimpleImputer(strategy='median')
categorical_imputer = SimpleImputer(strategy='most_frequent')

X_imputed = X.copy()

if numerical_features:
    print("\nImputing numerical features...")
    X_imputed[numerical_features] = numerical_imputer.fit_transform(X[numerical_features])
else:
    print("\nNo numerical features to impute.")

if categorical_features:
    print("Imputing categorical features...")
    X_imputed[categorical_features] = categorical_imputer.fit_transform(X[categorical_features])
else:
    print("No categorical features to impute.")

print("\nMissing values in X after imputation:")
print(X_imputed.isnull().sum()[X_imputed.isnull().sum() > 0])

if y is not None and y.isnull().any():
    print(f"\nTarget variable '{target_column}' has {y.isnull().sum()} missing values.")
else:
    print(f"\nTarget variable '{target_column}' has no missing values or y is None.")



Imputing numerical features...
No categorical features to impute.

Missing values in X after imputation:
Series([], dtype: int64)

Target variable 'danger_index' has no missing values or y is None.


### 3. Feature Engineering Pipeline

We'll create a `ColumnTransformer`. This transformer allows different preprocessing steps to apply different preprocessing steps to numerical and categorical features.
- **Numerical features**: Will be scaled using `StandardScaler` (after median imputation).
- **Categorical features**: Will be encoded using `OneHotEncoder` (after most_frequent imputation). `handle_unknown='ignore'` is used to prevent errors during transform if test data has new categories.

In [6]:
numerical_pipeline = Pipeline([
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline([
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ],
    remainder='passthrough' 
)

print("ColumnTransformer created successfully.")
print("Numerical features for transformer:", numerical_features)
print("Categorical features for transformer:", categorical_features)


ColumnTransformer created successfully.
Numerical features for transformer: ['body_weight', 'brain_weight', 'max_life_span', 'gestation_time', 'predation_index', 'sleep_exposure_index', 'total_sleep']
Categorical features for transformer: []


### 4. Split the Data

The dataset (features `X_imputed` and target `y`) will be split into training and testing sets. We'll use a 80/20 split and set a `random_state` for reproducibility.

In [7]:
if y is not None:
    X_train, X_test, y_train, y_test = train_test_split(
        X_imputed, y, 
        test_size=0.2, 
        random_state=42, 
        stratify=y if y.nunique() > 1 else None 
    )
    print("Data split into training and testing sets.")
    print("X_train shape:", X_train.shape)
    print("X_test shape:", X_test.shape)
    print("y_train shape:", y_train.shape)
    print("y_test shape:", y_test.shape)
else:
    print("Cannot split data as target variable y is not available.")
    X_train, X_test, y_train, y_test = None, None, None, None


Data split into training and testing sets.
X_train shape: (49, 7)
X_test shape: (13, 7)
y_train shape: (49,)
y_test shape: (13,)


### 5. Apply Preprocessing

The `ColumnTransformer` (preprocessor) will be fitted on the training data (`X_train`) only, to prevent data leakage from the test set. Then, both `X_train` and `X_test` will be transformed.

In [8]:
if X_train is not None:
    print("Fitting preprocessor on X_train and transforming X_train...")
    X_train_processed = preprocessor.fit_transform(X_train)
    print("Transforming X_test...")
    X_test_processed = preprocessor.transform(X_test)

    try:
        feature_names_out = preprocessor.get_feature_names_out()
        X_train_processed_df = pd.DataFrame(X_train_processed, columns=feature_names_out, index=X_train.index)
        X_test_processed_df = pd.DataFrame(X_test_processed, columns=feature_names_out, index=X_test.index)
        
        print("\nShape of X_train_processed:", X_train_processed_df.shape)
        print("First 5 rows of X_train_processed_df:")
        print(X_train_processed_df.head())
        
        print("\nShape of X_test_processed:", X_test_processed_df.shape)
    except Exception as e:
        print(f"Could not get feature names out or convert to DataFrame: {e}")
        print("X_train_processed and X_test_processed are likely NumPy arrays.")
        print("\nShape of X_train_processed (array):", X_train_processed.shape)
        print("Shape of X_test_processed (array):", X_test_processed.shape)
else:
    print("Skipping preprocessing application as X_train is not available.")
    X_train_processed, X_test_processed = None, None



Fitting preprocessor on X_train and transforming X_train...
Transforming X_test...

Shape of X_train_processed: (49, 7)
First 5 rows of X_train_processed_df:
    num__body_weight  num__brain_weight  num__max_life_span  \
4           2.311033           4.210652            3.665551   
34         -0.239046          -0.297476           -0.230886   
24         -0.154061           0.018093            1.641428   
37         -0.239120          -0.300093           -1.177886   
10         -0.238743          -0.294144           -0.816436   

    num__gestation_time  num__predation_index  num__sleep_exposure_index  \
4              3.245179              0.144338                   1.564490   
34            -0.759153             -0.562917                  -0.908413   
24             1.128411             -1.270171                   0.328038   
37            -0.759153              0.851592                  -0.908413   
10            -0.206366              1.558846                   0.946264   

    nu

### 6. Traditional Model: Logistic Regression

We will now train a traditional classification model, Logistic Regression, using the preprocessed training data. This will serve as a baseline model.

In [9]:
# Ensure X_train_processed and y_train are available from the previous preprocessing steps
if 'X_train_processed' in locals() and 'y_train' in locals() and X_train_processed is not None and y_train is not None:
    print("Training Logistic Regression model...")
    
    # Initialize Logistic Regression model
    # Increased max_iter for convergence, especially with scaled data.
    # Using 'solver' explicitly can also be good practice e.g. 'liblinear' for smaller datasets or 'lbfgs'
    lr_model = LogisticRegression(random_state=42, max_iter=1000, solver='liblinear') 
    
    # Train the model
    # TabPFN expects numpy array, ensure X_train_processed is suitable
    # If X_train_processed_df was created, use X_train_processed (the numpy array) for scikit-learn consistency
    # or ensure that X_train_processed_df is what you intend to use (if it exists and is preferred)
    
    # The variable X_train_processed should be a NumPy array if preprocessor.fit_transform was used directly.
    # If X_train_processed_df was created and preferred, that variable should be used.
    # Assuming X_train_processed is the NumPy array from preprocessor.
    
    lr_model.fit(X_train_processed, y_train)
    
    print("Logistic Regression model trained successfully.")
    print("Model details:", lr_model)
else:
    print("Skipping Logistic Regression model training as X_train_processed or y_train are not available.")
    lr_model = None # Ensure lr_model exists even if training is skipped


Training Logistic Regression model...
Logistic Regression model trained successfully.
Model details: LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')


### 7. TabPFN Model

Next, we will train a `TabPFNClassifier`. TabPFN is a pre-trained model that can achieve good performance on small tabular datasets without extensive hyperparameter tuning. It's designed to be efficient for datasets up to a certain size (typically around 1000 samples, 100 features, and 10 classes).

In [10]:
# Ensure X_train_processed and y_train are available from the previous preprocessing steps
if 'X_train_processed' in locals() and 'y_train' in locals() and X_train_processed is not None and y_train is not None:
    print("Training TabPFN model...")
    
    # Initialize TabPFNClassifier
    # device='cpu' for broader compatibility. Use 'cuda' if GPU is available.
    # N_ensemble_configurations can be adjusted; 32 is a common default.
    # TabPFN has constraints on dataset size (check documentation for specifics).
    # Our dataset (sleep ID 205) is small (49 training samples after 80/20 split, ~7 features after OHE) and should fit well.
    print(f"Shape of X_train_processed for TabPFN: {X_train_processed.shape}")
    print(f"Number of unique classes in y_train for TabPFN: {y_train.nunique()}")

    if X_train_processed.shape[0] > 1000 or X_train_processed.shape[1] > 100 or y_train.nunique() > 10:
        print("Warning: The dataset dimensions might exceed TabPFN's typical optimal range (1000 samples, 100 features, 10 classes).")
        print("Performance might vary, or it might require specific configurations if using a larger pre-trained model variant.")

    tabpfn_model = TabPFNClassifier(device='cpu', N_ensemble_configurations=32)
    
    # Train the model
    # TabPFN expects data to be numpy arrays. X_train_processed from ColumnTransformer is typically a numpy array.
    # y_train should also be a numpy array or pandas Series.
    tabpfn_model.fit(X_train_processed, y_train)
    
    print("TabPFN model trained successfully.")
    print("Model details:", tabpfn_model)
else:
    print("Skipping TabPFN model training as X_train_processed or y_train are not available.")
    tabpfn_model = None # Ensure tabpfn_model exists even if training is skipped


Training TabPFN model...
Shape of X_train_processed for TabPFN: (49, 7)
Number of unique classes in y_train for TabPFN: 5


TypeError: __init__() got an unexpected keyword argument 'N_ensemble_configurations'

### Logistic Regression Evaluation

We'll evaluate the performance of the trained Logistic Regression model on the test set (`X_test_processed` and `y_test`).

In [None]:
# Ensure lr_model, X_test_processed, and y_test are available
if 'lr_model' in locals() and lr_model is not None and    'X_test_processed' in locals() and X_test_processed is not None and    'y_test' in locals() and y_test is not None:
    
    print("Evaluating Logistic Regression model on the test set...")
    y_pred_lr = lr_model.predict(X_test_processed)
    
    lr_accuracy = accuracy_score(y_test, y_pred_lr)
    lr_f1_weighted = f1_score(y_test, y_pred_lr, average='weighted', zero_division=0)
    
    print(f"\nLogistic Regression - Accuracy: {lr_accuracy:.4f}")
    print(f"Logistic Regression - F1 Score (Weighted): {lr_f1_weighted:.4f}")
    
    print("\nLogistic Regression - Classification Report:")
    # Ensure target names are strings for the report
    class_labels_lr = np.unique(y_test).astype(str) # Use y_test for actual labels present in test set
    print(classification_report(y_test, y_pred_lr, target_names=class_labels_lr, zero_division=0))
else:
    print("Skipping Logistic Regression evaluation as the model or test data is not available.")


### TabPFN Model Evaluation

Now, let's evaluate the performance of the trained TabPFN model on the same test set.

In [None]:
# Ensure tabpfn_model, X_test_processed, and y_test are available
if 'tabpfn_model' in locals() and tabpfn_model is not None and    'X_test_processed' in locals() and X_test_processed is not None and    'y_test' in locals() and y_test is not None:
    
    print("Evaluating TabPFN model on the test set...")
    y_pred_tabpfn = tabpfn_model.predict(X_test_processed)
    
    tabpfn_accuracy = accuracy_score(y_test, y_pred_tabpfn)
    tabpfn_f1_weighted = f1_score(y_test, y_pred_tabpfn, average='weighted', zero_division=0)
    
    print(f"\nTabPFN Model - Accuracy: {tabpfn_accuracy:.4f}")
    print(f"TabPFN Model - F1 Score (Weighted): {tabpfn_f1_weighted:.4f}")
    
    print("\nTabPFN Model - Classification Report:")
    class_labels_tabpfn = np.unique(y_test).astype(str) # Use y_test for actual labels
    print(classification_report(y_test, y_pred_tabpfn, target_names=class_labels_tabpfn, zero_division=0))
else:
    print("Skipping TabPFN model evaluation as the model or test data is not available.")


## 8. Model Comparison Summary

Based on the evaluation metrics (accuracy, F1-score, and classification reports) from the preceding cells, we can compare the performance of Logistic Regression and TabPFN on the 'sleep' dataset.

*[Insert observations here after running the notebook. Key points to consider include:
- Which model achieved higher accuracy and F1-score (weighted or macro)?
- Were there significant differences in performance for specific classes (see classification reports)?
- How does the training time (not explicitly measured here, but can be inferred) compare? TabPFN is pre-trained but inference still takes time.
- Considering TabPFN requires minimal tuning, how does its out-of-the-box performance stack up against the baseline Logistic Regression?
Actual results will be filled in when the notebook is executed.]*

**Placeholder for Results:**

-   **Logistic Regression:**
    -   Accuracy: `[To be filled from output]`
    -   F1 Score (Weighted): `[To be filled from output]`
-   **TabPFN Model:**
    -   Accuracy: `[To be filled from output]`
    -   F1 Score (Weighted): `[To be filled from output]`

**Further Considerations:**
- For this small dataset (Sleep - ID 205), were the results as expected?
- Would hyperparameter tuning for Logistic Regression potentially change the outcome? (TabPFN generally doesn't require it).
- How might these models perform on larger, more complex datasets?

## 9. Conclusion

This notebook demonstrated the complete workflow for a binary/multi-class classification task using the OpenML 'sleep' dataset (ID: 205). We performed the following key steps:
1.  **Data Loading**: Fetched the dataset from OpenML.
2.  **Preprocessing**: Handled missing values via imputation, identified categorical and numerical features, and applied feature scaling (StandardScaler) and encoding (OneHotEncoder) using a `ColumnTransformer`.
3.  **Data Splitting**: Divided the data into training and testing sets.
4.  **Model Training**:
    *   Trained a traditional Logistic Regression model as a baseline.
    *   Trained a TabPFNClassifier, leveraging its pre-trained capabilities for tabular data.
5.  **Model Evaluation**: Assessed both models on the test set using accuracy, F1-score (weighted), and detailed classification reports.

The results from the evaluation sections *[will illustrate / illustrate - use appropriate tense after execution]* the comparative performance of these two distinct approaches. TabPFN often provides a strong, quick-to-train baseline, particularly effective for smaller datasets like the one used here, without requiring extensive hyperparameter tuning. Logistic Regression, while simpler, provides a well-understood benchmark.

This exercise highlights the utility of TabPFN as a valuable tool in the data scientist's toolkit for rapidly developing effective models on tabular data, alongside traditional, interpretable models like Logistic Regression.