# Out-of-Core Machine Learning with Vaex

## 1. What is Out-of-Core Machine Learning?

Out-of-core machine learning refers to techniques that allow you to train models on datasets that are too large to fit entirely in your system’s RAM. Instead of loading the whole dataset into memory, out-of-core methods process the data in smaller chunks or batches. This approach is especially useful for:

- **Big Data:** When datasets exceed available memory.
- **Streaming Data:** For real-time data that arrives continuously.
- **Resource Efficiency:** Reducing memory footprint and enabling scalability.

## 2. How Vaex Supports Out-of-Core ML

[Vaex](https://vaex.io/) is a high-performance Python library designed for out-of-core DataFrame operations and big data analytics. It enables efficient data manipulation, transformation, and visualization on datasets that are larger than your machine's memory. Key features include:

- **Memory Mapping:** Loads data from disk as needed without fully loading it into memory.
- **Lazy Evaluation:** Operations are deferred until explicitly computed, saving resources.
- **Fast Aggregations:** Optimized for computing statistical summaries, groupbys, and filtering on large datasets.

## 3. Practical Implementation Using Vaex

Below is an example that demonstrates how to use Vaex for out-of-core data processing. In this example, we simulate working with a large CSV file by loading a dataset with Vaex, performing some data transformations, and calculating aggregations without loading all data into RAM.

### Example: Out-of-Core Data Processing with Vaex

```python
import vaex
import numpy as np
import pandas as pd

# For demonstration, we'll create a large synthetic CSV file.
# In practice, you would have your own large CSV file.
# Create a DataFrame with 10 million rows and 5 columns
n_rows = 10_000_000
data = {
    'feature1': np.random.rand(n_rows),
    'feature2': np.random.rand(n_rows),
    'feature3': np.random.randint(0, 100, n_rows),
    'feature4': np.random.normal(0, 1, n_rows),
    'category': np.random.choice(['A', 'B', 'C'], n_rows)
}
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file (simulate a large file)
csv_path = 'large_dataset.csv'
df.to_csv(csv_path, index=False)

# Load the large CSV file with Vaex using memory mapping
# The 'convert' parameter creates an efficient binary version for faster access.
df_vaex = vaex.from_csv(csv_path, convert=True, chunk_size=1_000_000)

# Check the information about the dataset (this is a lazy operation)
print(df_vaex.info())

# Perform some out-of-core operations:
# 1. Filter rows where feature3 is greater than 50
df_filtered = df_vaex[df_vaex.feature3 > 50]

# 2. Compute the mean of feature1 and feature4 for each category
aggregated = df_filtered.groupby('category', agg={
    'mean_feature1': vaex.agg.mean('feature1'),
    'mean_feature4': vaex.agg.mean('feature4')
})
print(aggregated)

# 3. Add a new computed column (e.g., ratio of feature1 to feature2)
df_vaex['ratio'] = df_vaex.feature1 / (df_vaex.feature2 + 1e-6)  # Avoid division by zero

# Compute the mean ratio (lazy evaluation, executed when needed)
mean_ratio = df_vaex['ratio'].mean()
print("Mean ratio of feature1/feature2:", mean_ratio)
```

### Explanation

- **Creating & Saving Data:**  
  We simulate a large dataset by generating a DataFrame with 10 million rows and saving it to CSV. In real scenarios, you might already have a large file.

- **Loading Data with Vaex:**  
  The `vaex.from_csv()` function loads the data using memory mapping. The `convert=True` parameter creates a binary format that speeds up subsequent loads.

- **Filtering and Aggregation:**  
  We filter the data and perform a group-by aggregation on a categorical column—all handled out-of-core without loading the entire dataset into memory.

- **Lazy Computation:**  
  Operations like calculating the mean of a computed column (`ratio`) are performed lazily, which means Vaex computes these results only when explicitly requested.

## 4. Conclusion

Out-of-core machine learning is crucial for working with large datasets that exceed memory limits. Vaex provides an efficient, scalable solution by combining memory mapping and lazy evaluation to process data in chunks. By using Vaex, you can perform complex transformations, filtering, and aggregations on big data without requiring extensive computational resources.

> # reference:
>   - NYC Cab Dataset Project - https://vaex.io/blog/ml-impossible-train-a-1-billion-sample-model-in-20-minutes-with-vaex-and-scikit-learn-on-your