# Dataset Comparison: OHLC vs Lunar Feature Sets

## Thesis Context

This notebook accompanies a thesis on **quaternion-valued neural networks for cryptocurrency price prediction**. The core idea is to encode 4-dimensional feature vectors as quaternions and process them through quaternion-valued LSTM networks.

All data originates from the **LunarCrush API v2**, which provides 18 columns of social, market, and proprietary score data for major cryptocurrencies. From these 18 columns, we define two distinct 4-feature subsets:

| Config | Features | Purpose |
|--------|----------|---------|
| **OHLC** | close, high, low, open | Pure price action — traditional financial time-series |
| **Lunar** | interactions, sentiment, close, galaxy_score | Social-augmented — blends market and crowd signals |

Both subsets have exactly 4 features, matching the 4 components of a quaternion ($1, \mathbf{i}, \mathbf{j}, \mathbf{k}$).

This notebook loads real cached data and walks through each dataset configuration to explain what they contain, why those features were selected, and how they differ.

In [None]:
import os
import glob

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Column definitions (from src/data/lunarcrush_api.py lines 22-41)
LUNARCRUSH_ALL_COLUMNS = [
    "contributors_active",    # 0
    "contributors_created",   # 1
    "interactions",           # 2
    "posts_active",           # 3
    "posts_created",          # 4
    "sentiment",              # 5
    "spam",                   # 6
    "alt_rank",               # 7
    "circulating_supply",     # 8
    "close",                  # 9  <- target
    "galaxy_score",           # 10
    "high",                   # 11
    "low",                    # 12
    "market_cap",             # 13
    "market_dominance",       # 14
    "open",                   # 15
    "social_dominance",       # 16
    "volume_24h",             # 17
]

# Feature index constants from the two config files
OHLC_FEATURE_COLS = [9, 11, 12, 15]   # close, high, low, open
LUNAR_FEATURE_COLS = [2, 5, 9, 10]    # interactions, sentiment, close, galaxy_score

OHLC_FEATURES = [LUNARCRUSH_ALL_COLUMNS[i] for i in OHLC_FEATURE_COLS]
LUNAR_FEATURES = [LUNARCRUSH_ALL_COLUMNS[i] for i in LUNAR_FEATURE_COLS]

CACHE_DIR = os.path.join('..', 'data', 'cache')

print(f"OHLC features:  {OHLC_FEATURES}  (indices {OHLC_FEATURE_COLS})")
print(f"Lunar features: {LUNAR_FEATURES}  (indices {LUNAR_FEATURE_COLS})")

---
## Section 2: LunarCrush API Overview

### What is LunarCrush?

[LunarCrush](https://lunarcrush.com) is a social intelligence platform for cryptocurrency markets. It aggregates data from major social media platforms (Twitter/X, Reddit, YouTube, TikTok, news sites, etc.) and combines it with on-chain and market data to produce real-time metrics.

### What the Time Series API provides

The **Time Series v2** endpoint returns 18 columns per coin per time bucket (hourly or daily). These columns fall into three categories:

1. **Social engagement metrics** — raw counts of posts, interactions, contributors, spam
2. **Price & market data** — OHLC prices, market cap, volume, supply, dominance
3. **Proprietary composite scores** — LunarCrush's own derived indicators (galaxy_score, alt_rank, sentiment)

### Full 18-Column Reference

| Index | Column | Category | Description |
|------:|--------|----------|-------------|
| 0 | `contributors_active` | Social | Unique users actively posting about this coin |
| 1 | `contributors_created` | Social | New contributors posting for the first time |
| 2 | `interactions` | Social | Total engagements (likes, shares, comments) across all posts |
| 3 | `posts_active` | Social | Posts that received engagement in this period |
| 4 | `posts_created` | Social | New posts created in this period |
| 5 | `sentiment` | Proprietary | Crowd sentiment score (0–100 scale) |
| 6 | `spam` | Social | Posts flagged as spam |
| 7 | `alt_rank` | Proprietary | Relative social rank among all tracked coins (lower = more buzz) |
| 8 | `circulating_supply` | Market | Circulating token supply |
| 9 | `close` | Price | Closing price (USD) for the period |
| 10 | `galaxy_score` | Proprietary | Composite score blending social activity, sentiment, and market data (0–100) |
| 11 | `high` | Price | Highest price (USD) during the period |
| 12 | `low` | Price | Lowest price (USD) during the period |
| 13 | `market_cap` | Market | Market capitalization (USD) |
| 14 | `market_dominance` | Market | Percentage of total crypto market cap |
| 15 | `open` | Price | Opening price (USD) for the period |
| 16 | `social_dominance` | Social | Percentage of total social volume attributed to this coin |
| 17 | `volume_24h` | Market | 24-hour trading volume (USD) |

In [None]:
# Load the full 18-column BTC daily dataset
csv_path = os.path.join(CACHE_DIR, 'lunarcrush_btc_day_full.csv')
df_full = pd.read_csv(csv_path, index_col='Datetime', parse_dates=True)

print(f"Shape: {df_full.shape}")
print(f"Date range: {df_full.index.min().date()} to {df_full.index.max().date()}")
print(f"Columns ({len(df_full.columns)}): {list(df_full.columns)}")

In [None]:
# First 5 rows of the full dataset
df_full.head()

In [None]:
# Descriptive statistics (transposed for readability)
df_full.describe().T.style.format('{:,.2f}')

---
## Section 3: OHLC Dataset

### Configuration

From `configs/data/daily/btc_ohlc.yaml`:

```yaml
feature_cols: [9, 11, 12, 15]   # close, high, low, open
target_col: 0                    # close (index 0 after selection)
window_size: 20
```

This configuration selects the four standard OHLC price columns. The **target** is `close` (the first column after feature selection, index 0).

### Why OHLC?

OHLC (Open, High, Low, Close) is the canonical representation of price action in financial markets. Each daily bar captures:
- **Open** — the first trade price of the day
- **High** — the maximum price reached
- **Low** — the minimum price reached
- **Close** — the last trade price of the day

Together, these four values capture the full intraday price dynamics in a compact 4-dimensional vector — a natural fit for quaternion encoding.

In [None]:
# Extract the OHLC subset
df_ohlc = df_full[OHLC_FEATURES].copy()
print(f"OHLC subset shape: {df_ohlc.shape}")
print(f"Columns: {list(df_ohlc.columns)}")
print()
df_ohlc.head(10)

### Quaternion Encoding of OHLC

Each daily OHLC bar is encoded as a quaternion:

$$q_t = O_t + H_t \cdot \mathbf{i} + L_t \cdot \mathbf{j} + C_t \cdot \mathbf{k}$$

where:
- $O_t$ is the **open** price (real part)
- $H_t$ is the **high** price ($\mathbf{i}$ component)
- $L_t$ is the **low** price ($\mathbf{j}$ component)
- $C_t$ is the **close** price ($\mathbf{k}$ component)

This mapping is natural because:
1. All four values share the same unit (USD) and scale
2. They have a strict ordering constraint: $L_t \leq O_t, C_t \leq H_t$
3. The Hamilton product in quaternion algebra can capture cross-feature interactions that element-wise real-valued operations cannot

In [None]:
fig, axes = plt.subplots(2, 1, figsize=(14, 8), gridspec_kw={'height_ratios': [3, 1]})

# Top: OHLC price time series
ax = axes[0]
ax.plot(df_ohlc.index, df_ohlc['close'], label='Close', linewidth=1.2, color='#2196F3')
ax.plot(df_ohlc.index, df_ohlc['open'], label='Open', linewidth=0.8, alpha=0.7, color='#4CAF50')
ax.fill_between(df_ohlc.index, df_ohlc['low'], df_ohlc['high'], alpha=0.15, color='#2196F3', label='High-Low range')
ax.set_title('BTC Daily OHLC Prices', fontsize=14, fontweight='bold')
ax.set_ylabel('Price (USD)')
ax.legend(loc='upper left')
ax.grid(True, alpha=0.3)

# Bottom: Daily spread (high - low)
ax2 = axes[1]
spread = df_ohlc['high'] - df_ohlc['low']
ax2.fill_between(df_ohlc.index, 0, spread, alpha=0.5, color='#FF9800')
ax2.set_title('Daily Price Spread (High - Low)', fontsize=12)
ax2.set_ylabel('Spread (USD)')
ax2.set_xlabel('Date')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nOHLC Descriptive Statistics:")
df_ohlc.describe()

---
## Section 4: Lunar Dataset

### Configuration

From `configs/data/daily/btc_lunar.yaml`:

```yaml
feature_cols: [2, 5, 9, 10]   # interactions, sentiment, close, galaxy_score
target_col: 2                  # close (index 2 after selection)
window_size: 20
```

This configuration selects a diverse mix of social engagement, sentiment, price, and composite score features. The **target** is still `close`, but it is now at index 2 within the selected features.

### Why these four features?

Unlike OHLC (which captures only price), the Lunar feature set blends three distinct information sources:
- **Social engagement** (`interactions`) — market attention
- **Crowd mood** (`sentiment`) — directional signal from social media
- **Market price** (`close`) — the prediction target, enabling autoregressive learning
- **Composite signal** (`galaxy_score`) — LunarCrush's meta-feature combining social + technical indicators

In [None]:
# Extract the Lunar subset
df_lunar = df_full[LUNAR_FEATURES].copy()
print(f"Lunar subset shape: {df_lunar.shape}")
print(f"Columns: {list(df_lunar.columns)}")
print()
df_lunar.head(10)

In [None]:
fig, axes = plt.subplots(4, 1, figsize=(14, 12), sharex=True)

configs = [
    ('interactions', 'Total Interactions', '#9C27B0'),
    ('sentiment', 'Sentiment Score', '#E91E63'),
    ('close', 'Close Price (USD)', '#2196F3'),
    ('galaxy_score', 'Galaxy Score', '#FF9800'),
]

for ax, (col, title, color) in zip(axes, configs):
    ax.plot(df_lunar.index, df_lunar[col], linewidth=1.0, color=color)
    ax.set_ylabel(col, fontsize=10)
    ax.set_title(title, fontsize=11, fontweight='bold', loc='left')
    ax.grid(True, alpha=0.3)

axes[-1].set_xlabel('Date')
fig.suptitle('BTC Daily — Lunar Feature Set (4 panels)', fontsize=14, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

---
## Section 5: Feature Selection Rationale

### Why these 4 Lunar features?

The Lunar feature set was chosen through domain-knowledge reasoning, selecting one representative feature from each information category:

#### 1. `interactions` (index 2) — Proxy for Market Attention
- Measures total engagements (likes, shares, comments) across all social posts about the coin
- Preferred over `posts_created` or `contributors_active` because **engagement depth** matters more than raw post count — a single viral tweet with 100k likes carries more signal than 1000 low-engagement posts
- High interaction spikes often precede or coincide with price volatility

#### 2. `sentiment` (index 5) — Direct Measure of Crowd Mood
- LunarCrush's aggregated sentiment score (0–100) derived from NLP analysis of social posts
- Captures the **direction** of social attention — `interactions` tells us *how much* people are talking, `sentiment` tells us *how they feel*
- Pairing engagement volume with sentiment direction provides complementary signals

#### 3. `close` (index 9) — Prediction Target & Autoregressive Anchor
- The closing price is the prediction target in both configurations
- Including it as an input feature enables **autoregressive learning** — the model sees past closing prices to predict future ones
- Provides the price-scale anchor that grounds the social features

#### 4. `galaxy_score` (index 10) — Composite Meta-Feature
- LunarCrush's proprietary score that blends social activity, engagement, sentiment, and market technicals into a single 0–100 indicator
- Acts as a **pre-fused summary** of multiple signals — effectively a feature-engineered input that captures cross-domain interactions
- Provides information compression: one number that reflects the overall "health" of a coin's social + market standing

### Why the other 14 columns were excluded

| Column | Index | Reason for Exclusion |
|--------|------:|---------------------|
| `contributors_active` | 0 | Redundant with `interactions` — counts users, not engagement depth |
| `contributors_created` | 1 | Noisy metric; new contributors don't indicate directional sentiment |
| `posts_active` | 3 | Overlaps with `interactions`; active posts ⊂ posts with interactions |
| `posts_created` | 4 | Raw post count is noisy; bots and spam inflate this without signal |
| `spam` | 6 | Measures noise, not signal; anti-correlated with data quality |
| `alt_rank` | 7 | Relative ranking metric; ordinal scale loses magnitude information |
| `circulating_supply` | 8 | Near-constant for BTC over short periods; adds no predictive variance |
| `high` | 11 | Part of OHLC set only; redundant with `close` for Lunar config |
| `low` | 12 | Part of OHLC set only; redundant with `close` for Lunar config |
| `market_cap` | 13 | Derived from `close × circulating_supply`; linearly dependent on close |
| `market_dominance` | 14 | Relative metric across coins; not predictive of absolute price |
| `open` | 15 | Part of OHLC set only; highly correlated with previous day's close |
| `social_dominance` | 16 | Relative share metric; zero-sum across coins, not intrinsic to BTC |
| `volume_24h` | 17 | Candidate feature but excluded to stay at 4 dimensions; could be explored in future work |

---
## Section 6: Side-by-Side Comparison

In [None]:
# Configuration comparison table
comparison = pd.DataFrame({
    'Property': [
        'Config file',
        'feature_cols',
        'Feature names',
        'target_col (after selection)',
        'Target feature',
        'Information type',
        'Primary purpose',
    ],
    'OHLC': [
        'configs/data/daily/btc_ohlc.yaml',
        '[9, 11, 12, 15]',
        'close, high, low, open',
        '0 (close)',
        'close',
        'Pure price action',
        'Traditional financial time-series baseline',
    ],
    'Lunar': [
        'configs/data/daily/btc_lunar.yaml',
        '[2, 5, 9, 10]',
        'interactions, sentiment, close, galaxy_score',
        '2 (close)',
        'close',
        'Social + market hybrid',
        'Test whether social signals improve prediction',
    ],
})
comparison.set_index('Property', inplace=True)
comparison.style.set_properties(**{'text-align': 'left'}).set_table_styles(
    [{'selector': 'th', 'props': [('text-align', 'left')]}]
)

### Data Samples

In [None]:
# Side-by-side head()
print("=" * 80)
print("OHLC Dataset — First 5 Rows")
print("=" * 80)
display(df_ohlc.head())

print()
print("=" * 80)
print("Lunar Dataset — First 5 Rows")
print("=" * 80)
display(df_lunar.head())

In [None]:
# Side-by-side describe()
print("=" * 80)
print("OHLC — Descriptive Statistics")
print("=" * 80)
display(df_ohlc.describe().style.format('{:,.2f}'))

print()
print("=" * 80)
print("Lunar — Descriptive Statistics")
print("=" * 80)
display(df_lunar.describe().style.format('{:,.2f}'))

In [None]:
# 2x4 histogram grid: OHLC (top), Lunar (bottom)
fig, axes = plt.subplots(2, 4, figsize=(16, 7))

# Top row: OHLC features
for i, col in enumerate(OHLC_FEATURES):
    ax = axes[0, i]
    df_ohlc[col].dropna().hist(bins=50, ax=ax, color='#2196F3', alpha=0.7, edgecolor='white')
    ax.set_title(f'OHLC: {col}', fontsize=10, fontweight='bold')
    ax.tick_params(labelsize=8)

# Bottom row: Lunar features
colors_lunar = ['#9C27B0', '#E91E63', '#2196F3', '#FF9800']
for i, (col, color) in enumerate(zip(LUNAR_FEATURES, colors_lunar)):
    ax = axes[1, i]
    df_lunar[col].dropna().hist(bins=50, ax=ax, color=color, alpha=0.7, edgecolor='white')
    ax.set_title(f'Lunar: {col}', fontsize=10, fontweight='bold')
    ax.tick_params(labelsize=8)

fig.suptitle('Feature Distributions: OHLC (top) vs Lunar (bottom)', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmaps side-by-side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# OHLC correlation
corr_ohlc = df_ohlc.corr()
im1 = ax1.imshow(corr_ohlc.values, cmap='RdYlBu_r', vmin=-1, vmax=1, aspect='auto')
ax1.set_xticks(range(len(OHLC_FEATURES)))
ax1.set_yticks(range(len(OHLC_FEATURES)))
ax1.set_xticklabels(OHLC_FEATURES, rotation=45, ha='right')
ax1.set_yticklabels(OHLC_FEATURES)
ax1.set_title('OHLC Correlation Matrix', fontsize=12, fontweight='bold')
# Annotate
for i in range(len(OHLC_FEATURES)):
    for j in range(len(OHLC_FEATURES)):
        ax1.text(j, i, f'{corr_ohlc.values[i, j]:.3f}', ha='center', va='center', fontsize=9,
                color='white' if abs(corr_ohlc.values[i, j]) > 0.5 else 'black')

# Lunar correlation
corr_lunar = df_lunar.corr()
im2 = ax2.imshow(corr_lunar.values, cmap='RdYlBu_r', vmin=-1, vmax=1, aspect='auto')
ax2.set_xticks(range(len(LUNAR_FEATURES)))
ax2.set_yticks(range(len(LUNAR_FEATURES)))
ax2.set_xticklabels(LUNAR_FEATURES, rotation=45, ha='right')
ax2.set_yticklabels(LUNAR_FEATURES)
ax2.set_title('Lunar Correlation Matrix', fontsize=12, fontweight='bold')
# Annotate
for i in range(len(LUNAR_FEATURES)):
    for j in range(len(LUNAR_FEATURES)):
        ax2.text(j, i, f'{corr_lunar.values[i, j]:.3f}', ha='center', va='center', fontsize=9,
                color='white' if abs(corr_lunar.values[i, j]) > 0.5 else 'black')

fig.colorbar(im2, ax=[ax1, ax2], label='Pearson Correlation', shrink=0.8)
fig.suptitle('Key Insight: OHLC features are ~0.99 correlated (redundant) vs Lunar features are diverse',
             fontsize=11, style='italic', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Missing value analysis
print("Missing Values Analysis")
print("=" * 60)

missing_ohlc = df_ohlc.isnull().sum()
missing_lunar = df_lunar.isnull().sum()

missing_df = pd.DataFrame({
    'OHLC Features': pd.Series({col: f"{missing_ohlc[col]} ({missing_ohlc[col]/len(df_ohlc)*100:.1f}%)" for col in OHLC_FEATURES}),
    'Lunar Features': pd.Series({col: f"{missing_lunar[col]} ({missing_lunar[col]/len(df_lunar)*100:.1f}%)" for col in LUNAR_FEATURES}),
})

display(missing_df)

print(f"\nTotal rows: {len(df_full)}")
print(f"OHLC complete rows: {df_ohlc.dropna().shape[0]} / {len(df_ohlc)}")
print(f"Lunar complete rows: {df_lunar.dropna().shape[0]} / {len(df_lunar)}")

---
## Section 7: Summary

### Comparison Summary

| Aspect | OHLC | Lunar |
|--------|------|-------|
| **Information type** | Homogeneous (all price) | Heterogeneous (social + price + composite) |
| **Feature diversity** | Very low — all features ~0.99 correlated | High — features from distinct domains |
| **Quaternion suitability** | Natural fit (OHLC is a standard 4-tuple) | Novel encoding (mixed-domain quaternion) |
| **Scale consistency** | All in USD, same magnitude | Mixed scales (count, score, USD) — requires normalization |
| **Hypothesis** | Baseline performance; quaternion algebra captures price geometry | Social signals provide orthogonal information that improves prediction |

### Experimental Design

Both configurations are run through the **same model architecture** (quaternion LSTM) with identical hyperparameters. This controlled comparison isolates the effect of **feature selection** on prediction accuracy:

- If **OHLC outperforms Lunar**: price dynamics alone are sufficient, and social signals add noise
- If **Lunar outperforms OHLC**: social and sentiment data provide genuine predictive value beyond price action
- If **performance is similar**: the quaternion architecture is robust to feature choice, and the social signals neither help nor hurt

### Key Research Question

> *Does incorporating social media sentiment and engagement features into quaternion-valued neural networks improve cryptocurrency price prediction compared to using traditional OHLC price data alone?*

In [None]:
# List all available cached datasets with row counts and date ranges
csv_files = sorted(glob.glob(os.path.join(CACHE_DIR, 'lunarcrush_*.csv')))

records = []
for f in csv_files:
    try:
        tmp = pd.read_csv(f, index_col='Datetime', parse_dates=True)
        records.append({
            'File': os.path.basename(f),
            'Rows': len(tmp),
            'Columns': tmp.shape[1],
            'Start': str(tmp.index.min().date()),
            'End': str(tmp.index.max().date()),
        })
    except Exception as e:
        records.append({
            'File': os.path.basename(f),
            'Rows': 'ERROR',
            'Columns': '-',
            'Start': str(e)[:40],
            'End': '-',
        })

df_cache = pd.DataFrame(records)
print(f"Found {len(csv_files)} cached datasets in {CACHE_DIR}:\n")
display(df_cache.style.hide(axis='index'))