In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("adhoppin/financial-data")

print("Path to dataset files:", path)

  from .autonotebook import tqdm as notebook_tqdm


Downloading from https://www.kaggle.com/api/v1/datasets/download/adhoppin/financial-data?dataset_version_number=1...


100%|██████████| 4.22M/4.22M [00:00<00:00, 73.7MB/s]

Extracting files...
Path to dataset files: /home/de7281/.cache/kagglehub/datasets/adhoppin/financial-data/versions/1





### Exercise 1: Parameter Optimization
**Goal**: Find the optimal EWMA spans for better performance

```python
# Hint: Try different combinations of span_short and span_long
# Compare Sharpe ratios across different parameter combinations
```

### Exercise 2: Train-Test Split
**Goal**: Split data into training and testing periods to avoid overfitting

```python
# Hint: Use first 70% of data for training, last 30% for testing
# Develop strategy parameters on training set only
# Then test on out-of-sample test set
```

### Exercise 3: Different Stocks
**Goal**: Test the same strategy on other stocks

```python
# Hint: Try stocks with different characteristics:
# - High volatility (e.g., TSLA)
# - Stable dividend payers (e.g., KO, JNJ)
# - Tech stocks (e.g., GOOGL, MSFT)
```

### Exercise 4: Alternative Indicators
**Goal**: Create signals using different technical indicators

```python
# Ideas:
# - RSI (Relative Strength Index)
# - Bollinger Bands (mean reversion)
# - MACD (Moving Average Convergence Divergence)
# - Volume-weighted indicators
```

### Exercise 5: Risk Management
**Goal**: Add stop-loss and position sizing rules

```python
# Ideas:
# - Implement a maximum drawdown threshold (exit if down X%)
# - Use volatility-based position sizing
# - Add maximum position limits
# - Implement trailing stops
```

### Exercise 6: Multiple Timeframes
**Goal**: Use signals from multiple timeframes (daily, weekly, monthly)

```python
# Hint: Resample data to different frequencies
# Combine signals from different timeframes
# Weight them appropriately
```

---

## Additional Resources

- **Books**:
  - "Quantitative Trading" by Ernest Chan
  - "Algorithmic Trading" by Stefan Jansen
  - "Advances in Financial Machine Learning" by Marcos López de Prado

- **Libraries to Explore**:
  - `backtrader`: Full-featured backtesting framework
  - `zipline`: Professional backtesting used by Quantopian
  - `vectorbt`: Fast backtesting with vectorized operations
  - `ta-lib`: Technical analysis library

- **Next Topics**:
  - Machine learning for trading
  - Portfolio optimization
  - Risk modeling (VaR, CVaR)
  - Market microstructure
  - High-frequency trading

## Summary and Key Takeaways

Congratulations! You've built your first quantitative trading strategy. Here's what we covered:

### 1. **Exploratory Data Analysis**
   - Examined price trends and return distributions
   - Calculated basic statistics (mean, volatility, skewness, kurtosis)
   
### 2. **Autocorrelation**
   - Tested whether past returns predict future returns
   - Learned that most autocorrelations are small but potentially exploitable
   
### 3. **EWMA Signals**
   - Created momentum signals using exponentially weighted moving averages
   - Visualized when to enter/exit positions
   
### 4. **Statistical Testing**
   - Used t-statistics to test if signals are statistically significant
   - Compared returns during long vs short signals
   
### 5. **Backtesting**
   - Simulated historical performance
   - Calculated key metrics: returns, Sharpe ratio, drawdown
   - Compared to buy-and-hold benchmark
   
### 6. **Signal to Trades**
   - Converted signals into actual positions
   - Accounted for transaction costs
   - Tracked portfolio value over time

### Important Caveats:
- **Overfitting**: We tested on the same data used to develop the strategy
- **Transaction Costs**: Real costs may be higher and include market impact
- **Market Conditions**: Past performance doesn't guarantee future results
- **Risk Management**: We didn't implement position sizing or stop losses
- **Data Issues**: Survivorship bias, look-ahead bias, etc.

---

## Exercises for Further Exploration

Try these exercises to deepen your understanding:

In [None]:
# Simulate portfolio with transaction costs
initial_capital = 100000  # Start with $100,000
transaction_cost = 0.001  # 0.1% per trade (commissions + slippage)

# Calculate portfolio value over time
df_aapl['portfolio_value'] = initial_capital
df_aapl['cash'] = initial_capital
df_aapl['holdings_value'] = 0.0

portfolio_value = initial_capital
cash = initial_capital
shares = 0

portfolio_values = []
cash_values = []
holdings_values = []

for idx in range(len(df_aapl)):
    if idx == 0:
        portfolio_values.append(initial_capital)
        cash_values.append(cash)
        holdings_values.append(0)
        continue
    
    current_price = df_aapl.iloc[idx][price_col]
    position = df_aapl.iloc[idx]['position']
    prev_position = df_aapl.iloc[idx-1]['position']
    
    # Update holdings value based on price change
    holdings_value = shares * current_price
    
    # Check if position changed (trade occurred)
    if position != prev_position and not pd.isna(position) and not pd.isna(prev_position):
        # Close old position
        cash += shares * current_price * (1 - transaction_cost if shares > 0 else 1 + transaction_cost)
        shares = 0
        
        # Open new position
        if position != 0:
            target_value = portfolio_value * abs(position)
            shares_to_buy = target_value / current_price
            cost = shares_to_buy * current_price * (1 + transaction_cost)
            
            if cost <= cash:
                shares = shares_to_buy if position > 0 else -shares_to_buy
                cash -= cost
    
    # Update portfolio value
    holdings_value = shares * current_price
    portfolio_value = cash + holdings_value
    
    portfolio_values.append(portfolio_value)
    cash_values.append(cash)
    holdings_values.append(holdings_value)

df_aapl['portfolio_value'] = portfolio_values
df_aapl['cash'] = cash_values
df_aapl['holdings_value'] = holdings_values
df_aapl['portfolio_returns'] = df_aapl['portfolio_value'].pct_change()

# Plot portfolio value
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Portfolio value over time
ax1 = axes[0]
ax1.plot(df_aapl[date_col], df_aapl['portfolio_value'], label='Portfolio Value', linewidth=2)
ax1.axhline(y=initial_capital, color='black', linestyle='--', alpha=0.5, label='Initial Capital')
ax1.set_title('Portfolio Value Over Time (with Transaction Costs)', fontsize=14, fontweight='bold')
ax1.set_ylabel('Portfolio Value ($)')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Cash vs Holdings
ax2 = axes[1]
ax2.fill_between(df_aapl[date_col], 0, df_aapl['cash'], alpha=0.5, label='Cash')
ax2.fill_between(df_aapl[date_col], df_aapl['cash'], df_aapl['portfolio_value'], alpha=0.5, label='Holdings')
ax2.set_title('Portfolio Composition', fontsize=14, fontweight='bold')
ax2.set_xlabel('Date')
ax2.set_ylabel('Value ($)')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate final metrics
final_value = df_aapl['portfolio_value'].iloc[-1]
total_return = (final_value - initial_capital) / initial_capital
print(f"\nPortfolio Performance (with {transaction_cost*100}% transaction costs):")
print(f"  Initial Capital: ${initial_capital:,.2f}")
print(f"  Final Value: ${final_value:,.2f}")
print(f"  Total Return: {total_return*100:.2f}%")
print(f"  Number of Trades: {n_trades}")
print(f"  Transaction Costs: ${n_trades * initial_capital * transaction_cost:,.2f}")

In [None]:
# Convert signals to positions
# Position represents how much capital we allocate (as fraction of total portfolio)
# Signal = 1 → Position = 1.0 (100% long)
# Signal = -1 → Position = -1.0 (100% short)
# Signal = 0 → Position = 0.0 (no position/cash)

df_aapl['position'] = df_aapl['signal_lagged']

# Identify trades (when position changes)
df_aapl['position_change'] = df_aapl['position'].diff()
df_aapl['trade'] = (df_aapl['position_change'] != 0) & (df_aapl['position_change'].notna())

# Count trades
trades_df = df_aapl[df_aapl['trade']].copy()
n_trades = len(trades_df)

print(f"Total number of trades: {n_trades}")
print(f"\nFirst 10 trades:")
print(trades_df[[date_col, price_col, 'position', 'position_change']].head(10))

## Step 8: From Signals to Trades

So far we've been working with **signals** (-1, 0, 1). Now let's convert these into actual **trading positions** and track our portfolio.

### Key Concepts:
- **Position**: The number of shares we hold (or dollars invested)
- **Position Changes**: When we enter/exit positions (these are actual trades)
- **Transaction Costs**: In reality, every trade has costs (commissions, slippage)
- **Portfolio Value**: Track how our total capital changes over time

Let's build a simple position tracking system.

In [None]:
# Visualize drawdowns
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Calculate drawdowns
def calculate_drawdown(returns):
    cumulative = (1 + returns).cumprod()
    running_max = cumulative.expanding().max()
    drawdown = (cumulative - running_max) / running_max
    return drawdown

strategy_dd = calculate_drawdown(df_aapl['strategy_returns'].fillna(0))
buyhold_dd = calculate_drawdown(df_aapl['buy_hold_returns'].fillna(0))

# Plot strategy drawdown
ax1 = axes[0]
ax1.fill_between(df_aapl[date_col], strategy_dd * 100, 0, alpha=0.5, color='red')
ax1.set_title('EWMA Strategy Drawdown', fontsize=14, fontweight='bold')
ax1.set_ylabel('Drawdown (%)')
ax1.grid(True, alpha=0.3)

# Plot buy & hold drawdown
ax2 = axes[1]
ax2.fill_between(df_aapl[date_col], buyhold_dd * 100, 0, alpha=0.5, color='blue')
ax2.set_title('Buy & Hold Drawdown', fontsize=14, fontweight='bold')
ax2.set_xlabel('Date')
ax2.set_ylabel('Drawdown (%)')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Calculate performance metrics
def calculate_metrics(returns, name="Strategy"):
    """Calculate common performance metrics"""
    returns_clean = returns.dropna()
    
    # Total return
    total_return = (1 + returns_clean).prod() - 1
    
    # Annualized return (assuming 252 trading days)
    n_days = len(returns_clean)
    annualized_return = (1 + total_return) ** (252 / n_days) - 1
    
    # Volatility (annualized)
    volatility = returns_clean.std() * np.sqrt(252)
    
    # Sharpe ratio (assuming 0% risk-free rate)
    sharpe_ratio = annualized_return / volatility if volatility > 0 else 0
    
    # Maximum drawdown
    cumulative = (1 + returns_clean).cumprod()
    running_max = cumulative.expanding().max()
    drawdown = (cumulative - running_max) / running_max
    max_drawdown = drawdown.min()
    
    # Win rate
    win_rate = (returns_clean > 0).sum() / len(returns_clean)
    
    # Average win/loss
    wins = returns_clean[returns_clean > 0]
    losses = returns_clean[returns_clean < 0]
    avg_win = wins.mean() if len(wins) > 0 else 0
    avg_loss = losses.mean() if len(losses) > 0 else 0
    
    print(f"\n{name}:")
    print(f"  Total Return: {total_return*100:.2f}%")
    print(f"  Annualized Return: {annualized_return*100:.2f}%")
    print(f"  Annualized Volatility: {volatility*100:.2f}%")
    print(f"  Sharpe Ratio: {sharpe_ratio:.4f}")
    print(f"  Maximum Drawdown: {max_drawdown*100:.2f}%")
    print(f"  Win Rate: {win_rate*100:.2f}%")
    print(f"  Average Win: {avg_win*100:.4f}%")
    print(f"  Average Loss: {avg_loss*100:.4f}%")
    print(f"  Number of Trades: {len(returns_clean)}")
    
    return {
        'total_return': total_return,
        'annualized_return': annualized_return,
        'volatility': volatility,
        'sharpe_ratio': sharpe_ratio,
        'max_drawdown': max_drawdown,
        'win_rate': win_rate
    }

# Calculate metrics for both strategies
strategy_metrics = calculate_metrics(df_aapl['strategy_returns'], "EWMA Strategy")
buyhold_metrics = calculate_metrics(df_aapl['buy_hold_returns'], "Buy & Hold")

print("\n" + "=" * 60)

In [None]:
# Calculate strategy returns
# Strategy return = signal * forward return
# We use the lagged signal to avoid look-ahead bias
df_aapl['strategy_returns'] = df_aapl['signal_lagged'] * df_aapl['returns']

# Calculate cumulative returns
df_aapl['cumulative_returns'] = (1 + df_aapl['returns']).cumprod() - 1
df_aapl['cumulative_strategy_returns'] = (1 + df_aapl['strategy_returns']).cumprod() - 1

# Buy and hold benchmark
df_aapl['buy_hold_returns'] = df_aapl['returns']
df_aapl['cumulative_buy_hold'] = (1 + df_aapl['buy_hold_returns']).cumprod() - 1

# Plot cumulative returns
fig, ax = plt.subplots(figsize=(14, 7))

ax.plot(df_aapl[date_col], df_aapl['cumulative_strategy_returns'] * 100, 
        label='EWMA Strategy', linewidth=2)
ax.plot(df_aapl[date_col], df_aapl['cumulative_buy_hold'] * 100, 
        label='Buy & Hold', linewidth=2, alpha=0.7)

ax.set_title('Strategy Performance vs Buy & Hold', fontsize=14, fontweight='bold')
ax.set_xlabel('Date')
ax.set_ylabel('Cumulative Returns (%)')
ax.legend(loc='best')
ax.grid(True, alpha=0.3)
ax.axhline(y=0, color='black', linestyle='-', alpha=0.3)

plt.tight_layout()
plt.show()

# Print summary statistics
print("=" * 60)
print("BACKTEST PERFORMANCE SUMMARY")
print("=" * 60)

## Step 7: Backtesting the Strategy

Now let's see how our strategy would have performed historically. **Backtesting** simulates trading based on historical data.

### Key Metrics:
- **Cumulative Returns**: Total return over the period
- **Sharpe Ratio**: Risk-adjusted return (return per unit of volatility)
- **Maximum Drawdown**: Largest peak-to-trough decline
- **Win Rate**: Percentage of profitable trades

### Important Note:
⚠️ Backtesting on the same data used to develop the strategy can lead to **overfitting**. In practice, you should:
1. Split data into training and testing sets
2. Develop strategy on training set
3. Test on out-of-sample data

In [None]:
# Visualize returns distribution by signal
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
ax1 = axes[0]
data_to_plot = [returns_long, returns_short]
labels = ['Long (1)', 'Short (-1)']
bp = ax1.boxplot(data_to_plot, labels=labels, patch_artist=True)
bp['boxes'][0].set_facecolor('green')
bp['boxes'][0].set_alpha(0.5)
bp['boxes'][1].set_facecolor('red')
bp['boxes'][1].set_alpha(0.5)
ax1.axhline(y=0, color='black', linestyle='--', alpha=0.5)
ax1.set_title('Returns Distribution by Signal', fontsize=14, fontweight='bold')
ax1.set_ylabel('Returns')
ax1.grid(True, alpha=0.3)

# Histogram comparison
ax2 = axes[1]
ax2.hist(returns_long, bins=30, alpha=0.5, label='Long', color='green', density=True)
ax2.hist(returns_short, bins=30, alpha=0.5, label='Short', color='red', density=True)
ax2.axvline(x=returns_long.mean(), color='green', linestyle='--', linewidth=2, label=f'Long mean: {returns_long.mean():.4f}')
ax2.axvline(x=returns_short.mean(), color='red', linestyle='--', linewidth=2, label=f'Short mean: {returns_short.mean():.4f}')
ax2.set_title('Returns Distribution Comparison', fontsize=14, fontweight='bold')
ax2.set_xlabel('Returns')
ax2.set_ylabel('Density')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Calculate forward returns (next period's return based on current signal)
# Shift signal by 1 to avoid look-ahead bias
df_aapl['forward_returns'] = df_aapl['returns'].shift(-1)
df_aapl['signal_lagged'] = df_aapl['signal'].shift(1)

# Remove NaN values
df_test = df_aapl[['signal_lagged', 'forward_returns']].dropna()

# Split returns by signal
returns_long = df_test[df_test['signal_lagged'] == 1]['forward_returns']
returns_short = df_test[df_test['signal_lagged'] == -1]['forward_returns']
returns_neutral = df_test[df_test['signal_lagged'] == 0]['forward_returns']

print("=" * 60)
print("STATISTICAL ANALYSIS OF SIGNALS")
print("=" * 60)

# Long signal statistics
if len(returns_long) > 0:
    mean_long = returns_long.mean()
    std_long = returns_long.std()
    t_stat_long = mean_long / (std_long / np.sqrt(len(returns_long)))
    p_value_long = stats.t.sf(abs(t_stat_long), len(returns_long) - 1) * 2
    
    print(f"\nLONG SIGNALS (signal = 1):")
    print(f"  Number of observations: {len(returns_long)}")
    print(f"  Mean return: {mean_long:.6f} ({mean_long*100:.4f}%)")
    print(f"  Std dev: {std_long:.6f}")
    print(f"  T-statistic: {t_stat_long:.4f}")
    print(f"  P-value: {p_value_long:.4f}")
    print(f"  Significant at 95%? {'YES' if abs(t_stat_long) > 1.96 else 'NO'}")

# Short signal statistics
if len(returns_short) > 0:
    mean_short = returns_short.mean()
    std_short = returns_short.std()
    t_stat_short = mean_short / (std_short / np.sqrt(len(returns_short)))
    p_value_short = stats.t.sf(abs(t_stat_short), len(returns_short) - 1) * 2
    
    print(f"\nSHORT SIGNALS (signal = -1):")
    print(f"  Number of observations: {len(returns_short)}")
    print(f"  Mean return: {mean_short:.6f} ({mean_short*100:.4f}%)")
    print(f"  Std dev: {std_short:.6f}")
    print(f"  T-statistic: {t_stat_short:.4f}")
    print(f"  P-value: {p_value_short:.4f}")
    print(f"  Significant at 95%? {'YES' if abs(t_stat_short) > 1.96 else 'NO'}")

# Test difference between long and short
if len(returns_long) > 0 and len(returns_short) > 0:
    t_stat_diff, p_value_diff = stats.ttest_ind(returns_long, returns_short)
    print(f"\nDIFFERENCE BETWEEN LONG AND SHORT:")
    print(f"  Mean difference: {mean_long - mean_short:.6f}")
    print(f"  T-statistic: {t_stat_diff:.4f}")
    print(f"  P-value: {p_value_diff:.4f}")
    print(f"  Significantly different? {'YES' if p_value_diff < 0.05 else 'NO'}")

print("\n" + "=" * 60)

## Step 6: Statistical Testing with T-Statistics

Before we backtest our strategy, let's test whether our signals are statistically significant. We want to know:
**"Do returns when we have a buy signal differ significantly from returns when we have a sell signal?"**

### The T-Test:
The **t-statistic** measures how many standard errors a sample mean is from zero:

$$t = \frac{\bar{x}}{\text{SE}} = \frac{\bar{x}}{s / \sqrt{n}}$$

where:
- $\bar{x}$ is the sample mean
- $s$ is the sample standard deviation
- $n$ is the sample size
- A **t-stat > 2** (approximately) is often considered statistically significant at the 95% confidence level

In [None]:
# Generate trading signal
# Signal = 1 when fast EWMA > slow EWMA (bullish/buy)
# Signal = -1 when fast EWMA < slow EWMA (bearish/sell)
# Signal = 0 when no clear signal

df_aapl['signal'] = 0
df_aapl.loc[df_aapl['ewma_short'] > df_aapl['ewma_long'], 'signal'] = 1
df_aapl.loc[df_aapl['ewma_short'] < df_aapl['ewma_long'], 'signal'] = -1

# Alternative: signal based on price vs single EWMA
df_aapl['signal_simple'] = 0
df_aapl.loc[df_aapl[price_col] > df_aapl['ewma_short'], 'signal_simple'] = 1
df_aapl.loc[df_aapl[price_col] < df_aapl['ewma_short'], 'signal_simple'] = -1

# Visualize signals
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Plot 1: Price with EWMA crossover signals
ax1 = axes[0]
ax1.plot(df_aapl[date_col], df_aapl[price_col], label='Price', alpha=0.6, linewidth=1)
ax1.plot(df_aapl[date_col], df_aapl['ewma_short'], label=f'EWMA {span_short}', alpha=0.7, linewidth=1.5)
ax1.plot(df_aapl[date_col], df_aapl['ewma_long'], label=f'EWMA {span_long}', alpha=0.7, linewidth=1.5)

# Highlight buy/sell signals
buy_signals = df_aapl[df_aapl['signal'].diff() == 2]  # Changed from -1 to 1
sell_signals = df_aapl[df_aapl['signal'].diff() == -2]  # Changed from 1 to -1

ax1.scatter(buy_signals[date_col], buy_signals[price_col], 
           color='green', marker='^', s=100, label='Buy Signal', zorder=5)
ax1.scatter(sell_signals[date_col], sell_signals[price_col], 
           color='red', marker='v', s=100, label='Sell Signal', zorder=5)

ax1.set_title('EWMA Crossover Strategy Signals', fontsize=14, fontweight='bold')
ax1.set_xlabel('Date')
ax1.set_ylabel('Price ($)')
ax1.legend(loc='best')
ax1.grid(True, alpha=0.3)

# Plot 2: Signal over time
ax2 = axes[1]
ax2.fill_between(df_aapl[date_col], df_aapl['signal'], 0, 
                 where=(df_aapl['signal'] > 0), color='green', alpha=0.3, label='Long')
ax2.fill_between(df_aapl[date_col], df_aapl['signal'], 0, 
                 where=(df_aapl['signal'] < 0), color='red', alpha=0.3, label='Short')
ax2.set_title('Trading Signal Over Time', fontsize=14, fontweight='bold')
ax2.set_xlabel('Date')
ax2.set_ylabel('Signal')
ax2.set_ylim(-1.5, 1.5)
ax2.axhline(y=0, color='black', linestyle='-', alpha=0.5)
ax2.legend(loc='best')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print signal statistics
print(f"Signal distribution:")
print(df_aapl['signal'].value_counts())
print(f"\nPercentage of time long: {(df_aapl['signal'] == 1).sum() / len(df_aapl) * 100:.2f}%")
print(f"Percentage of time short: {(df_aapl['signal'] == -1).sum() / len(df_aapl) * 100:.2f}%")

In [None]:
# Calculate EWMA with different spans
span_short = 10  # Fast EWMA (more reactive)
span_long = 50   # Slow EWMA (smoother)

df_aapl['ewma_short'] = df_aapl[price_col].ewm(span=span_short, adjust=False).mean()
df_aapl['ewma_long'] = df_aapl[price_col].ewm(span=span_long, adjust=False).mean()

# Visualize EWMA
fig, ax = plt.subplots(figsize=(14, 7))

ax.plot(df_aapl[date_col], df_aapl[price_col], label='Actual Price', alpha=0.7, linewidth=1.5)
ax.plot(df_aapl[date_col], df_aapl['ewma_short'], label=f'EWMA (span={span_short})', alpha=0.8, linewidth=1.5)
ax.plot(df_aapl[date_col], df_aapl['ewma_long'], label=f'EWMA (span={span_long})', alpha=0.8, linewidth=1.5)

ax.set_title('AAPL Price with EWMA Indicators', fontsize=14, fontweight='bold')
ax.set_xlabel('Date')
ax.set_ylabel('Price ($)')
ax.legend(loc='best')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Step 5: Creating a Trading Signal with EWMA

**Exponentially Weighted Moving Average (EWMA)** gives more weight to recent observations. Unlike a simple moving average, EWMA reacts faster to recent price changes.

### The EWMA Formula:
$$EWMA_t = \alpha \cdot X_t + (1-\alpha) \cdot EWMA_{t-1}$$

where:
- $\alpha$ is the smoothing factor (0 < α < 1)
- Higher α means more weight on recent values
- Common alternative parameterization: $\alpha = 2/(span + 1)$

### Our Strategy:
We'll create a **momentum signal** based on EWMA:
- If price > EWMA → **Buy signal** (upward momentum)
- If price < EWMA → **Sell signal** (downward momentum)

### Interpreting Autocorrelation:

- Values **outside the confidence bands** (red dashed lines) are statistically significant
- Most financial return series show **very low autocorrelation** (close to zero)
- Even small autocorrelations can potentially be exploited for profit (though transaction costs matter!)
- This supports the idea that markets are relatively **efficient** but not perfectly so

In [None]:
# Calculate autocorrelation for different lags
from pandas.plotting import autocorrelation_plot

# Remove NaN values
returns_clean = df_aapl['returns'].dropna()

# Plot autocorrelation
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Autocorrelation plot
ax1 = axes[0]
autocorrelation_plot(returns_clean, ax=ax1)
ax1.set_title('Autocorrelation of AAPL Returns', fontsize=14, fontweight='bold')
ax1.set_xlabel('Lag')
ax1.set_ylabel('Autocorrelation')
ax1.set_xlim(0, 50)
ax1.grid(True, alpha=0.3)

# Manual autocorrelation for specific lags
ax2 = axes[1]
lags = range(1, 21)
autocorr_values = [returns_clean.autocorr(lag=i) for i in lags]
ax2.bar(lags, autocorr_values, alpha=0.7)
ax2.axhline(y=0, color='r', linestyle='-', alpha=0.5)
# Add confidence intervals (95% confidence: ±1.96/sqrt(n))
conf_interval = 1.96 / np.sqrt(len(returns_clean))
ax2.axhline(y=conf_interval, color='r', linestyle='--', alpha=0.5, label='95% confidence')
ax2.axhline(y=-conf_interval, color='r', linestyle='--', alpha=0.5)
ax2.set_title('Autocorrelation at Different Lags', fontsize=14, fontweight='bold')
ax2.set_xlabel('Lag (days)')
ax2.set_ylabel('Autocorrelation')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print specific lag autocorrelations
print("Autocorrelation values:")
for lag in [1, 2, 3, 5, 10]:
    acf = returns_clean.autocorr(lag=lag)
    print(f"  Lag {lag}: {acf:.6f}")

## Step 4: Autocorrelation Analysis

**Autocorrelation** measures how much a time series is correlated with itself at different time lags. In the context of stock returns:
- **Positive autocorrelation** at lag 1 means that positive returns tend to be followed by positive returns (momentum)
- **Negative autocorrelation** means positive returns tend to be followed by negative returns (mean reversion)
- **Zero autocorrelation** suggests returns are independent (random walk hypothesis)

Let's investigate whether AAPL returns exhibit autocorrelation.

### Key Observations from EDA:

1. **Price Trend**: We can see the overall price trajectory of AAPL
2. **Returns Distribution**: Daily returns appear to be roughly normally distributed (though often with fat tails)
3. **Volatility**: The standard deviation of returns gives us a measure of volatility
4. **Skewness & Kurtosis**: Tell us about asymmetry and tail thickness in the distribution

**Question to think about**: Do you notice any patterns or trends that might be exploitable for prediction?

In [None]:
# Calculate returns
# Returns are the percentage change in price from one period to the next
df_aapl['returns'] = df_aapl[price_col].pct_change()

# Visualize returns distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Returns over time
ax1 = axes[0]
ax1.plot(df_aapl[date_col], df_aapl['returns'], linewidth=0.8, alpha=0.7)
ax1.axhline(y=0, color='r', linestyle='--', alpha=0.5)
ax1.set_title('AAPL Daily Returns Over Time', fontsize=14, fontweight='bold')
ax1.set_xlabel('Date')
ax1.set_ylabel('Returns')
ax1.grid(True, alpha=0.3)

# Returns distribution
ax2 = axes[1]
ax2.hist(df_aapl['returns'].dropna(), bins=50, edgecolor='black', alpha=0.7)
ax2.set_title('Distribution of Daily Returns', fontsize=14, fontweight='bold')
ax2.set_xlabel('Returns')
ax2.set_ylabel('Frequency')
ax2.axvline(x=0, color='r', linestyle='--', alpha=0.5)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print return statistics
print(f"Mean daily return: {df_aapl['returns'].mean():.6f}")
print(f"Std dev of returns: {df_aapl['returns'].std():.6f}")
print(f"Skewness: {df_aapl['returns'].skew():.4f}")
print(f"Kurtosis: {df_aapl['returns'].kurtosis():.4f}")

In [None]:
# Visualize price over time
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

# Price chart
ax1 = axes[0]
price_col = 'Close' if 'Close' in df_aapl.columns else 'close' if 'close' in df_aapl.columns else df_aapl.select_dtypes(include=[np.number]).columns[0]
ax1.plot(df_aapl[date_col], df_aapl[price_col], linewidth=1.5)
ax1.set_title('AAPL Closing Price Over Time', fontsize=14, fontweight='bold')
ax1.set_xlabel('Date')
ax1.set_ylabel('Price ($)')
ax1.grid(True, alpha=0.3)

# Volume chart (if available)
ax2 = axes[1]
volume_col = 'Volume' if 'Volume' in df_aapl.columns else 'volume' if 'volume' in df_aapl.columns else None
if volume_col:
    ax2.bar(df_aapl[date_col], df_aapl[volume_col], alpha=0.7, width=1.0)
    ax2.set_title('AAPL Trading Volume Over Time', fontsize=14, fontweight='bold')
    ax2.set_xlabel('Date')
    ax2.set_ylabel('Volume')
    ax2.grid(True, alpha=0.3)
else:
    ax2.text(0.5, 0.5, 'Volume data not available', ha='center', va='center', fontsize=12)
    ax2.set_xticks([])
    ax2.set_yticks([])

plt.tight_layout()
plt.show()

In [None]:
# Basic statistics
print("Summary Statistics:")
print(df_aapl.describe())

# Check for missing values
print(f"\nMissing values in AAPL data:")
print(df_aapl.isnull().sum())

## Step 3: Exploratory Data Analysis (EDA)

Let's explore the AAPL stock data to understand its characteristics, trends, and patterns.

In [None]:
# Filter for AAPL
# Adapt this based on the actual column name in the dataset
ticker_col = 'Symbol' if 'Symbol' in df_raw.columns else 'Ticker' if 'Ticker' in df_raw.columns else df_raw.columns[0]
date_col = 'Date' if 'Date' in df_raw.columns else df_raw.columns[0]

df_aapl = df_raw[df_raw[ticker_col] == 'AAPL'].copy()

# Convert date column to datetime
df_aapl[date_col] = pd.to_datetime(df_aapl[date_col])
df_aapl = df_aapl.sort_values(date_col).reset_index(drop=True)

print(f"AAPL data shape: {df_aapl.shape}")
print(f"Date range: {df_aapl[date_col].min()} to {df_aapl[date_col].max()}")
print(f"\nFirst few rows:")
df_aapl.head(10)

## Step 2: Extract AAPL Data

Now let's filter the dataset to focus on Apple (AAPL) stock and prepare it for analysis.

In [None]:
# Explore the dataset structure
print("Column names:")
print(df_raw.columns.tolist())
print(f"\nData types:")
print(df_raw.dtypes)
print(f"\nMissing values:")
print(df_raw.isnull().sum())

# Check unique tickers if available
if 'Symbol' in df_raw.columns or 'Ticker' in df_raw.columns:
    ticker_col = 'Symbol' if 'Symbol' in df_raw.columns else 'Ticker'
    print(f"\nUnique tickers: {df_raw[ticker_col].nunique()}")
    print(f"Tickers include: {df_raw[ticker_col].unique()[:10]}")  # Show first 10

In [None]:
# List files in the dataset directory
dataset_files = os.listdir(path)
print("Files in dataset:")
for file in dataset_files:
    print(f"  - {file}")
    
# Load the main CSV file
data_file = [f for f in dataset_files if f.endswith('.csv')][0]
df_raw = pd.read_csv(os.path.join(path, data_file))

print(f"\nLoaded {data_file}")
print(f"Shape: {df_raw.shape}")
print(f"\nFirst few rows:")
df_raw.head()

## Step 1: Load and Explore the Dataset

Let's start by loading the financial data from Kaggle and exploring what's available.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import os

# Set style for better-looking plots
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

print("Libraries imported successfully!")

# Workshop 1: Introduction to Quantitative Trading
## Predicting Stock Prices with Simple Statistical Methods

Welcome to this introductory workshop on quantitative finance! In this workshop, we'll explore:

1. **Exploratory Data Analysis (EDA)** - Understanding our stock price data
2. **Autocorrelation** - How past prices relate to future prices
3. **Exponentially Weighted Moving Averages (EWMA)** - Creating trading signals
4. **Statistical Testing** - Validating our signals with t-statistics
5. **Backtesting** - Testing our strategy on historical data
6. **Signal to Trades** - Converting predictions into actual trading decisions

We'll focus on a single stock (AAPL) and try to predict the next timestep's price movement using only historical price data.