# 🚀 LSTM Stock Prediction Training
## Complete Training Pipeline

**Before running this notebook:**
1. Create folder `stock_lstm_project` in your Google Drive
2. Inside it, create: `data/`, `scripts/`, `outputs/`
3. Upload training script to `scripts/lstm_training.py`
4. Upload data files to `data/` folder:
   - train_sequences.npz
   - val_sequences.npz
   - test_sequences.npz
   - scalers.pkl

**Recommended Runtime:** GPU (T4)
- Go to: **Runtime → Change runtime type → GPU**

---
## 📂 Step 1: Mount Google Drive

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

print("\n" + "="*60)
print("✅ Google Drive mounted successfully!")
print("="*60)

---
## 🔧 Step 2: Setup Project Environment

In [None]:
# Navigate to project directory
PROJECT_ROOT = '/content/drive/MyDrive/stock_lstm_project'
os.chdir(PROJECT_ROOT)

print(f"📂 Current directory: {os.getcwd()}")
print("\n📁 Project structure:")
print("="*60)

for root, dirs, files in os.walk('.', topdown=True):
    # Limit depth to 2 levels
    level = root.replace('.', '').count(os.sep)
    if level < 3:
        indent = ' ' * 2 * level
        print(f'{indent}📁 {os.path.basename(root)}/')
        subindent = ' ' * 2 * (level + 1)
        for file in files:
            if not file.startswith('.'):
                print(f'{subindent}📄 {file}')
    dirs[:] = [d for d in dirs if not d.startswith('.')]  # Skip hidden dirs

print("="*60)

---
## ✅ Step 3: Verify Data Files

In [None]:
# Check if all required data files exist
data_dir = 'data'
required_files = [
    'train_sequences.npz',
    'val_sequences.npz', 
    'test_sequences.npz',
    'scalers.pkl'
]

print("🔍 Checking required data files...\n")
print("="*60)

all_present = True
for file in required_files:
    filepath = os.path.join(data_dir, file)
    exists = os.path.exists(filepath)
    status = "✅" if exists else "❌"
    size = ""
    if exists:
        size_mb = os.path.getsize(filepath) / (1024*1024)
        size = f"({size_mb:.2f} MB)"
    print(f"{status} {file:25} {'Found' if exists else 'MISSING':10} {size}")
    if not exists:
        all_present = False

print("="*60)

if all_present:
    print("\n✅ All data files present! Ready to train.")
else:
    print("\n❌ ERROR: Some files are missing!")
    print("Please upload them to the data/ folder before proceeding.")

---
## 📦 Step 4: Install/Verify Packages

In [None]:
# Install required packages (most are pre-installed in Colab)
!pip install -q tensorflow pandas numpy matplotlib seaborn scikit-learn

# Verify installations
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

print("✅ All packages installed and imported successfully!")
print(f"\n📊 Package Versions:")
print(f"   TensorFlow: {tf.__version__}")
print(f"   Keras:      {tf.keras.__version__}")
print(f"   NumPy:      {np.__version__}")
print(f"   Pandas:     {pd.__version__}")

# Check GPU availability
print(f"\n🖥️  GPU Status:")
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    print(f"   ✅ {len(gpus)} GPU(s) available")
    for gpu in gpus:
        print(f"      - {gpu.name}")
else:
    print(f"   ⚠️  No GPU detected - training on CPU")
    print(f"   💡 To enable GPU: Runtime → Change runtime type → GPU")

---
## 🚀 Step 5: RUN TRAINING!

This cell will:
1. Load all preprocessed data
2. Build LSTM model
3. Train the model (with early stopping)
4. Evaluate on test set
5. Generate visualizations
6. Save model and results

**Expected training time:**
- With GPU: 10-30 minutes
- Without GPU: 1-3 hours

In [None]:
# Change to data directory (where the script expects files)
os.chdir(f'{PROJECT_ROOT}/data')

print(f"📂 Working directory: {os.getcwd()}")
print("\n" + "="*80)
print("🎯 STARTING TRAINING PIPELINE")
print("="*80)
print("\n⏳ This may take 10-30 minutes with GPU...\n")

# Run the training script
%run ../scripts/lstm_training.py

---
## 📊 Step 6: View Training Results

In [None]:
from IPython.display import Image, display
import os

# Ensure we're in the right directory
os.chdir(f'{PROJECT_ROOT}/data')

plot_dir = 'plots'

print("="*80)
print("📊 TRAINING RESULTS VISUALIZATION")
print("="*80)

# Training History
print("\n📈 1. Training History (Loss & MAE)")
print("-" * 80)
if os.path.exists(f'{plot_dir}/training_history.png'):
    display(Image(f'{plot_dir}/training_history.png'))
else:
    print("❌ Plot not found!")

# Predictions
print("\n📈 2. Predicted vs Actual Values")
print("-" * 80)
if os.path.exists(f'{plot_dir}/predictions.png'):
    display(Image(f'{plot_dir}/predictions.png'))
else:
    print("❌ Plot not found!")

# Error Distribution
print("\n📉 3. Error Distribution Analysis")
print("-" * 80)
if os.path.exists(f'{plot_dir}/error_distribution.png'):
    display(Image(f'{plot_dir}/error_distribution.png'))
else:
    print("❌ Plot not found!")

print("\n" + "="*80)

---
## 🔍 Step 7: Load and Test Model

In [None]:
import tensorflow as tf
from tensorflow import keras
import pickle
import numpy as np

# Ensure we're in data directory
os.chdir(f'{PROJECT_ROOT}/data')

print("="*80)
print("🔍 LOADING TRAINED MODEL")
print("="*80)

# Load the trained model
model_path = 'final_lstm_model.keras'
model = keras.models.load_model(model_path)

print(f"\n✅ Model loaded successfully from: {model_path}")
print(f"\n📋 Model Architecture:")
print("-" * 80)
model.summary()
print("-" * 80)

# Load scalers
with open('scalers.pkl', 'rb') as f:
    scalers = pickle.load(f)
print(f"\n✅ Scalers loaded successfully")

# Load test data
test_data = np.load('test_sequences.npz')
X_test = test_data['X']
y_test = test_data['y']

print(f"✅ Test data loaded: {X_test.shape[0]:,} sequences")
print("="*80)

---
## 🎯 Step 8: Make Sample Predictions

In [None]:
import numpy as np

print("="*80)
print("🎯 SAMPLE PREDICTIONS")
print("="*80)

# Make predictions on first 10 test samples
num_samples = 10
predictions = model.predict(X_test[:num_samples])

print(f"\n📊 First {num_samples} Predictions:\n")
print(f"{'Sample':>8} | {'Predicted':>12} | {'Actual':>12} | {'Error':>12} | {'Error %':>10}")
print("-" * 80)

for i in range(num_samples):
    pred = predictions[i][0]
    actual = y_test[i]
    error = pred - actual
    error_pct = (error / (abs(actual) + 1e-8)) * 100
    
    print(f"{i+1:>8} | {pred:>12.4f} | {actual:>12.4f} | {error:>12.4f} | {error_pct:>9.2f}%")

# Calculate metrics on these samples
mae = np.mean(np.abs(predictions.flatten()[:num_samples] - y_test[:num_samples]))
mse = np.mean((predictions.flatten()[:num_samples] - y_test[:num_samples])**2)
rmse = np.sqrt(mse)

print("-" * 80)
print(f"\n📈 Metrics for these {num_samples} samples:")
print(f"   MAE:  {mae:.6f}")
print(f"   MSE:  {mse:.6f}")
print(f"   RMSE: {rmse:.6f}")
print("="*80)

---
## 📥 Step 9: Download Model (Optional)

If you want to download the model to your local machine:

In [None]:
from google.colab import files
import os

os.chdir(f'{PROJECT_ROOT}/data')

print("📥 Downloading model files...\n")

# Download model
print("Downloading: final_lstm_model.keras")
files.download('final_lstm_model.keras')

# Download scalers
print("Downloading: scalers.pkl")
files.download('scalers.pkl')

# Download training history
print("Downloading: training_history.pkl")
files.download('training_history.pkl')

print("\n✅ Downloads complete!")
print("\n💡 These files are also saved in your Google Drive at:")
print(f"   {PROJECT_ROOT}/data/")

---
## 📋 Summary

### ✅ What was created:
1. **Models:**
   - `best_lstm_model.keras` - Best model during training
   - `final_lstm_model.keras` - Final trained model

2. **Data:**
   - `training_history.pkl` - Complete training history

3. **Visualizations:**
   - `plots/training_history.png` - Loss curves
   - `plots/predictions.png` - Prediction analysis
   - `plots/error_distribution.png` - Error analysis

### 🚀 Next Steps:
1. Analyze the plots to understand model performance
2. If metrics are good, use model for real predictions
3. If metrics are poor, consider:
   - Collecting more data
   - Adjusting model architecture
   - Tuning hyperparameters
   - Adding more features

### 💾 All files are saved in Google Drive:
`/content/drive/MyDrive/stock_lstm_project/data/`