#Trading on Trends [An advanced Sentiment Analysis] - Documentation

## Project Requirements and Implementation

### Required Criteria

#### 1. Using APIs in the project (Criterion 11)
   * **Reddit API**: Social media sentiment data collection using PRAW library
     - Implementation: `StockSentimentAnalyzer.fetch_reddit_data()`
     - Features: Subreddit search, post collection, sentiment scoring
   * **Yahoo Finance API**: Historical stock market data retrieval using yfinance
     - Implementation: `StockSentimentAnalyzer.fetch_stock_data()`
     - Features: OHLCV data, technical indicators, price history
   * **Flask API**: Web application endpoints for analysis requests
     - Implementation: `/analyze` route in Flask application
     - Real-time analysis and visualization generation

#### 2. Data cleaning and/or Data transformation (Criterion 3)
   * **Text preprocessing**:
     - URL removal, special character handling, whitespace normalization
     - Implementation: `StockSentimentAnalyzer.clean_text()`
   * **Data merging**:
     - Stock and sentiment data fusion with date alignment
     - Implementation: `StockSentimentAnalyzer.merge_stock_and_sentiment()`
   * **Missing value handling**:
     - NaN and infinity value replacement with appropriate defaults
     - Forward/backward fill for price data, zero-fill for sentiment data
   * **Feature engineering**:
     - Technical indicators (RSI, MACD, Bollinger Bands)
     - Lag features, rolling statistics, interaction features
     - Implementation: `StockSentimentAnalyzer.add_technical_indicators()`

#### 3. Logistic Regression and variants (Criterion 9)
   * **Binary classification** for stock price direction prediction (up/down)
   * **Model variants implemented**:
     - Standard Logistic Regression with L2 regularization
     - Random Forest Classifier
     - XGBoost Classifier
   * **Implementation**: `StockSentimentAnalyzer.train_models()`
   * **Evaluation**: `StockSentimentAnalyzer._evaluate_logistic_model()`
     - Metrics: Accuracy, Precision, Recall, F1-score, AUC-ROC
     - Confusion matrix visualization

### Additional Criteria Implemented

#### 1. Object-oriented code (Criterion 1)
   * **Main class**: `StockSentimentAnalyzer`
   * **Methods organized by functionality**:
     - Data collection: `fetch_stock_data()`, `fetch_reddit_data()`
     - Data processing: `clean_text()`, `merge_stock_and_sentiment()`
     - Analysis: `analyze_data()`, `train_models()`
     - Visualization: Private methods for plotting
   * **Encapsulation**: Private helper methods, state management

#### 2. Regular Expression (Criterion 5)
   * **Text cleaning patterns**:
     - URL removal: `r'https?://\S+|www\.\S+'`
     - Username extraction: `r'@\w+'`
     - Hashtag normalization: `r'#'`
     - Special character filtering: `r'[^\w\s\.,!?]'`
   * **Implementation**: `StockSentimentAnalyzer.clean_text()`

#### 3. Linear Regression and variants (Criterion 8)
   * **Continuous target prediction** for stock returns percentage
   * **Model variants implemented**:
     - Ridge Regression (L2 regularization)
     - Gradient Boosting Regressor
     - XGBoost Regressor
   * **Implementation**: `StockSentimentAnalyzer.train_models()`
   * **Evaluation**: `StockSentimentAnalyzer._evaluate_linear_model()`
     - Metrics: RMSE, MAE, R², Directional Accuracy

### New Features and Enhancements

#### 1. Interactive Web Application
   * **Flask-based API**: Real-time analysis endpoint
   * **Dynamic visualizations**: Plotly charts with interactive elements
   * **Multi-panel dashboard**: Price charts, volume, technical indicators

#### 2. Advanced Sentiment Analysis
   * **Financial lexicon enhancement**: Custom financial terms for VADER
   * **Multi-source aggregation**: Combined sentiment from multiple subreddits
   * **Temporal features**: Sentiment momentum, rolling averages

#### 3. Comprehensive Technical Analysis
   * **Indicators implemented**:
     - Simple/Exponential Moving Averages (SMA/EMA)
     - Relative Strength Index (RSI)
     - MACD (Moving Average Convergence Divergence)
     - Bollinger Bands
     - Average True Range (ATR)
     - Stochastic Oscillator
   * **Visualization**: Multi-panel technical charts

#### 4. Enhanced Model Training Pipeline
   * **Feature selection**: Random Forest-based importance ranking
   * **Cross-validation**: Time series split for temporal data
   * **Class balancing**: SMOTE for handling imbalanced datasets
   * **Model selection**: Automated comparison of multiple algorithms

#### 5. Robust Error Handling and Data Validation
   * **NaN/infinity handling**: Comprehensive cleaning before analysis
   * **API error management**: Graceful fallbacks for data collection
   * **Visualization safety**: Validated inputs for plotting functions

#### 6. Performance Monitoring
   * **Model metrics tracking**: JSON storage of evaluation results
   * **Feature importance analysis**: Visual comparison across models
   * **Backtesting framework**: Historical performance validation

#### 7. Scalable Architecture
   * **Modular design**: Separate concerns for data, models, visualization
   * **Configurable parameters**: Easy adjustment of analysis parameters
   * **Extensible framework**: Simple addition of new data sources or models

### Project Structure
stock_sentiment_analysis/
├── data/
│   ├── raw/              # Original API data
│   ├── processed/        # Cleaned and merged data
│   └── final/            # Analysis-ready datasets
├── models/               # Trained model artifacts
├── visualizations/       # Generated charts and plots
├── app.py               # Flask web application
├── StockSentimentAnalyzer.py  # Main analysis class
└── requirements.txt      # Project dependencies


### Future Enhancements
1. Real-time streaming data integration
2. Advanced deep learning models (LSTM, Transformer)
3. Multi-asset portfolio analysis
4. Risk management and position sizing
5. Deployment on cloud infrastructure
6. Mobile-responsive frontend interface

In [None]:
!pip install praw yfinance pandas numpy matplotlib seaborn scikit-learn nltk textblob plotly joblib flask ta xgboost imbalanced-learn pyngrok flask_ngrok

Collecting praw
  Downloading praw-7.8.1-py3-none-any.whl.metadata (9.4 kB)
Collecting ta
  Downloading ta-0.11.0.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyngrok
  Downloading pyngrok-7.2.8-py3-none-any.whl.metadata (10 kB)
Collecting flask_ngrok
  Downloading flask_ngrok-0.0.25-py3-none-any.whl.metadata (1.8 kB)
Collecting prawcore<3,>=2.4 (from praw)
  Downloading prawcore-2.4.0-py3-none-any.whl.metadata (5.0 kB)
Collecting update_checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Downloading praw-7.8.1-py3-none-any.whl (189 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m189.3/189.3 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyngrok-7.2.8-py3-none-any.whl (25 kB)
Downloading flask_ngrok-0.0.25-py3-none-any.whl (3.1 kB)
Downloading prawcore-2.4.0-py3-none-any.whl (17 kB)
Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Building wheels for collected pa

In [None]:
!pip install pyngrok
!pip install flask_ngrok
!pip install --upgrade yfinance

Collecting yfinance
  Downloading yfinance-0.2.61-py2.py3-none-any.whl.metadata (5.8 kB)
Downloading yfinance-0.2.61-py2.py3-none-any.whl (117 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.9/117.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: yfinance
  Attempting uninstall: yfinance
    Found existing installation: yfinance 0.2.59
    Uninstalling yfinance-0.2.59:
      Successfully uninstalled yfinance-0.2.59
Successfully installed yfinance-0.2.61


In [None]:
!ngrok config add-authtoken 2ukQE6toDAkcx1PKiaOSfNTERKP_3W3BmtL46s7pzP9n6QbUT

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [None]:
# Create directory structure
import os
os.makedirs("data/raw", exist_ok=True)
os.makedirs("data/processed", exist_ok=True)
os.makedirs("data/final", exist_ok=True)
os.makedirs("visualizations", exist_ok=True)
os.makedirs("models", exist_ok=True)

## **boillinger bands: updated accuracy with XGBoost for R squd**

In [None]:
# Download NLTK resources
import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import time
import yfinance as yf
import praw
from textblob import TextBlob
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import re
from sklearn.model_selection import train_test_split, GridSearchCV, TimeSeriesSplit
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor, RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    confusion_matrix, mean_squared_error, mean_absolute_error, r2_score
)
from sklearn.feature_selection import SelectFromModel
from imblearn.over_sampling import SMOTE
import plotly.graph_objects as go
import plotly.express as px
import joblib
import json
import ta
import xgboost as xgb
from pyngrok import ngrok
from flask import Flask, render_template, request, jsonify
from flask_ngrok import run_with_ngrok

# Reddit API Credentials
REDDIT_CLIENT_ID = "XxQG_dz_0xM673dUEE91YA"
REDDIT_CLIENT_SECRET = "QxDmdmI4IrjMM_3VDBcwk8MWqfJscQ"
REDDIT_USER_AGENT = "tradingOntrends1.0"

class StockSentimentAnalyzer:
    def __init__(self, output_dir="data"):
        """
        Initialize the stock sentiment analyzer

        Parameters:
        output_dir (str): Directory to save data files
        """
        self.output_dir = output_dir
        self.raw_dir = os.path.join(output_dir, "raw")
        self.processed_dir = os.path.join(output_dir, "processed")
        self.final_dir = os.path.join(output_dir, "final")
        self.model_dir = os.path.join("models")
        self.viz_dir = os.path.join("visualizations")

        # Create directories if they don't exist
        os.makedirs(self.raw_dir, exist_ok=True)
        os.makedirs(self.processed_dir, exist_ok=True)
        os.makedirs(self.final_dir, exist_ok=True)
        os.makedirs(self.model_dir, exist_ok=True)
        os.makedirs(self.viz_dir, exist_ok=True)

        # Initialize sentiment analyzer with enhanced financial lexicon
        try:
            self.sia = SentimentIntensityAnalyzer()
            self._enhance_financial_lexicon()
            print("VADER sentiment analyzer initialized with financial lexicon")
        except Exception as e:
            print(f"Error initializing VADER: {e}")
            print("Will use TextBlob for sentiment analysis")
            self.sia = None

        # Initialize Reddit client
        try:
            self.reddit = praw.Reddit(
                client_id=REDDIT_CLIENT_ID,
                client_secret=REDDIT_CLIENT_SECRET,
                user_agent=REDDIT_USER_AGENT
            )
            print("Reddit client initialized successfully")
        except Exception as e:
            print(f"Error initializing Reddit client: {e}")
            self.reddit = None

    def _enhance_financial_lexicon(self):
        """Add finance-specific terms to VADER lexicon for better sentiment analysis"""
        if not self.sia:
            return

        # Positive financial terms
        positive_terms = {
            'bullish': 3.0, 'outperform': 2.5, 'buy': 2.0, 'upgrade': 2.0,
            'beat': 1.5, 'exceeds': 1.5, 'growth': 1.0, 'profit': 1.0,
            'surge': 1.8, 'rally': 1.7, 'gain': 1.2, 'upside': 1.3,
            'momentum': 0.8, 'opportunity': 0.7, 'strong': 0.9, 'higher': 0.7,
            'support': 0.6, 'confidence': 0.7, 'positive': 0.8, 'potential': 0.6
        }

        # Negative financial terms
        negative_terms = {
            'bearish': -3.0, 'underperform': -2.5, 'sell': -2.0, 'downgrade': -2.0,
            'miss': -1.5, 'below': -1.5, 'decline': -1.0, 'loss': -1.0,
            'plunge': -1.8, 'crash': -2.5, 'drop': -1.2, 'downside': -1.3,
            'weak': -0.9, 'risk': -0.8, 'concern': -0.7, 'lower': -0.7,
            'resistance': -0.6, 'recession': -1.5, 'negative': -0.8, 'caution': -0.6
        }

        # Update the lexicon
        self.sia.lexicon.update(positive_terms)
        self.sia.lexicon.update(negative_terms)

    def clean_text(self, text):
        """
        Clean text by removing URLs, special characters, etc.

        Parameters:
        text (str): Text to clean

        Returns:
        str: Cleaned text
        """
        if not isinstance(text, str):
            return ""

        # Remove URLs
        text = re.sub(r'https?://\S+|www\.\S+', '', text)

        # Remove usernames
        text = re.sub(r'@\w+', '', text)

        # Remove hashtags symbol (but keep the text)
        text = re.sub(r'#', '', text)

        # Remove special characters (keep letters, spaces, and basic punctuation)
        text = re.sub(r'[^\w\s\.,!?]', '', text)

        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text).strip()

        return text

    def get_sentiment(self, text):
        """
        Get sentiment score using VADER or TextBlob

        Parameters:
        text (str): Text to analyze

        Returns:
        dict: Sentiment scores and classification
        """
        cleaned_text = self.clean_text(text)

        if not cleaned_text:
            return {
                'compound': 0,
                'positive': 0,
                'negative': 0,
                'neutral': 0,
                'textblob': 0,
                'sentiment': 'neutral'
            }

        # Get TextBlob sentiment
        blob_sentiment = TextBlob(cleaned_text).sentiment.polarity

        # Get VADER sentiment if available
        if self.sia:
            vader_scores = self.sia.polarity_scores(cleaned_text)
            compound = vader_scores['compound']
            positive = vader_scores['pos']
            negative = vader_scores['neg']
            neutral = vader_scores['neu']
        else:
            # Fallback to TextBlob for sentiment
            compound = blob_sentiment
            positive = max(0, compound)
            negative = max(0, -compound)
            neutral = 1 - (positive + negative)

        # Classify sentiment
        if compound >= 0.05:
            sentiment = 'positive'
        elif compound <= -0.05:
            sentiment = 'negative'
        else:
            sentiment = 'neutral'

        # Return combined results
        return {
            'compound': compound,
            'positive': positive,
            'negative': negative,
            'neutral': neutral,
            'textblob': blob_sentiment,
            'sentiment': sentiment
        }

    def add_technical_indicators(self, stock_data):
        """
        Add technical analysis indicators to stock data

        Parameters:
        stock_data (DataFrame): DataFrame with stock price data

        Returns:
        DataFrame: DataFrame with added technical indicators
        """
        try:
            # Make a copy to avoid modifying the original
            df = stock_data.copy()

            # Simple Moving Averages
            df['sma_5'] = df['Close'].rolling(window=5).mean()
            df['sma_10'] = df['Close'].rolling(window=10).mean()
            df['sma_20'] = df['Close'].rolling(window=20).mean()

            # Exponential Moving Averages
            df['ema_5'] = df['Close'].ewm(span=5, adjust=False).mean()
            df['ema_10'] = df['Close'].ewm(span=10, adjust=False).mean()
            df['ema_20'] = df['Close'].ewm(span=20, adjust=False).mean()

            # Moving Average Convergence Divergence (MACD)
            try:
                macd = ta.trend.MACD(df['Close'])
                df['macd'] = macd.macd()
                df['macd_signal'] = macd.macd_signal()
                df['macd_diff'] = macd.macd_diff()
            except:
                # Calculate MACD manually if ta library fails
                df['macd'] = df['Close'].ewm(span=12, adjust=False).mean() - df['Close'].ewm(span=26, adjust=False).mean()
                df['macd_signal'] = df['macd'].ewm(span=9, adjust=False).mean()
                df['macd_diff'] = df['macd'] - df['macd_signal']

            # Relative Strength Index (RSI)
            try:
                df['rsi_14'] = ta.momentum.RSIIndicator(df['Close'], window=14).rsi()
            except:
                # Simplified RSI calculation if ta library fails
                delta = df['Close'].diff()
                gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
                loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
                rs = gain / loss
                df['rsi_14'] = 100 - (100 / (1 + rs))

            # Bollinger Bands
            try:
                bollinger = ta.volatility.BollingerBands(df['Close'], window=20, window_dev=2)
                df['bb_upper'] = bollinger.bollinger_hband()
                df['bb_middle'] = bollinger.bollinger_mavg()
                df['bb_lower'] = bollinger.bollinger_lband()
                df['bb_width'] = (df['bb_upper'] - df['bb_lower']) / df['bb_middle']
            except:
                # Manual Bollinger Bands calculation
                df['bb_middle'] = df['Close'].rolling(window=20).mean()
                df['bb_std'] = df['Close'].rolling(window=20).std()
                df['bb_upper'] = df['bb_middle'] + 2 * df['bb_std']
                df['bb_lower'] = df['bb_middle'] - 2 * df['bb_std']
                df['bb_width'] = (df['bb_upper'] - df['bb_lower']) / df['bb_middle']

            # Average True Range (ATR) - Volatility indicator
            try:
                df['atr'] = ta.volatility.AverageTrueRange(df['High'], df['Low'], df['Close'], window=14).average_true_range()
            except:
                # Simplified ATR calculation
                high_low = df['High'] - df['Low']
                high_close = np.abs(df['High'] - df['Close'].shift())
                low_close = np.abs(df['Low'] - df['Close'].shift())
                ranges = pd.concat([high_low, high_close, low_close], axis=1)
                true_range = np.max(ranges, axis=1)
                df['atr'] = true_range.rolling(14).mean()

            # Stochastic Oscillator
            try:
                stoch = ta.momentum.StochasticOscillator(df['High'], df['Low'], df['Close'], window=14, smooth_window=3)
                df['stoch_k'] = stoch.stoch()
                df['stoch_d'] = stoch.stoch_signal()
            except:
                # Manual Stochastic calculation
                df['stoch_k'] = 100 * (df['Close'] - df['Low'].rolling(window=14).min()) / (df['High'].rolling(window=14).max() - df['Low'].rolling(window=14).min())
                df['stoch_d'] = df['stoch_k'].rolling(window=3).mean()

            # Rate of Change (ROC)
            df['roc_5'] = df['Close'].pct_change(periods=5) * 100
            df['roc_10'] = df['Close'].pct_change(periods=10) * 100

            # Price rate of change
            df['close_pct_change'] = df['Close'].pct_change() * 100
            df['volume_pct_change'] = df['Volume'].pct_change() * 100

            # Price to SMA ratios
            df['price_to_sma5'] = df['Close'] / df['sma_5']
            df['price_to_sma20'] = df['Close'] / df['sma_20']

            # Crossovers (1 when shorter MA crosses above longer MA, -1 for the opposite, 0 otherwise)
            df['ema_5_10_cross'] = np.where(df['ema_5'] > df['ema_10'], 1, np.where(df['ema_5'] < df['ema_10'], -1, 0))
            df['ema_10_20_cross'] = np.where(df['ema_10'] > df['ema_20'], 1, np.where(df['ema_10'] < df['ema_20'], -1, 0))

            # Add volatility features
            df['volatility_daily'] = (df['High'] - df['Low']) / df['Open'] * 100
            df['volatility_5d'] = df['volatility_daily'].rolling(window=5).mean()

            # Gap features
            df['gap'] = (df['Open'] - df['Close'].shift(1)) / df['Close'].shift(1) * 100

            # Advanced rolling window features
            df['close_max_5d'] = df['Close'].rolling(window=5).max()
            df['close_min_5d'] = df['Close'].rolling(window=5).min()
            df['close_mean_5d'] = df['Close'].rolling(window=5).mean()
            df['close_std_5d'] = df['Close'].rolling(window=5).std()

            # Sentiment rolling features if available
            if 'compound_score_mean' in df.columns:
                df['sent_roll_mean_3'] = df['compound_score_mean'].rolling(3).mean()
                df['sent_roll_std_3'] = df['compound_score_mean'].rolling(3).std()

                # Sentiment momentum
                df['sent_momentum'] = df['compound_score_mean'] - df['compound_score_mean'].shift(1)
                df['sent_shift1'] = df['compound_score_mean'].shift(1)
                df['sent_shift2'] = df['compound_score_mean'].shift(2)

                # Interaction features
                df['sent_price_interaction'] = df['compound_score_mean'] * df['close_pct_change']
                df['sent_volume_interaction'] = df['compound_score_mean'] * df['volume_pct_change']

            # Fix the deprecated method warning by using proper fillna methods
            # First, backward fill
            df = df.bfill()
            # Then, fill any remaining NaN values with 0
            df = df.fillna(0)

            return df

        except Exception as e:
            print(f"Error adding technical indicators: {e}")
            # Return original data if technical indicators fail
            return stock_data
            """
            Add technical analysis indicators to stock data

            Parameters:
            stock_data (DataFrame): DataFrame with stock price data

            Returns:
            DataFrame: DataFrame with added technical indicators
            """
            try:
                # Make a copy to avoid modifying the original
                df = stock_data.copy()

                # Simple Moving Averages
                df['sma_5'] = df['Close'].rolling(window=5).mean()
                df['sma_10'] = df['Close'].rolling(window=10).mean()
                df['sma_20'] = df['Close'].rolling(window=20).mean()

                # Exponential Moving Averages
                df['ema_5'] = df['Close'].ewm(span=5, adjust=False).mean()
                df['ema_10'] = df['Close'].ewm(span=10, adjust=False).mean()
                df['ema_20'] = df['Close'].ewm(span=20, adjust=False).mean()

                # Moving Average Convergence Divergence (MACD)
                try:
                    macd = ta.trend.MACD(df['Close'])
                    df['macd'] = macd.macd()
                    df['macd_signal'] = macd.macd_signal()
                    df['macd_diff'] = macd.macd_diff()
                except:
                    # Calculate MACD manually if ta library fails
                    df['macd'] = df['Close'].ewm(span=12, adjust=False).mean() - df['Close'].ewm(span=26, adjust=False).mean()
                    df['macd_signal'] = df['macd'].ewm(span=9, adjust=False).mean()
                    df['macd_diff'] = df['macd'] - df['macd_signal']

                # Relative Strength Index (RSI)
                try:
                    df['rsi_14'] = ta.momentum.RSIIndicator(df['Close'], window=14).rsi()
                except:
                    # Simplified RSI calculation if ta library fails
                    delta = df['Close'].diff()
                    gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
                    loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
                    rs = gain / loss
                    df['rsi_14'] = 100 - (100 / (1 + rs))

                # Bollinger Bands
                try:
                    bollinger = ta.volatility.BollingerBands(df['Close'], window=20, window_dev=2)
                    df['bb_upper'] = bollinger.bollinger_hband()
                    df['bb_middle'] = bollinger.bollinger_mavg()
                    df['bb_lower'] = bollinger.bollinger_lband()
                    df['bb_width'] = (df['bb_upper'] - df['bb_lower']) / df['bb_middle']
                except:
                    # Manual Bollinger Bands calculation
                    df['bb_middle'] = df['Close'].rolling(window=20).mean()
                    df['bb_std'] = df['Close'].rolling(window=20).std()
                    df['bb_upper'] = df['bb_middle'] + 2 * df['bb_std']
                    df['bb_lower'] = df['bb_middle'] - 2 * df['bb_std']
                    df['bb_width'] = (df['bb_upper'] - df['bb_lower']) / df['bb_middle']

                # Average True Range (ATR) - Volatility indicator
                try:
                    df['atr'] = ta.volatility.AverageTrueRange(df['High'], df['Low'], df['Close'], window=14).average_true_range()
                except:
                    # Simplified ATR calculation
                    high_low = df['High'] - df['Low']
                    high_close = np.abs(df['High'] - df['Close'].shift())
                    low_close = np.abs(df['Low'] - df['Close'].shift())
                    ranges = pd.concat([high_low, high_close, low_close], axis=1)
                    true_range = np.max(ranges, axis=1)
                    df['atr'] = true_range.rolling(14).mean()

                # Stochastic Oscillator
                try:
                    stoch = ta.momentum.StochasticOscillator(df['High'], df['Low'], df['Close'], window=14, smooth_window=3)
                    df['stoch_k'] = stoch.stoch()
                    df['stoch_d'] = stoch.stoch_signal()
                except:
                    # Manual Stochastic calculation
                    df['stoch_k'] = 100 * (df['Close'] - df['Low'].rolling(window=14).min()) / (df['High'].rolling(window=14).max() - df['Low'].rolling(window=14).min())
                    df['stoch_d'] = df['stoch_k'].rolling(window=3).mean()

                # Rate of Change (ROC)
                df['roc_5'] = df['Close'].pct_change(periods=5) * 100
                df['roc_10'] = df['Close'].pct_change(periods=10) * 100

                # Price rate of change
                df['close_pct_change'] = df['Close'].pct_change() * 100
                df['volume_pct_change'] = df['Volume'].pct_change() * 100

                # Price to SMA ratios
                df['price_to_sma5'] = df['Close'] / df['sma_5']
                df['price_to_sma20'] = df['Close'] / df['sma_20']

                # Crossovers (1 when shorter MA crosses above longer MA, -1 for the opposite, 0 otherwise)
                df['ema_5_10_cross'] = np.where(df['ema_5'] > df['ema_10'], 1, np.where(df['ema_5'] < df['ema_10'], -1, 0))
                df['ema_10_20_cross'] = np.where(df['ema_10'] > df['ema_20'], 1, np.where(df['ema_10'] < df['ema_20'], -1, 0))

                # Add volatility features
                df['volatility_daily'] = (df['High'] - df['Low']) / df['Open'] * 100
                df['volatility_5d'] = df['volatility_daily'].rolling(window=5).mean()

                # Gap features
                df['gap'] = (df['Open'] - df['Close'].shift(1)) / df['Close'].shift(1) * 100

                # Advanced rolling window features
                df['close_max_5d'] = df['Close'].rolling(window=5).max()
                df['close_min_5d'] = df['Close'].rolling(window=5).min()
                df['close_mean_5d'] = df['Close'].rolling(window=5).mean()
                df['close_std_5d'] = df['Close'].rolling(window=5).std()

                # Sentiment rolling features if available
                if 'compound_score_mean' in df.columns:
                    df['sent_roll_mean_3'] = df['compound_score_mean'].rolling(3).mean()
                    df['sent_roll_std_3'] = df['compound_score_mean'].rolling(3).std()

                    # Sentiment momentum
                    df['sent_momentum'] = df['compound_score_mean'] - df['compound_score_mean'].shift(1)
                    df['sent_shift1'] = df['compound_score_mean'].shift(1)
                    df['sent_shift2'] = df['compound_score_mean'].shift(2)

                    # Interaction features
                    df['sent_price_interaction'] = df['compound_score_mean'] * df['close_pct_change']
                    df['sent_volume_interaction'] = df['compound_score_mean'] * df['volume_pct_change']

                # Fill NaN values created by technical indicators
                df = df.fillna(method='bfill')
                df = df.fillna(0)  # Fill any remaining NaN values

                return df

            except Exception as e:
                print(f"Error adding technical indicators: {e}")
                # Return original data if technical indicators fail
                return stock_data

    def fetch_stock_data(self, ticker_symbol, period="1mo", interval="1d"):
        """
        Fetch stock data using Yahoo Finance

        Parameters:
        ticker_symbol (str): Stock ticker symbol
        period (str): Period to fetch data for (e.g., "1mo", "3mo", "1y")
        interval (str): Data interval (e.g., "1d", "1h")

        Returns:
        DataFrame: Stock price data
        """
        print(f"Fetching stock data for {ticker_symbol}...")

        try:
            # Get stock data from Yahoo Finance
            stock = yf.Ticker(ticker_symbol)
            stock_data = stock.history(period=period, interval=interval)

            if stock_data.empty:
                print(f"No data found for {ticker_symbol}")
                return None

            # Add datetime index as a column
            stock_data['Date'] = stock_data.index

            # Calculate daily returns
            stock_data['daily_return'] = stock_data['Close'].pct_change() * 100

            # Calculate target variable: 1 if price goes up next day, 0 otherwise
            stock_data['price_up_next_day'] = stock_data['daily_return'].shift(-1) > 0
            stock_data['price_up_next_day'] = stock_data['price_up_next_day'].astype(int)

            # Add technical indicators
            stock_data = self.add_technical_indicators(stock_data)

            # Save to CSV
            csv_filename = os.path.join(self.raw_dir, f"{ticker_symbol}_stock.csv")
            stock_data.to_csv(csv_filename)
            print(f"✅ Stock data saved to {csv_filename}")

            return stock_data

        except Exception as e:
            print(f"Error fetching stock data for {ticker_symbol}: {e}")
            return None

    def fetch_reddit_data(self, ticker_symbol, subreddits=None, limit=100, days_back=30):
        """
        Fetch Reddit posts related to a stock ticker

        Parameters:
        ticker_symbol (str): Stock ticker symbol
        subreddits (list): List of subreddit names to search (default: ["stocks", "investing", "wallstreetbets"])
        limit (int): Maximum number of posts to collect per subreddit
        days_back (int): Number of days to look back

        Returns:
        DataFrame: Processed Reddit data with sentiment
        """
        if self.reddit is None:
            print("Reddit client not initialized. Cannot fetch Reddit data.")
            return None

        if subreddits is None:
            subreddits = ["stocks", "investing", "wallstreetbets", "stockmarket"]

        print(f"Fetching Reddit data for {ticker_symbol} from {subreddits}...")

        all_posts = []
        past_date = datetime.now() - timedelta(days=days_back)

        for subreddit_name in subreddits:
            try:
                print(f"Searching r/{subreddit_name} for posts about {ticker_symbol}...")
                subreddit = self.reddit.subreddit(subreddit_name)
                search_query = ticker_symbol

                # Search for posts containing the ticker symbol
                for post in subreddit.search(search_query, limit=limit):
                    # Skip posts older than days_back
                    post_date = datetime.fromtimestamp(post.created_utc)
                    if post_date < past_date:
                        continue

                    # Combine title and text for content
                    content = f"{post.title} {post.selftext}"
                    cleaned_content = self.clean_text(content)

                    # Get sentiment scores
                    sentiment = self.get_sentiment(content)

                    all_posts.append({
                        'subreddit': subreddit_name,
                        'title': post.title,
                        'content': content,
                        'cleaned_content': cleaned_content,
                        'upvotes': post.score,
                        'upvote_ratio': post.upvote_ratio,
                        'num_comments': post.num_comments,
                        'created_at': post_date,
                        'author': str(post.author),
                        'compound_score': sentiment['compound'],
                        'positive_score': sentiment['positive'],
                        'negative_score': sentiment['negative'],
                        'neutral_score': sentiment['neutral'],
                        'textblob_score': sentiment['textblob'],
                        'sentiment': sentiment['sentiment'],
                        'ticker': ticker_symbol
                    })

                print(f"Found {len(all_posts)} posts about {ticker_symbol} in r/{subreddit_name}")

            except Exception as e:
                print(f"Error fetching data from r/{subreddit_name}: {e}")

        if not all_posts:
            print(f"No Reddit posts found for {ticker_symbol}")
            return None

        # Create DataFrame
        reddit_df = pd.DataFrame(all_posts)

        # Save to CSV
        csv_filename = os.path.join(self.raw_dir, f"{ticker_symbol}_reddit.csv")
        reddit_df.to_csv(csv_filename, index=False)
        print(f"✅ Reddit data saved to {csv_filename}")

        return reddit_df

    def merge_stock_and_sentiment(self, ticker_symbol):
        """
        Merge stock data with social media sentiment data

        Parameters:
        ticker_symbol (str): Stock ticker symbol

        Returns:
        DataFrame: Merged data
        """
        try:
            # Load stock data
            stock_file = os.path.join(self.raw_dir, f"{ticker_symbol}_stock.csv")
            if not os.path.exists(stock_file):
                print(f"Stock data file not found: {stock_file}")
                return None

            # Read stock data and handle datetime conversion safely
            stock_df = pd.read_csv(stock_file)
            try:
                # Convert to datetime with explicit UTC handling
                stock_df['Date'] = pd.to_datetime(stock_df['Date'], utc=True)
                stock_df['date_only'] = stock_df['Date'].dt.date
            except Exception as e:
                print(f"Error converting stock dates for {ticker_symbol}: {e}")
                return None

            # Load Reddit data if available
            reddit_file = os.path.join(self.raw_dir, f"{ticker_symbol}_reddit.csv")
            if os.path.exists(reddit_file):
                reddit_df = pd.read_csv(reddit_file)
                try:
                    # Convert to datetime with explicit UTC handling
                    reddit_df['created_at'] = pd.to_datetime(reddit_df['created_at'], utc=True)
                    reddit_df['date_only'] = reddit_df['created_at'].dt.date
                except Exception as e:
                    print(f"Error converting Reddit dates for {ticker_symbol}: {e}")
                    return None

                # Aggregate Reddit sentiment by date with enhanced metrics
                daily_sentiment = reddit_df.groupby('date_only').agg({
                    'compound_score': ['mean', 'count', 'std', 'min', 'max'],
                    'positive_score': ['mean', 'sum'],
                    'negative_score': ['mean', 'sum'],
                    'neutral_score': 'mean',
                    'textblob_score': ['mean', 'std'],
                    'upvotes': ['sum', 'mean'],
                    'num_comments': ['sum', 'mean'],
                    'upvote_ratio': 'mean'
                }).reset_index()

                # Flatten multi-level columns
                daily_sentiment.columns = ['_'.join(col).strip('_') for col in daily_sentiment.columns.values]

                # Rename columns for clarity
                daily_sentiment = daily_sentiment.rename(columns={
                    'date_only_': 'date_only',
                    'compound_score_count': 'post_count'
                })

                # Calculate sentiment bias (difference between positive and negative)
                daily_sentiment['sentiment_bias'] = daily_sentiment['positive_score_sum'] - daily_sentiment['negative_score_sum']

                # Calculate sentiment dispersion (ratio of standard deviation to mean)
                daily_sentiment['sentiment_dispersion'] = np.abs(daily_sentiment['compound_score_std'] /
                                                        (daily_sentiment['compound_score_mean'] + 0.001))

                # Calculate engagement ratio (comments per post)
                daily_sentiment['engagement_ratio'] = daily_sentiment['num_comments_sum'] / (daily_sentiment['post_count'] + 1)

                # Merge with stock data
                merged_df = pd.merge(
                    stock_df,
                    daily_sentiment,
                    on='date_only',
                    how='left'
                )

                # Fill NaN sentiment values
                sentiment_columns = merged_df.columns[merged_df.columns.str.contains('score|post_count|upvotes|comments|sentiment|engagement')]
                merged_df[sentiment_columns] = merged_df[sentiment_columns].fillna(0)

            else:
                print(f"No Reddit data found for {ticker_symbol}, using stock data only")
                merged_df = stock_df

            # Verify data integrity
            if merged_df.empty:
                print(f"No valid data after merging for {ticker_symbol}")
                return None

            # Save merged data
            output_file = os.path.join(self.processed_dir, f"{ticker_symbol}_merged.csv")
            merged_df.to_csv(output_file, index=False)
            print(f"✅ Merged data saved to {output_file}")

            return merged_df

        except Exception as e:
            print(f"Error in merge_stock_and_sentiment for {ticker_symbol}: {e}")
            return None

    def analyze_data(self, ticker_symbol):
        """
        Analyze the merged data and create visualizations

        Parameters:
        ticker_symbol (str): Stock ticker symbol

        Returns:
        dict: Analysis statistics
        """
        try:
            # Load merged data
            merged_file = os.path.join(self.processed_dir, f"{ticker_symbol}_merged.csv")
            if not os.path.exists(merged_file):
                print(f"Merged data file not found: {merged_file}")
                return None

            merged_df = pd.read_csv(merged_file)

            # Handle datetime conversion safely
            try:
                merged_df['Date'] = pd.to_datetime(merged_df['Date'], utc=True)
            except Exception as e:
                print(f"Error converting dates in analysis for {ticker_symbol}: {e}")
                return None

            print(f"Analyzing data for {ticker_symbol}...")

            # ===== CRITICAL FIX: Handle NaN values before any analysis =====
            # Fix NaN values in important columns to fix the "Input y contains NaN" error
            columns_to_check = ['daily_return', 'compound_score_mean', 'Close', 'Open', 'High', 'Low', 'Volume']
            for col in columns_to_check:
                if col in merged_df.columns and merged_df[col].isna().any():
                    print(f"Found {merged_df[col].isna().sum()} NaN values in {col}, filling appropriately")
                    if col in ['daily_return', 'compound_score_mean']:
                        # For sentiment and returns, fill with 0
                        merged_df[col] = merged_df[col].fillna(0)
                    elif col in ['Close', 'Open', 'High', 'Low']:
                        # For price data, forward fill then backward fill
                        merged_df[col] = merged_df[col].ffill().bfill()
                    elif col == 'Volume':
                        # For volume, fill with median
                        merged_df[col] = merged_df[col].fillna(merged_df[col].median())

            # ===== Statistical analysis (create early to avoid another error) =====
            stats_file = os.path.join(self.processed_dir, f"{ticker_symbol}_stats.json")
            stats = {
                'ticker': ticker_symbol,
                'data_points': len(merged_df),
                'date_range': [merged_df['Date'].min().strftime('%Y-%m-%d'),
                              merged_df['Date'].max().strftime('%Y-%m-%d')],
                'avg_close': float(merged_df['Close'].mean()),
                'min_close': float(merged_df['Close'].min()),
                'max_close': float(merged_df['Close'].max()),
                'stddev_close': float(merged_df['Close'].std()),
                'avg_volume': float(merged_df['Volume'].mean()),
                'avg_daily_return': float(merged_df['daily_return'].mean()),
                'stddev_daily_return': float(merged_df['daily_return'].std()),
                'up_days_pct': float((merged_df['daily_return'] > 0).mean() * 100)
            }

            # Add technical indicator statistics if available
            if 'rsi_14' in merged_df.columns:
                stats.update({
                    'avg_rsi': float(merged_df['rsi_14'].mean()),
                    'overbought_days_pct': float((merged_df['rsi_14'] > 70).mean() * 100),
                    'oversold_days_pct': float((merged_df['rsi_14'] < 30).mean() * 100)
                })

            if 'volatility_daily' in merged_df.columns:
                stats.update({
                    'avg_volatility': float(merged_df['volatility_daily'].mean()),
                    'max_volatility': float(merged_df['volatility_daily'].max())
                })

            if 'compound_score_mean' in merged_df.columns:
                # Add sentiment statistics
                stats.update({
                    'avg_sentiment': float(merged_df['compound_score_mean'].mean()),
                    'min_sentiment': float(merged_df['compound_score_mean'].min()),
                    'max_sentiment': float(merged_df['compound_score_mean'].max()),
                    'stddev_sentiment': float(merged_df['compound_score_mean'].std()),
                    'positive_days_pct': float((merged_df['compound_score_mean'] > 0).mean() * 100),
                    'avg_posts_per_day': float(merged_df['post_count'].mean() if 'post_count' in merged_df.columns else 0)
                })

                # Safely calculate correlation between sentiment and returns
                # This is where NaN values can cause problems
                if 'daily_return' in merged_df.columns:
                    # Create a temporary dataframe with no NaN values for correlation calculation
                    temp_df = merged_df[['compound_score_mean', 'daily_return']].copy()
                    temp_df['next_day_return'] = temp_df['daily_return'].shift(-1)
                    temp_df = temp_df.dropna()

                    if len(temp_df) > 2:  # Need at least 3 points for correlation
                        corr = temp_df['compound_score_mean'].corr(temp_df['next_day_return'])
                        stats.update({
                            'sentiment_return_corr': float(corr)
                        })
                    else:
                        stats.update({
                            'sentiment_return_corr': 0.0
                        })

                # Add engagement statistics if available
                if 'engagement_ratio' in merged_df.columns:
                    stats.update({
                        'avg_engagement': float(merged_df['engagement_ratio'].mean()),
                        'max_engagement': float(merged_df['engagement_ratio'].max())
                    })

            # ===== Plot 1: Stock price and sentiment over time =====
            if 'compound_score_mean' in merged_df.columns:
                try:
                    fig = plt.figure(figsize=(14, 10))

                    # Create 3 vertically stacked subplots
                    gs = fig.add_gridspec(3, 1, height_ratios=[3, 1, 1], hspace=0.1)

                    # Stock price subplot
                    ax1 = fig.add_subplot(gs[0])
                    ax1.set_title(f'{ticker_symbol} Stock Price vs. Reddit Sentiment', fontsize=14)
                    ax1.plot(merged_df['Date'], merged_df['Close'], color='blue', linewidth=2, label='Close Price')
                    ax1.set_ylabel('Stock Price ($)', color='blue', fontsize=12)
                    ax1.tick_params(axis='y', labelcolor='blue')
                    ax1.grid(True, alpha=0.3)

                    # Add sentiment line on secondary y-axis
                    ax2 = ax1.twinx()
                    ax2.plot(merged_df['Date'], merged_df['compound_score_mean'], color='green',
                          linestyle='--', linewidth=2, label='Sentiment Score')
                    ax2.set_ylabel('Sentiment Score', color='green', fontsize=12)
                    ax2.tick_params(axis='y', labelcolor='green')

                    # Create legend with both price and sentiment
                    lines1, labels1 = ax1.get_legend_handles_labels()
                    lines2, labels2 = ax2.get_legend_handles_labels()
                    ax1.legend(lines1 + lines2, labels1 + labels2, loc='upper left')

                    # Volume subplot
                    ax3 = fig.add_subplot(gs[1], sharex=ax1)
                    ax3.bar(merged_df['Date'], merged_df['Volume'], color='gray', alpha=0.6, label='Volume')
                    ax3.set_ylabel('Volume', color='gray', fontsize=12)
                    ax3.tick_params(axis='y', labelcolor='gray')
                    ax3.grid(True, alpha=0.3)

                    # Sentiment and post count subplot
                    ax4 = fig.add_subplot(gs[2], sharex=ax1)
                    if 'post_count' in merged_df.columns:
                        ax4.bar(merged_df['Date'], merged_df['post_count'], color='orange', alpha=0.6, label='Post Count')
                        ax4.set_ylabel('Post Count', color='orange', fontsize=12)
                        ax4.tick_params(axis='y', labelcolor='orange')
                        ax4.grid(True, alpha=0.3)

                    # Remove x-axis labels for upper subplots
                    ax1.set_xticklabels([])
                    ax3.set_xticklabels([])
                    ax4.set_xlabel('Date', fontsize=12)

                    plt.savefig(os.path.join(self.viz_dir, f"{ticker_symbol}_price_vs_sentiment.png"))
                    plt.close(fig)  # Explicitly close the figure
                except Exception as e:
                    print(f"Error creating price vs sentiment plot: {e}")

                # Also create interactive Plotly version
                try:
                    fig = go.Figure()

                    # Add price candlestick
                    fig.add_trace(
                        go.Candlestick(
                            x=merged_df['Date'],
                            open=merged_df['Open'],
                            high=merged_df['High'],
                            low=merged_df['Low'],
                            close=merged_df['Close'],
                            name='OHLC',
                            increasing_line_color='green',
                            decreasing_line_color='red'
                        )
                    )

                    # Add moving averages if available
                    if 'sma_20' in merged_df.columns:
                        fig.add_trace(
                            go.Scatter(
                                x=merged_df['Date'],
                                y=merged_df['sma_20'],
                                name='20-day MA',
                                line=dict(color='purple', width=1)
                            )
                        )

                    # Add sentiment overlay
                    fig.add_trace(
                        go.Scatter(
                            x=merged_df['Date'],
                            y=merged_df['compound_score_mean'],
                            name='Sentiment Score',
                            line=dict(color='green', width=2, dash='dash'),
                            yaxis='y2'
                        )
                    )

                    # Add RSI if available
                    if 'rsi_14' in merged_df.columns:
                        fig.add_trace(
                            go.Scatter(
                                x=merged_df['Date'],
                                y=merged_df['rsi_14'],
                                name='RSI (14)',
                                line=dict(color='orange', width=1),
                                visible='legendonly',
                                yaxis='y3'
                            )
                        )

                    # Configure layout with multiple y-axes
                    fig.update_layout(
                        title=f'{ticker_symbol} Stock Price vs. Reddit Sentiment',
                        xaxis_title='Date',
                        yaxis_title='Stock Price ($)',
                        yaxis2=dict(
                            title='Sentiment Score',
                            titlefont=dict(color='green'),
                            tickfont=dict(color='green'),
                            overlaying='y',
                            side='right',
                            range=[-1, 1]
                        ),
                        yaxis3=dict(
                            title='RSI',
                            titlefont=dict(color='orange'),
                            tickfont=dict(color='orange'),
                            anchor='free',
                            overlaying='y',
                            side='right',
                            position=0.95,
                            range=[0, 100],
                            showgrid=False
                        ),
                        legend=dict(x=0.01, y=0.99, bgcolor='rgba(255,255,255,0.8)'),
                        template='plotly_white',
                        margin=dict(l=50, r=70, t=50, b=50),
                        height=700
                    )

                    fig.write_html(os.path.join(self.viz_dir, f"{ticker_symbol}_interactive_analysis.html"))
                except Exception as e:
                    print(f"Error creating interactive plot: {e}")

            # ===== Plot 2: Trading volume and social media activity =====
            if 'post_count' in merged_df.columns:
                try:
                    fig = plt.figure(figsize=(14, 6))

                    ax1 = plt.gca()
                    ax1.set_xlabel('Date')
                    ax1.set_ylabel('Trading Volume', color='blue')
                    ax1.bar(merged_df['Date'], merged_df['Volume'], color='blue', alpha=0.5)
                    ax1.tick_params(axis='y', labelcolor='blue')

                    ax2 = ax1.twinx()
                    ax2.set_ylabel('Number of Reddit Posts', color='red')
                    ax2.plot(merged_df['Date'], merged_df['post_count'], color='red', marker='o', linestyle='-')
                    ax2.tick_params(axis='y', labelcolor='red')

                    plt.title(f'{ticker_symbol} Trading Volume vs. Reddit Activity')
                    plt.grid(True, alpha=0.3)
                    plt.savefig(os.path.join(self.viz_dir, f"{ticker_symbol}_volume_vs_activity.png"))
                    plt.close(fig)  # Explicitly close the figure
                except Exception as e:
                    print(f"Error creating volume vs activity plot: {e}")

            # ===== Plot 3: Correlation between sentiment and next day returns =====
            if 'compound_score_mean' in merged_df.columns:
                try:
                    fig = plt.figure(figsize=(10, 8))

                    # Create a clean DataFrame for the scatter plot
                    plot_df = merged_df[['Date', 'compound_score_mean', 'daily_return']].copy()

                    # Create next day return (shift current day returns back one day)
                    plot_df['next_day_return'] = plot_df['daily_return'].shift(-1)

                    # Drop any rows with NaN values - THIS IS THE CRITICAL FIX
                    plot_df = plot_df.dropna(subset=['compound_score_mean', 'next_day_return'])

                    # Only create plot if we have enough data points
                    if len(plot_df) >= 3:
                        # Add post_count if available
                        if 'post_count' in merged_df.columns:
                            # Join the post_count column
                            plot_df = pd.merge(
                                plot_df,
                                merged_df[['Date', 'post_count']],
                                on='Date',
                                how='left'
                            )
                            # Fill NaN values with a default size value (IMPORTANT FIX)
                            plot_df['post_count'] = plot_df['post_count'].fillna(1)

                            # Make sure all values are positive numbers for the size parameter
                            plot_df['marker_size'] = plot_df['post_count'].clip(lower=1) * 20  # Scale for visibility

                            # Create scatter plot with explicit marker size
                            scatter = plt.scatter(
                                x=plot_df['compound_score_mean'],
                                y=plot_df['next_day_return'],
                                alpha=0.7,
                                c=plot_df['post_count'],  # This sets the color
                                cmap='viridis',
                                s=plot_df['marker_size']  # This sets the size and ensures it's always valid
                            )
                        else:
                            # Use a fixed size if post_count is not available
                            scatter = plt.scatter(
                                x=plot_df['compound_score_mean'],
                                y=plot_df['next_day_return'],
                                alpha=0.7,
                                color='blue',
                                s=70  # Fixed size
                            )

                        plt.axhline(y=0, color='r', linestyle='-', alpha=0.3)
                        plt.axvline(x=0, color='r', linestyle='-', alpha=0.3)

                        plt.xlabel('Sentiment Score', fontsize=12)
                        plt.ylabel('Next Day Return (%)', fontsize=12)
                        plt.title(f'{ticker_symbol} Sentiment vs. Next Day Returns', fontsize=14)
                        plt.grid(True, alpha=0.3)

                        # Add colorbar if using post count for coloring and we have at least 2 different values
                        if 'post_count' in plot_df.columns and plot_df['post_count'].nunique() > 1:
                            cbar = plt.colorbar(scatter)
                            cbar.set_label('Number of Posts')

                        # Add regression line
                        try:
                            from sklearn.linear_model import LinearRegression

                            # Prepare data for regression
                            X = plot_df['compound_score_mean'].values.reshape(-1, 1)
                            y = plot_df['next_day_return'].values

                            # Train model
                            model = LinearRegression()
                            model.fit(X, y)

                            # Calculate R-squared
                            r_squared = model.score(X, y)

                            # Get predictions for plotting
                            x_line = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
                            y_line = model.predict(x_line)

                            plt.plot(
                                x_line.flatten(),
                                y_line,
                                'r-',
                                label=f'Slope: {model.coef_[0]:.4f}, R²: {r_squared:.4f}'
                            )
                            plt.legend(fontsize=10)
                        except Exception as e:
                            print(f"Error creating regression line: {e}")

                        plt.savefig(os.path.join(self.viz_dir, f"{ticker_symbol}_sentiment_vs_returns.png"))
                    else:
                        print(f"Not enough valid data points to create sentiment vs returns scatter plot for {ticker_symbol}")

                    plt.close(fig)  # Explicitly close the figure
                except Exception as e:
                    print(f"Error creating sentiment vs returns plot: {e}")
                    import traceback
                    traceback.print_exc()

            # ===== Plot 4: Enhanced distribution of sentiment scores =====
            if 'compound_score_mean' in merged_df.columns:
                try:
                    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

                    # Create histogram of sentiment scores - no NaN values
                    sentiment_values = merged_df['compound_score_mean'].dropna()
                    if len(sentiment_values) > 0:
                        ax1.hist(sentiment_values, bins=20, alpha=0.7, color='green', edgecolor='black')
                        ax1.axvline(x=0, color='r', linestyle='--')
                        mean_sentiment = sentiment_values.mean()
                        ax1.axvline(x=mean_sentiment, color='blue', linestyle='-',
                                  label=f'Mean: {mean_sentiment:.3f}')
                        ax1.set_xlabel('Sentiment Score')
                        ax1.set_ylabel('Frequency')
                        ax1.set_title('Distribution of Reddit Sentiment Scores')
                        ax1.legend()
                        ax1.grid(True, alpha=0.3)

                    # Create sentiment categories and boxplot - no NaN values
                    cat_data = merged_df.dropna(subset=['compound_score_mean', 'daily_return'])

                    if len(cat_data) > 0:
                        # Create sentiment categories
                        cat_data['sentiment_category'] = pd.cut(
                            cat_data['compound_score_mean'],
                            bins=[-1, -0.5, -0.2, 0.2, 0.5, 1],
                            labels=['Very Negative', 'Negative', 'Neutral', 'Positive', 'Very Positive']
                        )

                        # Plot returns by sentiment category if we have enough data in each category
                        category_counts = cat_data['sentiment_category'].value_counts()
                        if (category_counts > 2).any():  # At least one category has more than 2 data points
                            sns.boxplot(x='sentiment_category', y='daily_return', data=cat_data, ax=ax2)
                            ax2.set_xlabel('Sentiment Category')
                            ax2.set_ylabel('Daily Return (%)')
                            ax2.set_title('Return Distribution by Sentiment Category')
                            ax2.grid(True, alpha=0.3)
                            plt.setp(ax2.get_xticklabels(), rotation=45)

                    plt.tight_layout()
                    plt.savefig(os.path.join(self.viz_dir, f"{ticker_symbol}_sentiment_analysis.png"))
                    plt.close(fig)  # Explicitly close the figure
                except Exception as e:
                    print(f"Error creating sentiment distribution plot: {e}")

            # ===== Plot 5: Technical indicators visualization =====
            if 'rsi_14' in merged_df.columns and 'macd' in merged_df.columns:
                try:
                    # Check if we have enough valid data
                    if merged_df['rsi_14'].notna().sum() > 5 and merged_df['macd'].notna().sum() > 5:
                        fig = plt.figure(figsize=(14, 10))

                        # Create 4 vertically stacked subplots
                        gs = fig.add_gridspec(4, 1, height_ratios=[3, 1, 1, 1], hspace=0.1)

                        # Price subplot
                        ax1 = fig.add_subplot(gs[0])
                        ax1.set_title(f'{ticker_symbol} Technical Indicators', fontsize=14)
                        ax1.plot(merged_df['Date'], merged_df['Close'], color='black', linewidth=2, label='Close Price')

                        # Add moving averages
                        if 'sma_20' in merged_df.columns:
                            ax1.plot(merged_df['Date'], merged_df['sma_20'], color='blue', linewidth=1.5, label='SMA 20')
                        if 'ema_10' in merged_df.columns:
                            ax1.plot(merged_df['Date'], merged_df['ema_10'], color='purple', linewidth=1.5, label='EMA 10')

                        # Add Bollinger Bands
                        if ('bb_upper' in merged_df.columns and
                            merged_df['bb_upper'].notna().all() and
                            merged_df['bb_lower'].notna().all()):
                            ax1.plot(merged_df['Date'], merged_df['bb_upper'], 'r--', linewidth=1, label='Bollinger Upper')
                            ax1.plot(merged_df['Date'], merged_df['bb_lower'], 'r--', linewidth=1, label='Bollinger Lower')
                            ax1.fill_between(merged_df['Date'], merged_df['bb_upper'], merged_df['bb_lower'],
                                          color='gray', alpha=0.1)

                        ax1.set_ylabel('Price ($)', fontsize=12)
                        ax1.grid(True, alpha=0.3)
                        ax1.legend(loc='upper left')

                        # Volume subplot
                        ax2 = fig.add_subplot(gs[1], sharex=ax1)
                        ax2.bar(merged_df['Date'], merged_df['Volume'], color='gray', alpha=0.5)
                        ax2.set_ylabel('Volume', fontsize=12)
                        ax2.grid(True, alpha=0.3)

                        # RSI subplot
                        ax3 = fig.add_subplot(gs[2], sharex=ax1)
                        ax3.plot(merged_df['Date'], merged_df['rsi_14'], color='green', linewidth=1.5)
                        ax3.axhline(y=70, color='r', linestyle='--', alpha=0.5)
                        ax3.axhline(y=30, color='g', linestyle='--', alpha=0.5)
                        ax3.set_ylabel('RSI (14)', fontsize=12)
                        ax3.set_ylim(0, 100)
                        ax3.grid(True, alpha=0.3)

                        # MACD subplot
                        ax4 = fig.add_subplot(gs[3], sharex=ax1)
                        ax4.plot(merged_df['Date'], merged_df['macd'], color='blue', linewidth=1.5, label='MACD')
                        ax4.plot(merged_df['Date'], merged_df['macd_signal'], color='red', linewidth=1.5, label='Signal')

                        # Add MACD histogram - safely handle possible NaN values
                        # Loop through each row and create histogram bars
                        for i in range(len(merged_df) - 1):
                            if (pd.notna(merged_df['macd'].iloc[i]) and
                                pd.notna(merged_df['macd_signal'].iloc[i])):

                                # Determine color based on MACD vs Signal
                                color = 'green' if merged_df['macd'].iloc[i] > merged_df['macd_signal'].iloc[i] else 'red'

                                # Calculate histogram value
                                hist_val = merged_df['macd'].iloc[i] - merged_df['macd_signal'].iloc[i]

                                # Plot histogram bar
                                ax4.bar(
                                    merged_df['Date'].iloc[i],
                                    hist_val,
                                    color=color,
                                    alpha=0.5,
                                    width=1
                                )

                        ax4.set_ylabel('MACD', fontsize=12)
                        ax4.grid(True, alpha=0.3)
                        ax4.legend(loc='upper left')

                        # Set x-axis label only for bottom subplot
                        ax4.set_xlabel('Date', fontsize=12)

                        # Remove x-axis labels for upper subplots
                        ax1.set_xticklabels([])
                        ax2.set_xticklabels([])
                        ax3.set_xticklabels([])

                        # Save figure without using tight_layout (which causes the warning)
                        plt.subplots_adjust(hspace=0.3)
                        plt.savefig(os.path.join(self.viz_dir, f"{ticker_symbol}_technical_indicators.png"))
                        plt.close(fig)  # Explicitly close the figure
                except Exception as e:
                    print(f"Error creating technical indicators plot: {e}")

            # Save statistics to JSON
            try:
                with open(stats_file, 'w') as f:
                    json.dump(stats, f, indent=2)

                print(f"✅ Analysis visualizations saved to {self.viz_dir}")
                print(f"✅ Statistics saved to {stats_file}")
            except Exception as e:
                print(f"Error saving statistics to file: {e}")

            return stats

        except Exception as e:
            print(f"Error in analyze_data for {ticker_symbol}: {e}")
            import traceback
            traceback.print_exc()  # Print the full traceback for debugging
            return None
            """
            Analyze the merged data and create visualizations

            Parameters:
            ticker_symbol (str): Stock ticker symbol

            Returns:
            dict: Analysis statistics
            """
            try:
                # Load merged data
                merged_file = os.path.join(self.processed_dir, f"{ticker_symbol}_merged.csv")
                if not os.path.exists(merged_file):
                    print(f"Merged data file not found: {merged_file}")
                    return None

                merged_df = pd.read_csv(merged_file)

                # Handle datetime conversion safely
                try:
                    merged_df['Date'] = pd.to_datetime(merged_df['Date'], utc=True)
                except Exception as e:
                    print(f"Error converting dates in analysis for {ticker_symbol}: {e}")
                    return None

                print(f"Analyzing data for {ticker_symbol}...")

                # Plot 1: Enhanced stock price and sentiment over time with volume
                if 'compound_score_mean' in merged_df.columns:
                    fig = plt.figure(figsize=(14, 10))

                    # Create 3 vertically stacked subplots
                    gs = fig.add_gridspec(3, 1, height_ratios=[3, 1, 1], hspace=0.1)

                    # Stock price subplot
                    ax1 = fig.add_subplot(gs[0])
                    ax1.set_title(f'{ticker_symbol} Stock Price vs. Reddit Sentiment', fontsize=14)
                    ax1.plot(merged_df['Date'], merged_df['Close'], color='blue', linewidth=2, label='Close Price')
                    ax1.set_ylabel('Stock Price ($)', color='blue', fontsize=12)
                    ax1.tick_params(axis='y', labelcolor='blue')
                    ax1.grid(True, alpha=0.3)

                    # Add sentiment line on secondary y-axis
                    ax2 = ax1.twinx()
                    ax2.plot(merged_df['Date'], merged_df['compound_score_mean'], color='green',
                          linestyle='--', linewidth=2, label='Sentiment Score')
                    ax2.set_ylabel('Sentiment Score', color='green', fontsize=12)
                    ax2.tick_params(axis='y', labelcolor='green')

                    # Create legend with both price and sentiment
                    lines1, labels1 = ax1.get_legend_handles_labels()
                    lines2, labels2 = ax2.get_legend_handles_labels()
                    ax1.legend(lines1 + lines2, labels1 + labels2, loc='upper left')

                    # Volume subplot
                    ax3 = fig.add_subplot(gs[1], sharex=ax1)
                    ax3.bar(merged_df['Date'], merged_df['Volume'], color='gray', alpha=0.6, label='Volume')
                    ax3.set_ylabel('Volume', color='gray', fontsize=12)
                    ax3.tick_params(axis='y', labelcolor='gray')
                    ax3.grid(True, alpha=0.3)

                    # Sentiment and post count subplot
                    ax4 = fig.add_subplot(gs[2], sharex=ax1)
                    if 'post_count' in merged_df.columns:
                        ax4.bar(merged_df['Date'], merged_df['post_count'], color='orange', alpha=0.6, label='Post Count')
                        ax4.set_ylabel('Post Count', color='orange', fontsize=12)
                        ax4.tick_params(axis='y', labelcolor='orange')
                        ax4.grid(True, alpha=0.3)

                    # Remove x-axis labels for upper subplots
                    ax1.set_xticklabels([])
                    ax3.set_xticklabels([])
                    ax4.set_xlabel('Date', fontsize=12)

                    plt.tight_layout()
                    plt.savefig(os.path.join(self.viz_dir, f"{ticker_symbol}_price_vs_sentiment.png"))
                    plt.close()

                    # Also create interactive Plotly version
                    fig = go.Figure()

                    # Add price candlestick
                    fig.add_trace(
                        go.Candlestick(
                            x=merged_df['Date'],
                            open=merged_df['Open'],
                            high=merged_df['High'],
                            low=merged_df['Low'],
                            close=merged_df['Close'],
                            name='OHLC',
                            increasing_line_color='green',
                            decreasing_line_color='red'
                        )
                    )

                    # Add moving averages if available
                    if 'sma_20' in merged_df.columns:
                      fig.add_trace(
                        go.Scatter(
                            x=merged_df['Date'],
                            y=merged_df['sma_20'],
                            name='20-day MA',
                            line=dict(color='purple', width=1)
                            )
                        )

                    # Add sentiment overlay
                    fig.add_trace(
                        go.Scatter(
                            x=merged_df['Date'],
                            y=merged_df['compound_score_mean'],
                            name='Sentiment Score',
                            line=dict(color='green', width=2, dash='dash'),
                            yaxis='y2'
                        )
                    )

                    # Add RSI if available
                    if 'rsi_14' in merged_df.columns:
                        fig.add_trace(
                            go.Scatter(
                                x=merged_df['Date'],
                                y=merged_df['rsi_14'],
                                name='RSI (14)',
                                line=dict(color='orange', width=1),
                                visible='legendonly',  # Hidden by default, can be toggled
                                yaxis='y3'
                            )
                        )

                    # Configure layout with multiple y-axes
                    fig.update_layout(
                        title=f'{ticker_symbol} Stock Price vs. Reddit Sentiment',
                        xaxis_title='Date',
                        yaxis_title='Stock Price ($)',
                        yaxis2=dict(
                            title='Sentiment Score',
                            titlefont=dict(color='green'),
                            tickfont=dict(color='green'),
                            overlaying='y',
                            side='right',
                            range=[-1, 1]
                        ),
                        yaxis3=dict(
                            title='RSI',
                            titlefont=dict(color='orange'),
                            tickfont=dict(color='orange'),
                            anchor='free',
                            overlaying='y',
                            side='right',
                            position=0.95,
                            range=[0, 100],
                            showgrid=False
                        ),
                        legend=dict(x=0.01, y=0.99, bgcolor='rgba(255,255,255,0.8)'),
                        template='plotly_white',
                        margin=dict(l=50, r=70, t=50, b=50),
                        height=700
                    )

                    fig.write_html(os.path.join(self.viz_dir, f"{ticker_symbol}_interactive_analysis.html"))

                # Plot 2: Trading volume and social media activity
                if 'post_count' in merged_df.columns:
                    fig = plt.figure(figsize=(14, 6))

                    ax1 = plt.gca()
                    ax1.set_xlabel('Date')
                    ax1.set_ylabel('Trading Volume', color='blue')
                    ax1.bar(merged_df['Date'], merged_df['Volume'], color='blue', alpha=0.5)
                    ax1.tick_params(axis='y', labelcolor='blue')

                    ax2 = ax1.twinx()
                    ax2.set_ylabel('Number of Reddit Posts', color='red')
                    ax2.plot(merged_df['Date'], merged_df['post_count'], color='red', marker='o', linestyle='-')
                    ax2.tick_params(axis='y', labelcolor='red')

                    plt.title(f'{ticker_symbol} Trading Volume vs. Reddit Activity')
                    plt.grid(True, alpha=0.3)
                    plt.tight_layout()
                    plt.savefig(os.path.join(self.viz_dir, f"{ticker_symbol}_volume_vs_activity.png"))
                    plt.close()

                # Plot 3: Enhanced correlation between sentiment and next day returns
                if 'compound_score_mean' in merged_df.columns:
                    fig = plt.figure(figsize=(10, 8))

                    # Remove NaN values
                    plot_df = merged_df.dropna(subset=['daily_return', 'compound_score_mean'])


                    # Create color based on upvote count if available
                    color_values = plot_df['post_count'] if 'post_count' in plot_df.columns else 'blue'

                    # Create scatter plot
                    scatter = plt.scatter(plot_df['compound_score_mean'],
                                        plot_df['daily_return'].shift(-1),
                                        alpha=0.7,
                                        c=color_values,
                                        cmap='viridis',
                                        s=70)  # Increased point size

                    plt.axhline(y=0, color='r', linestyle='-', alpha=0.3)
                    plt.axvline(x=0, color='r', linestyle='-', alpha=0.3)

                    plt.xlabel('Sentiment Score', fontsize=12)
                    plt.ylabel('Next Day Return (%)', fontsize=12)
                    plt.title(f'{ticker_symbol} Sentiment vs. Next Day Returns', fontsize=14)
                    plt.grid(True, alpha=0.3)

                    # Add colorbar if using post count for coloring
                    if 'post_count' in plot_df.columns:
                        cbar = plt.colorbar(scatter)
                        cbar.set_label('Number of Posts')

                    # Add regression line
                    if len(plot_df) > 2:  # Need at least 3 points for regression
                        from scipy import stats
                        from sklearn.linear_model import LinearRegression

                        # Simple linear regression
                        mask = ~np.isnan(plot_df['daily_return'].shift(-1))
                        if sum(mask) > 2:
                            # Scikit-learn for more robust calculation
                            X = plot_df.loc[mask, 'compound_score_mean'].values.reshape(-1, 1)
                            y = plot_df.loc[mask, 'daily_return'].shift(-1).values

                            model = LinearRegression()
                            model.fit(X, y)

                            # Calculate R-squared
                            r_squared = model.score(X, y)

                            # Get predictions for plotting
                            x_line = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
                            y_line = model.predict(x_line)

                            plt.plot(
                                x_line.flatten(),
                                y_line,
                                'r-',
                                label=f'Slope: {model.coef_[0]:.4f}, R²: {r_squared:.4f}'
                            )
                            plt.legend(fontsize=10)

                    # Add annotations for extreme values
                    if len(plot_df) > 0:
                        # Find the top 3 and bottom 3 returns
                        top_returns = plot_df.nlargest(3, 'daily_return')
                        bottom_returns = plot_df.nsmallest(3, 'daily_return')

                        # Annotate these points
                        for _, row in pd.concat([top_returns, bottom_returns]).iterrows():
                            plt.annotate(
                                f"{row['daily_return']:.2f}%",
                                xy=(row['compound_score_mean'], row['daily_return']),
                                xytext=(5, 5),
                                textcoords='offset points',
                                fontsize=8,
                                bbox=dict(boxstyle="round,pad=0.3", fc="white", alpha=0.7)
                            )

                    plt.tight_layout()
                    plt.savefig(os.path.join(self.viz_dir, f"{ticker_symbol}_sentiment_vs_returns.png"))
                    plt.close()

                # Plot 4: Enhanced distribution of sentiment scores
                if 'compound_score_mean' in merged_df.columns:
                    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

                    # Create histogram of sentiment scores
                    ax1.hist(merged_df['compound_score_mean'], bins=20, alpha=0.7, color='green', edgecolor='black')
                    ax1.axvline(x=0, color='r', linestyle='--')
                    ax1.axvline(x=merged_df['compound_score_mean'].mean(), color='blue', linestyle='-',
                              label=f'Mean: {merged_df["compound_score_mean"].mean():.3f}')
                    ax1.set_xlabel('Sentiment Score')
                    ax1.set_ylabel('Frequency')
                    ax1.set_title(f'Distribution of Reddit Sentiment Scores')
                    ax1.legend()
                    ax1.grid(True, alpha=0.3)

                    # Create distribution of returns by sentiment category
                    # Create sentiment categories
                    merged_df['sentiment_category'] = pd.cut(
                        merged_df['compound_score_mean'],
                        bins=[-1, -0.5, -0.2, 0.2, 0.5, 1],
                        labels=['Very Negative', 'Negative', 'Neutral', 'Positive', 'Very Positive']
                    )

                    # Plot returns by sentiment category
                    sns.boxplot(x='sentiment_category', y='daily_return', data=merged_df, ax=ax2)
                    ax2.set_xlabel('Sentiment Category')
                    ax2.set_ylabel('Daily Return (%)')
                    ax2.set_title('Return Distribution by Sentiment Category')
                    ax2.grid(True, alpha=0.3)
                    plt.xticks(rotation=45)

                    plt.tight_layout()
                    plt.savefig(os.path.join(self.viz_dir, f"{ticker_symbol}_sentiment_analysis.png"))
                    plt.close()

                # Plot 5: Technical indicators visualization
                if 'rsi_14' in merged_df.columns and 'macd' in merged_df.columns:
                    fig = plt.figure(figsize=(14, 10))

                    # Create 4 vertically stacked subplots
                    gs = fig.add_gridspec(4, 1, height_ratios=[3, 1, 1, 1], hspace=0.1)

                    # Price subplot
                    ax1 = fig.add_subplot(gs[0])
                    ax1.set_title(f'{ticker_symbol} Technical Indicators', fontsize=14)
                    ax1.plot(merged_df['Date'], merged_df['Close'], color='black', linewidth=2, label='Close Price')

                    # Add moving averages
                    if 'sma_20' in merged_df.columns:
                        ax1.plot(merged_df['Date'], merged_df['sma_20'], color='blue', linewidth=1.5, label='SMA 20')
                    if 'ema_10' in merged_df.columns:
                        ax1.plot(merged_df['Date'], merged_df['ema_10'], color='purple', linewidth=1.5, label='EMA 10')

                    # Add Bollinger Bands
                    if 'bb_upper' in merged_df.columns:
                        ax1.plot(merged_df['Date'], merged_df['bb_upper'], 'r--', linewidth=1, label='Bollinger Upper')
                        ax1.plot(merged_df['Date'], merged_df['bb_lower'], 'r--', linewidth=1, label='Bollinger Lower')
                        ax1.fill_between(merged_df['Date'], merged_df['bb_upper'], merged_df['bb_lower'], color='gray', alpha=0.1)

                    ax1.set_ylabel('Price ($)', fontsize=12)
                    ax1.grid(True, alpha=0.3)
                    ax1.legend(loc='upper left')

                    # Volume subplot
                    ax2 = fig.add_subplot(gs[1], sharex=ax1)
                    ax2.bar(merged_df['Date'], merged_df['Volume'], color='gray', alpha=0.5)
                    ax2.set_ylabel('Volume', fontsize=12)
                    ax2.grid(True, alpha=0.3)

                    # RSI subplot
                    ax3 = fig.add_subplot(gs[2], sharex=ax1)
                    ax3.plot(merged_df['Date'], merged_df['rsi_14'], color='green', linewidth=1.5)
                    ax3.axhline(y=70, color='r', linestyle='--', alpha=0.5)
                    ax3.axhline(y=30, color='g', linestyle='--', alpha=0.5)
                    ax3.set_ylabel('RSI (14)', fontsize=12)
                    ax3.set_ylim(0, 100)
                    ax3.grid(True, alpha=0.3)

                    # MACD subplot
                    ax4 = fig.add_subplot(gs[3], sharex=ax1)
                    ax4.plot(merged_df['Date'], merged_df['macd'], color='blue', linewidth=1.5, label='MACD')
                    ax4.plot(merged_df['Date'], merged_df['macd_signal'], color='red', linewidth=1.5, label='Signal')

                    # Add MACD histogram
                    for i in range(len(merged_df) - 1):
                        if merged_df['macd'].iloc[i] > merged_df['macd_signal'].iloc[i]:
                            color = 'green'
                        else:
                            color = 'red'
                        ax4.bar(merged_df['Date'].iloc[i], merged_df['macd'].iloc[i] - merged_df['macd_signal'].iloc[i],
                              color=color, alpha=0.5, width=1)

                    ax4.set_ylabel('MACD', fontsize=12)
                    ax4.grid(True, alpha=0.3)
                    ax4.legend(loc='upper left')

                    # Set x-axis label only for bottom subplot
                    ax4.set_xlabel('Date', fontsize=12)

                    # Remove x-axis labels for upper subplots
                    ax1.set_xticklabels([])
                    ax2.set_xticklabels([])
                    ax3.set_xticklabels([])

                    plt.tight_layout()
                    plt.savefig(os.path.join(self.viz_dir, f"{ticker_symbol}_technical_indicators.png"))
                    plt.close()

                # Statistical analysis
                stats_file = os.path.join(self.processed_dir, f"{ticker_symbol}_stats.json")
                stats = {
                    'ticker': ticker_symbol,
                    'data_points': len(merged_df),
                    'date_range': [merged_df['Date'].min().strftime('%Y-%m-%d'),
                                  merged_df['Date'].max().strftime('%Y-%m-%d')],
                    'avg_close': float(merged_df['Close'].mean()),
                    'min_close': float(merged_df['Close'].min()),
                    'max_close': float(merged_df['Close'].max()),
                    'stddev_close': float(merged_df['Close'].std()),
                    'avg_volume': float(merged_df['Volume'].mean()),
                    'avg_daily_return': float(merged_df['daily_return'].mean()),
                    'stddev_daily_return': float(merged_df['daily_return'].std()),
                    'up_days_pct': float((merged_df['daily_return'] > 0).mean() * 100)
                }

                # Add technical indicator statistics if available
                if 'rsi_14' in merged_df.columns:
                    stats.update({
                        'avg_rsi': float(merged_df['rsi_14'].mean()),
                        'overbought_days_pct': float((merged_df['rsi_14'] > 70).mean() * 100),
                        'oversold_days_pct': float((merged_df['rsi_14'] < 30).mean() * 100)
                    })

                if 'volatility_daily' in merged_df.columns:
                    stats.update({
                        'avg_volatility': float(merged_df['volatility_daily'].mean()),
                        'max_volatility': float(merged_df['volatility_daily'].max())
                    })

                if 'compound_score_mean' in merged_df.columns:
                    # Add sentiment statistics
                    stats.update({
                        'avg_sentiment': float(merged_df['compound_score_mean'].mean()),
                        'min_sentiment': float(merged_df['compound_score_mean'].min()),
                        'max_sentiment': float(merged_df['compound_score_mean'].max()),
                        'stddev_sentiment': float(merged_df['compound_score_mean'].std()),
                        'positive_days_pct': float((merged_df['compound_score_mean'] > 0).mean() * 100),
                        'avg_posts_per_day': float(merged_df['post_count'].mean() if 'post_count' in merged_df.columns else 0),
                        'sentiment_return_corr': float(merged_df['compound_score_mean'].corr(merged_df['daily_return'].shift(-1)))
                    })

                    # Add engagement statistics if available
                    if 'engagement_ratio' in merged_df.columns:
                        stats.update({
                            'avg_engagement': float(merged_df['engagement_ratio'].mean()),
                            'max_engagement': float(merged_df['engagement_ratio'].max())
                        })

                # Save statistics
                with open(stats_file, 'w') as f:
                    json.dump(stats, f, indent=2)

                print(f"✅ Analysis visualizations saved to {self.viz_dir}")
                print(f"✅ Statistics saved to {stats_file}")

                return stats

            except Exception as e:
                  print(f"Error in analyze_data for {ticker_symbol}: {e}")
                  return None

    def train_models(self, ticker_symbols=None):
        """
        Train both linear and logistic regression models with improved methodology

        Parameters:
        ticker_symbols (list): List of ticker symbols to use for training

        Returns:
        tuple: (linear_model, logistic_model, scaler, features)
        """
        try:
            # If no tickers specified, use all available merged data files
            if ticker_symbols is None:
                merged_files = [f for f in os.listdir(self.processed_dir) if f.endswith('_merged.csv')]
                ticker_symbols = [f.split('_')[0] for f in merged_files]

            if not ticker_symbols:
                print("No data files found for model training")
                return None, None, None, None

            print(f"Training models using data from: {ticker_symbols}")

            # Combine data from all tickers
            all_data = []
            for ticker in ticker_symbols:
                try:
                    merged_file = os.path.join(self.processed_dir, f"{ticker}_merged.csv")
                    if os.path.exists(merged_file):
                        ticker_df = pd.read_csv(merged_file)

                        # Ensure we have the required target columns
                        if 'daily_return' not in ticker_df.columns or 'price_up_next_day' not in ticker_df.columns:
                            print(f"Skipping {ticker} - missing required target columns")
                            continue

                        # Convert Date to datetime
                        if 'Date' in ticker_df.columns:
                            ticker_df['Date'] = pd.to_datetime(ticker_df['Date'])

                        ticker_df['ticker'] = ticker
                        all_data.append(ticker_df)
                except Exception as e:
                    print(f"Error loading data for {ticker}: {e}")
                    continue

            if not all_data:
                print("No data available for model training")
                return None, None, None, None

            # Combine all data
            combined_df = pd.concat(all_data, ignore_index=True)

            # Handle any string 'Date' columns that might cause isfinite errors
            if 'Date' in combined_df.columns:
                combined_df = combined_df.drop(columns=['Date'])

            if 'date_only' in combined_df.columns:
                combined_df = combined_df.drop(columns=['date_only'])

            # Define potential features based on what's available
            # Base features (price data)
            base_features = ['Open', 'High', 'Low', 'Close', 'Volume']

            # Technical indicator features
            tech_features = [col for col in combined_df.columns if col in [
                'sma_5', 'sma_10', 'sma_20', 'ema_5', 'ema_10', 'ema_20',
                'rsi_14', 'macd', 'macd_diff', 'bb_width', 'stoch_k',
                'volatility_daily', 'price_to_sma20', 'gap', 'atr',
                'ema_5_10_cross', 'ema_10_20_cross', 'roc_5', 'close_pct_change'
            ]]

            # Sentiment features
            sentiment_features = [col for col in combined_df.columns if col in [
                'compound_score_mean', 'positive_score_mean', 'negative_score_mean',
                'sentiment_bias', 'sentiment_dispersion', 'post_count',
                'upvotes_sum', 'num_comments_sum', 'engagement_ratio',
                'sent_momentum', 'sent_roll_mean_3', 'sent_price_interaction'
            ]]

            # Combine all available features
            available_features = base_features + tech_features + sentiment_features

            # Remove any columns that aren't numeric
            numeric_cols = combined_df.select_dtypes(include=['number']).columns
            available_features = [col for col in available_features if col in numeric_cols]

            # Remove target variables from features
            available_features = [col for col in available_features if col not in ['daily_return', 'price_up_next_day']]

            # Check if we have enough features
            if len(available_features) <= 5:
                print("Not enough features available for model training. Using base features only.")
                available_features = [col for col in base_features if col in numeric_cols]

            # Ensure all required columns exist
            for feature in available_features:
                if feature not in combined_df.columns:
                    print(f"Warning: {feature} not found in data, adding with zeros")
                    combined_df[feature] = 0

            # Handle missing values and infinities
            combined_df = combined_df.replace([np.inf, -np.inf], np.nan)
            combined_df[available_features] = combined_df[available_features].fillna(0)

            # Create additional features
            if all(col in combined_df.columns for col in ['High', 'Low']):
                combined_df['price_range'] = combined_df['High'] - combined_df['Low']
                available_features.append('price_range')

            if 'Volume' in combined_df.columns:
                combined_df['volume_change'] = combined_df['Volume'].pct_change()
                combined_df['volume_change'] = combined_df['volume_change'].fillna(0)
                available_features.append('volume_change')

            if all(col in combined_df.columns for col in ['Close', 'Open']):
                combined_df['close_to_open'] = (combined_df['Close'] - combined_df['Open']) / combined_df['Open'] * 100
                combined_df['close_to_open'] = combined_df['close_to_open'].fillna(0)
                available_features.append('close_to_open')

            # Create lag features for each ticker separately
            for ticker in combined_df['ticker'].unique():
                mask = combined_df['ticker'] == ticker
                for feature in ['Close', 'Volume', 'daily_return']:
                    if feature in combined_df.columns:
                        lag_col1 = f'{feature}_lag1'
                        lag_col2 = f'{feature}_lag2'

                        combined_df.loc[mask, lag_col1] = combined_df.loc[mask, feature].shift(1)
                        combined_df.loc[mask, lag_col2] = combined_df.loc[mask, feature].shift(2)

                        combined_df[lag_col1] = combined_df[lag_col1].fillna(0)
                        combined_df[lag_col2] = combined_df[lag_col2].fillna(0)

                        available_features.extend([lag_col1, lag_col2])

            # Add sentiment lag features if available
            if 'compound_score_mean' in combined_df.columns:
                for ticker in combined_df['ticker'].unique():
                    mask = combined_df['ticker'] == ticker

                    sent_lag1 = 'sentiment_lag1'
                    sent_lag2 = 'sentiment_lag2'

                    combined_df.loc[mask, sent_lag1] = combined_df.loc[mask, 'compound_score_mean'].shift(1)
                    combined_df.loc[mask, sent_lag2] = combined_df.loc[mask, 'compound_score_mean'].shift(2)

                    combined_df[sent_lag1] = combined_df[sent_lag1].fillna(0)
                    combined_df[sent_lag2] = combined_df[sent_lag2].fillna(0)

                    available_features.extend([sent_lag1, sent_lag2])

            # Make feature list unique
            available_features = list(set(available_features))

            # Create dummy variables for ticker symbols
            ticker_dummies = pd.get_dummies(combined_df['ticker'], prefix='ticker')
            combined_df = pd.concat([combined_df, ticker_dummies], axis=1)
            ticker_features = ticker_dummies.columns.tolist()
            available_features.extend(ticker_features)

            # Verify all features are numeric
            numeric_features = []
            for feature in available_features:
                if feature in combined_df.columns:
                    if np.issubdtype(combined_df[feature].dtype, np.number):
                        numeric_features.append(feature)
                    else:
                        print(f"Skipping non-numeric feature: {feature}")

            available_features = numeric_features

            # Prepare features and targets
            X = combined_df[available_features].values

            # Ensure target variables are numeric and don't have NaN values
            combined_df['daily_return'] = pd.to_numeric(combined_df['daily_return'], errors='coerce')
            combined_df['price_up_next_day'] = pd.to_numeric(combined_df['price_up_next_day'], errors='coerce')

            combined_df['daily_return'] = combined_df['daily_return'].fillna(0)
            combined_df['price_up_next_day'] = combined_df['price_up_next_day'].fillna(0)
            combined_df['price_up_next_day'] = combined_df['price_up_next_day'].astype(int)

            y_linear = combined_df['daily_return'].values
            y_logistic = combined_df['price_up_next_day'].values

            # Check for NaN or infinite values in X and y
            if np.any(np.isnan(X)) or np.any(np.isinf(X)):
                print("Warning: NaN or infinite values found in feature matrix X. Replacing with zeros.")
                X = np.nan_to_num(X, nan=0.0, posinf=0.0, neginf=0.0)

            if np.any(np.isnan(y_linear)) or np.any(np.isinf(y_linear)):
                print("Warning: NaN or infinite values found in y_linear. Replacing with zeros.")
                y_linear = np.nan_to_num(y_linear, nan=0.0, posinf=0.0, neginf=0.0)

            if np.any(np.isnan(y_logistic)) or np.any(np.isinf(y_logistic)):
                print("Warning: NaN or infinite values found in y_logistic. Replacing with zeros.")
                y_logistic = np.nan_to_num(y_logistic, nan=0.0, posinf=0.0, neginf=0.0)

            # Scale features
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)

            # Use time series split for validation
            tscv = TimeSeriesSplit(n_splits=5)

            # Take the last split to have a separate test set
            train_index = None
            test_index = None

            for train_idx, test_idx in tscv.split(X_scaled):
                train_index = train_idx
                test_index = test_idx

            if train_index is None or test_index is None:
                # Fall back to simple split if time series split fails
                train_size = int(len(X_scaled) * 0.8)
                train_index = np.arange(train_size)
                test_index = np.arange(train_size, len(X_scaled))

            X_train, X_test = X_scaled[train_index], X_scaled[test_index]
            y_linear_train, y_linear_test = y_linear[train_index], y_linear[test_index]
            y_logistic_train, y_logistic_test = y_logistic[train_index], y_logistic[test_index]

            print(f"Training data shape: {X_train.shape}")
            print(f"Testing data shape: {X_test.shape}")

            # Use a simpler approach for feature selection to avoid errors
            try:
                print("Performing feature selection...")
                rf_selector = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10)
                rf_selector.fit(X_train, y_logistic_train)

                # Get feature importances
                importances = rf_selector.feature_importances_

                # Select top 70% features
                num_features_to_keep = max(int(X_train.shape[1] * 0.7), 10)
                indices = np.argsort(importances)[::-1][:num_features_to_keep]

                # Print top features
                print("\nTop 10 features by importance:")
                for i in range(min(10, len(indices))):
                    print(f"{available_features[indices[i]]}: {importances[indices[i]]:.4f}")

                # Create mask
                mask = np.zeros(X_train.shape[1], dtype=bool)
                mask[indices] = True

                # Apply mask
                X_train_selected = X_train[:, mask]
                X_test_selected = X_test[:, mask]

                # Get selected feature names
                selected_feature_names = [available_features[i] for i in indices]

                print(f"Selected {X_train_selected.shape[1]} of {X_train.shape[1]} features")
            except Exception as e:
                print(f"Feature selection failed: {e}, using all features")
                X_train_selected = X_train
                X_test_selected = X_test
                mask = np.ones(X_train.shape[1], dtype=bool)
                selected_feature_names = available_features

            # Check for class imbalance
            class_counts = np.bincount(y_logistic_train.astype(int))
            class_balance = min(class_counts) / max(class_counts)
            print(f"Class distribution: {class_counts}, balance ratio: {class_balance:.2f}")

            # Apply SMOTE for class imbalance if needed and if we have enough samples
            X_train_logistic = X_train_selected
            if class_balance < 0.7 and min(class_counts) >= 5:
                try:
                    smote = SMOTE(random_state=42)
                    X_train_logistic, y_logistic_train_resampled = smote.fit_resample(X_train_selected, y_logistic_train)
                    print(f"After SMOTE: {np.bincount(y_logistic_train_resampled.astype(int))}")
                    y_logistic_train = y_logistic_train_resampled
                except Exception as e:
                    print(f"SMOTE failed: {e}, continuing with original data")
                    X_train_logistic = X_train_selected

            # Train regression model for returns
            print("\nTraining regression model for stock returns...")
            try:
                # Try Ridge regression
                ridge = Ridge(alpha=0.1, random_state=42)
                ridge.fit(X_train_selected, y_linear_train)
                ridge_score = ridge.score(X_train_selected, y_linear_train)
                print(f"Ridge regression R² score: {ridge_score:.4f}")

                # Use Ridge as our final model
                linear_model = ridge

            except Exception as e:
                print(f"Error training Ridge regression: {e}")
                # Fallback to simple linear regression
                from sklearn.linear_model import LinearRegression
                linear_model = LinearRegression()
                linear_model.fit(X_train_selected, y_linear_train)
                print("Fallback to simple LinearRegression")

            # Evaluate regression model
            try:
                linear_metrics = self._evaluate_linear_model(linear_model, X_test_selected, y_linear_test, selected_feature_names)
            except Exception as e:
                print(f"Error evaluating regression model: {e}")
                linear_metrics = {"error": str(e)}

            # Train classification model
            print("\nTraining classification model for price direction...")
            try:
                # Use logistic regression which is more stable
                logistic = LogisticRegression(C=1.0, class_weight='balanced',
                                          penalty='l2', solver='liblinear',
                                          random_state=42, max_iter=1000)
                logistic.fit(X_train_logistic, y_logistic_train)
                logistic_score = logistic.score(X_train_logistic, y_logistic_train)
                print(f"Logistic regression accuracy: {logistic_score:.4f}")

                # Use Logistic as our final model
                logistic_model = logistic

            except Exception as e:
                print(f"Error training Logistic regression: {e}")
                # Fallback to a very simple model
                from sklearn.dummy import DummyClassifier
                logistic_model = DummyClassifier(strategy='most_frequent', random_state=42)
                logistic_model.fit(X_train_logistic, y_logistic_train)
                print("Fallback to DummyClassifier")

            # Evaluate classification model
            try:
                logistic_metrics = self._evaluate_logistic_model(logistic_model, X_test_selected, y_logistic_test, selected_feature_names)
            except Exception as e:
                print(f"Error evaluating classification model: {e}")
                logistic_metrics = {"error": str(e)}

            # Save models and metadata
            try:
                joblib.dump(linear_model, os.path.join(self.model_dir, "linear_model.joblib"))
                joblib.dump(logistic_model, os.path.join(self.model_dir, "logistic_model.joblib"))
                joblib.dump(scaler, os.path.join(self.model_dir, "feature_scaler.joblib"))
                joblib.dump(available_features, os.path.join(self.model_dir, "all_feature_names.joblib"))
                joblib.dump(selected_feature_names, os.path.join(self.model_dir, "selected_feature_names.joblib"))
                joblib.dump(mask, os.path.join(self.model_dir, "feature_mask.joblib"))

                # Save metrics
                with open(os.path.join(self.model_dir, "linear_metrics.json"), 'w') as f:
                    json.dump(linear_metrics, f, indent=2)

                with open(os.path.join(self.model_dir, "logistic_metrics.json"), 'w') as f:
                    json.dump(logistic_metrics, f, indent=2)

                # Create feature importance visualization if possible
                try:
                    self._plot_feature_importance(linear_model, logistic_model, selected_feature_names)
                except Exception as e:
                    print(f"Error creating feature importance plot: {e}")

                print(f"✅ Models and metrics saved to {self.model_dir}")
            except Exception as e:
                print(f"Error saving models and metrics: {e}")

            return linear_model, logistic_model, scaler, selected_feature_names, mask

        except Exception as e:
            print(f"Error in train_models: {e}")
            import traceback
            traceback.print_exc()
            return None, None, None, None, None
            """
            Train both linear and logistic regression models with improved methodology

            Parameters:
            ticker_symbols (list): List of ticker symbols to use for training

            Returns:
            tuple: (linear_model, logistic_model, scaler, features)
            """
            try:
                # If no tickers specified, use all available merged data files
                if ticker_symbols is None:
                    merged_files = [f for f in os.listdir(self.processed_dir) if f.endswith('_merged.csv')]
                    ticker_symbols = [f.split('_')[0] for f in merged_files]

                if not ticker_symbols:
                    print("No data files found for model training")
                    return None, None, None, None

                print(f"Training models using data from: {ticker_symbols}")

                # Combine data from all tickers
                all_data = []
                for ticker in ticker_symbols:
                    try:
                        merged_file = os.path.join(self.processed_dir, f"{ticker}_merged.csv")
                        if os.path.exists(merged_file):
                            ticker_df = pd.read_csv(merged_file)
                            ticker_df['ticker'] = ticker
                            all_data.append(ticker_df)
                    except Exception as e:
                        print(f"Error loading data for {ticker}: {e}")
                        continue

                if not all_data:
                    print("No data available for model training")
                    return None, None, None, None

                # Combine all data
                combined_df = pd.concat(all_data, ignore_index=True)
                combined_df['Date'] = pd.to_datetime(combined_df['Date'], utc=True)

                # Define potential features based on what's available
                base_features = ['Open', 'High', 'Low', 'Close', 'Volume']

                # Technical indicator features
                tech_features = [col for col in combined_df.columns if col in [
                    'sma_5', 'sma_10', 'sma_20', 'ema_5', 'ema_10', 'ema_20',
                    'rsi_14', 'macd', 'macd_diff', 'bb_width', 'stoch_k',
                    'volatility_daily', 'price_to_sma20', 'gap', 'atr',
                    'ema_5_10_cross', 'ema_10_20_cross', 'roc_5', 'close_pct_change'
                ]]

                # Sentiment features
                sentiment_features = [col for col in combined_df.columns if col in [
                    'compound_score_mean', 'positive_score_mean', 'negative_score_mean',
                    'sentiment_bias', 'sentiment_dispersion', 'post_count',
                    'upvotes_sum', 'num_comments_sum', 'engagement_ratio',
                    'sent_momentum', 'sent_roll_mean_3', 'sent_price_interaction'
                ]]

                # Combine all available features
                available_features = base_features + tech_features + sentiment_features

                # Check if we have enough features
                if len(available_features) <= 5:
                    print("Not enough features available for model training. Using base features only.")
                    available_features = base_features

                # Ensure all required columns exist
                for feature in available_features:
                    if feature not in combined_df.columns:
                        print(f"Warning: {feature} not found in data, adding with zeros")
                        combined_df[feature] = 0

                # Handle missing values and infinities
                combined_df = combined_df.replace([np.inf, -np.inf], np.nan)
                combined_df[available_features] = combined_df[available_features].fillna(0)

                # Create additional features
                combined_df['price_range'] = combined_df['High'] - combined_df['Low']
                combined_df['volume_change'] = combined_df['Volume'].pct_change()
                combined_df['close_to_open'] = (combined_df['Close'] - combined_df['Open']) / combined_df['Open'] * 100

                # Add these to available features
                available_features.extend(['price_range', 'volume_change', 'close_to_open'])

                # Create lag features for each ticker separately
                for ticker in combined_df['ticker'].unique():
                    mask = combined_df['ticker'] == ticker
                    for feature in ['Close', 'Volume', 'daily_return']:
                        if feature in combined_df.columns:
                            combined_df.loc[mask, f'{feature}_lag1'] = combined_df.loc[mask, feature].shift(1)
                            combined_df.loc[mask, f'{feature}_lag2'] = combined_df.loc[mask, feature].shift(2)
                            combined_df.loc[mask, f'{feature}_lag3'] = combined_df.loc[mask, feature].shift(3)
                            available_features.extend([f'{feature}_lag1', f'{feature}_lag2', f'{feature}_lag3'])

                # Add sentiment lag features if available
                if 'compound_score_mean' in combined_df.columns:
                    for ticker in combined_df['ticker'].unique():
                        mask = combined_df['ticker'] == ticker
                        combined_df.loc[mask, 'sentiment_lag1'] = combined_df.loc[mask, 'compound_score_mean'].shift(1)
                        combined_df.loc[mask, 'sentiment_lag2'] = combined_df.loc[mask, 'compound_score_mean'].shift(2)
                        available_features.extend(['sentiment_lag1', 'sentiment_lag2'])

                # Fill NaN values created by lags
                combined_df = combined_df.fillna(0)

                # Make feature list unique
                available_features = list(set(available_features))

                # Create dummy variables for ticker symbols
                ticker_dummies = pd.get_dummies(combined_df['ticker'], prefix='ticker')
                combined_df = pd.concat([combined_df, ticker_dummies], axis=1)
                ticker_features = ticker_dummies.columns.tolist()
                available_features.extend(ticker_features)

                # Prepare features and targets
                X = combined_df[available_features].values
                y_linear = combined_df['daily_return'].values
                y_logistic = combined_df['price_up_next_day'].values

                # Remove any remaining NaN or infinite values
                mask = np.isfinite(X).all(axis=1) & np.isfinite(y_linear) & np.isfinite(y_logistic)
                X = X[mask]
                y_linear = y_linear[mask]
                y_logistic = y_logistic[mask]

                # Scale features
                scaler = StandardScaler()
                X_scaled = scaler.fit_transform(X)

                # Use time series split for validation
                tscv = TimeSeriesSplit(n_splits=5)

                # Take the last split to have a separate test set
                for train_index, test_index in tscv.split(X_scaled):
                    X_train, X_test = X_scaled[train_index], X_scaled[test_index]
                    y_linear_train, y_linear_test = y_linear[train_index], y_linear[test_index]
                    y_logistic_train, y_logistic_test = y_logistic[train_index], y_logistic[test_index]

                print(f"Training data shape: {X_train.shape}")
                print(f"Testing data shape: {X_test.shape}")

                # Select most relevant features using RandomForest
                print("Performing feature selection...")
                rf_selector = RandomForestClassifier(n_estimators=100, random_state=42)
                rf_selector.fit(X_train, y_logistic_train)

                # Get feature importances and select top features
                importances = rf_selector.feature_importances_
                indices = np.argsort(importances)[::-1]

                # Print top 20 features
                print("\nTop 20 features by importance:")
                for i in range(min(20, X_train.shape[1])):
                    print(f"{available_features[indices[i]]}: {importances[indices[i]]:.4f}")

                # Create a mask to select the top 70% most important features
                num_features_to_keep = max(int(X_train.shape[1] * 0.7), 10)  # Keep at least 10 features
                mask = np.zeros(X_train.shape[1], dtype=bool)
                mask[indices[:num_features_to_keep]] = True

                # Filter the features
                X_train_selected = X_train[:, mask]
                X_test_selected = X_test[:, mask]
                selected_feature_names = [available_features[i] for i in range(len(available_features)) if mask[i]]

                print(f"Selected {X_train_selected.shape[1]} of {X_train.shape[1]} features")

                # Check for class imbalance in classification target
                class_counts = np.bincount(y_logistic_train)
                class_balance = min(class_counts) / max(class_counts)
                print(f"Class distribution: {class_counts}, balance ratio: {class_balance:.2f}")

                # Apply SMOTE for class imbalance if needed
                if class_balance < 0.7:
                    print("Applying SMOTE to balance classes...")
                    try:
                        smote = SMOTE(random_state=42)
                        X_train_selected_resampled, y_logistic_train_resampled = smote.fit_resample(X_train_selected, y_logistic_train)
                        print(f"After SMOTE: {np.bincount(y_logistic_train_resampled)}")

                        # Use resampled data for classification
                        X_train_logistic = X_train_selected_resampled
                        y_logistic_train = y_logistic_train_resampled
                    except Exception as e:
                        print(f"SMOTE failed: {e}, continuing with original data")
                        X_train_logistic = X_train_selected
                else:
                    X_train_logistic = X_train_selected

                # Train and evaluate models
                print("\nTraining regression model for stock returns...")
                regression_models = {
                    'ridge': Ridge(alpha=0.01, random_state=42),
                    'gbr': GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=4, random_state=42),
                    'xgb': xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=4, random_state=42)
                }

                best_regression_score = -float('inf')
                linear_model = None

                for name, model in regression_models.items():
                    model.fit(X_train_selected, y_linear_train)
                    score = model.score(X_train_selected, y_linear_train)
                    print(f"{name} R² score: {score:.4f}")

                    if score > best_regression_score:
                        best_regression_score = score
                        linear_model = model

                linear_metrics = self._evaluate_linear_model(linear_model, X_test_selected, y_linear_test, selected_feature_names)

                # Train classification models
                print("\nTraining classification model for price direction...")
                classification_models = {
                    'logistic': LogisticRegression(C=1.0, class_weight='balanced', penalty='l2',
                                                  solver='liblinear', random_state=42, max_iter=1000),
                    'random_forest': RandomForestClassifier(n_estimators=100, max_depth=10,
                                                          class_weight='balanced', random_state=42),
                    'xgb': xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=4,
                                            use_label_encoder=False, eval_metric='logloss', random_state=42)
                }

                best_classification_score = -float('inf')
                logistic_model = None

                for name, model in classification_models.items():
                    model.fit(X_train_logistic, y_logistic_train)

                    # For evaluation, use F1 score which is better for imbalanced classes
                    from sklearn.metrics import f1_score
                    y_pred = model.predict(X_train_selected)
                    score = f1_score(y_logistic_train, y_pred)

                    print(f"{name} F1 score: {score:.4f}")

                    if score > best_classification_score:
                        best_classification_score = score
                        logistic_model = model

                logistic_metrics = self._evaluate_logistic_model(logistic_model, X_test_selected, y_logistic_test, selected_feature_names)

                # Save models and metadata
                joblib.dump(linear_model, os.path.join(self.model_dir, "linear_model.joblib"))
                joblib.dump(logistic_model, os.path.join(self.model_dir, "logistic_model.joblib"))
                joblib.dump(scaler, os.path.join(self.model_dir, "feature_scaler.joblib"))
                joblib.dump(available_features, os.path.join(self.model_dir, "all_feature_names.joblib"))
                joblib.dump(selected_feature_names, os.path.join(self.model_dir, "selected_feature_names.joblib"))
                joblib.dump(mask, os.path.join(self.model_dir, "feature_mask.joblib"))

                # Save metrics
                with open(os.path.join(self.model_dir, "linear_metrics.json"), 'w') as f:
                    json.dump(linear_metrics, f, indent=2)

                with open(os.path.join(self.model_dir, "logistic_metrics.json"), 'w') as f:
                    json.dump(logistic_metrics, f, indent=2)

                # Create feature importance visualization
                self._plot_feature_importance(linear_model, logistic_model, selected_feature_names)

                print(f"✅ Models and metrics saved to {self.model_dir}")

                return linear_model, logistic_model, scaler, selected_feature_names, mask

            except Exception as e:
                print(f"Error in train_models: {e}")
                import traceback
                traceback.print_exc()
                return None, None, None, None, None

    def _train_linear_model(self, X_train, y_train):
        """
        Train a linear regression model (Ridge) to predict stock returns
        """
        # Simple grid search to find best alpha
        alphas = [0.01, 0.1, 1.0, 10.0, 100.0]
        best_score = -float('inf')
        best_alpha = 1.0

        for alpha in alphas:
            model = Ridge(alpha=alpha, random_state=42)
            model.fit(X_train, y_train)
            score = model.score(X_train, y_train)

            if score > best_score:
                best_score = score
                best_alpha = alpha

        print(f"Best alpha for Ridge Regression: {best_alpha}")

        # Train final model with best alpha
        final_model = Ridge(alpha=best_alpha, random_state=42)
        final_model.fit(X_train, y_train)

        return final_model

    def _train_logistic_model(self, X_train, y_train):
        """
        Train a classification model to predict stock direction
        """
        print("Training classification model with multiple algorithms...")

        # Check for class imbalance
        class_counts = np.bincount(y_train)
        print(f"Class distribution before balancing: {class_counts}")

        # Apply SMOTE if there's class imbalance
        class_balance = min(class_counts) / max(class_counts)
        if class_balance < 0.7:
            try:
                smote = SMOTE(random_state=42)
                X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
                print(f"Applied SMOTE: {X_train.shape} -> {X_train_resampled.shape}")
                print(f"Balanced class distribution: {np.bincount(y_train_resampled)}")
                X_train, y_train = X_train_resampled, y_train_resampled
            except Exception as e:
                print(f"SMOTE failed: {e}, continuing with original data")

        # Try multiple models
        models = {
            'logistic': LogisticRegression(C=1.0, class_weight='balanced',
                                          penalty='l2', solver='liblinear',
                                          random_state=42, max_iter=1000),
            'random_forest': RandomForestClassifier(n_estimators=100, max_depth=8,
                                                  class_weight='balanced',
                                                  random_state=42),
            'xgboost': xgb.XGBClassifier(n_estimators=100, learning_rate=0.1,
                                        max_depth=4, random_state=42,
                                        use_label_encoder=False, eval_metric='logloss')
        }

        # Simple cross-validation to select best model
        best_score = 0
        best_model = None
        best_model_name = None

        for name, model in models.items():
            print(f"Training {name}...")
            model.fit(X_train, y_train)

            # Use F1 score for evaluation
            from sklearn.metrics import f1_score
            y_pred = model.predict(X_train)
            score = f1_score(y_train, y_pred)
            print(f"{name} training F1 score: {score:.4f}")

            if score > best_score:
                best_score = score
                best_model = model
                best_model_name = name

        print(f"Selected {best_model_name} as best classification model with F1 score: {best_score:.4f}")
        return best_model

    def _evaluate_linear_model(self, model, X_test, y_test, features):
        """
        Evaluate regression model and generate metrics

        Parameters:
        model: Trained regression model
        X_test: Test feature data
        y_test: Test target data
        features: List of feature names

        Returns:
        dict: Evaluation metrics
        """
        # Make predictions
        y_pred = model.predict(X_test)

        # Calculate metrics
        mse = mean_squared_error(y_test, y_pred)
        rmse = np.sqrt(mse)
        mae = mean_absolute_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)

        # Get directional accuracy (did we predict the direction correctly?)
        direction_actual = y_test > 0
        direction_pred = y_pred > 0
        directional_accuracy = accuracy_score(direction_actual, direction_pred)

        print("\n=== Regression Model Evaluation ===")
        print(f"RMSE: {rmse:.4f}")
        print(f"MAE: {mae:.4f}")
        print(f"R²: {r2:.4f}")
        print(f"Directional Accuracy: {directional_accuracy:.4f}")

        # Get feature importance
        if hasattr(model, 'feature_importances_'):
            # For tree-based models
            importances = model.feature_importances_
        elif hasattr(model, 'coef_'):
            # For linear models
            importances = np.abs(model.coef_)
            coefficients = model.coef_
        else:
            importances = np.zeros(len(features))
            coefficients = np.zeros(len(features))

        feature_importance = pd.DataFrame({
            'Feature': features,
            'Importance': importances,
            'Coefficient': coefficients if hasattr(model, 'coef_') else importances
        }).sort_values('Importance', ascending=False)

        print("\nTop 10 features by importance (Regression model):")
        print(feature_importance.head(10))

        # Create plot of actual vs. predicted
        plt.figure(figsize=(10, 6))
        plt.scatter(y_test, y_pred, alpha=0.5)
        plt.plot([-10, 10], [-10, 10], 'r--')
        plt.xlabel('Actual Return (%)')
        plt.ylabel('Predicted Return (%)')
        plt.title('Regression Model: Actual vs. Predicted Returns')
        plt.grid(True, alpha=0.3)
        plt.savefig(os.path.join(self.viz_dir, "regression_actual_vs_predicted.png"))
        plt.close()

        # Create residuals plot
        plt.figure(figsize=(10, 6))
        residuals = y_test - y_pred
        plt.scatter(y_pred, residuals, alpha=0.5)
        plt.axhline(y=0, color='r', linestyle='--')
        plt.xlabel('Predicted Return (%)')
        plt.ylabel('Residuals')
        plt.title('Regression Model: Residuals Plot')
        plt.grid(True, alpha=0.3)
        plt.savefig(os.path.join(self.viz_dir, "regression_residuals.png"))
        plt.close()

        # Save metrics
        metrics = {
            'rmse': float(rmse),
            'mae': float(mae),
            'r2': float(r2),
            'directional_accuracy': float(directional_accuracy),
            'top_features': feature_importance.head(10).to_dict('records')
        }

        return metrics

    def _evaluate_logistic_model(self, model, X_test, y_test, features):
        """
        Evaluate classification model and generate metrics

        Parameters:
        model: Trained classification model
        X_test: Test feature data
        y_test: Test target data
        features: List of feature names

        Returns:
        dict: Evaluation metrics
        """
        # Make predictions
        y_pred = model.predict(X_test)

        # Get probability predictions if the model supports it
        try:
            y_prob = model.predict_proba(X_test)[:, 1]
            has_proba = True
        except:
            y_prob = y_pred
            has_proba = False

        # Calculate metrics
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)

        try:
            auc = roc_auc_score(y_test, y_prob) if has_proba else 0
        except:
            auc = 0

        cm = confusion_matrix(y_test, y_pred)

        print("\n=== Classification Model Evaluation ===")
        print(f"Accuracy: {accuracy:.4f}")
        print(f"Precision: {precision:.4f}")
        print(f"Recall: {recall:.4f}")
        print(f"F1 Score: {f1:.4f}")
        print(f"AUC-ROC: {auc:.4f}")
        print("Confusion Matrix:")
        print(cm)

        # Get feature importance
        if hasattr(model, 'feature_importances_'):
            # For tree-based models
            importances = model.feature_importances_
            coefficients = importances
        elif hasattr(model, 'coef_'):
            # For linear models
            importances = np.abs(model.coef_[0])
            coefficients = model.coef_[0]
        else:
            importances = np.zeros(len(features))
            coefficients = np.zeros(len(features))

        feature_importance = pd.DataFrame({
            'Feature': features,
            'Importance': importances,
            'Coefficient': coefficients
        }).sort_values('Importance', ascending=False)

        print("\nTop 10 features by importance (Classification model):")
        print(feature_importance.head(10))

        # Plot confusion matrix
        plt.figure(figsize=(8, 6))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                   xticklabels=['Down', 'Up'],
                   yticklabels=['Down', 'Up'])
        plt.xlabel('Predicted')
        plt.ylabel('Actual')
        plt.title('Classification Model: Confusion Matrix')
        plt.tight_layout()
        plt.savefig(os.path.join(self.viz_dir, "classification_confusion_matrix.png"))
        plt.close()

        # Plot ROC curve if probabilities are available
        if has_proba and auc > 0:
            from sklearn.metrics import roc_curve
            fpr, tpr, _ = roc_curve(y_test, y_prob)

            plt.figure(figsize=(8, 6))
            plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {auc:.2f})')
            plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
            plt.xlim([0.0, 1.0])
            plt.ylim([0.0, 1.05])
            plt.xlabel('False Positive Rate')
            plt.ylabel('True Positive Rate')
            plt.title('Classification Model: ROC Curve')
            plt.legend(loc="lower right")
            plt.grid(True, alpha=0.3)
            plt.savefig(os.path.join(self.viz_dir, "classification_roc_curve.png"))
            plt.close()

        # Save metrics
        metrics = {
            'accuracy': float(accuracy),
            'precision': float(precision),
            'recall': float(recall),
            'f1': float(f1),
            'auc': float(auc),
            'confusion_matrix': cm.tolist(),
            'top_features': feature_importance.head(10).to_dict('records')
        }

        return metrics

    def _plot_feature_importance(self, linear_model, logistic_model, features):

        """
        Create visualization comparing feature importance between models

        Parameters:
        linear_model: Trained regression model
        logistic_model: Trained classification model
        features: List of feature names
        """
        # Get feature importance for both models
        if hasattr(linear_model, 'feature_importances_'):
            linear_importances = linear_model.feature_importances_
        elif hasattr(linear_model, 'coef_'):
            linear_importances = np.abs(linear_model.coef_)
        else:
            linear_importances = np.zeros(len(features))

        linear_importance_df = pd.DataFrame({
            'Feature': features,
            'Importance': linear_importances,
            'Model': 'Regression'
        }).sort_values('Importance', ascending=False)

        if hasattr(logistic_model, 'feature_importances_'):
            logistic_importances = logistic_model.feature_importances_
        elif hasattr(logistic_model, 'coef_'):
            try:
                logistic_importances = np.abs(logistic_model.coef_[0])
            except:
                logistic_importances = np.abs(logistic_model.coef_)
        else:
            logistic_importances = np.zeros(len(features))

        logistic_importance_df = pd.DataFrame({
            'Feature': features,
            'Importance': logistic_importances,
            'Model': 'Classification'
        }).sort_values('Importance', ascending=False)

        # Combine and normalize for comparison
        combined = pd.concat([linear_importance_df, logistic_importance_df])

        # Normalize importances within each model
        for model in ['Regression', 'Classification']:
            mask = combined['Model'] == model
            max_importance = combined.loc[mask, 'Importance'].max()
            if max_importance > 0:
                combined.loc[mask, 'Importance'] = combined.loc[mask, 'Importance'] / max_importance

        # Get top 15 features from both models
        top_features = set(linear_importance_df.head(15)['Feature']).union(
            set(logistic_importance_df.head(15)['Feature']))

        # Filter to only top features
        plot_data = combined[combined['Feature'].isin(top_features)]

        # Create grouped bar chart
        plt.figure(figsize=(14, 10))
        sns.set_style("whitegrid")
        chart = sns.barplot(x='Feature', y='Importance', hue='Model', data=plot_data)
        plt.xticks(rotation=45, ha='right')
        plt.title('Feature Importance Comparison: Regression vs. Classification Model', fontsize=14)
        plt.xlabel('Feature', fontsize=12)
        plt.ylabel('Normalized Importance', fontsize=12)
        plt.legend(title='Model Type')
        plt.tight_layout()
        plt.savefig(os.path.join(self.viz_dir, "feature_importance_comparison.png"))
        plt.close()

        # Create individual feature importance plots for each model
        for model_type, df in [('Regression', linear_importance_df), ('Classification', logistic_importance_df)]:
            plt.figure(figsize=(12, 8))
            top_n = min(20, len(df))
            sns.barplot(x='Importance', y='Feature', data=df.head(top_n), palette='viridis')
            plt.title(f'Top {top_n} Features for {model_type} Model', fontsize=14)
            plt.xlabel('Importance', fontsize=12)
            plt.ylabel('Feature', fontsize=12)
            plt.tight_layout()
            plt.savefig(os.path.join(self.viz_dir, f"{model_type.lower()}_feature_importance.png"))
            plt.close()


In [None]:
def run_analysis(tickers, days_back=180, reddit_limit=500):
    """
    Run the complete analysis workflow for a list of tickers

    Parameters:
    tickers (list): List of stock ticker symbols
    days_back (int): Number of days to look back for data
    reddit_limit (int): Maximum number of Reddit posts to fetch per subreddit

    Returns:
    tuple: (analyzer, models_dict) or (None, None) if analysis fails
    """
    try:
        # Initialize analyzer
        analyzer = StockSentimentAnalyzer()
        successful_tickers = []

        # Process each ticker
        for ticker in tickers:
            print(f"\n{'='*50}\nProcessing {ticker}\n{'='*50}")
            try:
                # 1. Fetch stock data
                stock_data = analyzer.fetch_stock_data(ticker, period=f"{days_back}d")
                if stock_data is None:
                    print(f"Skipping {ticker} due to missing stock data")
                    continue

                # 2. Fetch Reddit data
                reddit_data = analyzer.fetch_reddit_data(
                    ticker,
                    subreddits=["stocks", "investing", "wallstreetbets", "stockmarket"],
                    limit=reddit_limit,
                    days_back=days_back
                )

                # 3. Merge data
                merged_data = analyzer.merge_stock_and_sentiment(ticker)
                if merged_data is None:
                    print(f"Skipping {ticker} due to issues with data merging")
                    continue

                # 4. Analyze data
                stats = analyzer.analyze_data(ticker)
                if stats is not None:
                    successful_tickers.append(ticker)

            except Exception as e:
                print(f"Error processing {ticker}: {str(e)}")
                continue

        # Only train models if we have successful data
        if successful_tickers:
            try:
                # 5. Train models using successful tickers
                linear_model, logistic_model, scaler, features, feature_mask = analyzer.train_models(successful_tickers)

                models_dict = {
                    'linear_model': linear_model,
                    'logistic_model': logistic_model,
                    'scaler': scaler,
                    'features': features,
                    'feature_mask': feature_mask
                }

                return analyzer, models_dict
            except Exception as e:
                print(f"Error training models: {str(e)}")
                return analyzer, None
        else:
            print("No successful ticker analysis to train models")
            return analyzer, None

    except Exception as e:
        print(f"Fatal error in run_analysis: {str(e)}")
        return None, None

# Web application for the project
from flask import Flask, render_template, request, jsonify
from flask_ngrok import run_with_ngrok

app = Flask(__name__)
run_with_ngrok(app)

# Global variables to store the analyzer and models
global_analyzer = None
global_models = None

def initialize_analyzer():
    """Initialize the analyzer and models if not already done"""
    global global_analyzer, global_models
    if global_analyzer is None:
        global_analyzer, global_models = run_analysis(
            tickers=["AAPL", "MSFT", "TSLA", "AMZN", "NVDA"],
            days_back=30,
            reddit_limit=100
        )

@app.route('/')
def home():
    """Render the home page"""
    initialize_analyzer()
    return render_template('index.html')

@app.route('/analyze', methods=['POST'])
def analyze():
    """Handle stock analysis request"""
    try:
        ticker = request.form.get('ticker', '').upper()
        if not ticker:
            return jsonify({'error': 'No ticker provided'})

        print(f"Processing analysis request for {ticker}")

        # Initialize analyzer if not already done
        initialize_analyzer()

        if global_analyzer is None:
            return jsonify({'error': 'Failed to initialize analyzer'})

        # Fetch new data for the ticker
        stock_data = global_analyzer.fetch_stock_data(ticker)
        if stock_data is None:
            return jsonify({'error': f'Failed to fetch stock data for {ticker}'})

        reddit_data = global_analyzer.fetch_reddit_data(ticker)
        merged_data = global_analyzer.merge_stock_and_sentiment(ticker)
        stats = global_analyzer.analyze_data(ticker)

        if merged_data is None or stats is None:
            return jsonify({'error': f'Failed to analyze {ticker}'})

        # Create interactive chart using Plotly
        fig = go.Figure()

        # Add candlestick chart
        fig.add_trace(go.Candlestick(
            x=merged_data['Date'],
            open=merged_data['Open'],
            high=merged_data['High'],
            low=merged_data['Low'],
            close=merged_data['Close'],
            name='OHLC'
        ))

        # Add moving averages if available
        if 'sma_20' in merged_data.columns:
            # Clean sma_20 data - remove NaN and inf values
            sma_20_clean = merged_data['sma_20'].replace([np.inf, -np.inf], np.nan)
            fig.add_trace(go.Scatter(
                x=merged_data['Date'],
                y=sma_20_clean,
                name='20-day MA',
                line=dict(color='purple', width=1.5)
            ))

        if 'ema_10' in merged_data.columns:
            # Clean ema_10 data
            ema_10_clean = merged_data['ema_10'].replace([np.inf, -np.inf], np.nan)
            fig.add_trace(go.Scatter(
                x=merged_data['Date'],
                y=ema_10_clean,
                name='10-day EMA',
                line=dict(color='orange', width=1.5),
                visible='legendonly'  # Hidden by default
            ))

        # Add sentiment overlay if available
        if 'compound_score_mean' in merged_data.columns:
            # Clean sentiment data
            sentiment_clean = merged_data['compound_score_mean'].replace([np.inf, -np.inf], np.nan)
            fig.add_trace(go.Scatter(
                x=merged_data['Date'],
                y=sentiment_clean,
                name='Sentiment Score',
                yaxis='y2',
                line=dict(color='green', width=2, dash='dash')
            ))

            # Add post counts as bubbles if available
            if 'post_count' in merged_data.columns:
                # Clean post_count data and create safe bubble sizes
                post_count_clean = merged_data['post_count'].fillna(0).replace([np.inf, -np.inf], 0)

                # Create bubble size with minimum threshold
                bubble_size = post_count_clean * 5  # Scale for visibility

                # Set minimum size and maximum size to prevent extremes
                bubble_size = bubble_size.clip(lower=5, upper=50)

                # Only show bubbles where we actually have posts
                mask = post_count_clean > 0

                if mask.any():  # Only add trace if we have valid data
                    fig.add_trace(go.Scatter(
                        x=merged_data['Date'][mask],
                        y=sentiment_clean[mask],
                        mode='markers',
                        marker=dict(
                            size=bubble_size[mask],
                            color='rgba(0, 100, 80, 0.5)',
                            line=dict(width=1, color='rgba(0, 100, 80, 1)')
                        ),
                        name='Post Volume',
                        yaxis='y2',
                        hovertemplate='Date: %{x}<br>Posts: %{text}<br>Sentiment: %{y:.2f}<extra></extra>',
                        text=post_count_clean[mask]
                    ))

        # Update layout with custom design
        fig.update_layout(
            title=f'{ticker} Stock Price and Sentiment Analysis',
            yaxis=dict(
                title='Stock Price ($)',
                titlefont=dict(color='black'),
                tickfont=dict(color='black')
            ),
            yaxis2=dict(
                title='Sentiment Score',
                titlefont=dict(color='green'),
                tickfont=dict(color='green'),
                overlaying='y',
                side='right',
                range=[-1, 1],
                showgrid=False
            ),
            xaxis_title='Date',
            legend=dict(
                orientation="h",
                yanchor="bottom",
                y=0.15,
                xanchor="right",
                x=1
            ),
            template='plotly_white',
            height=800,
            margin=dict(l=70, r=100, t=80, b=100),
            hovermode='x unified',
            showlegend=True,
            autosize=True
        )

        # Add volume as a bar chart at the bottom
        fig2 = go.Figure()

        # Clean volume data
        volume_clean = merged_data['Volume'].replace([np.inf, -np.inf], 0).fillna(0)
        fig2.add_trace(go.Bar(
            x=merged_data['Date'],
            y=volume_clean,
            name='Volume',
            marker_color='rgba(100, 100, 200, 0.7)'
        ))

        # Add RSI indicator if available
        if 'rsi_14' in merged_data.columns:
            fig3 = go.Figure()

            # Clean RSI data
            rsi_clean = merged_data['rsi_14'].replace([np.inf, -np.inf], np.nan)
            fig3.add_trace(go.Scatter(
                x=merged_data['Date'],
                y=rsi_clean,
                name='RSI (14)',
                line=dict(color='purple', width=1.5)
            ))

            # Add RSI reference lines
            fig3.add_shape(
                type="line",
                x0=merged_data['Date'].min(),
                y0=70,
                x1=merged_data['Date'].max(),
                y1=70,
                line=dict(color="red", width=1, dash="dash"),
            )

            fig3.add_shape(
                type="line",
                x0=merged_data['Date'].min(),
                y0=30,
                x1=merged_data['Date'].max(),
                y1=30,
                line=dict(color="green", width=1, dash="dash"),
            )

            fig3.update_layout(
                title="RSI (14-day)",
                yaxis=dict(
                    title='RSI',
                    range=[0, 100]
                ),
                height=200,
                margin=dict(l=50, r=10, t=30, b=10),
                showlegend=False
            )
        else:
            fig3 = None

        # Update Volume chart layout
        fig2.update_layout(
            title="Trading Volume",
            yaxis=dict(title='Volume'),
            height=200,
            margin=dict(l=50, r=10, t=30, b=30),
            showlegend=False
        )

        # Generate predictions
        predictions = None
        if global_models and global_models.get('features') and global_models.get('feature_mask') is not None:
            try:
                # Create feature vector with all available features
                required_features = global_models['features']
                all_features = joblib.load(os.path.join(global_analyzer.model_dir, "all_feature_names.joblib"))
                feature_mask = global_models['feature_mask']

                # Prepare the latest data point for prediction
                feature_row = {}

                # Start with all zeros
                for feature in all_features:
                    feature_row[feature] = 0

                # Fill in the values we have from merged_data
                for feature in all_features:
                    if feature in merged_data.columns:
                        # Clean the value before using it
                        value = merged_data[feature].iloc[-1]
                        if pd.isna(value) or np.isinf(value):
                            value = 0
                        feature_row[feature] = float(value)

                # Create dummy variables for the ticker
                for col in all_features:
                    if col.startswith('ticker_') and col == f'ticker_{ticker}':
                        feature_row[col] = 1

                # Convert to numpy array in the correct order
                feature_vector = np.array([feature_row[f] for f in all_features]).reshape(1, -1)

                # Check for any remaining NaN or inf values
                feature_vector = np.nan_to_num(feature_vector, nan=0, posinf=0, neginf=0)

                # Scale the features
                feature_vector_scaled = global_models['scaler'].transform(feature_vector)

                # Apply feature mask for selected features
                feature_vector_selected = feature_vector_scaled[:, feature_mask]

                # Generate predictions
                if global_models['linear_model'] is not None:
                    linear_pred = global_models['linear_model'].predict(feature_vector_selected)
                    linear_return = float(linear_pred[0])
                else:
                    linear_return = 0

                if global_models['logistic_model'] is not None:
                    try:
                        logistic_prob = global_models['logistic_model'].predict_proba(feature_vector_selected)[:, 1]
                        prob_up = float(logistic_prob[0])
                    except:
                        prob_up = float(global_models['logistic_model'].predict(feature_vector_selected)[0])
                else:
                    prob_up = 0.5

                predictions = {
                    'returns': linear_return,
                    'probability_up': prob_up
                }
            except Exception as e:
                print(f"Error generating predictions: {str(e)}")
                import traceback
                traceback.print_exc()
                predictions = {'error': str(e)}

        # Prepare the final response
        response_data = {
            'success': True,
            'stats': stats,
            'predictions': predictions,
            'price_chart': json.loads(fig.to_json()),
            'volume_chart': json.loads(fig2.to_json()),
            'rsi_chart': json.loads(fig3.to_json()) if fig3 else None
        }

        print(f"Analysis completed for {ticker}")
        return jsonify(response_data)

    except Exception as e:
        print(f"Error in analyze(): {str(e)}")
        import traceback
        traceback.print_exc()
        return jsonify({'error': str(e)})
    """Handle stock analysis request"""
    try:
        ticker = request.form.get('ticker', '').upper()
        if not ticker:
            return jsonify({'error': 'No ticker provided'})

        print(f"Processing analysis request for {ticker}")

        # Initialize analyzer if not already done
        initialize_analyzer()

        if global_analyzer is None:
            return jsonify({'error': 'Failed to initialize analyzer'})

        # Fetch new data for the ticker
        stock_data = global_analyzer.fetch_stock_data(ticker)
        if stock_data is None:
            return jsonify({'error': f'Failed to fetch stock data for {ticker}'})

        reddit_data = global_analyzer.fetch_reddit_data(ticker)
        merged_data = global_analyzer.merge_stock_and_sentiment(ticker)
        stats = global_analyzer.analyze_data(ticker)

        if merged_data is None or stats is None:
            return jsonify({'error': f'Failed to analyze {ticker}'})

        # Create interactive chart using Plotly
        fig = go.Figure()

        # Add candlestick chart
        fig.add_trace(go.Candlestick(
            x=merged_data['Date'],
            open=merged_data['Open'],
            high=merged_data['High'],
            low=merged_data['Low'],
            close=merged_data['Close'],
            name='OHLC'
        ))

        # Add moving averages if available
        if 'sma_20' in merged_data.columns:
            fig.add_trace(go.Scatter(
                x=merged_data['Date'],
                y=merged_data['sma_20'],
                name='20-day MA',
                line=dict(color='purple', width=1.5)
            ))

        if 'ema_10' in merged_data.columns:
            fig.add_trace(go.Scatter(
                x=merged_data['Date'],
                y=merged_data['ema_10'],
                name='10-day EMA',
                line=dict(color='orange', width=1.5),
                visible='legendonly'  # Hidden by default
            ))

        # Add sentiment overlay if available
        if 'compound_score_mean' in merged_data.columns:
            fig.add_trace(go.Scatter(
                x=merged_data['Date'],
                y=merged_data['compound_score_mean'],
                name='Sentiment Score',
                yaxis='y2',
                line=dict(color='green', width=2, dash='dash')
            ))

            # Add post counts as bubbles if available
            if 'post_count' in merged_data.columns:
                bubble_size = merged_data['post_count'] * 5  # Scale for visibility
                bubble_size = bubble_size.fillna(0).replace(0, np.nan)  # Only show when count > 0



                fig.add_trace(go.Scatter(
                    x=merged_data['Date'],
                    y=merged_data['compound_score_mean'],
                    mode='markers',
                    marker=dict(
                        size=bubble_size,
                        color='rgba(0, 100, 80, 0.5)',
                        line=dict(width=1, color='rgba(0, 100, 80, 1)')
                    ),
                    name='Post Volume',
                    yaxis='y2',
                    hovertemplate='Date: %{x}<br>Posts: %{text}<br>Sentiment: %{y:.2f}<extra></extra>',
                    text=merged_data['post_count']
                ))

        # Update layout with custom design
        fig.update_layout(
          title=f'{ticker} Stock Price and Sentiment Analysis',
          autosize=True,  # Allow automatic sizing
          height=600,     # Optional: adjust height as needed
          margin=dict(l=60, r=60, t=60, b=60),
          xaxis=dict(
              title='Date',
              automargin=True
          ),
          yaxis=dict(
              title='Stock Price ($)',
              titlefont=dict(color='black'),
              tickfont=dict(color='black'),
              automargin=True
          ),
          yaxis2=dict(
              title='Sentiment Score',
              titlefont=dict(color='green'),
              tickfont=dict(color='green'),
              overlaying='y',
              side='right',
              range=[-1, 1],
              showgrid=False
          ),
          legend=dict(
              orientation="h",
              yanchor="bottom",
              y=1.02,
              xanchor="right",
              x=1
          ),
          template='plotly_white',
          hovermode='x unified'
      )
        # Add volume as a bar chart at the bottom
        fig2 = go.Figure()

        fig2.add_trace(go.Bar(
            x=merged_data['Date'],
            y=merged_data['Volume'],
            name='Volume',
            marker_color='rgba(100, 100, 200, 0.7)'
        ))

        # Add RSI indicator if available
        if 'rsi_14' in merged_data.columns:
            fig3 = go.Figure()

            fig3.add_trace(go.Scatter(
                x=merged_data['Date'],
                y=merged_data['rsi_14'],
                name='RSI (14)',
                line=dict(color='purple', width=1.5)
            ))

            # Add RSI reference lines
            fig3.add_shape(
                type="line",
                x0=merged_data['Date'].min(),
                y0=70,
                x1=merged_data['Date'].max(),
                y1=70,
                line=dict(color="red", width=1, dash="dash"),
            )

            fig3.add_shape(
                type="line",
                x0=merged_data['Date'].min(),
                y0=30,
                x1=merged_data['Date'].max(),
                y1=30,
                line=dict(color="green", width=1, dash="dash"),
            )

            fig3.update_layout(
                title="RSI (14-day)",
                yaxis=dict(
                    title='RSI',
                    range=[0, 100]
                ),
                height=200,
                margin=dict(l=50, r=10, t=30, b=10),
                showlegend=False
            )
        else:
            fig3 = None

        # Update Volume chart layout
        fig2.update_layout(
            title="Trading Volume",
            yaxis=dict(title='Volume'),
            height=300,
            margin=dict(l=50, r=10, t=30, b=30),
            showlegend=False
        )

        # Generate predictions
        predictions = None
        if global_models and global_models.get('features') and global_models.get('feature_mask') is not None:
            try:
                # Create feature vector with all available features
                required_features = global_models['features']
                all_features = joblib.load(os.path.join(global_analyzer.model_dir, "all_feature_names.joblib"))
                feature_mask = global_models['feature_mask']

                # Prepare the latest data point for prediction
                feature_row = {}

                # Start with all zeros
                for feature in all_features:
                    feature_row[feature] = 0

                # Fill in the values we have from merged_data
                for feature in all_features:
                    if feature in merged_data.columns:
                        feature_row[feature] = float(merged_data[feature].iloc[-1])

                # Create dummy variables for the ticker
                for col in all_features:
                    if col.startswith('ticker_') and col == f'ticker_{ticker}':
                        feature_row[col] = 1

                # Convert to numpy array in the correct order
                feature_vector = np.array([feature_row[f] for f in all_features]).reshape(1, -1)

                # Scale the features
                feature_vector_scaled = global_models['scaler'].transform(feature_vector)

                # Apply feature mask for selected features
                feature_vector_selected = feature_vector_scaled[:, feature_mask]

                # Generate predictions
                if global_models['linear_model'] is not None:
                    linear_pred = global_models['linear_model'].predict(feature_vector_selected)
                    linear_return = float(linear_pred[0])
                else:
                    linear_return = 0

                if global_models['logistic_model'] is not None:
                    try:
                        logistic_prob = global_models['logistic_model'].predict_proba(feature_vector_selected)[:, 1]
                        prob_up = float(logistic_prob[0])
                    except:
                        prob_up = float(global_models['logistic_model'].predict(feature_vector_selected)[0])
                else:
                    prob_up = 0.5

                predictions = {
                    'returns': linear_return,
                    'probability_up': prob_up
                }
            except Exception as e:
                print(f"Error generating predictions: {str(e)}")
                import traceback
                traceback.print_exc()
                predictions = {'error': str(e)}

        # Prepare the final response
        response_data = {
            'success': True,
            'stats': stats,
            'predictions': predictions,
            'price_chart': json.loads(fig.to_json()),
            'volume_chart': json.loads(fig2.to_json()),
            'rsi_chart': json.loads(fig3.to_json()) if fig3 else None
        }

        print(f"Analysis completed for {ticker}")
        return jsonify(response_data)

    except Exception as e:
        print(f"Error in analyze(): {str(e)}")
        import traceback
        traceback.print_exc()
        return jsonify({'error': str(e)})

# Create the templates directory and HTML template
def create_template_directory():
    """Create templates directory and index.html"""
    import os
    os.makedirs('templates', exist_ok=True)
    with open('templates/index.html', 'w') as f:
        f.write(INDEX_HTML)


# HTML template for the web application
INDEX_HTML = """
<!DOCTYPE html>
<html>
<head>
    <title>Trading On Trends</title>
    <script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
    <script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/three.js/r128/three.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.9.1/gsap.min.js"></script>
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.4/css/all.min.css">
    <style>
        @import url('https://fonts.googleapis.com/css2?family=Poppins:wght@300;400;500;600;700&display=swap');

        :root {
            --primary: #6c5ce7;
            --secondary: #a29bfe;
            --success: #00cec9;
            --danger: #e74c3c;
            --warning: #fdcb6e;
            --text: #2d3436;
            --light: #f9f9f9;
            --dark: #262837;
            --card-bg: rgba(255, 255, 255, 0.9);
            --card-shadow: 0 8px 30px rgba(0, 0, 0, 0.12);
            --gradient: linear-gradient(135deg, #6c5ce7, #00cec9);
        }

        * {
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }

        body {
            font-family: 'Poppins', sans-serif;
            background-color: var(--light);
            color: var(--text);
            overflow-x: hidden;
        }

        canvas {
            position: fixed;
            top: 0;
            left: 0;
            width: 100%;
            height: 100%;
            z-index: -1;
        }

        .container {
            width: 100%;
            min-height: 100vh;
            display: flex;
            justify-content: center;
            align-items: center;
            padding: 20px;
        }

        .app-wrapper {
            width: 100%;
            max-width: 1200px;
            background: var(--card-bg);
            border-radius: 24px;
            box-shadow: var(--card-shadow);
            padding: 40px;
            position: relative;
            overflow: hidden;
            backdrop-filter: blur(10px);
        }

        .app-header {
            text-align: center;
            margin-bottom: 40px;
            transform: translateY(30px);
            opacity: 0;
        }

        .app-header h1 {
            font-size: 2.5rem;
            font-weight: 700;
            margin-bottom: 10px;
            background: var(--gradient);
            -webkit-background-clip: text;
            -webkit-text-fill-color: transparent;
        }

        .app-header p {
            color: var(--text);
            opacity: 0.7;
        }

        .search-box {
            display: flex;
            justify-content: center;
            margin-bottom: 40px;
            transform: translateY(30px);
            opacity: 0;
        }

        .search-container {
            position: relative;
            width: 100%;
            max-width: 500px;
        }

        .search-container input {
            width: 100%;
            padding: 18px 24px;
            padding-right: 70px;
            background: white;
            border: none;
            border-radius: 50px;
            font-size: 1rem;
            box-shadow: 0 4px 15px rgba(0, 0, 0, 0.1);
            transition: all 0.3s ease;
        }

        .search-container input:focus {
            outline: none;
            box-shadow: 0 4px 20px rgba(108, 92, 231, 0.25);
        }

        .search-btn {
            position: absolute;
            top: 5px;
            right: 5px;
            height: 44px;
            width: 44px;
            border-radius: 50%;
            background: var(--gradient);
            border: none;
            color: white;
            cursor: pointer;
            font-size: 16px;
            transition: all 0.3s ease;
        }

        .search-btn:hover {
            transform: scale(1.05);
        }

        #loading {
            display: none;
            text-align: center;
            margin: 20px 0;
        }

        .loader {
            display: inline-block;
            width: 80px;
            height: 80px;
        }

        .loader:after {
            content: " ";
            display: block;
            width: 64px;
            height: 64px;
            margin: 8px;
            border-radius: 50%;
            border: 6px solid var(--primary);
            border-color: var(--primary) transparent var(--primary) transparent;
            animation: loader 1.2s linear infinite;
        }

        @keyframes loader {
            0% {
                transform: rotate(0deg);
            }
            100% {
                transform: rotate(360deg);
            }
        }

        #results {
            opacity: 0;
            transform: translateY(30px);
        }

        .chart-container {
            margin-bottom: 30px;
            border-radius: 16px;
            overflow: hidden;
            box-shadow: var(--card-shadow);
            background: white;
        }

        #price-chart, #volume-chart, #rsi-chart {
            width: 100%;
            background: white;
        }

        #price-chart {
            height: 500px;
        }

        #volume-chart, #rsi-chart {
            height: 200px;
        }

        .stats-container {
            display: grid;
            grid-template-columns: repeat(auto-fill, minmax(200px, 1fr));
            gap: 20px;
            margin: 30px 0;
        }

        .stat-card {
            background: white;
            padding: 20px;
            border-radius: 16px;
            box-shadow: 0 4px 15px rgba(0, 0, 0, 0.05);
            transition: all 0.3s ease;
            opacity: 0;
            transform: translateY(20px);
        }

        .stat-card:hover {
            transform: translateY(-5px);
            box-shadow: 0 8px 25px rgba(0, 0, 0, 0.1);
        }

        .stat-card strong {
            display: block;
            margin-bottom: 10px;
            color: var(--primary);
            font-size: 0.9rem;
            text-transform: uppercase;
            letter-spacing: 1px;
        }

        .stat-card div {
            font-size: 1.2rem;
            word-break: break-word;
            font-weight: 600;
        }

        .predictions-container {
            background: white;
            padding: 25px;
            border-radius: 16px;
            box-shadow: var(--card-shadow);
            margin-top: 30px;
            transform: translateY(20px);
            opacity: 0;
        }

        .predictions-container h3 {
            color: var(--primary);
            margin-bottom: 25px;
            font-size: 1.3rem;
            text-align: center;
        }

        .prediction-cards {
            display: flex;
            justify-content: space-around;
            flex-wrap: wrap;
            gap: 20px;
        }

        .prediction-card {
            flex: 1;
            min-width: 200px;
            padding: 20px;
            border-radius: 12px;
            text-align: center;
            box-shadow: 0 4px 12px rgba(0, 0, 0, 0.08);
            transition: all 0.3s ease;
        }

        .prediction-card:hover {
            transform: translateY(-5px);
            box-shadow: 0 8px 20px rgba(0, 0, 0, 0.15);
        }

        .prediction-card.up {
            background: linear-gradient(135deg, rgba(39, 174, 96, 0.1), rgba(39, 174, 96, 0.3));
            border-left: 4px solid #27ae60;
        }

        .prediction-card.down {
            background: linear-gradient(135deg, rgba(231, 76, 60, 0.1), rgba(231, 76, 60, 0.3));
            border-left: 4px solid #e74c3c;
        }

        .prediction-card .value {
            font-size: 2rem;
            font-weight: 700;
            margin: 10px 0;
        }

        .prediction-card .label {
            font-size: 0.9rem;
            opacity: 0.7;
            margin-bottom: 5px;
        }

        .prediction-card.up .value {
            color: #27ae60;
        }

        .prediction-card.down .value {
            color: #e74c3c;
        }

        .prediction-card i {
            font-size: 2.5rem;
            margin-bottom: 15px;
            opacity: 0.8;
        }

        .prediction-card.up i {
            color: #27ae60;
        }

        .prediction-card.down i {
            color: #e74c3c;
        }

        .indicator-container {
            margin-top: 40px;
            padding: 20px;
            background: white;
            border-radius: 16px;
            box-shadow: var(--card-shadow);
        }

        .indicator-header {
            display: flex;
            justify-content: space-between;
            align-items: center;
            margin-bottom: 20px;
        }

        .indicator-header h3 {
            color: var(--primary);
            font-size: 1.3rem;
            margin: 0;
        }

        .indicator-tabs {
            display: flex;
            gap: 10px;
        }

        .tab-btn {
            padding: 8px 16px;
            border: none;
            background: var(--light);
            border-radius: 20px;
            cursor: pointer;
            transition: all 0.3s ease;
        }

        .tab-btn.active {
            background: var(--primary);
            color: white;
        }

        .tab-content {
            display: none;
        }

        .tab-content.active {
            display: block;
        }

        @media (max-width: 768px) {
            .app-wrapper {
                padding: 20px;
            }

            .app-header h1 {
                font-size: 1.8rem;
            }

            .stats-container {
                grid-template-columns: 1fr;
            }

            .prediction-cards {
                flex-direction: column;
            }
        }
    </style>
</head>
<body>
    <canvas id="bg-canvas"></canvas>
    <div class="container">
        <div class="app-wrapper">
            <div class="app-header">
                <h1>Trading On Trends</h1>
                <p>Advanced Stock Sentiment Analysis with AI</p>
            </div>

            <div class="search-box">
                <div class="search-container">
                    <input type="text" id="ticker" placeholder="Enter stock ticker (e.g., AAPL, MSFT, TSLA)">
                    <button class="search-btn" onclick="analyzeStock()">
                        <i class="fas fa-search"></i>
                    </button>
                </div>
            </div>

            <div id="loading">
                <div class="loader"></div>
                <p>Analyzing market data and social sentiment...</p>
            </div>

            <div id="results">
                <div class="chart-container">
                    <div id="price-chart"></div>
                </div>

                <div class="chart-container">
                    <div id="volume-chart"></div>
                </div>

                <div id="rsi-container" class="chart-container" style="display:none;">
                    <div id="rsi-chart"></div>
                </div>

                <div id="predictions" class="predictions-container">
                    <!-- Predictions will be inserted here -->
                </div>

                <div id="stats" class="stats-container">
                    <!-- Stats will be inserted here -->
                </div>

                <div class="indicator-container">
                    <div class="indicator-header">
                        <h3>Technical Indicators</h3>
                        <div class="indicator-tabs">
                            <button class="tab-btn active" onclick="showTab('trend')">Trend</button>
                            <button class="tab-btn" onclick="showTab('momentum')">Momentum</button>
                            <button class="tab-btn" onclick="showTab('volatility')">Volatility</button>
                        </div>
                    </div>

                    <div id="trend-tab" class="tab-content active">
                        <!-- Trend indicators will be shown here -->
                    </div>

                    <div id="momentum-tab" class="tab-content">
                        <!-- Momentum indicators will be shown here -->
                    </div>

                    <div id="volatility-tab" class="tab-content">
                        <!-- Volatility indicators will be shown here -->
                    </div>
                </div>
            </div>
        </div>
    </div>

    <script>
        // Three.js background
        let scene, camera, renderer;
        let particles;

        function initThreeJS() {
            scene = new THREE.Scene();
            camera = new THREE.PerspectiveCamera(75, window.innerWidth / window.innerHeight, 0.1, 1000);

            renderer = new THREE.WebGLRenderer({
                canvas: document.querySelector('#bg-canvas'),
                antialias: true,
                alpha: true
            });
            renderer.setPixelRatio(window.devicePixelRatio);
            renderer.setSize(window.innerWidth, window.innerHeight);

            camera.position.z = 30;

            // Create particles
            const particlesGeometry = new THREE.BufferGeometry();
            const particlesCount = 2000;

            const posArray = new Float32Array(particlesCount * 3);

            for(let i = 0; i < particlesCount * 3; i++) {
                posArray[i] = (Math.random() - 0.5) * 100;
            }

            particlesGeometry.setAttribute('position', new THREE.BufferAttribute(posArray, 3));

            const particlesMaterial = new THREE.PointsMaterial({
                size: 0.2,
                color: '#6c5ce7',
                transparent: true,
                opacity: 0.8
            });

            particles = new THREE.Points(particlesGeometry, particlesMaterial);
            scene.add(particles);

            window.addEventListener('resize', () => {
                camera.aspect = window.innerWidth / window.innerHeight;
                camera.updateProjectionMatrix();
                renderer.setSize(window.innerWidth, window.innerHeight);
            });

            animate();
        }

        function animate() {
            requestAnimationFrame(animate);

            particles.rotation.x += 0.0005;
            particles.rotation.y += 0.0005;

            renderer.render(scene, camera);
        }

        // Initialize animations
        function initAnimations() {
            gsap.to('.app-header', {
                opacity: 1,
                y: 0,
                duration: 1,
                ease: 'power3.out'
            });

            gsap.to('.search-box', {
                opacity: 1,
                y: 0,
                duration: 1,
                delay: 0.3,
                ease: 'power3.out'
            });
        }

        // Initialize app
        window.onload = function() {
            initThreeJS();
            initAnimations();
        };

        // Tab switching functionality
        function showTab(tabName) {
            // Update button states
            document.querySelectorAll('.tab-btn').forEach(btn => {
                btn.classList.remove('active');
            });
            document.querySelector(`.tab-btn[onclick="showTab('${tabName}')"]`).classList.add('active');

            // Update content visibility
            document.querySelectorAll('.tab-content').forEach(content => {
                content.classList.remove('active');
            });
            document.getElementById(`${tabName}-tab`).classList.add('active');
        }

        function analyzeStock() {
            const ticker = document.getElementById('ticker').value;
            if (!ticker) {
                alert('Please enter a ticker symbol');
                return;
            }

            // Show loading indicator
            document.getElementById('loading').style.display = 'block';

            // Clear previous results
            document.getElementById('price-chart').innerHTML = '';
            document.getElementById('volume-chart').innerHTML = '';
            document.getElementById('rsi-chart').innerHTML = '';
            document.getElementById('stats').innerHTML = '';
            document.getElementById('predictions').innerHTML = '';
            document.getElementById('trend-tab').innerHTML = '';
            document.getElementById('momentum-tab').innerHTML = '';
            document.getElementById('volatility-tab').innerHTML = '';
            document.getElementById('rsi-container').style.display = 'none';

            // Hide results container
            gsap.to('#results', {
                opacity: 0,
                y: 30,
                duration: 0.5
            });

            $.ajax({
                url: '/analyze',
                method: 'POST',
                data: { ticker: ticker },
                success: function(response) {
                    // Hide loading indicator
                    document.getElementById('loading').style.display = 'none';

                    if (response.error) {
                        alert(`Error: ${response.error}`);
                        return;
                    }

                    // Render price chart
                    Plotly.newPlot('price-chart', response.price_chart.data, response.price_chart.layout);

                    // Render volume chart
                    Plotly.newPlot('volume-chart', response.volume_chart.data, response.volume_chart.layout);

                    // Render RSI chart if available
                    if (response.rsi_chart) {
                        document.getElementById('rsi-container').style.display = 'block';
                        Plotly.newPlot('rsi-chart', response.rsi_chart.data, response.rsi_chart.layout);
                    }

                    // Render statistics
                    let statsHtml = '';
                    const displayableStats = {
                        'ticker': 'Symbol',
                        'avg_close': 'Avg. Close Price',
                        'up_days_pct': 'Up Days %',
                        'avg_daily_return': 'Avg. Daily Return %',
                        'avg_volume': 'Avg. Volume',
                        'max_close': 'Max Close Price',
                        'min_close': 'Min Close Price',
                        'stddev_daily_return': 'Return Volatility %'
                    };

                    // Add sentiment stats if available
                    if (response.stats.avg_sentiment !== undefined) {
                        Object.assign(displayableStats, {
                            'avg_sentiment': 'Avg. Sentiment',
                            'positive_days_pct': 'Positive Sentiment Days %',
                            'avg_posts_per_day': 'Avg. Posts Per Day',
                            'sentiment_return_corr': 'Sentiment-Return Correlation'
                        });
                    }

                    // Add technical stats if available
                    if (response.stats.avg_rsi !== undefined) {
                        Object.assign(displayableStats, {
                            'avg_rsi': 'Avg. RSI',
                            'overbought_days_pct': 'Overbought Days %',
                            'oversold_days_pct': 'Oversold Days %',
                            'avg_volatility': 'Avg. Volatility %'
                        });
                    }

                    // Generate stat cards
                    let statCount = 0;
                    for (const [key, label] of Object.entries(displayableStats)) {
                        if (response.stats[key] !== undefined) {
                            let value = response.stats[key];

                            // Format numbers for better readability
                            if (typeof value === 'number') {
                                if (key.includes('pct') || key.includes('return') || key.includes('volatility')) {
                                    value = value.toFixed(2) + '%';
                                } else if (key.includes('volume')) {
                                    value = value.toLocaleString();
                                } else if (key.includes('sentiment') && !key.includes('days')) {
                                    value = value.toFixed(3);
                                } else if (key.includes('corr')) {
                                    value = value.toFixed(3);
                                } else if (key.includes('price') || key.includes('close')) {
                                    value = '$' + value.toFixed(2);
                                } else {
                                    value = value.toFixed(2);
                                }
                            }

                            statsHtml += `
                                <div class="stat-card stat-${statCount}">
                                    <strong>${label}</strong>
                                    <div>${value}</div>
                                </div>`;
                            statCount++;
                        }
                    }
                    document.getElementById('stats').innerHTML = statsHtml;

                    // Render predictions
                    if (response.predictions && response.predictions.returns !== undefined) {
                        const returnValue = response.predictions.returns;
                        const formattedReturn = returnValue.toFixed(2);
                        const probabilityValue = (response.predictions.probability_up * 100).toFixed(1);

                        let predictionsHtml = `<h3>AI Price Predictions</h3>
                        <div class="prediction-cards">
                            <div class="prediction-card ${returnValue > 0 ? 'up' : 'down'}">
                                <i class="fas fa-${returnValue > 0 ? 'chart-line' : 'chart-line fa-flip-vertical'}"></i>
                                <div class="label">Expected Return</div>
                                <div class="value">${formattedReturn}%</div>
                                <div class="confidence">Based on historical patterns</div>
                            </div>
                            <div class="prediction-card ${probabilityValue > 50 ? 'up' : 'down'}">
                                <i class="fas fa-${probabilityValue > 50 ? 'arrow-up' : 'arrow-down'}"></i>
                                <div class="label">Probability of Price Increase</div>
                                <div class="value">${probabilityValue}%</div>
                                <div class="confidence">Based on market sentiment</div>
                            </div>
                        </div>`;

                        document.getElementById('predictions').innerHTML = predictionsHtml;
                    }

                    // Render technical indicators
                    if (response.stats) {
                        // Trend tab
                        let trendHtml = '<div style="padding: 20px;">';

                        // Moving Average signals
                        if (response.stats.avg_close !== undefined) {
                            const smaSignal = response.stats.price_to_sma20 > 1 ?
                                '<span style="color: #27ae60;">Above SMA (Bullish)</span>' :
                                '<span style="color: #e74c3c;">Below SMA (Bearish)</span>';

                            trendHtml += `
                                <div style="margin-bottom: 15px;">
                                    <h4 style="margin-bottom: 10px;">Moving Average Analysis</h4>
                                    <p>Current Price to 20-day MA Ratio: ${smaSignal}</p>
                                </div>`;
                        }

                        // MACD signals if available
                        if (response.stats.macd_signal !== undefined) {
                            const macdSignal = response.stats.macd_diff > 0 ?
                                '<span style="color: #27ae60;">MACD Above Signal (Bullish)</span>' :
                                '<span style="color: #e74c3c;">MACD Below Signal (Bearish)</span>';

                            trendHtml += `
                                <div style="margin-bottom: 15px;">
                                    <h4 style="margin-bottom: 10px;">MACD Analysis</h4>
                                    <p>${macdSignal}</p>
                                </div>`;
                        }

                        trendHtml += '</div>';
                        document.getElementById('trend-tab').innerHTML = trendHtml;

                        // Momentum tab
                        let momentumHtml = '<div style="padding: 20px;">';

                        // RSI signals if available
                        if (response.stats.avg_rsi !== undefined) {
                            let rsiSignal;
                            const rsiValue = response.stats.avg_rsi;

                            if (rsiValue > 70) {
                                rsiSignal = '<span style="color: #e74c3c;">Overbought (Bearish)</span>';
                            } else if (rsiValue < 30) {
                                rsiSignal = '<span style="color: #27ae60;">Oversold (Bullish)</span>';
                            } else {
                                rsiSignal = '<span style="color: #7f8c8d;">Neutral</span>';
                            }

                            momentumHtml += `
                                <div style="margin-bottom: 15px;">
                                    <h4 style="margin-bottom: 10px;">RSI Analysis</h4>
                                    <p>Current RSI(14): ${rsiValue.toFixed(2)} - ${rsiSignal}</p>
                                    <div style="height: 20px; width: 100%; background: linear-gradient(to right, #27ae60, #f1c40f, #e74c3c); border-radius: 10px; margin-top: 10px; position: relative;">
                                        <div style="position: absolute; left: ${Math.min(100, Math.max(0, rsiValue))}%; transform: translateX(-50%); top: -15px;">
                                            <i class="fas fa-caret-down" style="color: #2c3e50;"></i>
                                        </div>
                                        <div style="display: flex; justify-content: space-between; margin-top: 25px; color: #7f8c8d; font-size: 0.8rem;">
                                            <span>Oversold</span>
                                            <span>Neutral</span>
                                            <span>Overbought</span>
                                        </div>
                                    </div>
                                </div>`;
                        }

                        momentumHtml += '</div>';
                        document.getElementById('momentum-tab').innerHTML = momentumHtml;

                        // Volatility tab
                        let volatilityHtml = '<div style="padding: 20px;">';

                        // Volatility signals if available
                        if (response.stats.avg_volatility !== undefined) {
                            const volatilityValue = response.stats.avg_volatility;
                            const volatilityMax = response.stats.max_volatility || volatilityValue * 2;
                            const volatilityRatio = (volatilityValue / volatilityMax) * 100;

                            volatilityHtml += `
                                <div style="margin-bottom: 15px;">
                                    <h4 style="margin-bottom: 10px;">Volatility Analysis</h4>
                                    <p>Average Daily Volatility: ${volatilityValue.toFixed(2)}%</p>
                                    <div style="height: 10px; width: 100%; background: #f1f2f6; border-radius: 5px; margin-top: 10px;">
                                        <div style="height: 100%; width: ${Math.min(100, volatilityRatio)}%; background: linear-gradient(to right, #6c5ce7, #a29bfe); border-radius: 5px;"></div>
                                    </div>
                                </div>`;
                        }

                        // Bollinger Band signals if available
                        if (response.stats.bb_width !== undefined) {
                            const bbWidthValue = response.stats.bb_width;

                            let bbSignal;
                            if (bbWidthValue > 0.1) {
                                bbSignal = 'High Volatility Expected';
                            } else if (bbWidthValue < 0.05) {
                                bbSignal = 'Low Volatility - Breakout Potential';
                            } else {
                                bbSignal = 'Normal Volatility';
                            }

                            volatilityHtml += `
                                <div style="margin-bottom: 15px;">
                                    <h4 style="margin-bottom: 10px;">Bollinger Bands</h4>
                                    <p>Band Width: ${bbWidthValue.toFixed(3)} - ${bbSignal}</p>
                                </div>`;
                        }

                        volatilityHtml += '</div>';
                        document.getElementById('volatility-tab').innerHTML = volatilityHtml;
                    }

                    // Show results with animation
                    gsap.to('#results', {
                        opacity: 1,
                        y: 0,
                        duration: 0.8,
                        onComplete: function() {
                            // Resize charts to ensure they render correctly
                            window.dispatchEvent(new Event('resize'));
                        }
                    });

                    // Animate stats cards
                    document.querySelectorAll('.stat-card').forEach((card, index) => {
                        gsap.to(card, {
                            opacity: 1,
                            y: 0,
                            duration: 0.5,
                            delay: 0.1 * index,
                            ease: 'power3.out'
                        });
                    });

                    // Animate predictions
                    if (document.getElementById('predictions').innerHTML !== '') {
                        gsap.to('#predictions', {
                            opacity: 1,
                            y: 0,
                            duration: 0.5,
                            delay: 0.3,
                            ease: 'power3.out'
                        });
                    }
                },
                error: function(xhr, status, error) {
                    // Hide loading indicator
                    document.getElementById('loading').style.display = 'none';
                    alert(`Error: ${error}`);
                }
            });
        }

        // Add mouse parallax effect to particles
        document.addEventListener('mousemove', (event) => {
            const mouseX = event.clientX / window.innerWidth - 0.5;
            const mouseY = event.clientY / window.innerHeight - 0.5;

            gsap.to(particles.rotation, {
                x: mouseY * 0.5,
                y: mouseX * 0.5,
                duration: 2
            });
        });

        // Handle Enter key in search box
        document.getElementById('ticker').addEventListener('keypress', function(event) {
            if (event.key === 'Enter') {
                event.preventDefault();
                analyzeStock();
            }
        });
    </script>
</body>
</html>
"""

# Set up ngrok tunnel
from pyngrok import ngrok
public_url = ngrok.connect(5000)

if __name__ == "__main__":
    print(f"Public URL: {public_url}")
    create_template_directory()
    # Initialize the analyzer
    initialize_analyzer()
    # Run the Flask app
    app.run()


Public URL: NgrokTunnel: "https://0fa8-35-204-77-46.ngrok-free.app" -> "http://localhost:5000"
VADER sentiment analyzer initialized with financial lexicon
Reddit client initialized successfully

Processing AAPL
Fetching stock data for AAPL...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



✅ Stock data saved to data/raw/AAPL_stock.csv
Fetching Reddit data for AAPL from ['stocks', 'investing', 'wallstreetbets', 'stockmarket']...
Searching r/stocks for posts about AAPL...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Found 2 posts about AAPL in r/stocks
Searching r/investing for posts about AAPL...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Found 3 posts about AAPL in r/investing
Searching r/wallstreetbets for posts about AAPL...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Found 6 posts about AAPL in r/wallstreetbets
Searching r/stockmarket for posts about AAPL...
Found 9 posts about AAPL in r/stockmarket
✅ Reddit data saved to data/raw/AAPL_reddit.csv
✅ Merged data saved to data/processed/AAPL_merged.csv
Analyzing data for AAPL...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



✅ Analysis visualizations saved to visualizations
✅ Statistics saved to data/processed/AAPL_stats.json

Processing MSFT
Fetching stock data for MSFT...
✅ Stock data saved to data/raw/MSFT_stock.csv
Fetching Reddit data for MSFT from ['stocks', 'investing', 'wallstreetbets', 'stockmarket']...
Searching r/stocks for posts about MSFT...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Found 4 posts about MSFT in r/stocks
Searching r/investing for posts about MSFT...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Found 5 posts about MSFT in r/investing
Searching r/wallstreetbets for posts about MSFT...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Found 9 posts about MSFT in r/wallstreetbets
Searching r/stockmarket for posts about MSFT...
Found 10 posts about MSFT in r/stockmarket
✅ Reddit data saved to data/raw/MSFT_reddit.csv
✅ Merged data saved to data/processed/MSFT_merged.csv
Analyzing data for MSFT...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



✅ Analysis visualizations saved to visualizations
✅ Statistics saved to data/processed/MSFT_stats.json

Processing TSLA
Fetching stock data for TSLA...
✅ Stock data saved to data/raw/TSLA_stock.csv
Fetching Reddit data for TSLA from ['stocks', 'investing', 'wallstreetbets', 'stockmarket']...
Searching r/stocks for posts about TSLA...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Found 9 posts about TSLA in r/stocks
Searching r/investing for posts about TSLA...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Found 14 posts about TSLA in r/investing
Searching r/wallstreetbets for posts about TSLA...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Found 23 posts about TSLA in r/wallstreetbets
Searching r/stockmarket for posts about TSLA...
Found 30 posts about TSLA in r/stockmarket
✅ Reddit data saved to data/raw/TSLA_reddit.csv
✅ Merged data saved to data/processed/TSLA_merged.csv
Analyzing data for TSLA...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



✅ Analysis visualizations saved to visualizations
✅ Statistics saved to data/processed/TSLA_stats.json

Processing AMZN
Fetching stock data for AMZN...
✅ Stock data saved to data/raw/AMZN_stock.csv
Fetching Reddit data for AMZN from ['stocks', 'investing', 'wallstreetbets', 'stockmarket']...
Searching r/stocks for posts about AMZN...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Found 5 posts about AMZN in r/stocks
Searching r/investing for posts about AMZN...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Found 7 posts about AMZN in r/investing
Searching r/wallstreetbets for posts about AMZN...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Found 11 posts about AMZN in r/wallstreetbets
Searching r/stockmarket for posts about AMZN...
Found 12 posts about AMZN in r/stockmarket
✅ Reddit data saved to data/raw/AMZN_reddit.csv
✅ Merged data saved to data/processed/AMZN_merged.csv
Analyzing data for AMZN...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



✅ Analysis visualizations saved to visualizations
✅ Statistics saved to data/processed/AMZN_stats.json

Processing NVDA
Fetching stock data for NVDA...
✅ Stock data saved to data/raw/NVDA_stock.csv
Fetching Reddit data for NVDA from ['stocks', 'investing', 'wallstreetbets', 'stockmarket']...
Searching r/stocks for posts about NVDA...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Found 8 posts about NVDA in r/stocks
Searching r/investing for posts about NVDA...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Found 9 posts about NVDA in r/investing
Searching r/wallstreetbets for posts about NVDA...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Found 14 posts about NVDA in r/wallstreetbets
Searching r/stockmarket for posts about NVDA...
Found 18 posts about NVDA in r/stockmarket
✅ Reddit data saved to data/raw/NVDA_reddit.csv
✅ Merged data saved to data/processed/NVDA_merged.csv
Analyzing data for NVDA...
✅ Analysis visualizations saved to visualizations
✅ Statistics saved to data/processed/NVDA_stats.json
Training models using data from: ['AAPL', 'MSFT', 'TSLA', 'AMZN', 'NVDA']
Skipping non-numeric feature: ticker_AAPL
Skipping non-numeric feature: ticker_AMZN
Skipping non-numeric feature: ticker_MSFT
Skipping non-numeric feature: ticker_NVDA
Skipping non-numeric feature: ticker_TSLA
Training data shape: (125, 44)
Testing data shape: (25, 44)
Performing feature selection...

Top 10 features by importance:
atr: 0.0543
gap: 0.0520
Volume_lag2: 0.0494
daily_return_lag2: 0.0468
daily_return_lag1: 0.0421
close_pct_change: 0.0410
volume_change: 0.0403
Close_lag2: 0.0389
close_to_open: 0.0355
volatility_daily: 0.0346
Selected 30 of




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.





Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




✅ Models and metrics saved to models
 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m


 * Running on http://0fa8-35-204-77-46.ngrok-free.app
 * Traffic stats available on http://127.0.0.1:4040


INFO:werkzeug:127.0.0.1 - - [12/May/2025 20:24:40] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [12/May/2025 20:24:41] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - - [12/May/2025 20:25:06] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [12/May/2025 20:25:06] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - - [12/May/2025 20:25:11] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [12/May/2025 21:05:32] "GET / HTTP/1.1" 200 -


Processing analysis request for AAPL
Fetching stock data for AAPL...
✅ Stock data saved to data/raw/AAPL_stock.csv
Fetching Reddit data for AAPL from ['stocks', 'investing', 'wallstreetbets', 'stockmarket']...
Searching r/stocks for posts about AAPL...
Found 2 posts about AAPL in r/stocks
Searching r/investing for posts about AAPL...
Found 3 posts about AAPL in r/investing
Searching r/wallstreetbets for posts about AAPL...
Found 6 posts about AAPL in r/wallstreetbets
Searching r/stockmarket for posts about AAPL...
Found 9 posts about AAPL in r/stockmarket
✅ Reddit data saved to data/raw/AAPL_reddit.csv
✅ Merged data saved to data/processed/AAPL_merged.csv
Analyzing data for AAPL...


INFO:werkzeug:127.0.0.1 - - [12/May/2025 21:06:04] "POST /analyze HTTP/1.1" 200 -


✅ Analysis visualizations saved to visualizations
✅ Statistics saved to data/processed/AAPL_stats.json
Analysis completed for AAPL


INFO:werkzeug:127.0.0.1 - - [12/May/2025 21:08:52] "GET / HTTP/1.1" 200 -
