🧠 Stock Price Sentiment Analysis using ML & Deep Learning (LSTM)

This notebook demonstrates sentiment analysis on stock-related text data using:
1. **Machine Learning Approach**: TF-IDF + Logistic Regression
2. **Deep Learning Approach**: LSTM Neural Network

We'll compare the performance of both methods.

## 📦 Step 1: Load Dataset and Dependencies

In [None]:
import pandas as pd
import numpy as np
import re
import string
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Load data
df = pd.read_csv('/kaggle/input/stock-sentiment-data/stock_sentiment.csv')
print("\nData Sample:")
print(df.head())

# Filter columns
df = df[['text', 'sentiment']].dropna()

## 📃 Step 2: Text Preprocessing

In [None]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)
    text = re.sub(r'\@\w+|\#', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\d+', '', text)
    return text.strip()

df['clean_text'] = df['text'].apply(clean_text)
df['label'] = df['sentiment'].map({'negative': 0, 'neutral': 1, 'positive': 2})

## 📊 Step 3: Machine Learning Approach (TF-IDF + Logistic Regression)

In [None]:
# TF-IDF Vectorization
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = vectorizer.fit_transform(df['clean_text'])
y = df['label']

# Train-test split
X_train_ml, X_test_ml, y_train_ml, y_test_ml = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42)

# Train model
logreg = LogisticRegression(max_iter=200)
logreg.fit(X_train_ml, y_train_ml)

# Evaluate
y_pred_ml = logreg.predict(X_test_ml)
print("\n🔍 Logistic Regression Accuracy:", accuracy_score(y_test_ml, y_pred_ml))
print(classification_report(y_test_ml, y_pred_ml))

# Confusion matrix
plt.figure(figsize=(6,4))
sns.heatmap(confusion_matrix(y_test_ml, y_pred_ml), 
            annot=True, cmap='Blues', fmt='d',
            xticklabels=['Neg', 'Neutral', 'Pos'],
            yticklabels=['Neg', 'Neutral', 'Pos'])
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Logistic Regression Confusion Matrix")
plt.show()

## 🤖 Step 4: Deep Learning Approach (LSTM)

In [None]:
# Tokenization
max_words = 10000
max_len = 100

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(df['clean_text'])

X_seq = tokenizer.texts_to_sequences(df['clean_text'])
X_pad = pad_sequences(X_seq, maxlen=max_len)

# Split data
y = df['label']
X_train_dl, X_test_dl, y_train_dl, y_test_dl = train_test_split(
    X_pad, y, test_size=0.2, random_state=42)

# Build model
model = Sequential([
    Embedding(input_dim=max_words, output_dim=128, input_length=max_len),
    LSTM(64, return_sequences=False),
    Dropout(0.5),
    Dense(3, activation='softmax')
])

# Compile and train
model.compile(loss='sparse_categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

history = model.fit(X_train_dl, y_train_dl, 
                   epochs=5, 
                   batch_size=64, 
                   validation_data=(X_test_dl, y_test_dl))

# Evaluate
loss, acc = model.evaluate(X_test_dl, y_test_dl)
print("\n📈 LSTM Test Accuracy:", acc)

# Plot training
plt.figure(figsize=(8,4))
plt.plot(history.history['accuracy'], label='Train Acc')
plt.plot(history.history['val_accuracy'], label='Val Acc')
plt.title("LSTM Accuracy Over Epochs")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()
plt.grid(True)
plt.show()

## 🏁 Conclusion

**Comparison**:
- Logistic Regression: Faster training, good baseline
- LSTM: Better for sequence data, higher potential accuracy

**Next Steps**:
- Try transformer models (BERT, etc.)
- Hyperparameter tuning
- Model deployment