### Step 1: Data Scraping (Using Reddit with PRAW)

1) Scrape dataset across multiple subreddits for diverse opinions.

2) Include both post titles and comments to capture detailed sentiments.

3) Use additional fields like post score and comment count for better features.

In [13]:
import praw
import pandas as pd

# Reddit API credentials
reddit = praw.Reddit(
    client_id="fofmJW3tmiSRJt7OqNIZ9A",
    client_secret="fwFHA7NUS-gUmwfy-fSdqDI9XpAb6A",
    user_agent="StockSentimentApp/1.0 by u/Fun-Teach-2904"
)

# Function to scrape Reddit posts
def scrape_reddit(subreddit_name, limit=100):
    subreddit = reddit.subreddit(subreddit_name)
    posts = []

    for post in subreddit.hot(limit=limit):
        posts.append({
            "title": post.title,
            "body": post.selftext,
            "score": post.score,
            "num_comments": post.num_comments,
            "created_utc": post.created_utc
        })

    return pd.DataFrame(posts)

# Scrape data from r/stocks
data = scrape_reddit("stocks", limit=200)
data.to_csv("reddit_data.csv", index=False)
print("Data scraped and saved to 'reddit_data.csv'") 

Data scraped and saved to 'reddit_data.csv'


In [14]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# Download stopwords
import nltk
nltk.download("stopwords")
nltk.download("punkt")

stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

def clean_text(text):
    text = re.sub(r"http\S+|www\S+", "", text)  # Remove URLs
    text = re.sub(r"\W", " ", text)  # Remove special characters
    text = text.lower()  # Convert to lowercase
    tokens = word_tokenize(text)  # Tokenize
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]  # Remove stopwords
    return " ".join(tokens)

data["cleaned_title"] = data["title"].apply(clean_text)
data["cleaned_body"] = data["body"].apply(clean_text)
data.to_csv("cleaned_reddit_data.csv", index=False)
print("Cleaned data saved to 'cleaned_reddit_data.csv'")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\91955\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\91955\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Cleaned data saved to 'cleaned_reddit_data.csv'


### Step 2: Preprocessing and Sentiment Analysis

1) Use VADER Sentiment Analysis for polarity scores (positive, negative, neutral).

2) Include TF-IDF vectors for text features to capture context and relevance.

3) Clean data thoroughly by removing:

4) Links, special characters, and stop words.

5) Non-stock-related posts using keywords like "stock," "market," or ticker symbols.

In [15]:
import re
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer

# Text cleaning
def clean_text(text):
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)
    text = re.sub(r"[^\w\s]", "", text)
    text = text.lower()
    return text

data["cleaned_title"] = data["title"].apply(clean_text)
data["cleaned_body"] = data["body"].apply(clean_text)

# Sentiment Analysis
analyzer = SentimentIntensityAnalyzer()
data["title_sentiment"] = data["cleaned_title"].apply(lambda x: analyzer.polarity_scores(x)["compound"])
data["body_sentiment"] = data["cleaned_body"].apply(lambda x: analyzer.polarity_scores(x)["compound"])

# TF-IDF features
vectorizer = TfidfVectorizer(max_features=500)
tfidf_matrix = vectorizer.fit_transform(data["cleaned_title"] + " " + data["cleaned_body"])

# Save TF-IDF as a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
data = pd.concat([data.reset_index(drop=True), tfidf_df], axis=1)

### Step 3: Feature Engineering

1) Include post popularity (score, num_comments) as features.

2) Extract mentions of stock tickers to track relevance to specific stocks.

In [16]:
# Feature: Mentions of "stock" or ticker symbols
data["mentions_stock"] = data["cleaned_body"].apply(lambda x: 1 if "stock" in x or "$" in x else 0)

# Final feature set
features = data[["title_sentiment", "body_sentiment", "score", "num_comments", "mentions_stock"] + list(tfidf_df.columns)]

In [17]:
# Simulate stock price movement labels (1 for up, 0 for down)
import numpy as np
data["stock_movement"] = np.random.choice([0, 1], size=len(data))

# Select features and labels
features = data[["title_sentiment", "body_sentiment", "score", "num_comments"]]
labels = data["stock_movement"]

### Step 4: Building and Improving the Prediction Model
Model Selection:

--> Start with a simple model (e.g., Logistic Regression) and move to advanced models like:

2) Random Forest

3) XGBoost

4) LSTM or Transformer models (for deep learning on text data).

5) Use GridSearchCV for hyperparameter tuning.

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score

# Define labels (example: price up = 1, price down = 0)
labels = data["stock_movement"]  # Assuming stock_movement is defined

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Hyperparameter tuning for Random Forest
param_grid = {
    "n_estimators": [100, 200, 300],
    "max_depth": [10, 20, None],
    "min_samples_split": [2, 5, 10]
}
grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=3, scoring="accuracy")
grid.fit(X_train, y_train)

# Best model and evaluation
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.55
Classification Report:
               precision    recall  f1-score   support

           0       0.60      0.43      0.50        21
           1       0.52      0.68      0.59        19

    accuracy                           0.55        40
   macro avg       0.56      0.56      0.55        40
weighted avg       0.56      0.55      0.54        40

