**Stock Movement Analysis Based on Social Media Sentiment**

Objective:To develop a machine learning model to predict stock price movements using Reddit data.

**Step 1: Scrape Data from Reddit**

In [1]:
pip install praw


Collecting praw
  Downloading praw-7.8.1-py3-none-any.whl.metadata (9.4 kB)
Collecting prawcore<3,>=2.4 (from praw)
  Downloading prawcore-2.4.0-py3-none-any.whl.metadata (5.0 kB)
Collecting update_checker>=0.18 (from praw)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Downloading praw-7.8.1-py3-none-any.whl (189 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m189.3/189.3 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading prawcore-2.4.0-py3-none-any.whl (17 kB)
Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Installing collected packages: update_checker, prawcore, praw
Successfully installed praw-7.8.1 prawcore-2.4.0 update_checker-0.18.0


In [3]:
import praw
import pandas as pd

# Create a Reddit instance
reddit = praw.Reddit(client_id='efRoGlp-YxaeGmEED53qPw',
                     client_secret='O4fxbrVz-HstCW8IkEhMIu2fLwt2TQ',
                     user_agent='Aditi Mandal')

# Get posts from the 'stocks' subreddit
posts = []
for post in reddit.subreddit('stocks').hot(limit=100):
    posts.append([post.title, post.selftext])

# Convert to DataFrame
df = pd.DataFrame(posts, columns=['title', 'text'])


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



In [4]:
pip install textblob



In [5]:
pip install scikit-learn



In [6]:
pip install yfinance



**Step 2: Data Preprocessing**

In [8]:
# Combine title and text for sentiment analysis
df['content'] = df['title'] + ' ' + df['text']

# Clean the data (remove NaNs, etc.)
df.dropna(subset=['content'], inplace=True)


**Step 3: Perform Sentiment Analysis**

In [9]:
from textblob import TextBlob

# Sentiment Analysis using TextBlob
df['sentiment'] = df['content'].apply(lambda x: TextBlob(x).sentiment.polarity)


**Step 4: Extract Key Features**

In [10]:
# Frequency of mentions
df['mentions'] = df['content'].apply(lambda x: x.lower().count('stock'))

# Save preprocessed data
df.to_csv('preprocessed_reddit_posts.csv', index=False)


**Step 5: Topic Modeling**

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

def topic_modeling(data, n_topics=5):
    vectorizer = CountVectorizer(stop_words='english')
    dtm = vectorizer.fit_transform(data)
    lda = LatentDirichletAllocation(n_components=n_topics, random_state=0)
    lda.fit(dtm)
    topics = lda.transform(dtm)
    return topics

# Extract topics
df['topics'] = list(topic_modeling(df['content']))


**Step 6: Build the Prediction Model**

In [12]:
import yfinance as yf
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Example: Fetch historical stock data for a specific stock
stock_data = yf.download('AAPL', start='2022-01-01', end='2022-12-31')

# Placeholder: Merge sentiment data with stock data (additional preprocessing required)
# Ensure date format is correct and align with stock data

# Placeholder for sentiment and stock movement
# Align this with actual stock movement data
X = df[['sentiment', 'mentions']].values
y = [1 if sentiment > 0 else 0 for sentiment in df['sentiment']]  # Placeholder target variable

# Splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training the model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Predicting
predictions = model.predict(X_test)

# Evaluation
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')

[*********************100%***********************]  1 of 1 completed


Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0


In [13]:
import praw
import pandas as pd
import numpy as np
from textblob import TextBlob
import yfinance as yf
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import re

# Reddit Data Processing
reddit = praw.Reddit(client_id='EwEvkfwnUKAcj3JpPKqnQQ',
                     client_secret='sweGCesqoOuZB-93LpUYOWv1ssprjw',
                     user_agent='Hrsht')

posts = []
for post in reddit.subreddit('stocks').hot(limit=100):
    posts.append([post.title, post.selftext])

# Convert to DataFrame
df = pd.DataFrame(posts, columns=['title', 'text'])
df['content'] = df['title'] + ' ' + df['text']
df.dropna(subset=['content'], inplace=True)

# Sentiment Analysis
df['sentiment'] = df['content'].apply(lambda x: TextBlob(x).sentiment.polarity)
df['mentions'] = df['content'].apply(lambda x: x.lower().count('stock'))

# Extract Date from Content (if present) or use Scrape Date
def extract_date(text):
    try:
        # Regex to match YYYY-MM-DD
        match = re.search(r'\d{4}-\d{2}-\d{2}', text)
        if match:
            return pd.to_datetime(match.group(), format='%Y-%m-%d')
    except Exception:
        pass
    return pd.NaT  # Return NaT (Not a Time) if no date found

df['date'] = df['content'].apply(extract_date)
df['date'].fillna(pd.Timestamp.now().normalize(), inplace=True)

# Stock Data Processing
stock_data = yf.download('AAPL', start='2022-01-01', end='2022-12-31')
stock_data['daily_return'] = stock_data['Adj Close'].pct_change()
stock_data['10d_ma'] = stock_data['Adj Close'].rolling(window=10).mean()
stock_data['50d_ma'] = stock_data['Adj Close'].rolling(window=50).mean()
stock_data['movement'] = np.where(stock_data['daily_return'] > 0, 1, 0)  # Upward movement = 1, else 0

# Reset index for stock data to avoid level mismatch
stock_data.reset_index(inplace=True)

# Align sentiment data with stock data
try:
    # Ensure both DataFrames have compatible structures
    print(stock_data.head())  # Debug: Check stock_data structure
    print(df.head())  # Debug: Check df structure

    if 'Date' in stock_data.columns and 'date' in df.columns:
        # Reset MultiIndex in stock_data if it exists
        stock_data.reset_index(inplace=True)

        # Ensure 'Date' in stock_data is a datetime object
        stock_data['Date'] = pd.to_datetime(stock_data['Date'])

        # Ensure 'date' in df is also a datetime object
        df['date'] = pd.to_datetime(df['date'])

        # Merge DataFrames
        merged_data = pd.merge(stock_data, df, left_on='Date', right_on='date', how='inner')

        if merged_data.empty:
            raise ValueError("Merged DataFrame is empty. Check if 'Date' and 'date' columns align.")
    else:
        raise ValueError("Missing 'Date' or 'date' columns in the dataframes.")
except ValueError as e:
    print(f"MergeError: {e}")
    print("Ensure both dataframes have compatible index or column structure before merging.")
    merged_data = None  # Prevent further execution if merge fails



It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['date'].fillna(pd.Timestamp.now().normalize(), inplace=True)
[*********************100%***********************]  1 of 1 completed

Price        Date   Adj Close       Close        High         Low        Open  \
Ticker                   AAPL        AAPL        AAPL        AAPL        AAPL   
0      2022-01-03  179.076614  182.009995  182.880005  177.710007  177.830002   
1      2022-01-04  176.803802  179.699997  182.940002  179.119995  182.630005   
2      2022-01-05  172.100830  174.919998  180.169998  174.639999  179.610001   
3      2022-01-06  169.227936  172.000000  175.300003  171.639999  172.699997   
4      2022-01-07  169.395187  172.169998  174.139999  171.029999  172.889999   

Price      Volume daily_return 10d_ma 50d_ma movement  
Ticker       AAPL                                      
0       104487900          NaN    NaN    NaN        0  
1        99310400    -0.012692    NaN    NaN        0  
2        94537600    -0.026600    NaN    NaN        0  
3        96904000    -0.016693    NaN    NaN        0  
4        86709100     0.000988    NaN    NaN        1  
                                        


