# Sentiment Analysis based Stock Price Prediction and Trading Model

This project aims to use Natural Language Processing (NLP) as well as Sentiment Analysis (SA) in order to predict Stock Prices. Predicted Stock Prices will be put in a trading model as a method of testing this prediction model's efficiency and effectiveness in making profits.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
import seaborn as sns
import re
import string
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV,StratifiedKFold,RandomizedSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
import scrapy
import requests
from bs4 import BeautifulSoup
from IPython.display import display, HTML

In [2]:
df = pd.read_csv("hf://datasets/suchkow/twitter-sentiment-stock/tweets.csv")

## Preprocessing Stage
In order to ensure we do not waste computing power as well as time while processing the Tweets in the dataset, we need to first remove the useless items in each Tweet. This includes Stopwords, Punctuation and certain symbols such as: '$', '*', '?'.

In [3]:
from nltk.corpus import stopwords
cachedwords = stopwords.words("english")

In [4]:
def clean_text(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

def remove_stopwords(text):
    words = [w for w in text if w not in cachedwords]
    return words

In [5]:
df['Tweet'] = df['Tweet'].apply(lambda x: clean_text(x))

## Sentiment Analysis

Here we use two types of Sentiment Analysis: NLTK's Opinion Lexicon and the VADER Lexicon. This is because the Sentiment in each Lexicon is calculated differently. Some points of clarification must be made here:
1) While typically only one type of Sentiment would be used, partially due to the fact that there is some degree of Multicollinearity otherwise, I think that it is much more important to have a broader view of the Tweets than to limit it to what only a singular Lexicon predicts to be positive or negative.
2) While the VADER Lexicon typically includes Stopwords, including Stopwords has had the effect of making the Sentiment Analysis take much longer than needed.
3) The Loughran-McDonald Lexicon was not used here as it also had the effect of drastically slowing down the code.

In [6]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to C:\Users\Jason
[nltk_data]     Lam\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [7]:
df['VADER_compound'] = df['Tweet'].apply(lambda x: sia.polarity_scores(x)['compound'])
df['VADER_sentiment'] = df['VADER_compound'].apply(lambda x: 'positive' if x > 0.05 else ('negative' if x < -0.05 else 'neutral'))

In [8]:
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
df['Tweet_tokens'] = df['Tweet'].apply(lambda x: tokenizer.tokenize(x.lower()))
df['Tweet_tokens'] = df['Tweet_tokens'].apply(lambda x: remove_stopwords(x))

In [9]:
import nltk
from nltk.corpus import opinion_lexicon

nltk.download('opinion_lexicon')
positive_words = set(opinion_lexicon.positive())
negative_words = set(opinion_lexicon.negative())

[nltk_data] Downloading package opinion_lexicon to C:\Users\Jason
[nltk_data]     Lam\AppData\Roaming\nltk_data...
[nltk_data]   Package opinion_lexicon is already up-to-date!


In [10]:
def get_sentiment(token_list):
    general_pos = general_neg = 0
    
    for token in token_list:
        # NLTK Opinion Lexicon
        if token in positive_words:
            general_pos += 1
        elif token in negative_words:
            general_neg += 1
    
    # Calculate overall sentiment
    total_pos = general_pos
    total_neg = general_neg
    
    if total_pos > total_neg:
        return "positive"
    elif total_neg > total_pos:
        return "negative"
    else:
        return "neutral"

In [11]:
df["opinion_sentiment"] = df["Tweet_tokens"].apply(get_sentiment)

In [12]:
df['Date'] = pd.to_datetime(df['Date'])

In [13]:
from sklearn.preprocessing import OrdinalEncoder

# Define the order of categories
categories_order = [['negative', 'neutral', 'positive']]

# Initialize and fit the OrdinalEncoder
encoder = OrdinalEncoder(categories=categories_order)
df['opinion_sentiment'] = encoder.fit_transform(df[['opinion_sentiment']])

In [14]:
df.drop(columns='VADER_sentiment',inplace=True)

## Custom Sentiment Analysis

Given that the NLTK does not grant a compound Sentiment Score, I was forced to make one. Here, I decided to take the average of all Positive (+1), Negative (-1) and Neutral (0) Sentiments of the day, granting an average sentiment.

In [15]:
# Group by date and calculate mean sentiment
general_daily_sentiment = df.groupby("Date")[["opinion_sentiment","VADER_compound"]].mean().reset_index()

In [16]:
df_aapl = df.copy()
df_aapl = df_aapl[df_aapl['Ticker'] == 'AAPL']
print(f"Original shape: {df.shape}, AAPL-only shape: {df_aapl.shape}")

Original shape: (1739993, 6), AAPL-only shape: (530186, 6)


In [17]:
specific_daily_sentiment = df_aapl.groupby("Date")[["opinion_sentiment","VADER_compound"]].mean().reset_index()

## Data Visualisation

Here, we make Interactive Visualisations with Plotly to make some elementary observations about the relationship between AAPL and the Sentiment on the day.

In [19]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add VADER trace (left axis)
fig.add_trace(
    go.Scatter(
        x=general_daily_sentiment['Date'], y=general_daily_sentiment['VADER_compound'],
        name="VADER (-1 to 1)", line=dict(color='blue'),
        mode='lines+markers'
    ),
    secondary_y=False
)

# Add Custom trace (right axis)
fig.add_trace(
    go.Scatter(
        x=general_daily_sentiment['Date'], y=general_daily_sentiment['opinion_sentiment'],
        name="Opinion (0 to 2)", line=dict(color='red'),
        mode='lines+markers'
    ),
    secondary_y=True
)

# Set axis ranges
fig.update_yaxes(range=[-1, 1], secondary_y=False)
fig.update_yaxes(range=[0, 2], secondary_y=True)

# Layout customization
fig.update_layout(
    title="Dual-Scale Sentiment Comparison",
    hovermode="x unified",
    width=1200,  # Responsive width
    height=600,
    template="plotly_white"  # Professional theme
)

fig.show()
# Save as HTML (scalable vector format)
fig.write_html("sentiment_plot.html")

In [20]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add VADER trace (left axis)
fig.add_trace(
    go.Scatter(
        x=specific_daily_sentiment['Date'], y=specific_daily_sentiment['VADER_compound'],
        name="VADER (-1 to 1)", line=dict(color='blue'),
        mode='lines+markers'
    ),
    secondary_y=False
)

# Add Custom trace (right axis)
fig.add_trace(
    go.Scatter(
        x=specific_daily_sentiment['Date'], y=specific_daily_sentiment['opinion_sentiment'],
        name="Opinion (0 to 2)", line=dict(color='red'),
        mode='lines+markers'
    ),
    secondary_y=True
)

# Set axis ranges
fig.update_yaxes(range=[-1, 1], secondary_y=False)
fig.update_yaxes(range=[0, 2], secondary_y=True)

# Layout customization
fig.update_layout(
    title="Dual-Scale Sentiment Comparison",
    hovermode="x unified",
    width=1200,  # Responsive width
    height=600,
    template="plotly_white"  # Professional theme
)

fig.show()
# Save as HTML (scalable vector format)
fig.write_html("sentiment_plot.html")

We notice in the two graphs above, that there is a significant blank between 2020 and the tail end of 2021. These two groups of data (Pre 2020 and Post September 2021) may be used as the Training and Test sets respectively.

## Trading Model

### Importing Data and Using Sentiment

Here, we look at two Sentiment Types: General Sentiment which refers to the overall sentiment on the day, and Specific Sentiment which are based on Tweets which specifically mention AAPL.

In [23]:
import yfinance as yf
# Download AAPL stock data
aapl = yf.download("AAPL", start="2015-01-01", end="2022-12-31")
aapl = aapl[['Close']].reset_index()  # Keep only 'Date' and 'Close'
aapl.columns = ['Date', 'aapl_close']  # Rename columns


YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed


In [24]:
# Convert sentiment dates to match AAPL's date range
train_general_sentiment = general_daily_sentiment[
    (general_daily_sentiment['Date'] >= '2015-01-01') & 
    (general_daily_sentiment['Date'] <= '2019-12-31')
]

train_specific_sentiment = specific_daily_sentiment[
    (specific_daily_sentiment['Date'] >= '2015-01-01') & 
    (specific_daily_sentiment['Date'] <= '2019-12-31')
]

In [27]:
# Merge on date (ensure both have datetime dtype)
merged_general_sentiment = pd.merge(train_general_sentiment, aapl, on='Date', how='left')

In [28]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add AAPL closing price (left axis)
fig.add_trace(
    go.Scatter(
        x=merged_general_sentiment['Date'], y=merged_general_sentiment['aapl_close'],
        name="AAPL Close Price", line=dict(color='green'),
        mode='lines'
    ),
    secondary_y=False
)

# Add VADER sentiment (right axis, upper half)
fig.add_trace(
    go.Scatter(
        x=merged_general_sentiment['Date'], y=merged_general_sentiment['VADER_compound'],
        name="VADER (-1 to 1)", line=dict(color='blue'),
        mode='lines+markers'
    ),
    secondary_y=True
)

# Axis customization
fig.update_yaxes(
    title_text="AAPL Close Price ($)", 
    secondary_y=False
)
fig.update_yaxes(
    title_text="Sentiment Score", 
    range=[-1, 1],  # Accommodates both sentiment scales
    secondary_y=True
)

# Layout adjustments
fig.update_layout(
    title="AAPL Stock Price vs. Sentiment (2015-2019)",
    hovermode="x unified",
    width=1200,
    height=600,
    template="plotly_white"
)

fig.show()
# Save as HTML
fig.write_html("aapl_sentiment_comparison.html")


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



In [32]:
# Merge on date (ensure both have datetime dtype)
merged_specific_sentiment = pd.merge(train_specific_sentiment, aapl, on='Date', how='left')

In [33]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add AAPL closing price (left axis)
fig.add_trace(
    go.Scatter(
        x=merged_specific_sentiment['Date'], y=merged_specific_sentiment['aapl_close'],
        name="AAPL Close Price", line=dict(color='green'),
        mode='lines'
    ),
    secondary_y=False
)

# Add VADER sentiment (right axis, upper half)
fig.add_trace(
    go.Scatter(
        x=merged_specific_sentiment['Date'], y=merged_specific_sentiment['VADER_compound'],
        name="VADER (-1 to 1)", line=dict(color='blue'),
        mode='lines+markers'
    ),
    secondary_y=True
)

# Axis customization
fig.update_yaxes(
    title_text="AAPL Close Price ($)", 
    secondary_y=False
)
fig.update_yaxes(
    title_text="Sentiment Score", 
    range=[-1, 1],  # Accommodates both sentiment scales
    secondary_y=True
)

# Layout adjustments
fig.update_layout(
    title="AAPL Stock Price vs. Sentiment (2015-2019)",
    hovermode="x unified",
    width=1200,
    height=600,
    template="plotly_white"
)

fig.show()
# Save as HTML
fig.write_html("aapl_sentiment_comparison.html")


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



From the two Graphs above, we can see that the two variables are not very well connected, particularly when we look at the change between 2015 and 2020 overall. As such, we need to approach the problem with a different angle.

## Custom Analysis

Here, instead of looking at the price itself, I chose to examine whether or not the change between closing prices would be affected by Sentiment on the day.

In [44]:
merged_general_sentiment['daily_change'] = merged_general_sentiment['aapl_close'].diff()  # Today's price - yesterday's price

In [45]:
merged_general_sentiment.head()

Unnamed: 0,Date,opinion_sentiment,VADER_compound,aapl_close,daily_change,pct_change
6,2015-01-07,1.121951,0.072085,23.910433,,1.402221
7,2015-01-08,1.132456,0.119754,24.829124,0.918692,3.842221
8,2015-01-09,1.119143,0.091736,24.855761,0.026636,0.107278
12,2015-01-13,1.071846,0.090243,24.458536,-0.397224,-1.598118
13,2015-01-14,0.997093,0.065653,24.365341,-0.093195,-0.381032


In [46]:
merged_general_sentiment['pct_change'] = merged_general_sentiment['aapl_close'].pct_change() * 100  # Multiply by 100 for %

In [47]:
merged_general_sentiment.dropna(inplace=True)

In [48]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add AAPL closing price (left axis)
fig.add_trace(
    go.Scatter(
        x=merged_general_sentiment['Date'], y=merged_general_sentiment['pct_change'],
        name="AAPL Close % Change", line=dict(color='green'),
        mode='lines'
    ),
    secondary_y=False
)

# Add VADER sentiment (right axis, upper half)
fig.add_trace(
    go.Scatter(
        x=merged_general_sentiment['Date'], y=merged_general_sentiment['VADER_compound'],
        name="VADER (-1 to 1)", line=dict(color='blue'),
        mode='lines+markers'
    ),
    secondary_y=True
)

# Axis customization
fig.update_yaxes(
    title_text="AAPL Close % Change", 
    secondary_y=False
)
fig.update_yaxes(
    title_text="Sentiment Score", 
    secondary_y=True
)

# Layout adjustments
fig.update_layout(
    title="AAPL Stock Price vs. Sentiment (2015-2019)",
    hovermode="x unified",
    width=1200,
    height=600,
    template="plotly_white"
)

fig.show()
# Save as HTML
fig.write_html("aapl_sentiment_comparison.html")


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



In [49]:
merged_specific_sentiment['daily_change'] = merged_specific_sentiment['aapl_close'].diff()  # Today's price - yesterday's price

In [50]:
merged_specific_sentiment['pct_change'] = merged_specific_sentiment['aapl_close'].pct_change() * 100  # Multiply by 100 for %


The default fill_method='pad' in Series.pct_change is deprecated and will be removed in a future version. Either fill in any non-leading NA values prior to calling pct_change or specify 'fill_method=None' to not fill NA values.



In [51]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add AAPL closing price (left axis)
fig.add_trace(
    go.Scatter(
        x=merged_specific_sentiment['Date'], y=merged_specific_sentiment['pct_change'],
        name="AAPL Close % Change", line=dict(color='green'),
        mode='lines'
    ),
    secondary_y=False
)

# Add VADER sentiment (right axis, upper half)
fig.add_trace(
    go.Scatter(
        x=merged_specific_sentiment['Date'], y=merged_specific_sentiment['VADER_compound'],
        name="VADER (-1 to 1)", line=dict(color='blue'),
        mode='lines+markers'
    ),
    secondary_y=True
)

# Axis customization
fig.update_yaxes(
    title_text="AAPL Close % Change", 
    secondary_y=False
)
fig.update_yaxes(
    title_text="Sentiment Score", 
    secondary_y=True
)

# Layout adjustments
fig.update_layout(
    title="AAPL Stock Price vs. Sentiment (2015-2019)",
    hovermode="x unified",
    width=1200,
    height=600,
    template="plotly_white"
)

fig.show()
# Save as HTML
fig.write_html("aapl_sentiment_comparison.html")


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



The Graphs above seem to exhibit a better display, which leads me to believe that I may be on the right track.

## Data Manipulation and Model Testing

The following parts are mainly to check whether or not our prediction model works, whether it can correctly predict the stock price's movements.

In [69]:
test_general_sentiment = general_daily_sentiment[
    general_daily_sentiment['Date'] >= '2020-01-01'
]

test_specific_sentiment = specific_daily_sentiment[
    specific_daily_sentiment['Date'] >= '2020-01-01'
]

In [70]:
aapl_new = yf.download("AAPL", start="2015-01-01", end="2022-12-31")
aapl_new = aapl_new[['Close']].reset_index()  # Keep only 'Date' and 'Close'
aapl_new.columns = ['Date', 'aapl_close']  # Rename columns


YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed


In [71]:
# Merge on date (ensure both have datetime dtype)
merged_test_specific = pd.merge(aapl_new, test_specific_sentiment, on='Date', how='right')

In [72]:
merged_test_general = pd.merge(aapl_new, test_general_sentiment, on='Date', how='right')

In [73]:
merged_test_specific.dropna(inplace=True)
merged_test_general.dropna(inplace=True)

In [74]:
merged_test_specific

Unnamed: 0,Date,aapl_close,opinion_sentiment,VADER_compound
0,2021-09-30,138.524841,0.800000,0.126760
1,2021-10-01,139.650650,1.000000,0.111117
3,2021-10-04,136.214478,1.000000,-0.043525
4,2021-10-05,138.143036,0.666667,0.263533
5,2021-10-06,139.014297,1.666667,0.262767
...,...,...,...,...
352,2022-09-23,148.092270,0.600000,0.010440
355,2022-09-26,148.427017,1.272727,0.278464
356,2022-09-27,149.401627,1.333333,0.278700
357,2022-09-28,147.511429,0.786885,-0.064649


In [75]:
merged_test_general['daily_change'] = merged_test_general['aapl_close'].diff()  # Today's price - yesterday's price
merged_test_general['pct_change'] = merged_test_general['aapl_close'].pct_change() * 100  # Multiply by 100 for %

In [76]:
merged_test_specific['daily_change'] = merged_test_specific['aapl_close'].diff()  # Today's price - yesterday's price
merged_test_specific['pct_change'] = merged_test_specific['aapl_close'].pct_change() * 100  # Multiply by 100 for %

In [77]:
merged_test_specific.dropna(inplace=True)
merged_test_general.dropna(inplace=True)

In [78]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()

X_train_general = merged_general_sentiment[['opinion_sentiment','VADER_compound']]
y_train_general = merged_general_sentiment['pct_change']

X_test_general = merged_test_general[['opinion_sentiment','VADER_compound']]
y_test_general = merged_test_general['pct_change']

lin_reg.fit(X_train_general, y_train_general)

In [79]:
y_pred_general = lin_reg.predict(X_test_general)

In [80]:
merged_test_general['Predictions - Linear Regression'] = y_pred_general

In [81]:
merged_test_general

Unnamed: 0,Date,aapl_close,opinion_sentiment,VADER_compound,daily_change,pct_change,Predictions - Linear Regression
1,2021-10-01,139.650650,1.090000,0.190960,1.125809,0.812713,1.839640
4,2021-10-04,136.214478,1.016667,0.103523,-3.436172,-2.460549,0.133871
5,2021-10-05,138.143036,1.067416,0.095519,1.928558,1.415825,0.176352
6,2021-10-06,139.014297,1.320988,0.200360,0.871262,0.630695,2.794185
7,2021-10-07,140.277191,1.147059,0.174430,1.262894,0.908463,1.762311
...,...,...,...,...,...,...,...
358,2022-09-23,148.092270,1.085106,0.119922,-2.274124,-1.512389,0.642826
361,2022-09-26,148.427017,1.182927,0.158670,0.334747,0.226040,1.624512
362,2022-09-27,149.401627,1.261364,0.182922,0.974609,0.656625,2.298444
363,2022-09-28,147.511429,1.008065,0.043586,-1.890198,-1.265179,-0.891372


In [82]:
print('MAE:', metrics.mean_absolute_error(y_test_general, y_pred_general))
print('MSE:', metrics.mean_squared_error(y_test_general, y_pred_general))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test_general, y_pred_general)))

MAE: 1.9536446003154782
MSE: 5.875250483914026
RMSE: 2.4238915990435763


In [83]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Calculate global min/max for both datasets
global_min = min(
    merged_test_general['pct_change'].min(),
    merged_test_general['Predictions - Linear Regression'].min()
)
global_max = max(
    merged_test_general['pct_change'].max(),
    merged_test_general['Predictions - Linear Regression'].max()
)

# Add padding to avoid edge values touching the axis
padding = 0.1 * (global_max - global_min)  # 10% padding
y_range = [global_min - padding, global_max + padding]

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add ACTUAL percentage change (left axis)
fig.add_trace(
    go.Scatter(
        x=merged_test_general['Date'], 
        y=merged_test_general['pct_change'],
        name="Actual Percentage Change", 
        line=dict(color='green'),
        mode='lines'
    ),
    secondary_y=False
)

# Add PREDICTED percentage change (right axis)
fig.add_trace(
    go.Scatter(
        x=merged_test_general['Date'], 
        y=merged_test_general['Predictions - Linear Regression'],
        name="Predictions", 
        line=dict(color='blue'),
        mode='lines+markers'
    ),
    secondary_y=True
)

# Set identical y-axis ranges
fig.update_yaxes(
    title_text="Percentage Change", 
    range=y_range,
    secondary_y=False
)
fig.update_yaxes(
    title_text="Percentage Change", 
    range=y_range,
    secondary_y=True
)

# Layout adjustments
fig.update_layout(
    title="Predicted vs Actual Percentage Change",
    hovermode="x unified",
    width=1200,
    height=600,
    template="plotly_white"
)

fig.show()
fig.write_html("aapl_sentiment_comparison.html")


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



In [84]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()

X_train_specific = merged_specific_sentiment[['opinion_sentiment','VADER_compound']]
y_train_specific = merged_specific_sentiment['pct_change']

X_test_specific = merged_test_specific[['opinion_sentiment','VADER_compound']]
y_test_specific = merged_test_specific['pct_change']

lin_reg.fit(X_train_general, y_train_general)

In [85]:
y_pred_specific = lin_reg.predict(X_test_specific)
merged_test_specific['Predictions - Linear Regression'] = y_pred_specific

In [86]:
print('MAE:', metrics.mean_absolute_error(y_test_specific, y_pred_specific))
print('MSE:', metrics.mean_squared_error(y_test_specific, y_pred_specific))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test_specific, y_pred_specific)))

MAE: 3.116621142438219
MSE: 17.904732395905427
RMSE: 4.2313983972093


In [87]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Calculate global min/max for both datasets
global_min = min(
    merged_test_specific['pct_change'].min(),
    merged_test_specific['Predictions - Linear Regression'].min()
)
global_max = max(
    merged_test_specific['pct_change'].max(),
    merged_test_specific['Predictions - Linear Regression'].max()
)

# Add padding to avoid edge values touching the axis
padding = 0.1 * (global_max - global_min)  # 10% padding
y_range = [global_min - padding, global_max + padding]

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add ACTUAL percentage change (left axis)
fig.add_trace(
    go.Scatter(
        x=merged_test_specific['Date'], 
        y=merged_test_specific['pct_change'],
        name="Actual Percentage Change", 
        line=dict(color='green'),
        mode='lines'
    ),
    secondary_y=False
)

# Add PREDICTED percentage change (right axis)
fig.add_trace(
    go.Scatter(
        x=merged_test_specific['Date'], 
        y=merged_test_specific['Predictions - Linear Regression'],
        name="Predictions", 
        line=dict(color='blue'),
        mode='lines+markers'
    ),
    secondary_y=True
)

# Set identical y-axis ranges
fig.update_yaxes(
    title_text="Percentage Change", 
    range=y_range,
    secondary_y=False
)
fig.update_yaxes(
    title_text="Percentage Change", 
    range=y_range,
    secondary_y=True
)

# Layout adjustments
fig.update_layout(
    title="Predicted vs Actual Percentage Change",
    hovermode="x unified",
    width=1200,
    height=600,
    template="plotly_white"
)

fig.show()
fig.write_html("aapl_sentiment_comparison.html")


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



In [92]:
from sklearn.linear_model import Ridge
model = Ridge(alpha=0.1).fit(X_train_general, y_train_general)

In [93]:
y_pred_ridge = model.predict(X_test_general)
merged_test_general['Predictions - Ridge Regression'] = y_pred_general

In [94]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Calculate global min/max for both datasets
global_min = min(
    merged_test_general['pct_change'].min(),
    merged_test_general['Predictions - Ridge Regression'].min()
)
global_max = max(
    merged_test_general['pct_change'].max(),
    merged_test_general['Predictions - Ridge Regression'].max()
)

# Add padding to avoid edge values touching the axis
padding = 0.1 * (global_max - global_min)  # 10% padding
y_range = [global_min - padding, global_max + padding]

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add ACTUAL percentage change (left axis)
fig.add_trace(
    go.Scatter(
        x=merged_test_general['Date'], 
        y=merged_test_general['pct_change'],
        name="Actual Percentage Change", 
        line=dict(color='green'),
        mode='lines'
    ),
    secondary_y=False
)

# Add PREDICTED percentage change (right axis)
fig.add_trace(
    go.Scatter(
        x=merged_test_general['Date'], 
        y=merged_test_general['Predictions - Ridge Regression'],
        name="Predictions", 
        line=dict(color='blue'),
        mode='lines+markers'
    ),
    secondary_y=True
)

# Set identical y-axis ranges
fig.update_yaxes(
    title_text="Percentage Change", 
    range=y_range,
    secondary_y=False
)
fig.update_yaxes(
    title_text="Percentage Change", 
    range=y_range,
    secondary_y=True
)

# Layout adjustments
fig.update_layout(
    title="Predicted vs Actual Percentage Change",
    hovermode="x unified",
    width=1200,
    height=600,
    template="plotly_white"
)

fig.show()
fig.write_html("aapl_sentiment_comparison.html")


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



In [95]:
print('MAE:', metrics.mean_absolute_error(y_test_general, y_pred_ridge))
print('MSE:', metrics.mean_squared_error(y_test_general, y_pred_ridge))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test_general, y_pred_ridge)))

MAE: 1.87214639959982
MSE: 5.435668436335621
RMSE: 2.3314520017224503


## Prediction Results Summary

From our last few predictions, we can see that the predictions themselves are not ideal. There are still points of wild Over and Underestimation. However, for the sake of the experiment, we continue with these predictions.

In [96]:
# Predict returns for entire dataset
df_empty = pd.DataFrame()

df_empty['predicted_return'] = model.predict(X_test_general)

# Trading rules
df_empty['signal'] = np.where(
    df_empty['predicted_return'] > 0.00005,  # Buy threshold (0.5%)
    1,  # Buy signal
    np.where(
        df_empty['predicted_return'] < -0.00005,  # Sell threshold
        -1,  # Sell signal
        0    # Hold
    )
)

# Shift signals to avoid look-ahead bias
df_empty['signal'] = df_empty['signal'].shift(1)

In [97]:
merged_test_general.reset_index(inplace=True)

In [98]:
# Calculate strategy returns
df_empty['strategy_return'] = df_empty['signal'] * merged_test_general['pct_change']

# Cumulative returns
df_empty['cumulative_market'] = (1 + merged_test_general['pct_change']).cumprod()
df_empty['cumulative_strategy'] = (1 + df_empty['strategy_return']).cumprod()

# Performance metrics
total_return = df_empty['cumulative_strategy'].iloc[-1] - 1
annualized_return = (1 + total_return) ** (252/len(df_empty)) - 1
sharpe_ratio = np.sqrt(252) * df_empty['strategy_return'].mean() / df_empty['strategy_return'].std()

print(f"Total Strategy Return: {total_return:.2%}")
print(f"Annualized Return: {annualized_return:.2%}")
print(f"Sharpe Ratio: {sharpe_ratio:.2f}")

Total Strategy Return: 1682038804872685944832.00%
Annualized Return: 2006471369035138727936.00%
Sharpe Ratio: -0.12


In [99]:
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=df_empty.index, y=df_empty['cumulative_strategy'],
    name='Strategy',
    line=dict(color='green')
))
fig.update_layout(
    title="Strategy Cumulative Returns (Fixed)",
    yaxis_title="Cumulative Returns",
    yaxis_type="log"  # Log scale to tame large values
)
fig.show()

In [100]:
# Assuming you've already calculated daily strategy returns:
df_empty['cumulative_strategy'] = (1 + df_empty['strategy_return']).cumprod()

# Total return (from inception to end)
total_strategy_return = df_empty['cumulative_strategy'].iloc[-1] - 1  # As decimal (e.g., 1.42 for 142%)
df_empty['cumulative_market'] = (1 + merged_test_general['pct_change']).cumprod()
total_market_return = df_empty['cumulative_market'].iloc[-1] - 1
alpha = total_strategy_return - total_market_return
# Number of trading days (typically ~252/year)
n_days = len(df)
n_years = n_days / 252

annualized_strategy = (1 + total_strategy_return) ** (1/n_years) - 1
annualized_market = (1 + total_market_return) ** (1/n_years) - 1


invalid value encountered in scalar power



## Conclusion

From the above graph it is evident that this trading model is not realistic and requires further engineering. However, its failure to make proper predictions may be allocated to the following few factors:
1) Insufficient Quantity/Quality of Data. The dataset available may have contained spam which was not filtered, as well as the fact that the resulting sentiment of each Tweet was extremely close to 0 most of the time.
2) No clear indicators given to the Trading Model itself due to the fact that a change in Sentiment did not always imply a change in Stock Price. Additionally, the Predictions in Stock Price changes were not ideal.

Ideally, the next time a similar project were to be made. I would implement the following if possible:
1) Better Data and a larger variety (Using Financial Newspaper headlines/content as a reference to see if the sentiment on various social media websites are accurate)
2) Finding Correlation Coefficients between the Stock Price and Sentiment first to ascertain whether the connection exists.
3) Building a better trading model with more advanced logic.