# **5450 Group Project**
# **Investor Toolkit: Asset Clustering, Sentiment-Based Trading Bot, and Price Action Model for Short Term Scalping**

Davit Barseghyan, Qixiu Quan, Jeff Grant, Eng Wei Jie Joseph


# Introduction


In this project, we aim to create a series of tools that an investor can use to create weighted portfolio of stocks optimized to maximize returns based on daily performance, predict market movements using sentiment analysis of tweets about various stocks, and analyze market movement to "scalp" stocks by predicting cyclic dips and peaks in prices to buy at low prices and sell at high prices.

This project is divided into three major parts:


1.   Asset Clustering: Grouping S&P 500 stocks based on annualized returns and volatility to form optimized investment portfolios.
2.   Sentiment-Based Trading Bot: Building an autonomous trading bot driven by sentiment analysis of stock-related tweets.
3.   A Price Action Model for short term scalping by predicting dips in the market.






#**Part 1: Asset Clustering**
In this part of our project we use the Yahoo finance API to retrieve historical ajdusted closing price data for stocks in the S&P 500 over the five-year period from 2014 to 2018 (We chose this time frame to avoid uncharacteristic turbulence during and immediately after the Covid-19 pandemic).  We then group stocks based on their annualized returns and volatility.  By analyzing these clusters we can create a portfolio expected, based on historical returns, to offer both diversification based on sector and maximized returns.



Phase 1: Data Acquisition and Cleaning
We retrieved historical adjusted closing price data for stocks in the S&P 500 using the Yahoo Finance API.

Import required libraries



In [1]:
pip install --upgrade yfinance

Collecting yfinance
  Downloading yfinance-0.2.51-py2.py3-none-any.whl.metadata (5.5 kB)
Downloading yfinance-0.2.51-py2.py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.7/104.7 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: yfinance
  Attempting uninstall: yfinance
    Found existing installation: yfinance 0.2.50
    Uninstalling yfinance-0.2.50:
      Successfully uninstalled yfinance-0.2.50
Successfully installed yfinance-0.2.51


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import yfinance as yf
import logging

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

Create list of stock tickers for S&P 500 companies by collecting and cleaning data from Wikipedia

In [3]:
sp500_url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

# Read in the url and scrape ticker data
data_table = pd.read_html(sp500_url)
tickers = data_table[0]['Symbol'].values.tolist()
tickers = [s.replace('\n', '') for s in tickers]
tickers = [s.replace('.', '-') for s in tickers]
tickers = [s.replace(' ', '') for s in tickers]

print(tickers)

['MMM', 'AOS', 'ABT', 'ABBV', 'ACN', 'ADBE', 'AMD', 'AES', 'AFL', 'A', 'APD', 'ABNB', 'AKAM', 'ALB', 'ARE', 'ALGN', 'ALLE', 'LNT', 'ALL', 'GOOGL', 'GOOG', 'MO', 'AMZN', 'AMCR', 'AEE', 'AEP', 'AXP', 'AIG', 'AMT', 'AWK', 'AMP', 'AME', 'AMGN', 'APH', 'ADI', 'ANSS', 'AON', 'APA', 'APO', 'AAPL', 'AMAT', 'APTV', 'ACGL', 'ADM', 'ANET', 'AJG', 'AIZ', 'T', 'ATO', 'ADSK', 'ADP', 'AZO', 'AVB', 'AVY', 'AXON', 'BKR', 'BALL', 'BAC', 'BAX', 'BDX', 'BRK-B', 'BBY', 'TECH', 'BIIB', 'BLK', 'BX', 'BK', 'BA', 'BKNG', 'BWA', 'BSX', 'BMY', 'AVGO', 'BR', 'BRO', 'BF-B', 'BLDR', 'BG', 'BXP', 'CHRW', 'CDNS', 'CZR', 'CPT', 'CPB', 'COF', 'CAH', 'KMX', 'CCL', 'CARR', 'CAT', 'CBOE', 'CBRE', 'CDW', 'CE', 'COR', 'CNC', 'CNP', 'CF', 'CRL', 'SCHW', 'CHTR', 'CVX', 'CMG', 'CB', 'CHD', 'CI', 'CINF', 'CTAS', 'CSCO', 'C', 'CFG', 'CLX', 'CME', 'CMS', 'KO', 'CTSH', 'CL', 'CMCSA', 'CAG', 'COP', 'ED', 'STZ', 'CEG', 'COO', 'CPRT', 'GLW', 'CPAY', 'CTVA', 'CSGP', 'COST', 'CTRA', 'CRWD', 'CCI', 'CSX', 'CMI', 'CVS', 'DHR', 'DRI', 'DV

## Data Collection and Initial Analysis
Download the adjusted closing price for all S&P 500 stocks for each day in the five year window between Jan 1, 2014 and December 31, 2018, inclusive.

In [4]:
# Download prices from Yahoo Finance
prices_list = []

# IMPORTANT NOTE: Dates are set for pre-pandemic to avoid any unexplained turbulence.
data = yf.download(tickers, start='2014-01-01', end='2019-01-01')['Adj Close']
data.head()

[*********************100%***********************]  503 of 503 completed
ERROR:yfinance:
17 Failed downloads:
ERROR:yfinance:['KVUE', 'GEV', 'GEHC', 'SW', 'CEG', 'FOX', 'VLTO', 'DOW', 'FOXA', 'UBER', 'PLTR', 'CARR', 'ABNB', 'CTVA', 'CRWD', 'OTIS', 'SOLV']: YFPricesMissingError('$%ticker%: possibly delisted; no price data found  (1d 2014-01-01 -> 2019-01-01) (Yahoo error = "Data doesn\'t exist for startDate = 1388552400, endDate = 1546318800")')


Ticker,ABNB,CARR,CEG,CRWD,CTVA,DOW,FOX,FOXA,GEHC,GEV,KVUE,OTIS,PLTR,SOLV,SW,UBER,VLTO
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2014-01-02,,,,,,,,,,,,,,,,,
2014-01-03,,,,,,,,,,,,,,,,,
2014-01-06,,,,,,,,,,,,,,,,,
2014-01-07,,,,,,,,,,,,,,,,,
2014-01-08,,,,,,,,,,,,,,,,,


In our exploratory data analysis, we see that our stock prices dataset has 1258 rows (the number of trading days in a five year period) and 485 columns (we note that data for some of the tickers was unavailable for our chosen 5 year period, as some stocks were initially listed or delisted from the S&P 500 during this timeframe).

We also noted that some tickers had null values for each date, so we decided to exclude these from our dataset.

In [5]:
# Perform some EDA and learn about the data
print(data.info())
print(data.describe())
print(data.shape)
print(data.isnull().sum())

# isnull().sum() found some tickers to be completely null, so we drop those assets and check again
data = data.dropna(axis=1, how='all')
print(data.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1258 entries, 2014-01-02 to 2018-12-31
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ABNB    0 non-null      float64
 1   CARR    0 non-null      float64
 2   CEG     0 non-null      float64
 3   CRWD    0 non-null      float64
 4   CTVA    0 non-null      float64
 5   DOW     0 non-null      float64
 6   FOX     0 non-null      float64
 7   FOXA    0 non-null      float64
 8   GEHC    0 non-null      float64
 9   GEV     0 non-null      float64
 10  KVUE    0 non-null      float64
 11  OTIS    0 non-null      float64
 12  PLTR    0 non-null      float64
 13  SOLV    0 non-null      float64
 14  SW      0 non-null      float64
 15  UBER    0 non-null      float64
 16  VLTO    0 non-null      float64
dtypes: float64(17)
memory usage: 176.9 KB
None
Ticker  ABNB  CARR  CEG  CRWD  CTVA  DOW  FOX  FOXA  GEHC  GEV  KVUE  OTIS  \
count    0.0   0.0  0.0   0.0   0.0  0.0  0.0 

We now use our cleaned data to create a table that matches stock tickers with their annualized returns and volatility.  In the process of doing this, we look at the percent change of the adjusted closing price, effectively normalizing the data.

In [7]:
# Create a new table with rows for assets, and columns for returns and volatility
daily_returns = data.pct_change().dropna()
annual_trading_days = 252
annualized_returns = daily_returns.mean() * annual_trading_days
annualized_volatility = daily_returns.std() * np.sqrt(annual_trading_days)

# Create the returns DataFrame
returns = pd.DataFrame({
    'Ticker': data.columns,
    'Returns': annualized_returns.values,  # Use .values to avoid index alignment issues
    'Volatility': annualized_volatility.values
})

print(data['AAPL'].head())  # First few rows
print(data['AAPL'].tail())
daily_returns.describe()
returns.head()

KeyError: 'AAPL'

## Data Visualization
Now let's take a dive deep to take a better look at the dataset that we are dealing with, and see if we and draw any interesting insights from it.

In [None]:
# distribution of annualized returns
plt.figure(figsize=(10, 6))
sns.histplot(returns['Returns'], kde=True, bins=30, color='blue', edgecolor='black', alpha=0.7)
plt.title('Distribution of Annualized Returns Across All Stocks', fontsize=16, weight='bold')
plt.xlabel('Annualized Returns', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

This histogram displays the distribution of annualized returns for all stocks in the dataset. It provides an overview of how the returns are spread, with a superimposed kernel density estimate (KDE) line to highlight the shape of the distribution.

Key Insights:

1. Concentration Around Negative Returns: Most stocks exhibit annualized returns clustered between -2% and -1%, suggesting a predominance of underperforming stocks during the analyzed period.

2. Presence of Outliers: While the majority of returns are near the mean, there are a few stocks with significantly positive or negative returns, highlighting potential high-risk/high-reward investment opportunities.

In [None]:
# scatter Plot to show Annualized Returns vs Volatility
plt.figure(figsize=(12, 8))
scatter = plt.scatter(
    returns['Volatility'],
    returns['Returns'],
    c=returns['Returns'],
    cmap='viridis',
    s=100,
    edgecolor='black'
)
plt.colorbar(scatter, label="Annualized Returns")
plt.title("Annualized Returns vs Volatility for Stocks", fontsize=14)
plt.xlabel("Volatility (Annualized)", fontsize=12)
plt.ylabel("Returns (Annualized)", fontsize=12)
plt.grid(alpha=0.3)

# ticker Annotations
for i, ticker in enumerate(returns['Ticker']):
    if i % 20 == 0:
        plt.annotate(ticker, (returns['Volatility'][i], returns['Returns'][i]), fontsize=8, ha='right')

plt.show()

The scatter plot visualizes the relationship between annualized returns (y-axis) and annualized volatility (x-axis) for various stocks, highlighting the risk-reward trade-off. Each dot represents a stock, with its position showing its risk level (volatility) and performance (returns). The color gradient adds a layer of insight, indicating the magnitude of returns, with brighter colors representing higher returns. This visualization helps identify clusters, high-return performers, and outliers in the dataset.

Key Insights:

1.   Risk-Reward Trade-off: Stocks like EQT and GE exhibit higher volatility with varying returns, reinforcing the classic risk-reward relationship where higher risk can lead to either higher or significantly lower returns.

2.   High-Return Outliers: Stocks like AVGO and PANW demonstrate exceptionally high annualized returns with moderate volatility, making them potential candidates for favorable risk-adjusted investment opportunities.






In [None]:
#Bar Chart of Top Performers
top_performers = returns.sort_values('Returns', ascending=False).head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x='Ticker', y='Returns', data=top_performers, palette='coolwarm')
plt.title('Top 10 Stocks by Annualized Returns', fontsize=16)
plt.xlabel('Ticker', fontsize=12)
plt.ylabel('Annualized Returns', fontsize=12)
plt.xticks(rotation=45)
plt.show()

This bar chart visualizes the top 10 stocks with the highest annualized returns from the dataset. Each bar represents a stock (identified by its ticker symbol) and its corresponding annualized return, showcasing the best-performing assets in terms of growth over the analyzed period.

Key Insights:

1. Top Performer: Broadcom Inc. (AVGO) significantly outperformed other stocks with the highest annualized return, exceeding 2.0. This indicates its exceptional growth potential compared to its peers.

2. Consistent High Returns: General Electric (GE) and Dollar Tree (DLTR) follow as strong performers, both showing annualized returns above 1.5, highlighting their consistent performance in the market.

3. Sector Representation: The top 10 stocks represent diverse sectors, suggesting opportunities for high returns across industries rather than concentration in a single market segment.

## Clustering
In this part of our analysis we use K-Means clustering to attempt to group the data in 1 to 10 clusters.

To determine the appropriate number of clusters, we plot the within-cluster sum of squares as a function of the number of clusters and look for the elbow.

In [None]:
#Apply K-Means clustering, first use elbow method to determine clusters
clustering_data = returns[['Ticker', 'Returns', 'Volatility']].copy()

# List to store the within-cluster sum of squares (WCSS) for each number of clusters
wcss = []

# Calculate WCSS for different numbers of clusters
for k in range(1, 11):  # Try 1 to 10 clusters
    kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
    kmeans.fit(clustering_data[['Returns', 'Volatility']])
    wcss.append(kmeans.inertia_)

# Plot the Elbow Curve
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), wcss, linestyle='--')
plt.title('Elbow Method for Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.xticks(range(1, 11))
plt.grid()
plt.show()


From the plot we can see that there is an elbow at three clusters, indicating that we should group our data into this number of clusters.

We next use the silhouette score, which measures how similar data points in a cluster are to each other, to verify the optimal number of clusters.

In [None]:
# Also try to confirm with the silhouette_score

from sklearn.metrics import silhouette_score

silhouette_scores = []
for k in range(2, 11):  # Silhouette score is undefined for k=1
    kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
    kmeans.fit(clustering_data[['Returns', 'Volatility']])
    score = silhouette_score(clustering_data[['Returns', 'Volatility']], kmeans.labels_)
    silhouette_scores.append(score)

# Plot the silhouette scores
plt.figure(figsize=(10, 6))
plt.plot(range(2, 11), silhouette_scores, marker='o', linestyle='--')
plt.title('Silhouette Score for Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.xticks(range(2, 11))
plt.grid()
plt.show()


This plot further confirms our conclusion that the optimal number of clusters for our data is 3.

Next we visualize the clusters in a scatter plot.

In [None]:
# Optimal number of clusters is 3, confirmed by the elbow and silhouette chart
kmeans = KMeans(n_clusters=3, init='k-means++', random_state=42)
kmeans.fit(clustering_data[['Returns', 'Volatility']])
clustering_data['Cluster'] = kmeans.labels_

import seaborn as sns

plt.figure(figsize=(10, 6))
sns.scatterplot(data=clustering_data, x='Returns', y='Volatility', hue='Cluster', palette='viridis')
plt.title('K-Means Clustering Results')

centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
            s=300, c='red', marker='X', label='Centroids')

clustering_data['Cluster'].value_counts()


## Cluster Analysis and Portfolio Optimization

Here we analyze each cluster in terms of its makeup and performance, in order to set parameters to create a diversified but high performing portfolio.

We use the yfinance API to retrieve the sector for each stock ticker and merge it to our clustering_data dataframe.

### **Disclaimer: yfinance can be unstable when pulling data. If the code takes too long, please restart the runtime and try again. Thank you!*

In [None]:
# Suppress yfinance logging
yf_logger = logging.getLogger("yfinance")
yf_logger.setLevel(logging.CRITICAL)

# Fetch sectors without printing errors - retry until it works (will only take 6-7 mins max)
sectors = {}
for ticker in tickers:
    try:
        stock = yf.Ticker(ticker)
        info = stock.info
        sectors[ticker] = info.get('sector', 'Unknown')
    except Exception:
        sectors[ticker] = 'Unknown'

sector_df = pd.DataFrame(list(sectors.items()), columns=['Ticker', 'Sector'])
sector_df.head()

In [None]:
clustering_data = clustering_data.reset_index().merge(sector_df, on='Ticker', how='left')
clustering_data.head()

In [None]:
# Count the sectors in each cluster
sector_counts = clustering_data.groupby(['Cluster', 'Sector']).size().reset_index(name='Count')


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))
sns.barplot(data=sector_counts, x='Cluster', y='Count', hue='Sector')
plt.title('Sector Distribution by Cluster')
plt.xlabel('Cluster')
plt.ylabel('Number of Stocks')
plt.legend(title='Sector')
plt.show()


The K-Means clustering showed that cluster 0 was the best performing while cluster 2 was the worst performing.

We will now analyze the sector characteristics within each cluster. Understanding the differences in performance of each sector within the cluster is very helpful when building a portfolio

In [None]:
sector_performance = clustering_data.groupby(['Cluster', 'Sector']).agg(
    Avg_Returns=('Returns', 'mean'),
    Avg_Volatility=('Volatility', 'mean')
).reset_index()


In [None]:
# make a heatmap to compare average returns and volatility by sector by cluster
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
sns.heatmap(sector_performance.pivot(index='Cluster', columns='Sector', values='Avg_Returns'), annot=True, cmap='coolwarm')
plt.title('Average Returns by Sector and Cluster')
plt.xlabel('Sector')
plt.ylabel('Cluster')


As Mentioned earlier, cluster 2 holds the poorest performing stocks, and this holds true accross every sector. It would be ideal to avoid this cluster alltogether when choosing an investment strategy, since it is clear that the stocks in this segment will not perform well.

Conversely, cluster 0 performed the best in all sectors, so it would be wise to select assets that fall into that cluster.

Based on this heatmap, for the time frame selected it would be ideal to select assets that are within the Consumer Cyclical, Industrials, and Consumer Defensive sectors within cluster 0.

It is also important to note that while the Consumer Defensive sector performed the best in cluster 0, it was one of the poorest in both cluster 1 and 2.

In [None]:
cluster_0_stocks = clustering_data[clustering_data['Cluster'] == 0]['Ticker'].tolist()
cluster_0_stocks

## Portfolio Optimiazation

Now that we have identified the best cluster to invest in, we will create a tool to create weighted portfolio designed to optimize returns according to user-set parameters of risk tolerance and the maximum share a single stock should have in the portfolio.

First we download the data for stocks in our desired cluster and calculate their mean returns over our chosen time period.

In [None]:
data = yf.download(cluster_0_stocks, start='2014-01-01', end='2019-01-01')['Adj Close']
returns  = data.pct_change().dropna()
mean_returns = returns.mean().values
cov_matrix = returns.cov()

Here we use a the CVXpy python library to solve the constraint problem that will create a portfolio that allocate investments from our optimal cluster to simultaneously maximize returns while also constraining risk to a level set by the investor.

In [None]:
import cvxpy as cp
import numpy as np
import pandas as pd

# Number of assets
num_assets = len(cluster_0_stocks)

# Ensure mean_returns is a NumPy array
mean_returns = mean_returns

# Define optimization variables
weights = cp.Variable(num_assets)

# Define risk and return
risk = cp.quad_form(weights, cov_matrix.values)  # Ensure cov_matrix is a NumPy array
expected_return = mean_returns.T @ weights  # Matrix multiplication

# Optimization setup
risk_aversion = 2  # risk tolerance, raise to diversify to more stocks
objective = cp.Maximize(expected_return - risk_aversion * risk)

constraints = [
    cp.sum(weights) == 1,  # Weights sum to 1
    weights >= 0,          # Long-only portfolio
    weights <= 0.2        # Maximum weight constraint of single asset
]

# Solve the optimization problem
problem = cp.Problem(objective, constraints)
problem.solve()

# Get optimal weights
optimal_weights = weights.value

# Create a pandas Series for the weights
portfolio_weights = pd.Series(optimal_weights, index=cluster_0_stocks)
portfolio_weights[portfolio_weights < 0] = 0
threshold = 1e-6
portfolio_weights[portfolio_weights < threshold] = 0
portfolio_weights /= portfolio_weights.sum()
print(portfolio_weights[portfolio_weights>0])


Now that we have weighted our portfolio to be optimized for returns based on our desired parameters, we calculate the annual return for the portfolio by multiplying the stocks' returns by their optimal weights (i.e. the fraction of the portfolio comprised of them) and the number of trading days in a year, 252.

We also calculate the volatility as the standard deviation of the returns (in this case multiplied by the square root of the number of trading days in order to annualize them), and the sharp ratio, which is the ratio of a portfolio's returns to its volatility.

In [None]:
# Portfolio returns
portfolio_returns = (returns @ optimal_weights)

# Annualized metrics
annual_return = np.mean(portfolio_returns) * 252
annual_volatility = np.std(portfolio_returns) * np.sqrt(252)
sharpe_ratio = annual_return / annual_volatility

print(f"Annual Return: {annual_return:.2%}")
print(f"Annual Volatility: {annual_volatility:.2%}")
print(f"Sharpe Ratio: {sharpe_ratio:.2f}")


To visualize our portfolio, we create a pie plot showing the stocks in our portfolio along with the percentage of the portfolio each asset comprises.

In [None]:
positive_portfolio_weights = portfolio_weights[portfolio_weights>0]

plt.figure(figsize=(8, 8))
positive_portfolio_weights.plot.pie(
    title="Portfolio Allocation",
    ylabel='',          # Hide the y-axis label
    fontsize=12,        # Adjust font size
    autopct=lambda p: round(p, 1)
)

# Display the chart
plt.show()

The Above chart indicates the optimized portfolio allocation of cluster 0 stocks. It is the ideal distribution to minimize risk and maximize rewards during the 2014-2019 period that we are analyzing.

# Part Two: Sentiment Analysis

The second portion of our project is implementation of an automonous trading bot based on sentiment analysis.

We will combine two datasets found on Kaggle (links provided below).  One dataset has daily stock market data on the most watched stocks on yahoo finance over a one year time period, and the other dataset has the text of over 80,000 tweets about different stocks along with the date of the tweet and the stock ticker of the stock that it references.

We will attempt to build a model that can predict whether a stock's price will rise or fall on a given day based on the average sentiment of tweets about this stock on that particular day.



## Data Cleaning and Sentiment Analysis
Link to datasets:
https://www.kaggle.com/datasets/equinxx/stock-tweets-for-sentiment-analysis-and-prediction?resource=download&select=stock_yfinance_data.csv

In [None]:
# Get files from Drive
from google.colab import drive
import pandas as pd
drive.mount('/content/drive')

tweet_data_path = '/content/drive/MyDrive/545 Project/stock_tweets.csv'
stock_data_path = '/content/drive/MyDrive/545 Project/stock_yfinance_data.csv'

# The data came from Kaggle: https://www.kaggle.com/datasets/equinxx/stock-tweets-for-sentiment-analysis-and-prediction?resource=download&select=stock_yfinance_data.csv


Here we read in the two datasets.  We see that the tweet data has over 80,000 rows, and the stock data has over 6000 rows.

In [None]:
stock_data = pd.read_csv(stock_data_path)
tweet_data = pd.read_csv(tweet_data_path)
print(tweet_data.shape)
print(stock_data.shape)
tweet_data.head()

In [None]:
print(stock_data.shape)
stock_data.head()

To clean the stock data, we take the following steps:


1.   Cast the 'Date' column into a datetime object and sort by date
2.   Rename columns
3.   Drop missing values.  Because the movements of stocks from day to day are not predictable and do not necessarily form any kind of patttern, we decided this is a better option than filling these values with something like a mean or median value.



In [None]:
# Clean Stock Data
stock_data['Date'] = pd.to_datetime(stock_data['Date'])
stock_data = stock_data.sort_values(by='Date')
stock_data.rename(columns={'Stock Name': 'Ticker'}, inplace=True)

# Drop missing values
stock_data.dropna(inplace=True)

To clean the tweet data, we take the following steps:


1.   Cast the 'Date' column to a datetime object
2.   Drop null values.  Since we are interested in the sentiment of the tweet, the date it was made, and the associated stock ticker, if a row is missing any information it will not help us in building our model.
3.   Clean the tweet text and put in in a new column 'clean_text'

    1.   Remove symbols for hashtags (#) and mentions (@),
    2.   Remove any other special characters and whitespace
4.   Drop rows with empty string tweets or duplicate tweets.







In [None]:

# Clean Tweet Data
import re

tweet_data['Date'] = pd.to_datetime(tweet_data['Date'], errors = 'coerce')
tweet_data.dropna(inplace=True)
def clean_tweet(tweet):
    # Remove URLs
    tweet = re.sub(r"http\S+|www\S+|https\S+", '', tweet, flags=re.MULTILINE)

    # Remove menitons and hashtags
    tweet = re.sub(r'\@\w+|\#','', tweet)

    # Remove special characters
    tweet = re.sub(r'[^\w\s]', '', tweet)

    # Remove extra whitespaces
    tweet = re.sub(r'\s+', ' ', tweet).strip()

    return tweet.strip()

tweet_data['clean_text'] = tweet_data['Tweet'].apply(clean_tweet)

# Drop Missing values and duplicate tweets
tweet_data.dropna(inplace=True)
tweet_data.drop_duplicates(subset='clean_text', inplace=True)
tweet_data.rename(columns={'Stock Name': 'Ticker'}, inplace=True)
tweet_data.head()

Next we use nltk's vader_lexicon's Sentiment Intensity Analyzer to compute the polarity score of each tweet.

In [None]:
# Calculate sentiment scores for tweets
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

sid = SentimentIntensityAnalyzer()
tweet_data['Sentiment'] = tweet_data['clean_text'].apply(lambda x: sid.polarity_scores(x)['compound'])
tweet_data.head()

In [None]:
print(len(tweet_data))
cleaned_tweet_data = tweet_data.dropna()
print(len(cleaned_tweet_data))

Now we aggregate the average sentiment scores on stock ticker and date, so in our final dataframe we will have an average daily sentiment for each stock ticker for each day.

In [None]:
# Aggregate sentiments to daily
tweet_data['Date'] = pd.to_datetime(tweet_data['Date']).dt.date
daily_sentiment = tweet_data.groupby(['Ticker', 'Date'])['Sentiment'].mean().reset_index()
print(daily_sentiment.shape)
tweet_data.head()

Finally we merge the stock data with the daily twitter sentiment data.  We do a left merge, to include ticker/day combinations where there are no associated tweets and therefore no associated sentiment.  We also add a column for the next day's closing price, which we then use to determine whether the stock's price rose or fell on a given date.  We then make a new column named 'Target' that has a value of 1 if the price rose and 0 if the price did not (which we will use for logistic regression), and a column named 'One_Day_Change', which calculates the percentage change in stock price over the next day (which we will use for linear regression).

In [None]:
stock_data['Date'] = pd.to_datetime(stock_data['Date']).dt.date
stock_data = stock_data[['Ticker', 'Date', 'Adj Close', 'Volume']].dropna()
print(stock_data.shape)

In [None]:
# Merge the twitter sentiment with the stock data
combined_data = pd.merge(stock_data, daily_sentiment, on=['Ticker', 'Date'], how='left')

# Create a new col for next day adjusted close by shifting by -1
combined_data['Next_Day_Adj_Close'] = combined_data.groupby('Ticker')['Adj Close'].shift(-1)

# Create a new col to track if the price went up. Our bot assumes only long positions (no short selling)
combined_data['Target'] = (combined_data['Next_Day_Adj_Close'] > combined_data['Adj Close']).astype(int)
combined_data['One_Day_Change'] = ((combined_data['Next_Day_Adj_Close'] - combined_data['Adj Close']) / combined_data['Adj Close'])

# Drop any rows with NaN targets
combined_data.dropna(subset=['Target'], inplace=True)
combined_data.dropna(subset=['One_Day_Change'], inplace=True)


combined_data.head()

Next, we repalce any NaN sentiment values with a value of 0, which is a neutral sentiment.

In [None]:
combined_data = combined_data.fillna(0)

In [None]:
combined_data.head()

## Visualizing the Data

We plot a frequency plot of the sentiment scores of our data to see how they are distributed, which will inform our decision of how to scale our data in our models.

In [None]:
sent = plt.hist(combined_data['Sentiment'])
plt.title('Tweet Sentiment Distribution')
plt.xlabel("Tweet Sentiment")
plt.ylabel("Number of Ticker/Day Pairs")

In [None]:
adj_close = plt.hist(combined_data['Adj Close'])
plt.title('Adjusted Closing Price Distribution')
plt.xlabel("Adjusted Closing Price")
plt.ylabel("Number of Ticker/Day Pairs")

In [None]:
volume = plt.hist(combined_data['Volume'])
plt.title('Volume Distribution')
plt.xlabel("Volume")
plt.ylabel("Number of Ticker/Day Pairs")

We can see that our data is only normally distributed, more or less, in the sentiment category, and that the other categories are heavily skewed towards lower values.  Because of this, we will use the MinMax scaler instead of the Standard Scaler to preserve the shape of the data.

## Building the models

Now that daily sentiments for each ticker are calculated, we can begin to put together a model to use the sentiment scores of the tweets to predict the movement of a stock's price. We will test various models, including logistic regression, random forest regression, and a neural network built with tensorflow and the Adam optimizer.

In [None]:
# Import necessary libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import ConfusionMatrixDisplay

import sklearn.metrics

Now we create our feature set, comprising of adjusted closing price, trading volume, and average daily sentiment. For our target variable, we use 'One_Day_Change' for our linear regression models and 'Target' for our logisitic regression models.  We then split the data into training and test sets, with the test set being 20% of the original dataset, and scale the data.

In [None]:
print(len(combined_data))
combined_data_cleaned = combined_data.dropna()
print(len(combined_data_cleaned))
X_log = combined_data_cleaned[['Adj Close', 'Volume', 'Sentiment']]
y_log = combined_data_cleaned['Target']
X_lin = combined_data_cleaned[['Adj Close', 'Volume', 'Sentiment']]
y_lin = combined_data_cleaned['One_Day_Change']

scaler = MinMaxScaler()
# scaler= StandardScaler()

# Split the data
X_train_lin, X_test_lin, y_train_lin, y_test_lin = train_test_split(X_lin, y_lin, test_size=0.2, random_state=42)

X_train_log, X_test_log, y_train_log, y_test_log = train_test_split(X_log, y_log, test_size=0.2, random_state=42)


# Scale the data
X_train_lin = scaler.fit_transform(X_train_lin)
X_test_lin = scaler.fit_transform(X_test_lin)

X_train_log = scaler.fit_transform(X_train_log)
X_test_log = scaler.fit_transform(X_test_log)

### **Linear Regression**

A Linear Regression model is fitted on the scaled data to predict One_Day_Change. The performance is evaluated using the coefficient of determination.

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train_lin, y_train_lin)

y_pred = lin_reg.predict(X_test_lin)

To evaluate our linear regression, we use the score feature of scikitlearn's linear regression model.  This calculates the coeffient of determination, defined as one minus the ratio of the residual sum of squares to the total sum of squares.

In our model, the coefficient is negative, which is a poor result, essentially saying that the model is (in this case slightly) worse than just assigning a the mean of the data to each data point.

Given this poor result, we will try to create a better model using other statistical methods.

In [None]:
lin_reg_score = lin_reg.score(X_test_lin, y_test_lin)
print(lin_reg_score)

### **Logistic Regression**

A Logistic Regression model is trained to classify price movement (up or down). The confusion matrix and accuracy score evaluate its performance.

In [None]:
log_clf = LogisticRegression()
log_clf.fit(X_train_log, y_train_log)

After fitting the model, we use it to make predictions on our test data, create a confusion matrix, and calculate the accuracy of our model.

Looking at the confusion matrix, we can see that our model is predicting a value of 0 for every datapoint (i.e. that for all data points the price of the stock will go down or remain constant), making the model completely unusable.

In [None]:
prediction = log_clf.predict(X_test_log)

log_confusion = sklearn.metrics.confusion_matrix(y_test_log, prediction)
print(log_confusion)
disp = ConfusionMatrixDisplay(confusion_matrix=log_confusion,
                              display_labels=log_clf.classes_)
disp.plot()

log_acc = sklearn.metrics.accuracy_score(prediction,y_test_log)
print(log_acc)

At this point, after noting the low performance of our linear and logistic regression models, we discussed the idea of using PCA, lasso, or ridge regression to improve performance, but as we only had a total of 3 features being used to predict our target, we decided that in this instance these models would likely not be beneficial.

### Random Forest Classifier

Having noted the extremely poor performance of our vanilla logistic regression classifier, we will try to use a random forest classifier to create a better model.  It will also have the advantage of providing us the importance it assigns to different variables, which could aid us in removing variables that may be making our data noisy and harder to predict.

We fit a random forest with 200 trees and a maximum depth of 60 in the hope that this large number of deep trees will create a better classifier.

In [None]:
rf_clf = RandomForestClassifier(random_state=42, n_estimators=200, max_depth=60, class_weight='balanced')

rf_clf.fit(X_train_log,y_train_log)

y_pred = rf_clf.predict(X_test_log)

rf_acc = rf_clf.score(X_test_log, y_test_log)
print(rf_acc)

Unfortunately, our random forest classifier had an accuracy even lower than our regular logistic regression classifier.  Upon analyzing the outputs however, we note one improvment: the classifier is now assigning values of both 1 and 0 to our test data.

The random forest classifier also allows us to look at the importance of each variable in the model, and all variables have roughly the same importance.  Given that we expect, from our real world knowledge of markets, sentiment to have a larger influence on stock prices than raw closing price, this is a sign that our model is not finding any correlation between our features and targets.

In [None]:
rf_confusion = sklearn.metrics.confusion_matrix(y_test_log, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=rf_confusion,
                              display_labels=rf_clf.classes_)
disp.plot()

importances = rf_clf.feature_importances_
print(importances)

### **Neural Network**

A simple Neural Network is built using TensorFlow/Keras with two hidden layers and trained using the Adam optimizer IN an attempt to increase the accuracy of the prediction.

In [None]:
X = combined_data[['Adj Close', 'Volume', 'Sentiment']]
y = combined_data['Target']

# Scaling Feature
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Building the neural network
model = Sequential([
    Dense(16, input_dim=X_train.shape[1], activation='relu'),
    Dense(8, activation='relu'),
    Dense(1, activation='sigmoid')
])

In [None]:
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test))

loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f}")

In [None]:
print(np.any(np.isnan(X_scaled)), np.any(np.isinf(X_scaled)))  # Check features
print(np.any(np.isnan(y)), np.any(np.isinf(y)))               # Check labels


Our neural network using the Adam optimizer received a slightly lower accuracy score than our random forest model, with an accuracy of only 0.5108

### **XGBoost Classifier**

The XGBoost Classifier is tested and optimized using grid search for hyperparameter tuning.

In [None]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import GridSearchCV

X_log_clean = combined_data_cleaned[['Adj Close', 'Volume', 'Sentiment']]
y_log_clean = combined_data_cleaned['Target']

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_log_clean)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_log_clean, test_size=0.2, random_state=42)

In [None]:
xgb_clf = XGBClassifier(
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)
xgb_clf.fit(X_train, y_train)

In [None]:
y_pred = xgb_clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.4f}")

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=xgb_clf.classes_)
disp.plot()

In [None]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
}

grid_search = GridSearchCV(
    estimator=XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42),
    param_grid=param_grid,
    scoring='accuracy',
    cv=3,
    verbose=1
)

grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

In [None]:
optimized_xgb = XGBClassifier(**best_params, use_label_encoder=False, eval_metric='logloss', random_state=42)
optimized_xgb.fit(X_train, y_train)

y_pred_opt = optimized_xgb.predict(X_test)
acc_opt = accuracy_score(y_test, y_pred_opt)
print(f"Optimized Accuracy: {acc_opt:.4f}")

cm_opt = confusion_matrix(y_test, y_pred_opt)
disp_opt = ConfusionMatrixDisplay(confusion_matrix=cm_opt, display_labels=optimized_xgb.classes_)
disp_opt.plot()

### Model Performance Summary and comments

In [None]:
#Model Performance Comparison
models = ['Logistic Regression', 'Random Forest', 'Neural Network', 'XGBoost (Vanilla)', 'XGBoost (Optimized)']
accuracies = [0.5179, 0.5187, 0.5211, 0.5195, 0.5116]

plt.figure(figsize=(10, 6))
plt.bar(models, accuracies, color='skyblue')
plt.title('Model Performance Comparison')
plt.xlabel('Models')
plt.ylabel('Accuracy (or $R^2$ for Linear Regression)')
plt.xticks(rotation=45, ha='right')
plt.axhline(y=0.5, color='r', linestyle='--', label='Random Baseline')
plt.legend()
plt.tight_layout()
plt.show()

** Note that we do not include our linear regression model here, as it is scored not on accuracy but by a coefficient of determination, which ranges from -1 to 1, which is not comparable to the accuracy scores of the logistic models.

| **Model**                     | **Coefficient of Determination** |
|-------------------------------|--------------------|
| **Linear Regression**        |   -0.0011

The result of our linear regression shows that our model is slightly worse than a coin flip at predicting correctly the price of stocks in our dataset.

| **Model**                     | **Accuracy** |
|-------------------------------|--------------------|
| **Logistic Regression**        | 0.5179            |
| **Random Forest**              | 0.5187            |
| **Neural Network (NN)**        | 0.5211            |
| **XGBoost (Vanilla)**          | 0.5195            |
| **XGBoost (Optimized)**        | 0.5116            |

The results of the models demonstrate that, while there may be *some* underlying correlation between tweet sentiment and priced movements, the models we implmented present at best solid foundation for further research and development in leveraging sentiment analysis and market data to predict stock price movements, rather than a usable tool for a serious or profit-conscious investor. With accuracies are just above that of a coin flip, they do not reveal reliable insights into the predictive potential of social media sentiment analysis in stock price forecasting. Below is a summary of key  conclusions drawn from each model:


1. **Logistic Regression**  
   Logistic regression achieved an accuracy of **51.79%**. This result suggests that this basic model cannot extract meaningfully predictive signals from a combination of market data and sentiment analysis.

2. **Random Forest**  
   With an accuracy of **51.87%**, the random forest model achieved a similarly low accuracy as the naive logistic regression. This result indicates that, at least based on the data we have used, there is likely no significant correlation between tweet sentiment and price movements.

3. **Neural Network (NN)**  
   Although it achieved the highest accuracy of **52.11%**, the neural network underscores that our data is not showing a correlation between our variables and price changes, as even an advanced architecture did not uncover correlation in the data.

4. **XGBoost Models**  
   Both vanilla and optimized XGBoost models performed similarly to our other, more naive models, achieving accuracies of **51.95%** and **51.16%**, respectively. Despite XGBoost’s flexibility and robustness in handling complex data, the low accuracy of our results indicates a lack of connection between tweet sentiment and price changes.


An analysis of the feature importances calculated by our random forest classifier also reinforce our failure to reject the null hypothesis.

In [None]:
feature_importance = [0.33, 0.34, 0.33]  # Example values for 'Adj Close', 'Volume', 'Sentiment'
features = ['Adj Close', 'Volume', 'Sentiment']

plt.figure(figsize=(8, 5))
plt.barh(features, feature_importance, color='lightgreen')
plt.title('Feature Importance')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()

The fact that the importance values across these three features are almost identical would seemt to indicate that all three contribute meaningfully to the model's predictions; however, when paired with the accuracy of our models being near that of a coin flip, it is in fact an indication that all of the features are equally unimportant at predicting the direction of priced movements of our stocks.


Based on the wide variety of models we have used to try to ascertain a correlation between tweet sentiment and stock price, we are not able to reject the null hypothesis, that is, our data does not support that an correlation exists between tweet sentiment and stock price.  We can say this because our best model for predicting price changes based on sentiment has an accuracy of 52.11%.  Moreover, the model with 52.11% accuracy, being a form of logistic regression, only predicts the direction of the price change, not its magnitude, meaning that we do not know if the increases predicted are larger or smaller than the decreases incorrectly predicted to be increases.

In this case, even a model that is correct more than half of the time in predicting the increase in price could lose money by often incorrectly predicting 'increase' for stocks that actually have large losses.

Although it is possible that there is in reality no correlation at all between tweet sentiment and stock price movements, we believe that, with stocks being assets often driven by investor emotion, more likely we need a more nuanced model and more complete data in order accurately quantify this correlation.  Our suggestions for further studying this model include:



1.   Using a different sentiment analysis tool. Twitter is well known for its sarcasm and use of double meanings, and the sentiment analysis model we used may be ill equipped to parse this data.
2.   Using a dataset of tweets about a larger number of stocks.  Because our dataset contained only a small subset of S&P 500 assets, it may contain bias, particularly as the stocks in our dataset are well known companies, that might have been tweeted about in reference to matters not directly related to investment (e.g. someone could tweet at Microsoft to complain about a new feature in Excel, not about its profitability or revenue).
3.   Eliminate stocks that are known to be the subject of much twitter activity by non-investors.  In particular, Tesla and Microsoft, as shown below, represent by far the largest number of tweets in our dataset, more than their relative market capitalizations would justify.  These highly tweeted stocks may be creating noise that is negatively affecting our model.
4.   Attempt a similar model using sentiment analysis of a different media source, e.g. LinkedIn posts or financial news stories (e.g. Bloomberg or CNBC).  These other data sources, because of the types of posts/stories they tend to produce, might more closely track sentiment, particularly the sentiment of large investors that can cause greater swings in stock price, than tweets do.


Below, we plot a histogram of the frequencies of the tickers in our tweet data.  We can see that tweets about Microsoft and Tesla far outnumber those about other tickers.

In [None]:
plt.hist(tweet_data['Ticker'])
plt.title('Tweet Distribution by Ticker')
plt.xlabel('Ticker')
plt.ylabel('Number of Tweets')
plt.xticks(rotation=45, ha='right')

As an experiment, we try removing the tickers TSLA and MSFT from our data and fit our most accurate model (random forest classifier) with this new data.

(You can comment out the lines below to run this model yourself of different subset of data).

In [None]:
selected_data = combined_data_cleaned.copy()

# remove TSLA from dataset
# selected_data = selected_data[selected_data['Ticker'] != 'TSLA']

# remove MSFT from dataset
selected_data = selected_data[selected_data['Ticker'] != 'MSFT']

We create new train and test sets with our data.

In [None]:
print(len(combined_data))
combined_data_cleaned = combined_data.dropna()
print(len(combined_data_cleaned))
X_log = selected_data[['Adj Close', 'Volume', 'Sentiment']]
y_log = selected_data['Target']

scaler = MinMaxScaler()
# scaler= StandardScaler()

# Split the data
X_train_log, X_test_log, y_train_log, y_test_log = train_test_split(X_log, y_log, test_size=0.2, random_state=42)


# Scale the data
X_train_log = scaler.fit_transform(X_train_log)
X_test_log = scaler.fit_transform(X_test_log)

Running the random forest classifier on data without TSLA, we find that our accuracy improve slightly (almost one percentage point to 53%), but that removing both MSFT and TSLA produces approximately the same accuracy as our original model (52%), and removing only MSFT produces an even lower accuracy of 49%.

Thus, although we see that certain tickers reduce the accuracy of our model, the effect of a single ticker on the accuracy, even when it has a lot of tweet activity, does not drastically increase (and in fact may decrease) the accuracy of our model.

In [None]:
rf_clf = RandomForestClassifier(random_state=42, n_estimators=200, max_depth=60, class_weight='balanced')

rf_clf.fit(X_train_log,y_train_log)

y_pred = rf_clf.predict(X_test_log)

rf_acc = rf_clf.score(X_test_log, y_test_log)
print(rf_acc)

### Sentiment Analysis Conclusion
This section of our project did not successfully create a tool for using machine learning models to predict stock price movements based on sentiment analysis and market data. Although the models, particularly the **neural network** and **XGBoost**, showed that further refinement and better data could potentially be used to create a useful tool, our current models fall far short of the standards for real-world deployment.

#### Insights from Feature Importance:
The analysis of feature importance highlights the value of integrating diverse data sources:
1. **Sentiment Analysis**: Demonstrated as a key predictor, emphasizing the impact of public opinion and media on stock performance.
2. **Trading Volume**: Captures market dynamics and investor activity, showcasing its strong predictive potential.
3. **Adjusted Close Prices**: Serves as a cornerstone feature, reflecting historical performance trends critical to forecasting.

#### Future Directions:
- **Advanced Feature Engineering**: Incorporating additional market indicators, alternative data sources (e.g., news articles, macroeconomic data), and refined sentiment metrics for greater precision.
- **Hyperparameter Optimization**: Leveraging grid search and automated tuning to improve model performance.
- **Real-time Deployment**: Transitioning the models for real-time stock prediction and decision-making, with enhanced pipelines for data collection and model retraining.

Although our current model does not meet the standards we would need to implement this tool in trading, we hope with ongoing refinement and exploration, these models could evolve into robust systems capable of delivering actionable insights in the fast-paced world of financial markets.

# Part Three: Price Action Model for Short Term Scalping

## What is Price Action

According to  [Investopedia](https://www.investopedia.com/terms/p/price-action.asp), Price action is the movement of a security's price plotted over time. Price action forms the basis for all technical analyses of a stock, commodity, or other asset charts.

![](https://www.investopedia.com/thmb/po6XWr9rdPVQ3770NhFBqFVDly4=/750x0/filters:no_upscale():max_bytes(150000):strip_icc():format(webp)/Price-action-aad4a749432f45b3ab1eb927e69177f6.jpg)

Many short-term traders rely exclusively on price action and the formations and trends extrapolated from it to make trading decisions. Technical analysis as a practice is a derivative of price action since it uses past prices in calculations that can then be used to inform trading decisions.

## Why & How Price Action Work?

Of course there's nothing work all the time, but there is something works some times. short-term traders (scalpers) uses Price Action theory and try to read the market movement and profit out of it based on the following general principles or assumptions.

### 1. Market Discounts Everything
- Price reflects all available information, including news, earnings, economic data, and market sentiment.
- This assumption aligns with the Efficient Market Hypothesis (EMH) in that any new information will be quickly absorbed and reflected in the market price.

### 2. Human Behavior and Market Psychology are Predictable
- Traders believe patterns emerge in price movements due to consistent human behaviors such as fear, greed, and herd mentality.
- Support and resistance levels, trends, and patterns like head-and-shoulders or flags are thought to be visual representations of these behaviors.

### 3. Price Moves in Trends
- Markets do not move randomly but tend to trend in a certain direction (uptrend, downtrend, or sideways).
- Identifying these trends allows traders to position themselves advantageously, assuming that trends are more likely to continue than reverse.

### 4. History Repeats Itself
- Patterns observed in historical price data are assumed to repeat over time because market participants react in similar ways under similar circumstances.
- Chart patterns, candlestick formations, and other recurring setups are used to anticipate future price movements.

### 5. Supply and Demand Drive Prices
- Price action is believed to be a reflection of the balance (or imbalance) between supply and demand.
- Price rises when demand exceeds supply and falls when supply exceeds demand. This balance is visualized through candlesticks, volume, and price levels.

### 8. Risk Can Be Managed Without Prediction
- Rather than trying to predict exact outcomes, price action traders assume they can manage risk effectively by focusing on probabilities and reacting to what the market is currently doing.
- Stop losses and position sizing are essential tools in this approach.

These theories are not something out of thin air, and in fact people have made great summaries about Price Action in trading. If interested, feel free to check out the following books for more information.

- [*Trading Price Action series* by Al Brooks](https://www.amazon.ca/stores/Al-Brooks/author/B001JSEI4Q?ref=ap_rdr&isDramIntegrated=true&shoppingPortalEnabled=true)
- [*Understanding Price Action: practical analysis of the 5-minute time frame* by Bob Volman](https://www.amazon.ca/Understanding-Price-Action-practical-analysis/dp/908227860X?ref_=ast_author_dp)

## Goal of this Sub-project

We want to implement a strategy that focuses on leveraging price action patterns, particularly pullbacks within trends, to identify reliable trade entries. Coupled with disciplined risk management, the goal is to maintain a win rate of at least 60% while adhering to a 1:1 risk-reward ratio. Through this exercise we show the effectiveness of human discretion in trading, demonstrating that simple, intuitive methods can still outperform complex and conceptually demanding models, such as those based on machine learning, in practical application.

## Install and Import

In [None]:
!pip install mplfinance

In [None]:
from google.colab import drive
import pandas as pd
import numpy as np
import mplfinance as mpf
import matplotlib.pyplot as plt
import json

## Load 5 minute interval price data for ES ([E-mini S&P 500 Futures](https://www.cmegroup.com/markets/equities/sp/e-mini-sandp500.html))

In [None]:
drive.mount('/content/drive')
es_path = '/content/drive/MyDrive/545 Project/stock_data_5_min_interval_10000_bars/CME_MINI_ES1!, 5_5a8c9.csv'

In [None]:
df_es = pd.read_csv(es_path)

## Data preprocessing

- We will need to rename the columns so that it works with `mplfinance` for visualization.vars
- We will also need to set index as the time, and convert the timezone to `America/New_York`

In [None]:
# Data preprocessing
def data_preprocessor(df):
    """
    Preprocesses a DataFrame for financial data analysis, ensuring proper column names,
    timezone conversion, and setting the time column as the index.

    Parameters:
    -----------
    df : pandas.DataFrame
        A DataFrame containing financial data with columns: 'open', 'high', 'low', 'close', 'Volume', and 'time'.
        The 'time' column should be a string or datetime object representing the timestamp for each row.

    Returns:
    --------
    pandas.DataFrame
        A new DataFrame with the following transformations applied:
        - Columns renamed to: 'Open', 'High', 'Low', 'Close', 'Volume'.
        - 'time' column converted to a datetime object, adjusted to the 'America/New_York' timezone.
        - 'time' column set as the index of the DataFrame.
    """
    # Create a copy of the original DataFrame to avoid modifying it
    ret = df.copy()

    # Rename columns to match expected format for financial analysis
    ret.rename(columns={
        'open': 'Open',
        'high': 'High',
        'low': 'Low',
        'close': 'Close',
        'Volume': 'Volume'
    }, inplace=True)

    # Convert the 'time' column to datetime and adjust for the 'America/New_York' timezone
    ret['time'] = pd.to_datetime(ret['time'], utc=True).dt.tz_convert('America/New_York')

    # Set the 'time' column as the index of the DataFrame
    ret.set_index('time', inplace=True)

    return ret


## Create indicators

### Pivots
A high pivot is a price point that forms when the market creates a peak. It signifies a temporary resistance level where prices have been rejected and start to move lower.

Similarly, a low pivot is a price point where the market forms a trough or bottom. It represents a temporary support level where prices stop falling and start to rise.

Pivots are useful to:
1. Trend Identification:

  - A series of higher pivot highs and higher pivot lows signals an uptrend.
  - A series of lower pivot highs and lower pivot lows signals a downtrend.
2. Support and Resistance:

  - High pivots often act as resistance levels, while low pivots act as support levels.
3. Reversal Points:

  - Pivots can mark potential reversal zones, helping traders enter or exit positions.
4. Pattern Formation:
  - Pivot highs and lows contribute to technical patterns like double tops/bottoms, head and shoulders, and more.

However, to reduce complexity, we didn't directly use Pivots in the strategy.

### MACD
![](https://www.keenbase-trading.com/wp-content/uploads/2024/06/how-to-find-the-best-macd-settings.jpg)

[Moving average convergence/divergence (MACD)](https://www.investopedia.com/terms/m/macd.asp#:~:text=Key%20Takeaways,EMA%20of%20the%20MACD%20line.) is a technical indicator to help investors identify entry points for buying or selling. The MACD line is calculated by subtracting the 26-period exponential moving average (EMA) from the 12-period EMA. The signal line is a nine-period EMA of the MACD line. Formula:

`MACD = 12-Period EMA − 26-Period EMA`

`Signal = 9-Period MACD`

We can use MACD to identify trend and measure the strength of the trend:

1. Crossover Signals:

  - Bullish Crossover: When the MACD Line crosses above the Signal Line, it may signal an upward momentum and a potential buy signal.
  - Bearish Crossover: When the MACD Line crosses below the Signal Line, it may indicate downward momentum and a potential sell signal.
2. Zero Line Cross:

  - When the MACD Line crosses above the zero line, it signals a bullish trend.
  - When the MACD Line crosses below the zero line, it signals a bearish trend.
3. Divergence:

  - Bullish Divergence: When the price makes lower lows but the MACD makes higher lows, it suggests a potential reversal to the upside.
  - Bearish Divergence: When the price makes higher highs but the MACD makes lower highs, it indicates a potential reversal to the downside.

In our model/strategy, we will focus on 2 and 3 using MACD line to identify trend and temporary pullbacks, and most importantly we find entry point when pullbacks is reversed and trend is resumed.

### Trend and pullback detection based on MACD

![](https://patternswizard.com/wp-content/uploads/2021/09/pullback.png.webp)

Simply put, we want to find a pattern where during a trending market, we find MACD had a opposite directional moves (Monotonically increasing or decreasing) for 5 consecutive bars (in 25 min), then in the most recent 6th bar it start to move in a direction aligns with the market trend.


1. Pullback bounce up in upward trend:

  When MACD is (a) monotonically decreasing for at least 5 bars, (b) and most recent bar closed with its MACD value greater than the MACD value of the bar before, (c) and second most recent MACD value are above 0 line (the second most MACD value is the lowest point, hence all MACD values we compared should all be above 0 therefore it's indicating an upward market trend). Then we say the current bar is the signal bar, and next bar is the entry bar, we make an entry with the open price of entry bar.

```
((macd.shift(5) > macd.shift(4)) &
  (macd.shift(4) > macd.shift(3)) &
  (macd.shift(3) > macd.shift(2)) &
  (macd.shift(2) < macd.shift(1)) &
  (macd.shift(2) > 0))
```

2. Pullback bounce down in downward trend:

  In this case every thing is similar, to previous case, but the in the opposite direction.

We will use such pattern recognition to create two seires boolean variables for each bar with timestamp as index. For example, for the upward trend pullback signal series, when it's true then it signifies that the corresponding bar at that timestamp is signal bar, we make an long entry in next bar. Similarly we do the opposite for downward trend pullback signal series.

However, we noticed that the underlying S&P 500 index is always bullish in longer time frame, the strategy is more effective in long entries vs short entries. So we will only keep long trades in this exercise.



In [None]:

# Function to calculate pivot high and low
def calculate_pivots(df, lb=5, rb=5):
    """
    Identifies pivot highs and lows in the dataframe.

    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame with 'High' and 'Low' columns.
    lb : int
        Number of bars to the left of the pivot to consider.
    rb : int
        Number of bars to the right of the pivot to consider.

    Returns:
    --------
    pivots_high : pandas.Series
        Series containing the pivot high values.
    pivots_low : pandas.Series
        Series containing the pivot low values.
    """
    pivots_high = df['High'][(df['High'] == df['High'].rolling(window=lb+rb+1, center=True).max())]
    pivots_low = df['Low'][(df['Low'] == df['Low'].rolling(window=lb+rb+1, center=True).min())]
    return pivots_high, pivots_low

# Function to calculate MACD
def calculate_macd(df, fast_length=12, slow_length=26, signal_length=9):
    """
    Calculates the MACD and Signal Line.

    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame with 'Close' column.
    fast_length : int
        The length of the fast moving average.
    slow_length : int
        The length of the slow moving average.
    signal_length : int
        The length of the signal line.

    Returns:
    --------
    macd : pandas.Series
        The MACD line.
    signal : pandas.Series
        The signal line.
    hist : pandas.Series
        The MACD histogram (difference between MACD and signal line).
    """
    fast_ma = df['Close'].ewm(span=fast_length, adjust=False).mean()
    slow_ma = df['Close'].ewm(span=slow_length, adjust=False).mean()
    macd = fast_ma - slow_ma
    signal = macd.ewm(span=signal_length, adjust=False).mean()
    hist = macd - signal
    return macd, signal, hist

# Function to identify trend changes based on MACD histogram
def identify_trend_changes(macd):
    """
    Identifies trend changes based on the MACD histogram.

    Parameters:
    -----------
    macd : pandas.Series
        The MACD values.

    Returns:
    --------
    pullback_bounce_up : pandas.Series
        Series indicating pullback bounce ups.
    pullback_bounce_dn : pandas.Series
        Series indicating pullback bounce downs.
    """
    pullback_bounce_up = ((macd.shift(5) > macd.shift(4)) &
                          (macd.shift(4) > macd.shift(3)) &
                          (macd.shift(3) > macd.shift(2)) &
                          (macd.shift(2) < macd.shift(1)) &
                          (macd.shift(2) > 0))

    pullback_bounce_dn = ((macd.shift(5) < macd.shift(4)) &
                          (macd.shift(4) < macd.shift(3)) &
                          (macd.shift(3) < macd.shift(2)) &
                          (macd.shift(2) > macd.shift(1)) &
                          (macd.shift(2) < 0))

    return pullback_bounce_up, pullback_bounce_dn

## Visualize indicators with candlestick chart

In here, we utilize mplfinance to visualize the price in candlestick chart with the indicators we created above.

In [None]:
def plot_candlestick_with_indicators(df, pivots_high, pivots_low, pullback_bounce_up, pullback_bounce_dn, macd, signal, hist, ticker_name):
    """
    Plots the candlestick chart along with support/resistance lines, MACD, and trend signals using mplfinance.

    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame containing the price data.
    pivots_high : pandas.Series
        Series containing the pivot highs.
    pivots_low : pandas.Series
        Series containing the pivot lows.
    pullback_bounce_up : pandas.Series
        Series indicating pullback bounce ups.
    pullback_bounce_dn : pandas.Series
        Series indicating pullback bounce downs.
    macd : pandas.Series
        The MACD line.
    hist : pandas.Series
        The MACD histogram.
    """

    # Align the indices of the indicators with df's index
    pivots_high = pivots_high.reindex(df.index)
    pivots_low = pivots_low.reindex(df.index)
    pullback_bounce_up = pullback_bounce_up.reindex(df.index)
    pullback_bounce_dn = pullback_bounce_dn.reindex(df.index)
    macd = macd.reindex(df.index)
    hist = hist.reindex(df.index)

    # Define the additional plots for the indicators
    apds = []

    # Pivot High and Low
    apds.append(mpf.make_addplot(pivots_high, type='scatter', markersize=50, marker='$H$', color='black'))
    apds.append(mpf.make_addplot(pivots_low, type='scatter', markersize=50, marker='$L$', color='black'))
    pullbacks = df.copy()
    pullbacks['bounce_up'] = (pullbacks['Low'] - 5) * pullback_bounce_up
    pullbacks['bounce_up'] = pullbacks['bounce_up'].apply(lambda x: None if x == 0 else x)
    pullbacks['bounce_dn'] = (pullbacks['High'] + 5) * pullback_bounce_dn
    pullbacks['bounce_dn'] = pullbacks['bounce_dn'].apply(lambda x: None if x == 0 else x)

    # # Pullback Bounce Up and Down
    apds.append(mpf.make_addplot(pullbacks['bounce_up'], type='scatter', markersize=50, marker='^', color='green'))
    apds.append(mpf.make_addplot(pullbacks['bounce_dn'], type='scatter', markersize=50, marker='v', color='red'))

    # # # MACD and Signal line
    apds.append(mpf.make_addplot(macd, panel=2, color='#8CFF9E'))
    apds.append(mpf.make_addplot(signal, panel=2, color='#FF7779'))
    apds.append(mpf.make_addplot(hist, panel=2, type='bar', color='gray', alpha=0.3))

    # Plot using mplfinance with the additional indicators
    fig, axes = mpf.plot(
      df, type='candle', style='charles',
      volume=True, title='Five-Minute Candlestick with Indicators for: ' + ticker_name,
      ylabel="Price", addplot=apds, returnfig=True, figsize=(24, 16)
    )


## Execute strategy based on the indicators - back testing

Here we have the back testing helper fuction that simulates and evaluates a pullback trading strategy based on price bounces using the Open, High, Low, Close price data and the pullback signals.

1. Entry Conditions:

 - The strategy looks for pullback bounce signals:
  
    - pullback_bounce_up: Indicates a potential buy (long) opportunity.
  
    - pullback_bounce_dn: (Commented out) Could be used for sell (short) trades.
 - If a signal is triggered, the strategy enters a trade on the next bar's open price.
2. Exit Conditions:
  - Take Profit: If the price moves favorably by a specified amount (Default: 15 points).
  - Stop Loss: If the price moves against the trade by the same threshold (Default: -15 points).
3. Profit Calculation:

  - Each trade's profit/loss is calculated and added to the total profit. A multiplier (Default: 50 for E-Mini S&P 500 Futures) adjusts the profit/loss for the trading instrument.
4. Tracking Metrics:
  - Entry/Exit times and prices.
  - Profit/Loss.
  - Duration in bars.
  - Number of trades.
  - Percent profitable trades.
  - Net profit.
  - Maximum drawdown.
  - Average trade profit.
  - Average trade duration.





In [None]:
def execute_trading_strategy(df, pullback_bounce_up, pullback_bounce_dn, multiplier=50, stop_loss=15, take_profit=15):
    """
    Executes a trading strategy based on pullback bounces and track performance.

    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame containing the OHLC data with 'Open', 'High', 'Low', 'Close' columns and Time as the index.
    pullback_bounce_up : pandas.Series
        A boolean Series indicating where pullback bounce up is true.
    pullback_bounce_dn : pandas.Series
        A boolean Series indicating where pullback bounce down is true.
    multiplier : int, optional
        The multiplier for profit/loss calculation (default is 50 for E-MINI S&P 500 Futures).
    stop_loss : float, optional
        The stop loss threshold in price points (default is 15).
    take_profit : float, optional
        The take profit threshold in price points (default is 15).

    Returns:
    --------
    performance_metrics : dict
        A dictionary containing trade performance metrics such as net profit, max drawdown,
        percentage profitable, etc.
    """
    # Initialize variables
    trades = []  # Store trades as (entry_time, entry_price, exit_time, exit_price, profit_loss)
    in_trade = False  # To track whether we are in a trade
    entry_price = 0  # The price at which the trade is entered
    entry_time = None  # The time at which the trade is entered
    trade_direction = None  # 1 for long, -1 for short
    total_profit = 0  # Total profit from trades
    max_drawdown = 0  # Maximum drawdown observed during the trades
    max_balance = 0  # Track the highest balance (to calculate max drawdown)
    trade_durations = []  # Track the duration of each trade in bars
    profit_losses = []  # Track profit and losses for each trade

    # Iterate through the dataframe to simulate trading strategy
    for i in range(1, len(df)):
        # If currently in a trade, check if we should exit
        if in_trade:
            # Check for exit conditions
            current_price = df['Close'].iloc[i]
            if trade_direction == 1:  # Long trade
                if current_price >= entry_price + take_profit:
                    # Take profit
                    exit_price = current_price
                    exit_time = df.index[i]
                    profit_loss = (exit_price - entry_price) * multiplier
                    trades.append((entry_time, entry_price, exit_time, exit_price, profit_loss))
                    in_trade = False
                    total_profit += profit_loss
                    profit_losses.append(profit_loss)
                    trade_durations.append(i - entry_bar)
                    max_balance = max(max_balance, total_profit)
                    max_drawdown = min(max_drawdown, total_profit - max_balance)
                elif current_price <= entry_price - stop_loss:
                    # Stop loss
                    exit_price = current_price
                    exit_time = df.index[i]
                    profit_loss = (exit_price - entry_price) * multiplier
                    trades.append((entry_time, entry_price, exit_time, exit_price, profit_loss))
                    in_trade = False
                    total_profit += profit_loss
                    profit_losses.append(profit_loss)
                    trade_durations.append(i - entry_bar)
                    max_balance = max(max_balance, total_profit)
                    max_drawdown = min(max_drawdown, total_profit - max_balance)
            # elif trade_direction == -1:  # Short trade
            #     if current_price <= entry_price - take_profit:
            #         # Take profit
            #         exit_price = current_price
            #         exit_time = df.index[i]
            #         profit_loss = (entry_price - exit_price) * multiplier
            #         trades.append((entry_time, entry_price, exit_time, exit_price, profit_loss))
            #         in_trade = False
            #         total_profit += profit_loss
            #         profit_losses.append(profit_loss)
            #         trade_durations.append(i - entry_bar)
            #         max_balance = max(max_balance, total_profit)
            #         max_drawdown = min(max_drawdown, total_profit - max_balance)
            #     elif current_price >= entry_price + stop_loss:
            #         # Stop loss
            #         exit_price = current_price
            #         exit_time = df.index[i]
            #         profit_loss = (entry_price - exit_price) * multiplier
            #         trades.append((entry_time, entry_price, exit_time, exit_price, profit_loss))
            #         in_trade = False
            #         total_profit += profit_loss
            #         profit_losses.append(profit_loss)
            #         trade_durations.append(i - entry_bar)
            #         max_balance = max(max_balance, total_profit)
            #         max_drawdown = min(max_drawdown, total_profit - max_balance)

        # Enter a new trade if not already in one
        if not in_trade:  # Only enter a new trade if not already in one
            if pullback_bounce_up.iloc[i-1]:
                # Long trade entry (next bar)
                entry_price = df['Open'].iloc[i]
                entry_time = df.index[i]
                trade_direction = 1  # Long trade
                entry_bar = i
                in_trade = True
            # elif pullback_bounce_dn.iloc[i-1]:
            #     # Short trade entry (next bar)
            #     entry_price = df['Open'].iloc[i]
            #     entry_time = df.index[i]
            #     trade_direction = -1  # Short trade
            #     entry_bar = i
            #     in_trade = True

    # Calculate performance metrics
    num_trades = len(trades)
    num_profitable_trades = len([x for x in profit_losses if x > 0])
    percent_profitable = (num_profitable_trades / num_trades) * 100 if num_trades > 0 else 0
    avg_trade_profit = np.mean(profit_losses) if num_trades > 0 else 0
    avg_trade_duration = np.mean(trade_durations) if num_trades > 0 else 0
    net_profit = total_profit

    performance_metrics = {
        'num_trades': num_trades,
        'percent_profitable': percent_profitable,
        'max_drawdown': max_drawdown,
        'avg_trade_profit': avg_trade_profit,
        'avg_trade_duration': avg_trade_duration,
        'net_profit': net_profit
    }

    # Plot performance metrics
    plt.figure(figsize=(14, 6))
    plt.subplot(2, 1, 1)
    plt.plot([x[2] for x in trades], [x[4] for x in trades], marker='o', color='b', label='Trade PnL')
    plt.xlabel('Exit Time')
    plt.ylabel('Profit/Loss')
    plt.title('Trade PnL Over Time')
    plt.grid(True)

    plt.subplot(2, 1, 2)
    plt.plot([x[2] for x in trades], np.cumsum([x[4] for x in trades]), marker='o', color='g', label='Cumulative Profit')
    plt.xlabel('Exit Time')
    plt.ylabel('Cumulative Profit')
    plt.title('Cumulative Profit Over Time')
    plt.grid(True)

    plt.tight_layout()
    plt.show()

    return performance_metrics

In [None]:
df_es_processed = data_preprocessor(df_es)
df_es_processed.head()

In [None]:
df_es_processed.info()

## Put things together

### Visualizing the price and indicators for tail 360 bars

In [None]:
def main(df, ticker_name):
    # Step 1: Calculate pivots (Highs and Lows)
    pivots_high, pivots_low = calculate_pivots(df)

    # Step 2: Calculate MACD and histogram
    macd, signal, hist = calculate_macd(df)

    # Step 3: Identify pullback bounces
    pullback_bounce_up, pullback_bounce_dn = identify_trend_changes(macd)

    # Step 4: Plot the candlestick chart with indicators
    plot_candlestick_with_indicators(df, pivots_high, pivots_low, pullback_bounce_up, pullback_bounce_dn, macd, signal, hist, ticker_name)


# Run the code with your dataframe
main(df_es_processed.tail(360), 'ES')

### Back testing using all price bars from Dec 2023 to Dec 2024.

In [None]:
macd, signal, hist = calculate_macd(df_es_processed)

pullback_bounce_up, pullback_bounce_dn = identify_trend_changes(macd)

performance_metrics = execute_trading_strategy(df_es_processed, pullback_bounce_up, pullback_bounce_dn)


### Performance Summary

In [None]:
print(json.dumps(performance_metrics, sort_keys=True, indent=4))

## Conclusion

With this exercise, we found a fairly effective and simple strategy that can achieve a 58% win rate with 1:1 risk to reward ratio. This translate into a net profit of 46K USD in a span of 1 year back testing trading the E-MINI S&P 500 Futures (50x leveraged according to contract specs). We showed that sometimes a simpler rule can generate powerful results in the settings of trading.

### Areas to improve

However, it is of course not perfect, and has way much rooms to improve.

1. We need more rigorous testing

  - We should back test over more historical data (5Y or 10Y+) to increase our confidence in our strategy.
  - And on top of that, we should forward test with paper money with new data that comes in every trading days for a extended period of time to prove the strategy is consistent and profitable.

2. In the back testing strategy itself, we should make adjustment based on the underlying asset we trade.
  - In the E-MINI S&P 500 Futures trading, we need to noticed that the contracts expires in every 3 months, during the contract switch period, the ES ticker sees a huge price jump that is purely due to the switch. We should exclude the profit made during such incidents.
3. We should also adjust back testing and forward testing to avoid news/event driven market trends
  - To test the pure price action strategy, we should avoid including the price trends driven by news events.
  - The easiest ones to exclude are the recent FOMC meetings driven upward trends due to rate cuts and election day related market movements.
  - These events/news are not likely to repeat itself often, so we should avoid including them during the evaluation period of the strategy.
4. Strategy Improvements
  - We can make the strategy more selective in entries by further incorporating the price pivot indicator.
  - In the pullback recognition, we used a strict value comparison over the recent consecutive MACD values to identify pullback trend. Instead, we can use linear regression to find such trend in MACD to tolerate fluctuations in MACD values.
  - There are a lot of hyper parameters that are set by our discretion. We can instead do a grid search to fit over a longer period of historical data.

# Project Conclusion

The Investor Toolkit integrates advanced techniques to address critical areas of investment strategy—portfolio optimization, sentiment-driven trading, and technical analysis-based scalping. Through asset clustering, we successfully identified diversified and high-performing investment opportunities in the S&P 500, providing a robust framework for portfolio construction. Sentiment analysis explored the potential of alternative data sources like social media to inform trading decisions, albeit with statistically insignificant predictive accuracy. Finally, the price action model for short-term scalping validated the effectiveness of simple, rule-based strategies in achieving consistent profitability, emphasizing the power of technical indicators in dynamic trading environments.

Together, these components underscore the versatility of combining machine learning, sentiment analysis, and traditional technical analysis to empower investors with actionable insights and scalable tools for decision-making.

#Future Works:


1. Enhanced Back-Testing and Forward-Testing: Expand historical back-tests and implement forward-testing with real-time data to validate and improve the scalability of all models.

2. Sentiment Data Refinement: Incorporate advanced NLP models and integrate alternative sentiment sources like financial news and professional networks to enhance predictive accuracy.

3. Dynamic Optimization: Introduce adaptive parameter tuning for clustering, trading strategies, and portfolio weights to improve model robustness and market responsiveness.













