Using the following pipeline for the multi-class classifier model. 

1. Split the data into train, val and test set
2. Load data into dataloaders
3. define the model
4. train the model
5. Evaluate and visualize training
5. fine tune the model


Loading the data into train, val and test. (80, 10, 10) 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import re

# Load the balanced dataset
df = pd.read_csv('data/processed/balanced_tweets_stock_data.csv')

print(f"Total samples: {len(df)}")
print(f"\nDataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()


Total samples: 36031

Dataset shape: (36031, 11)

First few rows:


Unnamed: 0,Date,cleaned_tweet,Stock Name,Company Name,Open,High,Low,Close,Volume,daily_return,price_range
0,2021-10-18 17:07:43+00:00,ok my 2 cents im 100% sure tweeted about $aaa ...,TSLA,"Tesla, Inc.",283.929993,291.753326,283.823334,290.036682,72621600,0.032122,7.929993
1,2022-04-12 22:55:34+00:00,this is how tesla is clobbering everybody else...,TSLA,"Tesla, Inc.",332.546661,340.396667,325.533325,328.983337,65976000,0.011292,14.863342
2,2022-03-04 15:00:28+00:00,at some point investors will put two and two t...,TSLA,"Tesla, Inc.",283.033325,285.216675,275.053345,279.429993,66999600,-0.001192,10.16333
3,2022-09-12 16:28:23+00:00,this is why less people start buying tesla shi...,TSLA,"Tesla, Inc.",300.720001,305.48999,300.399994,304.420013,48674600,0.015817,5.089996
4,2022-04-08 16:52:54+00:00,tesla $tsla block $sq and blockstream team up ...,TSLA,"Tesla, Inc.",347.736664,349.480011,340.813324,341.829987,55013700,-0.030049,8.666687


In [2]:
# Split into train (80%), temp (20%)
train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42, shuffle=True)

# Split temp into validation (10%) and test (10%)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42, shuffle=True)

# Print the sizes
print(f"Training set: {len(train_df)} samples ({len(train_df)/len(df)*100:.1f}%)")
print(f"Validation set: {len(val_df)} samples ({len(val_df)/len(df)*100:.1f}%)")
print(f"Test set: {len(test_df)} samples ({len(test_df)/len(df)*100:.1f}%)")




Training set: 28824 samples (80.0%)
Validation set: 3603 samples (10.0%)
Test set: 3604 samples (10.0%)


Creating DataLoaders of the dataset

In [3]:
from torch.utils.data import DataLoader
batch_size = 32

train_loader = DataLoader(train_df, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_df, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_df, batch_size=batch_size, shuffle=False)


Model deifnitions: Classifying stocks based on sentiment. 

Using a mix of tickers and multi-class text classification with suitable pre-trained model.


In [4]:
def label_from_ticker(text, ticker_map):
    tickers = re.findall(r'\$[A-Z]{1,5}', text.upper())
    if len(tickers) == 1:
        return ticker_map.get(tickers[0][1:], None)
    elif len(tickers) > 1:
        return "MULTI"
    else:
        return None
