**<h1 align="center">Text Mining</h1>**
**<h2 align="center">Stock Sentiment: Predicting market behavior from tweets</h2>**

This notebook presents the final solution for our Text Mining project on market sentiment classification based on tweets. Our approach uses the DistilBERT transformer model, fine-tuned on the labeled training data to classify each tweet as Bearish (0), Bullish (1), or Neutral (2). The solution includes preprocessing, tokenization, dataset preparation, model training, evaluation, and prediction. For a detailed overview of the experimentation process and alternative methods tested, please refer to the tm_tests_05 notebook and the accompanying report.

<a class="anchor" id="chapter1"></a>

# 1. Imports

</a>

In [1]:
# Standard Libraries
import re
import string
import numpy as np
import pandas as pd

# Text Preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import TreebankWordTokenizer
from nltk.stem import WordNetLemmatizer

# Model Evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score

# PyTorch Core
import torch
from torch.utils.data import Dataset

# Transformers (Hugging Face)
from transformers import (
    DistilBertTokenizer, DistilBertTokenizerFast, DistilBertForSequenceClassification,
    Trainer, TrainingArguments
)

  from .autonotebook import tqdm as notebook_tqdm





In [2]:
# Load the datasets
df_train = pd.read_csv('../Data/train.csv')
df_test = pd.read_csv('../Data/test.csv')

<a class="anchor" id="chapter2"></a>

# 2. Data Split

</a>

In [3]:
# Hold-out method with stratification
train_df, val_df = train_test_split(df_train, test_size=0.2, stratify=df_train['label'], random_state=42)

<a class="anchor" id="chapter3"></a>

# 3. Data Preprocessing

</a>

In [4]:
# Download required resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ruben\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ruben\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ruben\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ruben\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [5]:
# Initialize tools
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
tokenizer = TreebankWordTokenizer()

In [6]:
# Preprocessing function
def preprocess_text(text):
    # 1. Lowercase
    text = text.lower()
    
    # 2. Regex Cleaning
    text = re.sub(r"http\S+|www\S+", '', text)                         # Remove URLs
    text = re.sub(r"@\w+|#\w+|rt", '', text)                           # Remove mentions, hashtags, RT
    text = re.sub(r"br", "", text)                                     # Remove 'br' (e.g. <br> tags)
    text = re.sub(f"[{re.escape(string.punctuation)}]", '', text)      # Remove punctuation
    text = re.sub(r"[^a-zA-Z\s]", ' ', text)                           # Remove numbers and special characters
    text = re.sub(r"\s+", " ", text).strip()                           # Remove extra whitespace

    # 3. Tokenize using Treebank tokenizer
    tokens = tokenizer.tokenize(text)

    # 4. Remove stopwords and short tokens, then lemmatize
    clean_tokens = [
        lemmatizer.lemmatize(token)
        for token in tokens
        if token not in stop_words and len(token) > 2
    ]

    return " ".join(clean_tokens)

In [7]:
# Apply function to train, validation and test datasets
train_df['clean_text'] = train_df['text'].fillna('').apply(preprocess_text)
val_df['clean_text']   = val_df['text'].fillna('').apply(preprocess_text)
df_test['clean_text']  = df_test['text'].fillna('').apply(preprocess_text)

In [8]:
# Check before and after cleaning
print("Original tweet:\n", train_df['text'].iloc[6])
print("Cleaned tweet:\n", train_df['clean_text'].iloc[6])

Original tweet:
 Could Applied DNA Sciences, Inc. (APDN) See a Reversal After Breaking Its 52 Week Low? - The Lamp News
Cleaned tweet:
 could applied dna science inc apdn see reversal eaking week low lamp news


In [9]:
# Add cleaned text to the datasets
train_df['clean_text'] = train_df['text'].fillna('').apply(preprocess_text)
val_df['clean_text']   = val_df['text'].fillna('').apply(preprocess_text)
df_test['clean_text']  = df_test['text'].fillna('').apply(preprocess_text)

In [10]:
# Get the labels for training and validation sets
y_train = train_df['label']
y_val = val_df['label']

In [11]:
# Get the cleaned text for training, validation, and test sets
X_train_cleaned = train_df['clean_text']
X_val_cleaned = val_df['clean_text']
X_test_cleaned = df_test['clean_text']

In [12]:
# Check the columns of the test DataFrame
df_test.columns

Index(['id', 'text', 'clean_text'], dtype='object')

<a class="anchor" id="chapter4"></a>

# 4. Feature Engineering: Distil BERT

</a>

In [13]:
# Load pretrained DistilBERT model and tokenizer
bert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

In [14]:
# Custom PyTorch Dataset for training/validation data
class FinDataset(Dataset):
    def __init__(self, texts, labels):
        self.encodings = bert_tokenizer(
            texts.tolist(),
            truncation=True,
            padding=True,
            max_length=64
        )
        self.labels = labels.tolist()

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Custom PyTorch Dataset for inference (test set), no labels
class InferenceDataset(Dataset): 
    def __init__(self, texts):
        self.encodings = bert_tokenizer(
            texts.tolist(),
            truncation=True,
            padding=True,
            max_length=64
        )

    def __getitem__(self, idx):
        return {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}

    def __len__(self):
        return len(self.encodings['input_ids'])

# Instantiate datasets for training, validation, and test sets
train_dataset = FinDataset(X_train_cleaned, y_train)
val_dataset   = FinDataset(X_val_cleaned, y_val)
test_dataset = InferenceDataset(X_test_cleaned)

<a class="anchor" id="chapter5"></a>

# 5. Transformer

</a>

**Distil BERT Fine-tuned**

In [15]:
# Load tokenizer and model
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
# Define metrics
def compute_metrics(pred):
    preds = np.argmax(pred.predictions, axis=1)
    labels = pred.label_ids
    return {
        "precision": precision_score(labels, preds, average="macro"),
        "recall": recall_score(labels, preds, average="macro"),
        "f1": f1_score(labels, preds, average="macro"),
    }

In [17]:
# Define training configuration and hyperparameters for the DistilBERT model
training_args = TrainingArguments(
    output_dir="./bert_output",       
    do_train=True, # Training mode
    do_eval=True,
    num_train_epochs=2,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    save_strategy="no",                   
    logging_strategy="no",              
    report_to=[],                          
)

In [18]:
# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

# Train the model
trainer.train()


Step,Training Loss


TrainOutput(global_step=478, training_loss=0.5034384946942828, metrics={'train_runtime': 500.4012, 'train_samples_per_second': 30.512, 'train_steps_per_second': 0.955, 'total_flos': 201464773197336.0, 'train_loss': 0.5034384946942828, 'epoch': 2.0})

<a class="anchor" id="chapter6"></a>

# 6. Predictions on Test

</a>

In [19]:
# Predict on the test dataset
test_preds = trainer.predict(test_dataset)
y_test_pred = np.argmax(test_preds.predictions, axis=1)

# Save predictions to CSV
df_test['prediction'] = y_test_pred
df_test[['id', 'prediction']].to_csv("pred_05.csv", index=False)