# Trading Signal Generation
- applied labels by observing whether the USD/BRL exchange rate increased, decreased, or experienced low volatility (neutral)
- applied threshold of 0.2% applied to manually determining neutral days (technically doesn't even matter because we're not using the above approach anymore. Using Eli's Labels instead)
- applied word2vec model to vectorize text and further classify new articles to generate buy/sell/hold signals

Notes (in-progress)
- avoid data-leakage: don't train/test on the same data

In [1]:
import numpy as np
import pandas as pd
import matplotlib

## January Preprocessing

In [2]:
file_path = "data/labeled_january_data.csv"

with open(file_path, "r", encoding="utf-8") as file:
    df_jan = pd.read_csv(file)

In [3]:
import spacy

# spacy PT model
nlp = spacy.load('pt_core_news_sm')

#preprocessing
def preprocess_text_spacy(text):
    doc = nlp(text)
    
    # lemmatization and stopwords removal
    tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
    
    #tokens back to 1 string
    return ' '.join(tokens)

# preprocess ALL articles in df (the file)
df_jan['preprocessed_article'] = df_jan['article'].apply(preprocess_text_spacy)

# Display the processed articles
display(df_jan)

Unnamed: 0,date,article,label,preprocessed_article
0,09/01/24,"O petróleo testava reação moderada (+0,50%) no...",0,petróleo testar reação moderar pregão asiático...
1,09/01/24,Circularam comentários de que a reunião de Pac...,0,circularam comentário reunião Pacheco líder se...
2,09/01/24,"De qualquer modo, seis senadores estão com a p...",0,modo senador presença confirmar Único indicado...
3,09/01/24,"Nos EUA, sai a balança comercial de novembro (...",-1,EUA sair balança comercial novembro Fed boy Mi...
4,09/01/24,"O investidor cumpre a espera pela 5ªF, que pro...",-1,investidor cumprir espera prometer emoção CPI ...
...,...,...,...,...
1096,31/01/24,Emissão é de apenas uma série e já tem valor d...,0,Emissão série definir revelar executivo
1097,31/01/24,"ROMI teve lucro líquido de R$ 51,340 milhões n...",0,ROMI lucro líquido milhão queda
1098,31/01/24,ENEVA. Citi manteve recomendação de compra par...,0,ENEVA Citi manter recomendação compra ação ban...
1099,31/01/24,OI. Nova versão do plano de recuperação judici...,0,OI versão plano recuperação judicial concluir ...


In [8]:
df_jan.to_csv("saves/dataframes/january_df.csv", index=False)

## February Preprocessing

In [5]:
file_path = "data/labeled_february_data.csv"

with open(file_path, "r", encoding="utf-8") as file:
    df_feb = pd.read_csv(file)

In [6]:
import spacy

# spacy PT model
nlp = spacy.load('pt_core_news_sm')

#preprocessing
def preprocess_text_spacy(text):
    doc = nlp(text)
    
    # lemmatization and stopwords removal
    tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
    
    #tokens back to 1 string
    return ' '.join(tokens)

# preprocess ALL articles in df (the file)
df_feb['preprocessed_article'] = df_feb['article'].apply(preprocess_text_spacy)

# Display the processed articles
display(df_feb)

Unnamed: 0,date,article,label,preprocessed_article
0,01/02/2024,… O PMI industrial chinês medido pelo setor pr...,0,PMI industrial chinês meder setor privado fica...
1,01/02/2024,"… O texto do BC, praticamente igual ao anterio...",0,texto BC praticamente igual anterior dezembro ...
2,01/02/2024,"… Depois de baixar a Selic para 11,25%, o BC n...",1,baixar Selic BC mexeu quase comunicado parágra...
3,01/02/2024,"… O Copom não encurtou o horizonte de cortes, ...",1,Copom encurtar horizonte corte manter barra al...
4,01/02/2024,… Isso significa que março continua dado e que...,1,significar março continuar dar maio reservar s...
...,...,...,...,...
914,28/02/2024,CPFL PAULISTA. Conselho aprovou 14ª emissão de...,0,CPFL PAULISTA aprovar emissão debêntur montant...
915,28/02/2024,CPFL PIRATININGA. Conselho aprovou 16ª emissão...,0,CPFL PIRATININGA aprovar emissão debêntur mont...
916,28/02/2024,"UNIPAR informou a renúncia de Antonio Rabello,...",0,UNIPAR informar renúncia Antonio Rabello diret...
917,28/02/2024,GRUPO MATEUS concluiu venda de cinco imóveis p...,0,MATEUS concluir venda imóvel fundo TRX real mi...


In [9]:
df_feb.to_csv("saves/dataframes/february_df.csv", index=False)

## Applying the Word2Vec approach

## Visualising the distribution of the Target variable
- helps us to realize a class imbalance
- good to keep track of
- MOVE THIS INTO the PREPROCESSING file eventually

In [None]:
# January

import pandas as pd

file_path = 'data/labeled_january_data.csv'
df_jan = pd.read_csv(file_path)

column_data = df_jan['label']
counts = column_data.value_counts()

count_1 = counts[1]
count_minus_1 = counts[-1]
count_0 = counts[0]

print("January\n")
print(f"Count of 1: {count_1}")
print(f"Count of -1: {count_minus_1}")
print(f"Count of 0: {count_0}")
print(f"Total Number of Articles: {count_1 + count_minus_1 + count_0}")

Count of 1: 182
Count of -1: 363
Count of 0: 556
Total Number of Articles: 1101


In [None]:
# February

import pandas as pd

file_path = 'data/labeled_february_data.csv'
df = pd.read_csv(file_path)

column_name = 'label'
column_data = df[column_name]

counts = column_data.value_counts()

count_1 = counts[1]
count_minus_1 = counts[-1]
count_0 = counts[0]


print("February\n")
print(f"Count of 1: {count_1}")
print(f"Count of -1: {count_minus_1}")
print(f"Count of 0: {count_0}")
print(f"Total Number of Articles: {count_1 + count_minus_1 + count_0}")

February

Count of 1: 113
Count of -1: 157
Count of 0: 649
Total Number of Articles: 919


## Multinomial Logistic Regression Model with Custom Word2Vec Model

Task: 
- Train on January, test on first 2 weeks of February respectively
    - Train / test with 3 classes (+1, 0, -1), yielding a 3 x 3 confusion matrix
    - Train / test with 2 classes (+1, -1), yielding a 2 x 2 confusion matrix
- Apply softmax to improve accuracy if poor

In [46]:
# Confusion Matrix

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# y_test is actual labels and y_pred is predicted labels
cm = confusion_matrix(y_test, y_pred)

# heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Neutral', 'Positive'], yticklabels=['Negative', 'Neutral', 'Positive'])

plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')

# display plot
plt.show()

In [51]:
# Confusion Matrix

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# y_test is actual labels and y_pred is predicted labels
cm = confusion_matrix(y_test, y_pred)

# heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])

plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')

# display plot
plt.show()