# Boring stuff: setting everything up

*Warning: run this section only once*

Connect to your Google Drive so that your work does not get lost when you end your session

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Change working directory to your Google Drive

In [None]:
%cd /content/drive/MyDrive/Colab Notebooks

/content/drive/MyDrive/Colab Notebooks


Create the main directory for the laboratory inside your Google Drive

In [None]:
!mkdir NLP_MASTER

mkdir: cannot create directory ‘NLP_MASTER’: File exists


Remove unwanted directories (if it is your first run these directories do not exist and the following two commands have no effect)

In [None]:
!rm -rf /content/drive/MyDrive/NLP_MASTER/finance

In [None]:
!rm -rf /content/drive/MyDrive/NLP_MASTER/spacy-projects

Now let's install all the dependencies for the laboratory

In [None]:
!pip install -U pip setuptools wheel

[0m

In [None]:
#!pip install -U spacy-nightly --pre

In [None]:
!pip install -U spacy transformers

Collecting spacy
  Downloading spacy-3.7.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Downloading spacy-3.7.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: Operation cancelled by user[0m[31m
[0m

Now that everything is set up, change working directory to the newly created directory NLP_MASTER in your Google Drive

In [None]:
%cd /content/drive/MyDrive/Colab Notebooks/NLP_MASTER/

/content/drive/MyDrive/Colab Notebooks/NLP_MASTER


Clone the official projects from the Spacy Repo, you are going to start from [this one](https://github.com/explosion/projects/tree/v3/tutorials/textcat_goemotions) and adapt it to the sentiment classification of financial news headlines

In [None]:
!git clone https://github.com/explosion/projects.git spacy-projects

fatal: destination path 'spacy-projects' already exists and is not an empty directory.


Let's now create a subdirectory "finance" inside NLP_MASTER, where we are going to copy the textcat_goemotions tutorial we just cloned with git with the command above

In [None]:
!mkdir finance

mkdir: cannot create directory ‘finance’: File exists


In [None]:
!cp -r spacy-projects/tutorials/textcat_goemotions/* finance/

cp: cannot stat 'spacy-projects/tutorials/textcat_goemotions/*': No such file or directory


In [None]:
%cd /content/drive/MyDrive//Colab Notebooks/NLP_MASTER/finance/

/content/drive/MyDrive/Colab Notebooks/NLP_MASTER/finance


Spacy command line in action: now that we moved in the root directory of the project we tell Spacy to download everything the project needs in order to be run

In [None]:
!spacy project assets

[38;5;4mℹ Fetching 4 asset(s)[0m
[38;5;2m✔ Downloaded asset /content/drive/MyDrive/Colab
Notebooks/NLP_MASTER/finance/assets/categories.txt[0m
[38;5;2m✔ Downloaded asset /content/drive/MyDrive/Colab
Notebooks/NLP_MASTER/finance/assets/train.tsv[0m
[38;5;2m✔ Downloaded asset /content/drive/MyDrive/Colab
Notebooks/NLP_MASTER/finance/assets/dev.tsv[0m
[38;5;2m✔ Downloaded asset /content/drive/MyDrive/Colab
Notebooks/NLP_MASTER/finance/assets/test.tsv[0m


# Sentiment analysis: Reddit Posts Dataset

*Example records [TEXT_CONTENT, EMOTION_ID, TEXT_ID]:*

You can take a look at the dataset [here](https://drive.google.com/file/d/118kEBuOXikDJhlAvDVmAVxNBymtQ5MKb/view?usp=sharing)

*   My favourite food is anything I didn't have to cook myself.	27	eebbqej
*   Thank you friend	15	eeqd04y
*   It's crazy how far Photoshop has come. Underwater bridges?!! NEVER!!!	7,13	efanc6t


Check out **assets/categories.txt** to explore the labels for this dataset. *The first row corresponds to the emotion_id 0, the second row to the emotion_id 1 and so on.*

---



***Edit project.yml and change gpu_id from -1 to 0 in order to take advantage of the Colab GPU***

Let Spacy **preprocess Reddit Posts Dataset** (assets/train.tsv, assets/dev.tsv, assets/test.tsv and assets/categories.txt) and format it as it internally needs.

In [114]:
!spacy project run preprocess

[1m
Running command: /usr/bin/python3 scripts/convert_corpus.py


Now that the dataset has been processed, **let's train the model** on the Reddit posts!

In [115]:
!spacy project run train

[1m
Running command: /usr/bin/python3 -m spacy train ./configs/cnn.cfg -o training/cnn --gpu-id 0
[38;5;2m✔ Created output directory: training/cnn[0m
[38;5;4mℹ Saving to output directory: training/cnn[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['textcat'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ----------  ------
  0       0          0.26       56.58    0.57
 16     200          5.60       84.48    0.84
 33     400          0.03       84.34    0.84
 50     600          0.00       85.51    0.86
 66     800          0.00       85.62    0.86
 83    1000          0.00       85.53    0.86
100    1200          0.00       85.74    0.86
116    1400          0.00       85.78    0.86
133    1600          0.00       85.76    0.86
150    1800          0.00       85.48    0.85
166    2000          0.00       85.37    0.85
183    2200          0.00       85.25    

Automatic SpaCy evaluation of the model you just trained

In [116]:
!spacy project run evaluate

[1m
Running command: /usr/bin/python3 -m spacy evaluate ./training/cnn/model-best ./corpus/test.spacy --output ./metrics/cnn.json
[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   88.94 
SPEED                 20465 

[1m

               P       R       F
negative   67.33   41.74   51.53
neutral    86.07   91.67   88.78
positive   66.75   58.71   62.47

[1m

           ROC AUC
negative      0.88
neutral       0.93
positive      0.86

[38;5;2m✔ Saved results to metrics/cnn.json[0m


Let's test the model on some examples, **feel free to change them to whatever you want**!

In [153]:
df1 = pd.read_csv("/content/drive/MyDrive/Colab_Notebooks/hist_fx_09_04_2020_02_04_2021.csv")
df2 = pd.read_csv("/content/drive/MyDrive/Colab_Notebooks/hist_fx_30_03_2021_05_06_2024.csv")

df_final = pd.concat([df1, df2])

In [154]:
import spacy
nlp = spacy.load("/content/drive/MyDrive/Colab Notebooks/NLP_MASTER/finance/training/cnn/model-best")


negative = []
neutral = []
positive = []
for doc in nlp.pipe(df_final["txt"]):
    # Do something with the doc here
    pred = doc.cats
    negative.append(pred["negative"])
    neutral.append(pred["neutral"])
    positive.append(pred["positive"])

df_final["negative"] = negative
df_final["neutral"] = neutral
df_final["positive"] = positive



In [151]:
df_finalf.index

Index(['2021-01-01', '2021-01-01', '2024-01-01', '2024-01-01', '2021-02-01',
       '2021-02-01', '2022-02-01', '2022-02-01', '2023-02-01', '2023-02-01',
       ...
       '2023-08-31', '2023-08-31', '2022-10-31', '2022-10-31', '2023-10-31',
       '2023-10-31', '2020-12-31', '2020-12-31', '2021-12-31', '2021-12-31'],
      dtype='object', name='date', length=2131)

In [160]:
df_finalf.index = pd.to_datetime(df_finalf.index, format = '%d-%m-%Y %H:%M').strftime('%d-%m-%Y')

In [161]:
df_finalf

Unnamed: 0_level_0,negative,neutral,positive
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
01-01-2021,0.045591,0.347507,0.002401
01-01-2021,0.010174,0.233930,0.006298
01-01-2024,0.000055,0.985181,0.010035
01-01-2024,0.000189,0.999684,0.000021
01-02-2021,0.001550,0.841206,0.002725
...,...,...,...
31-10-2023,0.002908,0.985406,0.000067
31-12-2020,0.000833,0.208067,0.131062
31-12-2020,0.054524,0.350759,0.003904
31-12-2021,0.000121,0.979307,0.003975


In [158]:
df_finalf = df_final.groupby("date").agg(negative=('negative', 'mean'), neutral=('neutral', 'mean'), positive=('positive', 'mean'))

#Data Preparation: from the Reddit Post Dataset to the Financial News Dataset
**TODO: Upload Financial News Dataset file FinancialPhraseBank_AllAgree.txt to the assets folder, you can find the dataset [here](https://drive.google.com/file/d/1WXM2t8sh-myIEUZt37zIXC2McNrCyS2l/view?usp=sharing)**\
Financial news dataset example records [TEXT_CONTENT, SENTIMENT_LABEL]:


*   According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .@neutral
*   Finnish Talentum reports its operating profit increased to EUR 20.5 mn in 2005 from EUR 9.3 mn in 2004 , and net sales totaled EUR 103.3 mn , up from EUR 96.4 mn .@positive
*   Pharmaceuticals group Orion Corp reported a fall in its third-quarter earnings that were hit by larger expenditures on R&D and marketing .@negative



---

Now you have to **format the Financial News Dataset like the Reddit Posts Dataset**, in order to retrain the sentiment classifier on the new financial dataset.

Remember to split the dataset into train (70%), validation (10%) and test (20%), **saving the respective TSV files (train.tsv, dev.tsv, test.tsv) in the asset folder** .



In [129]:
df_finalf

Unnamed: 0_level_0,negative,neutral,positive
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
01-01-2021 00:20,0.045591,0.347507,0.002401
01-01-2021 03:14,0.010174,0.233930,0.006298
01-01-2024 00:37,0.000055,0.985181,0.010035
01-01-2024 01:30,0.000189,0.999684,0.000021
01-02-2021 02:26,0.001550,0.841206,0.002725
...,...,...,...
31-10-2023 01:16,0.002908,0.985406,0.000067
31-12-2020 02:07,0.000833,0.208067,0.131062
31-12-2020 03:15,0.054524,0.350759,0.003904
31-12-2021 00:57,0.000121,0.979307,0.003975


In [162]:
import pandas as pd

In [163]:
data = pd.read_csv("/content/drive/MyDrive/Colab_Notebooks/FinancialPhraseBank_AllAgree (1).txt", sep="@", header=None, encoding="ISO-8859-1")

In [103]:
data

Unnamed: 0,0,1
0,"According to Gran , the company has no plans t...",neutral
1,"For the last quarter of 2010 , Componenta 's n...",positive
2,"In the third quarter of 2010 , net sales incre...",positive
3,Operating profit rose to EUR 13.1 mn from EUR ...,positive
4,"Operating profit totalled EUR 21.1 mn , up fro...",positive
...,...,...
2259,Operating result for the 12-month period decre...,negative
2260,HELSINKI Thomson Financial - Shares in Cargote...,negative
2261,LONDON MarketWatch -- Share prices ended lower...,negative
2262,Operating profit fell to EUR 35.4 mn from EUR ...,negative


In [104]:
import uuid

In [105]:
data["uuid"] = [str(uuid.uuid4()) for x in data[0]]

In [106]:
data

Unnamed: 0,0,1,uuid
0,"According to Gran , the company has no plans t...",neutral,c6f2c946-ea14-46a7-b01b-c75afd042c7d
1,"For the last quarter of 2010 , Componenta 's n...",positive,94833092-4809-48a1-8f5a-2859881ff753
2,"In the third quarter of 2010 , net sales incre...",positive,dfb0e2fe-266c-4c7f-af53-30d81f090579
3,Operating profit rose to EUR 13.1 mn from EUR ...,positive,23ae4067-abb4-43f6-9060-0e01e6bf6798
4,"Operating profit totalled EUR 21.1 mn , up fro...",positive,3233d966-2b81-4cb9-abe4-7519c96c3c2e
...,...,...,...
2259,Operating result for the 12-month period decre...,negative,59fe3791-8712-4b8d-89ea-8a7b73b87ac4
2260,HELSINKI Thomson Financial - Shares in Cargote...,negative,4541721c-5b19-418d-b156-a93684dc1b0d
2261,LONDON MarketWatch -- Share prices ended lower...,negative,99473c1f-be8c-4dd9-a0f0-1283c4d65d13
2262,Operating profit fell to EUR 35.4 mn from EUR ...,negative,e556749b-888c-465d-8df3-f284a845207b


In [107]:
data.loc[data[1] == "neutral", 1] = 1
data.loc[data[1] == "positive", 1] = 2
data.loc[data[1] == "negative", 1] = 0

In [108]:
data

Unnamed: 0,0,1,uuid
0,"According to Gran , the company has no plans t...",1,c6f2c946-ea14-46a7-b01b-c75afd042c7d
1,"For the last quarter of 2010 , Componenta 's n...",2,94833092-4809-48a1-8f5a-2859881ff753
2,"In the third quarter of 2010 , net sales incre...",2,dfb0e2fe-266c-4c7f-af53-30d81f090579
3,Operating profit rose to EUR 13.1 mn from EUR ...,2,23ae4067-abb4-43f6-9060-0e01e6bf6798
4,"Operating profit totalled EUR 21.1 mn , up fro...",2,3233d966-2b81-4cb9-abe4-7519c96c3c2e
...,...,...,...
2259,Operating result for the 12-month period decre...,0,59fe3791-8712-4b8d-89ea-8a7b73b87ac4
2260,HELSINKI Thomson Financial - Shares in Cargote...,0,4541721c-5b19-418d-b156-a93684dc1b0d
2261,LONDON MarketWatch -- Share prices ended lower...,0,99473c1f-be8c-4dd9-a0f0-1283c4d65d13
2262,Operating profit fell to EUR 35.4 mn from EUR ...,0,e556749b-888c-465d-8df3-f284a845207b


In [109]:
from sklearn.model_selection import train_test_split

In [110]:
X_temp, X_test = train_test_split(data, test_size=0.8, random_state=42)

X_train, X_val = train_test_split(X_temp, test_size=0.2, random_state=42)  # 0.25 * 0.8 = 0.2


In [111]:
X_train.to_csv('/content/drive/MyDrive/Colab Notebooks/NLP_MASTER/finance/assets/train.tsv', index=False, header=False, sep="\t")

In [112]:
X_test.to_csv('/content/drive/MyDrive/Colab Notebooks/NLP_MASTER/finance/assets/test.tsv', index=False, header=False, sep="\t")

In [113]:
X_val.to_csv('/content/drive/MyDrive/Colab Notebooks/NLP_MASTER/finance/assets/dev.tsv', index=False, header=False, sep="\t")

In [164]:
df_finalf.to_csv("/content/drive/MyDrive/Colab_Notebooks/sentiment.csv")