## Natural Languange Process (NLP) II in Finance

This notebook provides an example of Topic Modeling in finance.

#### NLP Packages

* Gensim - an NLP toolkit for Python (see https://www.nltk.org)


In [1]:
import re
import pandas as pd

# NLP Toolkits
import nltk
import en_core_web_sm

In [2]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))

nlp = en_core_web_sm.load()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [4]:
!pip install numpy==1.26.4
!pip install pandas==1.5.3

Collecting pandas==1.5.3
  Downloading pandas-1.5.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading pandas-1.5.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m62.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.2.2
    Uninstalling pandas-2.2.2:
      Successfully uninstalled pandas-2.2.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 1.5.3 which is incompatible.
plotnine 0.14.5 requires pandas>=2.2.0, but you have pandas 1.5.3 which is incompatible.
cudf-cu12 25.2.1 requires pandas<2.2.4dev0,>=2.0, but you have pandas 1.5.3 which is incompatible.
miz

In [5]:
!pip install gensim



In [7]:
from gensim import corpora, models
from gensim.utils import simple_preprocess

#### Text Data

We'll upload:
* **The Financial Phrase Bank** (Malo et al., 2014): This dataset comprises business phrases that have been annotated with sentiment scores. You can download this dataset from the Hugging Face datasets repository:


> https://huggingface.co/datasets/financial_phrasebank/tree/main/data

In [8]:
df_phrases = pd.read_csv('Sentences_AllAgree.txt', delimiter='@', encoding='latin-1', on_bad_lines='skip', names=['Phrase', 'Sentiment'])
df_phrases.head()

  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access
  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access


Unnamed: 0,Phrase,Sentiment
0,"According to Gran , the company has no plans t...",neutral
1,"For the last quarter of 2010 , Componenta 's n...",positive
2,"In the third quarter of 2010 , net sales incre...",positive
3,Operating profit rose to EUR 13.1 mn from EUR ...,positive
4,"Operating profit totalled EUR 21.1 mn , up fro...",positive


### Text as Data

#### Preprocessing text data

The following Python function, preprocess_text, performs several preprocessing steps on a given text to prepare it for natural language processing (NLP) tasks. These include:
* Converting all characters in the text to lowercase, to make the text case-insensitive
* Removing punctuation and non-alphanumeric characters: Uses a regular expression to remove any characters that are not alphanumeric (including punctuation), which helps in cleaning the text.
* Filtering out the stopwords, i.e., common words like "is", "and", "the", etc., that are often removed in NLP tasks because they carry less meaningful information.
* Lemmatization -- the process of reducing words to their base or dictionary form. For example, "running" becomes "run".

In [9]:
def preprocess_text(text, stopwrds):
  filtered_tokens = []

  # Convert text to lowercase
  text = text.lower()

  # Remove punctuation and non-alphanumeric characters using regular expression
  text = re.sub(r'\W+', ' ', text)

  # Tokenize text using NLTK
  word_tokens = word_tokenize(text)

  # Remove stop words
  if stopwrds:
    filtered_tokens = [word for word in word_tokens if word not in stop_words]

    # Reconstruct the text without stop words
    text = ' '.join(filtered_tokens)

  # Use spaCy for lemmatization
  doc = nlp(text)
  lemmatized_text = " ".join([token.lemma_ for token in doc])

  return lemmatized_text

In [10]:
df_phrases['Pre_Processed'] = df_phrases['Phrase'].apply(lambda x: preprocess_text(x, True))
df_phrases['Pre_Processed_w_stopwords'] = df_phrases['Phrase'].apply(lambda x: preprocess_text(x, False))
df_phrases.head()

  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access
  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access


Unnamed: 0,Phrase,Sentiment,Pre_Processed,Pre_Processed_w_stopwords
0,"According to Gran , the company has no plans t...",neutral,accord gran company plan move production russi...,accord to gran the company have no plan to mov...
1,"For the last quarter of 2010 , Componenta 's n...",positive,last quarter 2010 componenta net sale double e...,for the last quarter of 2010 componenta s net ...
2,"In the third quarter of 2010 , net sales incre...",positive,third quarter 2010 net sale increase 5 2 eur 2...,in the third quarter of 2010 net sale increase...
3,Operating profit rose to EUR 13.1 mn from EUR ...,positive,operating profit rise eur 13 1 mn eur 8 7 mn c...,operating profit rise to eur 13 1 mn from eur ...
4,"Operating profit totalled EUR 21.1 mn , up fro...",positive,operating profit total eur 21 1 mn eur 18 6 mn...,operating profit total eur 21 1 mn up from eur...


#### Bag of Words (document-term matrix)

The following code blocks performs a Bag of Words (BoW) transformation on the pre-processed phrases and then visualizes the most common words found in those phrases.

In [14]:
texts = df_phrases['Pre_Processed'].apply(simple_preprocess)

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)

# Convert dictionary to a Bag of Words corpus
corpus = [dictionary.doc2bow(text) for text in texts]

Apply the LDA model

In [15]:
lda_model = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, passes=20)

### Visualizing the Topics


In [16]:
!pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting pandas>=2.0.0 (from pyLDAvis)
  Downloading pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m74.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, pandas, pyLDAvis
  Attempting uninstall: pandas
    Found existing insta

In [17]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

In [18]:
pyLDAvis.enable_notebook()
prepared_vis = gensimvis.prepare(lda_model, corpus, dictionary)

# Visualizing
pyLDAvis.display(prepared_vis)