# Topic Modeling
Topic modeling is a statistical modeling approach that employs unsupervised Machine Learning to identify clusters or groups of similar words within text data. Through document analysis, this technique uncovers common themes and arranges them into meaningful clusters, encompassing various topics like contracts, invoices, complaints, and more.

In this particular project, I focus on utilizing Latent Dirichlet analysis (LDA) as a primary topic modeling method. LDA is employed to analyze extensive text data derived from FOMC meetings, categorizing the meetings into distinct concurrent themes. This process allows for a deeper understanding of the discussions held during each meeting, shedding light on the prevailing topics discussed in these financial gatherings.


In this project, I adopt a similar approach as described in this [blog](https://highdemandskills.com/topic-trends-fomc/#h2-1) post to implement topic modeling using the [gensim](https://radimrehurek.com/gensim/models/ldamulticore.html#module-gensim.models.ldamulticore) package. The insights and methodologies presented in the mentioned blog significantly influence the analysis conducted in this notebook.
 

In [1]:
import pandas as pd
import gensim
import gensim.corpora as corpora
from gensim import models
from pprint import pprint
from itertools import chain
import time

In [2]:
fomc = pd.read_pickle('../data/fomc_data.pkl')
fomc.head(3)

Unnamed: 0,minutes_paragraphs,paragraphs_length,minutes_text,text_length
1993-02-03,"[[meeting, federal, open, market, committee, h...","[12, 15, 24, 29, 12, 32, 37, 32, 14, 16, 82, 5...",meeting federal open market committee hold off...,4437
1993-03-23,"[[meeting, federal, open, market, committee, h...","[11, 13, 64, 23, 24, 28, 60, 51, 64, 56, 100, ...",meeting federal open market committee hold off...,2789
1993-05-18,"[[meeting, federal, open, market, committee, h...","[11, 26, 19, 25, 27, 62, 46, 54, 37, 89, 56, 6...",meeting federal open market committee hold off...,2354


To train the LDA model, we must begin by creating a dictionary from the corpus. The dictionary will map the words in the corpus to unique word IDs, facilitating further processing. Next, we'll convert the corpus into a bag-of-words representation, which captures word frequencies in each document. Finally, we'll apply TF-IDF (Term Frequency-Inverse Document Frequency) to the bag-of-words representation, transforming it into a TF-IDF representation. This entire process is known as text representation.

Text representation is a critical step in preparing the data for topic modeling using LDA. It allows us to represent the text data in a numerical format that LDA can analyze effectively. The dictionary, bag-of-words, and TF-IDF representations are crucial in extracting meaningful topics and insights from the FOMC meeting minutes data during the LDA modeling process.

In [3]:
# Concatenate all paragraphs from the 'fomc' DataFrame into a single list using chain.from_iterable
# This creates a comprehensive list of all paragraphs from the FOMC meeting minutes
fomcminute_full_list = list(chain.from_iterable(fomc['minutes_paragraphs']))

# Create a Dictionary (ID2word) to map unique words in the paragraphs to unique IDs
# This step is essential for subsequent processing and analysis
ID2word = corpora.Dictionary(fomcminute_full_list)

# Create the Bag of Words (BoW) corpus for all documents in the fomcminute_full_list
# BoW representation is a numerical representation of the paragraphs, capturing word frequencies in each document
corpus = [ID2word.doc2bow(doc) for doc in fomcminute_full_list]

# Alternatively, we can define 'corpus' in this way if we have previously added 'doc2bow' to the 'fomc' DataFrame
# corpus = list(chain.from_iterable(fomc['doc2bow']))

# Fit the Term Frequency-Inverse Document Frequency (TF-IDF) model to the BoW corpus
TFIDF = models.TfidfModel(corpus)

# Apply the TF-IDF model to the BoW corpus to transform it into a TF-IDF representation
trans_TFIDF = TFIDF[corpus]

In [4]:
def apply_doc2bow(x):
    """
    Apply the Gensim Dictionary.doc2bow() function to convert a list of documents
    into Bag-of-Words (BoW) representation.

    The function takes a list of documents, where each document is represented as
    a list of tokens. It applies the Gensim Dictionary.doc2bow() function to each
    document to convert it into a BoW representation, which is a list of tuples
    (word_id, word_frequency) for each word in the document.

    Parameters:
        x (list): A list of documents, where each document is represented as a list of tokens.

    Returns:
        list: A list of BoW representations for each document in the input list.
              Each BoW representation is a list of tuples (word_id, word_frequency).

    Example:
        Given the input list of documents 'x':
        [
            ['apple', 'orange', 'banana'],
            ['apple', 'apple', 'grape', 'grape'],
            ['orange', 'orange', 'orange', 'apple']
        ]

        The function will return:
        [
            [(0, 1), (1, 1), (2, 1)],
            [(0, 2), (3, 2)],
            [(0, 1), (2, 3)]
        ]


        In this example, the Gensim Dictionary object (ID2word) is assumed to be defined
        outside this function, containing the mapping of words to word IDs.

    Note:
        The ID2word dictionary should be created using the Gensim corpora.Dictionary class,
        and it should be shared across all the functions in the pipeline for consistency.
    """
    
    return [ID2word.doc2bow(sublist) for sublist in x]

In [5]:
fomc['doc2bow'] = fomc['minutes_paragraphs'].apply(apply_doc2bow)
fomc.head(3)

Unnamed: 0,minutes_paragraphs,paragraphs_length,minutes_text,text_length,doc2bow
1993-02-03,"[[meeting, federal, open, market, committee, h...","[12, 15, 24, 29, 12, 32, 37, 32, 14, 16, 82, 5...",meeting federal open market committee hold off...,4437,"[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, ..."
1993-03-23,"[[meeting, federal, open, market, committee, h...","[11, 13, 64, 23, 24, 28, 60, 51, 64, 56, 100, ...",meeting federal open market committee hold off...,2789,"[[(0, 1), (1, 1), (3, 1), (4, 1), (5, 1), (6, ..."
1993-05-18,"[[meeting, federal, open, market, committee, h...","[11, 26, 19, 25, 27, 62, 46, 54, 37, 89, 56, 6...",meeting federal open market committee hold off...,2354,"[[(0, 1), (1, 1), (3, 1), (4, 1), (5, 1), (6, ..."


In [6]:
start_time = time.time()

SEED = 130 # Set random seed
NUM_topics = 6 # Set number of topics
ALPHA = 0.15 # Set alpha
ETA = 1.25 # Set eta

# Train LDA model using the corpus
lda_model = gensim.models.LdaMulticore(corpus=trans_TFIDF, 
                                       num_topics=NUM_topics, 
                                       id2word=ID2word, 
                                       random_state=SEED, 
                                       alpha=ALPHA, 
                                       eta=ETA, 
                                       passes=100)

# Print topics generated from the training corpus
pprint(lda_model.print_topics(num_words=10))

end_time = time.time()
execution_time = end_time - start_time
print('\n')
print("Execution time:", execution_time, "seconds")

[(0,
  '0.011*"foreign" + 0.011*"system" + 0.011*"open" + 0.011*"operation" + '
  '0.010*"currency" + 0.010*"transaction" + 0.009*"security" + 0.008*"agency" '
  '+ 0.008*"account" + 0.008*"manager"'),
 (1,
  '0.007*"consumer" + 0.006*"quarter" + 0.006*"spending" + 0.006*"price" + '
  '0.006*"business" + 0.005*"sale" + 0.005*"month" + 0.005*"inventory" + '
  '0.004*"production" + 0.004*"inflation"'),
 (2,
  '0.008*"yield" + 0.007*"export" + 0.006*"dollar" + 0.006*"foreign" + '
  '0.005*"import" + 0.005*"period" + 0.005*"trade" + 0.004*"equity" + '
  '0.004*"index" + 0.004*"intermeeting"'),
 (3,
  '0.010*"inflation" + 0.008*"participant" + 0.007*"policy" + '
  '0.007*"committee" + 0.007*"member" + 0.006*"economic" + 0.006*"risk" + '
  '0.005*"fund" + 0.005*"percent" + 0.005*"monetary"'),
 (4,
  '0.011*"loan" + 0.008*"secretary" + 0.008*"credit" + 0.007*"economist" + '
  '0.006*"issuance" + 0.006*"commercial" + 0.006*"bank" + 0.005*"mortgage" + '
  '0.005*"nonfinancial" + 0.005*"general"

In [7]:
def topic_weight(x):
    """
    Get the topic weight for a list of documents using the trained Latent Dirichlet Allocation (LDA) model.

    The function takes a list of documents, where each document is represented as a list of tokens. It applies the
    trained LDA model to each document to obtain the topic probabilities. The LDA model should be trained using
    the Gensim library and should include the relevant TF-IDF and Dictionary objects (TFIDF and ID2word) used to
    convert the documents into numerical representations.

    Parameters:
        x (list): A list of documents, where each document is represented as a list of tokens.

    Returns:
        list: A list of topic weight distributions for each document in the input list.
              Each topic weight distribution is a list of tuples (topic_id, probability) representing the
              probability of each topic in the document.

    Example:
        Given the input list of documents 'x':
        [
            ['apple', 'orange', 'banana'],
            ['apple', 'apple', 'grape', 'grape'],
            ['orange', 'orange', 'orange', 'apple']
        ]

        Assuming that the LDA model 'lda_model' and the corresponding TF-IDF and Dictionary objects 'TFIDF' and 'ID2word'
        are available and properly trained, the function will return (assuming the specified number of topics in lda_model
        is 3):
        [
            [(0, 0.15), (1, 0.8), (2, 0.05)],
            [(0, 0.4), (1, 0.1), (2, 0.5)],
            [(0, 0.05), (1, 0.9), (2, 0.05)]
        ]

        In this example, each list inside the main list represents the topic weight distribution for each document.

    Note:
        The 'lda_model', 'TFIDF', and 'ID2word' objects should be pre-trained using the Gensim library and shared across
        other functions in the pipeline for consistency. The function assumes that the LDA model has been trained using
        the same set of topics as the number of output topics required in the returned topic weight distribution.
    """
    
    return [lda_model.get_document_topics(TFIDF[ID2word.doc2bow(sublist)], minimum_probability=0) for sublist in x]


The `topic_weight` function computes the weight of each topic within each paragraph of the FOMC minutes. This provides valuable information on the prominence and relevance of different topics in each individual paragraph.

The output of the `topic_weight` function is a list of lists, where each nested list corresponds to a paragraph within the FOMC minutes. Inside each nested list, there are tuples representing the weight of each topic in the form `(topic, weight)`. Higher weights indicate that a particular topic has a stronger presence in that specific paragraph.

This structured output allows us to analyze the distribution of topics throughout the FOMC minutes and gain insights into the significance of different topics at the paragraph level. By examining these topic weights, we can better understand the themes and emphasis of the discussions in each paragraph of the FOMC minutes.

In [8]:
fomc['topic_weight'] = fomc['minutes_paragraphs'].apply(topic_weight)
fomc.head(3)

Unnamed: 0,minutes_paragraphs,paragraphs_length,minutes_text,text_length,doc2bow,topic_weight
1993-02-03,"[[meeting, federal, open, market, committee, h...","[12, 15, 24, 29, 12, 32, 37, 32, 14, 16, 82, 5...",meeting federal open market committee hold off...,4437,"[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, ...","[[(0, 0.036949728), (1, 0.036866453), (2, 0.03..."
1993-03-23,"[[meeting, federal, open, market, committee, h...","[11, 13, 64, 23, 24, 28, 60, 51, 64, 56, 100, ...",meeting federal open market committee hold off...,2789,"[[(0, 1), (1, 1), (3, 1), (4, 1), (5, 1), (6, ...","[[(0, 0.03791663), (1, 0.03780504), (2, 0.0377..."
1993-05-18,"[[meeting, federal, open, market, committee, h...","[11, 26, 19, 25, 27, 62, 46, 54, 37, 89, 56, 6...",meeting federal open market committee hold off...,2354,"[[(0, 1), (1, 1), (3, 1), (4, 1), (5, 1), (6, ...","[[(0, 0.03791647), (1, 0.03780504), (2, 0.0377..."


Next, I calculate the topic scores for each FOMC minutes data. The topic scores represent the relative significance of each topic within each meeting, considering topic weight within each paragraph, and the paragraph's length compared to the overall text length.

In [9]:
# Initialize an empty dictionary to store the topic scores for each FOMC minutes.
dic = {}

# Loop through each index (date) in the FOMC minutes DataFrame.
for index in fomc.index:

    # Create an empty dictionary to store the cumulative topic scores for the paragraphs within the current date.
    sum_dict = {}

    # Iterate through each paragraph's index within the current date.
    for i in range(len(fomc.loc[index, 'topic_weight'])):

        # Get the length of the entire FOMC minutes text and the length of the current paragraph.
        doc_length = fomc.loc[index, 'text_length']
        para_length = fomc.loc[index, 'paragraphs_length'][i]

        # For each topic-weight tuple within the current paragraph, calculate its contribution to the topic score.
        for tup in fomc.loc[index, 'topic_weight'][i]:
            idx, val = tup
            score_contribution = val / doc_length * para_length

            # Update the cumulative topic scores in the sum_dict for each topic.
            if idx not in sum_dict:
                sum_dict[idx] = score_contribution
            else:
                sum_dict[idx] += score_contribution

    # Store the resulting sum_dict in the dic dictionary with the date (index) as the key.
    dic[index] = sum_dict

# Assign the dic dictionary as a new column named 'topic_score' in the fomc DataFrame.
# The 'topic_score' column contains the calculated topic scores for each fomc meeting minute.
fomc['topic_score'] = dic


In [10]:
fomc.head(3)

Unnamed: 0,minutes_paragraphs,paragraphs_length,minutes_text,text_length,doc2bow,topic_weight,topic_score
1993-02-03,"[[meeting, federal, open, market, committee, h...","[12, 15, 24, 29, 12, 32, 37, 32, 14, 16, 82, 5...",meeting federal open market committee hold off...,4437,"[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, ...","[[(0, 0.036949728), (1, 0.036866453), (2, 0.03...","{0: 0.2171418778484963, 1: 0.22202803608318067..."
1993-03-23,"[[meeting, federal, open, market, committee, h...","[11, 13, 64, 23, 24, 28, 60, 51, 64, 56, 100, ...",meeting federal open market committee hold off...,2789,"[[(0, 1), (1, 1), (3, 1), (4, 1), (5, 1), (6, ...","[[(0, 0.03791663), (1, 0.03780504), (2, 0.0377...","{0: 0.05243485473769568, 1: 0.3377698638401964..."
1993-05-18,"[[meeting, federal, open, market, committee, h...","[11, 26, 19, 25, 27, 62, 46, 54, 37, 89, 56, 6...",meeting federal open market committee hold off...,2354,"[[(0, 1), (1, 1), (3, 1), (4, 1), (5, 1), (6, ...","[[(0, 0.03791647), (1, 0.03780504), (2, 0.0377...","{0: 0.04440606846066607, 1: 0.3259464979374318..."


In [11]:
# Create a DataFrame 'topics' from the 'dic' dictionary.
# The 'topics' DataFrame will contain the topic scores for each FOMC minutes, with dates as rows and topics as columns.
topics = pd.DataFrame(dic).transpose()
topics.columns = 'topic 1, consumption, foreign_exchange_rate, inflation, financial_market, topic 6'.split(', ')
# topics.columns = [f'topic_{i}' for i in range(1, NUM_topics+1)]# alternatively, we can give each topic a generic title. 

topics.head()

Unnamed: 0,topic 1,consumption,foreign_exchange_rate,inflation,financial_market,topic 6
1993-02-03,0.217142,0.222028,0.071233,0.354325,0.077838,0.057434
1993-03-23,0.052435,0.33777,0.102628,0.422744,0.049573,0.034851
1993-05-18,0.044406,0.325946,0.120336,0.428469,0.042354,0.038488
1993-07-07,0.031498,0.281643,0.069809,0.522031,0.053767,0.041252
1993-08-17,0.036917,0.411438,0.112277,0.312171,0.081001,0.046197


In [12]:
fomc = pd.concat([fomc, topics], axis=1)
fomc.head(3)

Unnamed: 0,minutes_paragraphs,paragraphs_length,minutes_text,text_length,doc2bow,topic_weight,topic_score,topic 1,consumption,foreign_exchange_rate,inflation,financial_market,topic 6
1993-02-03,"[[meeting, federal, open, market, committee, h...","[12, 15, 24, 29, 12, 32, 37, 32, 14, 16, 82, 5...",meeting federal open market committee hold off...,4437,"[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, ...","[[(0, 0.036949728), (1, 0.036866453), (2, 0.03...","{0: 0.2171418778484963, 1: 0.22202803608318067...",0.217142,0.222028,0.071233,0.354325,0.077838,0.057434
1993-03-23,"[[meeting, federal, open, market, committee, h...","[11, 13, 64, 23, 24, 28, 60, 51, 64, 56, 100, ...",meeting federal open market committee hold off...,2789,"[[(0, 1), (1, 1), (3, 1), (4, 1), (5, 1), (6, ...","[[(0, 0.03791663), (1, 0.03780504), (2, 0.0377...","{0: 0.05243485473769568, 1: 0.3377698638401964...",0.052435,0.33777,0.102628,0.422744,0.049573,0.034851
1993-05-18,"[[meeting, federal, open, market, committee, h...","[11, 26, 19, 25, 27, 62, 46, 54, 37, 89, 56, 6...",meeting federal open market committee hold off...,2354,"[[(0, 1), (1, 1), (3, 1), (4, 1), (5, 1), (6, ...","[[(0, 0.03791647), (1, 0.03780504), (2, 0.0377...","{0: 0.04440606846066607, 1: 0.3259464979374318...",0.044406,0.325946,0.120336,0.428469,0.042354,0.038488


In [13]:
fomc.to_pickle('../data/fomc_topic_modeling.pkl')