In [1]:
pip install rank-bm25

Collecting rank-bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank-bm25
Successfully installed rank-bm25-0.2.2


In [30]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from rank_bm25 import BM25Okapi
import pandas as pd
from collections import defaultdict

In [3]:
company_data_preprocessed=pd.read_csv('/content/drive/MyDrive/SEM9/Information Retrieval/Financial_Search_Engine/package datasets/Company_Data_Preprocessed', index_col=0)

In [4]:
company_data_preprocessed

Unnamed: 0,Symbol,Company_Name,Current_Price,Earnings_per_Share,Growth_Rate,Intrinsic_Value,Long Business Summary,preprocessed_description
0,RRX,Regal Rexnord Corporation,118.2800,9.947,-0.774,0.177498,Regal Rexnord Corporation manufactures and sel...,regal rexnord corporation manufactures sells i...
1,APTMU,Alpha Partners Technology Merger Corp.,10.6200,0.013,0.311,0.000021,Alpha Partners Technology Merger Corp. does no...,alpha partners technology merger corp signific...
2,WSFS,WSFS Financial Corporation,35.2350,9.927,0.047,2.967547,WSFS Financial Corporation operates as the sav...,wsfs financial corporation operates savings lo...
3,FRAF,Franklin Financial Services Corporation,32.0000,6.153,-0.162,0.961493,Franklin Financial Services Corporation operat...,franklin financial services corporation operat...
4,WF,Woori Financial Group Inc.,26.7500,63314.470,-0.342,121665.258441,"Woori Financial Group Inc., together with its ...",woori financial group inc together subsidiarie...
...,...,...,...,...,...,...,...,...
230,WEYS,"Weyco Group, Inc.",28.5300,2.452,0.064,0.225463,"Weyco Group, Inc. designs and distributes foot...",weyco group inc designs distributes footwear m...
231,FE,FirstEnergy Corp.,36.0085,0.206,0.196,0.001411,"FirstEnergy Corp., through its subsidiaries, g...",firstenergy corp subsidiaries generates transm...
232,SHEL,Shell plc,65.3350,6.695,-0.808,0.121649,Shell plc operates as an energy and petrochemi...,shell plc operates energy petrochemical compan...
233,TXT,Textron Inc.,75.4400,8.525,0.274,1.266532,"Textron Inc. operates in the aircraft, defense...",textron inc operates aircraft defense industri...


# LDA topic modelling

In [5]:

def topic_modelling(company_data,num_topics=5):
  # Convert the long business summaries to a list
  documents = company_data['preprocessed_description']

  # Vectorize the text using CountVectorizer
  vectorizer = CountVectorizer(stop_words='english')
  X = vectorizer.fit_transform(documents)

  # Apply LDA (Latent Dirichlet Allocation) to the vectorized data
  lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
  lda.fit(X)

  # Assign topics to each business summary
  topics = []
  for idx, document in enumerate(documents):
      # Transform the document to its topic distribution
      topic_distribution = lda.transform(vectorizer.transform([document]))[0]

      # Get the index of the topic with the highest probability
      dominant_topic_idx = topic_distribution.argmax()

      # Append the dominant topic index to the list of topics
      topics.append(f'Topic {dominant_topic_idx + 1}')

  # Add the 'Topic' column to the original data
  company_data['topic'] = topics
  return company_data,vectorizer,lda

In [6]:
def get_query_topic(query, vectorizer, lda_model):
    query_vector = vectorizer.transform([query])
    topic_distribution = lda_model.transform(query_vector)
    return topic_distribution.argmax()

In [15]:
def find_most_relevant_companies(query_topic, lda_model, vectorizer, documents, company_names, top_n=10):
    topic_document_scores = lda_model.transform(vectorizer.transform(documents))
    relevant_document_indices = topic_document_scores[:, query_topic].argsort()[-top_n:][::-1]
    relevant_companies = [company_names[idx]for idx in relevant_document_indices]
    relevant_documents = [documents[idx]for idx in relevant_document_indices]
    return relevant_companies,relevant_documents

In [10]:
topic_data,vectorizer,lda_model=topic_modelling(company_data_preprocessed,10)

In [11]:
topic_data

Unnamed: 0,Symbol,Company_Name,Current_Price,Earnings_per_Share,Growth_Rate,Intrinsic_Value,Long Business Summary,preprocessed_description,topic
0,RRX,Regal Rexnord Corporation,118.2800,9.947,-0.774,0.177498,Regal Rexnord Corporation manufactures and sel...,regal rexnord corporation manufactures sells i...,Topic 1
1,APTMU,Alpha Partners Technology Merger Corp.,10.6200,0.013,0.311,0.000021,Alpha Partners Technology Merger Corp. does no...,alpha partners technology merger corp signific...,Topic 4
2,WSFS,WSFS Financial Corporation,35.2350,9.927,0.047,2.967547,WSFS Financial Corporation operates as the sav...,wsfs financial corporation operates savings lo...,Topic 3
3,FRAF,Franklin Financial Services Corporation,32.0000,6.153,-0.162,0.961493,Franklin Financial Services Corporation operat...,franklin financial services corporation operat...,Topic 3
4,WF,Woori Financial Group Inc.,26.7500,63314.470,-0.342,121665.258441,"Woori Financial Group Inc., together with its ...",woori financial group inc together subsidiarie...,Topic 3
...,...,...,...,...,...,...,...,...,...
230,WEYS,"Weyco Group, Inc.",28.5300,2.452,0.064,0.225463,"Weyco Group, Inc. designs and distributes foot...",weyco group inc designs distributes footwear m...,Topic 5
231,FE,FirstEnergy Corp.,36.0085,0.206,0.196,0.001411,"FirstEnergy Corp., through its subsidiaries, g...",firstenergy corp subsidiaries generates transm...,Topic 8
232,SHEL,Shell plc,65.3350,6.695,-0.808,0.121649,Shell plc operates as an energy and petrochemi...,shell plc operates energy petrochemical compan...,Topic 10
233,TXT,Textron Inc.,75.4400,8.525,0.274,1.266532,"Textron Inc. operates in the aircraft, defense...",textron inc operates aircraft defense industri...,Topic 9


In [34]:
query_LDA="finance"
query_topic_dist=get_query_topic(query_LDA,vectorizer,lda_model)

In [35]:
query_topic_dist

9

In [36]:
relevant_companies,relevant_documents=find_most_relevant_companies(query_topic_dist, lda_model, vectorizer, company_data_preprocessed['preprocessed_description'], company_data_preprocessed['Symbol'])
for i in range(len(relevant_companies)):
  print(relevant_companies[i], '  ', relevant_documents[i])

ANDE    andersons inc operates trade renewables plant nutrient sectors united states internationally operates three segments trade renewables plant nutrient companys trade segment operates grain elevators stores commodities provides grain marketing risk management origination services well sells commodities corn soybeans wheat oats corn oil segment also engages commodity merchandising business well offers logistics physical commodities whole grains grain products feed ingredients domestic fuel products agricultural commodities renewables segment produces purchases sells ethanol coproducts well offers facility operations risk management marketing services ethanol plants invests operates companys plant nutrient segment manufactures distributes retails agricultural related plant nutrients liquid industrial products corncobbased products pelleted lime gypsum products well turf fertilizer pesticide herbicide products crop nutrients crop protection chemicals seed products well provides appli

# BM25 model

In [18]:
def bm25_model(documents):
  tokenized_documents = [doc.split() for doc in documents]
  bm25 = BM25Okapi(tokenized_documents)
  return bm25

In [21]:
def get_query_results(query, bm25_model, documents, company_names, top_n=10):
    tokenized_query = query.split()
    scores = bm25_model.get_scores(tokenized_query)
    sorted_company_results = sorted(company_names, reverse=True)[:top_n]
    sorted_docs = sorted(documents, reverse=True)[:top_n]
    scores = sorted(scores, reverse=True)[:top_n]
    return sorted_company_results,sorted_docs,scores

In [37]:
bm_model=bm25_model(company_data_preprocessed['preprocessed_description'])

In [42]:
query_bm25="financial"
company_names,docs,scores=get_query_results(query_bm25,bm_model,company_data_preprocessed['preprocessed_description'],company_data_preprocessed['Symbol'])
for i in range(len(company_names)):
  print(scores[i],": ",company_names[i], '  ', docs[i])

2.6575286577165316 :  ZLSWU    zalatoris ii acquisition corp significant operations intends effect merger amalgamation share exchange asset acquisition share purchase reorganization similar business combination one businesses brazil company formerly known xpac acquisition corp changed name zalatoris ii acquisition corp july 2023 company incorporated 2021 based new york new york
2.62061500573951 :  ZJYL    yiren digital ltd subsidiaries operates online consumer finance marketplace connects borrowers investors peoples republic china company operates wealth credit segments offers loan facilitation services postorigination services cash processing collection sms services also distributes shortterm cash management insurance products addition company offers consultancy information technology support referral system maintenance customer support services involved provision services financing lease insurance brokerage electronic commerce businesses company offers products wealth management webs

# Ranking the results

In [33]:
def rank_stocks_by_intrinsic_value(stock_list):
    stock_list.sort(key=lambda stock: company_data_preprocessed[company_data_preprocessed['Symbol'] == stock]['Intrinsic_Value'].iloc[0], reverse=True)
    ranked_Output=defaultdict()
    for stock in stock_list:
      ranked_Output[company_data_preprocessed[company_data_preprocessed['Symbol'] == stock]['Company_Name'].iloc[0]]=company_data_preprocessed[company_data_preprocessed['Symbol'] == stock]['Long Business Summary'].iloc[0]
    return ranked_Output

In [39]:
print("Ranked Output for LDA model\nQuery: ",query_LDA,"\n")
LDA_model_ranks=rank_stocks_by_intrinsic_value(relevant_companies)
i=0
for company_name,summary in LDA_model_ranks.items():
  print(i+1,".", company_name)
  print(summary)
  print("\n")
  i+=1

Ranked Output for LDA model
Query:  finance 

1 . AMC Networks Inc.
AMC Networks Inc., an entertainment company, owns and operates a suite of video entertainment products that are delivered to audiences and a platform to distributors and advertisers in the United States and internationally. The company operates through Domestic Operations, and International and Other. The Domestic Operations segment operates various national programming networks, including the AMC, WE tv, BBC AMERICA, IFC, and SundanceTV; provides subscription streaming services comprising Acorn TV, Shudder, Sundance Now, ALLBLK, and HIDIVE, as well as AMC+ and other streaming initiatives; and engages in film distribution business under the IFC Films name. This segment also produces and licenses original programming for various programming networks, as well as services the national programming networks. The International and Other segment operates a portfolio of channels under the AMCNI name; and production and comedy 

In [43]:
print("Ranked Output for BM25 model\nQuery: ",query_bm25,"\n")
bm25_model_ranks=rank_stocks_by_intrinsic_value(company_names)
i=0
for company_name,summary in bm25_model_ranks.items():
  print(i+1,".", company_name)
  print(summary)
  print("\n")
  i+=1

Ranked Output for BM25 model
Query:  financial 

1 . Wintrust Financial Corporation
Wintrust Financial Corporation operates as a financial holding company. It operates in three segments: Community Banking, Specialty Finance, and Wealth Management. The Community Banking segment offers non-interest bearing deposits, non-brokered interest-bearing transaction accounts, and savings and domestic time deposits; home equity, consumer, and real estate loans; safe deposit facilities; and automatic teller machine (ATM), online and mobile banking, and other services. It also engages in the retail origination and purchase of residential mortgages for sale into the secondary market; and provision of lending, deposits, and treasury management services to condominium, homeowner, and community associations, as well as asset-based lending for middle-market companies. In addition, this segment offers loan and deposit services to mortgage brokerage companies; lending to restaurant franchisees; direct leas