This project aims to classify financial documents into predefined categories and extract meaningful topics from the text using Latent Dirichlet Allocation (LDA) and Random Forest classifiers. An interactive Streamlit application was developed to allow users to upload HTML files, preprocess the text, and obtain predictions on document classes and topics.
The dataset used for this project can be downloaded from the following link: Financial Documents Dataset
finacplus.ipynb: Jupyter notebook containing the code and analysis.finacplus.py: Python script to run the Streamlit application.finacplus_LDA.pkl: Pickle file containing the trained LDA model.finacplus_rfmodel.pkl: Pickle file containing the trained Random Forest model.finacplus_vectoriser.pkl: Pickle file containing the CountVectorizer.finacplus_encoder.pkl: Pickle file containing the LabelEncoder.
To run the project locally, follow these steps:
-
Clone the repository:
git clone https://github.com/yourusername/finacplus.git cd finacplus -
Install the required packages:
pip install -r requirements.txt
-
Download the dataset from the link provided above and place it in the project directory.
To run the Streamlit application, execute the following command:
```bash
streamlit run finacplus_app.py
'''
- Extract text from HTML files: Use BeautifulSoup to parse HTML files and extract text.
- Preprocess the extracted text:
- Convert text to lowercase.
- Remove punctuation, newlines, URLs, and numbers.
- Remove stopwords using NLTK's stopwords corpus and additional domain-specific stopwords.
- Perform lemmatization using NLTK's WordNetLemmatizer to reduce words to their base forms.
- CountVectorizer: Transform the text data into numerical format by capturing word frequencies.
- Latent Dirichlet Allocation (LDA): Apply LDA for topic modeling to identify latent topics in the text data.
- Truncated Singular Value Decomposition (SVD): Use SVD for dimensionality reduction and significant topic extraction.
- Random Forest Classifier: Train a Random Forest Classifier to classify documents into categories such as Balance Sheets, Cash Flow, Income Statement, Notes, and Others.
- Train the models: Train the models on the preprocessed text data.
- Evaluate models:
- Use log-likelihood and perplexity for evaluating LDA.
- Use accuracy scores for evaluating the Random Forest classifier.
- Optimize hyperparameters: Perform grid search to find the best configuration for the models.
- Develop a user-friendly application: Create a Streamlit application to upload HTML files, preprocess text, and display predictions.
- Display results:
- Show the predicted class from the Random Forest model.
- Display topics from the SVD model along with their distributions.
This project demonstrates the successful application of machine learning techniques for the classification and topic modeling of financial documents. The Streamlit application provides an interactive and easy-to-use interface for document classification and topic extraction, making it a valuable tool for financial analysts and researchers.
- KADAMBI V KASHYAP