1.Import Libraries 

In [1]:
import os
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


os for interacting with the operating system (e.g., file paths).
pandas for data manipulation and analysis.
BeautifulSoup from the bs4 library for parsing HTML content.
Various modules from sklearn for machine learning tasks, including data splitting, feature extraction, model training, and performance evaluation.

In [2]:

# Define the path to the main folder using raw string literal
main_folder_path = r'C:\Users\Mohan.S\Desktop\Mohan_Project\Finac\data'

# Define sub-folder names and create a dictionary to map them to labels
sub_folders = ['Income Statement', 'Balance Sheets', 'Cash Flow', 'Notes', 'Others']
label_dict = {sub_folder: idx for idx, sub_folder in enumerate(sub_folders)}



main_folder_path specifies the location of the main data folder.
sub_folders lists the names of subfolders within the main folder, each representing a category of financial documents.
label_dict creates a dictionary to map each subfolder name to a unique numeric label. This will be used later to label the data.

In [3]:

# Function to extract text from .html files

def extract_text_from_html(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        soup = BeautifulSoup(file, 'html.parser')
        return soup.get_text(separator=' ', strip=True)



This function reads an HTML file, parses it using BeautifulSoup, and extracts the text content, removing HTML tags and combining the text with spaces.

In [5]:
# Create a DataFrame to store the data and labels

data = []
labels = []

for sub_folder in sub_folders:
    folder_path = os.path.join(main_folder_path, sub_folder)
    if not os.path.exists(folder_path):
        print(f"Subfolder not found: {folder_path}")
        continue
    for file_name in os.listdir(folder_path):
        if file_name.endswith('.html'):
            file_path = os.path.join(folder_path, file_name)
            text = extract_text_from_html(file_path)
            data.append(text)
            labels.append(label_dict[sub_folder])
            

# Check if we have collected any data
# if not data:
    # raise FileNotFoundError("No .html files found in the specified subfolders.")

df = pd.DataFrame({'text': data, 'label': labels})



Initializes empty lists data and labels.
Iterates over each subfolder and reads all .html files within it.
Uses extract_text_from_html to extract text content from each HTML file.
Appends the extracted text and corresponding label (based on the subfolder) to the data and labels lists.
Converts the lists into a pandas DataFrame df with two columns: text and label.

In [6]:

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)



X_train and X_test contain the training and test text data, respectively.
y_train and y_test contain the corresponding labels.
test_size=0.2 indicates that 20% of the data is reserved for testing.
random_state=42 ensures reproducibility by setting a seed for the random number generator.

In [7]:
# Use TF-IDF vectorizer to transform the text data

vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)



Initializes a TfidfVectorizer to convert text data into TF-IDF features, with a maximum of 5000 features.
Fits the vectorizer to the training data (X_train) and transforms it into TF-IDF features (X_train_tfidf).
Transforms the test data (X_test) into TF-IDF features using the already fitted vectorizer (X_test_tfidf).

In [8]:
# Initialize and train the Logistic Regression model
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_tfidf, y_train)



random_state=42 ensures reproducibility.
max_iter=1000 sets the maximum number of iterations for the solver to converge.
The model is trained using the training TF-IDF features (X_train_tfidf) and their corresponding labels (y_train).

In [9]:
# Make predictions on the test set
y_pred = model.predict(X_test_tfidf)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Print classification report
all_labels = sorted(label_dict.values())
report = classification_report(y_test, y_pred, labels=all_labels, target_names=sub_folders)
print(report)



Accuracy: 0.9306930693069307
                  precision    recall  f1-score   support

Income Statement       1.00      0.89      0.94        57
  Balance Sheets       1.00      0.98      0.99        58
       Cash Flow       1.00      0.50      0.67         8
           Notes       0.91      0.86      0.89       123
          Others       0.91      0.97      0.94       259

        accuracy                           0.93       505
       macro avg       0.96      0.84      0.89       505
    weighted avg       0.93      0.93      0.93       505



The model makes predictions on the test TF-IDF features (X_test_tfidf).
accuracy_score computes the accuracy of the predictions.
classification_report generates a detailed report showing the precision, recall, F1-score, and support for each class. The target_names parameter maps the numeric labels back to their respective subfolder names.

In [10]:
# Print classification report with zero_division parameter set to 1
report = classification_report(y_test, y_pred, labels=all_labels, target_names=sub_folders, zero_division=1)
print(report)


                  precision    recall  f1-score   support

Income Statement       1.00      0.89      0.94        57
  Balance Sheets       1.00      0.98      0.99        58
       Cash Flow       1.00      0.50      0.67         8
           Notes       0.91      0.86      0.89       123
          Others       0.91      0.97      0.94       259

        accuracy                           0.93       505
       macro avg       0.96      0.84      0.89       505
    weighted avg       0.93      0.93      0.93       505

