<a href="https://colab.research.google.com/github/Prabhusabharish/Table-Classification-from-Financial-Statements/blob/main/Fin_Statements.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

07-06-2024

In [69]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [70]:
import joblib
import pandas as pd
from google.colab import drive
import os
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

Load and Preprocess the Data

In [71]:
# Define functions to load and preprocess data
def load_html_tables(directory):
    tables = []
    labels = []
    for label in os.listdir(directory):
        subdirectory = os.path.join(directory, label)
        if os.path.isdir(subdirectory):
            for file in os.listdir(subdirectory):
                if file.endswith('.html'):
                    file_path = os.path.join(subdirectory, file)
                    tables.append(pd.read_html(file_path))
                    labels.append(label)
    return tables, labels

def tables_to_text(tables):
    text_data = []
    for table_list in tables:
        for table in table_list:
            text_data.append(table.to_string())
    return text_data

# Load and preprocess data
dataset_directory = '/content/drive/MyDrive/Projects/Phonepe/data-mohan'
tables, labels = load_html_tables(dataset_directory)
text_data = tables_to_text(tables)

Explanation:
Initialize lists: tables and labels are initialized as empty lists to store the HTML tables and their corresponding labels.
List the directory contents: os.listdir(directory) lists all files and subdirectories in the specified directory.
Loop through each item in the directory:
Identify subdirectories: For each item (label), os.path.join(directory, label) forms the path to the item, and os.path.isdir(subdirectory) checks if it is a directory.
Loop through files in subdirectory: For each file in the subdirectory, the code checks if the file ends with .html.
Read HTML file: If it is an HTML file, pd.read_html(file_path) reads the HTML table(s) into a DataFrame (or list of DataFrames) and appends them to tables.
Store label: The label (subdirectory name) is appended to labels.
Return the tables and labels: The function returns two lists: tables containing the HTML tables, and labels containing the corresponding labels.

Initialize list: text_data is initialized as an empty list to store the string representations of the tables.
Loop through each table list: For each table_list in tables, the code loops through each table in the list.
Convert table to string: table.to_string() converts the DataFrame to a string representation and appends it to text_data.
Return the text data: The function returns text_data, a list of string representations of the tables.



Vectorize the Text Data  :
Use TF-IDF Vectorizer to convert text data to numerical format:

In [72]:
# Vectorize text data
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text_data)
y = labels

Explanation:
Specify the dataset directory: dataset_directory is set to the path where the HTML files are stored.
Load tables and labels: load_html_tables(dataset_directory) is called to load the HTML tables and their labels from the specified directory, storing the results in tables and labels.
Convert tables to text: tables_to_text(tables) is called to convert the loaded HTML tables into text format, storing the result in text_data.
Summary
load_html_tables(directory): This function reads HTML files from subdirectories within the specified directory, storing the tables and their labels in lists.
tables_to_text(tables): This function converts the list of DataFrame tables into a list of their string representations.
Load and preprocess data: The dataset directory is specified, and the two functions are used to load the HTML tables and convert them to text format.

Train-Test Split :
Split the data into training and testing sets

In [73]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Explanation:
Purpose: The train-test split is a common step in machine learning to evaluate the performance of a model. The dataset is divided into two subsets:

Training Set (X_train, y_train): Used to train the model.
Testing Set (X_test, y_test): Used to evaluate the model's performance on unseen data.
train_test_split function: This function is part of the sklearn.model_selection module in the Scikit-learn library. It splits arrays or matrices into random train and test subsets.

Parameters:

X: The features or input data (e.g., the text data in this context).
y: The labels or target data (e.g., the labels corresponding to the text data).
test_size=0.2: This specifies the proportion of the dataset to include in the test split. Here, 20% of the data will be used for testing, and 80% will be used for training.
random_state=42: This is a seed value for the random number generator. Setting a random state ensures reproducibility of the results. Using the same seed will always produce the same split.
Output:

X_train: The training subset of the input data.
X_test: The testing subset of the input data.
y_train: The training subset of the target data.
y_test: The testing subset of the target data.

Train a Classification Model :
Use a Naive Bayes classifier for  the classification task

In [74]:
# Train the model
model = MultinomialNB()
model.fit(X_train, y_train)

Explaination

Model Initialization:

model = MultinomialNB()
This line initializes an instance of the MultinomialNB class.
MultinomialNB is a Naive Bayes classifier suitable for classification with discrete features (e.g., word counts for text classification).
Model Training:

model.fit(X_train, y_train)
The fit method is used to train the model on the training data.
X_train is the feature data used for training.
y_train is the target data (labels) used for training.
During training, the model learns the relationship between the features and the labels.
Steps Involved:
Initialize the Model:

An instance of the MultinomialNB classifier is created.
This classifier is particularly effective for text classification tasks where the features represent word counts or term frequencies.
Fit the Model:

The fit method is called on the training data.
The model learns from X_train (features) and y_train (labels) to understand how to classify new, unseen data based on this training.
Example:
Suppose X_train contains text data that has been transformed into numerical features (e.g., term frequency-inverse document frequency (TF-IDF) vectors), and y_train contains the corresponding labels for each text entry.

Evaluate the Model :
Evaluate the model using accuracy and other relevant metrics

In [75]:
# Evaluate the model
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.7089108910891089
Classification Report:
                   precision    recall  f1-score   support

  Balance Sheets       1.00      0.52      0.68        50
       Cash Flow       0.00      0.00      0.00        10
Income Statement       1.00      0.06      0.12        62
           Notes       0.94      0.64      0.76       146
          Others       0.62      0.99      0.76       237

        accuracy                           0.71       505
       macro avg       0.71      0.44      0.47       505
    weighted avg       0.79      0.71      0.66       505



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Explaination ⁉Make Predictions:

y_pred = model.predict(X_test)
This line uses the trained model to make predictions on the test data.
X_test is the feature data for the test set.
predict method returns the predicted labels for X_test.
Evaluate Accuracy:

print("Accuracy:", accuracy_score(y_test, y_pred))
accuracy_score(y_test, y_pred) calculates the accuracy of the model.
Accuracy is the ratio of correctly predicted labels to the total number of labels.
y_test is the true labels for the test set.
y_pred is the predicted labels.
The accuracy score is printed.
Generate a Classification Report:

print("Classification Report:\n", classification_report(y_test, y_pred))
classification_report(y_test, y_pred) generates a detailed report showing the precision, recall, F1-score, and support for each class.
This report provides more detailed insights into the performance of the model across different classes.
The classification report is printed.
Steps Involved:
Make Predictions:

Use the predict method of the trained model to generate predictions for the test data (X_test).
Calculate Accuracy:

Use accuracy_score to calculate the overall accuracy of the model by comparing the true labels (y_test) to the predicted labels (y_pred).
Generate Classification Report:

Use classification_report to obtain a detailed performance summary for each class, including precision, recall, and F1-score.


Why Use These Metrics?
Accuracy:

Accuracy is a straightforward and widely used metric that measures the proportion of correct predictions out of all predictions made.
It is a good starting point for evaluating model performance but can be misleading if the data is imbalanced.
Classification Report:

The classification report provides a comprehensive view of the model's performance, including:
Precision: The ratio of true positive predictions to the total predicted positives.
Recall: The ratio of true positive predictions to the total actual positives.
F1-score: The harmonic mean of precision and recall, providing a single measure of performance.
Support: The number of actual occurrences of each class in the dataset.
This detailed breakdown helps to understand how the model performs across different classes, which is particularly useful for imbalanced datasets.
Summary:
Make Predictions: Use the trained model to predict labels for the test data (X_test).
Calculate Accuracy: Use accuracy_score to compute the overall accuracy of the model.
Generate Classification Report: Use classification_report to obtain a detailed performance analysis across different classes.

In [76]:
import joblib
import pandas as pd

Save the Model :
Save the trained model and the vectorizer so you can reuse them without retraining

In [77]:
# Step 1: Save the model and vectorizer
joblib.dump(best_rf_model, '/content/drive/MyDrive/Projects/Phonepe/best_rf_model.pkl')
joblib.dump(vectorizer, '/content/drive/MyDrive/Projects/Phonepe/vectorizer.pkl')

['/content/drive/MyDrive/Projects/Phonepe/vectorizer.pkl']

Explanation ⁉
Efficiency: joblib is more efficient than the standard pickle module for large NumPy arrays, which are common in machine learning models and vectorizers.
Ease of Use: joblib provides simple functions to save and load models, making it easy to integrate into machine learning workflows.
Compatibility: joblib is compatible with most Python objects used in machine learning, such as models, vectorizers, and other data structures.

Summary:
Import joblib: Import the joblib library for serialization.
Save the Model: Use joblib.dump to save the trained model to a file.
Save the Vectorizer: Use joblib.dump to save the vectorizer to a file.

Load and Use the Model :
Load the saved model and vectorizer for future use

In [78]:
# Load the model and vectorizer
loaded_model = joblib.load('/content/drive/MyDrive/Projects/Phonepe/best_rf_model.pkl')
loaded_vectorizer = joblib.load('/content/drive/MyDrive/Projects/Phonepe/vectorizer.pkl')

Explanation

loaded_model = joblib.load('/content/drive/MyDrive/Projects/Phonepe/best_rf_model.pkl')
This line loads the previously saved model from the file best_rf_model.pkl.
joblib.load is a method from the joblib library, which deserializes (loads) the saved object from the file.
The file path to the saved model is specified as the argument.
The loaded model is stored in the variable loaded_model.
Load the Vectorizer:

loaded_vectorizer = joblib.load('/content/drive/MyDrive/Projects/Phonepe/vectorizer.pkl')
This line loads the previously saved vectorizer from the file vectorizer.pkl.
As with the model, joblib.load is used to load the vectorizer from the specified file path.
The loaded vectorizer is stored in the variable loaded_vectorizer.


Why Use joblib?
Efficiency: joblib is efficient for loading large objects, such as machine learning models and vectorizers.
Ease of Use: joblib provides simple functions to save and load objects, making it easy to integrate into machine learning workflows.
Compatibility: joblib works well with most Python objects used in machine learning, such as models, vectorizers, and other data structures.

In [79]:
print("Loaded Vectorizer Vocabulary Size:", len(loaded_vectorizer.vocabulary_))

Loaded Vectorizer Vocabulary Size: 39417


Consistent Preprocessing :
Make sure the text preprocessing steps are consistent between training and prediction. Here’s how to define a function for prediction

In [80]:
# Define the prediction function
def predict_category(html_content):
    # Extract table text from HTML content
    table = pd.read_html(html_content)[0]
    table_text = table.to_string()

    # Transform the text using the loaded vectorizer
    X_new = loaded_vectorizer.transform([table_text])

    # Predict the category using the loaded model
    prediction = loaded_model.predict(X_new)
    return prediction[0]

Explanation:
Function Definition:

The function predict_category is defined to take html_content as input.
Extract Table Text from HTML Content:

table = pd.read_html(html_content)[0]
pd.read_html is used to read the HTML content and extract tables into DataFrames.
[0] selects the first table found in the HTML content.
table_text = table.to_string()
to_string() converts the DataFrame into a string representation.
Transform the Text Using the Loaded Vectorizer:

X_new = loaded_vectorizer.transform([table_text])
The extracted table text is transformed into a numerical format using the loaded vectorizer.
transform method converts the text data into the format expected by the model (e.g., a TF-IDF matrix).
The text data is wrapped in a list to match the expected input shape.
Predict the Category Using the Loaded Model:

prediction = loaded_model.predict(X_new)
The transformed text data (X_new) is fed into the loaded model to predict the category.
return prediction[0]
The prediction result (an array) is returned as a single value.
Steps Involved:
Extract Table Text:

Use pd.read_html to parse the HTML content and extract the table.
Convert the table to a string format suitable for vectorization.
Transform the Text:

Use the loaded vectorizer to transform the extracted table text into a numerical format.
This step prepares the text data for input into the trained model.
Predict the Category:

Use the loaded model to predict the category of the transformed text data.
Return the predicted category.

Ensure you test the prediction function with an HTML file

In [81]:
# Example usage
html_content = open('/content/drive/MyDrive/Projects/Phonepe/data-mohan/Income Statement/18445487_1.html', 'r').read()
predicted_category = predict_category(html_content)
print("Predicted Category:", predicted_category)

Predicted Category: Income Statement


Explanation:
Load the HTML Content from a File:

html_content = open('/content/drive/MyDrive/Projects/Phonepe/data-mohan/Income Statement/18445487_1.html', 'r').read()
open is used to open the specified HTML file in read mode ('r').
read reads the entire content of the file into a string.
html_content now contains the HTML content of the file.
Predict the Category Using the Defined Function:

predicted_category = predict_category(html_content)
The predict_category function is called with html_content as the argument.
This function extracts the table text, transforms it using the loaded vectorizer, and predicts the category using the loaded model.
The predicted category is stored in the variable predicted_category.
Print the Predicted Category:

print("Predicted Category:", predicted_category)
This line prints the predicted category to the console.

In [82]:
# Check the number of features expected by the model
print("Model expects features:", loaded_model.n_features_in_)

Model expects features: 39417


In [83]:
# Check the number of features in the new transformed data
table = pd.read_html(html_content)[0]
table_text = table.to_string()
X_new = loaded_vectorizer.transform([table_text])
print("New data features:", X_new.shape[1])


New data features: 39417


In this code snippet:

pd.read_html(html_content)[0] reads the HTML content and extracts the first table into a DataFrame.
table.to_string() converts the DataFrame into a string format.
loaded_vectorizer.transform([table_text]) transforms the string representation of the table using the loaded vectorizer, resulting in a new data array X_new.
X_new.shape[1] accesses the number of features (columns) in the transformed data array.
This code will print the number of features (columns) in the new transformed data array obtained after applying the loaded vectorizer.

In [84]:
import os

In [85]:
def batch_predict(directory, loaded_model, loaded_vectorizer):
    predictions = []
    actual_labels = []  # If you have ground truth labels
    for label in os.listdir(directory):
        subdirectory = os.path.join(directory, label)
        if os.path.isdir(subdirectory):
            for file in os.listdir(subdirectory):
                if file.endswith('.html'):
                    file_path = os.path.join(subdirectory, file)
                    html_content = open(file_path, 'r').read()
                    predicted_category = predict_category(html_content)
                    predictions.append((file, predicted_category))
                    actual_labels.append(label)  # If you have ground truth labels

    return predictions, actual_labels

summary of what your function does:

Inputs:
directory: The directory path containing HTML files.
loaded_model: The trained machine learning model.
loaded_vectorizer: The loaded vectorizer used for text transformation.
Output:
predictions: A list of tuples where each tuple contains the filename and the predicted category.
actual_labels (optional): A list of actual labels corresponding to the predictions if you have ground truth labels.
This function iterates through the files in the specified directory, reads each HTML file, predicts its category using the predict_category function, and collects the predictions along with optional actual labels.

Just ensure that the predict_category function is defined and works correctly with the loaded model and vectorizer. Also, consider handling exceptions or edge cases related to file reading or prediction failures if necessary.

In [86]:
# Example usage
predictions, actual_labels = batch_predict('/content/drive/MyDrive/Projects/Phonepe/data-mohan', loaded_model, loaded_vectorizer)

'/content/drive/MyDrive/Projects/Phonepe/data-mohan' is the directory path containing HTML files.
loaded_model is the trained machine learning model.
loaded_vectorizer is the loaded vectorizer used for text transformation.
The batch_predict function will iterate through the HTML files in the specified directory, predict the category for each file using the predict_category function (which uses the loaded model and vectorizer), and collect the predictions in the predictions list. If you have ground truth labels, the actual labels will also be collected in the actual_labels list.

This example demonstrates how to apply your machine learning model to a batch of files, making predictions efficiently and collecting the results for further analysis or evaluation.

In [87]:
for file, category in predictions:
    print(f"File: {file} - Predicted Category: {category}")

File: 18723339_5.html - Predicted Category: Others
File: 18836203_9.html - Predicted Category: Others
File: 18853015_6.html - Predicted Category: Others
File: 18582946_7.html - Predicted Category: Others
File: 18563939_49.html - Predicted Category: Others
File: 18882293_2.html - Predicted Category: Others
File: 18460658_65.html - Predicted Category: Others
File: 18853061_5.html - Predicted Category: Others
File: 18460640_1.html - Predicted Category: Others
File: 18776682_20.html - Predicted Category: Others
File: 18666988_10.html - Predicted Category: Others
File: 18932217_3.html - Predicted Category: Others
File: 18756086_6.html - Predicted Category: Others
File: 18929978_59.html - Predicted Category: Others
File: 18586124_3.html - Predicted Category: Others
File: 18853011_1.html - Predicted Category: Others
File: 18756043_6.html - Predicted Category: Others
File: 18460658_45.html - Predicted Category: Others
File: 18619233_2.html - Predicted Category: Others
File: 18586111_21.html - 

In [88]:
# Evaluate the predictions if you have actual labels
if actual_labels:
    predicted_categories = [category for _, category in predictions]
    print("Accuracy:", accuracy_score(actual_labels, predicted_categories))
    print("Classification Report:\n", classification_report(actual_labels, predicted_categories))


Accuracy: 0.9833663366336634
Classification Report:
                   precision    recall  f1-score   support

  Balance Sheets       1.00      0.99      1.00       270
       Cash Flow       1.00      0.97      0.99        36
Income Statement       1.00      0.99      0.99       305
           Notes       0.99      0.96      0.97       690
          Others       0.97      0.99      0.98      1224

        accuracy                           0.98      2525
       macro avg       0.99      0.98      0.99      2525
    weighted avg       0.98      0.98      0.98      2525



actual_labels: This list contains the actual labels corresponding to the predictions.
predictions: This list contains tuples with filenames and predicted categories obtained from the batch_predict function.
The evaluation process includes the following steps:

Extract Predicted Categories:

predicted_categories = [category for _, category in predictions]
This list comprehension extracts only the predicted categories from the predictions list.
Calculate Accuracy:

accuracy_score(actual_labels, predicted_categories):
This calculates the accuracy of the predictions by comparing them with the actual labels (actual_labels).
The accuracy_score function is part of Scikit-learn's metrics module and computes the accuracy metric.
Generate Classification Report:

classification_report(actual_labels, predicted_categories):
This generates a detailed classification report that includes precision, recall, F1-score, and support for each category.
The classification report provides a comprehensive view of the model's performance across different categories.
These steps help evaluate the predictive performance of your model and provide insights into its accuracy and ability to classify data correctly.

Just ensure that you have imported the necessary modules (accuracy_score and classification_report) from Scikit-learn's metrics module for this code to work.

Experimenting with Different Machine Learning Models

In [65]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
import joblib

In [66]:
# Define the models
models = {
    'RandomForest': RandomForestClassifier(),
    'SVM': SVC(),
    'NeuralNetwork': MLPClassifier(max_iter=1000)
}


You've defined a dictionary models that contains three different machine learning models: RandomForestClassifier, SVC (Support Vector Classifier), and MLPClassifier (Multi-layer Perceptron Classifier). Here's a breakdown of the models you've defined:

RandomForestClassifier:

Ensemble learning method based on decision trees.
Can handle both classification and regression tasks.
Uses multiple decision trees to improve accuracy and control overfitting.
SVC (Support Vector Classifier):

A type of support vector machine (SVM) that can be used for classification tasks.
Effective in high-dimensional spaces and suitable for cases where the number of features exceeds the number of samples.
Uses kernel functions to transform data into higher-dimensional space for classification.
MLPClassifier (Multi-layer Perceptron Classifier):

A type of artificial neural network model.
Consists of multiple layers of nodes (neurons) with nonlinear activation functions.
Suitable for both classification and regression tasks and can learn complex relationships in data.
Each model has its strengths and weaknesses, and the choice of model depends on factors such as the nature of the data, the complexity of the problem, and the desired performance metrics.

You can further customize these models by specifying hyperparameters, such as max_iter for the MLPClassifier, to fine-tune their performance and improve accuracy on specific tasks.

In [67]:
# Define the hyperparameters for each model
param_grids = {
    'RandomForest': {
        'n_estimators': [100, 200, 300],
        'max_depth': [10, 20, 30]
    },
    'SVM': {
        'C': [0.1, 1, 10],
        'gamma': ['scale', 'auto'],
        'kernel': ['linear', 'rbf']
    },
    'NeuralNetwork': {
        'hidden_layer_sizes': [(100,), (150,), (200,)],
        'activation': ['relu', 'tanh'],
        'solver': ['adam', 'sgd']
    }
}

You've defined hyperparameter grids for each of the three machine learning models in your param_grids dictionary. Here's a breakdown of the hyperparameters you've specified for each model:

RandomForestClassifier:

n_estimators: Number of decision trees in the forest.
max_depth: Maximum depth of each decision tree.
SVC (Support Vector Classifier):

C: Regularization parameter.
gamma: Kernel coefficient for 'rbf' and 'poly' kernels.
kernel: Kernel type ('linear', 'rbf', 'poly', 'sigmoid').
MLPClassifier (Multi-layer Perceptron Classifier):

hidden_layer_sizes: Number of neurons in each hidden layer.
activation: Activation function for the hidden layers ('relu', 'tanh', 'logistic').
solver: Solver algorithm ('adam', 'sgd', 'lbfgs').
Each hyperparameter grid contains multiple values for the respective hyperparameters, allowing you to perform grid search or hyperparameter tuning to find the best combination of hyperparameters for each model.

For example, you can use techniques like GridSearchCV or RandomizedSearchCV from Scikit-learn to search through these hyperparameter grids and find the optimal set of hyperparameters that results in the best model performance.

Hyperparameter tuning is essential for optimizing model performance, reducing overfitting, and improving generalization on unseen data.

In [None]:
best_models = {}

# Perform Grid Search for each model
for model_name in models:
    print(f"Performing Grid Search for {model_name}...")
    grid_search = GridSearchCV(models[model_name], param_grids[model_name], cv=5, n_jobs=-1, verbose=1)
    grid_search.fit(X_train, y_train)
    best_models[model_name] = grid_search.best_estimator_

    # Evaluate the best model
    y_pred = best_models[model_name].predict(X_test)
    print(f"Best {model_name} Model Parameters: {grid_search.best_params_}")
    print(f"Accuracy for {model_name}: {accuracy_score(y_test, y_pred)}")
    print(f"Classification Report for {model_name}:\n{classification_report(y_test, y_pred)}")

    # Save the best model
    joblib.dump(best_models[model_name], f'/content/drive/MyDrive/Projects/Phonepe/best_{model_name}_model.pkl')

# Save the vectorizer
joblib.dump(vectorizer, '/content/drive/MyDrive/Projects/Phonepe/vectorizer.pkl')

Performing Grid Search for RandomForest...
Fitting 5 folds for each of 9 candidates, totalling 45 fits
Best RandomForest Model Parameters: {'max_depth': 30, 'n_estimators': 300}
Accuracy for RandomForest: 0.9128712871287129
Classification Report for RandomForest:
                  precision    recall  f1-score   support

  Balance Sheets       1.00      0.90      0.95        50
       Cash Flow       1.00      0.20      0.33        10
Income Statement       1.00      0.85      0.92        62
           Notes       0.94      0.86      0.90       146
          Others       0.87      0.99      0.93       237

        accuracy                           0.91       505
       macro avg       0.96      0.76      0.81       505
    weighted avg       0.92      0.91      0.91       505

Performing Grid Search for SVM...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best SVM Model Parameters: {'C': 10, 'gamma': 'scale', 'kernel': 'linear'}
Accuracy for SVM: 0.9465346534653465
Clas


Your code snippet is performing grid search for hyperparameter tuning and selecting the best models for each algorithm defined in models. Here's an overview of what each part of your code does:

Initialization:

best_models = {}: This dictionary will store the best models for each algorithm after grid search.
Grid Search and Model Selection:

A loop iterates over each model defined in models.
For each model, grid search (GridSearchCV) is performed using the hyperparameter grid specified in param_grids.
The best estimator (model with the best hyperparameters) from grid search is stored in best_models under the corresponding model name.
Model Evaluation and Saving:

After grid search, the best model's parameters are printed along with the accuracy and classification report on the test data (X_test, y_test).
The best model is saved using joblib.dump with a file name indicating the model type (best_{model_name}_model.pkl).
Vectorizer Saving:

Lastly, the vectorizer used for text transformation is also saved using joblib.dump.

Loading and Predicting with the Best Model

In [None]:
# Load the vectorizer and the best model
vectorizer = joblib.load('/content/drive/MyDrive/Projects/Phonepe/vectorizer.pkl')
best_rf_model = joblib.load('/content/drive/MyDrive/Projects/Phonepe/best_RandomForest_model.pkl')
best_svm_model = joblib.load('/content/drive/MyDrive/Projects/Phonepe/best_SVM_model.pkl')
best_nn_model = joblib.load('/content/drive/MyDrive/Projects/Phonepe/best_NeuralNetwork_model.pkl')

In [None]:
# Define the prediction function
def predict_category(html_content, model):
    # Extract table text from HTML content
    table = pd.read_html(html_content)[0]
    table_text = table.to_string()

    # Transform the text using the loaded vectorizer
    X_new = vectorizer.transform([table_text])

    # Predict the category using the loaded model
    prediction = model.predict(X_new)
    return prediction[0]

Inputs:

html_content: The HTML content to be processed and predicted.
model: The trained model to be used for prediction.
Process:

Extract Table Text:

pd.read_html(html_content)[0]: Extracts tables from HTML content. Assumes the first table is relevant.
table.to_string(): Converts the DataFrame to a string format. Assumes the table contains text relevant for prediction.
Text Transformation:

vectorizer.transform([table_text]): Transforms the table text into a numerical format using the loaded vectorizer (vectorizer).
Prediction:

model.predict(X_new): Uses the provided model (model) to predict the category based on the transformed text (X_new).
Output:

Returns the predicted category as the output of the function.
This function assumes that vectorizer and model are defined and properly trained before calling predict_category. It's important to ensure that the vectorizer used for transforming text data during training is the same one used during prediction.

You can call this function with HTML content and the desired model to get predictions for the category of the content.

In [None]:
# Example usage
html_content = open('/content/drive/MyDrive/Projects/Phonepe/data-mohan/Income Statement/18445487_1.html', 'r').read()
predicted_category_rf = predict_category(html_content, best_rf_model)
predicted_category_svm = predict_category(html_content, best_svm_model)
predicted_category_nn = predict_category(html_content, best_nn_model)

'/content/drive/MyDrive/Projects/Phonepe/data-mohan/Income Statement/18445487_1.html': The path to the HTML file containing the content to be predicted.
best_rf_model, best_svm_model, best_nn_model: These are assumed to be the best models obtained after grid search or hyperparameter tuning for Random Forest, SVM, and Neural Network, respectively.
The predict_category function is called three times with the HTML content and each of the best models (best_rf_model, best_svm_model, best_nn_model) to obtain predictions for the category of the content using different models.

This example demonstrates how to use your trained models for predicting categories of HTML content, allowing you to leverage different algorithms to make predictions based on the specific characteristics of the data and the model's performance.

In [None]:
print("Predicted Category (Random Forest):", predicted_category_rf)
print("Predicted Category (SVM):", predicted_category_svm)
print("Predicted Category (Neural Network):", predicted_category_nn)