# Training and Testing Finance Data Set

### ✅ Step 1: Installing required packages


In [15]:
# Install scikit-learn if not already installed          
%pip install scikit-learn


Note: you may need to restart the kernel to use updated packages.


### ✅ Step 2: Preparing the data

In [16]:
import pandas as pd
from datasets import load_dataset

# Load the dataset with both splits
ds = load_dataset("gretelai/synthetic_pii_finance_multilingual")

# Convert to DataFrame and filter for English
df_train = pd.DataFrame(ds["train"])
df_test = pd.DataFrame(ds["test"])

df_train = df_train[df_train["language"] == "English"].reset_index(drop=True)
df_test = df_test[df_test["language"] == "English"].reset_index(drop=True)

# Use text as input (X), and document type as label (y)
X_train = df_train["generated_text"]
y_train = df_train["document_type"]
X_test = df_test["generated_text"]
y_test = df_test["document_type"]

print("Training set shape:", X_train.shape, y_train.shape)
print("Test set shape:", X_test.shape, y_test.shape)


Training set shape: (25948,) (25948,)
Test set shape: (2962,) (2962,)


### ✅ Step 3: Building & training the model

In [17]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

model = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=10000)),
    ("clf", LogisticRegression(max_iter=200))
])

model.fit(X_train, y_train)


## Why do we use TF-IDF instead of BERT? ##

We use **TF-IDF** instead of BERT for several practical reasons:

- **Simplicity and Speed:**  
  TF-IDF is easy to implement and very fast to compute, making it ideal for quick experiments and prototyping. It does not require a GPU or large memory.

- **Lower Resource Requirements:**  
  BERT and other transformer models are computationally intensive and require more memory and processing power, which may not be available or necessary for your current task.

- **Interpretability:**  
  With TF-IDF, each feature corresponds to a specific word or n-gram, making it easier to understand what the model is learning. BERT embeddings are dense and less interpretable.

- **Strong Baseline:**  
  For many text classification problems, especially with well-structured and formal text (like finance documents), TF-IDF combined with a simple classifier (such as logistic regression) provides a strong and reliable baseline.

- **No Need for Pretrained Models:**  
  BERT requires downloading large pretrained models and possibly fine-tuning, which adds complexity. TF-IDF works out-of-the-box with your data.

**In summary:**  
TF-IDF is a practical, efficient, and interpretable choice for this dataset and task, especially in the early stages of experimentation or when computational resources are limited. For higher accuracy or want to capture deeper semantic meaning, we can later experiment with BERT or other transformer-based embeddings.

### ✅ Step 4: Evaluating model


In [12]:
from sklearn.metrics import classification_report, accuracy_score

# Predict on the test set
y_pred = model.predict(X_test)

# Print accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# Print detailed classification report
print(classification_report(y_test, y_pred))

Accuracy: 0.9572254335260115
                                        precision    recall  f1-score   support

                         Annual Report       0.94      0.92      0.93        49
                          Audit Report       0.93      0.98      0.95        53
                            BAI Format       0.94      0.93      0.93       169
                        Bank Statement       0.96      0.82      0.88        56
                        Bill of Lading       1.00      0.99      1.00       154
                         Business Plan       0.94      0.98      0.96        48
                                   CSV       0.88      0.99      0.93       165
                Compliance Certificate       0.93      0.90      0.91        41
       Corporate Governance Guidelines       0.95      0.98      0.96        53
                  Corporate Tax Return       0.94      0.88      0.91        74
                    Credit Application       0.92      0.81      0.86        57
          

### ✅ Step 5: Taking a Prediction Test

Let's use our trained model to predict the document type for a new finance-related document.


In [13]:
# Example: Predict the document type for a new document
new_text = ["This is a sample finance contract about a loan agreement."]
predicted_type = model.predict(new_text)
print("Predicted document type:", predicted_type[0])

Predicted document type: Loan Application


## Summary

- The model was trained to classify English-language finance documents into 60 different document types using TF-IDF features and logistic regression.

- Data was loaded directly from the Hugging Face `gretelai/synthetic_pii_finance_multilingual` dataset, and filtered to include only English-language documents.

- I did not split the data ourselves; instead, we used the pre-split training and test sets provided by the Hugging Face dataset.

- Model evaluation included accuracy and a detailed classification report, providing insights into precision, recall, and F1-score for each class.

- This workflow provides a strong, interpretable baseline for finance document classification and can be extended with more advanced