# Goal of script

Final version created on 21/03/2023 by Claire S

This script trains SetFit on sentence fragments 

- spacy_sentence_bert (if using)
- huggingface
- sentence-transformers


Some sources:
- https://levelup.gitconnected.com/introduction-to-setfit-few-shot-text-classification-3fbf3a5b9b90
- https://www.youtube.com/watch?v=8h27lV8v8BU&ab_channel=HuggingFace

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# to import my own functions:
import pandas as pd
import sys
import ipynb
import re
import numpy as np
sys.path.append('/project/Xelix_Project/utils')

In [3]:
from ipynb.fs.full.Regex_html_Functions import clean_websites
from ipynb.fs.full.Loop_Functions import sentence_by_sentence

In [4]:
# pandas options
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)

# Examples V3

Updates made to the training set:

- took out all greetings (e.g. "Good morning", "Hello Jane", "Hi Alex")
- Made training set more distinct (sentences too similar was harming model performance) e.g. target sentence examples emphasize on "new bank details", e.g. "see our new bank details on the attached invoice", negative examples instead also include "click here for our bank details", "the bank details are on the invoice"

Outstanding question --> We could make it even more sensitive?

In [5]:
target_email_examples = [
    
    ### True positive examples
    # short human-like emails
    'please find our updated bank details attached.',
    'we have updated our bank details.',
    'we have changed bank since last month.',
    'we have switched banks and we would kindly ask for our details to be updated.',
    'we have updated our bank information.',
    'we have switched banks and now have new bank information.',
    'Please find attached our bank letter with our bank details, please let us know if you need anything else.',
    'Our bank details changed in May 2021!',
#     'Please find attached a letter containing new contact details as there has been an Aquaid branch, and your account has been transferred to Aquaid (South Coast).',
    'As part of this change I would be very grateful if you could please also change the bank account details you use to pay all Aquaid invoices to the new bank account given in the letter.',
    'please find attached our updated bank details.',
    'we have switched to a new bank.',
    'Please find attached our bank letter with updated information.',
    'please find attached our updated bank details.',
    'we have updated our bank information.',
    'please see the attached document for our new bank information.',
    'attached is a document with our new bank details.',
    'we have switched banks.',
    'please find attached our bank letter with updated bank details.',
    'our bank details have changed.',
    'Please find attached the letter with new bank details as there has been an Aquaid branch transfer.',
    'Please update our vendor profile with our new bank information.',
    'We have changed banks and would like you to pay invoices into the new account as the old one is closing soon.',
    
    # updating more generic
    'Could you please update our bank information according to the attached letter?',
    
    # more generic emails
    'we have changed our bank.',
    'Please find our new bank details below.',
    'This email is to notify you that we have changed our banking information.',
    'Attention: We have updated our bank details.',
    'we wanted to let you know that we have changed our bank.',
    'please note that our bank information has changed.',
    'Please note we have new bank details so please update your systems accordingly',
    'Please would you arrange for all future BACS and CHAPS payments to be forwarded to this account',
    'We have changed banks as of 11/14/22.',
    'Please find attached our new bank details',
    'Can you please send me a form so I can give you our new bank account information?',
    'Please find the attached requested signed letter in regards to our recent change of bank details.',
    'Would you be able to help me update our vendor profile along with our bank information?',
    'We would ask that you now pay invoices into the Action24 bank account as the Wilson one is due to close shortly.',
    'Please find attached Active Security Group Ltd, Re Branding Letter, which also has our New Bank Details.',
    'However, we are requesting you update our banking details in your system to that of our Ahead, Inc. bank account with Regions Bank.',
    'The Ahead, Inc. Regions Bank account information is on page 2 of the attachment within a letter from Regions Bank certifying the account.',
    'OUR BANK DETAILS HAVE RECENTLY CHANGED.',
    'OUR BANK DETAILS HAVE CHANGED RECENTLY.',
    'Please see the attached statement for new bank information.',
    'PLEASE SEE RECENT STATEMENT / INVOICES SHOWING OUR NEW BANK DETAILS.',
    'NEW BANK DETAILS:',
    '*** PLEASE NOTE OUR NEW BANK DETAILS WITH IMMEDIATE AFFECT ***',
    'Please find below a formal letter stating our new company details and our NEW bank account.',
    'Please ensure all future payments are made to our new Starling Account',
    'We recently mailed out a notification regarding a change of banking for your lease(s) with InSite, which were acquired by American Tower in March 2021.',
    'We sent out a letter in October regarding a banking change for leases that American Tower acquired from IWG/InSite last year.',
    'Our records show that E. W. Scripps is still sending payments to our old bank account',
    'Please note our bank details have recently changed, these can be found at the bottom of the attachment.',
    'Going forward can you please make all payments to our new AIB bank account, details of our AIB bank account are attached.',
    'please find attached our Rebranding Letter with new bank details for Active Security Group Ltd.',
    'We have changed our bank account to the Ahead, Inc. account with Regions Bank.',
    'The bank information is in the attached letter.',
    'New Account Number: 31354760 New Sort Code: 40-11-60',
    
    #complicated
    'we have merged with Company, and as a result, our financial administration will change.', # sort of in a bank change email
    
    # bank-change combo with another topic
    'Please can you confirm Settlement date of our 2 invoices that you have updated our bank details to the new ones supplied (as attached)',
    'Please note: Due to our current banking provider (Ulster Bank) leaving the Irish Market our bank details have changed.',
    'Our New Bank Details (AIB) are noted at the top of the invoice.',
    'Please note our new bank details on the invoice.',
    'Please note that our bank details have changed, see attached',
    'Can you amend your records for all future payments',
    'Can you please make this change to our account on your systems with immediate effect for all future payments',
    'As part of this change I would be very grateful if you could please also change the bank account details you use to pay all Aquaid invoices to the new bank account given in the letter',
    'It does not look as though our information has been updated as I just received another deposit in our old bank account under our old name',
    'NEW BANK DETAILS: ACCOUNT NAME ALLAN REEDER LTD BANK LLOYDS BANK SORT CODE 30-95-89 ACCOUNT NO 10909872',
    'Please note we have changed our Bank details: ',
    'Please ensure you pay via our new bank details of : Bank Name: HSBC Sort Code : 40-61-35 Account Number : 03006513',
    'Please ensure you pay to our updated bank details of : Bank Name: Nationwide Sort Code : 36-80-56 Account Number : 19217583',
    'Please check our new bank details ',
    'All invoices from the 1st of July will reflect the changes to FourNet, and payments from this date will need to be made to the account shown above.',
    'Please note we have changed our Bank details: ',
    'Please update according to our new Bank details: ',
    'Please find our new bank details on top of the attached invoice',
    'We are trying to migrate all of our payment to the Citi Bank Account',
    'Please let me know if I can help with anything in order to update our payment information on your side',
    'Please make the payments via our new bank details',
    'I just received another deposit in our old bank account under our old name',
    'Please can you make these updates to your records of our old bank account',
    'Please see attached invoice due for payment and new bank details ',
    'However, bank details are different to what we have on GPD and needs to be updated',
    'See attached document with updated payment information',
    'Please update our payment information',
    'we have new payment information',
    'We are in the process of closing our SVB bank account and so are reaching out to customers that pay us through this account',
    'We have notified your company on several occasions that the PAYEE bank account details do need amending',
    'your company persistently pay funds into the wrong bank account',
    'The bank account you are currently paying monies into closes this coming Friday and any deposits then made after this date are at risk of being lost',
    'you are paying into the wrong bank account',
    'when we receive written confirmation that the bank account details have been changed',
    'Please update your records to make payments to this account going forward',
    'Please find your statement of account attached and a Change of Bank Details letter',
    'Please find attached a Change of Bank Details letter',
    'Please can you confirm that you have updated our bank details to the new ones supplied (as attached) '
    '**OUR USD / EURO BANK DETAILS HAVE CHANGED**',
    'From the 1st of July, we will also need you to amend any records in the system to reflect the changes',
    'Please can you change any company information held for C>Ways to reflect the following.',
    'Change of Bank Details',
    'Please find enclosed a letter regarding Benjamin West''s change of bank details.',
    'We have a NEW BANK ACCOUNT which is Windsor Tourist Guides Ltd Sort Code 23-05-80 Account Number 39717905',
    'Please note our new Bank Account Our bank account is',
    'It is against Chess Valley as the supplier but they have new bank details as they have been bought out',
    'We have had a message from our bank informing us that you have changed your bank details (sort code)',
    'It does not look as though our information has been updated as I just received another deposit in our old bank account under our old name',
    'See the attached document with updated payment information',
    'This information is also included on all of our invoices',
    'See the attached invoice with the new information',
    'We have updated our financial administration',
    'Please let me know if you need more assistance to update our bank details',
    'We are in the process of closing our SVB bank account and so are reaching out to customers that pay us through this account',
    'PLEASE NOTE OUR CHANGE OF BANK DETAILS ATTACHED',
    'we''re requesting you forward November''s payment to our JPM Chase bank account',
    
    # Chat GPT
    'Please check our new bank details',
    'Enclosed, you will find our updated bank details',
    'Our updated bank details are attached for your reference',
    'Attached, please find the updated information for our new bank account',
    'Our banking information has been updated',
    'Our account details have been revised',
    'We have made changes to our bank details',
    'Our bank information has been recently updated',
    'We recently switched to a different bank',
    'Our banking institution has changed',
    'We have transferred our accounts to a new bank',
    'As of last month, we are now banking with a different institution',
    'Our accounts have been transferred to a new banking institution, and we now have new information to share',
    'We recently changed banks, and our new bank information is as follows',
    'We have switched to a different bank, and we have new account information to share',
    'As of now, we are banking with a new institution, and we have updated information to share',
    'Enclosed is a letter with our bank details. Please let us know if you require any additional information',
    'Attached, you will find a letter with our banking information. If you require any further details, please let us know',
    'We have attached a letter with our bank details. Please do not hesitate to contact us if you require any additional information',
    'Our bank letter, which includes our account details, is attached. Please let us know if you need anything else',
    'Please see the attached document for the new information and update your payment records accordingly',
    'Kindly update your payment information to reflect the change',
    'Please find attached a letter with the new bank details',
    'We would like to inform you that we have switched banks and have new bank details',
    'Please be informed that our bank information has changed',
    'To ensure prompt payment, please update your records accordingly',
    'The new bank details are included in the attached document',
    'Due to the recent acquisition of our company by XYZ, we have merged with their financial administration',
    'As a result, our bank details have been updated',
    'Please see the attached letter with the new bank details',
    'Please note that our bank details have changed due to our current banking provider leaving the market',
    'The new bank details are noted at the top of the invoice',
    'Kindly update your payment records accordingly',
    'We have recently rebranded and as part of the process, changed our bank account information',
    'Please find attached our rebranding letter with the new bank details',
    
    # new
    'Please see attached invoice due for payment and new bank details',
    
]
print(len(target_email_examples))

149


In [6]:
neg_email_examples = [
    ### False negative examples
    # short human-like emails
    'Please find attached an updated statement for the Coworth Park account',
    
    # fraud/cybercrime warnings
    'Please contact us if you receive an email notifying you of any change.',
    'Fraud/cybercrime warning: we will never notify you of bank detail changes via email.',
    'We want to make you aware of a recent fraud attempt regarding bank details.',
    'Please let us know if you have received a potentially fraudulent email.',
    'Attention: We have received reports of fraudulent emails regarding bank detail changes.',
    'Please disregard any such emails and contact us immediately if you receive one.',
    'we wanted to make you aware of recent fraud attempts regarding bank details.',
    'Please be vigilant and contact us immediately if you receive any suspicious emails.',
    'Please be aware of potential phishing attempts and do not click on suspicious links.',
    'We would like to emphasize that we will never ask you to disclose your personal information via email.',
    'Fraud alert: we have received reports of fraudulent emails posing as our company.',
    'Please be wary of emails asking for gift card payments or wire transfers.',
    'Please be cautious of emails that request urgent action or threaten consequences.',
    'Attention: we have noticed an increase in fraudulent emails posing as our company.',
    'Please be vigilant.',
    
    # Call to action different topics
    'Please respond to the statement as fast as possible and pay any outstanding invoices.',
    'Please can you pay your outstanding invoices as soon as possible.',
    'Please be extra vigilant when opening attachments or clicking links.',
    'Please consider the environment before printing this email message.',
    'We kindly request if you could arrange payment as soon as possible?',
    'We would ask you to focus, in any fresh text that you may care to send us, on your private dining room and its capacity/ambience as this is the information that visitors to are specifically looking for.',
    'Please additionally advise us of your opening hours every day for lunch and dinner as well as advising us of the number of seated dining guests and standing event guests that your private dining room(s) can  accommodate to enable us to add this information to your listing.',
    'We would ask that any new content you send us is original, not to be found elsewhere on the internet, as Google favours original content.',
    'We are waiting to send your first purchase order from our Scripps location KOAA, but first we need some information from you.',
    'If you are interested in this payment method, please contact US Bank (our bank provider) at 1-855-268-5386, or by email to paymentplus@access-online.com ',
    'I am following up on our invoice 409292 for PS314.88 that is now overdue for payment',
    'Please could you confirm if this is now approved and when we can expect to receive payment?',
    'Please confirm your attendance for the upcoming conference by the end of the week.',
    'If you have not already done so, please submit your application by the deadline.',
    'We would appreciate your feedback on our recent product launch. Please take a moment to complete our survey.',
    'Please visit our website for more information on our services and pricing.',
    'We encourage you to share your experience with us on social media.',
    'Please consider making a donation to support our charitable cause.',
    'Please RSVP for the company holiday party by December 1st.',
    'If you have any questions or concerns, please do not hesitate to contact us.',
    'Please save the date for our upcoming charity fundraiser on March 31st.',
#     'PLEASE NOTE OUR BANK DETAILS FOR OUR STERLING ACCOUNT ARE AS BELOW: ROYAL BANK OF SCOTLAND SORT CODE 16-26-32 ACCOUNT NUMBER 11526700',

    # other types of updating
    'Please update your contact information in our database to ensure you receive our latest updates.',
    'Please find your updated invoice attached.',
    
    # negative change
    'The Company have not changed their bank details',
    'ArtSystems have not changed their bank details',
    'Please note that we will never ask you to provide your bank details via email.',
    'We will never ask you to change our bank details.',
    'Please note that we will never ask you to provide your bank details or any other sensitive information via email.',
    'Should you receive any emails from me regarding a change in bank details, please call the office to check the validity of the request prior to making any change to your systems',
    'We want to remind you that we will never ask for your password via email.',
    'Please note that we will never ask for your credit card details over email.',
    'We want to stress that we will never request sensitive information via email.',
    'Please do not change the bank details and continue to use the account below for all payments to us.',
    'Please also note that we will never ask you to provide sensitive information via email.',
    
    # other topics, e.g. invoices, statements
    'we would like to hereby send you your statement.',
    'Your statement is attached. Please review it and let us know if you have any questions or concerns.',
    'please find your order confirmation attached.',
    'Dear ACCOUNTS PAYABLE, Please find attached a statement of your account',
    'Invoice 21110019589 paid 9/23 via EFT/ACH',
    'This statement is correct as at 01/06/2022',
    'The attached statement is accurate as of 29/12/2023',
    'On checking your account, we see that you have PS238.17 unpaid from April and prior, details attached.',
    
    # completely different emails 
    'I currently don''t see a payment for $1100.00 for the month of March',
    'Remittance below.',
    'If you are not the intended recipient, please delete this e-mail immediately',
    'In order for you or your company to be added to our Procure-to-Pay system, we will need you to complete our Supplier Self-Registration',
    'Per the attached final audit we all agreed that the credits listed below were correct since you had not paid the invoices',
    'Today I downloaded the details for the payment processed on 27th Transaction ref.# 287447',
    'Now, you need to pay back the amount of $21,287.57',
    'Activity and use of our systems is monitored to secure its effective use and operation and for other lawful business purposes',
    'Communications using these systems will also be monitored and may be recorded to secure effective use and operation and for other lawful business purposes',
    'If you are not the addressee, please notify the sender immediately by return e-mail and delete this message',
    'Company and its subsidiaries',
    'All product and company names referenced herein are trademarks of their respective owners',
    'Limited is registered in England and Wales',
    'The Company is registered in England and Wales',
    'This ensures the smooth running of all accounts going into the New Year',
    'Please find attached the report on our company''s sustainability efforts.',    
    'Please see attached for our latest white paper on industry trends.',
    'However, we have two other payment options that will allow you to get your payments',
    'The two invoices expired still not paid are nr 23 and 38 issued by Company on Feb/march 2022, see them attached Amounts are in EUR and we should receive the bank transfer on our Country bank account',
    'I have just checked our bank account again and the EFT has not posted and is not pending in our account(which normally would not take this long if you said it was to pay on 5/13/22)',
    'The above invoices were paid on the payment date above and will be transferred to your bank account via Bacs / Swift',
    'This change reflects a positive shift in our capabilities and services, and all current staff remain in place',
    'We believe the most important details, though, are yours - and to keep the information you share with us as safe as possible, we''ve updated our privacy policy in line with the requirements set out by the General Data Protection Regulation',
    'Additional information regarding your bill, individual charge service details, and your account history are available on the Billing & Cost Management Page.',
    
    # uncertain about whether to include in training
    '**Please note that we now take Credit/Debit Cards for payment and we have a Direct Debit facility that all our Clients can sign up to',
    'Details are shown in the Subject line',
    'The AWS Customer Agreement was updated on April 30, 2019',
    'You can see more information about these changes at /agreement/recent-changes/',
    
    # new additions
#     'Bank Details Euro Sterling Name: A M Dunne & Son Ltd',
    'Click here for our bank details',
    'Yes, we would change all payments to EFT',
#     'It is against Chess Valley as the supplier but they have new bank details as they have been bought out',
    
    'Kindly update your systems/records with our new email address details: ',
    'The bank details are the same, however, invoices have different numbers',
    'will be updating some of our email addresses and as such the email address from which you receive your invoices and statements from us will change',
    'The bank details are still the same',
    'The bank details have not changed',
    'There are no changes to the bank details',
    'Further to our email communication sent last month, we can now confirm the legal name change of Dorma UK Ltd to dormakaba UK Ltd',
    'What is the new companies house number',
    'In continuing our continuous improvement plan, we are upgrading our core computer systems to SAP',
    'We''ve rebranded',
    'The new and updated Pierre Frey website now provides you with more information on dimensions, FR rating, composition and more',
    'The assets of Map Marketing Ltd have been purchased by a new company All Jigsaw Puzzles Ltd and the factory in the UK with its existing staff will continue to manufacture the same products as before, without any interruption to production',
    'WE HAVE MOVED 1 st floor, 1 Europa Drive, Sheffield, S9 1XT',
    'In case new linked accounts are added, click ''Enable Tax Settings Inheritance'' so the tax registration information will automatically be inherited from your Payer account to your Linked account',
    'If you can please reply and confirm as per the attached statement that these invoices will be paid as per the agreed terms we can get the new orders dispatched',
    'We constantly look at ways to improve the service we offer, and as part of this we have moved to utilise a new system called Sidetrade which will deliver improvements in the way we communicate with you',
    'I understand that you require our full data to set us up on your system as a supplier, so could you please send me your ''New Suppliers Account'' form if applicable',
    'As of August, Bottomline Technologies will be updating some of our email addresses and as such the email address from which you receive your invoices and statements from us will change',
    'We are writing to you as an existing customer of Wilson Security to let you know that as of 22nd December 2022 Action Alarms Limited t/a Action24 acquired Wilson Security Limited',
    'The Dorma name change does not yet mark the full corporate merge of the two businesses',
    'The assets of Map Marketing Ltd have now been purchased by a new company ''All Jigsaw Puzzles Ltd'' that has taken over the trade accounts (including yours) and the factory in Hatherleigh with its existing staff and will continue to manufacture the same map based goods, laminated planners and',
    'Change of Name/ Brand/ Trading Style As you are aware, Irongate Group was acquired by Complete Business Solutions Group Ltd on the 1st November 2018 and Irongate Group became a trading name of Complete Business Solutions Group Ltd',
    'The template for our Statements and Dunning correspondence will change',
    'Direct Debit payment collections dates will now be in line with the due date listed on our Invoices',
    'The name change is a consequence of our rebrand to position us better in the market in line with the products and service we now offer',
    'Please accept this communication by way of informing you that we have recently changed our registered company name from ACS Business Supplies Ltd to ACS Technology Group Ltd',
    'As part of our ongoing commitment to invest in innovation and continuous improvement, towards the end of last year, we announced a business restructure, with Swift360 joining the SMI Group family',
    'From 15th August 2022, all employees will be physically located in the same office, with Swift360 team members joining the SMI team in our new head office in Petersfield, Hampshire',
    'We converted to a new business model as at the close of play yesterday (we''re now a "John Lewis" style of ownership) and I''m afraid our accounts department is completely overrun with the switch over',
    'If you have any questions regarding your statement please contact us using the details below',
    'If you could please check your bank records and advise accordingly',
    'Could you confirm your bank details and send us new statement',
    'The BACs payment list was changed yesterday',
    'If you receive any notification regarding a change to our bank details, please immediately contact by phone your account manager or, if not available, another member of our staff',
    'Likewise, if we receive any notification regarding a change to your bank details, we will be contacting a representative at your company directly for confirmation of both old and new bank details prior to accepting the change',
    'Please be aware that we do not notify changes (or accept notification of changes) to bank details by email',
    'Please amend your accounts to all Remittance advices are emailed to',
    'Please find attached overdue invoice',
    'Please find attached your latest invoice from Company.',
    'Please contact your accounts department re payment of your overdue Account',
    'If you have received new information, do let us know',
    'Please let me know if I can assist with anything else',
    'A copy of the invoice is attached as requested',
    'The following payment has been remitted',
    'See attached invoice relating to your recently dispatched order',
    'Can you tell me which bank account this money was deposited in',
    'I am reaching out to inquire on the status of this invoice',
    'Can you please provide a payment update on the overdue balance highlighted',
    'Can you please give us an update on when can we expect the payment to hit our bank account',
    'I need help',
    'This issue with the formatting has now been fixed',
    'Also can you please look into invoice from 19',
    'Please find attached outstanding invoices',
    'It has come to our attention that a number of invoices appear as overdue',
    'Please provide invoice copy',
    'please let me know if we need to go over anything else',
    'Due to a system error, this e-mail may be a duplicate of an earlier Invoice',
    'Note:- there are 3 invoices on the attachment sent yesterday',
    'Our records indicate that the account has a past due in the amount of $ 4,607',
    'We require immediate payment to bring the account current up to date',
    'Please find attached invoices',
    'We have attached your latest statement which shows that your January account is now overdue for payment',
    
    # Chat GPT
    'We have prepared an updated statement for the Coworth Park account, which is attached for your review',
    'Your account statement has been updated and is attached herewith for your reference',
    'Please let us know if you receive any emails about changes that you were not expecting',
    'In case of any email notifications about changes, please get in touch with us without delay',
    'Beware of fraud and cybercrime: we will never inform you of bank detail changes through email',
    'To protect yourself from fraud, please be aware that we will never send emails about bank detail changes',
    'We would like to caution you that we will not inform you of any changes in bank details through email',
    'If you receive an email that appears suspicious, please notify us immediately',
    'As requested, we have attached an updated statement for the account',
    
    # more
    'Likewise, if we receive any notification regarding a change to your bank details, we will be contacting a representative at your company directly for confirmation of both old and new bank details prior to accepting the change',
    
    
]

print(len(neg_email_examples))


160


### Combine the examples

In [7]:
email_bodies = target_email_examples + neg_email_examples

pos = [1]*len(target_email_examples)
negs = [0]*len(neg_email_examples)

true_positives = pos + negs

my_dict = {'text':email_bodies,'label':true_positives}
df = pd.DataFrame(my_dict).sample(frac=1)
df["label_text"] = df.apply(lambda row: "positive" if row["label"] else "negative", axis = 1)

print(len(df))

309


In [8]:
df

Unnamed: 0,text,label,label_text
167,Please be extra vigilant when opening attachme...,0,negative
98,Change of Bank Details,1,positive
55,The bank information is in the attached letter.,1,positive
166,Please can you pay your outstanding invoices a...,0,negative
219,Limited is registered in England and Wales,0,negative
104,It does not look as though our information has...,1,positive
202,"Dear ACCOUNTS PAYABLE, Please find attached a ...",0,negative
300,Your account statement has been updated and is...,0,negative
213,"Now, you need to pay back the amount of $21,28...",0,negative
105,See the attached document with updated payment...,1,positive


# Initialise Setfit Model

In [9]:
from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitModel, SetFitTrainer
from sentence_transformers import SentenceTransformer
from transformers import Trainer
import huggingface
from sklearn.model_selection import train_test_split
import datasets
import torch
import tensorflow as tf

In [10]:
# split into train and test set and put in dataset format
train_data, test_data = train_test_split(df, test_size=0.25, random_state=42)

train_ds = datasets.Dataset.from_dict(train_data.to_dict(orient="list"))
test_ds = datasets.Dataset.from_dict(test_data.to_dict(orient="list"))

print(len(train_ds))
print(len(test_ds))

231
78


Load a setfit model from huggingface, e.g. "sentence-transformers/paraphrase-mpnet-base-v2"

In [11]:
# Load a setfit model from the hub
model_name = "sentence-transformers/paraphrase-mpnet-base-v2"

Assess model

In [12]:
model = SetFitModel.from_pretrained(model_name)

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


In [13]:
dir(model)

['__call__',
 '__class__',
 '__dataclass_fields__',
 '__dataclass_params__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_freeze_or_not',
 '_from_pretrained',
 '_prepare_dataloader',
 '_prepare_optimizer',
 '_save_pretrained',
 'create_model_card',
 'fit',
 'freeze',
 'from_pretrained',
 'has_differentiable_head',
 'l2_weight',
 'model_body',
 'model_head',
 'multi_target_strategy',
 'normalize_embeddings',
 'predict',
 'predict_proba',
 'push_to_hub',
 'save_pretrained',
 'to',
 'unfreeze']

In [None]:
# check GPU is available
# print(torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu"))

In [None]:
# specify to use GPU (maybe not necessary)
# model.to("cuda")

In [None]:
# del model

# Initialise SetFit trainer

Based on Huggingface transformer trainer

e.g. use following parameters:

- loss_class=CosineSimilarityLoss,
- batch_size=16,
- num_iterations=20, # Number of text pairs to generate for contrastive learning
- num_epochs=5

In [14]:
# Create trainer - add early stopping
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    loss_class=CosineSimilarityLoss,
    batch_size=16,
    num_iterations=20, # Number of text pairs to generate for contrastive learning
    num_epochs=2 #1 #5, # Number of epochs to use for contrastive learning
)

In [15]:
%%time
# after instantiating the trainer, train it
trainer.train()
metrics = trainer.evaluate()
metrics

***** Running training *****
  Num examples = 9240
  Num epochs = 2
  Total optimization steps = 1156
  Total train batch size = 16


Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/578 [00:00<?, ?it/s]

Iteration:   0%|          | 0/578 [00:00<?, ?it/s]

***** Running evaluation *****


CPU times: user 7min 11s, sys: 2min 24s, total: 9min 36s
Wall time: 9min 43s


{'accuracy': 0.9487179487179487}

### Quick summary

Model now reaches 100% accuracy...?

Model used to reach 0.974% accuracy on training data (highest I could get it)

In [16]:
model

SetFitModel()

# Save model

In [17]:
# save model
model._save_pretrained('trained_models/SetFit_Trained7')

In [None]:
# load model
# model = SetFitModel._from_pretrained('trained_models/SetFit_Trained4')

# Assess training results in more details

https://discuss.huggingface.co/t/confidence-score-in-setfit-fine-tuned-model/25957

In [None]:
# a = model.predict_proba(['hello'])
# a

In [None]:
lab_highest_probs = trainer.model.predict(test_ds['text']) #gives predicted label with highest prob
lab_probs = trainer.model.predict_proba(test_ds['text']) #gives all probabilities- for each label
high_label = np.argmax(trainer.model.predict_proba(test_ds['text'])) #gives the label with the highest prob

In [None]:
probs = lab_probs.tolist()
pred_labs = lab_highest_probs.tolist()

In [None]:
p_0 = []
p_1 = []
for p in probs:
    p_0.append(p[0])
    p_1.append(p[1])

In [None]:
prob_dict = {'text':test_ds['text'],'y_pred':pred_labs,'prob_0':p_0,'prob_1':p_1}
out_df = pd.DataFrame.from_dict(prob_dict)

In [None]:
out_df

In [None]:
dir(model.model_body)

## Extract embeddings (not working atm)

https://github.com/huggingface/setfit/issues/245

In [None]:
# embeddings = model.model_body.encode(test_ds,convert_to_tensor=True)

In [None]:
def predict_proba(model: SetFitModel):
    embeddings = model.model_body.encode(test_ds, normalize_embeddings=self.normalize_embeddings, convert_to_tensor=True)
    return model.model_head.predict_proba(embeddings)


In [None]:
# probs =predict_proba(model)