Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparknlp 534 Introducing BART Transformer for text-to-text generation tasks like translation and summarization #13731

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
d3acb6f
WIP: Added Bart transformer scala files
prabod Mar 10, 2023
b0d0e88
WIP: Added BART tokenizer test and BART is locally working
prabod Mar 12, 2023
e288c05
WIP: Added BART tokenizer test and BART is locally working
prabod Mar 12, 2023
788e08c
WIP: Added Beam Hypothesis and Beam Scorer implementations
prabod Mar 13, 2023
38e2fcd
WIP: Added Logit Processors
prabod Mar 15, 2023
5a97228
WIP: Added Beam Search implementation
prabod Mar 15, 2023
fb9dff4
WIP: Completed Beam Search implementation
prabod Mar 16, 2023
fc2d132
WIP: fixed a bug in Beam search algorithm
prabod Mar 23, 2023
1955504
WIP: changed BartTransformer methods to include beam size and added d…
prabod Mar 23, 2023
6d3e802
WIP: changed BartTransformer test methods
prabod Mar 23, 2023
187621f
WIP: fixed errors in BeamSearch
prabod Mar 23, 2023
db339fe
WIP: Updated to use separate encoder decoder model
prabod Mar 23, 2023
696250c
WIP: Changed model to handle the int64 version of the model weights
prabod Mar 26, 2023
5efc186
WIP: Added python API implementation
prabod Mar 26, 2023
3fa3f81
Pass session and encoder state as a parameter
prabod Mar 28, 2023
0daa244
Update TopK Logit Warper Logic
prabod Mar 28, 2023
a83b9ae
Code clean up
prabod Mar 28, 2023
3a0e07f
Update Tests
prabod Mar 28, 2023
7368aa4
Update documentation
prabod Mar 28, 2023
9907bf6
Update documentation and python tests
prabod Mar 28, 2023
902e406
Update python tests
prabod Mar 28, 2023
b39b028
SPARKNLP-534 move BartTokenizer to the Bart backend
maziyarpanahi Apr 4, 2023
fccb319
SPARKNLP-534 Fix the copyright year
maziyarpanahi Apr 4, 2023
1950cea
SPARKNLP-534 Add BartTransformer to annotator and ResourceDownloader
maziyarpanahi Apr 4, 2023
f85ecc2
SPARKNLP-534 Fix BartTransformer in annotator
maziyarpanahi Apr 4, 2023
9c4a03b
Merge branch 'master' into SPARKNLP-534-Import-and-Implement-BART-arc…
maziyarpanahi Apr 4, 2023
34aa823
Merge branch 'release/440-release-candidate' into SPARKNLP-534-Import…
maziyarpanahi Apr 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions python/sparknlp/annotator/seq2seq/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@
from sparknlp.annotator.seq2seq.gpt2_transformer import *
from sparknlp.annotator.seq2seq.marian_transformer import *
from sparknlp.annotator.seq2seq.t5_transformer import *
from sparknlp.annotator.seq2seq.bart_transformer import *
402 changes: 402 additions & 0 deletions python/sparknlp/annotator/seq2seq/bart_transformer.py

Large diffs are not rendered by default.

6 changes: 6 additions & 0 deletions python/sparknlp/internal/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,12 @@ def __init__(self, path, jspark):
"com.johnsnowlabs.nlp.annotators.seq2seq.T5Transformer.loadSavedModel", path, jspark)


class _BartLoader(ExtendedJavaWrapper):
def __init__(self, path, jspark):
super(_BartLoader, self).__init__(
"com.johnsnowlabs.nlp.annotators.seq2seq.BartTransformer.loadSavedModel", path, jspark)


class _USELoader(ExtendedJavaWrapper):
def __init__(self, path, jspark, loadsp):
super(_USELoader, self).__init__("com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder.loadSavedModel",
Expand Down
280 changes: 280 additions & 0 deletions python/test/annotator/seq2seq/bart_transformer_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
# Copyright 2017-2023 John Snow Labs
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest

import pytest

from sparknlp.annotator import *
from sparknlp.base import *
from test.util import SparkContextForTest


@pytest.mark.slow
class BartTransformerQATestSpec(unittest.TestCase):
def setUp(self):
self.spark = SparkContextForTest.spark

def runTest(self):
data = self.spark.createDataFrame([
[1, """PG&E stated it scheduled the blackouts in response to forecasts for high winds
amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were
scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow.
""".strip().replace("\n", " ")],
]).toDF("id", "text")

document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")

bart = BartTransformer.pretrained("bart_large_cnn") \
.setInputCols(["documents"]) \
.setOutputCol("answers")

pipeline = Pipeline().setStages([document_assembler, bart])
results = pipeline.fit(data).transform(data)

results.select("questions.result", "answers.result").show(truncate=False)


@pytest.mark.slow
class BartTransformerSummaryTestSpec(unittest.TestCase):
def setUp(self):
self.spark = SparkContextForTest.spark

def runTest(self):
data = self.spark.createDataFrame([
[1, """
Heat oven to 200C/180C fan/gas 6. Line each hole of a 12-hole muffin tin with a thin strip of baking
parchment across the middle that’s long enough so the ends stick out a centimetre or two – use a dab of
butter to stick in place. Roll out two thirds of the pastry on a lightly floured surface and stamp out
12 x 10cm circles (you may need to re-roll trimmings). Press a circle into each hole to line.

Sprinkle 1 tsp of breadcrumbs into the base of each pie. Tip the rest of the crumbs into a mixing bowl.
Squeeze in the sausage meat, discarding the skins, along with the bacon, mace, pepper, sage and just a
little salt. Get your hands in and mash and squish everything together until the breadcrumbs have just
about disappeared. Divide mixture between the holes, packing in firmly and shaping to a dome in the middle.

Roll out the remaining pastry and stamp out 12 x 7cm circles. Brush with a little egg and add a top to
each pie, egg-side down to stick, carefully pressing pastry edges together to seal. Brush with more egg
(don’t throw away leftovers) and sprinkle with sesame seeds. Bake for 30 mins until golden then carefully
remove the pies from the tin, using the parchment ends to help you lift them out. Sit on a parchment lined
baking tray, brush all round the sides with more egg and put back in the oven for 8 mins. Cool completely
then eat with piccalilli, or your favourite pickle.
""".strip().replace("\n", " ")]]).toDF("id", "text")

document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")

bart = BartTransformer.pretrained("bart_large_cnn") \
.setTask("summarize:") \
.setMaxOutputLength(200) \
.setInputCols(["documents"]) \
.setOutputCol("summaries")

pipeline = Pipeline().setStages([document_assembler, bart])
results = pipeline.fit(data).transform(data)

results.select("summaries.result").show(truncate=False)


@pytest.mark.slow
class BartTransformerSummaryWithRepetitionPenaltyTestSpec(unittest.TestCase):
def setUp(self):
self.spark = SparkContextForTest.spark

def runTest(self):
data = self.spark.createDataFrame([
[1, """Preheat the oven to 220°C/ fan200°C/gas 7. Trim the lamb fillet of fat and cut into slices the thickness
of a chop. Cut the kidneys in half and snip out the white core. Melt a knob of dripping or 2 tablespoons
of vegetable oil in a heavy large pan. Fry the lamb fillet in batches for 3-4 minutes, turning once, until
browned. Set aside. Fry the kidneys and cook for 1-2 minutes, turning once, until browned. Set aside.
Wipe the pan with kitchen paper, then add the butter. Add the onions and fry for about 10 minutes until
softened. Sprinkle in the flour and stir well for 1 minute. Gradually pour in the stock, stirring all the
time to avoid lumps. Add the herbs. Stir the lamb and kidneys into the onions. Season well. Transfer to a
large 2.5-litre casserole. Slice the peeled potatoes thinly and arrange on top in overlapping rows. Brush
with melted butter and season. Cover and bake for 30 minutes. Reduce the oven temperature to 160°C
/fan140°C/gas 3 and cook for a further 2 hours. Then increase the oven temperature to 200°C/ fan180°C/gas 6,
uncover, and brush the potatoes with more butter. Cook uncovered for 15-20 minutes, or until golden.
""".strip().replace("\n", " ")]]).toDF("id", "text")

document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")

bart = BartTransformer.pretrained("bart_large_cnn") \
.setTask("summarize:") \
.setMaxOutputLength(50) \
.setInputCols(["documents"]) \
.setOutputCol("summaries") \
.setRepetitionPenalty(2)

pipeline = Pipeline().setStages([document_assembler, bart])
results = pipeline.fit(data).transform(data)

results.select("summaries.result").show(truncate=False)


@pytest.mark.slow
class BartTransformerSummaryWithSamplingAndDeactivatedTopKTestSpec(unittest.TestCase):
def setUp(self):
self.spark = SparkContextForTest.spark

def runTest(self):
data = self.spark.createDataFrame([
[1, """Preheat the oven to 220°C/ fan200°C/gas 7. Trim the lamb fillet of fat and cut into slices the thickness
of a chop. Cut the kidneys in half and snip out the white core. Melt a knob of dripping or 2 tablespoons
of vegetable oil in a heavy large pan. Fry the lamb fillet in batches for 3-4 minutes, turning once, until
browned. Set aside. Fry the kidneys and cook for 1-2 minutes, turning once, until browned. Set aside.
Wipe the pan with kitchen paper, then add the butter. Add the onions and fry for about 10 minutes until
softened. Sprinkle in the flour and stir well for 1 minute. Gradually pour in the stock, stirring all the
time to avoid lumps. Add the herbs. Stir the lamb and kidneys into the onions. Season well. Transfer to a
large 2.5-litre casserole. Slice the peeled potatoes thinly and arrange on top in overlapping rows. Brush
with melted butter and season. Cover and bake for 30 minutes. Reduce the oven temperature to 160°C
/fan140°C/gas 3 and cook for a further 2 hours. Then increase the oven temperature to 200°C/ fan180°C/gas 6,
uncover, and brush the potatoes with more butter. Cook uncovered for 15-20 minutes, or until golden.
""".strip().replace("\n", " ")]]).toDF("id", "text")

document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")

bart = BartTransformer.pretrained("bart_large_cnn") \
.setTask("summarize:") \
.setMaxOutputLength(50) \
.setDoSample(True) \
.setInputCols(["documents"]) \
.setOutputCol("summaries") \
.setTopK(0)

pipeline = Pipeline().setStages([document_assembler, bart])
results = pipeline.fit(data).transform(data)

results.select("summaries.result").show(truncate=False)


@pytest.mark.slow
class BartTransformerSummaryWithSamplingAndTemperatureTestSpec(unittest.TestCase):
def setUp(self):
self.spark = SparkContextForTest.spark

def runTest(self):
data = self.spark.createDataFrame([
[1, """Preheat the oven to 220°C/ fan200°C/gas 7. Trim the lamb fillet of fat and cut into slices the thickness
of a chop. Cut the kidneys in half and snip out the white core. Melt a knob of dripping or 2 tablespoons
of vegetable oil in a heavy large pan. Fry the lamb fillet in batches for 3-4 minutes, turning once, until
browned. Set aside. Fry the kidneys and cook for 1-2 minutes, turning once, until browned. Set aside.
Wipe the pan with kitchen paper, then add the butter. Add the onions and fry for about 10 minutes until
softened. Sprinkle in the flour and stir well for 1 minute. Gradually pour in the stock, stirring all the
time to avoid lumps. Add the herbs. Stir the lamb and kidneys into the onions. Season well. Transfer to a
large 2.5-litre casserole. Slice the peeled potatoes thinly and arrange on top in overlapping rows. Brush
with melted butter and season. Cover and bake for 30 minutes. Reduce the oven temperature to 160°C
/fan140°C/gas 3 and cook for a further 2 hours. Then increase the oven temperature to 200°C/ fan180°C/gas 6,
uncover, and brush the potatoes with more butter. Cook uncovered for 15-20 minutes, or until golden.
""".strip().replace("\n", " ")]]).toDF("id", "text")

document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")

bart = BartTransformer.pretrained("bart_large_cnn") \
.setTask("summarize:") \
.setMaxOutputLength(50) \
.setDoSample(True) \
.setInputCols(["documents"]) \
.setOutputCol("summaries") \
.setTopK(50) \
.setTemperature(0.7)

pipeline = Pipeline().setStages([document_assembler, bart])
results = pipeline.fit(data).transform(data)

results.select("summaries.result").show(truncate=False)


@pytest.mark.slow
class BartTransformerSummaryWithSamplingAndTopPTestSpec(unittest.TestCase):
def setUp(self):
self.spark = SparkContextForTest.spark

def runTest(self):
data = self.spark.createDataFrame([
[1, """Preheat the oven to 220°C/ fan200°C/gas 7. Trim the lamb fillet of fat and cut into slices the thickness
of a chop. Cut the kidneys in half and snip out the white core. Melt a knob of dripping or 2 tablespoons
of vegetable oil in a heavy large pan. Fry the lamb fillet in batches for 3-4 minutes, turning once, until
browned. Set aside. Fry the kidneys and cook for 1-2 minutes, turning once, until browned. Set aside.
Wipe the pan with kitchen paper, then add the butter. Add the onions and fry for about 10 minutes until
softened. Sprinkle in the flour and stir well for 1 minute. Gradually pour in the stock, stirring all the
time to avoid lumps. Add the herbs. Stir the lamb and kidneys into the onions. Season well. Transfer to a
large 2.5-litre casserole. Slice the peeled potatoes thinly and arrange on top in overlapping rows. Brush
with melted butter and season. Cover and bake for 30 minutes. Reduce the oven temperature to 160°C
/fan140°C/gas 3 and cook for a further 2 hours. Then increase the oven temperature to 200°C/ fan180°C/gas 6,
uncover, and brush the potatoes with more butter. Cook uncovered for 15-20 minutes, or until golden.
""".strip().replace("\n", " ")]]).toDF("id", "text")

document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")

bart = BartTransformer.pretrained("bart_large_cnn")\
.setTask("summarize:") \
.setMaxOutputLength(50) \
.setDoSample(True) \
.setInputCols(["documents"]) \
.setOutputCol("summaries") \
.setTopK(0) \
.setTopP(0.7)

pipeline = Pipeline().setStages([document_assembler, bart])
results = pipeline.fit(data).transform(data)

results.select("summaries.result").show(truncate=False)


@pytest.mark.slow
class BartTransformerSummaryWithSamplingTestSpec(unittest.TestCase):
def setUp(self):
self.spark = SparkContextForTest.spark

def runTest(self):
data = self.spark.createDataFrame([
[1, """Preheat the oven to 220°C/ fan200°C/gas 7. Trim the lamb fillet of fat and cut into slices the thickness
of a chop. Cut the kidneys in half and snip out the white core. Melt a knob of dripping or 2 tablespoons
of vegetable oil in a heavy large pan. Fry the lamb fillet in batches for 3-4 minutes, turning once, until
browned. Set aside. Fry the kidneys and cook for 1-2 minutes, turning once, until browned. Set aside.
Wipe the pan with kitchen paper, then add the butter. Add the onions and fry for about 10 minutes until
softened. Sprinkle in the flour and stir well for 1 minute. Gradually pour in the stock, stirring all the
time to avoid lumps. Add the herbs. Stir the lamb and kidneys into the onions. Season well. Transfer to a
large 2.5-litre casserole. Slice the peeled potatoes thinly and arrange on top in overlapping rows. Brush
with melted butter and season. Cover and bake for 30 minutes. Reduce the oven temperature to 160°C
/fan140°C/gas 3 and cook for a further 2 hours. Then increase the oven temperature to 200°C/ fan180°C/gas 6,
uncover, and brush the potatoes with more butter. Cook uncovered for 15-20 minutes, or until golden.
""".strip().replace("\n", " ")]]).toDF("id", "text")

document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")

bart = BartTransformer.pretrained("bart_large_cnn") \
.setTask("summarize:") \
.setMaxOutputLength(50) \
.setDoSample(True) \
.setInputCols(["documents"]) \
.setOutputCol("summaries")

pipeline = Pipeline().setStages([document_assembler, bart])
results = pipeline.fit(data).transform(data)

results.select("summaries.result").show(truncate=False)

Loading