Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Models hub #13876

Merged
merged 31 commits into from
Jul 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
57d855e
Merge branch 'master' into models_hub
maziyarpanahi Nov 21, 2022
41cda2d
Merge branch 'models_hub' of https://github.com/JohnSnowLabs/spark-nl…
maziyarpanahi Nov 25, 2022
6c39602
Merge branch 'master' into models_hub
maziyarpanahi Dec 15, 2022
bed4adb
Merge branch 'master' into models_hub
maziyarpanahi Dec 21, 2022
cf0b08f
Merge branch 'master' into models_hub
maziyarpanahi Feb 7, 2023
93d6753
Merge branch 'master' into models_hub
maziyarpanahi Mar 14, 2023
afb700e
Add model 2023-04-13-CyberbullyingDetection_ClassifierDL_tfhub_en (#1…
jsl-models Apr 13, 2023
bb9a155
2023-04-20-distilbert_base_uncased_mnli_en (#13761)
jsl-models Apr 20, 2023
ea0ba05
2023-04-20-distilbert_base_zero_shot_classifier_turkish_cased_multinl…
jsl-models Apr 21, 2023
9afffb1
2023-05-04-roberta_base_zero_shot_classifier_nli_en (#13781)
jsl-models May 4, 2023
f4356e5
2023-05-09-distilbart_xsum_6_6_en (#13788)
jsl-models May 10, 2023
04149fb
Merge branch 'master' into models_hub
maziyarpanahi May 10, 2023
de3e19e
2023-05-11-distilbart_cnn_12_6_en (#13795)
jsl-models May 11, 2023
71de0f7
2023-05-19-match_pattern_en (#13805)
jsl-models May 21, 2023
f28ea8e
2023-05-22-explain_document_md_fr (#13811)
jsl-models May 23, 2023
4049881
2023-05-24-explain_document_md_fr (#13821)
jsl-models May 25, 2023
e4e465e
Add model 2023-05-25-explain_document_md_fr (#13827)
jsl-models May 25, 2023
e8e01a5
2023-05-25-dependency_parse_en (#13828)
jsl-models May 26, 2023
9c0a24e
Merge branch 'master' into models_hub
maziyarpanahi May 26, 2023
2fd64c3
2023-05-25-distilcamembert_french_legal_fr (#13826)
jsl-models May 26, 2023
795ebf8
Update title for 2023-05-25-distilcamembert_french_legal_fr.md (#13831)
Mary-Sci May 26, 2023
c04ca51
2023-05-27-explain_document_md_fr (#13836)
jsl-models May 27, 2023
4d64d1b
2023-05-28-longformer_base_english_legal_en (#13838)
jsl-models May 28, 2023
02a9afb
2023-05-28-xlm_longformer_base_english_legal_en (#13839)
jsl-models May 29, 2023
d054074
2023-06-21-bert_embeddings_distil_clinical_en (#13861)
jsl-models Jun 21, 2023
43ab794
2023-06-26-distilbert_embeddings_finetuned_sarcasm_classification_en …
jsl-models Jun 26, 2023
7cde44f
2023-06-27-roberta_embeddings_robertinh_gl (#13868)
jsl-models Jun 27, 2023
ced98b6
Add model 2023-06-29-xlmroberta_embeddings_paraphrase_mpnet_base_v2_x…
jsl-models Jun 30, 2023
dfaabd4
2023-06-08-instructor_base_en (#13850)
jsl-models Jul 1, 2023
59113cd
2023-06-28-roberta_base_en (#13871)
jsl-models Jul 1, 2023
740f4fb
Merge branch 'master' into models_hub
maziyarpanahi Jul 3, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
layout: model
title: Multilingual XLMRoBerta Embeddings Cased Model
author: John Snow Labs
name: xlmroberta_embeddings_paraphrase_mpnet_base_v2
date: 2023-06-29
tags: [xx, embeddings, xlmroberta, open_source, transformer, tensorflow]
task: Embeddings
language: xx
edition: Spark NLP 4.4.4
spark_version: 3.0
supported: true
engine: tensorflow
annotator: XlmRoBertaEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

Pretrained XLMRoberta Embeddings model is a multilingual embedding model adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.

## Predicted Entities



{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_paraphrase_mpnet_base_v2_xx_4.4.4_3.0_1688073546075.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlmroberta_embeddings_paraphrase_mpnet_base_v2_xx_4.4.4_3.0_1688073546075.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")

embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_paraphrase_mpnet_base_v2","xx") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)

pipeline = Pipeline(stages=[documentAssembler,
tokenizer,
embeddings])

data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")
result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")

val embeddings = XlmRoBertaEmbeddings.pretrained("xlmroberta_embeddings_paraphrase_mpnet_base_v2", "xx")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")

val pipeline = new Pipeline().setStages(Array(documentAssembler,
tokenizer,
embeddings))

val data = Seq("I love Spark NLP").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
```
</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|xlmroberta_embeddings_paraphrase_mpnet_base_v2|
|Compatibility:|Spark NLP 4.4.4+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|xx|
|Size:|1.0 GB|
|Case sensitive:|true|

## References

https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
layout: model
title: English Legal Longformer Base Embeddings Model
author: John Snow Labs
name: longformer_base_english_legal
date: 2023-05-28
tags: [en, longformerformaskedlm, transformer, open_source, legal, tensorflow]
task: Embeddings
language: en
edition: Spark NLP 4.4.2
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LongformerEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

Pretrained Legal Longformer Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-longformer-base` is a English model originally trained by `lexlms`.

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_base_english_legal_en_4.4.2_3.0_1685282124579.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_base_english_legal_en_4.4.2_3.0_1685282124579.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}

```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")

tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")

embeddings = LongformerEmbeddings.pretrained("longformer_base_english_legal","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)

pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])

data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")

result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))

val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")

val embeddings = LongformerEmbeddings.pretrained("longformer_base_english_legal","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))

val data = Seq("I love Spark NLP").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)
```
</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|longformer_base_english_legal|
|Compatibility:|Spark NLP 4.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|561.6 MB|
|Case sensitive:|true|
|Max sentence length:|4096|

## References

https://huggingface.co/lexlms/legal-longformer-base
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
layout: model
title: English Legal Longformer Large Embeddings Model
author: John Snow Labs
name: longformer_large_english_legal
date: 2023-05-28
tags: [en, longformerformaskedlm, transformer, open_source, legal, tensorflow]
task: Embeddings
language: en
edition: Spark NLP 4.4.2
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LongformerEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

Pretrained Legal Longformer Large Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-longformer-large` is a English model originally trained by `lexlms`.

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/longformer_large_english_legal_en_4.4.2_3.0_1685289330980.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/longformer_large_english_legal_en_4.4.2_3.0_1685289330980.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}

```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")

tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")

embeddings = LongformerEmbeddings.pretrained("longformer_large_english_legal","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)

pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])

data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")

result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))

val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")

val embeddings = LongformerEmbeddings.pretrained("longformer_large_english_legal","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))

val data = Seq("I love Spark NLP").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)
```
</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|longformer_large_english_legal|
|Compatibility:|Spark NLP 4.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|1.6 GB|
|Case sensitive:|true|
|Max sentence length:|4096|

## References

https://huggingface.co/lexlms/legal-longformer-large
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
layout: model
title: English Legal XLM-Longformer Base Embeddings Model
author: John Snow Labs
name: xlm_longformer_base_english_legal
date: 2023-05-28
tags: [en, longformerformaskedlm, transformer, open_source, legal, tensorflow]
task: Embeddings
language: en
edition: Spark NLP 4.4.2
spark_version: 3.0
supported: true
engine: tensorflow
annotator: LongformerEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

Pretrained Legal XLM-Longformer Embeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP. `legal-xlm-longformer-base` is a English model originally trained by `joelito`.

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/xlm_longformer_base_english_legal_en_4.4.2_3.0_1685286936656.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/xlm_longformer_base_english_legal_en_4.4.2_3.0_1685286936656.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}

```python
documentAssembler = DocumentAssembler() \
.setInputCols("text") \
.setOutputCols("document")

tokenizer = Tokenizer() \
.setInputCols("document") \
.setOutputCol("token")

embeddings = LongformerEmbeddings.pretrained("xlm_longformer_base_english_legal","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)

pipeline = Pipeline(stages=[documentAssembler, tokenizer, embeddings])

data = spark.createDataFrame([["I love Spark NLP"]]).toDF("text")

result = pipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("text"))
.setOutputCols(Array("document"))

val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")

val embeddings = LongformerEmbeddings.pretrained("xlm_longformer_base_english_legal","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(True)

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, embeddings))

val data = Seq("I love Spark NLP").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)
```
</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|xlm_longformer_base_english_legal|
|Compatibility:|Spark NLP 4.4.2+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[sentence, token]|
|Output Labels:|[embeddings]|
|Language:|en|
|Size:|788.6 MB|
|Case sensitive:|true|
|Max sentence length:|4096|

## References

https://huggingface.co/joelito/legal-xlm-longformer-base
Loading