Models hub finance (#1316)

* Add model 2023-08-03-finner_bert_subpoenas_sm_en (#493) Co-authored-by: gadde5300 <gadde5300@gmail.com> * Delete subpoenas ner finance * Add model 2023-08-30-finpipe_deid_en (#566) Co-authored-by: Meryem1425 <vildansarikaya25@gmail.com> * Add model 2023-08-30-finpipe_deid_en (#570) Co-authored-by: SKocer <samedkocer22@gmail.com> * Add model 2023-08-30-finpipe_deid_en (#571) Co-authored-by: SKocer <samedkocer22@gmail.com> * Delete 2023-08-30-finpipe_deid_en.md * Add model 2023-08-30-finpipe_deid_en (#572) Co-authored-by: gokhanturer <mgturer@gmail.com> * Add model 2023-08-30-finpipe_deid_en (#574) Co-authored-by: SKocer <samedkocer22@gmail.com> * Add model 2023-09-01-finpipe_deid_en (#586) Co-authored-by: Meryem1425 <vildansarikaya25@gmail.com> * Add model 2023-09-01-finpipe_deid_en (#589) Co-authored-by: SKocer <samedkocer22@gmail.com> * Add model 2023-09-01-finpipe_deid_en (#593) Co-authored-by: gokhanturer <mgturer@gmail.com> * 2023-10-06-finembedding_e5_base_en (#685) * Add model 2023-10-06-finembedding_e5_base_en * Add model 2023-10-06-finner_absa_sm_en * Add model 2023-10-06-finassertion_absa_sm_en --------- Co-authored-by: dcecchini <dadachini@hotmail.com> * Add model 2023-11-09-finembedding_e5_large_en (#745) Co-authored-by: dcecchini <dadachini@hotmail.com> * 2023-11-11-finner_aspect_based_sentiment_md_en (#754) * Add model 2023-11-11-finner_aspect_based_sentiment_md_en * Add model 2023-11-11-finassertion_aspect_based_sentiment_md_en * Update 2023-11-11-finner_aspect_based_sentiment_md_en.md * Update 2023-11-11-finassertion_aspect_based_sentiment_md_en.md --------- Co-authored-by: Mary-Sci <meryemyildiz366@gmail.com> Co-authored-by: Merve Ertas Uslu <67653613+Mary-Sci@users.noreply.github.com> * Add model 2023-12-07-finembeddings_bge_base_en (#812) Co-authored-by: dcecchini <dadachini@hotmail.com> * 2024-05-17-finner_sec_edgar_fe_en (#1211) * Add model 2024-05-17-finner_sec_edgar_fe_en * Add model 2024-05-17-finner_deid_sec_fe_en * Update 2024-05-17-finner_deid_sec_fe_en.md * Add model 2024-05-21-finner_aspect_based_sentiment_fe_en * Add model 2024-05-21-finance_word_embeddings_en * Add model 2024-06-07-finner_financial_xlarge_fe_en * Update 2024-06-07-finner_financial_xlarge_fe_en.md * Add model 2024-06-10-finel_nasdaq_company_name_stock_screener_fe_en * Add model 2024-06-10-finel_edgar_company_name_fe_en * Add model 2024-06-10-finance_bge_base_embeddings_en * Add model 2024-06-11-finel_names2tickers_fe_en * Add model 2024-06-12-finel_tickers2names_fe_en * Add model 2024-06-21-finassertion_aspect_based_sentiment_md_fe_en --------- Co-authored-by: gadde5300 <gadde5300@gmail.com> Co-authored-by: GADDE SAI SHAILESH <69344247+gadde5300@users.noreply.github.com> --------- Co-authored-by: jsl-models <74001263+jsl-models@users.noreply.github.com> Co-authored-by: gadde5300 <gadde5300@gmail.com> Co-authored-by: Meryem1425 <vildansarikaya25@gmail.com> Co-authored-by: SKocer <samedkocer22@gmail.com> Co-authored-by: Merve Ertas Uslu <67653613+Mary-Sci@users.noreply.github.com> Co-authored-by: gokhanturer <mgturer@gmail.com> Co-authored-by: Mary-Sci <meryemyildiz366@gmail.com> Co-authored-by: GADDE SAI SHAILESH <69344247+gadde5300@users.noreply.github.com>
JohnSnowLabs · Jul 10, 2024 · 8fd3da2 · 8fd3da2
1 parent 4550767
commit 8fd3da2
Show file tree

Hide file tree

Showing 11 changed files with 1,203 additions and 0 deletions.
diff --git a/docs/_posts/gadde5300/2024-05-17-finner_deid_sec_fe_en.md b/docs/_posts/gadde5300/2024-05-17-finner_deid_sec_fe_en.md
@@ -0,0 +1,138 @@
+---
+layout: model
+title: Generic Deidentification NER (Finance)
+author: John Snow Labs
+name: finner_deid_sec_fe
+date: 2024-05-17
+tags: [deid, deidentification, anonymization, en, licensed]
+task: Named Entity Recognition
+language: en
+edition: Finance NLP 1.0.0
+spark_version: 3.0
+supported: true
+annotator: FinanceNerModel
+article_header:
+  type: cover
+use_language_switcher: "Python-Scala-Java"
+---
+
+## Description
+
+This is a NER model trained using custom finance embeddings which allows you to detect some generic entities that may require to be masked or obfuscated to be compliant with different regulations, as GDPR and CCPA. This is just an NER model, make sure you try the full De-identification pipelines available in Models Hub.
+
+## Predicted Entities
+
+`AGE`, `CITY`, `COUNTRY`, `DATE`, `EMAIL`, `LOCATION-OTHER`, `FAX`, `ORG`, `PERSON`, `PHONE`, `PROFESSION`, `STATE`, `STREET`, `URL`, `ZIP`
+
+{:.btn-box}
+<button class="button button-orange" disabled>Live Demo</button>
+<button class="button button-orange" disabled>Open in Colab</button>
+[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_deid_sec_fe_en_1.0.0_3.0_1715953927003.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
+[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_deid_sec_fe_en_1.0.0_3.0_1715953927003.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
+
+## How to use
+
+
+
+<div class="tabs-box" markdown="1">
+{% include programmingLanguageSelectScalaPythonNLU.html %}
+```python
+documentAssembler = nlp.DocumentAssembler()\
+        .setInputCol("text")\
+        .setOutputCol("document")
+
+sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
+        .setInputCols(["document"])\
+        .setOutputCol("sentence")
+
+tokenizer = nlp.Tokenizer()\
+        .setInputCols(["sentence"])\
+        .setOutputCol("token")
+
+embeddings = nlp.WordEmbeddingsModel.pretrained("finance_word_embeddings", "en", "finance/models")\
+            .setInputCols(["sentence","token"])\
+            .setOutputCol("embeddings")
+
+ner_model =finance.NerModel.pretrained("finner_deid_sec_fe", "en", "finance/models")\
+      .setInputCols(["sentence", "token", "embeddings"])\
+      .setOutputCol("ner")
+
+ner_converter = nlp.NerConverter()\
+        .setInputCols(["sentence","token","ner"])\
+        .setOutputCol("ner_chunk")
+
+nlpPipeline = nlp.Pipeline(stages=[
+        documentAssembler,
+        sentenceDetector,
+        tokenizer,
+        embeddings,
+        ner_model,
+        ner_converter])
+
+empty_data = spark.createDataFrame([[""]]).toDF("text")
+
+model = nlpPipeline.fit(empty_data)
+
+text = [""" This LICENSE AND DEVELOPMENT AGREEMENT (this Agreement) is entered into effective as of Nov. 02, 2019 (the Effective Date) by and between Bioeq IP AG, having its principal place of business at 333 Twin Dolphin Drive, Suite 600, Redwood City, CA, 94065, USA (Licensee). """]
+
+res = model.transform(spark.createDataFrame([text]).toDF("text"))
+```
+
+</div>
+
+## Results
+
+```bash
++----------------------+------+
+|chunk                 |label |
++----------------------+------+
+|Nov. 02, 2019         |DATE  |
+|333 Twin Dolphin Drive|STREET|
+|Redwood City          |CITY  |
+|CA                    |STATE |
+|94065                 |ZIP   |
++----------------------+------+
+
+```
+
+{:.model-param}
+## Model Information
+
+{:.table-model}
+|---|---|
+|Model Name:|finner_deid_sec_fe|
+|Compatibility:|Finance NLP 1.0.0+|
+|License:|Licensed|
+|Edition:|Official|
+|Input Labels:|[sentence, token, embeddings]|
+|Output Labels:|[ner]|
+|Language:|en|
+|Size:|14.6 MB|
+
+## References
+
+In-house annotated documents with protected information
+
+## Benchmarking
+
+```bash
+                precision    recall  f1-score   support
+           AGE       0.97      0.95      0.96       266
+          CITY       0.86      0.80      0.83       120
+       COUNTRY       0.86      0.63      0.73        38
+          DATE       0.98      0.98      0.98      2206
+         EMAIL       1.00      1.00      1.00         1
+           FAX       0.00      0.00      0.00         2
+LOCATION-OTHER       1.00      0.33      0.50         6
+           ORG       0.82      0.55      0.66        42
+        PERSON       0.95      0.95      0.95      1295
+         PHONE       0.89      0.89      0.89        62
+    PROFESSION       0.75      0.55      0.64        76
+         STATE       0.90      0.92      0.91        90
+        STREET       0.92      0.89      0.91        81
+           URL       0.00      0.00      0.00         1
+           ZIP       0.97      0.94      0.95        67
+     micro-avg       0.96      0.94      0.95      4353
+     macro-avg       0.79      0.69      0.73      4353
+  weighted-avg       0.96      0.94      0.95      4353
+```
diff --git a/docs/_posts/gadde5300/2024-05-17-finner_sec_edgar_fe_en.md b/docs/_posts/gadde5300/2024-05-17-finner_sec_edgar_fe_en.md
@@ -0,0 +1,130 @@
+---
+layout: model
+title: Financial NER on EDGAR Documents
+author: John Snow Labs
+name: finner_sec_edgar_fe
+date: 2024-05-17
+tags: [en, licensed, finance, ner, sec]
+task: Named Entity Recognition
+language: en
+edition: Finance NLP 1.0.0
+spark_version: 3.0
+supported: true
+annotator: LegalNerModel
+article_header:
+  type: cover
+use_language_switcher: "Python-Scala-Java"
+---
+
+## Description
+
+This Financial NER model extracts ORG, INST, LAW, COURT, PER, LOC, MISC, ALIAS, and TICKER entities from the US SEC EDGAR documents, was trained using custom finance word embeddings.
+
+## Predicted Entities
+
+`ORG`, `INST`, `LAW`, `COURT`, `PER`, `LOC`, `MISC`, `ALIAS`, `TICKER`
+
+{:.btn-box}
+<button class="button button-orange" disabled>Live Demo</button>
+<button class="button button-orange" disabled>Open in Colab</button>
+[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finner_sec_edgar_fe_en_1.0.0_3.0_1715948751469.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
+[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finner_sec_edgar_fe_en_1.0.0_3.0_1715948751469.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
+
+## How to use
+
+
+
+<div class="tabs-box" markdown="1">
+{% include programmingLanguageSelectScalaPythonNLU.html %}
+```python
+document_assembler = nlp.DocumentAssembler()\
+      .setInputCol("text")\
+      .setOutputCol("document")
+
+sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
+      .setInputCols(["document"])\
+      .setOutputCol("sentence")
+
+tokenizer = nlp.Tokenizer()\
+      .setInputCols(["sentence"])\
+      .setOutputCol("token")
+
+embeddings = nlp.WordEmbeddingsModel.pretrained("finance_word_embeddings", "en", "finance/models")\
+            .setInputCols(["sentence","token"])\
+            .setOutputCol("embeddings")
+
+ner_model = finance.NerModel.pretrained("finner_sec_edgar_fe", "en", "finance/models")\
+      .setInputCols(["sentence", "token", "embeddings"])\
+      .setOutputCol("ner")
+
+ner_converter = nlp.NerConverter()\
+      .setInputCols(["sentence", "token", "ner"])\
+      .setOutputCol("ner_chunk")
+
+nlpPipeline = nlp.Pipeline(stages=[
+        document_assembler,
+        sentence_detector,
+        tokenizer,
+        embeddings,
+        ner_model,
+        ner_converter])
+
+empty_data = spark.createDataFrame([[""]]).toDF("text")
+
+model = nlpPipeline.fit(empty_data)
+
+text = ["""In our opinion, the accompanying consolidated balance sheets and the related consolidated statements of operations, of changes in stockholders' equity, and of cash flows present fairly, in all material respects, the financial position of SunGard Capital Corp. II and its subsidiaries ( SCC II ) at December 31, 2010, and 2009, and the results of their operations and their cash flows for each of the three years in the period ended December 31, 2010, in conformity with accounting principles generally accepted in the United States of America."""]
+
+result = model.transform(spark.createDataFrame([text]).toDF("text"))
+```
+
+</div>
+
+## Results
+
+```bash
++----------------------------------------+-----+
+|chunk                                   |label|
++----------------------------------------+-----+
+|SunGard Capital Corp                    |ORG  |
+|SCC II                                  |ALIAS|
+|accounting principles generally accepted|LAW  |
+|United States of America                |LOC  |
++----------------------------------------+-----+
+```
+
+{:.model-param}
+## Model Information
+
+{:.table-model}
+|---|---|
+|Model Name:|finner_sec_edgar_fe|
+|Compatibility:|Finance NLP 1.0.0+|
+|License:|Licensed|
+|Edition:|Official|
+|Input Labels:|[sentence, token, embeddings]|
+|Output Labels:|[ner]|
+|Language:|en|
+|Size:|2.2 MB|
+
+## References
+
+In-house annotations
+
+## Benchmarking
+
+```bash
+          precision    recall  f1-score   support
+ALIAS         0.91      0.80      0.85        84
+COURT         1.00      1.00      1.00         6
+INST          0.92      0.76      0.83        76
+LAW           0.89      0.86      0.87       166
+LOC           0.87      0.87      0.87       140
+MISC          0.86      0.75      0.80       226
+ORG           0.88      0.91      0.89       430
+PER           0.89      0.88      0.89        66
+TICKER        1.00      0.86      0.92         7
+micro-avg     0.88      0.85      0.87      1201
+macro-avg     0.91      0.85      0.88      1201
+weighted-avg  0.88      0.85      0.86      1201
+```
diff --git a/docs/_posts/gadde5300/2024-05-21-finance_word_embeddings_en.md b/docs/_posts/gadde5300/2024-05-21-finance_word_embeddings_en.md
@@ -0,0 +1,66 @@
+---
+layout: model
+title: Finance Word Embeddings
+author: John Snow Labs
+name: finance_word_embeddings
+date: 2024-05-21
+tags: [en, finance, licensed, word_embeddings]
+task: Embeddings
+language: en
+edition: Finance NLP 1.0.0
+spark_version: 3.0
+supported: true
+annotator: WordEmbeddingsModel
+article_header:
+  type: cover
+use_language_switcher: "Python-Scala-Java"
+---
+
+## Description
+
+The word embedding models were based on Word2Vec, trained on a mix of different datasets. We used public data and in-house annotated documents.
+
+## Predicted Entities
+
+
+
+{:.btn-box}
+<button class="button button-orange" disabled>Live Demo</button>
+<button class="button button-orange" disabled>Open in Colab</button>
+[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/finance/models/finance_word_embeddings_en_1.0.0_3.0_1716300545868.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
+[Copy S3 URI](s3://auxdata.johnsnowlabs.com/finance/models/finance_word_embeddings_en_1.0.0_3.0_1716300545868.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
+
+## How to use
+
+
+
+<div class="tabs-box" markdown="1">
+{% include programmingLanguageSelectScalaPythonNLU.html %}
+```python
+model =  nlp.WordEmbeddingsModel.pretrained("finance_word_embeddings","en","finance/models")\
+	.setInputCols(["sentence","token"])\
+	.setOutputCol("embeddings")
+```
+
+</div>
+
+{:.model-param}
+## Model Information
+
+{:.table-model}
+|---|---|
+|Model Name:|finance_word_embeddings|
+|Type:|embeddings|
+|Compatibility:|Finance NLP 1.0.0+|
+|License:|Licensed|
+|Edition:|Official|
+|Input Labels:|[document, token]|
+|Output Labels:|[word_embeddings]|
+|Language:|en|
+|Size:|103.4 MB|
+|Case sensitive:|false|
+|Dimension:|200|
+
+## References
+
+Public data and in-house annotated documents