Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixies #1313

Merged
merged 1 commit into from
Jul 8, 2024
Merged

fixies #1313

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/_data/navigation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,7 @@ sparknlp-healthcare:
url: /licensed/api/python
- title: Wiki
url: /docs/en/wiki
- title: Speed Benchmarks
- title: Benchmarks
url: /docs/en/benchmark
- title: Best Practices Using Pretrained Models Together
url: /docs/en/best_practices_pretrained_models
Expand Down
11 changes: 11 additions & 0 deletions docs/_includes/docs-sparckocr-pagination.html
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
<ul class="pagination">
<li>
<a href="#">Version <strong id="previosver"></strong></a>
</li>
<li>
<strong>Version <strong id="currversion"></strong></strong>
</li>
<li>
<a href="#">Version <strong id="nextver"></strong></a>
</li>
</ul>
<ul class="pagination owl-carousel pagination_big">
<li><a href="release_notes_5_3_1">5.3.1</a></li>
<li><a href="release_notes_5_3_0">5.3.0</a></li>
Expand Down
2 changes: 0 additions & 2 deletions docs/_includes/scripts/article.js
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,6 @@ $(document).ready(function () {
$('.pagination_big').owlCarousel({
margin:10,
nav:true,
center: true,
loop: true,
dots:false,
responsive:{
0:{
Expand Down
2 changes: 1 addition & 1 deletion docs/en/legal_release_notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ sidebar:

## Releases log


{:.table-model-big}
| | | | |
|--------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
| [1.0.0](https://medium.com/spark-nlp/spark-nlp-for-legal-1-0-0-over-300-new-state-of-the-art-models-in-multiple-languages-f3bae55c32e1) | [1.1.0](https://medium.com/@muhendisbp/legal-nlp-1-1-0-for-spark-nlp-has-been-released-89de7f099bdc) | [1.2.0](https://medium.com/spark-nlp/legal-nlp-1-2-0-for-spark-nlp-has-been-released-%EF%B8%8F-8d060b3391ef) | [1.3.0](https://gaddesaishailesh.medium.com/spark-nlp-for-legal-1-3-0-over-100-new-state-of-the-art-models-%EF%B8%8F-b069207ce77f) |
Expand Down
4 changes: 2 additions & 2 deletions docs/en/licensed_version_compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ sidebar:

<div class="h3-box" markdown="1">


{:.table-model-big}
| Spark NLP for Healthcare | Spark NLP (Public) |
|---------------------------|--------------------|
| 5.3.3 | 5.3.2 |
Expand Down Expand Up @@ -95,7 +95,7 @@ sidebar:
| 2.3.4 | 2.3.4 |



{:.table-model-big}
| Spark NLP for Healthcare | Spark OCR |
|---------------------------|--------------------|
| 4.3.0 | 4.3.1 |
Expand Down
2 changes: 1 addition & 1 deletion docs/en/ocr_benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ sidebar:

#### Benchmark Table

{:.table-model-big.db}
{:.table-model-big}
| Instance | memory | cores | input\_data\_pages| partition | second per page | timing |
| ------------- | ------ | ----- | ----------------- | ------------- | --------------- | ------- |
| m5n.4xlarge | 64 GB | 16 | 1000 | 10 | 0.24 | 4 mins |
Expand Down
1 change: 1 addition & 0 deletions docs/en/ocr_structures.md
Original file line number Diff line number Diff line change
Expand Up @@ -367,6 +367,7 @@ Show single image with metadata in Jupyter notebook.

{:.table-model-big}
| Param name | Type | Default | Description |
|------------|------|---------|-------------|
| width | string | "600" | width of image |
| show_meta | boolean | true | enable/disable displaying metadata of image |

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ Output:

{:.table-model-big}
| | chunks | begin | end | code | resolutions
|----|-----------------------------|---------|-------|-----------|------------|
| 2 | COPD | 113 | 116 | 13645005 | copd - chronic obstructive pulmonary disease
| 8 | PTCA | 324 | 327 | 373108000 | post percutaneous transluminal coronary angioplasty (finding)
| 16 | close monitoring | 519 | 534 | 417014005 | on examination - vigilance
Expand Down
6 changes: 3 additions & 3 deletions docs/en/spark_nlp_healthcare_versions/release_notes_4_4_0.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ Our clinical summarizer models with only 250M parameters perform 30-35% better t

🔎 Benchmark on MtSamples Summarization Dataset

{:.table-model-big}
{:.table-model-big.db}
| model_name | model_size | Rouge | Bleu | bertscore_precision | bertscore_recall: | bertscore_f1 |
|--|--|--|--|--|--|--|
philschmid/flan-t5-base-samsum | 250M | 0.1919 | 0.1124 | 0.8409 | 0.8964 | 0.8678 |
Expand All @@ -100,7 +100,7 @@ transformersbook/pegasus-samsum | 500M | 0.1924 | 0.0965 | 0.8920 | 0.8149 | 0.8

🔎 Benchmark on MIMIC Summarization Dataset

{:.table-model-big}
{:.table-model-big.db}
| model_name | model_size | Rouge | Bleu | bertscore_precision | bertscore_recall: | bertscore_f1 |
|--|--|--|--|--|--|--|
philschmid/flan-t5-base-samsum | 250M | 0.1910 | 0.1037 | 0.8708 | 0.9056 | 0.8879 |
Expand All @@ -110,7 +110,7 @@ transformersbook/pegasus-samsum | 570M | 0.1425 | 0.0582 | 0.9171 | 0.8682 | 0.8
**summarizer_clinical_jsl** | **250M** | **0.395** | **0.2962** | **0.895** | **0.9316** | **0.913** |
**summarizer_clinical_jsl_augmented** | **250M** | **0.3964** | **0.307** | **0.9109** | **0.9452** | **0.9227** |

![image](https://user-images.githubusercontent.com/64752006/230899745-3a67d142-1bdf-4f4b-83cb-d012953b1e09.png)
![Benchmark on MIMIC Summarization Dataset](https://user-images.githubusercontent.com/64752006/230899745-3a67d142-1bdf-4f4b-83cb-d012953b1e09.png)

</div><div class="h3-box" markdown="1">

Expand Down
6 changes: 2 additions & 4 deletions docs/en/spark_ocr_versions/ocr_release_notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ Release date: 11-04-2024
## Improved table extraction capabilities in HocrToTextTable
Many issues related to column detection in our Table Extraction pipelines are addressed in this release, compared to previous Visual NLP version the metrics have improved. Table below shows F1-score(CAR or Cell Adjacency Relationship) performances on ICDAR 19 Track B dataset for different IoU values of our two versions in comparison with [other results](https://paperswithcode.com/paper/multi-type-td-tsr-extracting-tables-from/review/).

{:.table-model-big}
| Model | 0.6 | 0.7 | 0.8 | 0.9 |
| ------------- | ------------- |------------- |------------- |------------- |
| CascadeTabNet | 0.438 | 0.354 | 0.19 | 0.036 |
Expand Down Expand Up @@ -88,7 +89,7 @@ ocr = ImageToTextV2.pretrained("ocr_base_printed_v2_opt", "en", "clinical/ocr")
.setIncludeConfidence(True)
```

![image](/assets/images/ocr/confidence_score.png)
![Confidence scores in ImageToTextV2](/assets/images/ocr/confidence_score.png)

Check this [updated notebook](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/TextRecognition/SparkOcrImageToTextV2.ipynb) for an end-to-end example.

Expand Down Expand Up @@ -131,9 +132,6 @@ ImageDrawRegions is the annotator used for rendering regions into images so we c
### Bug Fixes
+ PdfToImage resetting page information when used in the same pipeline as PdfToText: When the sequence {PdfToText, PdfToImage} was used the original pages computed at PdfToText where resetted to zero by PdfToImage.




</div><div class="prev_ver h3-box" markdown="1">

## Previous versions
Expand Down
4 changes: 4 additions & 0 deletions docs/en/spark_ocr_versions/release_notes_1_10_0.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ Release date: 20-01-2021

Support Microsoft Docx documents.

</div><div class="h3-box" markdown="1">

#### New Features

* Added [DocToText](/docs/en/ocr_pipeline_components#doctotext) transformer for extract text
Expand All @@ -30,6 +32,8 @@ table data from DOCX documents.
* Added [DocToPdf](/docs/en/ocr_pipeline_components#doctopdf) transformer for convert DOCX
documents to PDF format.

</div><div class="h3-box" markdown="1">

#### Bugfixes

* Fixed issue with loading model data on some cluster configurations
Expand Down
8 changes: 8 additions & 0 deletions docs/en/spark_ocr_versions/release_notes_1_11_0.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,22 +22,30 @@ Release date: 25-02-2021
Support German, French, Spanish and Russian languages.
Improving [PositionsFinder](/docs/en/ocr_pipeline_components#positionsfinder) and ImageToText for better support de-identification.

</div><div class="h3-box" markdown="1">

#### New Features

* Loading model data from S3 in [ImageToText](/docs/en/ocr_pipeline_components#imagetotext).
* Added support German, French, Spanish, Russian languages in [ImageToText](/docs/en/ocr_pipeline_components#imagetotext).
* Added different OCR model types: Base, Best, Fast in [ImageToText](/docs/en/ocr_pipeline_components#imagetotext).

</div><div class="h3-box" markdown="1">

#### Enhancements

* Added spaces symbols to the output positions in the [ImageToText](/docs/en/ocr_pipeline_components#imagetotext) transformer.
* Eliminate python-levensthein from dependencies for simplify installation.

</div><div class="h3-box" markdown="1">

#### Bugfixes

* Fixed issue with extracting coordinates in in [ImageToText](/docs/en/ocr_pipeline_components#imagetotext).
* Fixed loading model data on cluster in yarn mode.

</div><div class="h3-box" markdown="1">

#### New notebooks

* [Languages Support](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/1.11.0/jupyter/SparkOcrLanguagesSupport.ipynb)
Expand Down
4 changes: 4 additions & 0 deletions docs/en/spark_ocr_versions/release_notes_1_2_0.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,17 @@ Release date: 08-04-2020

Improved support Databricks and processing selectable pdfs.

</div><div class="h3-box" markdown="1">

#### Enhancements

* Adapted Spark OCR for run on Databricks.
* Added rewriting positions in [ImageToText](/docs/en/ocr_pipeline_components#imagetotext) when run together with PdfToText.
* Added 'positionsCol' param to [ImageToText](/docs/en/ocr_pipeline_components#imagetotext).
* Improved support Spark NLP. Changed [start](/docs/en/ocr_install#using-start-function) function.

</div><div class="h3-box" markdown="1">

#### New Features

* Added [showImage](/docs/en/ocr_structures#showimages) implicit to Dataframe for display images in Scala Databricks notebooks.
Expand Down
4 changes: 4 additions & 0 deletions docs/en/spark_ocr_versions/release_notes_1_3_0.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,16 @@ Release date: 22-05-2020

New functionality for de-identification problem.

</div><div class="h3-box" markdown="1">

#### Enhancements

* Renamed TesseractOCR to ImageToText.
* Simplified installation.
* Added check license from `SPARK_NLP_LICENSE` env varibale.

</div><div class="h3-box" markdown="1">

#### New Features

* Support storing for binaryFormat. Added support storing Image and PDF files.
Expand Down
8 changes: 7 additions & 1 deletion docs/en/spark_ocr_versions/release_notes_1_4_0.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,19 +21,25 @@ Release date: 23-06-2020

Added support Dicom format and improved support image morphological operations.

</div><div class="h3-box" markdown="1">

#### Enhancements

* Updated [start](/docs/en/ocr_install#using-start-function) function. Improved support Spark NLP internal.
* `ImageMorphologyOpening` and `ImageErosion` are removed.
* Improved existing transformers for support de-identification Dicom documents.
* Added possibility to draw filled rectangles to [ImageDrawRegions](/docs/en/ocr_pipeline_components#imagedrawregions).

</div><div class="h3-box" markdown="1">

#### New Features

* Support reading and writing Dicom documents.
* Added [ImageMorphologyOperation](/docs/en/ocr_pipeline_components#imagemorphologyoperation) transformer which support:
erosion, dilation, opening and closing operations.


</div><div class="h3-box" markdown="1">

#### Bugfixes

* Fixed issue in [ImageToText](/docs/en/ocr_pipeline_components#imagetotext) related to extraction coordinates.
Expand Down
1 change: 1 addition & 0 deletions docs/en/spark_ocr_versions/release_notes_1_6_0.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ Release date: 05-09-2020

Support parsing data from tables for selectable PDFs.

</div><div class="h3-box" markdown="1">

#### New Features

Expand Down
3 changes: 3 additions & 0 deletions docs/en/spark_ocr_versions/release_notes_1_8_0.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ Release date: 20-11-2020
Optimisation performance for processing multipage PDF documents.
Support up to 10k pages per document.

</div><div class="h3-box" markdown="1">

#### New Features

* Added [ImageAdaptiveBinarizer](/docs/en/ocr_pipeline_components#imageadaptivebinarizer) Scala transformer with support:
Expand All @@ -30,6 +32,7 @@ Support up to 10k pages per document.
- Sauvola local thresholding
* Added possibility to split pdf to small documents for optimize processing in [PdfToImage](/docs/en/ocr_pipeline_components#pdftoimage).

</div><div class="h3-box" markdown="1">

#### Enhancements

Expand Down
2 changes: 2 additions & 0 deletions docs/en/spark_ocr_versions/release_notes_1_9_0.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ Release date: 11-12-2020

Extension of FoundationOne report parser and support HOCR output format.

</div><div class="h3-box" markdown="1">

#### New Features

* Added [ImageToHocr](/docs/en/ocr_pipeline_components#imagetohocr) transformer for recognize text from image and store it to HOCR format.
Expand Down
4 changes: 4 additions & 0 deletions docs/en/spark_ocr_versions/release_notes_3_0_0.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ Spark OCR 3.0.0 extends the support for Apache Spark 3.0.x and 3.1.x major relea

Spark OCR started to support Tensorflow models. First model is [VisualDocumentClassifier](/docs/en/ocr_pipeline_components#visualdocumentclassifier).

</div><div class="h3-box" markdown="1">

#### New Features

* Support for Apache Spark and PySpark 3.0.x on Scala 2.12
Expand All @@ -47,6 +49,8 @@ Spark OCR started to support Tensorflow models. First model is [VisualDocumentCl
* [VisualDocumentClassifier](/docs/en/ocr_pipeline_components#visualdocumentclassifier) model for classification documents using text and layout data.
* Added support Vietnamese language.

</div><div class="h3-box" markdown="1">

#### New notebooks

* [Visual Document Classifier](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/SparkOCRVisualDocumentClassifier.ipynb)
Expand Down
3 changes: 3 additions & 0 deletions docs/en/spark_ocr_versions/release_notes_3_10_0.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Release date: 10-01-2022

Form recognition using LayoutLMv2 and text detection.

</div><div class="h3-box" markdown="1">

#### New Features

Expand All @@ -30,12 +31,14 @@ Form recognition using LayoutLMv2 and text detection.
* Support rotated regions in [ImageSplitRegions](/docs/en/ocr_pipeline_components#imagesplitregions)
* Support rotated regions in [ImageDrawRegions](/docs/en/ocr_pipeline_components#imagedrawregions)

</div><div class="h3-box" markdown="1">

#### New Models

* LayoutLMv2 fine-tuned on FUNSD dataset
* Text detection model based on CRAFT architecture

</div><div class="h3-box" markdown="1">

#### New notebooks

Expand Down
8 changes: 7 additions & 1 deletion docs/en/spark_ocr_versions/release_notes_3_11_0.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,11 @@ Release date: 28-02-2022

#### Overview

We are glad to announce that Spark OCR 3.11.0 has been released!.
We are glad to announce that Spark OCR 3.11.0 has been released!
This release comes with new models, new features, bug fixes, and notebook examples.

</div><div class="h3-box" markdown="1">

#### New Features

* Added [ImageTextDetectorV2](/docs/en/ocr_object_detection#imagetextdetectorv2) Python Spark-OCR Transformer for detecting printed and handwritten text
Expand All @@ -32,11 +34,15 @@ This release comes with new models, new features, bug fixes, and notebook exampl
* Added [FormRelationExtractor](/docs/en/ocr_visual_document_understanding#formrelationextractor) for detecting relations between key and value entities in forms.
* Added the capability of fine tuning VisualDocumentNerV2 models for key-value pairs extraction.

</div><div class="h3-box" markdown="1">

#### New Models

* ImageTextDetectorV2: this extends the ImageTextDetectorV1 character level text detection model with a refiner net architecture.
* ImageTextRecognizerV2: Text recognition for printed text based on the Deep Learning Transformer Architecture.

</div><div class="h3-box" markdown="1">

#### New notebooks

* [SparkOcrImageToTextV2](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/3110-release-candidate/jupyter/TextRecognition/SparkOcrImageToTextV2.ipynb)
Expand Down
8 changes: 8 additions & 0 deletions docs/en/spark_ocr_versions/release_notes_3_12_0.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ Release date: 14-04-2022
We're glad to announce that Spark OCR 3.12.0 has been released!
This release comes with new models for Handwritten Text Recognition, Spark 3.2 support, bug fixes, and notebook examples.

</div><div class="h3-box" markdown="1">

#### New Features

* Added to the ImageTextDetectorV2:
Expand Down Expand Up @@ -57,16 +59,22 @@ spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False)

* Improved documentation on the website.

</div><div class="h3-box" markdown="1">

#### New Models

ocr_small_printed: Text recognition small model for printed text based on ImageToTextV2
ocr_small_handwritten: Text recognition small model for handwritten text based on ImageToTextV2
ocr_base_handwritten: Text recognition base model for handwritten text based on ImageToTextV2

</div><div class="h3-box" markdown="1">

#### Bug Fixes

* display_table() function failing to display tables coming from digital PDFs.

</div><div class="h3-box" markdown="1">

#### New notebooks

* [SparkOcrImageToTextV2OutputFormats.ipynb](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/3120-release-candidate/jupyter/TextRecognition/SparkOcrImageToTextV2OutputFormats.ipynb), different output formats for ImageToTextV2.
Expand Down
Loading