# Spark
- Open-source Natural Language Processing Library
- Pre-trained models and pipelines
- [Finance NLP (released Fall 2022)](https://nlp.johnsnowlabs.com/classify_financial_documents)
	- [Intro to new library](https://medium.com/spark-nlp/spark-nlp-for-finance-is-released-cfa3cc7b9faa)
	- [Spark Tutorials on GH](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Public)
	- [Intro to Spark NLP Foundations](https://towardsdatascience.com/introduction-to-spark-nlp-foundations-and-basic-components-part-i-c83b7629ed59)
> [Main Table of Contents](../../../README.md)

## Transfer Learning
- Transfer learning is a means to extract knowledge from a source setting and apply it to a different target setting, and it is a highly effective way to keep improving the accuracy of NLP models and to get reliable accuracies even with small data by leveraging the already existing labelled data of some related task or domain. As a result, there is no need to amass millions of data points in order to train a state-of-the-art model.
- The long reign of word vectors as NLP’s core representation technique has seen an exciting new line of challengers such as ELMo, BERT, RoBERTa, ALBERT, XLNet, Ernie, ULMFiT, OpenAI transformer, which are all open-source, including pre-trained models, and can be tuned or reused without a major computing effort. These works made headlines by demonstrating that pre-trained language models can be used to achieve state-of-the-art results on a wide range of NLP tasks, sometimes even surpassing the human level benchmarks

## In This Notebook
- AnnotatorTypes
- Annotators
- Transfomers
- Pipelines
- LightPipeline for small dataset
	- LightPipeline Methods
- Finance NLP Features

## AnnotatorTypes

AnnotatorType | 
--- |
Document
token
chunk
pos
word_embeddings
date
entity
sentiment
named_entity
dependency
labeled_dependency

## Annotators
- In Spark NLP, all Annotators are either Estimators or Transformers as we see in Spark ML. An Estimator in Spark ML is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model. A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer that transforms a DataFrame with features into a DataFrame with predictions.

	Type of annotator | Description
	--- | ---
	AnnotatorApproach<br>(trainable annotator) |  Extends Estimators from Spark ML, which train on df and produces a model<br>Call `fit()` then `transform()`
	AnnotatorModel<br>(trained annotator) | Extends Transformers which are meant to transform one df into another df through some models<br>Call `transform()`

In [4]:
# load ClassiferDL Model trained on bertwiki finance sentiment
sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_bertwiki_finance_sentiment", "en").setInputCols(["sentence_embeddings"]).setOutputCol("class")
sentimentClassifier.transform(df)

NameError: name 'ClassifierDLModel' is not defined

In [None]:
# Tokenizer is AnnotatorApproach call fit() then transform()
tokenizer = Tokenizer().setInputCols([“document”]).setOutputCol(“token”)
tokenizer.fit(df).transform(df)

SyntaxError: invalid character in identifier (<ipython-input-2-abdf9819f3e5>, line 2)

In [None]:
# Stemmer is AnnotatorModel, just call transform()
stemmer = Stemmer().setInputCols([“token”]).setOutputCol(“stem”)
stemmer.transform(df)

In [None]:
# load ClassiferDL Model traineed on bertwiki finance sentiment
pipeline = PretrainedPipeline('classifierdl_bertwiki_finance_sentiment_pipeline', lang='en')
print(pipeline.model.stages)
# Seem like this is same as fit and transforming of a manual pipeline created with `Pipeline()` function?
# result is dictionary if text_line
# result is list of dict if text_list
result = pipeline.annotate(text_list)

## Transformers
- Transformers used for getting data in or transform teh data from one AnnotatorType to another
- Transformers help address the question: 
	- What to do if my df doesn't have columns in the types that each Annotator accepts or outputs?
- 5 Transformer Types
- Transformer do not product model, so just need to call `transform()` on them


	Transformer | Description
	--- | ---
	DocumentAssembler |  To get through the NLP process, we need to get raw data annotated. This is a special transformer that does this for us; it creates the first annotation of type Document which may be used by annotators down the road
	TokenAssembler | Reconstructs a Document type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, to use this document annotation in further annotators
	Doc2Chunk | Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol
	Chunk2Doc | Converts a CHUNK type column back into DOCUMENT. Useful when trying to re-tokenize or do further analysis on a CHUNK result
	Finisher | Once we have our NLP pipeline ready to go, we might want to use our annotation results somewhere else where it is easy to use. The Finisher outputs annotation(s) values into a string

## Pipelines
- Spark pipeline is a sequence of stages (Estimators or Transformers)
- The first stage of most pipelines is to create a Spark document through a `DocumentAssembler`
- Subsequent stages then take df column(s) with certain AnnotatorType(s) and outputs certain df columns(s) AnnotatorType(s)
	- As the document is passed through pipeline stages, annotations are made to the document
- `Pipeline(stages=[...]).fit(train_df)` results in a pipelineModel which can then be called `.transform(test_df)` to make predictions on test_df

In [None]:
# Get list of all public pipelines
from sparknlp.pretrained import ResourceDownloader
ResourceDownloader.showPublicPipelines(lang="en")

### Pipeline Methods/Attributes

Method/Attribute|
--- |
.getStages()

## LightPipeline for small dataset
- Distributed processing works best on very large datasets
- `Pipeline()`Too much power/bulkiness for small datasets (<=50k sentences), instead use `LightPipeline`
	- Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, becoming more than 10x times faster for smaller amounts of data 
- Usage:
	- useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests
	- simply plug in a trained (fitted) pipeline and then annotate a plain text (doesn't have to be df)

### LightPipeline Methods/Attributes

Method |
---|
.annotate()
.fullAnnotate()
.getStages()
.transform()
.setIgnoreUnsupported()
.getIgnoreUnsupported
.pipelineModel

In [None]:
from sparknlp.base import LightPipeline
LightPipeline(someTrainedPipeline).annotate(someStringOrArrayDoesNotHaveToBeDF)

from sparknlp.base import LightPipeline
lightModel = LightPipeline(someTrainedpipelineModel, parse_embeddings=True)
lightModel.annotate('Hello there, Nice to finally meet you')

## [Finance NLP (released Fall 2022)](https://nlp.johnsnowlabs.com/classify_financial_documents)

Available Features | Description
--- | ---
sentiment analysis |
Financial NER models | Extracting organizations, products, revenue, profit, losses, trading symbols, SEC 10-K information,...
text classification | classify texts into specific financial categories
Entity linking| to normalize NER entities and link them to public databases/data sources, such as Edgar, Crunchbase, and Nasdaq. By doing that, you can augment Company Names, for example, with externally available data about them
Pattern matching |Use context-aware symbolic components to model patterns to combine with Deep Learning information extraction techniques
financial embeddings| The result is an n-dimensional embedding matrix, impossible to process by the human eye, containing the interpretation of the text using a financial domain
assertion status | infer temporality (present, past, future), probability (possible), or other conditions in the context of the extracted entities
relation extraction| infer relations between the extracted entities. For example, the relations of the parties in an agreement
question answering | Financial Question Answering model, finetuned on proprietary Financial questions and answers
Knowledge graphs | Create Knowledge bases combining entities and relations in a graph, which can be exploited afterward in graph databases
Table understanding| Use State-of-the-Art Deep Learning architectures to query tables (extracted, for example, with Spark OCR) with Natural Language. No training is needed
Zero-shot-learning | Spark NLP for Finance includes Zero-shot NER and Zero-shot Relation Extraction, to create your information extraction models without any training data, just with examples (prompts)
Deidentification | The task of detecting privacy-related entities in text, such as person names, emails, and contact data