# Financial Word and Sentence Embeddings

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Finance/2.Embeddings.ipynb)

# Finance Word and Sentence Embeddings visualization using PCA (Principal Component Analysis)

Modern NLP models work with a numerical representation of texts and their menaning. For token classification problems (inferring a class for a token, for example Name Entity Recognition) Word Embeddings are required. For sentences, paragraph, document classification - we use Sentence Embeddings.

In this notebook, we got token embeddings using Spark NLP Finance Word Embeddings(**bert_embeddings_sec_bert_base**) and using these token embeddings we got sentence embeddings by sparknlp annotator SentenceEmbeddings to get those numerical representations of the semantics of the texts. The result is a 768 embeddings matrix, impossible to process by the human eye.

There are many techniques we can use to visualize those embeddings. We are using one of them - Principal Component Analysis, a dimensionality reduction process, carried out by Spark MLLib. Both embeddings have 768 dimensions, so we will reduced this dimensions from **768** to **3** (X, Y, Z) and will use a color for the word / sentence legend.

# Installation

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs 

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Saving latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json to latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json


In [None]:
from johnsnowlabs import * 
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
jsl.install()

👌 Detected license file /content/latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up John Snow Labs home in /home/ckl/.johnsnowlabs this might take a few minutes.
Downloading 🐍+🚀 Python Library Spark-NLP-4.1.0-wheel-for-spark-3.x.x.whl
Downloading 🐍+💊 Python Library hc
Downloading 🐍+🕶 Python Library Spark-OCR-4.0.1-wheel-for-spark-3.x.x.whl
Downloading 🫘+🚀 Java Library Spark-NLP-4.1.0-cpu-for-spark-3.x.x.jar
Downloading 🫘+💊 Java Library hc
Downloading 🫘+🕶 Java Library Spark-OCR-4.0.1-cpu-for-spark-3.x.x.jar
🙆 JSL Home setup in /root/.johnsnowlabs
Running "/usr/bin/python3 -m pip install https://pypi.johnsnowlabs.com/[LIBRARY_SECRET]spark-ocr/spark_ocr-4.0.1-py3-none-any.whl --force-reinstall"
Running "/usr/bin/python3 -m pip install https://pypi.johnsnowlabs.com/[LIBRARY_SECRET]spark-nlp-internal/spark_nlp_internal-4.1.0-py3-none-any.whl --force-reinst

## Start Spark Session

In [None]:
from johnsnowlabs import * 
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

👌 Detected license file /content/latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json
📋 Stored new John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_2_for_Spark-Healthcare_Spark-OCR.json
👌 Launched SparkSession with Jars for: 🚀Spark-NLP, 💊Spark-Healthcare, 🕶Spark-OCR


In [5]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# if you want to start the session with custom params as in start function above
def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+PUBLIC_VERSION) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+SECRET+"/spark-nlp-jsl-"+JSL_VERSION+".jar")
      
    return builder.getOrCreate()

#spark = start(SECRET)


# Get sample text

In [6]:
! pip install plotly

# Downloading sample datasets.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Finance/data/finance_pca_samples.csv

In [7]:
df = pd.read_csv("finance_pca_samples.csv")

In [8]:
# Create spark dataframe
sdf = spark.createDataFrame(df)
sdf.show()

+--------------------+----------------+
|                text|           label|
+--------------------+----------------+
|I called Huntingt...|        Accounts|
|I opened an citi ...|        Accounts|
|I have been a lon...|    Credit Cards|
|My credit limit w...|    Credit Cards|
|I am filing this ...|Credit Reporting|
|The Credit Bureau...|Credit Reporting|
|I noticed an arti...| Debt Collection|
|A bank account wa...| Debt Collection|
|I was contacted v...|           Loans|
|My husband recent...|           Loans|
|I wire transfered...| Money Transfers|
|PayPal holds fund...| Money Transfers|
|We have requested...|        Mortgage|
|I filled out a co...|        Mortgage|
+--------------------+----------------+



# Pipeline with Spark NLP and Spark MLLIB

In [9]:
# We defined a generic pipeline for word and sentence embeddings

def generic_pipeline():
  document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

  tokenizer = nlp.Tokenizer()\
      .setInputCols("document")\
      .setOutputCol("token")

  word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
      .setInputCols(["document", "token"])\
      .setOutputCol("word_embeddings")

  pipeline = Pipeline(stages = [
      document_assembler,
      tokenizer,
      word_embeddings
  ])

  return pipeline



## Sentence Embeddings

In [10]:
embeddings_sentence = nlp.SentenceEmbeddings()\
    .setInputCols(["document", "word_embeddings"])\
    .setOutputCol("sentence_embeddings")\
    .setPoolingStrategy("AVERAGE")
# We used sparknlp SentenceEmbeddings anootator to get each sentence embeddings from token embeddings

# Custom transform to retrieve the numerical embeddings from Spark NLP and pass it to Spark MLLib

In [12]:
# This class extracts the embeddings from the Spark NLP Annotation object
# from pyspark import ml as ML
class EmbeddingsUDF(
    Transformer, ML.param.shared.HasInputCol, ML.param.shared.HasOutputCol,
    ML.util.DefaultParamsReadable, ML.util.DefaultParamsWritable
):
    @keyword_only
    def __init__(self):
        super(EmbeddingsUDF, self).__init__()

        def _sum(r):
            result = 0.0
            for e in r:
                result += e
            return result

        self.udfs = {
            'convertToVectorUDF': F.udf(lambda vs: ML.linalg.Vectors.dense(vs), ML.linalg.VectorUDT()),
            'sumUDF': F.udf(lambda r: _sum(r), T.FloatType())
        }

    def _transform(self, dataset):

        results = dataset.select(
            "*", F.explode("sentence_embeddings.embeddings").alias("embeddings")
        )
        results = results.withColumn(
            "features",
            self.udfs['convertToVectorUDF'](F.col("embeddings"))
        )
        results = results.withColumn(
            "emb_sum",
            self.udfs['sumUDF'](F.col("embeddings"))
        )
        # Remove those with embeddings all zeroes (so we can calculate cosine distance)
        results = results.where(F.col("emb_sum")!=0.0)

        return results

In [13]:
embeddings_for_pca = EmbeddingsUDF()

In [14]:
DIMENSIONS  = 3

In [15]:
pca = ML.feature.PCA(k=DIMENSIONS, inputCol="features", outputCol="pca_features")

### Full Spark NLP + Spark MLLib pipeline

In [16]:
# We did all process in one pipeline
pipeline = Pipeline().setStages([generic_pipeline(), embeddings_sentence, embeddings_for_pca, pca])

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [17]:
model = pipeline.fit(sdf)

In [18]:
result = model.transform(sdf)

In [19]:
result.select('pca_features', 'label').show(truncate=False)

+-------------------------------------------------------------+----------------+
|pca_features                                                 |label           |
+-------------------------------------------------------------+----------------+
|[3.3957665030600284,-1.0603627660799937,-1.5687926035355226] |Accounts        |
|[2.3660879256962373,0.8591939157072354,-0.8066160065217086]  |Accounts        |
|[0.6867755242066036,1.4823952348657883,0.006593475906606655] |Credit Cards    |
|[-0.2883394962686842,1.0031545461252709,-0.7963772156325533] |Credit Cards    |
|[-0.5037780043184444,-1.3771577732968636,0.44497422838484085]|Credit Reporting|
|[1.0397592316146262,-1.7194163703349976,1.8539397995641316]  |Credit Reporting|
|[2.7731725080459433,1.1680247576296174,1.394945023145095]    |Debt Collection |
|[-0.45950692064643583,0.8339685731704523,0.5051715356279506] |Debt Collection |
|[0.2703108926695578,1.106941385143897,-0.4247536194057045]   |Loans           |
|[0.8662548164505044,1.14352

In [20]:
df = result.select('pca_features', 'label').toPandas()

df
# As you see, dimension values are inside a list

Unnamed: 0,pca_features,label
0,"[3.3957665030600284, -1.0603627660799937, -1.5...",Accounts
1,"[2.3660879256962373, 0.8591939157072354, -0.80...",Accounts
2,"[0.6867755242066036, 1.4823952348657883, 0.006...",Credit Cards
3,"[-0.2883394962686842, 1.0031545461252709, -0.7...",Credit Cards
4,"[-0.5037780043184444, -1.3771577732968636, 0.4...",Credit Reporting
5,"[1.0397592316146262, -1.7194163703349976, 1.85...",Credit Reporting
6,"[2.7731725080459433, 1.1680247576296174, 1.394...",Debt Collection
7,"[-0.45950692064643583, 0.8339685731704523, 0.5...",Debt Collection
8,"[0.2703108926695578, 1.106941385143897, -0.424...",Loans
9,"[0.8662548164505044, 1.1435248866274272, 0.870...",Loans


In [21]:
# We extract the dimension values out off the list

df["x"] = df["pca_features"].apply(lambda x: x[0])

df["y"] = df["pca_features"].apply(lambda x: x[1])

df["z"] = df["pca_features"].apply(lambda x: x[2])

df = df[["x", "y", "z", "label"]]

df

Unnamed: 0,x,y,z,label
0,3.395767,-1.060363,-1.568793,Accounts
1,2.366088,0.859194,-0.806616,Accounts
2,0.686776,1.482395,0.006593,Credit Cards
3,-0.288339,1.003155,-0.796377,Credit Cards
4,-0.503778,-1.377158,0.444974,Credit Reporting
5,1.039759,-1.719416,1.85394,Credit Reporting
6,2.773173,1.168025,1.394945,Debt Collection
7,-0.459507,0.833969,0.505172,Debt Collection
8,0.270311,1.106941,-0.424754,Loans
9,0.866255,1.143525,0.870356,Loans


In [22]:
import plotly.express as px

fig = px.scatter_3d(df, x = 'x', y = 'y', z = 'z', color = 'label', width=800, height=600)

fig.show()

### Word Embeddings

We can also visualize the semantics of words, instead of full texts, by using Word Embeddings. We will add a Tokenizer and a WordEmbeddings model to get those embeddings, and them apply PCA as before. Firstly we splitted the pipeline in two to get all token embeddings

In [23]:
model = generic_pipeline().fit(sdf)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [24]:
result = model.transform(sdf)

In [25]:
result_df = result.select("label", F.explode(F.arrays_zip("token.result", "word_embeddings.embeddings")).alias("cols"))\
      .select(F.expr("cols['0']").alias("token"),
              "label",
              F.expr("cols['1']").alias("embeddings"))

result_df.show(truncate = 80)


+----------+--------+--------------------------------------------------------------------------------+
|     token|   label|                                                                      embeddings|
+----------+--------+--------------------------------------------------------------------------------+
|         I|Accounts|[-0.29679197, 0.80952483, 0.026026454, 0.08434165, 0.74346274, -0.02694714, -...|
|    called|Accounts|[0.28905857, -0.2922978, -0.42990327, -0.3833459, 0.026178878, -0.1272839, -0...|
|Huntington|Accounts|[0.20684443, -0.010130229, -0.25902456, -0.3755847, 0.4579211, 0.3114918, -0....|
|      Bank|Accounts|[-0.034711868, 0.460474, -0.6221116, -0.011170343, 0.29385087, 0.31341177, -0...|
|        to|Accounts|[-0.4045788, -0.37686452, -0.08015355, -0.5890965, -0.33856547, -0.39321235, ...|
|     close|Accounts|[0.35089272, 0.9568468, 0.8632823, -0.43343982, 0.11386755, -0.48837718, -0.8...|
|        my|Accounts|[-0.3659193, 0.26555997, -0.32495028, -0.5081898, -0

In [26]:
# Here we defined inheritance class from that defined previously EmbeddingsUDF class
class WordEmbeddingsUDF(EmbeddingsUDF):    
    def _transform(self, dataset):
        
        results = dataset.select('token', 'label', 'embeddings') # We changed this line because our embedding cloumn is already exploded

        results = results.withColumn(
            "features",
            self.udfs['convertToVectorUDF'](F.col("embeddings"))
        )
        results = results.withColumn(
            "emb_sum",
            self.udfs['sumUDF'](F.col("embeddings"))
        )
        # Remove those with embeddings all zeroes (so we can calculate cosine distance)
        results = results.where(F.col("emb_sum")!=0.0)

        return results

In [27]:
embeddings_for_pca = WordEmbeddingsUDF()

In [28]:
DIMENSIONS  = 3

In [29]:
pca = ML.feature.PCA(k=DIMENSIONS, inputCol="features", outputCol="pca_features")

### Full Spark NLP + Spark MLLib pipeline

In [30]:
# We run the second part of the pipeline. Here 768 dimensions is reduced to 3 dimensions

pipeline = Pipeline().setStages([embeddings_for_pca, pca])


In [31]:
model = pipeline.fit(result_df)

In [32]:
result = model.transform(result_df)

In [33]:
result.select("token", "label", "pca_features").show(truncate = 60)

+----------+--------+------------------------------------------------------------+
|     token|   label|                                                pca_features|
+----------+--------+------------------------------------------------------------+
|         I|Accounts|  [9.850465344870052,0.021824808185499117,1.712889610377415]|
|    called|Accounts| [0.5703221766039556,0.34666339931905005,-2.867727585221515]|
|Huntington|Accounts|  [8.635449026171822,0.8802331103753408,-0.8417092198995026]|
|      Bank|Accounts|   [9.391060490437583,0.4506684475981722,-1.215744009319277]|
|        to|Accounts| [-2.0937840432194608,-1.1261841265579964,4.473376590663871]|
|     close|Accounts|[-2.8977641355061525,-0.16329873928869695,2.631654680384121]|
|        my|Accounts|  [3.5422389955907287,-2.7214870668758673,2.847899036412741]|
|   account|Accounts|[-1.2533282844778362,0.006484996411762444,1.9023196258439...|
|         ,|Accounts|[-1.3713437013933456,0.16044007744083294,2.2361486511639304]|
|   

In [34]:
df = result.select('token', 'label', 'pca_features').toPandas()

df

Unnamed: 0,token,label,pca_features
0,I,Accounts,"[9.850465344870052, 0.021824808185499117, 1.71..."
1,called,Accounts,"[0.5703221766039556, 0.34666339931905005, -2.8..."
2,Huntington,Accounts,"[8.635449026171822, 0.8802331103753408, -0.841..."
3,Bank,Accounts,"[9.391060490437583, 0.4506684475981722, -1.215..."
4,to,Accounts,"[-2.0937840432194608, -1.1261841265579964, 4.4..."
...,...,...,...
1364,the,Mortgage,"[0.207832347767541, 1.2121747437322248, 2.3456..."
1365,company,Mortgage,"[0.9758761373328054, 1.1525684955599131, 1.548..."
1366,never,Mortgage,"[-0.009451121973367997, -1.3605053701459033, -..."
1367,responds,Mortgage,"[-1.3105386067531253, -0.3951970841405002, -1...."


In [35]:
df["x"] = df["pca_features"].apply(lambda x: x[0])

df["y"] = df["pca_features"].apply(lambda x: x[1])

df["z"] = df["pca_features"].apply(lambda x: x[2])

df = df[["token", "label", "x", "y", "z"]]

df

Unnamed: 0,token,label,x,y,z
0,I,Accounts,9.850465,0.021825,1.712890
1,called,Accounts,0.570322,0.346663,-2.867728
2,Huntington,Accounts,8.635449,0.880233,-0.841709
3,Bank,Accounts,9.391060,0.450668,-1.215744
4,to,Accounts,-2.093784,-1.126184,4.473377
...,...,...,...,...,...
1364,the,Mortgage,0.207832,1.212175,2.345684
1365,company,Mortgage,0.975876,1.152568,1.548877
1366,never,Mortgage,-0.009451,-1.360505,-0.080953
1367,responds,Mortgage,-1.310539,-0.395197,-1.634089


In [36]:
import plotly.express as px

fig = px.scatter_3d(df, x = 'x', y = 'y', z = 'z', color = "label", width=1000, height = 800, hover_data = ["token", "label"])

fig.show()