# Legal Word and Sentence Embeddings

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/2.Embeddings.ipynb)

# Legal Word and Sentence Embeddings visualization using PCA (Principal Component Analysis)

Modern NLP models work with a numerical representation of texts and their menaning. For token classification problems (inferring a class for a token, for example Name Entity Recognition) Word Embeddings are required. For sentences, paragraph, document classification - we use Sentence Embeddings.

In this notebook, we use Spark NLP Legal Word (**roberta_embeddings_legal_roberta_base**) and Sentence (**sent_bert_base_uncased_legal**) Embeddings to get those numerical representations of the semantics of the texts. The result is a 768 embeddings matrix, impossible to process by the human eye.

There are many techniques we can use to visualize those embeddings. We are using one of them - Principal Component Analysis, a dimensionality reduction process, carried out by Spark MLLib. Both embeddings have 768 dimensions, so we will reduced this dimensions from **768** to **3** (X, Y, Z) and will use a color for the word / sentence legend.

## Import Required Library

In [None]:
# Installing plotly
! pip install plotly

# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==4.0.0


## Start Spark Session

In [None]:
import json
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp

import pyspark
from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel
from pyspark.sql import functions as F

from pyspark import keyword_only
from pyspark.ml import Pipeline
from pyspark.ml import Transformer
from pyspark.ml.feature import PCA
from pyspark.ml.functions import vector_to_array
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.ml.param.shared import HasInputCol, HasOutputCol
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
from pyspark.sql import DataFrame
from pyspark.sql import Window
from pyspark.sql import types as T

import pandas as pd
import warnings
warnings.filterwarnings('ignore')

print ("Spark NLP Version :", sparknlp.version())

spark = sparknlp.start()

spark

Spark NLP Version : 4.0.0


# Get sample text

In [None]:
# Downloading sample datasets.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Legal/data/legal_pca_samples.csv

In [None]:
df = pd.read_csv('pca_samples.csv')

df.head()

Unnamed: 0,text,label
0,The fiscal year of the Company (herein called ...,fiscal-year
1,Each of the Borrower and each other member of ...,fiscal-year
2,Purchaser shall pay as the total Purchase Pric...,purchase-price
3,The purchase price to be paid by Purchaser to ...,purchase-price
4,The Guarantor hereby unconditionally and irrev...,guarantee


In [None]:
# Create spark dataframe
sdf = spark.createDataFrame(df)
sdf.show()

+--------------------+--------------+
|                text|         label|
+--------------------+--------------+
|The fiscal year o...|   fiscal-year|
|Each of the Borro...|   fiscal-year|
|Purchaser shall p...|purchase-price|
|The purchase pric...|purchase-price|
|The Guarantor her...|     guarantee|
|The Holding Compa...|     guarantee|
|GFS will bear its...|      expenses|
|Each party shall ...|      expenses|
|Failure by either...|        waiver|
|Failure of any pa...|        waiver|
+--------------------+--------------+



# Sentence Embeddings

In [None]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_uncased_legal", "en") \
    .setInputCols("document") \
    .setOutputCol("document_embeddings")

sent_bert_base_uncased_legal download started this may take some time.
Approximate size to download 390.8 MB
[OK!]


# Custom transform to retrieve the numerical embeddings from Spark NLP and pass it to Spark MLLib

In [None]:
# This class extracts the embeddings from the Spark NLP Annotation object
class EmbeddingsUDF(
    Transformer, HasInputCol, HasOutputCol,
    DefaultParamsReadable, DefaultParamsWritable
):
    @keyword_only
    def __init__(self):
        super(EmbeddingsUDF, self).__init__()

        def _sum(r):
            result = 0.0
            for e in r:
                result += e
            return result

        self.udfs = {
            'convertToVectorUDF': F.udf(lambda vs: Vectors.dense(vs), VectorUDT()),
            'sumUDF': F.udf(lambda r: _sum(r), T.FloatType())
        }

    def _transform(self, dataset):

        results = dataset.select(
            "*", F.explode("document_embeddings.embeddings").alias("embeddings")
        )
        results = results.withColumn(
            "features",
            self.udfs['convertToVectorUDF'](F.col("embeddings"))
        )
        results = results.withColumn(
            "emb_sum",
            self.udfs['sumUDF'](F.col("embeddings"))
        )
        # Remove those with embeddings all zeroes (so we can calculate cosine distance)
        results = results.where(F.col("emb_sum")!=0.0)

        return results

In [None]:
embeddings_for_pca = EmbeddingsUDF()

In [None]:
DIMENSIONS  = 3

In [None]:
import pyspark
pca = pyspark.ml.feature.PCA(k=DIMENSIONS, inputCol="features", outputCol="pca_features")

### Full Spark NLP + Spark MLLib pipeline

In [None]:
# We did all process in one pipeline

pipeline = Pipeline().setStages([document_assembler, embeddings, embeddings_for_pca, pca])

In [None]:
pipeline.getStages()

[DocumentAssembler_c3f4f011c634,
 BERT_SENTENCE_EMBEDDINGS_dae49a767331,
 EmbeddingsUDF_383d506f9bab,
 PCA_7027dead6607]

In [None]:
model = pipeline.fit(sdf)

In [None]:
result = model.transform(sdf)

In [None]:
result.select('pca_features', 'label').show(truncate=False)

+------------------------------------------------------------+--------------+
|pca_features                                                |label         |
+------------------------------------------------------------+--------------+
|[-11.77244716366699,-3.189947356388765,4.4911623229315785]  |fiscal-year   |
|[-11.401110660624248,-3.7697414158197438,3.240553405264415] |fiscal-year   |
|[-4.7833146129210355,-0.4942508065537561,2.8697833776569577]|purchase-price|
|[-5.455980892626395,-1.3412495828462316,3.3317290563506945] |purchase-price|
|[-8.84166062140954,-1.8203561928036376,0.13392067779165123] |guarantee     |
|[-11.532893021254765,-2.4995736938288324,0.8141664251810958]|guarantee     |
|[-5.731707629428153,-3.815863951248384,3.513596288588748]   |expenses      |
|[-3.8010514486947096,-4.345370568255292,1.677434102133743]  |expenses      |
|[-6.783526937000946,-5.815459791835026,3.2361589693958557]  |waiver        |
|[-7.139201238503951,-6.340440847367817,1.2347492798270938]  |wa

In [None]:
df = result.select('pca_features', 'label').toPandas()

df
# As you see, dimension values are inside a list

Unnamed: 0,pca_features,label
0,"[-11.77244716366699, -3.189947356388765, 4.491...",fiscal-year
1,"[-11.401110660624248, -3.7697414158197438, 3.2...",fiscal-year
2,"[-4.7833146129210355, -0.4942508065537561, 2.8...",purchase-price
3,"[-5.455980892626395, -1.3412495828462316, 3.33...",purchase-price
4,"[-8.84166062140954, -1.8203561928036376, 0.133...",guarantee
5,"[-11.532893021254765, -2.4995736938288324, 0.8...",guarantee
6,"[-5.731707629428153, -3.815863951248384, 3.513...",expenses
7,"[-3.8010514486947096, -4.345370568255292, 1.67...",expenses
8,"[-6.783526937000946, -5.815459791835026, 3.236...",waiver
9,"[-7.139201238503951, -6.340440847367817, 1.234...",waiver


In [None]:
# We extract the dimension values out off the list

df["x"] = df["pca_features"].apply(lambda x: x[0])

df["y"] = df["pca_features"].apply(lambda x: x[1])

df["z"] = df["pca_features"].apply(lambda x: x[2])

df = df[["x", "y", "z", "label"]]

df

Unnamed: 0,x,y,z,label
0,-11.772447,-3.189947,4.491162,fiscal-year
1,-11.401111,-3.769741,3.240553,fiscal-year
2,-4.783315,-0.494251,2.869783,purchase-price
3,-5.455981,-1.34125,3.331729,purchase-price
4,-8.841661,-1.820356,0.133921,guarantee
5,-11.532893,-2.499574,0.814166,guarantee
6,-5.731708,-3.815864,3.513596,expenses
7,-3.801051,-4.345371,1.677434,expenses
8,-6.783527,-5.81546,3.236159,waiver
9,-7.139201,-6.340441,1.234749,waiver


In [None]:
import plotly.express as px

fig = px.scatter_3d(df, x='x', y='y', z='z', color='label', width=800, height=600)

fig.show()

# Word Embeddings

We can also visualize the semantics of words, instead of full texts, by using Word Embeddings. We will add a Tokenizer and a WordEmbeddings model to get those embeddings, and them apply PCA as before.

In [None]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols("document")\
    .setOutputCol("token")

embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document", "token"])\
    .setOutputCol("document_embeddings")

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]


In [None]:
# Firstly we splitted the pipeline in two to get all token embeddings

pipeline = Pipeline().setStages([document_assembler, tokenizer, embeddings])

In [None]:
model = pipeline.fit(sdf)

In [None]:
result = model.transform(sdf)

In [None]:
result_df = result.select("label", F.explode(F.arrays_zip("token.result", "document_embeddings.embeddings")).alias("cols"))\
                   .select(F.expr("cols['0']").alias("token"),
                           F.expr("cols['1']").alias("embeddings"),
                           "label")

result_df.show(truncate = 80)


+--------+--------------------------------------------------------------------------------+-----------+
|   token|                                                                      embeddings|      label|
+--------+--------------------------------------------------------------------------------+-----------+
|     The|[0.04391094, -0.028177992, 0.11459787, -0.022955947, 0.7428129, 0.4352008, -0...|fiscal-year|
|  fiscal|[-0.19364583, 0.14353976, 0.22895497, -0.48883635, -0.41577122, -0.18882717, ...|fiscal-year|
|    year|[0.08520961, 0.21057254, 0.22785076, -0.43965444, 0.62087715, -0.23067635, 0....|fiscal-year|
|      of|[-0.14763746, 0.1524232, 0.24129547, -0.24562353, 0.8748058, 0.50721204, 0.10...|fiscal-year|
|     the|[-0.047555808, -0.07707906, 0.3023237, 0.13319425, -0.015342899, 0.9450758, -...|fiscal-year|
| Company|[-0.04946076, 0.26861066, 0.06913849, -0.39064062, 0.7189984, 0.7938846, 0.05...|fiscal-year|
|       (|[0.10544683, -0.21629706, 0.114817925, 0.07629284, 0.8

In [None]:
# Here we defined inheritance class from that defined previously EmbeddingsUDF class
class WordEmbeddingsUDF(EmbeddingsUDF):    
    def _transform(self, dataset):
        
        results = dataset.select('token', 'label', 'embeddings') # We changed this line because our embedding cloumn is already exploded

        results = results.withColumn(
            "features",
            self.udfs['convertToVectorUDF'](F.col("embeddings"))
        )
        results = results.withColumn(
            "emb_sum",
            self.udfs['sumUDF'](F.col("embeddings"))
        )
        # Remove those with embeddings all zeroes (so we can calculate cosine distance)
        results = results.where(F.col("emb_sum")!=0.0)

        return results

In [None]:
embeddings_for_pca = WordEmbeddingsUDF()

In [None]:
DIMENSIONS  = 3

In [None]:
import pyspark
pca = pyspark.ml.feature.PCA(k=DIMENSIONS, inputCol="features", outputCol="pca_features")

## Full Spark NLP + Spark MLLib pipeline

In [None]:
# We run the second part of the pipeline

pipeline = Pipeline().setStages([embeddings_for_pca, pca])


In [None]:
model = pipeline.fit(result_df)

In [None]:
result = model.transform(result_df)

In [None]:
result.select("token", "embeddings", "pca_features", "label").show(truncate = 60)

+--------+------------------------------------------------------------+------------------------------------------------------------+-----------+
|   token|                                                  embeddings|                                                pca_features|      label|
+--------+------------------------------------------------------------+------------------------------------------------------------+-----------+
|     The|[0.04391094, -0.028177992, 0.11459787, -0.022955947, 0.74...|   [4.228120509444965,7.827736003364483,-1.3588865878154572]|fiscal-year|
|  fiscal|[-0.19364583, 0.14353976, 0.22895497, -0.48883635, -0.415...| [5.170710489800116,0.25310763654270096,-1.4800026405586908]|fiscal-year|
|    year|[0.08520961, 0.21057254, 0.22785076, -0.43965444, 0.62087...|   [9.016012554607927,1.6321133397405232,0.6355111816631596]|fiscal-year|
|      of|[-0.14763746, 0.1524232, 0.24129547, -0.24562353, 0.87480...|   [6.321855231350654,3.628017709379505,-1.1226350294441834

In [None]:
df = result.select('token', 'pca_features',  'label').toPandas()

df

Unnamed: 0,token,pca_features,label
0,The,"[4.228120509444965, 7.827736003364483, -1.3588...",fiscal-year
1,fiscal,"[5.170710489800116, 0.25310763654270096, -1.48...",fiscal-year
2,year,"[9.016012554607927, 1.6321133397405232, 0.6355...",fiscal-year
3,of,"[6.321855231350654, 3.628017709379505, -1.1226...",fiscal-year
4,the,"[7.346548330521915, 8.460093687284157, -2.2529...",fiscal-year
...,...,...,...
672,constitute,"[6.169835369763037, -1.2662338826600077, -0.77...",waiver
673,a,"[8.734050499114321, 1.4806559743451906, 0.2484...",waiver
674,continuing,"[4.1603599395059865, -0.5176250624644265, -1.2...",waiver
675,waiver,"[7.907894854126703, -0.919223431094611, -3.686...",waiver


In [None]:
df["x"] = df["pca_features"].apply(lambda x: x[0])

df["y"] = df["pca_features"].apply(lambda x: x[1])

df["z"] = df["pca_features"].apply(lambda x: x[2])

df = df[["token", "x", "y", "z", "label"]]

df

Unnamed: 0,token,x,y,z,label
0,The,4.228121,7.827736,-1.358887,fiscal-year
1,fiscal,5.170710,0.253108,-1.480003,fiscal-year
2,year,9.016013,1.632113,0.635511,fiscal-year
3,of,6.321855,3.628018,-1.122635,fiscal-year
4,the,7.346548,8.460094,-2.252963,fiscal-year
...,...,...,...,...,...
672,constitute,6.169835,-1.266234,-0.777761,waiver
673,a,8.734050,1.480656,0.248475,waiver
674,continuing,4.160360,-0.517625,-1.257246,waiver
675,waiver,7.907895,-0.919223,-3.686117,waiver


In [None]:
import plotly.express as px

fig = px.scatter_3d(df, x = 'x', y = 'y', z = 'z', color = "label", width=1000, height = 800, hover_data = ["token", "label"])

fig.show()

That chart is super cool because you can see how the same token gets different embeddings depending on the context.