# Legal Word and Sentence Embeddings

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/2.Embeddings.ipynb)

# Legal Word and Sentence Embeddings visualization using PCA (Principal Component Analysis)

Modern NLP models work with a numerical representation of texts and their menaning. For token classification problems (inferring a class for a token, for example Name Entity Recognition) Word Embeddings are required. For sentences, paragraph, document classification - we use Sentence Embeddings.

In this notebook, we use Spark NLP Legal Word (**roberta_embeddings_legal_roberta_base**) and Sentence (**sent_bert_base_uncased_legal**) Embeddings to get those numerical representations of the semantics of the texts. The result is a 768 embeddings matrix, impossible to process by the human eye.

There are many techniques we can use to visualize those embeddings. We are using one of them - Principal Component Analysis, a dimensionality reduction process, carried out by Spark MLLib. Both embeddings have 768 dimensions, so we will reduced this dimensions from **768** to **3** (X, Y, Z) and will use a color for the word / sentence legend.

## Import Required Library

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs 

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [3]:
from johnsnowlabs import * 
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
jsl.install()

👌 Detected license file /content/4.1.0.spark_nlp_for_healthcare.json
🚨 Outdated Medical Secrets in license file. Version=4.1.0 but should be Version=0.1.14
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up if John Snow Labs home exists in /root/.johnsnowlabs this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.1.0-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library internal_with_finleg-0.1.14-py3-none-any.whl
Downloading 🐍+🕶 Python Library spark_ocr-4.1.0-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.1.0.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-assembly-4.1.0.jar
Downloading 🫘+🕶 Java Library spark-ocr-assembly-4.1.0.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/4.1.0.spark_nlp_for_healthcare.json
Installing /root/.johnsnowlabs/py_installs/internal_with_finleg-0.1.14-py3-none-any.whl to /usr/bin/python3
Running: /usr/bin/python3 -

## Start Spark Session

In [2]:
from johnsnowlabs import * 
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

[91m🚨 Your Spark-OCR is outdated, installed==4.0.0a1 but latest version==4.1.0
You can run [92m jsl.install() [39mto update Spark-OCR
👌 Detected license file /content/4.1.0.spark_nlp_for_healthcare.json
🚨 Outdated Medical Secrets in license file. Version=4.1.0 but should be Version=0.1.14
👌 Launched [92mcpu-Optimized JVM[39m with SparkSession with Jars for: 🚀Spark-NLP==4.1.0, 💊Spark-Healthcare==4.0.0a1, 🕶Spark-OCR==4.1.0, running on ⚡ PySpark==3.1.2


In [5]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# if you want to start the session with custom params as in start function above
def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+PUBLIC_VERSION) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+SECRET+"/spark-nlp-jsl-"+JSL_VERSION+".jar")
      
    return builder.getOrCreate()

#spark = start(SECRET)


# Get sample text

In [6]:
! pip install plotly

# Downloading sample datasets.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Legal/data/legal_pca_samples.csv



Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [7]:
df = pd.read_csv('legal_pca_samples.csv')

df.head()

Unnamed: 0,text,label
0,The fiscal year of the Company (herein called ...,fiscal-year
1,Each of the Borrower and each other member of ...,fiscal-year
2,Purchaser shall pay as the total Purchase Pric...,purchase-price
3,The purchase price to be paid by Purchaser to ...,purchase-price
4,The Guarantor hereby unconditionally and irrev...,guarantee


In [8]:
# Create spark dataframe
sdf = spark.createDataFrame(df)
sdf.show()

+--------------------+--------------+
|                text|         label|
+--------------------+--------------+
|The fiscal year o...|   fiscal-year|
|Each of the Borro...|   fiscal-year|
|Purchaser shall p...|purchase-price|
|The purchase pric...|purchase-price|
|The Guarantor her...|     guarantee|
|The Holding Compa...|     guarantee|
|GFS will bear its...|      expenses|
|Each party shall ...|      expenses|
|Failure by either...|        waiver|
|Failure of any pa...|        waiver|
+--------------------+--------------+



# Sentence Embeddings

In [9]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_uncased_legal", "en") \
    .setInputCols("document") \
    .setOutputCol("document_embeddings")

sent_bert_base_uncased_legal download started this may take some time.
Approximate size to download 390.8 MB
[OK!]


# Custom transform to retrieve the numerical embeddings from Spark NLP and pass it to Spark MLLib

In [10]:
# This class extracts the embeddings from the Spark NLP Annotation object
# from pyspark import ml as ML

class EmbeddingsUDF(
    Transformer, ML.param.shared.HasInputCol,  ML.param.shared.HasOutputCol,
    ML.util.DefaultParamsReadable, ML.util.DefaultParamsWritable
):
    @keyword_only
    def __init__(self):
        super(EmbeddingsUDF, self).__init__()

        def _sum(r):
            result = 0.0
            for e in r:
                result += e
            return result

        self.udfs = {
            'convertToVectorUDF': F.udf(lambda vs: ML.linalg.Vectors.dense(vs), ML.linalg.VectorUDT()),
            'sumUDF': F.udf(lambda r: _sum(r), T.FloatType())
        }

    def _transform(self, dataset):

        results = dataset.select(
            "*", F.explode("document_embeddings.embeddings").alias("embeddings")
        )
        results = results.withColumn(
            "features",
            self.udfs['convertToVectorUDF'](F.col("embeddings"))
        )
        results = results.withColumn(
            "emb_sum",
            self.udfs['sumUDF'](F.col("embeddings"))
        )
        # Remove those with embeddings all zeroes (so we can calculate cosine distance)
        results = results.where(F.col("emb_sum")!=0.0)

        return results

In [11]:
embeddings_for_pca = EmbeddingsUDF()

In [12]:
DIMENSIONS  = 3

In [13]:
# import pyspark
pca =ML.feature.PCA(k=DIMENSIONS, inputCol="features", outputCol="pca_features")

### Full Spark NLP + Spark MLLib pipeline

In [14]:
# We did all process in one pipeline

pipeline = Pipeline().setStages([document_assembler, embeddings, embeddings_for_pca, pca])

In [15]:
pipeline.getStages()

[DocumentAssembler_6362b85f6157,
 BERT_SENTENCE_EMBEDDINGS_dae49a767331,
 EmbeddingsUDF_5343b539aa93,
 PCA_4b206c5ac337]

In [16]:
model = pipeline.fit(sdf)

In [17]:
result = model.transform(sdf)

In [18]:
result.select('pca_features', 'label').show(truncate=False)

+------------------------------------------------------------+--------------+
|pca_features                                                |label         |
+------------------------------------------------------------+--------------+
|[-11.772445094865585,-3.189949908927504,4.491164875939447]  |fiscal-year   |
|[-11.401111489956085,-3.7697420690450114,3.2405557866380605]|fiscal-year   |
|[-4.783314072273469,-0.4942509144215836,2.8697848017573997] |purchase-price|
|[-5.455985763815361,-1.341251980692156,3.331733577142756]   |purchase-price|
|[-8.841659351711087,-1.8203592163383913,0.13392225588710466]|guarantee     |
|[-11.53289158260848,-2.499578166241808,0.8141681752789257]  |guarantee     |
|[-5.731705300065157,-3.8158641943470264,3.513598491386224]  |expenses      |
|[-3.801049176283123,-4.345370453824397,1.6774366963116676]  |expenses      |
|[-6.783523982011747,-5.815460999997304,3.2361621070405153]  |waiver        |
|[-7.139197885591457,-6.34044271366872,1.2347534304223584]   |wa

In [19]:
df = result.select('pca_features', 'label').toPandas()

df
# As you see, dimension values are inside a list

Unnamed: 0,pca_features,label
0,"[-11.772445094865585, -3.189949908927504, 4.49...",fiscal-year
1,"[-11.401111489956085, -3.7697420690450114, 3.2...",fiscal-year
2,"[-4.783314072273469, -0.4942509144215836, 2.86...",purchase-price
3,"[-5.455985763815361, -1.341251980692156, 3.331...",purchase-price
4,"[-8.841659351711087, -1.8203592163383913, 0.13...",guarantee
5,"[-11.53289158260848, -2.499578166241808, 0.814...",guarantee
6,"[-5.731705300065157, -3.8158641943470264, 3.51...",expenses
7,"[-3.801049176283123, -4.345370453824397, 1.677...",expenses
8,"[-6.783523982011747, -5.815460999997304, 3.236...",waiver
9,"[-7.139197885591457, -6.34044271366872, 1.2347...",waiver


In [20]:
# We extract the dimension values out off the list

df["x"] = df["pca_features"].apply(lambda x: x[0])

df["y"] = df["pca_features"].apply(lambda x: x[1])

df["z"] = df["pca_features"].apply(lambda x: x[2])

df = df[["x", "y", "z", "label"]]

df

Unnamed: 0,x,y,z,label
0,-11.772445,-3.18995,4.491165,fiscal-year
1,-11.401111,-3.769742,3.240556,fiscal-year
2,-4.783314,-0.494251,2.869785,purchase-price
3,-5.455986,-1.341252,3.331734,purchase-price
4,-8.841659,-1.820359,0.133922,guarantee
5,-11.532892,-2.499578,0.814168,guarantee
6,-5.731705,-3.815864,3.513598,expenses
7,-3.801049,-4.34537,1.677437,expenses
8,-6.783524,-5.815461,3.236162,waiver
9,-7.139198,-6.340443,1.234753,waiver


In [21]:
import plotly.express as px

fig = px.scatter_3d(df, x='x', y='y', z='z', color='label', width=800, height=600)

fig.show()

# Word Embeddings

We can also visualize the semantics of words, instead of full texts, by using Word Embeddings. We will add a Tokenizer and a WordEmbeddings model to get those embeddings, and them apply PCA as before.

In [22]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols("document")\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document", "token"])\
    .setOutputCol("document_embeddings")

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]


In [23]:
# Firstly we splitted the pipeline in two to get all token embeddings

pipeline = Pipeline().setStages([document_assembler, tokenizer, embeddings])

In [24]:
model = pipeline.fit(sdf)

In [25]:
result = model.transform(sdf)

In [26]:
result_df = result.select("label", F.explode(F.arrays_zip("token.result", "document_embeddings.embeddings")).alias("cols"))\
                   .select(F.expr("cols['0']").alias("token"),
                           F.expr("cols['1']").alias("embeddings"),
                           "label")

result_df.show(truncate = 80)


+--------+--------------------------------------------------------------------------------+-----------+
|   token|                                                                      embeddings|      label|
+--------+--------------------------------------------------------------------------------+-----------+
|     The|[-0.1905832, 0.02907233, 0.1323556, 0.19562279, 0.7778327, 0.2899082, -0.1439...|fiscal-year|
|  fiscal|[-0.19621783, 0.14509645, 0.23111376, -0.5060165, -0.38397065, -0.16950995, 0...|fiscal-year|
|    year|[0.080063194, 0.22008342, 0.2320247, -0.4419714, 0.58936, -0.23692241, 0.1419...|fiscal-year|
|      of|[-0.14090316, 0.15613584, 0.24000195, -0.24493258, 0.8977557, 0.4878105, 0.09...|fiscal-year|
|     the|[-0.060954522, -0.0823207, 0.3149926, 0.12840015, -0.014585942, 0.97888047, -...|fiscal-year|
| Company|[-0.06074677, 0.27488348, 0.07146067, -0.39569902, 0.7331536, 0.8051565, 0.05...|fiscal-year|
|       (|[0.11098474, -0.23188351, 0.11235487, 0.07458271, 0.86

In [27]:
# Here we defined inheritance class from that defined previously EmbeddingsUDF class
class WordEmbeddingsUDF(EmbeddingsUDF):    
    def _transform(self, dataset):
        
        results = dataset.select('token', 'label', 'embeddings') # We changed this line because our embedding cloumn is already exploded

        results = results.withColumn(
            "features",
            self.udfs['convertToVectorUDF'](F.col("embeddings"))
        )
        results = results.withColumn(
            "emb_sum",
            self.udfs['sumUDF'](F.col("embeddings"))
        )
        # Remove those with embeddings all zeroes (so we can calculate cosine distance)
        results = results.where(F.col("emb_sum")!=0.0)

        return results

In [28]:
embeddings_for_pca = WordEmbeddingsUDF()

In [29]:
DIMENSIONS  = 3

In [30]:
# import pyspark
pca = ML.feature.PCA(k=DIMENSIONS, inputCol="features", outputCol="pca_features")

## Full Spark NLP + Spark MLLib pipeline

In [31]:
# We run the second part of the pipeline

pipeline = Pipeline().setStages([embeddings_for_pca, pca])


In [32]:
model = pipeline.fit(result_df)

In [33]:
result = model.transform(result_df)

In [34]:
result.select("token", "embeddings", "pca_features", "label").show(truncate = 60)

+--------+------------------------------------------------------------+------------------------------------------------------------+-----------+
|   token|                                                  embeddings|                                                pca_features|      label|
+--------+------------------------------------------------------------+------------------------------------------------------------+-----------+
|     The|[-0.1905832, 0.02907233, 0.1323556, 0.19562279, 0.7778327...|   [6.574771574766921,7.652864938222547,-0.9220384239015803]|fiscal-year|
|  fiscal|[-0.19621783, 0.14509645, 0.23111376, -0.5060165, -0.3839...| [5.227396638653371,-0.06940065957977028,-1.397650442752261]|fiscal-year|
|    year|[0.080063194, 0.22008342, 0.2320247, -0.4419714, 0.58936,...|    [9.185597536500767,1.0014793252960874,1.099837490970727]|fiscal-year|
|      of|[-0.14090316, 0.15613584, 0.24000195, -0.24493258, 0.8977...|  [6.397928467286442,3.1917339278452252,-1.0405452621515632

In [35]:
df = result.select('token', 'pca_features',  'label').toPandas()

df

Unnamed: 0,token,pca_features,label
0,The,"[6.574771574766921, 7.652864938222547, -0.9220...",fiscal-year
1,fiscal,"[5.227396638653371, -0.06940065957977028, -1.3...",fiscal-year
2,year,"[9.185597536500767, 1.0014793252960874, 1.0998...",fiscal-year
3,of,"[6.397928467286442, 3.1917339278452252, -1.040...",fiscal-year
4,the,"[7.772298933784342, 8.005474265704821, -2.2717...",fiscal-year
...,...,...,...
672,constitute,"[6.129944862870592, -1.5722384559240419, -0.58...",waiver
673,a,"[8.764674396927632, 1.0030866890588166, 0.4532...",waiver
674,continuing,"[4.15035925368287, -0.7722874001107889, -1.152...",waiver
675,waiver,"[7.824427829455883, -1.4749751170255523, -3.41...",waiver


In [36]:
df["x"] = df["pca_features"].apply(lambda x: x[0])

df["y"] = df["pca_features"].apply(lambda x: x[1])

df["z"] = df["pca_features"].apply(lambda x: x[2])

df = df[["token", "x", "y", "z", "label"]]

df

Unnamed: 0,token,x,y,z,label
0,The,6.574772,7.652865,-0.922038,fiscal-year
1,fiscal,5.227397,-0.069401,-1.397650,fiscal-year
2,year,9.185598,1.001479,1.099837,fiscal-year
3,of,6.397928,3.191734,-1.040545,fiscal-year
4,the,7.772299,8.005474,-2.271738,fiscal-year
...,...,...,...,...,...
672,constitute,6.129945,-1.572238,-0.582969,waiver
673,a,8.764674,1.003087,0.453226,waiver
674,continuing,4.150359,-0.772287,-1.152193,waiver
675,waiver,7.824428,-1.474975,-3.411842,waiver


In [37]:
import plotly.express as px

fig = px.scatter_3d(df, x = 'x', y = 'y', z = 'z', color = "label", width=1000, height = 800, hover_data = ["token", "label"])

fig.show()

That chart is super cool because you can see how the same token gets different embeddings depending on the context.