![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Legal Word and Sentence Embeddings

# Legal Word and Sentence Embeddings visualization using PCA (Principal Component Analysis)

Modern NLP models work with a numerical representation of texts and their menaning. For token classification problems (inferring a class for a token, for example Name Entity Recognition) Word Embeddings are required. For sentences, paragraph, document classification - we use Sentence Embeddings.

In this notebook, we use Spark NLP Legal Word (**roberta_embeddings_legal_roberta_base**) and Sentence (**sent_bert_base_uncased_legal**) Embeddings to get those numerical representations of the semantics of the texts. The result is a 768 embeddings matrix, impossible to process by the human eye.

There are many techniques we can use to visualize those embeddings. We are using one of them - Principal Component Analysis, a dimensionality reduction process, carried out by Spark MLLib. Both embeddings have 768 dimensions, so we will reduced this dimensions from **768** to **3** (X, Y, Z) and will use a color for the word / sentence legend.

## Import Required Library

In [0]:
from johnsnowlabs import * 
import pandas as pd

# Get sample text

In [0]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings_JSL/Legal/data/legal_pca_samples.csv
df = pd.read_csv('legal_pca_samples.csv')

df.head()

Unnamed: 0,text,label
0,The fiscal year of the Company (herein called ...,fiscal-year
1,Each of the Borrower and each other member of ...,fiscal-year
2,Purchaser shall pay as the total Purchase Pric...,purchase-price
3,The purchase price to be paid by Purchaser to ...,purchase-price
4,The Guarantor hereby unconditionally and irrev...,guarantee


In [0]:
# Create spark dataframe
sdf = spark.createDataFrame(df)
sdf.show()

# Sentence Embeddings

In [0]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_uncased_legal", "en") \
    .setInputCols("document") \
    .setOutputCol("document_embeddings")

# Custom transform to retrieve the numerical embeddings from Spark NLP and pass it to Spark MLLib

In [0]:
# This class extracts the embeddings from the Spark NLP Annotation object
# from pyspark import ml as ML

from pyspark.ml import Transformer
from pyspark import ml as ML
from pyspark import keyword_only 
from pyspark.sql import functions as F
import pyspark.sql.types as T

class EmbeddingsUDF(
    Transformer, ML.param.shared.HasInputCol,  ML.param.shared.HasOutputCol,
    ML.util.DefaultParamsReadable, ML.util.DefaultParamsWritable
):
    @keyword_only
    def __init__(self):
        super(EmbeddingsUDF, self).__init__()

        def _sum(r):
            result = 0.0
            for e in r:
                result += e
            return result

        self.udfs = {
            'convertToVectorUDF': F.udf(lambda vs: ML.linalg.Vectors.dense(vs), ML.linalg.VectorUDT()),
            'sumUDF': F.udf(lambda r: _sum(r), T.FloatType())
        }

    def _transform(self, dataset):

        results = dataset.select(
            "*", F.explode("document_embeddings.embeddings").alias("embeddings")
        )
        results = results.withColumn(
            "features",
            self.udfs['convertToVectorUDF'](F.col("embeddings"))
        )
        results = results.withColumn(
            "emb_sum",
            self.udfs['sumUDF'](F.col("embeddings"))
        )
        # Remove those with embeddings all zeroes (so we can calculate cosine distance)
        results = results.where(F.col("emb_sum")!=0.0)

        return results

In [0]:
embeddings_for_pca = EmbeddingsUDF()

In [0]:
DIMENSIONS  = 3

In [0]:
# import pyspark
pca =ML.feature.PCA(k=DIMENSIONS, inputCol="features", outputCol="pca_features")

### Full Spark NLP + Spark MLLib pipeline

In [0]:
# We did all process in one pipeline

pipeline = nlp.Pipeline().setStages([document_assembler, embeddings, embeddings_for_pca, pca])

In [0]:
pipeline.getStages()

In [0]:
model = pipeline.fit(sdf)

In [0]:
result = model.transform(sdf)

In [0]:
result.select('pca_features', 'label').show(truncate=False)

In [0]:
df = result.select('pca_features', 'label').toPandas()

df
# As you see, dimension values are inside a list

Unnamed: 0,pca_features,label
0,"[-11.772477034904608, -3.189962369128527, 4.49...",fiscal-year
1,"[-11.401123661713259, -3.7697457285371905, 3.2...",fiscal-year
2,"[-4.783320443150728, -0.4942616667908408, 2.86...",purchase-price
3,"[-5.455994051962746, -1.341260074480057, 3.331...",purchase-price
4,"[-8.841677620092424, -1.8203607493804024, 0.13...",guarantee
5,"[-11.532878941614142, -2.4995702800175503, 0.8...",guarantee
6,"[-5.7317047066223745, -3.8158763820398773, 3.5...",expenses
7,"[-3.8010565515445687, -4.3453814052287045, 1.6...",expenses
8,"[-6.783557847085066, -5.815457902520404, 3.236...",waiver
9,"[-7.139204069336535, -6.340446439270053, 1.234...",waiver


In [0]:
# We extract the dimension values out off the list

df["x"] = df["pca_features"].apply(lambda x: x[0])

df["y"] = df["pca_features"].apply(lambda x: x[1])

df["z"] = df["pca_features"].apply(lambda x: x[2])

df = df[["x", "y", "z", "label"]]

df

Unnamed: 0,x,y,z,label
0,-11.772477,-3.189962,4.491138,fiscal-year
1,-11.401124,-3.769746,3.240525,fiscal-year
2,-4.78332,-0.494262,2.869768,purchase-price
3,-5.455994,-1.34126,3.331716,purchase-price
4,-8.841678,-1.820361,0.133903,guarantee
5,-11.532879,-2.49957,0.814128,guarantee
6,-5.731705,-3.815876,3.513563,expenses
7,-3.801057,-4.345381,1.677418,expenses
8,-6.783558,-5.815458,3.236153,waiver
9,-7.139204,-6.340446,1.234714,waiver


In [0]:
import plotly.express as px

fig = px.scatter_3d(df, x='x', y='y', z='z', color='label', width=800, height=600)

fig.show()

# Word Embeddings

We can also visualize the semantics of words, instead of full texts, by using Word Embeddings. We will add a Tokenizer and a WordEmbeddings model to get those embeddings, and them apply PCA as before.

In [0]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols("document")\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document", "token"])\
    .setOutputCol("document_embeddings")

In [0]:
# Firstly we splitted the pipeline in two to get all token embeddings

pipeline = nlp.Pipeline().setStages([document_assembler, tokenizer, embeddings])

In [0]:
model = pipeline.fit(sdf)

In [0]:
result = model.transform(sdf)

In [0]:
from pyspark.sql import functions as F

result_df = result.select("label", F.explode(F.arrays_zip(result.token.result, result.document_embeddings.embeddings)).alias("cols"))\
                   .select(F.expr("cols['0']").alias("token"),
                           F.expr("cols['1']").alias("embeddings"),
                           "label")

result_df.show(truncate = 80)


In [0]:
# Here we defined inheritance class from that defined previously EmbeddingsUDF class
class WordEmbeddingsUDF(EmbeddingsUDF):    
    def _transform(self, dataset):
        
        results = dataset.select('token', 'label', 'embeddings') # We changed this line because our embedding cloumn is already exploded

        results = results.withColumn(
            "features",
            self.udfs['convertToVectorUDF'](F.col("embeddings"))
        )
        results = results.withColumn(
            "emb_sum",
            self.udfs['sumUDF'](F.col("embeddings"))
        )
        # Remove those with embeddings all zeroes (so we can calculate cosine distance)
        results = results.where(F.col("emb_sum")!=0.0)

        return results

In [0]:
embeddings_for_pca = WordEmbeddingsUDF()

In [0]:
DIMENSIONS  = 3

In [0]:
# import pyspark
pca = ML.feature.PCA(k=DIMENSIONS, inputCol="features", outputCol="pca_features")

## Full Spark NLP + Spark MLLib pipeline

In [0]:
# We run the second part of the pipeline

pipeline = nlp.Pipeline().setStages([embeddings_for_pca, pca])


In [0]:
model = pipeline.fit(result_df)

In [0]:
result = model.transform(result_df)

In [0]:
result.select("token", "embeddings", "pca_features", "label").show(truncate = 60)

In [0]:
df = result.select('token', 'pca_features',  'label').toPandas()

df

Unnamed: 0,token,pca_features,label
0,The,"[6.574773139044961, 7.65286502110281, -0.92203...",fiscal-year
1,fiscal,"[5.227395982797811, -0.06940068997057545, -1.3...",fiscal-year
2,year,"[9.185602029229745, 1.0014820713259496, 1.0998...",fiscal-year
3,of,"[6.397932523970772, 3.1917363167944006, -1.040...",fiscal-year
4,the,"[7.772305113086351, 8.005479063316065, -2.2717...",fiscal-year
...,...,...,...
672,constitute,"[6.129947148631695, -1.5722397283753062, -0.58...",waiver
673,a,"[8.76467597454233, 1.0030852667560528, 0.45322...",waiver
674,continuing,"[4.1503600980893305, -0.7722882042241939, -1.1...",waiver
675,waiver,"[7.824426762455292, -1.4749742620145976, -3.41...",waiver


In [0]:
df["x"] = df["pca_features"].apply(lambda x: x[0])

df["y"] = df["pca_features"].apply(lambda x: x[1])

df["z"] = df["pca_features"].apply(lambda x: x[2])

df = df[["token", "x", "y", "z", "label"]]

df

Unnamed: 0,token,x,y,z,label
0,The,6.574773,7.652865,-0.922037,fiscal-year
1,fiscal,5.227396,-0.069401,-1.397651,fiscal-year
2,year,9.185602,1.001482,1.099838,fiscal-year
3,of,6.397933,3.191736,-1.040545,fiscal-year
4,the,7.772305,8.005479,-2.271736,fiscal-year
...,...,...,...,...,...
672,constitute,6.129947,-1.572240,-0.582971,waiver
673,a,8.764676,1.003085,0.453228,waiver
674,continuing,4.150360,-0.772288,-1.152194,waiver
675,waiver,7.824427,-1.474974,-3.411841,waiver


In [0]:
import plotly.express as px

fig = px.scatter_3d(df, x = 'x', y = 'y', z = 'z', color = "label", width=1000, height = 800, hover_data = ["token", "label"])

fig.show()

That chart is super cool because you can see how the same token gets different embeddings depending on the context.