![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/03.Word_Sentence_Embeddings.ipynb)

# Legal Word and Sentence Embeddings

# Legal Word and Sentence Embeddings visualization using PCA (Principal Component Analysis)

Modern NLP models work with a numerical representation of texts and their menaning. For token classification problems (inferring a class for a token, for example Name Entity Recognition) Word Embeddings are required. For sentences, paragraph, document classification - we use Sentence Embeddings.

In this notebook, we use Spark NLP Legal Word (**roberta_embeddings_legal_roberta_base**) and Sentence (**sent_bert_base_uncased_legal**) Embeddings to get those numerical representations of the semantics of the texts. The result is a 768 embeddings matrix, impossible to process by the human eye.

There are many techniques we can use to visualize those embeddings. We are using one of them - Principal Component Analysis, a dimensionality reduction process, carried out by Spark MLLib. Both embeddings have 768 dimensions, so we will reduced this dimensions from **768** to **3** (X, Y, Z) and will use a color for the word / sentence legend.

# Installation

In [None]:
! pip install -q johnsnowlabs

## Automatic Installation
Using my.johnsnowlabs.com SSO

In [None]:
from johnsnowlabs import nlp, legal

# nlp.install(force_browser=True)

## Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

# Starting

In [None]:
spark = nlp.start()

# Get sample text

In [None]:
! pip install plotly

# Downloading sample datasets.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/legal_pca_samples.csv



In [None]:
import pandas as pd

df = pd.read_csv('legal_pca_samples.csv')

df.head()

Unnamed: 0,text,label
0,The fiscal year of the Company (herein called ...,fiscal-year
1,Each of the Borrower and each other member of ...,fiscal-year
2,Purchaser shall pay as the total Purchase Pric...,purchase-price
3,The purchase price to be paid by Purchaser to ...,purchase-price
4,The Guarantor hereby unconditionally and irrev...,guarantee


In [None]:
# Create spark dataframe
sdf = spark.createDataFrame(df)
sdf.show()

+--------------------+--------------+
|                text|         label|
+--------------------+--------------+
|The fiscal year o...|   fiscal-year|
|Each of the Borro...|   fiscal-year|
|Purchaser shall p...|purchase-price|
|The purchase pric...|purchase-price|
|The Guarantor her...|     guarantee|
|The Holding Compa...|     guarantee|
|GFS will bear its...|      expenses|
|Each party shall ...|      expenses|
|Failure by either...|        waiver|
|Failure of any pa...|        waiver|
+--------------------+--------------+



# Sentence Embeddings

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_uncased_legal", "en") \
    .setInputCols("document") \
    .setOutputCol("document_embeddings")

sent_bert_base_uncased_legal download started this may take some time.
Approximate size to download 390.8 MB
[OK!]


# Custom transform to retrieve the numerical embeddings from Spark NLP and pass it to Spark MLLib

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

In [None]:
# This class extracts the embeddings from the Spark NLP Annotation object
# from pyspark import ml as ML

class EmbeddingsUDF(
    nlp.Transformer, nlp.ML.param.shared.HasInputCol,  nlp.ML.param.shared.HasOutputCol,
    nlp.ML.util.DefaultParamsReadable, nlp.ML.util.DefaultParamsWritable
):
    @keyword_only
    def __init__(self):
        super(EmbeddingsUDF, self).__init__()

        def _sum(r):
            result = 0.0
            for e in r:
                result += e
            return result

        self.udfs = {
            'convertToVectorUDF': F.udf(lambda vs: nlp.ML.linalg.Vectors.dense(vs), nlp.ML.linalg.VectorUDT()),
            'sumUDF': F.udf(lambda r: _sum(r), T.FloatType())
        }

    def _transform(self, dataset):

        results = dataset.select(
            "*", F.explode("document_embeddings.embeddings").alias("embeddings")
        )
        results = results.withColumn(
            "features",
            self.udfs['convertToVectorUDF'](F.col("embeddings"))
        )
        results = results.withColumn(
            "emb_sum",
            self.udfs['sumUDF'](F.col("embeddings"))
        )
        # Remove those with embeddings all zeroes (so we can calculate cosine distance)
        results = results.where(F.col("emb_sum")!=0.0)

        return results

In [None]:
embeddings_for_pca = EmbeddingsUDF()

In [None]:
DIMENSIONS  = 3

In [None]:
# import pyspark
pca = nlp.ML.feature.PCA(k=DIMENSIONS, inputCol="features", outputCol="pca_features")

### Full Spark NLP + Spark MLLib pipeline

In [None]:
# We did all process in one pipeline

pipeline = nlp.Pipeline().setStages([document_assembler, embeddings, embeddings_for_pca, pca])

In [None]:
pipeline.getStages()

[DocumentAssembler_5bf685394361,
 BERT_SENTENCE_EMBEDDINGS_dae49a767331,
 EmbeddingsUDF_6cfa4744c08f,
 PCA_5be1eba924fe]

In [None]:
model = pipeline.fit(sdf)

In [None]:
result = model.transform(sdf)

In [None]:
result.select('pca_features', 'label').show(truncate=False)

+-----------------------------------------------------------+--------------+
|pca_features                                               |label         |
+-----------------------------------------------------------+--------------+
|[-11.772444786459502,-3.1899426160082354,4.491155988811607]|fiscal-year   |
|[-11.401114580806887,-3.769737534360105,3.2405607816256445]|fiscal-year   |
|[-4.783313864981387,-0.4942406903061649,2.8697672467860604]|purchase-price|
|[-5.455985032697512,-1.341242764724348,3.3317233401952815] |purchase-price|
|[-8.841676655795247,-1.820364584047892,0.13392410273877264]|guarantee     |
|[-11.532871191686567,-2.499575934119743,0.8141606865382864]|guarantee     |
|[-5.731693818493872,-3.815858392691706,3.513588271042771]  |expenses      |
|[-3.8010454974614007,-4.345361920061271,1.6774524605760994]|expenses      |
|[-6.783526297480714,-5.815450054424824,3.236173498846246]  |waiver        |
|[-7.13918115635957,-6.34043989712841,1.234729257880866]    |waiver        |

In [None]:
df = result.select('pca_features', 'label').toPandas()

df
# As you see, dimension values are inside a list

Unnamed: 0,pca_features,label
0,"[-11.772444786459502, -3.1899426160082354, 4.4...",fiscal-year
1,"[-11.401114580806887, -3.769737534360105, 3.24...",fiscal-year
2,"[-4.783313864981387, -0.4942406903061649, 2.86...",purchase-price
3,"[-5.455985032697512, -1.341242764724348, 3.331...",purchase-price
4,"[-8.841676655795247, -1.820364584047892, 0.133...",guarantee
5,"[-11.532871191686567, -2.499575934119743, 0.81...",guarantee
6,"[-5.731693818493872, -3.815858392691706, 3.513...",expenses
7,"[-3.8010454974614007, -4.345361920061271, 1.67...",expenses
8,"[-6.783526297480714, -5.815450054424824, 3.236...",waiver
9,"[-7.13918115635957, -6.34043989712841, 1.23472...",waiver


In [None]:
# We extract the dimension values out off the list

df["x"] = df["pca_features"].apply(lambda x: x[0])

df["y"] = df["pca_features"].apply(lambda x: x[1])

df["z"] = df["pca_features"].apply(lambda x: x[2])

df = df[["x", "y", "z", "label"]]

df

Unnamed: 0,x,y,z,label
0,-11.772445,-3.189943,4.491156,fiscal-year
1,-11.401115,-3.769738,3.240561,fiscal-year
2,-4.783314,-0.494241,2.869767,purchase-price
3,-5.455985,-1.341243,3.331723,purchase-price
4,-8.841677,-1.820365,0.133924,guarantee
5,-11.532871,-2.499576,0.814161,guarantee
6,-5.731694,-3.815858,3.513588,expenses
7,-3.801045,-4.345362,1.677452,expenses
8,-6.783526,-5.81545,3.236173,waiver
9,-7.139181,-6.34044,1.234729,waiver


In [None]:
import plotly.express as px

fig = px.scatter_3d(df, x='x', y='y', z='z', color='label', width=800, height=600)

fig.show()

# Word Embeddings

We can also visualize the semantics of words, instead of full texts, by using Word Embeddings. We will add a Tokenizer and a WordEmbeddings model to get those embeddings, and them apply PCA as before.

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols("document")\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document", "token"])\
    .setOutputCol("document_embeddings")

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]


In [None]:
# Firstly we splitted the pipeline in two to get all token embeddings

pipeline = nlp.Pipeline().setStages([document_assembler, tokenizer, embeddings])

In [None]:
model = pipeline.fit(sdf)

In [None]:
result = model.transform(sdf)

In [None]:
result_df = result.select("label", F.explode(F.arrays_zip("token.result", "document_embeddings.embeddings")).alias("cols"))\
                   .select(F.expr("cols['0']").alias("token"),
                           F.expr("cols['1']").alias("embeddings"),
                           "label")

result_df.show(truncate = 80)


+--------+--------------------------------------------------------------------------------+-----------+
|   token|                                                                      embeddings|      label|
+--------+--------------------------------------------------------------------------------+-----------+
|     The|[-0.19058353, 0.02907179, 0.13235606, 0.19562247, 0.77783114, 0.28990984, -0....|fiscal-year|
|  fiscal|[-0.19621773, 0.14509664, 0.23111394, -0.50601673, -0.38397044, -0.16950981, ...|fiscal-year|
|    year|[0.08006305, 0.22008368, 0.23202448, -0.4419725, 0.58936155, -0.23692255, 0.1...|fiscal-year|
|      of|[-0.14090274, 0.1561361, 0.24000195, -0.2449323, 0.897756, 0.4878102, 0.09172...|fiscal-year|
|     the|[-0.060954493, -0.08232107, 0.31499305, 0.12840052, -0.014585197, 0.97888094,...|fiscal-year|
| Company|[-0.06074696, 0.27488312, 0.07146063, -0.39569926, 0.73315394, 0.80515677, 0....|fiscal-year|
|       (|[0.110985324, -0.23188369, 0.11235473, 0.07458283, 0.8

In [None]:
# Here we defined inheritance class from that defined previously EmbeddingsUDF class
class WordEmbeddingsUDF(EmbeddingsUDF):    
    def _transform(self, dataset):
        
        results = dataset.select('token', 'label', 'embeddings') # We changed this line because our embedding cloumn is already exploded

        results = results.withColumn(
            "features",
            self.udfs['convertToVectorUDF'](F.col("embeddings"))
        )
        results = results.withColumn(
            "emb_sum",
            self.udfs['sumUDF'](F.col("embeddings"))
        )
        # Remove those with embeddings all zeroes (so we can calculate cosine distance)
        results = results.where(F.col("emb_sum")!=0.0)

        return results

In [None]:
embeddings_for_pca = WordEmbeddingsUDF()

In [None]:
DIMENSIONS  = 3

In [None]:
# import pyspark
pca = nlp.ML.feature.PCA(k=DIMENSIONS, inputCol="features", outputCol="pca_features")

## Full Spark NLP + Spark MLLib pipeline

In [None]:
# We run the second part of the pipeline

pipeline = nlp.Pipeline().setStages([embeddings_for_pca, pca])


In [None]:
model = pipeline.fit(result_df)

In [None]:
result = model.transform(result_df)

In [None]:
result.select("token", "embeddings", "pca_features", "label").show(truncate = 60)

+--------+------------------------------------------------------------+------------------------------------------------------------+-----------+
|   token|                                                  embeddings|                                                pca_features|      label|
+--------+------------------------------------------------------------+------------------------------------------------------------+-----------+
|     The|[-0.19058353, 0.02907179, 0.13235606, 0.19562247, 0.77783...|  [6.574778635662339,7.6528643152564015,-0.9220368677054432]|fiscal-year|
|  fiscal|[-0.19621773, 0.14509664, 0.23111394, -0.50601673, -0.383...| [5.227397542553495,-0.0694029672524412,-1.3976501037958822]|fiscal-year|
|    year|[0.08006305, 0.22008368, 0.23202448, -0.4419725, 0.589361...|    [9.185603253708326,1.001476454698103,1.0998418807415162]|fiscal-year|
|      of|[-0.14090274, 0.1561361, 0.24000195, -0.2449323, 0.897756...|   [6.3979324524305445,3.191733260939924,-1.040544103553439

In [None]:
df = result.select('token', 'pca_features',  'label').toPandas()

df

In [None]:
df["x"] = df["pca_features"].apply(lambda x: x[0])

df["y"] = df["pca_features"].apply(lambda x: x[1])

df["z"] = df["pca_features"].apply(lambda x: x[2])

df = df[["token", "x", "y", "z", "label"]]

df

In [None]:
import plotly.express as px

fig = px.scatter_3d(df, x = 'x', y = 'y', z = 'z', color = "label", width=1000, height = 800, hover_data = ["token", "label"])

fig.show()

You can see how the same token gets different embeddings depending on the context.