![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/medium-cognitive-search/tutorials/blogposts/medium/cognitive-search/medlineplus_sparknlp.ipynb)

In [None]:
%%capture
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

In [None]:
import sparknlp
import xml.etree.ElementTree as ET
import pandas as pd
import urllib.request

from sparknlp.annotator import *
from sparknlp.base import *
import pyspark.sql.functions as F

from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.ml.feature import BucketedRandomProjectionLSH, BucketedRandomProjectionLSHModel

In [None]:
%%time
spark = sparknlp.start(gpu=True)

CPU times: user 258 ms, sys: 42.4 ms, total: 300 ms
Wall time: 56.3 s


In [None]:
print("Spark NLP version: {}".format(sparknlp.version()))
print("Apache Spark version: {}".format(spark.version))

Spark NLP version: 3.1.2
Apache Spark version: 3.0.3


In [None]:
%%capture
!wget https://github.com/JohnSnowLabs/spark-nlp-workshop/raw/master/tutorials/blogposts/medium/cognitive-search/corpus/mplus_topics_2021-06-01.txt

In [None]:
medlineplusDF = spark.read.option("header","true").csv("mplus_topics_2021-06-01.txt")

In [None]:
%%time
medlineplusDF.show(5, truncate=True)

+--------------+--------------------+--------------------+--------------------+
|         title|                 url|            metadesc|        norm_summary|
+--------------+--------------------+--------------------+--------------------+
|           A1C|https://medlinepl...|If you are being ...| A1C is a blood t...|
|Abdominal Pain|https://medlinepl...|Stomach aches can...| Your abdomen ext...|
|      Abortion|https://medlinepl...|An abortion is a ...| An abortion is a...|
|       Abscess|https://medlinepl...|Abscesses are fil...| An abscess is a ...|
|          Acne|https://medlinepl...|Looking for ways ...| Acne is a common...|
+--------------+--------------------+--------------------+--------------------+
only showing top 5 rows

CPU times: user 1.59 ms, sys: 0 ns, total: 1.59 ms
Wall time: 202 ms


In [None]:
medlineplusDF = medlineplusDF.withColumn("text", F.concat(F.col("metadesc"), F.lit(" "), F.col("norm_summary"))).select("title", "url", "text")

In [None]:
medlineplusDF.persist()
medlineplusDF.show(5, truncate=100)

+--------------+------------------------------------------+----------------------------------------------------------------------------------------------------+
|         title|                                       url|                                                                                                text|
+--------------+------------------------------------------+----------------------------------------------------------------------------------------------------+
|           A1C|          https://medlineplus.gov/a1c.html|If you are being tested for Type 2 diabetes, your doctor gives you an A1C test. The test is also ...|
|Abdominal Pain|https://medlineplus.gov/abdominalpain.html|Stomach aches can be painful. Find out what might be the cause of your abdominal pain.   Your abd...|
|      Abortion|     https://medlineplus.gov/abortion.html|An abortion is a medical procedure to end a pregnancy. It uses medicine or surgery to remove the ...|
|       Abscess|      https://medl

In [None]:
docass = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

In [None]:
sentence_detector_dl = SentenceDetectorDLModel \
  .pretrained("sentence_detector_dl", "xx") \
  .setInputCols(["document"]) \
  .setOutputCol("sentence")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]


In [None]:
emb_use = UniversalSentenceEncoder.pretrained("tfhub_use_multi", "xx") \
      .setInputCols("sentence") \
      .setOutputCol("use_embeddings")


tfhub_use_multi download started this may take some time.
Approximate size to download 247.6 MB
[OK!]


In [None]:
pipeline_use = Pipeline(stages=[
  docass, sentence_detector_dl, emb_use
])
model_use = pipeline_use.fit(medlineplusDF)
medlineplusSentencesDF = model_use.transform(medlineplusDF)

In [None]:
%%time
medlineplusSentencesDF.show(5)

+--------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|         title|                 url|                text|            document|            sentence|      use_embeddings|
+--------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|           A1C|https://medlinepl...|If you are being ...|[[document, 0, 12...|[[document, 0, 12...|[[sentence_embedd...|
|Abdominal Pain|https://medlinepl...|Stomach aches can...|[[document, 0, 83...|[[document, 0, 28...|[[sentence_embedd...|
|      Abortion|https://medlinepl...|An abortion is a ...|[[document, 0, 47...|[[document, 0, 53...|[[sentence_embedd...|
|       Abscess|https://medlinepl...|Abscesses are fil...|[[document, 0, 84...|[[document, 0, 65...|[[sentence_embedd...|
|          Acne|https://medlinepl...|Looking for ways ...|[[document, 0, 11...|[[document, 0, 40...|[[sentence_embedd...|
+--------------+--------

In [None]:
medlineplusSentencesDF = medlineplusSentencesDF.select(
    F.col("title"),
    F.col("url"),
    F.arrays_zip(
        F.col("sentence.result").alias("sentence"), 
        F.col("sentence.begin").alias("begin"), 
        F.col("sentence.end").alias("end"), 
        F.col("use_embeddings.embeddings")
        ).alias("zip")
).select(
    F.col("title"),
    F.col("url"),
    F.explode(F.col("zip")).alias("zip")
).select(
    F.col("title"),
    F.col("url"),
    F.col("zip")['0'].alias("sentence"),
    F.col("zip")['1'].alias("begin"),
    F.col("zip")['2'].alias("end"),
    F.col("zip")['3'].alias("embeddings")
)

In [None]:
myudf = F.udf(lambda vs: Vectors.dense(vs), VectorUDT())
medlineplusSentencesDF = medlineplusSentencesDF.select("title", "url", "sentence", "begin", "end", myudf("embeddings").alias("embeddings"))

In [None]:
%%time
medlineplusSentencesDF.persist()
medlineplusSentencesDF.show(5)

+--------------------+--------------------+--------------------+------------+-----+---+--------------------+
|               title|                 url|            sentence|sentence_seq|begin|end|          embeddings|
+--------------------+--------------------+--------------------+------------+-----+---+--------------------+
|Ankle Injuries an...|https://medlinepl...|Ankle injuries an...|           1|    0| 67|[-0.0795291587710...|
|Ankle Injuries an...|https://medlinepl...|Learn about diffe...|           2|   69|146|[-0.0731623098254...|
|Ankle Injuries an...|https://medlinepl...|Your ankle bone a...|           3|  149|229|[-0.0435278527438...|
|Ankle Injuries an...|https://medlinepl...|Your ligaments, w...|           4|  231|307|[-0.0642164871096...|
|Ankle Injuries an...|https://medlinepl...|Your muscles and ...|           5|  309|341|[-0.0617586746811...|
+--------------------+--------------------+--------------------+------------+-----+---+--------------------+
only showing top 5 

In [None]:
%%time
brp = BucketedRandomProjectionLSH(
    inputCol="embeddings", 
    outputCol="hashes",
    bucketLength=10, 
    numHashTables=5
    )
brp_model = brp.fit(medlineplusSentencesDF)
hashesDF = brp_model.transform(medlineplusSentencesDF)

CPU times: user 18.3 ms, sys: 2.75 ms, total: 21 ms
Wall time: 994 ms


In [None]:
%%time 
hashesDF.persist()
hashesDF.select("title", "sentence", "embeddings", "hashes").show(5, truncate=60)

+----------------------------+------------------------------------------------------------+------------------------------------------------------------+--------------------------------------+
|                       title|                                                    sentence|                                                  embeddings|                                hashes|
+----------------------------+------------------------------------------------------------+------------------------------------------------------------+--------------------------------------+
|Ankle Injuries and Disorders|Ankle injuries and ankle disorders can affect tendons and...|[-0.07952915877103806,-0.0331236831843853,0.0117932194843...|  [[0.0], [-1.0], [0.0], [0.0], [0.0]]|
|Ankle Injuries and Disorders|Learn about different kinds of ankle problems including s...|[-0.07316230982542038,-0.013313423842191696,-0.0048491135...|[[0.0], [-1.0], [-1.0], [0.0], [-1.0]]|
|Ankle Injuries and Disorders|Your ankle

In [None]:
brp_model.write().overwrite().save("brp_model.parquet")
brp_model = BucketedRandomProjectionLSHModel.load("brp_model.parquet")

In [None]:
def get_key(query, model):
  queryDF = spark.createDataFrame([[query]]).toDF("text")
  queryDF = model.transform(queryDF)
  queryDF = queryDF.select(
    F.explode(
        F.arrays_zip(
            F.col("sentence.result"), 
            F.col("use_embeddings.embeddings")
          )
        ).alias("zip")
    ).select(
        F.col("zip")['0'].alias("sentence"),
        myudf(F.col("zip")['1']).alias("embeddings")
    )
  key = queryDF.select("embeddings").take(1)[0].embeddings
  return key

def find_close_sentences(query, emb_model, brp_model, hashesDF, k):
  key = get_key(query, emb_model)
  resultsDF = brp_model.approxNearestNeighbors(hashesDF, key, k)
  return resultsDF.select("title", "url", "sentence", "distCol", "hashes")


In [None]:
key = get_key("How to treat depression?", model_use)
key

DenseVector([-0.0362, -0.0433, 0.0, -0.003, -0.0727, 0.0306, -0.0043, -0.0086, 0.0023, 0.0172, -0.0565, -0.0489, -0.0281, -0.0027, 0.0503, -0.0295, -0.007, -0.0666, 0.0137, -0.0501, 0.0104, -0.0448, 0.0635, 0.0414, 0.0713, -0.0257, 0.0442, 0.0083, -0.0444, -0.038, -0.029, -0.0343, 0.0051, -0.0687, -0.0067, 0.0608, 0.0028, 0.0737, 0.053, -0.0758, -0.0299, 0.039, -0.0447, 0.0519, -0.0129, -0.0688, 0.0681, 0.0235, -0.008, -0.0416, 0.0109, -0.0124, 0.0172, -0.0023, 0.0075, -0.0294, 0.0469, -0.0668, -0.0443, -0.0107, 0.0135, -0.0573, 0.0234, 0.0174, -0.0262, 0.0761, -0.0727, 0.0054, 0.048, -0.0246, 0.0525, -0.0485, 0.049, 0.021, 0.0576, 0.0375, -0.0, 0.0339, -0.0456, 0.0764, -0.0756, -0.0188, -0.051, -0.0659, -0.0298, -0.0266, -0.0146, -0.0483, -0.0291, -0.004, -0.0586, -0.0398, -0.0075, -0.0172, -0.035, 0.0219, -0.0295, 0.0356, -0.0007, 0.0201, -0.0199, 0.005, -0.0522, -0.0128, 0.0536, 0.0772, -0.0006, -0.0169, -0.0159, 0.0576, -0.0389, -0.0028, 0.055, -0.0314, -0.0601, 0.0066, -0.059, 0.0

In [None]:
%%time
find_close_sentences("How to treat depression?", model_use, brp_model, hashesDF, 5).show(truncate=False)

+----------------+--------------------------------------------+------------------------------------------------------------------+------------+------------------+--------------------------------------+
|title           |url                                         |sentence                                                          |sentence_seq|distCol           |hashes                                |
+----------------+--------------------------------------------+------------------------------------------------------------------+------------+------------------+--------------------------------------+
|Teen Depression |https://medlineplus.gov/teendepression.html |Learn about diagnosis and treatment " What is depression in teens?|3           |0.8463346765863106|[[-1.0], [0.0], [0.0], [-1.0], [-1.0]]|
|Depression      |https://medlineplus.gov/depression.html     |Learn about treatments. " Depression is a serious medical illness.|4           |0.8992738325893537|[[-1.0], [0.0], [0.0], [-1.0],

In [None]:
%%time
find_close_sentences("How to treat diabetes?", model_use, brp_model, hashesDF, 10).show(truncate=False)

+-----------------------+--------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-------------------+---------------------------------------+
|title                  |url                                               |sentence                                                                                                                                                                     |sentence_seq|distCol            |hashes                                 |
+-----------------------+--------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-------------------+---------------------------------------+
|Diabetes Medicines     |htt

In [None]:
question = "How can I prevent cancer?"
candidatesDF = find_close_sentences(question, model_use, brp_model, hashesDF, 20)

In [None]:
%%time
candidatesDF.persist()
candidatesDF.show(20, truncate=80)

+----------------------------+-----------------------------------------------------+--------------------------------------------------------------------+------------+------------------+---------------------------------------+
|                       title|                                                  url|                                                            sentence|sentence_seq|           distCol|                                 hashes|
+----------------------------+-----------------------------------------------------+--------------------------------------------------------------------+------------+------------------+---------------------------------------+
|                 Lung Cancer|              https://medlineplus.gov/lungcancer.html|                            What are the treatments for lung cancer?|          27|0.8336666147052888|  [[-1.0], [0.0], [0.0], [-1.0], [0.0]]|
|How to Prevent Heart Disease|https://medlineplus.gov/howtopreventheartdisease.html|            

In [None]:
candidateSourcesDF = candidatesDF.groupBy("title", "url").count().select("*").orderBy("count", ascending=False)
candidateSourcesDF.show(20, truncate=False)

+----------------------------+-----------------------------------------------------+-----+
|title                       |url                                                  |count|
+----------------------------+-----------------------------------------------------+-----+
|Lung Cancer                 |https://medlineplus.gov/lungcancer.html              |3    |
|Cervical Cancer Screening   |https://medlineplus.gov/cervicalcancerscreening.html |2    |
|How to Prevent Heart Disease|https://medlineplus.gov/howtopreventheartdisease.html|1    |
|Cancer--Living with Cancer  |https://medlineplus.gov/cancerlivingwithcancer.html  |1    |
|Reproductive Hazards        |https://medlineplus.gov/reproductivehazards.html     |1    |
|Tumors and Pregnancy        |https://medlineplus.gov/tumorsandpregnancy.html      |1    |
|Leukemia                    |https://medlineplus.gov/leukemia.html                |1    |
|Prostate Cancer Screening   |https://medlineplus.gov/prostatecancerscreening.html |1    |

In [None]:
candidate_titles = list(candidatesDF.select("title").toPandas()['title'])

In [None]:
candidate_sources_pd = medlineplusDF.where(F.col("title").isin(candidate_titles)).toPandas()

In [None]:
pd.set_option('display.max_colwidth', None)
candidate_sources_pd.head(4)

Unnamed: 0,title,url,text
0,Adrenal Gland Cancer,https://medlineplus.gov/adrenalglandcancer.html,"Tumors can affect adrenal glands. Most adrenal gland tumors are benign. Types of tumors include Neuroblastoma and Pheochromocytoma. Your adrenal, or suprarenal, glands are located on the top of each kidney. These glands produce hormones that you can't live without, including sex hormones and cortisol, which helps you respond to stress and has many other functions. A number of disorders can affect the adrenal glands , including tumors. Tumors can be either benign or malignant. Benign tumors aren't cancer. Malignant ones are. Most adrenal gland tumors are benign. They usually do not cause symptoms and may not require treatment. Malignant adrenal gland cancers are uncommon. Types of tumors include Adrenocortical carcinoma - cancer in the outer part of the gland Neuroblastoma , a type of childhood cancer Pheochromocytoma - a rare tumor that is usually benign Symptoms depend on the type of cancer you have. Treatments may include surgery, chemotherapy, or radiation therapy."
1,Benign Tumors,https://medlineplus.gov/benigntumors.html,"You may be relieved when your doctor tells you a tumor is benign. But they may need to be removed. Find out more about benign tumors. Tumors are abnormal growths in your body. They can be either benign or malignant. Benign tumors aren't cancer. Malignant ones are. Benign tumors grow only in one place. They cannot spread or invade other parts of your body. Even so, they can be dangerous if they press on vital organs, such as your brain. Tumors are made up of extra cells. Normally, cells grow and divide to form new cells as your body needs them. When cells grow old, they die, and new cells take their place. Sometimes, this process goes wrong. New cells form when your body does not need them, and old cells do not die when they should. These extra cells can divide without stopping and may form tumor. Treatment often involves surgery. Benign tumors usually don't grow back. NIH: National Cancer Institute"
2,Cancer Chemotherapy,https://medlineplus.gov/cancerchemotherapy.html,"Chemotherapy may help you fight cancer. Find out about the types of chemotherapy, side effects, and the latest news about chemotherapy. Normally, your cells grow and die in a controlled way. Cancer cells keep growing without control. Chemotherapy is drug therapy for cancer. It works by killing the cancer cells, stopping them from spreading, or slowing their growth. However, it can also harm healthy cells, which causes side effects. You may have a lot of side effects, some, or none at all. It depends on the type and amount of chemotherapy you get and how your body reacts. Some common side effects are fatigue, nausea, vomiting, pain, and hair loss. There are ways to prevent or control some side effects. Talk with your health care provider about how to manage them. Healthy cells usually recover after chemotherapy is over, so most side effects gradually go away. Your treatment plan will depend on the cancer type, the chemotherapy drugs used, the treatment goal, and how your body responds. Chemotherapy may be given alone or with other treatments. You may get treatment every day, every week, or every month. You may have breaks between treatments so that your body has a chance to build new healthy cells. You might take the drugs by mouth, in a shot, as a cream, or intravenously (by IV). NIH: National Cancer Institute"
3,Cancer--Living with Cancer,https://medlineplus.gov/cancerlivingwithcancer.html,"Living with cancer is not easy. It can take a physical and emotional toll on your health. Learn how to cope with cancer in your daily life. Cancer is common. Half of all men and a third of women will get a diagnosis of cancer in their lifetime. Many people with cancer do survive. Millions of Americans alive today have a history of cancer. For most people with cancer, living with the disease is the biggest challenge they have ever faced. It can change your routines, roles and relationships. It can cause money and work problems. The treatment can change the way you feel and look. Learning more about ways you can help yourself may ease some of your concerns. Support from others is important. All cancer survivors should have follow-up care. Knowing what to expect after cancer treatment can help you and your family make plans, lifestyle changes, and important decisions. NIH: National Cancer Institute"
