## Introduction


1. Based on the results from part 1, the system go to **Google** and **search treatment** for desired stage. And get **first 10 websites** (if one of the website is my domain data which is used to compare similarity with other website).

2. Then using a **Scrapy script** to **crawl** all possible **links** and **store** them in a json file **(Links.json)**.

3. Using **BeautifulSoup and request** to get **all text in these links (only distinct links)** and **store** them in **ScrapedDataFrame.json** (for later usage) and return an pandas dataframe **(df)**.

4. Then **concatenating all paragraph or sentences in each link** to one paragraph **(df_WholeWeb)** used to comapre with my data.

5. Using **BeautifulSoup and request** get all **specific content for stage I, II and III of NSCLC**, after that they are **concatenaing to one paragraph** for each of them and return a dataframe **(df_prepTreatmentData)**.

6. **Compare the similarity** of my data by using **Google Universal sentences encoder model**. The **5 most similar links** (if they are **higher** **80%**) will be taken.

7. **Process my data** and using **LSA, LDA and NNMF** to extract the
important information to **get 50 keywords**.

8. Get **all sentences and paragraphs form the 5 most similar links** above and reuse **Google Universal sentences encoder model** to **find** the **highest rank of sentence and paragraph**.

9. **Get a half highest rank (if they > 80%)** and **combine** it into **1 document** then use **texrank** to **summarize the content with ratio range from 10% to 70% and must higher than 1000 words**.


**NOTE: All the test are preprocessed before finding the similarity such as convert_lower_case, remove_punctuation, remove_apostrophe, convert_numbers, remove_stop_words, stemming**



In [1]:
cd ..

/


In [2]:
cd "/content/drive/My Drive/CBD_Robotic/Part2/Part2_2/Multisite-Python-Crawler-master/scraper"

/content/drive/My Drive/CBD_Robotic/Part2/Part2_2/Multisite-Python-Crawler-master/scraper


In [3]:
!pip install google




In [4]:
!pip install scrapy



In [0]:
import os

In [0]:
################################################################################

In [7]:
try: 
    from googlesearch import search 
except ImportError:  
    print("No module named 'google' found") 
  
# to search 
query = "Treatment Choices for Non-Small Cell Lung Cancer by Stage 2 or II"

myDomainData = "www.cancer.ca" # avoid seeking my data
googleLinks = []
count = 0
for j in search(query, tld="co.in", num=1, stop=None, pause=2): 
    if count == 5:
      break
    print(j) 
    if myDomainData == j.split('/')[2]:
      continue
    googleLinks.append(j) 
    count += 1

https://www.cancer.org/cancer/lung-cancer/treating-non-small-cell/by-stage.html
https://www.verywellhealth.com/stage-2-non-small-cell-lung-cancer-2249379
https://www.webmd.com/lung-cancer/lung-cancer-stage-2-overview
https://www.texasoncology.com/types-of-cancer/lung-cancer/non-small-cell-lung-cancer/stage-ii-non-small-cell-lung-cancer
https://www.healthline.com/health/stage-2-lung-cancer


In [0]:
import json
fullAllLinks = {}
for link in googleLinks:
  spiderName = "mySpider"
  url = "-a url=" + str(link)
  domainName = str(link).split("/")[2]
  domain = "-a domain=" + domainName
  jsonName = "-o " + domainName.replace(".","_") +".json"
  os.system("scrapy crawl {} {} {}".format(spiderName, url, domain))
  #get fullAllLinks
  with open("allLinks.json", 'r') as f: 
      # Reading from json file 
      json_object = json.load(f)
  fullAllLinks[domainName] = json_object

with open("Links/Links.json", "w") as outfile: 
			json.dump(fullAllLinks, outfile)




In [0]:

def GetConvertedDomainNametxt(links):
  domainNameList = []
  jsonFileList = []
  for link in links:
    domainNameList.append(str(link).split("/")[2])
    jsonFileList.append(str(link).split("/")[2].replace(".", "_") + ".txt")
    
  return list(set(domainNameList)), list(set(jsonFileList))

In [0]:

def GetConvertedDomainNameJson(links):
  domainNameList = []
  jsonFileList = []
  for link in links:
    domainNameList.append(str(link).split("/")[2])
    jsonFileList.append(str(link).split("/")[2].replace(".", "_") + ".json")
    
  return list(set(domainNameList)), list(set(jsonFileList))

In [0]:
domainNameList , txtFileList= GetConvertedDomainNametxt(googleLinks)

In [0]:
domainNameList , jsonFileList= GetConvertedDomainNameJson(googleLinks)

In [13]:
domainNameList

['www.verywellhealth.com',
 'www.healthline.com',
 'www.webmd.com',
 'www.cancer.org',
 'www.texasoncology.com']

In [14]:
txtFileList

['www_texasoncology_com.txt',
 'www_healthline_com.txt',
 'www_verywellhealth_com.txt',
 'www_cancer_org.txt',
 'www_webmd_com.txt']

In [15]:
jsonFileList

['www_webmd_com.json',
 'www_healthline_com.json',
 'www_verywellhealth_com.json',
 'www_cancer_org.json',
 'www_texasoncology_com.json']

In [0]:
import requests, re
from bs4 import BeautifulSoup
import pandas as pd

titleList = []
# urls = ["https://www.cancer.org/cancer/lung-cancer/treating-non-small-cell/by-stage.html"]
# text = 'downloaded'
def GenerateSrapedDataFrame(allLinksJson, path, googleLinks):
  #Get all domain name list
  domainNameList , jsonFileList= GetConvertedDomainNameJson(googleLinks)
  
  #Searching tags may have text
  #lista = ['h1', 'h2', 'h3', 'p', 'a', 'ul', 'span', 'input']
  # lista = ['h1', 'h2', 'h3', 'p']
  # cannot arrange h1 h2 h3 and p into a graph so I only get <p> tag and leave it for futer improvement
  lista = ['p']

  # Generate Pandas Data Frame
  df = pd.DataFrame(columns = ['Domains', 'links', 'title','content'])
  dictionary = {}

  links = [] # container of all links got from scrapy
  for domain in domainNameList:
    for i in allLinksJson[domain]:
      links.append(allLinksJson[domain][i])

  with open(path + "ScrapedDataFrame.json", 'w', encoding='utf-8') as outfile:
    for index, link in enumerate(links):
      website = requests.get(link)
      soup = BeautifulSoup(website.content, 'lxml')
      tags = soup.find_all(lista)
      title = soup.find('title').text
      if title in titleList: # avoid repating page
        continue
      else:
        titleList.append(title)
      text = [''.join(s.findAll(text=True)) for s in tags]
      text_len = len(text)

      # for item in text:
      #   print(item, file=outfile)
      d = {'Domains':link.split('/')[2], 'links':link, 'title':title,'content': text}
      df = df.append(d, ignore_index = True)
      dictionary[index] = d
    json.dump(dictionary, outfile)
  print('Done! File is saved where you have your scrape-website.py')
  return df

In [0]:
with open('Links/Links.json', 'r', encoding='utf-8') as outfile:
  allLinks = json.load(outfile)

In [18]:
path = 'Links/ScrapedLinks/'

df = GenerateSrapedDataFrame(allLinks, path, googleLinks)


Done! File is saved where you have your scrape-website.py


In [19]:
df

Unnamed: 0,Domains,links,title,content
0,www.verywellhealth.com,https://www.verywellhealth.com/stage-2-non-sma...,Stage 2 Non-Small Cell Lung Cancer,[Stage 2 non-small cell lung cancer is defined...
1,www.verywellhealth.com,https://www.verywellhealth.com/,Verywell Health - Know More. Feel Better.,[\nLearn to navigate these difficult topics wi...
2,www.verywellhealth.com,https://www.verywellhealth.com/lung-cancer-ove...,"Lung Cancer - Symptoms, Treatment, and More",[Lung cancer is the leading cause of cancer de...
3,www.verywellhealth.com,https://www.verywellhealth.com/non-small-cell-...,Non-Small Cell Lung Cancer,[Different lung cancer subtypes fall under the...
4,www.verywellhealth.com,https://www.verywellhealth.com/small-cell-lung...,"Small Cell Lung Cancer - Symptoms, Treatment, ...",[Small cell lung cancer is a fast-growing form...
...,...,...,...,...
670,www.texasoncology.com,https://www.texasoncology.com/newly-diagnosed-...,\n\tPatient Forms | Texas Oncology\n,"[At your first appointment, you’ll need to com..."
671,www.texasoncology.com,https://www.texasoncology.com/patients/resourc...,\n\tMyCare Plus Patient Portal | Texas Oncology\n,[Texas Oncology I Can Newsletter is published ...
672,www.texasoncology.com,https://www.texasoncology.com/patients/resourc...,\n\tOnline Bill Pay | Texas Oncology\n,"[Texas Oncology offers a secure, quick and eas..."
673,www.texasoncology.com,https://www.texasoncology.com/newly-diagnosed-...,\r\n\tInsurance Information | Texas Oncology\r\n,[The list below reflects some of the plans we ...


#### Concatenate all paraphs to one graph for rank page

In [20]:
' '.join(map(str, df.content[0]))

'Stage 2 non-small cell lung cancer\xa0is defined as a “localized cancer,” that is, it refers to a tumor that is present in the lung and may have spread to local lymph nodes, but has not spread further. Tumors that have spread beyond these areas are referred to as "advanced cancers." About 30 percent of lung cancers are diagnosed when they are at stage 1 or 2,\ufeff\ufeff and the prognosis\xa0(long-term outcome) is significantly better than with later stages\xa0of the disease. How is it determined that a lung cancer is stage 2 (stage II), and how this stage of lung cancer is treated? Determining the stage of a lung cancer is very important in choosing the most appropriate treatment. Stage II is divided into stages IIA and IIB. Stage IIA and IIB are each divided into two sections depending on the size of the tumor, where the tumor is found, and whether there is cancer in the lymph nodes. Stage IIA (1) Cancer has spread to lymph nodes on the same side of the chest as the tumor. The lymph

In [0]:
df_WholeWeb = df.copy()
df_WholeWeb.content = df.content.apply(lambda x : ' '.join(map(str, x)) )

In [22]:
df_WholeWeb

Unnamed: 0,Domains,links,title,content
0,www.verywellhealth.com,https://www.verywellhealth.com/stage-2-non-sma...,Stage 2 Non-Small Cell Lung Cancer,Stage 2 non-small cell lung cancer is defined ...
1,www.verywellhealth.com,https://www.verywellhealth.com/,Verywell Health - Know More. Feel Better.,\nLearn to navigate these difficult topics wit...
2,www.verywellhealth.com,https://www.verywellhealth.com/lung-cancer-ove...,"Lung Cancer - Symptoms, Treatment, and More",Lung cancer is the leading cause of cancer dea...
3,www.verywellhealth.com,https://www.verywellhealth.com/non-small-cell-...,Non-Small Cell Lung Cancer,Different lung cancer subtypes fall under the ...
4,www.verywellhealth.com,https://www.verywellhealth.com/small-cell-lung...,"Small Cell Lung Cancer - Symptoms, Treatment, ...",Small cell lung cancer is a fast-growing form ...
...,...,...,...,...
670,www.texasoncology.com,https://www.texasoncology.com/newly-diagnosed-...,\n\tPatient Forms | Texas Oncology\n,"At your first appointment, you’ll need to comp..."
671,www.texasoncology.com,https://www.texasoncology.com/patients/resourc...,\n\tMyCare Plus Patient Portal | Texas Oncology\n,Texas Oncology I Can Newsletter is published m...
672,www.texasoncology.com,https://www.texasoncology.com/patients/resourc...,\n\tOnline Bill Pay | Texas Oncology\n,"Texas Oncology offers a secure, quick and easy..."
673,www.texasoncology.com,https://www.texasoncology.com/newly-diagnosed-...,\r\n\tInsurance Information | Texas Oncology\r\n,The list below reflects some of the plans we a...


### My Data

In [0]:
data_base_url = "https://www.cancer.ca/en/cancer-information/cancer-type/lung/treatment/"

In [0]:
def parse_data(full_url):
  stagesDict = {'stage-1': 'Treating stage I NSCLC', 'stage-2': 'Treating stage II NSCLC', 'stage-3': 'Treating stage III NSCLC'}
  df = pd.DataFrame(columns = [ 'stages','pargraphs'])
  for s in list(stagesDict.keys()): 
    base_url = full_url + s + "/?region=on"
    base_url = requests.get(base_url, timeout = 5)
    page_content = BeautifulSoup(base_url.content, 'html.parser')
    title = page_content.findAll('h1')[1].text
    containers = page_content.find('div', { 'class': "Section1"})
    print(s)
    flag = True
    # pargraphs = title
    for section in containers.findAll('h2'):
      nextNode = section
      # print(nextNode)
      pargraphs =   " "
      if section.text == 'Clinical trials':
        continue
      if flag:
        flag = False
        pargraphs =   title + " "
      # if section.text not in ['Surgery', 'Radiation therapy', 'Chemotherapy', 'Clinical trials']:
      #   pargraphs = section.text + " "
      while True:
          nextNode = nextNode.nextSibling
          if nextNode == "\n":
                continue
          try:            
              tag_name = nextNode.name
          except AttributeError:
              tag_name = ""
          if tag_name == "p":
              # print(nextNode.text)
              #print(nextNode.string)# error not show all <p>
              pargraphs +=  nextNode.text + " "
              
          else:
              df = df.append({'stages' : stagesDict[s],'pargraphs': pargraphs }, ignore_index = True)
              # print ("*****")
              break
  return df

In [25]:
df_TreatmentData = parse_data(data_base_url)

stage-1
stage-2
stage-3


In [26]:
df_TreatmentData


Unnamed: 0,stages,pargraphs
0,Treating stage I NSCLC,Treatments for stage 1 non–small cell lung can...
1,Treating stage I NSCLC,Radiation therapy is offered for stage 1 non–...
2,Treating stage I NSCLC,Chemotherapy may be offered after surgery to ...
3,Treating stage II NSCLC,Treatments for stage 2 non–small cell lung can...
4,Treating stage II NSCLC,External beam radiation therapy is offered fo...
5,Treating stage II NSCLC,Chemotherapy may be offered after surgery for...
6,Treating stage II NSCLC,Chemoradiation may be offered as a treatment ...
7,Treating stage III NSCLC,Treatments for stage 3 non–small cell lung can...
8,Treating stage III NSCLC,The type of surgery done depends on where the...
9,Treating stage III NSCLC,The most common chemotherapy drug combination...


## Prepare Data 

In [27]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [28]:
!pip install num2words



In [0]:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from collections import Counter
from num2words import num2words

import os
import string
import numpy as np
import copy
import pandas as pd
import pickle
import re
import math

In [0]:
def convert_lower_case(data):
    return np.char.lower(data)

In [0]:
def remove_stop_words(data):
    stop_words = stopwords.words('english')
    words = word_tokenize(str(data))
    new_text = ""
    for w in words:
        if w not in stop_words and len(w) > 1:
            new_text = new_text + " " + w
    return new_text

In [0]:
def remove_punctuation(data):
    symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
    for i in range(len(symbols)):
        data = np.char.replace(data, symbols[i], ' ')
        data = np.char.replace(data, "  ", " ")
    data = np.char.replace(data, ',', '')
    return data

In [0]:
def remove_apostrophe(data):
    return np.char.replace(data, "'", "")

In [0]:

def stemming(data):
    stemmer= PorterStemmer()
    
    tokens = word_tokenize(str(data))
    new_text = ""
    for w in tokens:
        new_text = new_text + " " + stemmer.stem(w)
    return new_text

In [0]:
def convert_numbers(data):
    tokens = word_tokenize(str(data))
    new_text = ""
    for w in tokens:
        try:
            w = num2words(int(w))
        except:
            a = 0
        new_text = new_text + " " + w
    new_text = np.char.replace(new_text, "-", " ")
    return new_text

In [0]:
def preprocess(data):
    data = convert_lower_case(data)
    data = remove_punctuation(data) #remove comma seperately
    data = remove_apostrophe(data)
    data = convert_numbers(data) # must be activated before remove_stop_words
    data = remove_stop_words(data)
    data = stemming(data)
    # data = remove_punctuation(data)
    # data = convert_numbers(data)
    # data = stemming(data) #needed again as we need to stem the words
    # data = remove_punctuation(data) #needed again as num2word is giving few hypens and commas fourty-one
    # data = remove_stop_words(data) #needed again as num2word is giving stop words 101 - one hundred and one
    return data

In [37]:
preprocess(df_TreatmentData.pargraphs[0])

' treatment stage one non–smal cell lung cancer surgeri standard treatment stage one non–smal cell lung cancer peopl well enough surgeri lobectomi remov lobe lung main type surgeri stage one non–smal cell lung cancer offer best chanc cancer complet remov wedg segment resect use remov tumour along margin healthi lung tissu type surgeri may offer stage one non–smal cell lung cancer peopl good lung function sleev resect use remov tumour larg airway lung bronchu surgeri non–smal cell lung cancer lymph node chest around lung remov surgeri may stop cancer lymph node shown diagnost test done cancer spread far surgeri help treatment surgeri may done lab report cancer margin posit margin tissu remov'

In [0]:
#df_TreatmentData
#df_WholeWeb # df
df_prepTreatmentData = df_TreatmentData.copy()
df_prepWholeWeb = df_WholeWeb.copy()
df_prepTreatmentData.pargraphs = df_TreatmentData.pargraphs.apply(lambda x: preprocess(x))
df_WholeWeb.content = df_WholeWeb.content.apply(lambda x: preprocess(x))

In [39]:
df_prepWholeWeb.head()

Unnamed: 0,Domains,links,title,content
0,www.verywellhealth.com,https://www.verywellhealth.com/stage-2-non-sma...,Stage 2 Non-Small Cell Lung Cancer,Stage 2 non-small cell lung cancer is defined ...
1,www.verywellhealth.com,https://www.verywellhealth.com/,Verywell Health - Know More. Feel Better.,\nLearn to navigate these difficult topics wit...
2,www.verywellhealth.com,https://www.verywellhealth.com/lung-cancer-ove...,"Lung Cancer - Symptoms, Treatment, and More",Lung cancer is the leading cause of cancer dea...
3,www.verywellhealth.com,https://www.verywellhealth.com/non-small-cell-...,Non-Small Cell Lung Cancer,Different lung cancer subtypes fall under the ...
4,www.verywellhealth.com,https://www.verywellhealth.com/small-cell-lung...,"Small Cell Lung Cancer - Symptoms, Treatment, ...",Small cell lung cancer is a fast-growing form ...


In [40]:
df_prepTreatmentData.head()

Unnamed: 0,stages,pargraphs
0,Treating stage I NSCLC,treatment stage one non–smal cell lung cancer...
1,Treating stage I NSCLC,radiat therapi offer stage one non–smal cell ...
2,Treating stage I NSCLC,chemotherapi may offer surgeri peopl stage on...
3,Treating stage II NSCLC,treatment stage two non–smal cell lung cancer...
4,Treating stage II NSCLC,extern beam radiat therapi offer stage two no...


## Use Google Universal sentence encoder to find smilar websites

In [41]:
%tensorflow_version 1.x
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np

TensorFlow 1.x selected.


In [42]:
!mkdir test_SimlarTxt
# Download the module, and uncompress it to the destination folder. 
!curl -L "https://tfhub.dev/google/universal-sentence-encoder-large/3?tf-hub-format=compressed" |   tar -zxvC test_SimlarTxt

mkdir: cannot create directory ‘test_SimlarTxt’: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
./
./tfhub_module.pb
./variables/
./variables/variables.data-00000-of-00001
 97  745M   97  726M    0     0  36.9M      0  0:00:20  0:00:19  0:00:01 40.5M./variables/variables.index
./assets/
./saved_model.pb
100  745M  100  745M    0     0  36.8M      0  0:00:20  0:00:20 --:--:-- 40.0M


In [43]:
#Function so that one session can be called multiple times. 
#Useful while multiple calls need to be done for embedding. 
import tensorflow as tf
import tensorflow_hub as hub
def embed_useT(module):
    with tf.Graph().as_default():
        sentences = tf.placeholder(tf.string)
        embed = hub.Module(module)
        embeddings = embed(sentences)
        session = tf.train.MonitoredSession()
    return lambda x: session.run(embeddings, {sentences: x})

embed_fn = embed_useT("test_SimlarTxt")

INFO:tensorflow:Saver not created because there are no variables in the graph to restore
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


In [44]:
convert_numbers("Treating stage I NSCLC")

array(' Treating stage I NSCLC', dtype='<U23')

In [0]:
stageI = ""
stageII = ""
stageIII = ""
def CombineStage(stageName, df):
  string = ""
  for s in df[df.stages == stageName].pargraphs:
    string += s
  return string
stageI =  CombineStage('Treating stage I NSCLC', df_prepTreatmentData)
stageII =  CombineStage('Treating stage II NSCLC', df_prepTreatmentData)
stageIII =  CombineStage('Treating stage III NSCLC', df_prepTreatmentData)


In [0]:
webContentList = []
for web in df_prepWholeWeb.content:
  webContentList.append(web)

In [47]:
messages = [stageII] +  webContentList
print(len(messages))
encoding_matrix = embed_fn(messages)
result = np.inner(encoding_matrix, encoding_matrix)[0]
print(len(result))
result5Highest = list(sorted(result[1:]))[-5:]
print("result5Highest {}".format(result5Highest))

index = np.argsort(result[1:])[-5:]
# for i in range(len(index)):
#   index[i] =  index[i] + 1
index = [x+1 for x in index] # increas by 1 because remove 1st element (my data)

print("result[index] {}".format(result[index]))

# filter all result greater than 80%
finalIndex = []
for i in index:
  if result[i] > 0.8:
    finalIndex.append(i)

finalIndex

676
676
result5Highest [0.86293, 0.8650019, 0.86702067, 0.87356174, 0.87814975]
result[index] [0.86293    0.8650019  0.86702067 0.87356174 0.87814975]


[54, 119, 432, 626, 624]

In [48]:
messages[finalIndex[-1]]

' Accurate staging of non-small cell lung (NSCLC) cancer is essential before definitive therapy can begin. Staging is performed according to the tumor, node, metastasis (TNM), staging system. Select the following general stage of cancer in order to learn more about treatment options.\xa0 In addition to stage other tests may be performed on the biopsy in order to further classify the cancer and determine the optimal treatment strategy. Based on the stage of the cancer and the results of these tests, treatment of lung cancer is individualized. Testing the cancer for specific characteristics: An important advance in the treatment of NSCLC is the development of targeted therapies—drugs that target specific biological pathways involved in the growth or spread of cancer. Patients should endure that their cancer is teste in order to determine the best overall treatment strategy. EGFR gene: Mutations in the epidermal growth factor receptor (EGFR) gene may affect how NSCLC responds to certain d

In [49]:
messages[finalIndex[0]]

'The treatment of non-small cell lung cancer depends on the stage of the disease,\ufeff\ufeff as well as the subtype and molecular profile. Early-stage cancers may be treated with surgery or a specialized form of radiation therapy if surgery is not possible. Advanced lung cancers are most often treated with targeted therapies, immunotherapy (checkpoint inhibitors), or chemotherapy. In addition to these treatments, local treatments designed to eradicate the sites of spread (metastasis) are sometimes used. When you\'ve been diagnosed with non-small cell lung cancer, the most important step you can take to maximize your outcome is to find a good doctor and cancer center. With surgery, studies have shown that outcomes of lung cancer surgery are better at cancer centers that perform large volumes of these surgeries.\ufeff\ufeff Once you have met with a lung cancer specialist, it\'s also very helpful to get a second opinion. With so many options now available to treat non-small cell lung can

In [50]:
finalIndex_df = [x - 1 for x in finalIndex]
df_prepWholeWeb.content[finalIndex_df]

53     The treatment of non-small cell lung cancer de...
118    Lung cancer is cancer that begins in the lungs...
431     A stage II non-small cell lung cancer is loca...
625     Stage I non-small cell lung cancer (NSCLC) is...
623     Accurate staging of non-small cell lung (NSCL...
Name: content, dtype: object

## Use Google Universal sentence encoder to find the smilar sentences or graphs of the most highest similarity 5 pages and then summarizing the treatment


### My Keywords

In [51]:
!pip install --upgrade gensim


Requirement already up-to-date: gensim in /usr/local/lib/python3.6/dist-packages (3.8.2)


In [0]:
from gensim import corpora
from gensim.models import LsiModel, LdaModel, nmf
from nltk.tokenize import RegexpTokenizer


In [0]:
def preprocess_data(doc_set):
    """
    Input  : docuemnt list
    Purpose: preprocess text (tokenize, removing stopwords, and stemming)
    Output : preprocessed text
    """
    # initialize regex tokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    # create English stop words list
    en_stop = set(stopwords.words('english'))
    # Create p_stemmer of class PorterStemmer
    p_stemmer = PorterStemmer()
    # list for tokenized documents in loop
    texts = []
    # loop through document list
    for i in doc_set:
        # clean and tokenize document string
        raw = i.lower()
        tokens = tokenizer.tokenize(raw)
        # remove stop words from tokens
        stopped_tokens = [i for i in tokens if not i in en_stop]
        # stem tokens
        # stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
        # # add tokens to list
        texts.append(stopped_tokens)
    return texts

In [0]:
def prepare_corpus(doc_clean):
    """
    Input  : clean document
    Purpose: create term dictionary of our courpus and Converting list of documents (corpus) into Document Term Matrix
    Output : term dictionary and Document Term Matrix
    """
    # Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
    dictionary = corpora.Dictionary(doc_clean)
    # Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
    doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
    # generate LDA model
    return dictionary,doc_term_matrix

In [0]:
def create_gensim_lsa_model(doc_clean,number_of_topics,words):
    """
    Input  : clean document, number of topics and number of words associated with each topic
    Purpose: create LSA model using gensim
    Output : return LSA model
    """
    dictionary,doc_term_matrix=prepare_corpus(doc_clean)
    # generate LSA model
    lsamodel = LsiModel(doc_term_matrix, num_topics=number_of_topics, id2word = dictionary)  # train model
    print(lsamodel.print_topics(num_topics=number_of_topics, num_words=words))
    return lsamodel

In [0]:
def create_gensim_lda_model(doc_clean,number_of_topics,words):
    """
    Input  : clean document, number of topics and number of words associated with each topic
    Purpose: create LDA model using gensim
    Output : return LDA model
    """
    dictionary,doc_term_matrix=prepare_corpus(doc_clean)
    # generate LDA model
    ldamodel = LdaModel(doc_term_matrix, num_topics=number_of_topics, id2word = dictionary)  # train model
    print(ldamodel.print_topics(num_topics=number_of_topics, num_words=words))
    return ldamodel

In [0]:
def create_gensim_nmf_model(doc_clean,number_of_topics,words):
    """
    Input  : clean document, number of topics and number of words associated with each topic
    Purpose: create nmf model using gensim
    Output : return nmf model
    """
    dictionary,doc_term_matrix=prepare_corpus(doc_clean)
    # generate nmf model
    nmfmodel = nmf.Nmf(doc_term_matrix, num_topics=number_of_topics, id2word = dictionary)  # train model
    print(nmfmodel.print_topics(num_topics=number_of_topics, num_words=words))
    return nmfmodel

#### Determine the number of topics

In [0]:
def compute_coherence_values(dictionary, doc_term_matrix, doc_clean, stop, start=2, step=3):
    """
    Input   : dictionary : Gensim dictionary
              corpus : Gensim corpus
              texts : List of input texts
              stop : Max num of topics
    purpose : Compute c_v coherence for various number of topics
    Output  : model_list : List of LSA topic models
              coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, stop, step):
        # generate LSA model
        model = LsiModel(doc_term_matrix, num_topics=number_of_topics, id2word = dictionary)  # train model
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=doc_clean, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values

In [59]:
stageII

' treatment stage two non–smal cell lung cancer surgeri standard treatment stage two non–smal cell lung cancer peopl well enough surgeri lobectomi remov lobe lung main type surgeri stage two non–smal cell lung cancer offer best chanc cancer complet remov wedg segment resect use remov tumour along margin healthi lung tissu type surgeri may offer stage two non–smal cell lung cancer peopl good lung function sleev resect use remov tumour larg airway bronchu lung extend pulmonari resect chest wall resect may done stage two non–smal cell lung cancer spread chest wall tissu around lung surgeri non–smal cell lung cancer lymph node chest around lung remov cancer lymph node shown diagnost test surgeri may stop done cancer spread far surgeri help treatment surgeri may done lab report cancer found margin posit margin tissu remov extern beam radiat therapi offer stage two non–smal cell lung cancer peopl well enough surgeri choos surgeri radiat therapi may given surgeri posit margin tissu remov surg

In [60]:
# LSA Model
number_of_topics=1
words=50
document_list = [stageII]
clean_text=preprocess_data(document_list)
lsamodel=create_gensim_lsa_model(clean_text,number_of_topics,words)
ldamodel = create_gensim_lda_model(clean_text,number_of_topics,words)
nmfmodel = create_gensim_nmf_model(clean_text,number_of_topics,words)

[(0, '0.408*"lung" + 0.360*"cancer" + 0.336*"surgeri" + 0.264*"may" + 0.216*"cell" + 0.216*"smal" + 0.216*"non" + 0.192*"peopl" + 0.192*"offer" + 0.168*"two" + 0.168*"remov" + 0.168*"stage" + 0.144*"radiat" + 0.144*"treatment" + 0.120*"spread" + 0.096*"tissu" + 0.096*"therapi" + 0.096*"lymph" + 0.096*"node" + 0.096*"resect" + 0.096*"given" + 0.096*"margin" + 0.096*"done" + 0.096*"chemotherapi" + 0.072*"use" + 0.072*"enough" + 0.072*"tumour" + 0.072*"sbrt" + 0.072*"chest" + 0.048*"type" + 0.048*"cisplatin" + 0.048*"wall" + 0.048*"three" + 0.048*"healthi" + 0.048*"paraplatin" + 0.048*"hypofraction" + 0.048*"posit" + 0.048*"around" + 0.048*"well" + 0.024*"aq" + 0.024*"bodi" + 0.024*"airway" + 0.024*"test" + 0.024*"complet" + 0.024*"chanc" + 0.024*"chemoradi" + 0.024*"lobe" + 0.024*"function" + 0.024*"carboplatin" + 0.024*"diagnost"')]
[(0, '0.048*"lung" + 0.043*"cancer" + 0.040*"surgeri" + 0.032*"may" + 0.027*"smal" + 0.027*"non" + 0.027*"cell" + 0.024*"offer" + 0.024*"peopl" + 0.021*"two

In [61]:
lsa_words_topic = [x[0] for x in lsamodel.show_topic(0,topn=words) ]
' '.join(lsa_words_topic)

'lung cancer surgeri may cell smal non peopl offer two remov stage radiat treatment spread tissu therapi lymph node resect given margin done chemotherapi use enough tumour sbrt chest type cisplatin wall three healthi paraplatin hypofraction posit around well aq bodi airway test complet chanc chemoradi lobe function carboplatin diagnost'

In [62]:
lda_words_topic = [x[0] for x in ldamodel.show_topic(0,topn=words) ]
' '.join(lda_words_topic)

'lung cancer surgeri may smal non cell offer peopl two remov stage treatment radiat spread node therapi resect done given lymph chemotherapi tissu margin tumour chest sbrt enough use cisplatin hypofraction type wall three posit well healthi paraplatin around person lobectomi toler bronchu extend navelbin healthcar found along complet imrt'

In [63]:
nmf_words_topic = [x[0] for x in nmfmodel.show_topic(0,topn=words) ]
' '.join(nmf_words_topic)

'lung cancer surgeri may cell smal non offer peopl remov two stage radiat treatment spread resect chemotherapi given lymph tissu margin node done therapi use tumour enough sbrt chest healthi well three type posit around paraplatin wall hypofraction cisplatin radiotherapi extend complet help shown cm stereotact stop vinorelbin outsid wedg'

#### Combine 5 pages Links


In [64]:
fiveLinks = df_prepWholeWeb.links[finalIndex_df]
fiveLinks = list(fiveLinks)
fiveLinks

['https://www.verywellhealth.com/non-small-cell-lung-cancer-treatment-2824979',
 'https://www.healthline.com/health/lung-cancer/stage-1-lung-cancer',
 'https://www.texasoncology.com/types-of-cancer/lung-cancer/non-small-cell-lung-cancer/stage-ii-non-small-cell-lung-cancer',
 'https://www.texasoncology.com/cancer-and-blood-disorders/cancer-types/lung-cancer/non-small-cell-lung-cancer/stage-i-non-small-cell-lung-cancer',
 'https://www.texasoncology.com/cancer-and-blood-disorders/cancer-types/lung-cancer/non-small-cell-lung-cancer']

In [0]:
senParFiveLinks = []
for l in fiveLinks: # 5 links
  for senPar in list(df.content[df_prepWholeWeb.links == l])[0]: # all sentences or paragraphs in the link
    senParFiveLinks.append(senPar)

In [66]:
senParFiveLinks[:5]

['The treatment of non-small cell lung cancer depends on the stage of the disease,\ufeff\ufeff as well as the subtype and molecular profile. Early-stage cancers may be treated with surgery or a specialized form of radiation therapy if surgery is not possible. Advanced lung cancers are most often treated with targeted therapies, immunotherapy (checkpoint inhibitors), or chemotherapy. In addition to these treatments, local treatments designed to eradicate the sites of spread (metastasis) are sometimes used.',
 "When you've been diagnosed with non-small cell lung cancer, the most important step you can take to maximize your outcome is to find a good doctor and cancer center. With surgery, studies have shown that outcomes of lung cancer surgery are better at cancer centers that perform large volumes of these surgeries.\ufeff\ufeff Once you have met with a lung cancer specialist, it's also very helpful to get a second opinion.",
 "With so many options now available to treat non-small cell l

In [67]:
prep_senParFiveLinks = [preprocess(x) for x in senParFiveLinks ]
print(len(prep_senParFiveLinks))
prep_senParFiveLinks[:5]

204


[' treatment non small cell lung cancer depend stage disease\ufeff\ufeff well subtyp molecular profil earli stage cancer may treat surgeri special form radiat therapi surgeri possibl advanc lung cancer often treat target therapi immunotherapi checkpoint inhibitor chemotherapi addit treatment local treatment design erad site spread metastasi sometim use',
 ' youv diagnos non small cell lung cancer import step take maxim outcom find good doctor cancer center surgeri studi shown outcom lung cancer surgeri better cancer center perform larg volum surgeri \ufeff\ufeff met lung cancer specialist also help get second opinion',
 ' mani option avail treat non small cell lung cancer nsclc help break two major approach primari approach taken depend stage lung cancer',
 ' treatment option broken',
 ' stage cancer local therapi may suffici treat tumor stage iv tumor system therapi treatment choic stage ii stage iii lung cancer usual treat combin local system therapi']

In [68]:
ldaString = ""
for s in lda_words_topic:
  ldaString += s + " "
ldaString

'lung cancer surgeri may smal non cell offer peopl two remov stage treatment radiat spread node therapi resect done given lymph chemotherapi tissu margin tumour chest sbrt enough use cisplatin hypofraction type wall three posit well healthi paraplatin around person lobectomi toler bronchu extend navelbin healthcar found along complet imrt '

In [69]:
messages2 = [ldaString] +  prep_senParFiveLinks
print(len(messages2))
encoding_matrix = embed_fn(messages2)
result_senPar = np.inner(encoding_matrix, encoding_matrix)[0]
print(len(result_senPar))
halfLen = int(len(prep_senParFiveLinks)/2) # get 1/2 highest result
result_senPar5Highest = list(sorted(result_senPar[1:]))[-halfLen:]
print("result_senPar5Highest {}".format(result_senPar5Highest))

index_senPar = np.argsort(result_senPar[1:])[-halfLen:]
# for i in range(len(index_senPar)):
#   index_senPar[i] =  index_senPar[i] + 1
index_senPar = [x+1 for x in index_senPar] # increas by 1 because remove 1st element (my data)

print("result_senPar[index_senPar] {}".format(result_senPar[index_senPar]))

# filter all result_senPar greater than 80%
finalIndex_senPar = []
for i in index_senPar:
  if result_senPar[i] > 0.8:
    finalIndex_senPar.append(i)

print(len(finalIndex_senPar))

205
205
result_senPar5Highest [0.7365817, 0.7374294, 0.74134433, 0.74134433, 0.7414775, 0.74754214, 0.7477176, 0.75158286, 0.75158286, 0.7520707, 0.7520707, 0.75337446, 0.7538832, 0.7538832, 0.7548232, 0.7592883, 0.7605932, 0.76079, 0.76097655, 0.762339, 0.7627104, 0.76642466, 0.76642466, 0.76692086, 0.77242327, 0.77242327, 0.7734767, 0.7734767, 0.7744908, 0.7751358, 0.7756368, 0.7770207, 0.7770207, 0.77819943, 0.77819943, 0.78434896, 0.7879374, 0.7879374, 0.7902738, 0.7902738, 0.79071355, 0.79071355, 0.79219925, 0.7932299, 0.79445904, 0.794951, 0.7963377, 0.7963377, 0.79768485, 0.8022673, 0.8047347, 0.80620337, 0.8079082, 0.8120262, 0.81391907, 0.8142965, 0.81467676, 0.8187161, 0.818979, 0.81994295, 0.82011485, 0.8227416, 0.8227416, 0.82331, 0.8243702, 0.8243702, 0.8275318, 0.8287605, 0.83027655, 0.8312187, 0.8312187, 0.83427846, 0.84015805, 0.843755, 0.84770614, 0.84770614, 0.8532014, 0.85838497, 0.8610033, 0.8616117, 0.8616117, 0.8669523, 0.8669523, 0.86764026, 0.8676794, 0.86780745

In [70]:
finalIndex_senPar = [x - 1 for x in finalIndex_senPar]
finalTreatment = ""
for s in finalIndex_senPar:
  finalTreatment += senParFiveLinks[s] 
print(finalTreatment)

The treatment options for non-small cell lung cancer have increased dramatically in even the past few years, and many additional therapies are being evaluated in clinical trials. Instead of treating lung cancer as a single disease, it is now recognized and treated as a condition made up of many diseases. Fortunately, along with advances in treatment have come greater social support. Patient-led groups are now available for many of the common mutations (such as the ROS2ders and EGFR resisters) that also include oncologists, surgeons, pathologists, researchers, and more.Radiofrequency ablation uses high-energy radio waves to heat the tumor. Guided by imaging scans, a small probe is inserted through the skin and to the tumor. It can be performed under local anesthesia as an outpatient procedure.One category of immunotherapy drugs is checkpoint inhibitors, of which four drugs are currently available for treating non-small cell lung cancer (with different indications):﻿﻿Recurrent/Relapsed: 

In [71]:
from gensim.summarization.summarizer import summarize
sum_finalTreatment = summarize(finalTreatment, ratio = 0.5)
print(sum_finalTreatment)

The treatment options for non-small cell lung cancer have increased dramatically in even the past few years, and many additional therapies are being evaluated in clinical trials.
It can be performed under local anesthesia as an outpatient procedure.One category of immunotherapy drugs is checkpoint inhibitors, of which four drugs are currently available for treating non-small cell lung cancer (with different indications):﻿﻿Recurrent/Relapsed: Cancer has progressed or returned (recurred/relapsed) following an initial treatment with surgery, radiation therapy and/or chemotherapy.Small cell lung cancer is a very aggressive form of lung cancer.
Learn about small cell lung cancer symptoms, risk factors, diagnosis, and treatment.A variety of factors ultimately influence a patient’s decision to receive treatment of cancer.
Stage 1 is further divided into 1A and 1B.With so many options now available to treat non-small cell lung cancer (NSCLC), it's helpful to break these down into two major app

In [78]:
from gensim.summarization.summarizer import summarize
ratioList = [x*0.1 for x in range(7)]
sum_finalTreatment = [0]
for i in ratioList:
  sum_finalTreatment = summarize(finalTreatment, ratio = i)
  if len(sum_finalTreatment) >= 1000:
    print("ratio: {}".format(i))
    print("len(sum_finalTreatment): {}".format(len(sum_finalTreatment)))
    break
print(sum_finalTreatment)

ratio: 0.1
len(sum_finalTreatment): 4056
It can be performed under local anesthesia as an outpatient procedure.One category of immunotherapy drugs is checkpoint inhibitors, of which four drugs are currently available for treating non-small cell lung cancer (with different indications):﻿﻿Recurrent/Relapsed: Cancer has progressed or returned (recurred/relapsed) following an initial treatment with surgery, radiation therapy and/or chemotherapy.Small cell lung cancer is a very aggressive form of lung cancer.
The information on this website is intended to help educate patients about their treatment options and to facilitate a mutual or shared decision-making process with their treating cancer physician.Researchers from the City of Hope National Medical Center recently determined that annual CT scans and chest x-rays three times per year may detect early second cancers in patients with previously treated stage IA NSCLC who appeared to be cured.
The average diameter of cancers detected by CT 

## Conclusion

- Should control the depth of crwaling pages => the more links ther longer content.
- Choose better data to use similarity comparation

https://stackoverflow.com/questions/13437402/how-to-run-scrapy-from-within-a-python-script

In [72]:
cd ..

/content/drive/My Drive/CBD_Robotic/Part2/Part2_2/Multisite-Python-Crawler-master



https://www.google.com/search?q=Extract+information+NLP&oq=extrac&aqs=chrome.0.69i59l2j69i57j69i59j69i60l3.2919j0j1&client=ubuntu&sourceid=chrome&ie=UTF-8


https://medium.com/@venali

https://towardsdatascience.com/deep-learning-for-specific-information-extraction-from-unstructured-texts-12c5b9dceada

https://www.analyticsvidhya.com/blog/2019/09/introduction-information-extraction-python-spacy/

https://www.intechopen.com/books/efficient-decision-support-systems-practice-and-challenges-in-biomedical-related-domain/information-extraction-approach-for-clinical-practice-guidelines-representation-in-a-medical-decisio


https://link.springer.com/article/10.1007/s13721-019-0216-2

http://airccse.org/journal/avc/papers/1114avc03.pdf

https://www.sciencedirect.com/science/article/pii/S1532046417302563.

https://medium.com/@andreasherman/different-ways-of-doing-relation-extraction-from-text-7362b4c3169e

