# **<center>Text Similarity Project</center>**

**Author:**  Hoang-Minh Nguyen

**Email:**  hoangminh.nhm2008@gmail.com

**Mobile:** (+61) 416 156 867

**Version:**  2.0

# **Configuration**

* This piece of code allows Google Drive File System to access the current Drive system containing this program.

In [1]:
# Mount Drive system
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


* The following chunk of code imports all of the necessary libraries supporting the operation of this program.

In [2]:
%%capture

!pip install bs4
!pip install sentence_transformers
!pip install swifter

In [3]:
import pandas as pd
import swifter
import requests
import lxml.html as html
from bs4 import *

import nltk
from nltk import tokenize
from nltk.tokenize import RegexpTokenizer 

import spacy as sp
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

In [4]:
%%capture

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Preliminary

Here's the summary of my approach:
1. Use `requests`, `xpath`, and `BeautifulSoup` to crawl content from the Wikipedia pages.
2. Tokenize the text to only keep plain words (remove numbers and other special symbols).
3. Use `SpaCy` to get the Adjectives from the text.
4. Split the text into sentences and use `SBERT` to Encode them (using Multiprocessing mode).
5. Extract the sentences into a new dataframe then Calculate the pairwise cosine similarity for each sentence against all other.
6. Use Self-Join to get a dataframe that contains the similar sentences together with their links.
 

# Execution

## 1. Initialize the static variables

* The following chunk initializes static variables used throughout entire Part 3. 

In [5]:
%%capture

WIKI_URL = "https://en.wikipedia.org/wiki/Cabinet_of_Australia"

# Load SBERT 
SBERT = SentenceTransformer('paraphrase-MiniLM-L3-v2') 
SBERT_POOL = SBERT.start_multi_process_pool() # Multiprocessing would help improve SBERT's performance

# Load SpaCy 
NLP = sp.load("en_core_web_sm")

## 2. Crawl and Analyze text

* The following function reads all the text in the `<body>` tag of a HTML page. The text is then tokenized and lemmatized before stopwords are filtered out.

In [6]:
# Read HTML page
def get_page_content(url):
  # Fetch URL Content
  res = requests.get(url)
  
  # Get content in the <body> tag
  parser = BeautifulSoup(res.text, 'html.parser').select('body')[0]
  
  # Get all contents from the <p> tags as a string
  paragraphs = list()
  paragraphs = [tag.text.strip() for tag in parser.find_all() if tag.name=="p"]
  text = " ".join(paragraphs)

  # Initialize tokenizer
  tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?(?:[.!?])?")

  # Clean-up the string
  unigrams = tokenizer.tokenize(text.lower())

  # Return the new string
  return " ".join(unigrams)

* The following function get all Adjectives from a text using SpaCy.

In [7]:
# Get Adj(s)
def get_adj(target):
  # Process the text
  doc = NLP(target)

  # Get Adj
  adj_list = [word for word in doc if (word.pos_=="ADJ" and len(word)>1)] 
  adj_list = list(set(adj_list)) # Only get unique words
  return adj_list

* The following chunk crawl data from the Wikipedia page for each of the Australia Cabinet member.

In [40]:
# Store all cabinet members' data in this list
CABINET = list()

try:  
  res = requests.get(WIKI_URL)
  
  if res.status_code == 200: 
    # Parse text using XPath
    html_page = html.fromstring(res.text)
    cabinet_table = html_page.xpath('//div[@class="mw-parser-output"]/table')[1]
    cabinet_table_rows = cabinet_table.xpath('tbody/tr')

    # For each cabinet member
    for row in cabinet_table_rows[1:]:
      name = row.xpath('td[last()-2]/a[2]/text()')[0]
      link = row.xpath('td[last()-2]/a[2]/@href')[0]
      content = get_page_content("https://en.wikipedia.org/{}".format(link))
      
      # Split the text into a list of sentences
      sentences = tokenize.sent_tokenize(content)

      print("Name:{}".format(name), " --- Link:{}".format(link))

      CABINET.append({
          "Name": name,
          "Link": "https://en.wikipedia.org/{}".format(link),
          "Content": content,
          "Adj": get_adj(content),
          "Sentence": sentences,
          # Use SBERT in multiprocessing mode to encode the sentences
          "Encoded": SBERT.encode_multi_process(sentences, SBERT_POOL).tolist() 
      })
    
  else: # Just in case something happens to Wikipedia
    print("Status: ", res.status_code, ", cannot get data")

except Exception as e: # Just in case something happens to Wikipedia
  print(e)

Name:Anthony Albanese  --- Link:/wiki/Anthony_Albanese
Name:Richard Marles  --- Link:/wiki/Richard_Marles
Name:Penny Wong  --- Link:/wiki/Penny_Wong
Name:Jim Chalmers  --- Link:/wiki/Jim_Chalmers
Name:Katy Gallagher  --- Link:/wiki/Katy_Gallagher
Name:Don Farrell  --- Link:/wiki/Don_Farrell
Name:Tony Burke  --- Link:/wiki/Tony_Burke
Name:Mark Butler  --- Link:/wiki/Mark_Butler
Name:Chris Bowen  --- Link:/wiki/Chris_Bowen
Name:Tanya Plibersek  --- Link:/wiki/Tanya_Plibersek
Name:Catherine King  --- Link:/wiki/Catherine_King_(politician)
Name:Amanda Rishworth  --- Link:/wiki/Amanda_Rishworth
Name:Bill Shorten  --- Link:/wiki/Bill_Shorten
Name:Linda Burney  --- Link:/wiki/Linda_Burney
Name:Mark Dreyfus  --- Link:/wiki/Mark_Dreyfus
Name:Brendan O'Connor  --- Link:/wiki/Brendan_O%27Connor_(politician)
Name:Jason Clare  --- Link:/wiki/Jason_Clare
Name:Julie Collins  --- Link:/wiki/Julie_Collins
Name:Michelle Rowland  --- Link:/wiki/Michelle_Rowland
Name:Madeleine King  --- Link:/wiki/Madelei

## 3. Outcome (Cabinet members and their adjectives)

* The adjectives of the cabinet members would be displayed below.

In [41]:
# Cabinet members with their adjectives
pd.DataFrame(CABINET)[["Name", "Link", "Adj"]]

Unnamed: 0,Name,Link,Adj
0,Anthony Albanese,https://en.wikipedia.org//wiki/Anthony_Albanese,"[semi, albanese, maritime, australian, first, ..."
1,Richard Marles,https://en.wikipedia.org//wiki/Richard_Marles,"[victorian, prime, private, u., quadrilateral,..."
2,Penny Wong,https://en.wikipedia.org//wiki/Penny_Wong,"[industrial, rudd, national, indigenous, first..."
3,Jim Chalmers,https://en.wikipedia.org//wiki/Jim_Chalmers,"[positive, new, strong, sound, youngest, forme..."
4,Katy Gallagher,https://en.wikipedia.org//wiki/Katy_Gallagher,"[poor, dangerous, ted, elder, second, national..."
5,Don Farrell,https://en.wikipedia.org//wiki/Don_Farrell,"[federal, special, former, less, indulgent, fa..."
6,Tony Burke,https://en.wikipedia.org//wiki/Tony_Burke,"[offensive, extensive, largest, safe, more, en..."
7,Mark Butler,https://en.wikipedia.org//wiki/Mark_Butler,"[miscellaneous, prime, federal, old, electoral..."
8,Chris Bowen,https://en.wikipedia.org//wiki/Chris_Bowen,"[small, other, federal, financial, regional, p..."
9,Tanya Plibersek,https://en.wikipedia.org//wiki/Tanya_Plibersek,"[australian, australian, other, domestic, seri..."


## 4. Compute Cosine similarity

* Data of sentences is extracted to a different list.

In [42]:
# Extract a smaller subset of the data
SENTENCES = [
  {
    "Name": element["Name"],
    "Link": element["Link"], 
    "Sentence": element["Sentence"],
    "Encoded": element["Encoded"]
  } 
  for element in CABINET 
]

* A smaller data frame is created only for the sentences.

In [43]:
# Convert the list into a DataFrame, each row contains a list of sentences
sentences_df = pd.DataFrame(SENTENCES)

# Explode the table to have each row represent a single sentence
sentences_df = sentences_df.set_index(["Name", "Link"]).swifter.apply(pd.Series.explode).reset_index()
sentences_df["ID"] = range(0, sentences_df.shape[0]) # Set ID for the rows

sentences_df

Pandas Apply:   0%|          | 0/2 [00:00<?, ?it/s]

Unnamed: 0,Name,Link,Sentence,Encoded,ID
0,Anthony Albanese,https://en.wikipedia.org//wiki/Anthony_Albanese,anthony norman albanese lb ni zi al-b neez-ee ...,"[-0.17799575626850128, 0.05270959809422493, 0....",0
1,Anthony Albanese,https://en.wikipedia.org//wiki/Anthony_Albanese,he attended st mary's cathedral college before...,"[0.11786865442991257, -0.05322521924972534, 0....",1
2,Anthony Albanese,https://en.wikipedia.org//wiki/Anthony_Albanese,he joined the labor party as a student and bef...,"[-0.04393822327256203, 0.07338488101959229, 0....",2
3,Anthony Albanese,https://en.wikipedia.org//wiki/Anthony_Albanese,albanese was elected to the house of represent...,"[0.08946056663990021, 0.10150539129972458, 0.1...",3
4,Anthony Albanese,https://en.wikipedia.org//wiki/Anthony_Albanese,he was first appointed to the shadow cabinet i...,"[-0.16330453753471375, -0.09657547622919083, 0...",4
...,...,...,...,...,...
1047,Clare O'Neil,https://en.wikipedia.org//wiki/Clare_O%27Neil,i want my generation to be the last to have to...,"[-0.09499959647655487, 0.10766898840665817, 0....",1047
1048,Clare O'Neil,https://en.wikipedia.org//wiki/Clare_O%27Neil,o'neil lives with her partner brendan an anaes...,"[0.08744799345731735, -0.302818238735199, 0.11...",1048
1049,Clare O'Neil,https://en.wikipedia.org//wiki/Clare_O%27Neil,o'neil has two sons elvis and louis and a daug...,"[0.07609084248542786, -0.41609659790992737, -0...",1049
1050,Clare O'Neil,https://en.wikipedia.org//wiki/Clare_O%27Neil,while living in the northern territory o'neil ...,"[0.2947000563144684, -0.3881482779979706, 0.40...",1050


* The following function calculate the pairwise Cosine similarity between a sentences and all of its counterparts. Only when the score is higher than 0.8 that the pair is recorded.

In [31]:
# Calculate pairwise cosine similarity
def get_simi(df, id):
  
  # List of similar sentences
  result = list()

  # Calculate the similarities between a given sentence and all other
  simi_array = list(cosine_similarity([df["Encoded"][id]], df["Encoded"].to_list())[0])

  # Only keep similarity score > 0.8
  for i in range(len(simi_array)):
    if simi_array[i]>0.8 and i!=id:
      result.append(i)

  # Set None for empty lists for filtering convenience
  if len(result)<1:
    return None

  return result

In [44]:
# Calculate similarity scores for all sentences
sentences_df["Simi"] = sentences_df.swifter.apply(lambda row: get_simi(sentences_df, row["ID"]), axis=1)

sentences_df

Pandas Apply:   0%|          | 0/1052 [00:00<?, ?it/s]

Unnamed: 0,Name,Link,Sentence,Encoded,ID,Simi
0,Anthony Albanese,https://en.wikipedia.org//wiki/Anthony_Albanese,anthony norman albanese lb ni zi al-b neez-ee ...,"[-0.17799575626850128, 0.05270959809422493, 0....",0,
1,Anthony Albanese,https://en.wikipedia.org//wiki/Anthony_Albanese,he attended st mary's cathedral college before...,"[0.11786865442991257, -0.05322521924972534, 0....",1,
2,Anthony Albanese,https://en.wikipedia.org//wiki/Anthony_Albanese,he joined the labor party as a student and bef...,"[-0.04393822327256203, 0.07338488101959229, 0....",2,
3,Anthony Albanese,https://en.wikipedia.org//wiki/Anthony_Albanese,albanese was elected to the house of represent...,"[0.08946056663990021, 0.10150539129972458, 0.1...",3,
4,Anthony Albanese,https://en.wikipedia.org//wiki/Anthony_Albanese,he was first appointed to the shadow cabinet i...,"[-0.16330453753471375, -0.09657547622919083, 0...",4,
...,...,...,...,...,...,...
1047,Clare O'Neil,https://en.wikipedia.org//wiki/Clare_O%27Neil,i want my generation to be the last to have to...,"[-0.09499959647655487, 0.10766898840665817, 0....",1047,
1048,Clare O'Neil,https://en.wikipedia.org//wiki/Clare_O%27Neil,o'neil lives with her partner brendan an anaes...,"[0.08744799345731735, -0.302818238735199, 0.11...",1048,
1049,Clare O'Neil,https://en.wikipedia.org//wiki/Clare_O%27Neil,o'neil has two sons elvis and louis and a daug...,"[0.07609084248542786, -0.41609659790992737, -0...",1049,
1050,Clare O'Neil,https://en.wikipedia.org//wiki/Clare_O%27Neil,while living in the northern territory o'neil ...,"[0.2947000563144684, -0.3881482779979706, 0.40...",1050,


* Only sentences with similar counterpart(s) are kept.

In [45]:
# Filter out sentences with similar counterparts 
simi_df = sentences_df[sentences_df['Simi'].notnull()][["Name", "Link", "Sentence", "Simi", "ID"]].explode('Simi')

* Use self-join to get pairs of similar sentences together with their links.

In [46]:
# Self merge to get data for the matching pairs
result_df = pd.merge(simi_df, simi_df, left_on="Simi", right_on="ID")[["Name_x", "Link_x", "Sentence_x", "ID_x", "ID_y", "Sentence_y", "Link_y", "Name_y"]]
# Filter out reverse duplication (e.g. A-B and B-A)
result_df["temp"] = result_df.swifter.apply(lambda row: str(sorted([row["ID_x"], row["ID_y"]])), axis=1)
result_df = result_df.drop_duplicates(subset="temp").drop(["temp"], axis=1)

Pandas Apply:   0%|          | 0/48 [00:00<?, ?it/s]

## 5. Outcome (Similar sentences)

* The similar sentences would be displayed below.

In [47]:
result_df

Unnamed: 0,Name_x,Link_x,Sentence_x,ID_x,ID_y,Sentence_y,Link_y,Name_y
0,Anthony Albanese,https://en.wikipedia.org//wiki/Anthony_Albanese,after labor's surprise defeat in the election ...,9,699,he led labor to a narrow loss at the election ...,https://en.wikipedia.org//wiki/Bill_Shorten,Bill Shorten
1,Anthony Albanese,https://en.wikipedia.org//wiki/Anthony_Albanese,albanese is the first italian-australian to be...,11,98,albanese is the first italian-australian prime...,https://en.wikipedia.org//wiki/Anthony_Albanese,Anthony Albanese
3,Katy Gallagher,https://en.wikipedia.org//wiki/Katy_Gallagher,she was appointed to bill shorten's shadow min...,284,1023,in she was appointed as a shadow minister by o...,https://en.wikipedia.org//wiki/Clare_O%27Neil,Clare O'Neil
6,Madeleine King,https://en.wikipedia.org//wiki/Madeleine_King,after the election she was appointed shadow mi...,968,1023,in she was appointed as a shadow minister by o...,https://en.wikipedia.org//wiki/Clare_O%27Neil,Clare O'Neil
9,Clare O'Neil,https://en.wikipedia.org//wiki/Clare_O%27Neil,following the election o'neil was appointed to...,1037,1023,in she was appointed as a shadow minister by o...,https://en.wikipedia.org//wiki/Clare_O%27Neil,Clare O'Neil
12,Katy Gallagher,https://en.wikipedia.org//wiki/Katy_Gallagher,she was subsequently elected manager of opposi...,285,345,she was additionally appointed manager of gove...,https://en.wikipedia.org//wiki/Katy_Gallagher,Katy Gallagher
14,Don Farrell,https://en.wikipedia.org//wiki/Don_Farrell,appointed the parliamentary secretary for sust...,349,356,on march farrell was promoted into the outer m...,https://en.wikipedia.org//wiki/Don_Farrell,Don Farrell
15,Don Farrell,https://en.wikipedia.org//wiki/Don_Farrell,on july farrell was appointed the minister for...,350,357,on july as part of the second rudd ministry fa...,https://en.wikipedia.org//wiki/Don_Farrell,Don Farrell
18,Mark Butler,https://en.wikipedia.org//wiki/Mark_Butler,he is a member of the australian labor party a...,446,855,he is a member of the australian labor party a...,https://en.wikipedia.org//wiki/Brendan_O%27Con...,Brendan O'Connor
19,Chris Bowen,https://en.wikipedia.org//wiki/Chris_Bowen,in bowen was appointed to the labor front benc...,489,502,bowen was later appointed shadow treasurer by ...,https://en.wikipedia.org//wiki/Chris_Bowen,Chris Bowen
