This notebook is mixt between a work practice and the research I did online. Here you will find a small tutoriel of text processing tools and a tutorial to find the similar sentences (matching in meaning/semantic way).

# First work practice in NLP

Subject :

Go on Google News and select 2 press articles (2 about the same topic).
Copy/paste the text content of each article in 2 separate files.

**The goal is to find the two nearest sentences (in a meaning/semantic way) in the articles on the same topic.**
The possible tools used to achieve this result are : **tokenization, regular expressions and string distances**.


# Libraries :

In [None]:
import nltk
import pandas as pd
from nltk.stem.snowball import EnglishStemmer
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.metrics import *
import textdistance as td

stemmer = EnglishStemmer()

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

# Course

the homework and its conclusion are in the bottom of the document

## Tools example :

### Tokenization of text into sentence :

In [None]:
tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
text=" Hello everyone. Hope all are fine and doing well. Hope you find the book interesting"
tokenizer.tokenize(text)

[' Hello everyone.',
 'Hope all are fine and doing well.',
 'Hope you find the book interesting']

### Tokenization of text in other langages :

In [None]:
french_tokenizer=nltk.data.load('tokenizers/punkt/french.pickle')
french_tokenizer.tokenize("Bonjour à tous. J'espère que tout le monde va bien et que tout va bien. J'espère que vous trouverez le livre intéressant")

['Bonjour à tous.',
 "J'espère que tout le monde va bien et que tout va bien.",
 "J'espère que vous trouverez le livre intéressant"]

### Tokenization of sentences into words :

In [None]:
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize("Have a nice day. I hope you find the book interesting")

['Have',
 'a',
 'nice',
 'day.',
 'I',
 'hope',
 'you',
 'find',
 'the',
 'book',
 'interesting']

**As you can see below there may be limitations depending on the structure and shape of the words.**

TreebankWordTokenizer uses conventions according to Penn Treebank Corpus. It works by
separating contractions. This is shown here:

In [None]:
text=nltk.word_tokenize(" Don't hesitate to ask questions")
print(text)


['Do', "n't", 'hesitate', 'to', 'ask', 'questions']


Another word tokenizer is PunktWordTokenizer . It works by splitting punctuation; each word is
kept instead of creating an entirely new token. This is shown here :

In [None]:
from nltk.tokenize import WordPunctTokenizer
WordPunctTokenizer().tokenize(" Don't hesitate to ask questions")

['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions']

Another word tokenizer is WordPunctTokenizer . It
provides splitting by making punctuation an entirely new token. This is shown here:

In [None]:
tokenizer= nltk.WordPunctTokenizer()
tokenizer.tokenize(" Don't hesitate to ask questions")


['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions']

You can also do it "manually" with regex by matching with spaces or gaps :  

In [None]:
tokenizer=RegexpTokenizer("[\w]+")
tokenizer.tokenize("Don't hesitate to ask questions")
["Don't", 'hesitate', 'to', 'ask', 'questions']

["Don't", 'hesitate', 'to', 'ask', 'questions']

In [None]:
sent="Don't hesitate to ask questions"
print(regexp_tokenize(sent, pattern='\w+|\$[\d\.]+|\S+'))


['Don', "'t", 'hesitate', 'to', 'ask', 'questions']


### Deal with stopwords :

Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so widely used that they carry very little useful information.

In [None]:
stops=set(stopwords.words('english'))
words=["Don't", 'hesitate','to','ask','questions']
[word for word in words if word not in stops]

["Don't", 'hesitate', 'ask', 'questions']

In [None]:
#list of langage where we can remove the stopwords.
stopwords.fileids()


['arabic',
 'azerbaijani',
 'basque',
 'bengali',
 'catalan',
 'chinese',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hebrew',
 'hinglish',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

### Example of text replacement with another text :

In [None]:
replacement_patterns = [
(r'don\'t', 'do not'),
(r'didn\'t', 'did not'),
(r'can\'t', 'cannot'),
(r"must've","must have")
]

class RegexpReplacer(object):
   def __init__(self, patterns=replacement_patterns):
      self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]

   def replace(self, text):
      s = text
      for (pattern, repl) in self.patterns:
           s = re.sub(pattern, repl, s)
      return s

replacer=RegexpReplacer()
print(replacer.replace("don't hesitate to ask questions."))
print(replacer.replace("She must've gone to the market but she didn't go."))
print(word_tokenize(replacer.replace("don't hesitate to ask questions")))


do not hesitate to ask questions.
She must have gone to the market but she did not go.
['do', 'not', 'hesitate', 'to', 'ask', 'questions']


### Lemmatization :

Lemmatization is the process in which we transform the word into a form with a different word
category. The word formed after lemmatization is entirely different. The built-in morphy() function
is used for lemmatization in WordNetLemmatizer. The inputted word is left unchanged if it is not
found in WordNet. In the argument, pos refers to the part of speech category of the inputted word.

The WordNetLemmatizer library may be defined as a wrapper around the so-called WordNet
corpus, and it makes use of the morphy() function present in WordNetCorpusReader to extract a
lemma. If no lemma is extracted, then the word is only returned in its original form. For example,
for works , the lemma returned is the singular form, work .

In [None]:
lemmatizer_output=WordNetLemmatizer()

In [None]:
print("lematization of 'working': ",lemmatizer_output.lemmatize('working'))
print("lematization in the base form of the verb of 'working':",lemmatizer_output.lemmatize('working',pos='v'))
print("lematization of 'works':",lemmatizer_output.lemmatize('works'))


lematization of 'working':  working
lematization in the base form of the verb of 'working': work
lematization of 'works': work


### Stemming :

Stemming is a text preprocessing technique in natural language processing (NLP). Specifically, it is the process of reducing inflected form of a word to one so-called “stem,” or root form.

 The practical distinction between stemming and lemmatization is that, where stemming merely removes common suffixes from the end of word tokens, lemmatization ensures the output word is an existing normalized form of the word (i.e. lemma) that can be found in the dictionary.

In [None]:
stemmer_output=PorterStemmer()
print("stemming of 'happiness':",stemmer_output.stem('happiness'))
print("lemmatization of 'happiness':",lemmatizer_output.lemmatize('happiness'))

stemming of 'happiness': happi
lemmatization of 'happiness': happiness


### Similarity measure

Some of the most common ways to capture similarity between text units are:
- Longest Common Substring (LCS)
- Levenshtein Edit Distance
- Hamming Distance
- Cosine Similarity
- Jaccard Distance
- Euclidean Distance

For an example we will use Levenshtein edit-distance between two strings. The edit distance is the number of characters that need to be substituted, inserted, or deleted, to transform s1 into s2. For example, transforming “rain” to “shine” requires three steps, consisting of two substitutions and one insertion: “rain” -> “sain” -> “shin” -> “shine”. These operations could have been done in other orders, but at least three steps are needed.

In [None]:
print("similarity with distance between 'relate' and 'relation'",edit_distance("relate","relation"))
print("similarity with distance between 'suggestion' and 'calculation'",edit_distance("suggestion","calculation"))


similarity with distance between 'relate' and 'relation' 3
similarity with distance between 'suggestion' and 'calculation' 7


In [None]:
X=set([10,20,30,40])
Y=set([20,30,60])
print("Jaccard's coefficient for distance with X=[10,20,30,40] and Y=[20,30,60]:",jaccard_distance(X,Y))
print("Jaccard's coefficient for distance with 'similarity with Jaccard coef' and 'similarity with Tanimoto':",td.jaccard('similarity with Jaccard coef'.split(), "similarity with Tanimoto".split()))

Jaccard's coefficient for distance with X=[10,20,30,40] and Y=[20,30,60]: 0.6
Jaccard's coefficient for distance with 'similarity with Jaccard coef' and 'similarity with Tanimoto': 0.4


### Frequency word analysis

Generally we use frequency for TF-IDF (Term Frequency-Inverse Document Frequency). It is a commonly used weighting scheme in Natural Language Processing (NLP) that quantifies the importance of a term in a document or a collection of documents. TF-IDF takes into account both the frequency of a term within a document (term frequency) and its rarity across the entire document collection (inverse document frequency).

We will see TF-IDF in another notebook.

Here we will code a tool manually to analyze the word frequency.

In [None]:
text1 = "Elizabeth II (Elizabeth Alexandra Mary; 21 April 1926 – 8 September 2022) was Queen of the United Kingdom and other Commonwealth realms from 6 February 1952 until her death in 2022. She was queen regnant of 32 sovereign states during of his life and 15 at the time of his death. [a] Her reign of 70 years and 214 days is the longest of any British monarch and the longest recorded of any female head of state in history. "+"Elizabeth was born in Mayfair, London, as the first child of the Duke and Duchess of York (later King George VI and Queen Elizabeth). Her father came to the throne in 1936 upon the abdication of his brother, King Edward VIII, making Elizabeth heir presumptive. She was educated privately at home and began taking public office during World War II, serving in the Auxiliary Territorial Service. In November 1947, she married Philip Mountbatten, a former prince of Greece and Denmark, and their marriage lasted 73 years until his death in April 2021. They had four children: Charles, Anne, Andrew and Edward. "+"When her father died in February 1952, Elizabeth, then aged 25, became queen of seven independent Commonwealth countries: the United Kingdom, Canada, Australia, New Zealand, South Africa , Pakistan and Ceylon (known today as Sri Lanka). as well as head of the Commonwealth. Elizabeth reigned as a constitutional monarch through major political changes such as the Troubles in Northern Ireland, devolution in the United Kingdom, decolonization of Africa, and the United Kingdom's membership of the European Communities and withdrawal of the European Union. The number of its kingdoms varied over time as territories gained independence and some kingdoms became republics. His many historic visits and meetings include state visits to China in 1986, Russia in 1994 and the Republic of Ireland in 2011, as well as meetings with five popes. " +"Significant events include Elizabeth's coronation in 1953 and the celebrations of her silver, gold, diamond and platinum jubilees in 1977, 2002, 2012 and 2022, respectively. Elizabeth was the longest-serving British monarch and the second longest-reigning sovereign in world history, behind only Louis XIV of France. She sometimes faced republican sentiment and media criticism of her family, particularly after the breakdown of her children's marriages, her annus horribilis in 1992, and the death of her former daughter-in-law Diana, Princess of Wales, in 1997. However, support as the monarchy in the United Kingdom has remained consistently high, as has his personal popularity. Elizabeth died aged 96 at Balmoral Castle, Aberdeenshire in 2022, months after her platinum jubilee, and was succeeded by her eldest son, Charles III."
list_word1 =[]
text1=text1.lower()
english_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
tok1 = english_tokenizer.tokenize(text1)
for i in tok1:
    list_word1+=nltk.word_tokenize(i)



In [None]:
freq = dict((i, list_word1.count(i)) for i in set(list_word1))

In [None]:
too_freq_word = {k:v for k,v in freq.items() if v>=3 }#and len(k)<7}
too_freq_word

{'her': 11,
 "'s": 3,
 ',': 40,
 'was': 6,
 ')': 3,
 'a': 3,
 'monarch': 3,
 'of': 25,
 'kingdom': 5,
 'she': 4,
 'as': 11,
 'in': 20,
 '2022': 3,
 'his': 6,
 'commonwealth': 3,
 'elizabeth': 10,
 'queen': 4,
 'the': 25,
 '.': 15,
 'at': 3,
 'and': 23,
 'united': 5,
 'death': 4,
 '(': 3}

In [None]:
too_freq_word = {k:v for k,v in freq.items() if v>=3 and len(k)>5 and k.isnumeric()==False}
print("important and long words (that are generally meaningful) :",too_freq_word)


important and long words (that are generally meaningful) : {'monarch': 3, 'kingdom': 5, 'commonwealth': 3, 'elizabeth': 10, 'united': 5}


In [None]:
# sequence of usual words
print("sequence of usual words:")
suite_word = list(nltk.bigrams(list_word1))
suite_word_pop = dict((i, suite_word.count(i)) for i in set(suite_word))
too_freq_suite_word = {k:v for k,v in suite_word_pop.items() if v>=3}
too_freq_suite_word

sequence of usual words:


{('united', 'kingdom'): 5,
 ('.', 'elizabeth'): 4,
 ('of', 'the'): 5,
 ('and', 'the'): 6,
 ('of', 'his'): 3,
 ('in', 'the'): 3,
 ('of', 'her'): 4,
 ('as', 'the'): 3,
 (',', 'and'): 4,
 (',', 'as'): 3,
 ('the', 'united'): 5}

# Homework

Importation of two similar texts (text1 and text2) and one very different text (text3).

In [None]:
text1 = "Elizabeth II (Elizabeth Alexandra Mary; 21 April 1926 – 8 September 2022) was Queen of the United Kingdom and other Commonwealth realms from 6 February 1952 until her death in 2022. She was queen regnant of 32 sovereign states during of his life and 15 at the time of his death. [a] Her reign of 70 years and 214 days is the longest of any British monarch and the longest recorded of any female head of state in history. "+"Elizabeth was born in Mayfair, London, as the first child of the Duke and Duchess of York (later King George VI and Queen Elizabeth). Her father came to the throne in 1936 upon the abdication of his brother, King Edward VIII, making Elizabeth heir presumptive. She was educated privately at home and began taking public office during World War II, serving in the Auxiliary Territorial Service. In November 1947, she married Philip Mountbatten, a former prince of Greece and Denmark, and their marriage lasted 73 years until his death in April 2021. They had four children: Charles, Anne, Andrew and Edward. "+"When her father died in February 1952, Elizabeth, then aged 25, became queen of seven independent Commonwealth countries: the United Kingdom, Canada, Australia, New Zealand, South Africa , Pakistan and Ceylon (known today as Sri Lanka). as well as head of the Commonwealth. Elizabeth reigned as a constitutional monarch through major political changes such as the Troubles in Northern Ireland, devolution in the United Kingdom, decolonization of Africa, and the United Kingdom's membership of the European Communities and withdrawal of the European Union. The number of its kingdoms varied over time as territories gained independence and some kingdoms became republics. His many historic visits and meetings include state visits to China in 1986, Russia in 1994 and the Republic of Ireland in 2011, as well as meetings with five popes. " +"Significant events include Elizabeth's coronation in 1953 and the celebrations of her silver, gold, diamond and platinum jubilees in 1977, 2002, 2012 and 2022, respectively. Elizabeth was the longest-serving British monarch and the second longest-reigning sovereign in world history, behind only Louis XIV of France. She sometimes faced republican sentiment and media criticism of her family, particularly after the breakdown of her children's marriages, her annus horribilis in 1992, and the death of her former daughter-in-law Diana, Princess of Wales, in 1997. However, support as the monarchy in the United Kingdom has remained consistently high, as has his personal popularity. Elizabeth died aged 96 at Balmoral Castle, Aberdeenshire in 2022, months after her platinum jubilee, and was succeeded by her eldest son, Charles III."

In [None]:
text1

"Elizabeth II (Elizabeth Alexandra Mary; 21 April 1926 – 8 September 2022) was Queen of the United Kingdom and other Commonwealth realms from 6 February 1952 until her death in 2022. She was queen regnant of 32 sovereign states during of his life and 15 at the time of his death. [a] Her reign of 70 years and 214 days is the longest of any British monarch and the longest recorded of any female head of state in history. Elizabeth was born in Mayfair, London, as the first child of the Duke and Duchess of York (later King George VI and Queen Elizabeth). Her father came to the throne in 1936 upon the abdication of his brother, King Edward VIII, making Elizabeth heir presumptive. She was educated privately at home and began taking public office during World War II, serving in the Auxiliary Territorial Service. In November 1947, she married Philip Mountbatten, a former prince of Greece and Denmark, and their marriage lasted 73 years until his death in April 2021. They had four children: Charl

In [None]:
text2 = "Elizabeth II (pronounced in French /elizabɛt/a; in English: Elizabeth II, pronounced /əˈlɪzəbəθ/b), born April 21, 1926 in Mayfair (London) and died September 8, 2022 at Balmoral Castle (Scotland), is queen of the United Kingdom of Great Britain and Northern Ireland and the other Commonwealth Realms from 6 February 1952 until his death. At birth, she was third in line to the throne after her uncle and her father. In 1936, his uncle became king but abdicated a few months later, leaving the throne to his younger brother. Princess Elizabeth then became, at the age of 10, heir presumptive to the British Crown. During the Second World War, she enlisted in the Auxiliary Territorial Service. On November 20, 1947, she married Philip Mountbatten, Prince of Greece and Denmark, with whom she had four children: Charles, Anne, Andrew and Edward. She acceded to the British throne on February 6, 1952, at the age of 25, on the death of George VI. His coronation, on June 2, 1953, was the first to be broadcast on television. She becomes the sovereign of seven independent Commonwealth states: South Africa, Australia, Canada, Ceylon, New Zealand, Pakistan and the United Kingdom. Between 1956 and 2021, the number of its kingdoms increases and decreases at the same time: colonies of the British Empire gain independence, choose whether or not to recognize Queen Elizabeth II as the symbolic sovereign of their new independent state ; some kingdoms also became republics. In the year of her death, in addition to the aforementioned Australia, Canada, New Zealand and the United Kingdom, Elizabeth II was Queen of Antigua and Barbuda, the Bahamas, Belize, Grenada , Jamaica, Papua New Guinea, Saint Kitts and Nevis, Saint Vincent and the Grenadines, Saint Lucia, Solomon Islands and Tuvalu. During a long reign in which she saw fifteen different British Prime Ministers succeed one another, she made numerous historic visits and oversaw several constitutional changes in her kingdoms, such as the devolution of power to the United Kingdom and the patriation of the Constitution of Canada . She also experienced difficult times, including the assassination of Prince Philip's uncle and mentor, Lord Mountbatten, in 1979, and the separations and divorce of three of her children in 1992 (a year she describes as annus horribilis). , the death of her daughter-in-law, Diana Spencer, in 1997, the almost simultaneous deaths of her sister and mother in 2002, as well as the death of her husband in 2021 after more than 73 years of marriage. Furthermore, the Queen has sometimes had to face harsh criticism of the royal family from the press, but support for the monarchy and her personal popularity remain high among the British population. On September 9, 2015, she became the longest-reigning British sovereign. She reigned for 70 years, 7 months and 2 days, exceeding the reign of her great-great-grandmother Queen Victoria (63 years, 7 months and 2 days). On October 13, 2016, following the death of Thailand's King Rama IX, she became the longest-reigning and oldest sovereign in office. At the start of June 2022, she becomes the first monarch in the history of the United Kingdom to celebrate her platinum jubilee, which marks the 70th anniversary of her accession to the throne. She then became the second monarch of the modern era to have had the longest reign, behind the King of France Louis XIV. She died three months later, on September 8, 2022, at the age of 96; his eldest son succeeded him under the name of Charles III."

In [None]:
text2

"Elizabeth II (pronounced in French /elizabɛt/a; in English: Elizabeth II, pronounced /əˈlɪzəbəθ/b), born April 21, 1926 in Mayfair (London) and died September 8, 2022 at Balmoral Castle (Scotland), is queen of the United Kingdom of Great Britain and Northern Ireland and the other Commonwealth Realms from 6 February 1952 until his death. At birth, she was third in line to the throne after her uncle and her father. In 1936, his uncle became king but abdicated a few months later, leaving the throne to his younger brother. Princess Elizabeth then became, at the age of 10, heir presumptive to the British Crown. During the Second World War, she enlisted in the Auxiliary Territorial Service. On November 20, 1947, she married Philip Mountbatten, Prince of Greece and Denmark, with whom she had four children: Charles, Anne, Andrew and Edward. She acceded to the British throne on February 6, 1952, at the age of 25, on the death of George VI. His coronation, on June 2, 1953, was the first to be b

In [None]:
text1=text1.lower()
text2=text2.lower()


In [None]:
english_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
tok1 = english_tokenizer.tokenize(text1)
tok2 = english_tokenizer.tokenize(text2)

In [None]:
tok1


['elizabeth ii (elizabeth alexandra mary; 21 april 1926 – 8 september 2022) was queen of the united kingdom and other commonwealth realms from 6 february 1952 until her death in 2022. she was queen regnant of 32 sovereign states during of his life and 15 at the time of his death.',
 '[a] her reign of 70 years and 214 days is the longest of any british monarch and the longest recorded of any female head of state in history.',
 'elizabeth was born in mayfair, london, as the first child of the duke and duchess of york (later king george vi and queen elizabeth).',
 'her father came to the throne in 1936 upon the abdication of his brother, king edward viii, making elizabeth heir presumptive.',
 'she was educated privately at home and began taking public office during world war ii, serving in the auxiliary territorial service.',
 'in november 1947, she married philip mountbatten, a former prince of greece and denmark, and their marriage lasted 73 years until his death in april 2021. they had

## Find the two nearest sentences in the text 1 and 2

### First try with Levenshtein Edit Distance only

In [None]:
df = pd.DataFrame()
cal = pd.DataFrame()
for i in tok1 :
    for j in tok2 :
#         print(i)
#         print(j)
        cal["1_sentences"]= [i]
        cal["2_sentences"]= [j]
        cal["comparison_score_edit"] =[edit_distance(i,j)]
#         print(edit_distance(i,j))
#         print(cal)
        df=pd.concat([cal,df], ignore_index=True)


df


Unnamed: 0,1_sentences,2_sentences,comparison_score_edit
0,"elizabeth died aged 96 at balmoral castle, abe...","she died three months later, on september 8, 2...",90
1,"elizabeth died aged 96 at balmoral castle, abe...",she then became the second monarch of the mode...,112
2,"elizabeth died aged 96 at balmoral castle, abe...","at the start of june 2022, she becomes the fir...",132
3,"elizabeth died aged 96 at balmoral castle, abe...","on october 13, 2016, following the death of th...",115
4,"elizabeth died aged 96 at balmoral castle, abe...","she reigned for 70 years, 7 months and 2 days,...",123
...,...,...,...
310,elizabeth ii (elizabeth alexandra mary; 21 apr...,"during the second world war, she enlisted in t...",228
311,elizabeth ii (elizabeth alexandra mary; 21 apr...,"princess elizabeth then became, at the age of ...",221
312,elizabeth ii (elizabeth alexandra mary; 21 apr...,"in 1936, his uncle became king but abdicated a...",218
313,elizabeth ii (elizabeth alexandra mary; 21 apr...,"at birth, she was third in line to the throne ...",220


In [None]:
df.sort_values(by=["comparison_score_edit"], ascending=False).head(10)

Unnamed: 0,1_sentences,2_sentences,comparison_score_edit
26,she sometimes faced republican sentiment and m...,"on september 9, 2015, she became the longest-r...",313
34,she sometimes faced republican sentiment and m...,"his coronation, on june 2, 1953, was the first...",310
37,she sometimes faced republican sentiment and m...,"during the second world war, she enlisted in t...",308
167,as well as head of the commonwealth.,elizabeth ii (pronounced in french /elizabɛt/a...,304
40,she sometimes faced republican sentiment and m...,"at birth, she was third in line to the throne ...",304
38,she sometimes faced republican sentiment and m...,"princess elizabeth then became, at the age of ...",301
39,she sometimes faced republican sentiment and m...,"in 1936, his uncle became king but abdicated a...",298
35,she sometimes faced republican sentiment and m...,she acceded to the british throne on february ...,289
24,she sometimes faced republican sentiment and m...,"on october 13, 2016, following the death of th...",289
41,she sometimes faced republican sentiment and m...,elizabeth ii (pronounced in french /elizabɛt/a...,288


In [None]:
print("According to Levenshtein Edit Distance the two nearest sentence in the text are:")
print("-",df.sort_values(by=["comparison_score_edit"], ascending=False).head(1)["1_sentences"].values[0])
print("-",df.sort_values(by=["comparison_score_edit"], ascending=False).head(1)["2_sentences"].values[0])

According to Levenshtein Edit Distance the two nearest sentence in the text are:
- she sometimes faced republican sentiment and media criticism of her family, particularly after the breakdown of her children's marriages, her annus horribilis in 1992, and the death of her former daughter-in-law diana, princess of wales, in 1997. however, support as the monarchy in the united kingdom has remained consistently high, as has his personal popularity.
- on september 9, 2015, she became the longest-reigning british sovereign.


It doesn't work. So we will try another method.

### Second try : successful

I analysis sentences by splitting them in list of keyword and I compare them.
process :
- create a list with all the word to remove symbole and useless word => to only have the keyword
- stemming of each keyword => to not be influenced by grammar
- compare each word from a sentence in the text 1 to all the word in each sentence in the text 2

In [None]:
df["1_sentences_processing"] = df["1_sentences"].apply(lambda text : text.replace("(","").replace(")","").replace(",","").replace(":","").replace(";",""))
df["2_sentences_processing"] = df["2_sentences"].apply(lambda text : text.replace("(","").replace(")","").replace(",","").replace(":","").replace(";",""))

tokenizer=RegexpTokenizer("[a-zA-Z]+")
df["1_sentences_processing"] = df["1_sentences_processing"].apply(lambda text : tokenizer.tokenize(text))
df["2_sentences_processing"] = df["2_sentences_processing"].apply(lambda text : tokenizer.tokenize(text))

stopsen = set(stopwords.words("english"))
df["1_sentences_processing"] = df["1_sentences_processing"].apply(lambda words : [word for word in words if word not in stopsen])
df["2_sentences_processing"] = df["2_sentences_processing"].apply(lambda words : [word for word in words if word not in stopsen])

df["1_sentences_processing"] = df["1_sentences_processing"].apply(lambda words : [stemmer.stem(x) for x in words])
df["2_sentences_processing"] = df["2_sentences_processing"].apply(lambda words : [stemmer.stem(x) for x in words])

df

Unnamed: 0,1_sentences,2_sentences,comparison_score_edit,1_sentences_processing,2_sentences_processing
0,"elizabeth died aged 96 at balmoral castle, abe...","she died three months later, on september 8, 2...",90,"[elizabeth, die, age, balmor, castl, aberdeens...","[die, three, month, later, septemb, age, eldes..."
1,"elizabeth died aged 96 at balmoral castle, abe...",she then became the second monarch of the mode...,112,"[elizabeth, die, age, balmor, castl, aberdeens...","[becam, second, monarch, modern, era, longest,..."
2,"elizabeth died aged 96 at balmoral castle, abe...","at the start of june 2022, she becomes the fir...",132,"[elizabeth, die, age, balmor, castl, aberdeens...","[start, june, becom, first, monarch, histori, ..."
3,"elizabeth died aged 96 at balmoral castle, abe...","on october 13, 2016, following the death of th...",115,"[elizabeth, die, age, balmor, castl, aberdeens...","[octob, follow, death, thailand, king, rama, i..."
4,"elizabeth died aged 96 at balmoral castle, abe...","she reigned for 70 years, 7 months and 2 days,...",123,"[elizabeth, die, age, balmor, castl, aberdeens...","[reign, year, month, day, exceed, reign, great..."
...,...,...,...,...,...
310,elizabeth ii (elizabeth alexandra mary; 21 apr...,"during the second world war, she enlisted in t...",228,"[elizabeth, ii, elizabeth, alexandra, mari, ap...","[second, world, war, enlist, auxiliari, territ..."
311,elizabeth ii (elizabeth alexandra mary; 21 apr...,"princess elizabeth then became, at the age of ...",221,"[elizabeth, ii, elizabeth, alexandra, mari, ap...","[princess, elizabeth, becam, age, heir, presum..."
312,elizabeth ii (elizabeth alexandra mary; 21 apr...,"in 1936, his uncle became king but abdicated a...",218,"[elizabeth, ii, elizabeth, alexandra, mari, ap...","[uncl, becam, king, abdic, month, later, leav,..."
313,elizabeth ii (elizabeth alexandra mary; 21 apr...,"at birth, she was third in line to the throne ...",220,"[elizabeth, ii, elizabeth, alexandra, mari, ap...","[birth, third, line, throne, uncl, father]"


In [None]:
df["Jaccard_score"] = df.apply(lambda x : td.jaccard(x["1_sentences_processing"],x["2_sentences_processing"]),axis=1)
df = df.sort_values(by=["Jaccard_score"], ascending=False).reset_index()
del df["index"]
print("Nearest sentences with Jaccard score")
for i in range(3):
  print(f"-{i+1}-")
  print("- ",df.iloc[i,0])
  print("- ",df.iloc[i,1])

Nearest sentences with Jaccard score
-1-
-  in november 1947, she married philip mountbatten, a former prince of greece and denmark, and their marriage lasted 73 years until his death in april 2021. they had four children: charles, anne, andrew and edward.
-  on november 20, 1947, she married philip mountbatten, prince of greece and denmark, with whom she had four children: charles, anne, andrew and edward.
-2-
-  when her father died in february 1952, elizabeth, then aged 25, became queen of seven independent commonwealth countries: the united kingdom, canada, australia, new zealand, south africa , pakistan and ceylon (known today as sri lanka).
-  she becomes the sovereign of seven independent commonwealth states: south africa, australia, canada, ceylon, new zealand, pakistan and the united kingdom.
-3-
-  elizabeth died aged 96 at balmoral castle, aberdeenshire in 2022, months after her platinum jubilee, and was succeeded by her eldest son, charles iii.
-  she died three months late

works quite well.
Jacccard is good to match the nearest sentences by meaning because it does not take into consideration the place of the word within the sentence.
So it's rather practical when the authors don't have the same way of writing things.