# Search system for contextually close texts

In [1]:
from PyPDF2 import PdfReader

import pandas as pd
import numpy as np
import random

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import euclidean_distances
import re

#### Download books of various genres from open sources, create a dataset

In [2]:
reader = PdfReader("books/English-Phonetics-and-Phonology.-An-Introduction-PDFDrive-.pdf")
  
# getting a specific page from the pdf file
page = reader.pages[10]
  
# extracting text from page
text = page.extract_text()
print(text)

Wiley also publishes its books in a variety of electronic
formats. Some content that appears in print may not be
available in electronic books.
Designations used by companies to distinguish their products
are often claimed as trademarks. All brand names and product
names used in this book are trade names, service marks,
trademarks or registered trademarks of their respective owners.
The publisher is not associated with any product or vendor
mentioned in this book. This publication is designed to provide
accurate and authoritative information in regard to the subject
matter covered. It is sold on the understanding that the
publisher is not engaged in rendering professional services. If
professional advice or other expert assistance is required, the
services of a competent professional should be sought.
Library of Congress Cataloging-in-Publication Data
Carr, Philip, 1953–
English phonetics and phonology : an introduction / Philip Carr.
— Second edition.
pages cm
Includes bibliographical

In [3]:
books_path = "books/"

data = {
    "genre": [
        "Non-fiction",
        "Non-fiction",
        "Fantasy",
        "Novel",
        "Novel",
        "Non-fiction",
        "Non-fiction",
        "Novel",
        "Adventure",
        "Novel",
        "Religion",
        "Religion",
        "Religion",
        "Religion",
        "Romance",
        "Romance",
        "Romance",
        "Romance",
        "Fiction",
        "Fiction",
        "Non-fiction"
    ],
    "description": [
        "English Phonetics and Phonology is an excellent text for individuals who have no prior understanding of the topic as it is an introductory text that is straightforward to understand regarding the phonological structure of the English language.\nThis leading textbook in the market teaches undergraduate students and those whose first language is not English the fundamentals of articulatory phonetics and phonology in an interesting and straightforward way.",
        "The Second Edition of Lobsters: Biology, Management, Fisheries, and Aquaculture, which Bruce Phillips is editing, delivers exhaustive coverage of these fascinating creatures, stretching from growth and development to management and conservation. The book is being published under Lobsters: Biology, Management, Fisheries, and Aquaculture. Several chapters included in the First Edition and covered topics such as Growth, Reproduction, Diseases, Behaviour, Nutrition, Larval and Post-Larval Ecology, and Juvenile and Adult Ecology have been removed and replaced with new chapters.",
        "Lissa Dragomir is a Moroi princess: a mortal vampire with a rare aptitude for channelling the earth’s magic. Strigoi are the deadliest vampires and must always be kept away from her. Lissa’s best friend, Rose Hathaway, is a dhampir due to the potent combination of human and vampire blood that courses through her veins. Rose has devoted her life to the perilous mission of shielding Lissa from the Strigoi, who are intent on assimilating her into their kind.",
        "Toni was adamant in her protest, and it persisted. I have no idea what type of a person you think I am or what you believe you have the power to do to me. You are not going to be successful in keeping me here!",
        "Ruthless. Meticulous. Arrogant. \nControl is essential to Dante Russo’s happiness, both in his personal and professional life. \nThe prospect of blackmail compels the rich CEO to enter into an engagement with a woman he knows very little about, despite the fact that he had no intention of ever getting married.",
        "The book Digital Marketing: Strategy, Implementation, and Practice, which is now in its sixth edition, gives thorough and useful advice to businesses on how they can fulfill their marketing objectives by making the most of the opportunities presented by digital media and technology. \nStudents will gain a better understanding of how digital marketing functions in the real world by participating in case studies and interviews with representatives from forward-thinking companies like eBay and Facebook, which are covered in the course Digital Marketing, which brings together the theory and practice of marketing with actual business experience.",
        "More than 15 million copies of How to Win Friends and Influence People have been sold since the book’s initial publication in 1936. The first book published by Dale Carnegie is an all-time best-seller packed with rock-solid wisdom that has helped thousands of people climb the ladder of success in their personal and professional lives. \n The teachings of Dale Carnegie are timeless and will assist you in reaching your full potential in this complex and cutthroat era. They are as relevant today as they were when Carnegie first published them.",
        "Lilli DeForrest has no idea what to anticipate when her longtime friend Caroline extends an invitation to visit Exotica for a week of being pampered and having fun there. But as soon as Rajan enters her suite, their attraction is powerful, immediate, and overpowering. \nHe is the perfect lover for her in equal parts sensitive and commanding. However, Lilli is about to learn that Rajan’s deft touch is only the beginning of what he has in store for her.",
        "The Aeneid (Oxford World’s Classics): The new translation that was done by Frederick Ahl accomplishes something that has never been accomplished before: it captures the enthusiasm, poetic vitality, and intellectual drive of the original.",
        "Its all right, child, she consoled him. Now tell me who it is. When Aureliano told her, Pilar Ternera let out a deep laugh, the old expansive laugh that ended up as a cooing of doves.\nThere was no mystery in the heart of a Buendía that was impenetrable for her because a century of cards and experience had taught her that the history of the family was a machine with unavoidable repetitions, a turning wheel that would have gone on spilling into eternity were it not for the progressive and irremediable wearing of the axle.",
        "The book Secrets of the Secret Place by Bob Sorge has one goal: to stoke your fire to find the secret spot with God.",
        "Heaven Is for Real describes the events that took place during Colton’s trip to the afterlife as well as his family’s struggle to come to terms with the fact that their son. Colton Burpo comes out of surgery that saved his life with astonishing stories of his visit to paradise, his family is at a loss for what to believe. He was 4-year-old when this happened.",
        "guided by the example of Christ in the treatment of enemies; therefore they cannot be agreeable to the will of God, and therefore their overthrow by a spiritual regeneration of their subjects is inevitable.\n“We regard as unchristian and unlawful not only all wars, whether offensive or defensive, but all preparations for war; every naval ship, every arsenal, every fortification, we regard as unchristian and unlawful; the existence of any kind of standing army, all military chieftains, all monuments commemorative of victory over a fallen foe, all trophies won in battle, all celebrations in honor of military exploits, all appropriations for defense by arms; we regard as unchristian and unlawful every edict of government requiring of its subjects military service.",
        "Someone needs to speak the unadulterated truth about the Bible. It is not in the preachers’ best interest to do so because they fear being removed from their pulpits. College professors cannot risk doing so since doing so would result in a reduction in their income. Politicians dare not.",
        "A conscious marriage is created by bringing into awareness the unconscious directives and purposes of a romantic or love marriage. Love marriage is defined as a voluntary union of two individuals based upon a romantic attraction that is stirred by unconscious needs that have their roots in unresolved childhood issues.\nLove marriages have existed throughout history, but they have not been the dominant cultural form of marriage until the latter part of the nineteenth century, and then largely in the Western world.",
        "…uty clung to her father in terror, which became all the greater when she saw how frightened he was. But when the Beast really appeared, though she trembled at the sight of him, she made a great effort to hide her horror, and saluted him respectfully.\nThis evidently pleased the Beast. After looking at her he said, in a tone that might have struck terror into the boldest heart, though he did not seem to be angry:\n“Good-evening, old man. Good-evening, Beauty.”",
        "“An incredible tale of love and intrigue that is on par with, if not superior to, the other exciting tales that Mrs. Rinehart has written in terms of fascination.\nThe book is one of the liveliest to be published this year, and it will contribute to the author’s growing reputation as a creator of unconventional storylines.” That would be the Philadelphia Record.",
        "Dr. Gabe Allen is a man of his word when it comes to not dating his coworkers, but after meeting ER nurse Larissa Brockman, he is faced with the decision of whether or not to breach his promise.\nGabe is drawn back to the church he had abandoned because of Larissa’s strong religious beliefs; yet, when their lives are in danger, he realizes that it is Larissa who has the most to gain from having a deeper understanding of what it means to forgive.",
        "In this breathtaking ninth book of the Keeper of the Lost Cities series, which has been a bestseller both in the New York Times and in USA TODAY, Sophie and her friends learn the real meaning of power—and of evil.\nThe game was altered because of Sophie Foster. Now she is forced to make a choice between two evils",
        "Percy Jackson, who is only twelve years old, is currently on the most perilous journey of his life. Percy must travel around the United States with the assistance of a satyr and a daughter of Athena in order to apprehend a thief who has made off with the original weapon of mass destruction, which is Zeus’ master bolt.",
        "Rich Dad Poor Dad is Robert Kiyosaki’s autobiography about his upbringing with two fathers — his biological father and the father of his best friend, whom he refers to as his “rich dad” — and the ways in which the perspectives of both fathers influenced Robert’s perspectives on financial matters and the stock market. This book debunks the notion that one must have a significant income in order to become wealthy. It also clarifies the distinction between working for money and allowing your money to work for you."
    ],
    "filename": [
        "English-Phonetics-and-Phonology.-An-Introduction-PDFDrive-.pdf",
        "Lobsters_ Biology, Management, Aquaculture & Fisheries ( PDFDrive ).pdf",
        "Vampire-Academy-Vampire-Academy-Book-1-PDFDrive-.pdf",
        "Sweet-Revenge-PDFDrive-.pdf",
        "King-of-Wrath-by-Ana-Huang.pdf",
        "Digital Marketing_ Strategy, Implementation and practice ( PDFDrive ).pdf",
        "How-To-Win-Friends-and-Influence-People-PDFDrive-1.pdf",
        "Exotica-Seven-Days-of-Kama-Sutra-9-Days-of-Arabian-Nights-PDFDrive-.pdf",
        "The-Aeneid-Oxford-Worlds-Classics-PDFDrive-.pdf",
        "One-Hundred-Years-of-Solitude.pdf",
        "Secrets-of-the-Secret-Place-Sorge.pdf",
        "Heaven-is-for-Real-_-A-Little-Boys-Astounding-Story-of-His-Trip-to-Heaven-and-Back-PDFDrive-.pdf",
        "The-Kingdom-of-God-Is-Within-You-1.pdf",
        "About-the-Holy-Bible.pdf",
        "Getting-The-Love-You-Want-PDF-Download-Free.pdf",
        "Beauty-and-the-Beast-3.pdf",
        "The-After-House.pdf",
        "Healing-Her-Heart.pdf",
        "Keeper-Of-The-Lost-Cities-PDFDrive-.pdf",
        "THE-LIGHTNING-THIEF-Percy-Jackson-and-the-Olympians-Book-1-Rick-PDFDrive-.pdf",
        "Rich Dad Poor Dad ( 13streamDrive ).pdf"
    ],
    "text": []
}

In [6]:
def extract_book_text(fpath):
    reader = PdfReader(fpath)
    text = ""
    
    for i in range(len(reader.pages)):
        page = reader.pages[i]
        page_text = page.extract_text()
        try:
            page_text.encode("utf-8")   # some characters are not decoded to utf-8
            text += page_text + "\n"
        except:
            pass
    return text


for fname in data["filename"]:
    text = extract_book_text('books/' + fname)
    data["text"].append(text) 
    print(fname)

English-Phonetics-and-Phonology.-An-Introduction-PDFDrive-.pdf
Lobsters_ Biology, Management, Aquaculture & Fisheries ( PDFDrive ).pdf
Vampire-Academy-Vampire-Academy-Book-1-PDFDrive-.pdf
Sweet-Revenge-PDFDrive-.pdf
King-of-Wrath-by-Ana-Huang.pdf
Digital Marketing_ Strategy, Implementation and practice ( PDFDrive ).pdf
How-To-Win-Friends-and-Influence-People-PDFDrive-1.pdf
Exotica-Seven-Days-of-Kama-Sutra-9-Days-of-Arabian-Nights-PDFDrive-.pdf
The-Aeneid-Oxford-Worlds-Classics-PDFDrive-.pdf
One-Hundred-Years-of-Solitude.pdf
Secrets-of-the-Secret-Place-Sorge.pdf
Heaven-is-for-Real-_-A-Little-Boys-Astounding-Story-of-His-Trip-to-Heaven-and-Back-PDFDrive-.pdf
The-Kingdom-of-God-Is-Within-You-1.pdf
About-the-Holy-Bible.pdf
Getting-The-Love-You-Want-PDF-Download-Free.pdf
Beauty-and-the-Beast-3.pdf
The-After-House.pdf
Healing-Her-Heart.pdf
Keeper-Of-The-Lost-Cities-PDFDrive-.pdf
THE-LIGHTNING-THIEF-Percy-Jackson-and-the-Olympians-Book-1-Rick-PDFDrive-.pdf
Rich Dad Poor Dad ( 13streamDrive ).

In [7]:
data["text"][0][:1000]

'\nContents\nSound Recordings\nPrefaces to the First Edition\nPreface to the Second Edition\nAcknowledgements\nFigure 1 The organs of speech\nFigure 2 The International Phonetic\nAlphabet\n1 English Phonetics: Consonants (i)\n1.1 Airstream and Articulation\n1.2 Place of Articulation\n1.3 Manner of Articulation Stops, Fricatives\nand Approximants\nNotes\nExercises\n2 English Phonetics: Consonants (ii)\n2.1 Central vs Lateral\n2.2 Taps and Trills\n2.3 Secondary Articulation\n2.4 Affricates\n2.5 Aspiration\n2.6 Nasal Stops\nNotes\nExercises\n3 English Phonetics: Vowels (i)\n3.1 The Primary Cardinal Vowels\n3.2 RP and GA Short Vowels\nExercises\n4 English Phonetics: Vowels (ii)\n4.1 RP and GA Long Vowels\n4.2 RP and GA Diphthongs\nNotes\nExercises\n5 The Phonemic Principle\n5.1 Introduction Linguistic Knowledge\n5.2 Contrast vs Predictability: The Phoneme\n5.3 Phonemes, Allophones and Contexts\n5.4 Summing Up\nNotes\nExercises\n6 English Phonemes\n6.1 English Consonant Phonemes\n6.2 The Ph

In [10]:
# save the dataset to a file
df = pd.DataFrame.from_dict(data)
df.to_csv("books_dataset.csv", index=False)
df

Unnamed: 0,genre,description,filename,text
0,Non-fiction,English Phonetics and Phonology is an excellen...,English-Phonetics-and-Phonology.-An-Introducti...,\nContents\nSound Recordings\nPrefaces to the ...
1,Non-fiction,"The Second Edition of Lobsters: Biology, Manag...","Lobsters_ Biology, Management, Aquaculture & F...","LOBSTERS: BIOLOGY , \nMANAGEMENT, AQUACULTURE ..."
2,Fantasy,Lissa Dragomir is a Moroi princess: a mortal v...,Vampire-Academy-Vampire-Academy-Book-1-PDFDriv...,\nTable\tof\tContents\n\t\nTitle\tPage\nCopyri...
3,Novel,"Toni was adamant in her protest, and it persis...",Sweet-Revenge-PDFDrive-.pdf,\nSWEET\tREVENGE\n\t\nAnne\tMather\n\t\n\t\nTh...
4,Novel,Ruthless. Meticulous. Arrogant. \nControl is e...,King-of-Wrath-by-Ana-Huang.pdf,"\nAna Huang is a USA Today, Publishers Weekly..."
5,Non-fiction,"The book Digital Marketing: Strategy, Implemen...","Digital Marketing_ Strategy, Implementation an...",\nDigital Marketing\nA01_CHAF7611_06_SE_FM.ind...
6,Non-fiction,More than 15 million copies of How to Win Frie...,How-To-Win-Friends-and-Influence-People-PDFDri...,\nCONTENTS\nCover\nAbout\tthe\tAuthor\nAlso\tb...
7,Novel,Lilli DeForrest has no idea what to anticipate...,Exotica-Seven-Days-of-Kama-Sutra-9-Days-of-Ara...,\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n ...
8,Adventure,The Aeneid (Oxford World’s Classics): The new ...,The-Aeneid-Oxford-Worlds-Classics-PDFDrive-.pdf,\nAeneid\nThis page intentionally left blank \...
9,Novel,"Its all right, child, she consoled him. Now te...",One-Hundred-Years-of-Solitude.pdf,\nONE\tHUNDRED\nYEARS\tOF\nSOLITUDE\n\t\t\t\t\...


#### Clear the text

In [16]:
def preprocess_text(row_text): 
    text = row_text.lower()
    text = re.sub("[^a-zA-Z]", " ", text)  # leave only alphabetic characters
    
    # tokenize into words
    words = word_tokenize(text)

    # remove stop words
    words = [word for word in words if word not in stopwords.words("english")]
    
    lemmatizer = nltk.WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words if len(lemmatizer.lemmatize(word)) > 1] 
        
    return " ".join(lemmatized_words)

In [17]:
preprocess_text(df["text"][0][:1000])

'content sound recording preface first edition preface second edition acknowledgement figure organ speech figure international phonetic alphabet english phonetics consonant airstream articulation place articulation manner articulation stop fricative approximants note exercise english phonetics consonant ii central lateral tap trill secondary articulation affricate aspiration nasal stop note exercise english phonetics vowel primary cardinal vowel rp ga short vowel exercise english phonetics vowel ii rp ga long vowel rp ga diphthong note exercise phonemic principle introduction linguistic knowledge contrast predictability phoneme phoneme allophone context summing note exercise english phoneme english consonant phoneme phonological form morpheme english vow'

In [18]:
preprocessed_texts = [preprocess_text(df["text"][i]) for i in range(len(df))]

#### Create the "Bag of words" model for separate words

In [19]:
count = TfidfVectorizer(use_idf=True,     # calculate the "weight" of the word (relative frequency of occurrences of the word in the document)
                        smooth_idf=False,  
                        ngram_range=(1,1),stop_words='english')

matrix = count.fit_transform(preprocessed_texts).toarray()
matrix

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.00018882, 0.        , 0.        , ..., 0.00045573, 0.        ,
        0.00159507],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.00122431, 0.00122431, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [20]:
matrix.shape

(21, 50160)

#### Сompare the Euclidean distances between the selected book and others

In [29]:
book_idx = random.randint(0, len(df) - 1)
print("Randomly chose the book number", book_idx + 1)
distances = euclidean_distances(np.array(matrix), np.array([matrix[book_idx]]))
distances

Randomly chose the book number 6


array([[1.38506707],
       [1.35947227],
       [1.39137737],
       [1.40578048],
       [1.37201102],
       [0.        ],
       [1.29969291],
       [1.38551058],
       [1.38831708],
       [1.39396315],
       [1.37644894],
       [1.39965669],
       [1.35400786],
       [1.38949753],
       [1.33000749],
       [1.3678152 ],
       [1.38462056],
       [1.39265449],
       [1.39679159],
       [1.3868723 ],
       [1.34057669]])

In [30]:
pd.set_option('display.max_colwidth', None)  # шоб виводило текст повністю

print("MY BOOK: " + df["filename"][book_idx])
print("GENRE: " + df["genre"][book_idx])
print("DESCRIPTION:")
print(df["description"][book_idx])
print("-----------------------------------------------------------------------------------\n\n")
print("                                     SIMILAR BOOKS")

idxs = np.argsort(distances, axis=0)[1:6]  # the first book will be the same
for i in idxs:
    print("BOOK: " + list(df["filename"][i])[0]) 
    print("GENRE: " + list(df["genre"][i])[0])
    print("DESCRIPTION:")
    print(list(df["description"][i])[0])
    print("\n\n")

MY BOOK: Digital Marketing_ Strategy, Implementation and practice ( PDFDrive ).pdf
GENRE: Non-fiction
DESCRIPTION:
The book Digital Marketing: Strategy, Implementation, and Practice, which is now in its sixth edition, gives thorough and useful advice to businesses on how they can fulfill their marketing objectives by making the most of the opportunities presented by digital media and technology. 
Students will gain a better understanding of how digital marketing functions in the real world by participating in case studies and interviews with representatives from forward-thinking companies like eBay and Facebook, which are covered in the course Digital Marketing, which brings together the theory and practice of marketing with actual business experience.
-----------------------------------------------------------------------------------


                                     SIMILAR BOOKS
BOOK: How-To-Win-Friends-and-Influence-People-PDFDrive-1.pdf
GENRE: Non-fiction
DESCRIPTION:
More th

#### Create the "Bag of words" model for bigrams

In [31]:
count2 = TfidfVectorizer(use_idf=True, 
                        smooth_idf=False,  
                        ngram_range=(2,2), stop_words='english') 

matrix2 = count2.fit_transform(preprocessed_texts).toarray()
matrix2.shape

(21, 725984)

In [32]:
distances = euclidean_distances(np.array(matrix2), np.array([matrix2[book_idx]]))
distances

array([[1.41261686e+00],
       [1.38197851e+00],
       [1.41333554e+00],
       [1.41388509e+00],
       [1.41122832e+00],
       [5.27467430e-07],
       [1.41120465e+00],
       [1.41323048e+00],
       [1.41350674e+00],
       [1.41381660e+00],
       [1.41348616e+00],
       [1.41330165e+00],
       [1.41262222e+00],
       [1.41402424e+00],
       [1.41103387e+00],
       [1.41286109e+00],
       [1.41353653e+00],
       [1.41371372e+00],
       [1.41297565e+00],
       [1.41328801e+00],
       [1.41186962e+00]])

In [33]:
print("MY BOOK: " + df["filename"][book_idx])
print("GENRE: " + df["genre"][book_idx])
print("DESCRIPTION:")
print(df["description"][book_idx])
print("-----------------------------------------------------------------------------------\n\n")
print("                                     SIMILAR BOOKS")

idxs = np.argsort(distances, axis=0)[1:6] 
for i in idxs:
    print("BOOK: " + list(df["filename"][i])[0]) 
    print("GENRE: " + list(df["genre"][i])[0])
    print("DESCRIPTION:")
    print(list(df["description"][i])[0])
    print("\n\n")

MY BOOK: Digital Marketing_ Strategy, Implementation and practice ( PDFDrive ).pdf
GENRE: Non-fiction
DESCRIPTION:
The book Digital Marketing: Strategy, Implementation, and Practice, which is now in its sixth edition, gives thorough and useful advice to businesses on how they can fulfill their marketing objectives by making the most of the opportunities presented by digital media and technology. 
Students will gain a better understanding of how digital marketing functions in the real world by participating in case studies and interviews with representatives from forward-thinking companies like eBay and Facebook, which are covered in the course Digital Marketing, which brings together the theory and practice of marketing with actual business experience.
-----------------------------------------------------------------------------------


                                     SIMILAR BOOKS
BOOK: Lobsters_ Biology, Management, Aquaculture & Fisheries ( PDFDrive ).pdf
GENRE: Non-fiction
DES