**To begin, we need to upload the book file to Google Colab. This code allows us to manually select and upload a file from our local system.**

In [1]:
from google.colab import files
uploaded = files.upload()


Saving pg1513-images.html to pg1513-images.html


**Extracting Text from HTML File
After uploading the book file, we need to extract the text from it. This code uses BeautifulSoup to parse the HTML file and retrieve the text from all paragraph tags.**


In [2]:
from bs4 import BeautifulSoup

from bs4 import BeautifulSoup

html_path = "/content/pg1513-images.html"
with open("/content/pg1513-images.html", encoding="utf-8") as file:
    soup = BeautifulSoup(file, "html.parser")

book_text = "\n".join([p.get_text() for p in soup.find_all("p")])

print(book_text[:1000])
book_text = "\n".join([p.get_text() for p in soup.find_all("p")])

print(book_text[:1000])


Title: Romeo and Juliet
Author: William Shakespeare
Release date: November 1, 1998 [eBook #1513]
                Most recently updated: June 19, 2024
Language: English
Credits: the PG Shakespeare Team, a team of about twenty Project Gutenberg volunteers

ESCALUS, Prince of Verona.
MERCUTIO, kinsman to the Prince, and friend to Romeo.
PARIS, a young Nobleman, kinsman to the Prince.
Page to Paris.


MONTAGUE, head of a Veronese family at feud with the Capulets.
LADY MONTAGUE, wife to Montague.
ROMEO, son to Montague.
BENVOLIO, nephew to Montague, and friend to Romeo.
ABRAM, servant to Montague.
BALTHASAR, servant to Romeo.


CAPULET, head of a Veronese family at feud with the Montagues.
LADY CAPULET, wife to Capulet.
JULIET, daughter to Capulet.
TYBALT, nephew to Lady Capulet.
CAPULET’S COUSIN, an old man.
NURSE to Juliet.
PETER, servant to Juliet’s Nurse.
SAMPSON, servant to Capulet.
GREGORY, servant to Capulet.
Servants.


FRIAR LAWRENCE, a Franciscan.
FRIAR JOHN, of the same Order.
An

In [4]:
import os
print(os.listdir("/content/"))


['.config', 'pg1513-images.html', 'sample_data']


**Reading and Extracting Text from an HTML File
This code reads an HTML file containing the book's content and extracts text from all paragraph tags using BeautifulSoup. This ensures that only meaningful text is retrieved while avoiding unnecessary HTML elements.**

In [6]:
import requests
from bs4 import BeautifulSoup
html_path = "/content/pg1513-images.html"
with open(html_path, "r", encoding="utf-8") as file:
    soup = BeautifulSoup(file, "html.parser")
book_text = "\n".join([p.get_text() for p in soup.find_all("p")])
print(book_text[:1000])


Title: Romeo and Juliet
Author: William Shakespeare
Release date: November 1, 1998 [eBook #1513]
                Most recently updated: June 19, 2024
Language: English
Credits: the PG Shakespeare Team, a team of about twenty Project Gutenberg volunteers

ESCALUS, Prince of Verona.
MERCUTIO, kinsman to the Prince, and friend to Romeo.
PARIS, a young Nobleman, kinsman to the Prince.
Page to Paris.


MONTAGUE, head of a Veronese family at feud with the Capulets.
LADY MONTAGUE, wife to Montague.
ROMEO, son to Montague.
BENVOLIO, nephew to Montague, and friend to Romeo.
ABRAM, servant to Montague.
BALTHASAR, servant to Romeo.


CAPULET, head of a Veronese family at feud with the Montagues.
LADY CAPULET, wife to Capulet.
JULIET, daughter to Capulet.
TYBALT, nephew to Lady Capulet.
CAPULET’S COUSIN, an old man.
NURSE to Juliet.
PETER, servant to Juliet’s Nurse.
SAMPSON, servant to Capulet.
GREGORY, servant to Capulet.
Servants.


FRIAR LAWRENCE, a Franciscan.
FRIAR JOHN, of the same Order.
An

**Extracting Text from a Wikipedia Page
This code fetches the content of a Wikipedia page and extracts text from all paragraph tags using the BeautifulSoup library. This helps in gathering relevant textual data while filtering out unnecessary HTML elements.**

In [8]:
url = "https://en.wikipedia.org/wiki/Romeo_and_Juliet"

import requests
from bs4 import BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

book_text = "\n".join([p.get_text() for p in soup.find_all("p")])

print(book_text[:1000])
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

book_text = "\n".join([p.get_text() for p in soup.find_all("p")])

print(book_text[:1000])




The Tragedy of Romeo and Juliet, often shortened to Romeo and Juliet, is a tragedy written by William Shakespeare early in his career about the romance between two Italian youths from feuding families. It was among Shakespeare's most popular plays during his lifetime and, along with Hamlet, is one of his most frequently performed. Today, the title characters are regarded as archetypal young lovers.

Romeo and Juliet belongs to a tradition of tragic romances stretching back to antiquity. The plot is based on an Italian tale written by Matteo Bandello and translated into verse as The Tragical History of Romeus and Juliet by Arthur Brooke in 1562 and retold in prose in Palace of Pleasure by William Painter in 1567. Shakespeare borrowed heavily from both but expanded the plot by developing a number of supporting characters, in particular Mercutio and Paris. Believed to have been written between 1591 and 1595, the play was first published in a quarto version in 1597. The text of the first

**Splitting the Extracted Text into Chunks
Since large text data cannot be processed at once, we divide it into smaller chunks for efficient handling. This function splits the text into smaller segments of a specified word limit to ensure better performance during processing.**

In [9]:
def chunk_text(text, chunk_size=500):
    words = text.split()
    return [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

chunks = chunk_text(book_text, chunk_size=500)

print(f"Total Chunks: {len(chunks)}")
print(f"Sample Chunk: {chunks[0]}")


Total Chunks: 21
Sample Chunk: The Tragedy of Romeo and Juliet, often shortened to Romeo and Juliet, is a tragedy written by William Shakespeare early in his career about the romance between two Italian youths from feuding families. It was among Shakespeare's most popular plays during his lifetime and, along with Hamlet, is one of his most frequently performed. Today, the title characters are regarded as archetypal young lovers. Romeo and Juliet belongs to a tradition of tragic romances stretching back to antiquity. The plot is based on an Italian tale written by Matteo Bandello and translated into verse as The Tragical History of Romeus and Juliet by Arthur Brooke in 1562 and retold in prose in Palace of Pleasure by William Painter in 1567. Shakespeare borrowed heavily from both but expanded the plot by developing a number of supporting characters, in particular Mercutio and Paris. Believed to have been written between 1591 and 1595, the play was first published in a quarto version in

**Implementing a Question-Answering Model
To extract relevant answers from the text, we use a pre-trained transformer model. This model takes a question and a chunk of text as input and predicts the most relevant answer based on the given context.**

In [10]:
from transformers import pipeline
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased")
question = "What is machine learning?"
context = chunks[0]
answer = qa_pipeline(question=question, context=context)

print(f"Answer: {answer['answer']}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cuda:0


Answer: . The play, set in Verona,


****

**Answering Questions from the Text
This function allows users to ask questions based on the extracted text. It processes each chunk and returns the most confident answer. If no confident answer is found, it prompts the user to rephrase the question.**

In [11]:
def answer_question(question, book_chunks):
    for chunk in book_chunks:
        answer = qa_pipeline(question=question, context=chunk)
        if answer["score"] > 0.5:
            return answer["answer"]
    return "Try rephrasing your question.."

user_question = input("Ask a question: ")
response = answer_question(user_question, chunks)
print(f"Answer: {response}")


Ask a question: Who is Romeo


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Answer: Try rephrasing your question..


**This part of the code prints the first chunk of extracted text and then processes the book text into chunks of 700 words. The answer_question function is updated to find the best possible answer by comparing confidence scores across all chunks. Instead of returning the first confident answer, it selects the most accurate response with the highest score. If no suitable answer is found, it displays a message indicating that the answer couldn't be determined.**

In [None]:
print(chunks[0])
chunks = chunk_text(book_text, chunk_size=700)
def answer_question(question, book_chunks):
    best_answer = None
    best_score = 0
    for chunk in book_chunks:
        answer = qa_pipeline(question=question, context=chunk)
        if answer["score"] > best_score:
            best_score = answer["score"]
            best_answer = answer["answer"]
    return best_answer if best_answer else "Sorry, I couldn't find an answer."


**This part of the code creates a loop that allows users to ask multiple questions. The loop continues until the user types "exit". Each question is processed using the `answer_question` function, which retrieves the most relevant answer from the text chunks. The response is then displayed.**

In [None]:
while True:
    user_question = input("\nAsk a question (or type 'exit' to stop): ")
    if user_question.lower() == "exit":
        break
    response = answer_question(user_question, chunks)
    print(f"Answer: {response}")


Ask a question (or type 'exit' to stop): stop
Answer: Try rephrasing your question..
