<a href="https://colab.research.google.com/github/RonitShetty/NLP-Labs/blob/main/C070_RonitShetty_NLPLab4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Lab 4
**Roll No.:** C070  
**Name:** Ronit Shetty  
**SAP ID:** 70322000128  
**Division:** C  
**Batch:** C1  

###B.1 Tasks

In [13]:
# Step 1: Install necessary libraries in the Colab environment
!pip install requests beautifulsoup4 pandas

# Step 2: Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 3: Define the URL and send the HTTP request
# This is a website designed for scraping, so it's very reliable.
URL = "http://quotes.toscrape.com"
webpage = requests.get(URL)
soup = BeautifulSoup(webpage.content, "html.parser")

# Step 4: Find all the containers for each quote
# By inspecting the page, we can see each quote is in a <div class="quote">
quote_containers = soup.find_all("div", class_="quote")

# Step 5: Initialize lists to store the scraped data
all_quotes = []
all_authors = []
all_tags = []

# Step 6: Loop through each quote container and extract the relevant information
for quote in quote_containers:
    # Extract the quote's text, which is in a <span class="text">
    text = quote.find("span", class_="text").get_text(strip=True)

    # Extract the author's name, which is in a <small class="author">
    author = quote.find("small", class_="author").get_text(strip=True)

    # Find the container for tags, which is a <div class="tags">
    tags_div = quote.find("div", class_="tags")
    # Inside that div, find all the <a> tags and get their text
    tags = [tag.get_text(strip=True) for tag in tags_div.find_all("a", class_="tag")]
    # Join the list of tags into a single comma-separated string
    tags_str = ", ".join(tags)

    # Add the extracted data to our lists
    all_quotes.append(text)
    all_authors.append(author)
    all_tags.append(tags_str)


# Step 7: Create a Pandas DataFrame to display the data neatly
df = pd.DataFrame({
    'Quote': all_quotes,
    'Author': all_authors,
    'Tags': all_tags
})

# Display the results
print("--- Scraped Quotes from http://quotes.toscrape.com ---")
print(df.to_string())

--- Scraped Quotes from http://quotes.toscrape.com ---
                                                                                                                                 Quote             Author                                          Tags
0                  “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”    Albert Einstein        change, deep-thoughts, thinking, world
1                                                “It is our choices, Harry, that show what we truly are, far more than our abilities.”       J.K. Rowling                            abilities, choices
2  “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”    Albert Einstein  inspirational, life, live, miracle, miracles
3                             “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”        Jan

In [14]:
# Step 1: Install Tesseract OCR engine and the Python wrapper in Colab
!sudo apt install tesseract-ocr
!pip install pytesseract

# Step 2: Import necessary libraries
import pytesseract
from PIL import Image # Pillow library for image handling
import requests # To download the image from a URL

# Step 3: Download an image with text to process
image_url = 'https://www.aplustopper.com/media/images/articles/Merchant-of-Venice-Act-1-Scene-1-Translation-Meaning-Annotations-1.png'

img = Image.open(requests.get(image_url, stream=True).raw)

# Display the image (optional, for verification in a notebook)
# img.show() # In Colab, you'd use from google.colab.patches import cv2_imshow; cv2_imshow(img)

# Step 4: Use pytesseract to perform OCR on the image
# The image_to_string function takes an image object and returns the extracted text
extracted_text = pytesseract.image_to_string(img)

# Step 5: Print the extracted text
print("--- Extracted Text ---")
print(extracted_text)

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
--- Extracted Text ---
[Venice - A street]

Enter Antonio, SALARINO, and SALANIO.

ANTONIO: Jn sooth, I know not why I am so
sad;

It wearies me; you say it wearies you;

But how I caught it, found it, or came by it,
What stuff ’tis made of, whereof it is born,
Tam to learn;(5)

And such a want-wit sadness makes of me,
That I have much ado to know myself.

SALARINO: Your mind is tossing on the ocean;
There, where your argosies, with portly sail,
Like signiors and rich burghers on the flood, (10)
Or, as it were, the pageants of the sea,

Do overpeer the petty traffickers,

That curt sy to them, do them reverence,

As they fly by them with their woven wings.



### B.2 Observations and Learning

Text acquisition is highly dependent on the source's structure. For web scraping with **Beautiful Soup**, success hinges on correctly identifying **HTML tags** and **attributes** using a browser's inspection tools, and I learned that scrapers are fragile and can break if a website's layout changes. Using headers like `User-Agent` is essential to prevent being blocked. For OCR with **Tesseract**, I observed that the **quality of the input image is paramount**. While it worked well on clean, machine-printed text, it's clear that significant image preprocessing would be required for noisy or complex images to ensure accurate text extraction.

### B.3 Conclusion

I learned how to apply fundamental text acquisition techniques and successfully achieved the experiment's outcomes. I implemented a solution using the **Beautiful Soup** library to parse and scrape structured data from a live website and utilized the **Tesseract OCR engine** to extract and digitize text from an image file. This lab provided direct, hands-on experience in gathering raw text from different sources, solidifying my understanding of the initial data collection phase of an NLP project.

### B.4 Questions of Curiosity

1.  **How does one parse the HTML into a `BeautifulSoup` object given a `response` object?**
    
    *Ans: To parse an HTML `response` object, I create a `BeautifulSoup` object by passing the response content and a parser: `soup = BeautifulSoup(response.content, 'html.parser')`.*

2.  **Given that `soup.find_all(class_='items')` returns a list, in order to get the first item, all you need to do is index. (True/False)**
    
    *Ans: **True**. To get the first item from the list-like object returned by `soup.find_all()`, I can use index `[0]`.*

3.  **Given the below html, how would this tag type be described in web scraping code? `<h1 class='sports'>Sports News</h1>`**
    
    *Ans: The tag `<h1 class='sports'>Sports News</h1>` is an `<h1>` tag with a `class` attribute of `'sports'`. I would locate it in code with a command like `soup.find('h1', class_='sports')`.*

4.  **List 3 Disadvantages of using Tesseract.**
    
    *Ans: Three disadvantages of Tesseract are: 1) High sensitivity to image quality, performing poorly on low-resolution or noisy images. 2) Difficulty interpreting complex page layouts like multiple columns. 3) Poor accuracy on non-standard fonts or handwritten text.*