# Level 1
## Task 1: Web Scraping with Python

This notebook demonstrates how to scrape structured quote data from
http://quotes.toscrape.com using helper functions provided in scraper.py.

It includes:
- Extracting quote data
- Keyword searching
- Dynamic page detection
- Saving data to CSV


In [1]:
import os
import sys

notebook_dir = os.getcwd()  # current notebook working dir
root_dir = os.path.abspath(os.path.join(notebook_dir, ".."))  

print("Notebook dir:", notebook_dir)
print("Root dir:", root_dir)

if root_dir not in sys.path:
    sys.path.append(root_dir)
    print(f"Added {root_dir} to sys.path")
else:
    print(f"{root_dir} already in sys.path")

print("Current sys.path:")
for p in sys.path:
    print(" ", p)


Notebook dir: e:\CODveda\codveda-internship\notebooks
Root dir: e:\CODveda\codveda-internship
Added e:\CODveda\codveda-internship to sys.path
Current sys.path:
  C:\Users\Kabelo Matlakala\AppData\Local\Programs\Python\Python311\python311.zip
  C:\Users\Kabelo Matlakala\AppData\Local\Programs\Python\Python311\DLLs
  C:\Users\Kabelo Matlakala\AppData\Local\Programs\Python\Python311\Lib
  C:\Users\Kabelo Matlakala\AppData\Local\Programs\Python\Python311
  e:\CODveda\codveda-internship\codveda-env
  
  e:\CODveda\codveda-internship\codveda-env\Lib\site-packages
  e:\CODveda\codveda-internship\codveda-env\Lib\site-packages\win32
  e:\CODveda\codveda-internship\codveda-env\Lib\site-packages\win32\lib
  e:\CODveda\codveda-internship\codveda-env\Lib\site-packages\Pythonwin
  e:\CODveda\codveda-internship


In [2]:
from Level1_Basic.Task1_Scraping.scraper import (
    scrape_quotes,
    search_quotes,
    is_dynamic,
    extract_structured_content,
    save_scraped_text_to_csv
)

## Scrape Quotes from Demo Site

In [3]:
# Scrape quotes from the demo website
df_quotes = scrape_quotes()

print(f"Total quotes scraped: {len(df_quotes)}")
df_quotes.head()

Total quotes scraped: 100


Unnamed: 0,Quote,Author,Tags
0,“The world as we have created it is a process ...,Albert Einstein,"change, deep-thoughts, thinking, world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities, choices"
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational, life, live, miracle, miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy, books, classic, humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself, inspirational"


## Search Within Quotes

In [4]:
# Search for a keyword inside quotes (e.g., author or tag)
search_term = "truth"  # Replace with any term like "life", "friendship", etc.
search_results = search_quotes(df_quotes, search_term)

print(f"Results for search term: '{search_term}'")
search_results.head()

Results for search term: 'truth'


Unnamed: 0,Quote,Author,Tags
10,“This life is what you make it. No matter what...,Marilyn Monroe,"friends, heartbreak, inspirational, life, love..."
32,"“The truth is, everyone is going to hurt you. ...",Bob Marley,friendship
75,“The reason I talk to myself is because I’m th...,George Carlin,"humor, insanity, lies, lying, self-indulgence,..."
88,“A lie can travel half way around the world wh...,Mark Twain,"misattributed-mark-twain, truth"
90,"“The truth."" Dumbledore sighed. ""It is a beaut...",J.K. Rowling,truth


## Save Scraped Quotes to CSV

In [5]:
# Define output directory and file path
output_dir = os.path.join(root_dir, "data", "raw")
os.makedirs(output_dir, exist_ok=True)  # Ensure folder exists

output_path = os.path.join(output_dir, "quotes_scraped.csv")

# Save dataframe to CSV
df_quotes.to_csv(output_path, index=False)

print(f"Quotes saved to: {output_path}")


Quotes saved to: e:\CODveda\codveda-internship\data\raw\quotes_scraped.csv


---

## Check if URL is Dynamic (JS-rendered)

In [6]:
# Optional: Check if URL is Dynamic (JS-rendered)
custom_url = "https://quotes.toscrape.com/js"  # You can change this like: 
#custom_url = "https://en.wikipedia.org/wiki/Natural_language_processing"

if is_dynamic(custom_url):
    print(f"The page at {custom_url} appears to be dynamic (requires Selenium).!!!")
else:
    print(f"The page at {custom_url} is static and safe to scrape with requests.")


The page at https://quotes.toscrape.com/js appears to be dynamic (requires Selenium).!!!


In [7]:
# Extract structured text content from the URL
try:
    extracted_text = extract_structured_content(custom_url)

    if extracted_text:
        print("Successfully extracted visible text.")
        print("\n Preview (First 500 characters):\n")
        print(extracted_text[:500])

        # Save as CSV in your output_dir
        saved_path = save_scraped_text_to_csv(custom_url, extracted_text, save_dir=output_dir)
        
        print(f"\n Saved structured content to: {saved_path}")
    else:
        print("No visible text found on the page.")
except Exception as e:
    print(f" Error extracting content: {e} !!!")


Successfully extracted visible text.

 Preview (First 500 characters):

# Quotes to Scrape

Login

- Next→
Quotes by:GoodReads.com

Made with❤byZyte

 Saved structured content to: e:\CODveda\codveda-internship\data\raw\quotes_to_scrape_scraped.csv


## Summary

- Scraped quotes from http://quotes.toscrape.com
- Searched quotes using keywords
- Saved structured data to CSV
- Detected dynamic websites
- Extracted and saved visible text from a custom URL
