<a href="https://colab.research.google.com/github/SANGRAMLEMBE/MTech/blob/main/Applied_Data_Science/Practical/sangram_lembe_Unit_2_Lab_4_Regex_and_Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. **Regex Practice** – Basic and advanced regex problems.  

In [1]:
# Import re module
import re

### Example 1: Extract email addresses


In [2]:
text = "Email: alice@mail.com, bob@company.org"

# Regex for emails
pattern = r"\b[\w.%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"

emails = re.findall(pattern, text)
print("Extracted emails:", emails)

Extracted emails: ['alice@mail.com', 'bob@company.org']


### Example 2: Find numbers in text


In [3]:
text = "Order number: 12345"

match = re.search(r"\d+", text)
if match:
    print("Matched:", match.group())
    print("Start index:", match.start())
    print("End index:", match.end())

Matched: 12345
Start index: 14
End index: 19


### Example 3: Split text using multiple delimiters


In [4]:
text = "one,two;three four|five"

parts = re.split(r"[,;| ]", text)
print("Splitted parts:", parts)

Splitted parts: ['one', 'two', 'three', 'four', 'five']


### Example 4: Mask phone numbers


In [5]:
text = "My phone number is 9876543210"

# Replace all digits with '*'
masked = re.sub(r"\d", "*", text)
print("Masked:", masked)

Masked: My phone number is **********


### Example 5: Match and find positions


In [6]:
text = "Python is fun"

match = re.search(r"Python", text)
if match:
    print("Match:", match.group())
    print("Span:", match.span())
    print("Start:", match.start())
    print("End:", match.end())

Match: Python
Span: (0, 6)
Start: 0
End: 6


## 🔹 Customer Log Regex Problems
Now we solve tasks based on a given customer log.


In [7]:
customer_log = """OrderID: 45892 | Name: Alice | Email: alice.wonder@mail.com | Phone: 9876543210
OrderID: 45893 | Name: Bob | Email: bob@company.org | Phone: 8765432109
OrderID: 45894 | Name: Charlie | Email: charlie_99@mail.net | Phone: 7654321098"""

### Task 1: Extract all Order IDs


In [8]:
order_ids = re.findall(r"OrderID: (\d+)", customer_log)
print("Order IDs:", order_ids)

Order IDs: ['45892', '45893', '45894']


### Task 2: Extract all email addresses


In [9]:
emails = re.findall(r"[\w.%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}", customer_log)
print("Emails:", emails)
emails = re.findall(r"[\w.%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}", customer_log)
print("Emails:", emails)

Emails: ['alice.wonder@mail.com', 'bob@company.org', 'charlie_99@mail.net']
Emails: ['alice.wonder@mail.com', 'bob@company.org', 'charlie_99@mail.net']


### Task 3: Find all phone numbers


In [10]:
phones = re.findall(r"\d{10}", customer_log)
print("Phone Numbers:", phones)

Phone Numbers: ['9876543210', '8765432109', '7654321098']


### Task 4: Search for first OrderID


In [11]:
match = re.search(r"OrderID: \d+", customer_log)
if match:
    print("Matched:", match.group())
    print("Start:", match.start())
    print("End:", match.end())

Matched: OrderID: 45892
Start: 0
End: 14


### Task 5: Check if text contains the word "Charlie"


In [12]:
if re.search(r"Charlie", customer_log):
    print("Found")
else:
    print("Not Found")

Found


### Task 6: Split wherever `|`


In [13]:
parts = re.split(r"\|", customer_log)
print(parts[:5])  # printing first few for readability

['OrderID: 45892 ', ' Name: Alice ', ' Email: alice.wonder@mail.com ', ' Phone: 9876543210\nOrderID: 45893 ', ' Name: Bob ']


### Task 7: Split text into lines

In [14]:
lines = re.split(r"\n", customer_log)
print("Orders:", lines)

Orders: ['OrderID: 45892 | Name: Alice | Email: alice.wonder@mail.com | Phone: 9876543210', 'OrderID: 45893 | Name: Bob | Email: bob@company.org | Phone: 8765432109', 'OrderID: 45894 | Name: Charlie | Email: charlie_99@mail.net | Phone: 7654321098']


### Task 8: Mask all digits in phone numbers

In [15]:
masked_log = re.sub(r"\d", "*", customer_log)
print(masked_log)

OrderID: ***** | Name: Alice | Email: alice.wonder@mail.com | Phone: **********
OrderID: ***** | Name: Bob | Email: bob@company.org | Phone: **********
OrderID: ***** | Name: Charlie | Email: charlie_**@mail.net | Phone: **********


### Task 9: Replace "OrderID:" with "ID-"

In [16]:
updated_log = re.sub(r"OrderID:", "ID-", customer_log)
print(updated_log)

ID- 45892 | Name: Alice | Email: alice.wonder@mail.com | Phone: 9876543210
ID- 45893 | Name: Bob | Email: bob@company.org | Phone: 8765432109
ID- 45894 | Name: Charlie | Email: charlie_99@mail.net | Phone: 7654321098


### Task 10: Search for first email with positions

In [17]:
match = re.search(r"[\w.%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}", customer_log)
if match:
    print("Matched email:", match.group())
    print("Start:", match.start())
    print("End:", match.end())
    print("Span:", match.span())

Matched email: alice.wonder@mail.com
Start: 38
End: 59
Span: (38, 59)


# 🔹 Part B: Web Scraping Practice
We will use `requests` and `BeautifulSoup` for scraping.


In [18]:
# Install required packages (only needed once in Colab)
!pip install beautifulsoup4 requests



### Example: Scraping from quotes.toscrape.com

In [19]:
import requests
from bs4 import BeautifulSoup

url = "https://quotes.toscrape.com/"
response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

# Extract all quotes and authors
quotes = soup.find_all("span", class_="text")
authors = soup.find_all("small", class_="author")

print("Quotes and Authors:\n")
for quote, author in zip(quotes, authors):
    print(f"{quote.text}  — {author.text}")

Quotes and Authors:

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”  — Albert Einstein
“It is our choices, Harry, that show what we truly are, far more than our abilities.”  — J.K. Rowling
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”  — Albert Einstein
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”  — Jane Austen
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”  — Marilyn Monroe
“Try not to become a man of success. Rather become a man of value.”  — Albert Einstein
“It is better to be hated for what you are than to be loved for what you are not.”  — André Gide
“I have not failed. I've just found 10,000 ways that won't work.”  — Thomas A. Edison
“A woman is like a tea bag; you never know how strong it is until it's 

### Example: Extract all links

In [20]:
print("\nLinks on the page:")
for link in soup.find_all("a"):
    print(link.get("href"))


Links on the page:
/
/login
/author/Albert-Einstein
/tag/change/page/1/
/tag/deep-thoughts/page/1/
/tag/thinking/page/1/
/tag/world/page/1/
/author/J-K-Rowling
/tag/abilities/page/1/
/tag/choices/page/1/
/author/Albert-Einstein
/tag/inspirational/page/1/
/tag/life/page/1/
/tag/live/page/1/
/tag/miracle/page/1/
/tag/miracles/page/1/
/author/Jane-Austen
/tag/aliteracy/page/1/
/tag/books/page/1/
/tag/classic/page/1/
/tag/humor/page/1/
/author/Marilyn-Monroe
/tag/be-yourself/page/1/
/tag/inspirational/page/1/
/author/Albert-Einstein
/tag/adulthood/page/1/
/tag/success/page/1/
/tag/value/page/1/
/author/Andre-Gide
/tag/life/page/1/
/tag/love/page/1/
/author/Thomas-A-Edison
/tag/edison/page/1/
/tag/failure/page/1/
/tag/inspirational/page/1/
/tag/paraphrased/page/1/
/author/Eleanor-Roosevelt
/tag/misattributed-eleanor-roosevelt/page/1/
/author/Steve-Martin
/tag/humor/page/1/
/tag/obvious/page/1/
/tag/simile/page/1/
/page/2/
/tag/love/
/tag/inspirational/
/tag/life/
/tag/humor/
/tag/books/
/t

### Example: Extract heading

In [21]:
heading = soup.find("h1")
print("Heading:", heading.text)

Heading: 
Quotes to Scrape



### Task: Scraping Laptops Data from WebScraper.io
We scrape: Title, Price, Description, Reviews

### 🖥️ Task: Scraping Laptops Data from WebScraper.io
We will scrape the following details from the laptops category page:

- Product Title  
- Product Price  
- Product Description  
- Number of Reviews  

Finally, we will store results in a Pandas DataFrame.


In [22]:
# Step 1: Import required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [23]:
# Step 2: Send request to the laptops category page
url = "https://webscraper.io/test-sites/e-commerce/static/computers/laptops"
response = requests.get(url)

# Parse HTML content
soup = BeautifulSoup(response.text, "html.parser")

In [24]:
# Step 3: Find all product containers
products = soup.find_all("div", class_="thumbnail")
print("Total products found on this page:", len(products))

Total products found on this page: 6


In [25]:
# Step 4: Extract data (Title, Price, Description, Reviews) safely
data = []

for product in products:
    # Title
    title = product.find("a", class_="title")
    title = title.text.strip() if title else "N/A"

    # Price (using CSS selector so it works even if multiple classes exist)
    price = product.select_one("h4.price")
    price = price.text.strip() if price else "N/A"

    # Description
    description = product.find("p", class_="description")
    description = description.text.strip() if description else "N/A"

    # Reviews
    reviews = product.find("p", class_="pull-right")
    reviews = reviews.text.strip() if reviews else "N/A"

    # Append to list
    data.append([title, price, description, reviews])

In [26]:
# Step 5: Convert to Pandas DataFrame
df = pd.DataFrame(data, columns=["Title", "Price", "Description", "Reviews"])

In [27]:
# Step 6: Display the first 10 rows
print("✅ Scraping completed successfully!")
print("Total products scraped:", len(df))
df.head(10)

✅ Scraping completed successfully!
Total products scraped: 6


Unnamed: 0,Title,Price,Description,Reviews
0,Packard 255 G2,$416.99,"15.6"", AMD E2-3800 1.3GHz, 4GB, 500GB, Windows...",
1,Aspire E1-510,$306.99,"15.6"", Pentium N3520 2.16GHz, 4GB, 500GB, Linux",
2,ThinkPad T540p,$1178.99,"15.6"", Core i5-4200M, 4GB, 500GB, Win7 Pro 64bit",
3,ProBook,$739.99,"14"", Core i5 2.6GHz, 4GB, 500GB, Win7 Pro 64bit",
4,ThinkPad X240,$1311.99,"12.5"", Core i5-4300U, 8GB, 240GB SSD, Win7 Pro...",
5,Aspire E1-572G,$581.99,"15.6"", Core i5-4200U, 8GB, 1TB, Radeon R7 M265...",


### 🖥️ Task: Scraping All Laptops Data with Pagination
We will:
- Loop through all pages of laptops
- Extract Title, Price, Description, Reviews
- Stop when no more products are found
- Store results in a Pandas DataFrame

In [28]:
# Step 1: Import required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [29]:
# Step 2: Initialize variables
all_data = []   # to store all laptops data
page = 1        # start from page 1

In [30]:
# Step 3: Loop through all pages until no products are found
while True:
    url = f"https://webscraper.io/test-sites/e-commerce/static/computers/laptops?page={page}"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    products = soup.find_all("div", class_="thumbnail")
    if not products:   # stop if no products on page
        break

    for product in products:
        # Title
        title = product.find("a", class_="title")
        title = title.text.strip() if title else "N/A"

        # Price (safe selector)
        price = product.select_one("h4.price")
        price = price.text.strip() if price else "N/A"

        # Description
        description = product.find("p", class_="description")
        description = description.text.strip() if description else "N/A"

        # Reviews
        reviews = product.find("p", class_="pull-right")
        reviews = reviews.text.strip() if reviews else "N/A"

        # Append to data list
        all_data.append([title, price, description, reviews])

    print(f"✅ Page {page} scraped successfully with {len(products)} products.")
    page += 1   # move to next page

✅ Page 1 scraped successfully with 6 products.
✅ Page 2 scraped successfully with 6 products.
✅ Page 3 scraped successfully with 6 products.
✅ Page 4 scraped successfully with 6 products.
✅ Page 5 scraped successfully with 6 products.
✅ Page 6 scraped successfully with 6 products.
✅ Page 7 scraped successfully with 6 products.
✅ Page 8 scraped successfully with 6 products.
✅ Page 9 scraped successfully with 6 products.
✅ Page 10 scraped successfully with 6 products.
✅ Page 11 scraped successfully with 6 products.
✅ Page 12 scraped successfully with 6 products.
✅ Page 13 scraped successfully with 6 products.
✅ Page 14 scraped successfully with 6 products.
✅ Page 15 scraped successfully with 6 products.
✅ Page 16 scraped successfully with 6 products.
✅ Page 17 scraped successfully with 6 products.
✅ Page 18 scraped successfully with 6 products.
✅ Page 19 scraped successfully with 6 products.
✅ Page 20 scraped successfully with 3 products.


In [31]:
# Step 4: Convert collected data to Pandas DataFrame
df_all = pd.DataFrame(all_data, columns=["Title", "Price", "Description", "Reviews"])

In [32]:
# Step 5: Display results
print("Total products scraped across all pages:", len(df_all))
df_all.head(10)

Total products scraped across all pages: 117


Unnamed: 0,Title,Price,Description,Reviews
0,Packard 255 G2,$416.99,"15.6"", AMD E2-3800 1.3GHz, 4GB, 500GB, Windows...",
1,Aspire E1-510,$306.99,"15.6"", Pentium N3520 2.16GHz, 4GB, 500GB, Linux",
2,ThinkPad T540p,$1178.99,"15.6"", Core i5-4200M, 4GB, 500GB, Win7 Pro 64bit",
3,ProBook,$739.99,"14"", Core i5 2.6GHz, 4GB, 500GB, Win7 Pro 64bit",
4,ThinkPad X240,$1311.99,"12.5"", Core i5-4300U, 8GB, 240GB SSD, Win7 Pro...",
5,Aspire E1-572G,$581.99,"15.6"", Core i5-4200U, 8GB, 1TB, Radeon R7 M265...",
6,ThinkPad Yoga,$1033.99,"12.5"" Touch, Core i3-4010U, 4GB, 500GB + 16GB ...",
7,Pavilion,$609.99,"15.6"", Core i5-4200U, 6GB, 750GB, Windows 8.1",
8,Inspiron 15,$745.99,"Moon Silver, 15.6"", Core i7-4510U, 8GB, 1TB, R...",
9,Dell XPS 13,$1281.99,"13.3"" Touch, Core i5-4210U, 8GB, 128GB SSD, Wi...",
