# **Approach / Logic**

1. **Input:** Two company website URLs (you can use any, e.g., https://www.python.org and https://www.djangoproject.com).

2. **Fetch** the HTML using requests.

3. **Parse** the content using BeautifulSoup.

4. **Extract** emails, phone numbers, and links using regular expressions (regex).

5. **Clean and validate** data (check for duplicates or invalid formats).

6. **Save output** to CSV using pandas.

# **Website Contact Scraper**

In [15]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

In [16]:
# ---------- CONFIGURATION ----------
websites = [
    "https://www.pw.live",
    "https://www.labelnest.in"
]

In [17]:
# ---------- REGEX PATTERNS ----------
# Email pattern
email_pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"

# Phone pattern (includes normal + toll-free like 1800/1860)
phone_pattern = r"(\+?\d[\d\s\-().]{7,}\d|1800[\s\-]?\d{3}[\s\-]?\d{4}|1860[\s\-]?\d{3}[\s\-]?\d{4})"

# Validation patterns
valid_email_pattern = re.compile(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$")
valid_phone_pattern = re.compile(r"^(\+?\d[\d\s\-().]{7,}\d|1800[\s\-]?\d{3}[\s\-]?\d{4}|1860[\s\-]?\d{3}[\s\-]?\d{4})$")

In [18]:
# ---------- FUNCTIONS ----------
def is_valid_email(email):
    return bool(valid_email_pattern.match(email))

def is_valid_phone(phone):
    return bool(valid_phone_pattern.match(phone))

In [19]:
def clean_links(links):
    """Normalize and remove duplicate links."""
    cleaned = []
    for link in links:
        # Strip fragments (#), query params, trailing slashes
        link = re.sub(r'#.*$', '', link)
        link = re.sub(r'\?.*$', '', link)
        link = link.rstrip('/')
        if link not in cleaned:
            cleaned.append(link)
    return cleaned

In [20]:
def scrape_contact_info(url):
    notes = ""
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()

        soup = BeautifulSoup(response.text, 'html.parser')
        text = soup.get_text(" ", strip=True)
        links = [a['href'] for a in soup.find_all('a', href=True)]

        # Extract contact info
        emails = list(set(re.findall(email_pattern, text)))
        phones = list(set(re.findall(phone_pattern, text)))

        # Validate
        valid_emails = [e for e in emails if is_valid_email(e)]
        valid_phones = [p for p in phones if is_valid_phone(p)]

        # Find and clean contact page links
        contact_pages = [link for link in links if 'contact' in link.lower()]
        contact_pages = clean_links(contact_pages)

        if not valid_emails and not valid_phones:
            notes = "No valid contact info found on main page"

        return {
            "Website": url,
            "Email": ", ".join(valid_emails) if valid_emails else "-",
            "Phone": ", ".join(valid_phones) if valid_phones else "-",
            "Page Found": ", ".join(contact_pages) if contact_pages else url,
            "Notes": notes
        }

    except Exception as e:
        return {
            "Website": url,
            "Email": "-",
            "Phone": "-",
            "Page Found": "-",
            "Notes": f"Error: {e}"
        }

In [21]:
# ---------- MAIN SCRIPT ----------
print("🔍 Starting Website Contact Scraper...\n")

results = [scrape_contact_info(site) for site in websites]
df = pd.DataFrame(results)

🔍 Starting Website Contact Scraper...



In [22]:
# Remove duplicates based on Email + Phone
df.drop_duplicates(subset=["Email", "Phone"], inplace=True)

In [23]:
# Save to CSV
output_file = "contacts.csv"
df.to_csv(output_file, index=False)

print("✅ Scraping Complete!")
print("📁 Data saved to:", output_file)
print("\n📊 Extracted Contact Information:\n")
print(df.to_string(index=False))

✅ Scraping Complete!
📁 Data saved to: contacts.csv

📊 Extracted Contact Information:

                 Website                Email                   Phone                          Page Found Notes
     https://www.pw.live                    - 08448982616 08448982616      https://www.pw.live/contact-us      
https://www.labelnest.in contact@labelnest.in          +91 9731474655 https://www.labelnest.in/contact-us      
