# Task 1: Web Scraping & Data Processing

**Objective:**  
Scrape 30,000 data points from [Trustpilot](https://www.trustpilot.com/) and process the data into a structured CSV file.

**Key Requirements:**

- **Data Points:**  
  - Extract 30,000 records by targeting multiple categories (e.g., *beauty_wellbeing*, *food_beverages_tobacco*, *electronics_technology*). I chose to scrape from 'Categories' because it provides the email data needed for the upcoming task, as Trustpilot does not disclose reviewer emails due to privacy policies. Each category contains up to 10,000 listings, so by selecting three categories with the highest number of entries, we can obtain 30,000 data points with email information—all while staying within privacy guidelines. For testing, I set `max_items=100`—this should be increased to 30,000 to scrape the full dataset. On average, it takes about 2 seconds per lead. This delay is due to the challenge of scraping email data from Trustpilot, as the email is hidden behind a pop-up that only appears after clicking the "Contact" button and requires a waiting period.
  

- **Fields to Extract:**  
  - **Business Name:** Extracted from a dedicated business card.
  - **Website:** Refined URL (removing UTM parameters) from the contact details.
  - **Location:** Address information from the contact details.
  - **Email:** Extracted from the tooltip that appears after clicking the "Contact" button.
  - **Category:** The category from which the data point was scraped.

- **Automation:**  
  - Fully automated Python script using Selenium (with webdriver-manager for driver setup) to navigate and scrape data without manual intervention.
  - Implements pagination to traverse multiple pages.

- **Error Handling & Deduplication:**  
  - Comprehensive exception handling for element retrieval, timeouts, and network issues.
  - Retry mechanisms to ensure data extraction even if some elements are delayed.
  - Deduplication logic to remove any duplicate records before saving.

**Outcome:**  
A reproducible, automated script that outputs a CSV file (`trustpilot_data.csv`) containing the deduplicated records with fields: Category, Business Name, Website, Location, and Email.


In [1]:
# Cell 1: Install required libraries
%pip install selenium webdriver-manager

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Cell 2: Full scraping script with deduplication

import time
import csv
import urllib.parse
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

def clean_website(display_text, href):
    """
    Refine the website URL by using the display text if available 
    and removing UTM parameters.
    """
    text = display_text.strip() if display_text else ""
    website = text if text else href.strip()
    # Remove query parameters (like UTM)
    parsed = urllib.parse.urlparse(website)
    if parsed.scheme and parsed.netloc:
        website = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
    return website if website else "N/A"

def get_contact_details(driver, contact_card, max_attempts=2):
    """
    Try clicking the "Contact" button and extracting details.
    Returns (website, email, location). Retries if necessary.
    """
    website = "N/A"
    email = "N/A"
    location = "N/A"
    
    for attempt in range(max_attempts):
        try:
            contact_button = contact_card.find_element(By.XPATH, './/button[@aria-label="Contact"]')
            driver.execute_script("arguments[0].scrollIntoView(true);", contact_button)
            driver.execute_script("arguments[0].click();", contact_button)
            # Increase sleep time on retry
            time.sleep(2 if attempt else 1)
            
            wait = WebDriverWait(driver, 10)
            tooltip = wait.until(EC.visibility_of_element_located(
                (By.CSS_SELECTOR, 'div.tooltip_tooltip-inner___wDGV')
            ))
            
            li_elements = tooltip.find_elements(By.XPATH, './/li[contains(@class, "styles_item__9pNrw")]')
            # Reset values for this attempt
            website, email, location = "N/A", "N/A", "N/A"
            for li in li_elements:
                try:
                    a_tag = li.find_element(By.TAG_NAME, "a")
                    href = a_tag.get_attribute("href")
                    if href.startswith("mailto:"):
                        email = href.replace("mailto:", "").strip() or "N/A"
                    elif href.startswith("tel:"):
                        # Skip phone numbers.
                        continue
                    elif href.startswith("http"):
                        website = clean_website(a_tag.text, href)
                except Exception:
                    text_val = li.text.strip()
                    if text_val:
                        location = text_val
            if website != "N/A" or email != "N/A" or location != "N/A":
                break
        except Exception as e:
            print(f"Attempt {attempt+1} failed to extract contact details: {e}")
            time.sleep(2)
    return website, email, location

def scrape_current_page(driver):
    """
    Scrape all listings from the current page.
    Returns a list of dictionaries with keys: Business Name, Website, Location, Email.
    """
    page_data = []
    
    # Collect business name cards and contact cards separately.
    business_cards = driver.find_elements(By.CSS_SELECTOR, 
        "div.paper_paper__EGeEb.paper_outline__bqVmn.card_card__yyGgu.card_noPadding__OOiac.styles_wrapper__Jg8fe")
    contact_cards = driver.find_elements(By.CSS_SELECTOR, 
        "div.card_cardContent__4Js_A.styles_footerWrapper__fzSEA")
    
    total = min(len(business_cards), len(contact_cards))
    for i in range(total):
        # Extract business name.
        try:
            bn_elem = business_cards[i].find_element(By.CSS_SELECTOR, 
                        "p.typography_heading-xs__osRhC.typography_appearance-default__t8iAq")
            business_name = bn_elem.text.strip() or "N/A"
            if i == 0 and not business_name:
                time.sleep(2)
                business_name = bn_elem.text.strip() or "N/A"
        except Exception:
            business_name = "N/A"
        
        # Extract contact details from the corresponding contact card.
        website, email, location = get_contact_details(driver, contact_cards[i])
        
        page_data.append({
            "Business Name": business_name,
            "Website": website,
            "Location": location,
            "Email": email
        })
    return page_data

def scrape_category(driver, category, max_items=40):
    """
    Scrape listings from a category across multiple pages until max_items are collected.
    """
    data = []
    base_url = f"https://www.trustpilot.com/categories/{category}"
    driver.get(base_url)
    time.sleep(2)
    
    while len(data) < max_items:
        current_page_data = scrape_current_page(driver)
        data.extend([{"Category": category, **d} for d in current_page_data])
        print(f"Collected {len(data)} listings so far for {category}...")
        if len(data) >= max_items:
            break
        
        try:
            next_page_button = driver.find_element(By.CSS_SELECTOR, 'a[name="pagination-button-next"]')
            next_href = next_page_button.get_attribute("href")
            if next_href:
                next_url = urllib.parse.urljoin("https://www.trustpilot.com", next_href)
                driver.get(next_url)
                time.sleep(2)
            else:
                break
        except Exception as e:
            print(f"No next page found or error navigating: {e}")
            break
    
    return data[:max_items]

def deduplicate_records(records):
    """
    Deduplicate records based on a unique key created from all fields.
    """
    seen = set()
    deduped = []
    for rec in records:
        key = (
            rec["Category"].strip().lower(),
            rec["Business Name"].strip().lower(),
            rec["Website"].strip().lower(),
            rec["Email"].strip().lower(),
            rec["Location"].strip().lower()
        )
        if key not in seen:
            seen.add(key)
            deduped.append(rec)
    return deduped

def main():
    categories = ["beauty_wellbeing", "food_beverages_tobacco", "electronics_technology"]
    all_data = []
    
    options = Options()
    options.add_argument("--headless")
    options.add_argument("--disable-gpu")
    
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=options)
    
    # Loop through categories and scrape up to max_items for each.
    for cat in categories:
        print(f"Scraping category: {cat}")
        cat_data = scrape_category(driver, cat, max_items=100)
        all_data.extend(cat_data)
        time.sleep(1)
    
    driver.quit()
    
    # Deduplicate records.
    all_data = deduplicate_records(all_data)
    
    # Write the collected data to CSV.
    csv_file = "trustpilot_data.csv"
    fieldnames = ["Category", "Business Name", "Website", "Location", "Email"]
    try:
        with open(csv_file, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=fieldnames)
            writer.writeheader()
            for row in all_data:
                writer.writerow(row)
        print(f"Scraping complete. {len(all_data)} deduplicated records saved to {csv_file}")
    except Exception as e:
        print("Error writing CSV:", e)

if __name__ == "__main__":
    main()


Scraping category: beauty_wellbeing
Attempt 1 failed to extract contact details: Message: 
Stacktrace:
	GetHandleVerifier [0x00F70B43+25139]
	(No symbol) [0x00F013F4]
	(No symbol) [0x00DE04E3]
	(No symbol) [0x00E283D7]
	(No symbol) [0x00E2872B]
	(No symbol) [0x00E71002]
	(No symbol) [0x00E4D014]
	(No symbol) [0x00E6E778]
	(No symbol) [0x00E4CDC6]
	(No symbol) [0x00E1BDE9]
	(No symbol) [0x00E1D124]
	GetHandleVerifier [0x01274373+3185251]
	GetHandleVerifier [0x0129291A+3309578]
	GetHandleVerifier [0x0128CF42+3286578]
	GetHandleVerifier [0x01007AE0+643536]
	(No symbol) [0x00F0A20D]
	(No symbol) [0x00F070B8]
	(No symbol) [0x00F07257]
	(No symbol) [0x00EF9E00]
	BaseThreadInitThunk [0x74FEFCC9+25]
	RtlGetAppContainerNamedObjectPath [0x771682AE+286]
	RtlGetAppContainerNamedObjectPath [0x7716827E+238]

Collected 20 listings so far for beauty_wellbeing...
Collected 40 listings so far for beauty_wellbeing...
Collected 60 listings so far for beauty_wellbeing...
Collected 80 listings so far for be

# Task 2: Email Marketing Automation

**Objective:**  
Automate the process of sending 20–50 personalized emails using the scraped Trustpilot data. For this implementation, we send 7 emails per target category (beauty_wellbeing, food_beverages_tobacco, and electronics_technology), totaling 21 emails.

**Key Features:**

- **Personalized Emails:**  
  Each email is customized using dynamic fields from the scraped data (e.g., Business Name, Location, Category). The Groq API is used to generate concise, professional emails that include a subject line, a clear call-to-action, and a proper sign-off.

- **Fixed Special Offer:**  
  A predefined offer ("a special 10% discount on our services for new clients") is included in every email to ensure consistency across the campaign.

- **Automated Delivery:**  
  The script uses Gmail-SMTP for sending emails automatically without any manual intervention. It also implements rate-limiting to comply with API and SMTP best practices.

- **CSV Integration & Updates:**  
  The script reads the scraped data from `trustpilot_data.csv` (produced in Task 1) and groups records by category. After sending an email, it updates the CSV by adding two new fields:
  - **Email Sent:** Marked "Yes" if an email was successfully sent, otherwise "No".
  - **Email Body:** Stores the exact email content sent.

- **Robust Error Handling:**  
  Comprehensive error handling is built-in for API requests and SMTP operations, ensuring reliable automation even when issues arise.

**Workflow:**

1. **Data Loading:**  
   The script reads the scraped data from `trustpilot_data.csv` and groups records by target categories.

2. **Email Generation:**  
   For each selected business (7 per category), the script generates a personalized email via the Groq API. The prompt instructs the model to produce a concise email with a subject line, body, and professional sign-off (ending with "Best regards, Muhib Al Muntakim, Ray Advertising Limited") that includes the fixed offer.

3. **Email Sending:**  
   The generated email is sent using Gmail-SMTP. If an email is successfully sent, the record is updated to reflect this status along with the email body.

4. **CSV Update:**  
   Once all emails are sent, the updated records (with the new fields "Email Sent" and "Email Body") are written back to `trustpilot_data.csv` for complete traceability.

**Outcome:**  
A fully automated email marketing solution that leverages your scraped Trustpilot data to send 21 customized B2B emails and updates your dataset with detailed send-status information, ensuring consistency and reliability for your campaign.


In [15]:
import os
import time
import requests
import smtplib
import urllib.parse
import csv
from collections import defaultdict
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from dotenv import load_dotenv

# Load environment variables from .env
load_dotenv()

# --- Configuration from environment variables ---
GMAIL_SENDER = os.getenv("GMAIL_SENDER")
GMAIL_APP_PASSWORD = os.getenv("GMAIL_APP_PASSWORD")
SENDER_NAME = os.getenv("SENDER_NAME")
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
GROQ_API_URL = "https://api.groq.com/openai/v1/chat/completions"

# Define a fixed special offer
OFFER = "a special 10% discount on our services for new clients"

def generate_email_content(data):
    """
    Generate a concise, friendly, and personalized B2B marketing email using the Groq API.
    The email should include a subject on the first line, a blank line, then the email body,
    ending with a professional sign-off:
    
    Best regards,
    Muhib Al Muntakim
    Ray Advertising Limited
    
    The email includes the fixed offer defined by OFFER.
    
    :param data: A dict with dynamic fields (e.g., Business Name, Location, Category)
    :return: (subject, email_body) tuple
    """
    prompt = (
        f"Write a final, concise, friendly, and personalized B2B marketing email for {data['Business Name']} "
        f"located in {data['Location']}. The email is sent from Muhib Al Muntakim on behalf of Ray Advertising Limited. "
        f"Ray Advertising has extensive experience supporting businesses in the {data['Category']} category. "
        f"Explain briefly how our performance marketing solutions can boost their growth and include the following offer: {OFFER}. "
        "Keep the email short and to the point. Ensure the email ends with a professional sign-off in the following format:\n\n"
        "Best regards,\nMuhib Al Muntakim\nRay Advertising Limited\n\n"
        "Return the email in the following format:\n"
        "Subject: <Your generated subject line>\n\n"
        "<Your email body here>\n\n"
        "Do not include any internal chain-of-thought or reasoning."
    )
    
    headers = {
        "Authorization": f"Bearer {GROQ_API_KEY}",
        "Content-Type": "application/json"
    }
    payload = {
        "model": "llama-3.3-70b-versatile",
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.7,
        "max_tokens": 300,
        "top_p": 0.9
    }
    
    try:
        response = requests.post(GROQ_API_URL, json=payload, headers=headers, timeout=10)
        response.raise_for_status()
        result = response.json()
        full_output = result['choices'][0]['message']['content'].strip()
        # Parse output: assume first line "Subject:" then a blank line, then body.
        lines = full_output.splitlines()
        subject = ""
        body = ""
        if lines:
            for i, line in enumerate(lines):
                if line.lower().startswith("subject:"):
                    subject = line[len("Subject:"):].strip()
                    # Find the first empty line after subject
                    j = i + 1
                    while j < len(lines) and lines[j].strip() != "":
                        j += 1
                    body = "\n".join(lines[j+1:]).strip()
                    break
            if not subject:
                subject = "Exclusive Invitation from Ray Advertising Limited"
                body = full_output
        return subject, body
    except requests.exceptions.HTTPError as http_err:
        print(f"HTTP error occurred: {http_err}")
    except requests.exceptions.RequestException as req_err:
        print(f"Request error occurred: {req_err}")
    except Exception as e:
        print(f"An error occurred: {e}")
    
    # Fallback content if API call fails
    fallback_subject = "Exclusive Invitation from Ray Advertising Limited"
    fallback_body = (
        f"Dear {data['Business Name']},\n\n"
        "We at Ray Advertising Limited are excited to introduce our high-performance marketing solutions. "
        f"Our team, led by Muhib Al Muntakim, has extensive experience supporting businesses in the {data['Category']} category, "
        f"and we are currently offering {OFFER} to help you boost your growth. "
        "We would love to discuss how our tailored services can support your unique needs.\n\n"
        "Best regards,\nMuhib Al Muntakim\nRay Advertising Limited"
    )
    return fallback_subject, fallback_body

def send_email(recipient, subject, content):
    """
    Send an email using Gmail SMTP.
    """
    msg = MIMEMultipart()
    msg["From"] = f"{SENDER_NAME} <{GMAIL_SENDER}>"
    msg["To"] = recipient
    msg["Subject"] = subject
    msg.attach(MIMEText(content, "plain"))
    
    try:
        with smtplib.SMTP("smtp.gmail.com", 587) as server:
            server.starttls()
            server.login(GMAIL_SENDER, GMAIL_APP_PASSWORD)
            server.send_message(msg)
        print(f"Email sent to {recipient}")
    except Exception as e:
        print(f"Error sending email to {recipient}: {e}")

def main():
    # Define target categories and number of emails per category
    target_categories = ["beauty_wellbeing", "food_beverages_tobacco", "electronics_technology"]
    emails_per_category = 7  # Total emails will be 3 * 7 = 21

    # Read all records from trustpilot_data.csv and ensure new fields exist
    all_records = []
    with open("trustpilot_data.csv", "r", encoding="utf-8") as csvfile:
        reader = csv.DictReader(csvfile)
        # Add new fields if they don't exist
        fieldnames = reader.fieldnames + ["Email Sent", "Email Body"] if "Email Sent" not in reader.fieldnames else reader.fieldnames
        for row in reader:
            # Initialize the new fields if not present
            if "Email Sent" not in row:
                row["Email Sent"] = "No"
            if "Email Body" not in row:
                row["Email Body"] = ""
            all_records.append(row)

    # Group records by target category
    records_by_category = defaultdict(list)
    for record in all_records:
        cat = record.get("Category", "").strip()
        if cat in target_categories:
            records_by_category[cat].append(record)
    
    total_emails_sent = 0
    # For each target category, send emails to up to emails_per_category records
    for cat in target_categories:
        count = 0
        for record in records_by_category.get(cat, []):
            if count >= emails_per_category:
                break
            email_address = record.get("Email", "").strip()
            if not email_address or email_address.upper() == "N/A":
                continue
            print(f"\nGenerating email content for: {record['Business Name']} (Category: {cat})")
            subject, email_content = generate_email_content(record)
            print("Generated Email Subject:")
            print(subject)
            print("\nGenerated Email Content:")
            print(email_content)
            send_email(email_address, subject, email_content)
            # Update the record with email details
            record["Email Sent"] = "Yes"
            record["Email Body"] = email_content
            count += 1
            total_emails_sent += 1
            time.sleep(2)  # Pause between emails to manage rate limits

    print(f"Total emails sent: {total_emails_sent}")

    # Write updated records back to CSV (overwrite the existing file or create a new one)
    updated_fieldnames = fieldnames  # includes our new fields
    with open("trustpilot_data.csv", "w", newline="", encoding="utf-8") as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=updated_fieldnames)
        writer.writeheader()
        for rec in all_records:
            writer.writerow(rec)
    print("CSV file updated with email send status.")

if __name__ == "__main__":
    main()



Generating email content for: PHARX (Category: beauty_wellbeing)
Generated Email Subject:
Unlock Growth for PHARX with Our Expert Solutions

Generated Email Content:
Dear PHARX Team,

I hope this email finds you well. My name is Muhib Al Muntakim, and I'm reaching out from Ray Advertising Limited. We've helped numerous businesses in the beauty and wellbeing category thrive, and I believe our performance marketing solutions can do the same for PHARX.

Our expertise can enhance your online presence, drive more sales, and ultimately boost your growth. As a new client, we're offering you a special 10% discount on our services. This is a great opportunity to elevate your brand and reach new heights.

If you're interested in learning more, I'd be happy to schedule a call to discuss how we can support your business goals.

Best regards,
Muhib Al Muntakim
Ray Advertising Limited
Email sent to support@pharx.de

Generating email content for: Lindywell (Category: beauty_wellbeing)
Generated Emai