# Web scraping to generate RSS feed for new positions in economics
This application is using three main sources to retrieve information about the new job posts:
1. [NBER](https://www.nber.org/career-resources/research-assistant-positions-not-nber)
2. [Predoc](https://predoc.org/opportunities)
3. [EconJobMarket](https://econjobmarket.org/market)
   
The packeges that are needed are **requests**, **beautifulsoup4**,**MIMEtext**. As a first step we recall them:


In [94]:
import xml.etree.ElementTree as ET  # For XML handling
import requests  # For HTTP requests
import certifi  # For SSL certification verification
from bs4 import BeautifulSoup  # For web scraping
import re  # For regular expressions
import os  # For file and environment variable management
import pandas as pd  # For data manipulation
import smtplib  # For sending emails
from email.mime.text import MIMEText  # For constructing email messages
from email.mime.multipart import MIMEMultipart  # For handling email attachments
from IPython.display import Markdown, display  # For displaying tables in Jupyter
from dotenv import load_dotenv  # For loading environment variables
import urllib3  # For managing HTTP connections
from jinja2 import Environment, FileSystemLoader
import datetime
# Suppress SSL warnings for sites with invalid certificates (if necessary)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# Load environment variables from .env file
load_dotenv()  # For email credentials (SENDER_EMAIL, SENDER_PASSWORD)


True

In [95]:

PREDOC_URL = "https://predoc.org/opportunities"
NBER_URL = "https://www.nber.org/career-resources/research-assistant-positions-not-nber"
EJM_URL = "https://econjobmarket.org/market"
XML_FILE = "jobs.xml"

# Define your GitHub repository link
GITHUB_REPO_URL = "https://github.com/RickyJ99/RA-rss"
GITHUB_ISSUE_URL = f"{GITHUB_REPO_URL}/issues"

## Downloading the html
The following functions are downloading the HTML content from the sources and it save it in the foulder sources.
For PREDOC there is a issue with certificate so it is easy to use curl (bash MacOS)

In [96]:
!mkdir -p sources
!curl -L "https://predoc.org/opportunities" -o "sources/predoc.html"

14987.38s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
14993.11s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  407k  100  407k    0     0   2054      0  0:00:08  0:00:01  0:00:07 52039150k      0  0:00:02  0:00:02 --:--:--  151k


In [97]:
def download_html(url, filename):
    """
    Downloads the HTML content from the given URL and saves it to the specified filename.
    """
    try:
        response = requests.get(url, verify=certifi.where())
        response.raise_for_status()
        with open(filename, "w", encoding="utf-8") as f:
            f.write(response.text)
        print(f"Downloaded HTML from {url} to {filename}")
    except Exception as e:
        print(f"Error downloading {url}: {e}")

# Ensure the 'sources' folder exists.
os.makedirs("sources", exist_ok=True)

# Download HTML content for each source.
download_html(PREDOC_URL, "sources/predoc.html")
download_html(NBER_URL, "sources/nber.html")
download_html(EJM_URL, "sources/ejm.html")


Error downloading https://predoc.org/opportunities: HTTPSConnectionPool(host='predoc.org', port=443): Max retries exceeded with url: /opportunities (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1010)')))
Downloaded HTML from https://www.nber.org/career-resources/research-assistant-positions-not-nber to sources/nber.html
Downloaded HTML from https://econjobmarket.org/market to sources/ejm.html


## Extract Main Field Helper Function 🔑

The `extract_main_field` function analyzes a given text to determine which research fields are mentioned. It searches for multiple keywords in a **case-insensitive** manner. If one or more keywords are found, it returns them as a comma‑separated string. If "N/A" are found, it returns `"N/A"`.

### Keywords Included:
- **Economics**
- **Macroeconomics**
- **Microeconomics**
- **Labour**
- **Industrial Organization**
- **Enterpreneurship**
- **Healthcare**
- **Discrimination**
- **Finance**
- **Public Policy**

You can extend this list with additional fields in economics as needed.

In [98]:
def extract_main_field(text):
    """
    Looks for keywords in the provided text.
    Keywords: Economics, Macroeconomics, Microeconomics, Labour, Industrial Organization,
    Enterpreneurship, Healthcare, Discrimination, Finance, Public Policy.
    Returns a comma-separated string of all found keywords or "None" if none are found.
    """
    keywords = [
        "Economics", "Macroeconomics", "Microeconomics","Microeconomic theory", "Macroeconomic theory"
        "Labour", "Industrial Organization", "Entrepreneurship",
        "Healthcare", "Discrimination", "Finance", "Public Policy", "Local Economic Policy", "Climate"
    ]
    
    found = []
    for keyword in keywords:
        if keyword.lower() in text.lower():
            found.append(keyword)
    
    if found:
        # Remove duplicates while preserving order and return as a comma-separated string.
        unique_keywords = list(dict.fromkeys(found))
        return ", ".join(unique_keywords)
    else:
        return None

**Function: `extract_program_type(text)`**

- **Purpose:**  
  This function takes a string as input (which might be a job title or description) and determines the program type based on certain keywords.

- **How It Works:**  
  1. **Convert to Lowercase:**  
     The input text is converted to lowercase to ensure case-insensitive matching.
  2. **Keyword Checks:**  
     - If the text contains any variation of "predoctoral" (e.g., "predoctoral", "pre doc", "pre-doc", "predoc"), it returns **"PreDoctoral Program"**.
     - If the text contains any variation of "postdoc" (e.g., "postdoc", "post doc", "post-doc", "postdoctoral", "post doctoral"), it returns **"Post Doc"**.
     - If the text contains "phd" or "ph.d", it returns **"PhD"**.
     - If the text mentions "research assistant" or even "ra" (for example, in abbreviated or extensive form), it returns **"Research Assistant"**.
  3. **Default Category:**  
     If "N/A" of the keywords are found, the function defaults to returning **"Research Assistant"**.





In [None]:
def extract_program_type(text):
    """
    Analyzes the provided text (e.g., a job title or description) to determine the program type.
    
    Keywords used:
      - "PreDoctoral Program" if the text includes variations like "predoctoral", "predoc", etc.
      - "Post Doc" if the text includes variations like "postdoc", "post-doctoral", etc.
      - "PhD" if the text includes "phd" or "ph.d".
      - "Research Assistant" (RA) for all other cases.
    
    Returns:
      A string representing the program type.
    """
    # Convert the text to lowercase for case-insensitive matching. 🔍
    text_lower = text.lower()
    
    # Check for PreDoctoral indicators. 🎓
    if any(kw in text_lower for kw in ["predoctoral", "pre doc", "pre-doc", "predoc"]):
        return "PreDoctoral Program"
    
    # Check for Post Doc indicators. 📚
    elif any(kw in text_lower for kw in ["postdoc", "post doc", "post-doc", "postdoctoral", "post doctoral"]):
        return "Post Doc"
    
    # Check for PhD indicators. 🎓
    elif "phd" in text_lower or "ph.d" in text_lower:
        return "PhD"
    
    # Check for Research Assistant indicators. 💼
    elif "research assistant" in text_lower or "ra" in text_lower:
        return "Research Assistant"
    
    # Default category is Research Assistant (RA). 🔄
    else:
        return "Research Assistant"





## Web Scraping Section 🚀

In this section, we set up our web scraping functionality. Our goal is to **extract job details** from pre-doctoral opportunities pages (in this example, from [predoc.org](https://predoc.org/opportunities)). We assume that the HTML content has already been downloaded and saved locally in the `sources` folder.

### Predoc
What This Section Does:
- **Reads the Local HTML File 📂:**  
  We read the downloaded HTML file (`sources/predoc.html`). If the file isn't found, the code prompts you to download it first.
  
- **Parses the HTML Content 🥣:**  
  Using BeautifulSoup, the code parses the HTML to locate the container that holds the opportunity details.
  
- **Extracts Key Information 🔍:**  
  For each job posting, the function extracts:
  - **Program Title** and **Link** from the `<h2>` element.
  - Additional details (like **sponsor**, **institution**, **fields of research**, and **deadline**) from the "copy" `<div>`.
  
- **Determines the Main Field 🔑:**  
  It combines several text fields and passes them to an auxiliary function (`extract_main_field()`) that determines the primary focus (e.g., Economics, Microeconomics, Finance, etc.).

- **Returns the Data as a List 📤:**  
  Each job is stored as a dictionary, and the function returns a list of these dictionaries.

> **Note:**  
> Make sure to download the HTML file before running the scraper (therefore run the previous chunks).


In [100]:
def scrape_predoc():
    """
    Scrapes the pre-doctoral opportunities page from the local HTML file
    and extracts job details.
    """
    jobs = []
    
    # Attempt to read the local HTML file. 📂
    try:
        with open("sources/predoc.html", "r", encoding="utf-8") as f:
            html = f.read()
    except Exception as e:
        print("Error reading sources/predoc.html. Please download the HTML from predoc before proceeding. 🚫")
        return jobs  # Return an empty list if the file can't be read.
    
    # Parse the HTML content using BeautifulSoup. 🥣
    soup = BeautifulSoup(html, "html.parser")
    
    # Find the container holding the opportunities using a regex on the class name. 🔍
    container = soup.find("div", class_=re.compile("Opportunities"))
    if not container:
        print("No Predoc container found. 😢")
        return jobs
    
    # Loop over each article element within the container. 📝
    articles = container.find_all("article")
    for article in articles:
        job = {}
        job["source"] = "Predoc"  # Mark the source as 'predoc'. 🌟
        
        # Extract the title and link from the <h2> element. 🏷️
        h2 = article.find("h2")
        if h2:
            a_tag = h2.find("a")
            if a_tag:
                job["program_title"] = a_tag.get_text(strip=True)
                job["link"] = a_tag.get("href", "N/A").strip()
            else:
                job["program_title"] = "N/A"
                job["link"] = "N/A"
        else:
            job["program_title"] = "N/A"
            job["link"] = "N/A"
        
        # Extract details from the "copy" div. 🗒️
        copy_div = article.find("div", class_="copy")
        if copy_div:
            p = copy_div.find("p")
            if p:
                text = p.get_text(separator=" ", strip=True)
                # Use regex to capture specific fields from the text. 🔍
                researcher_match = re.search(r"Sponsoring Researcher\(s\):\s*(.*?)\s*Sponsoring Institution:", text)
                institution_match = re.search(r"Sponsoring Institution:\s*(.*?)\s*Fields of Research", text)
                fields_match = re.search(r"Fields of Research\s*:\s*(.*?)\s*Deadline:", text)
                deadline_match = re.search(r"Deadline:\s*(.*)", text)
                job["sponsor"] = researcher_match.group(1).strip() if researcher_match else "N/A"
                job["institution"] = institution_match.group(1).strip() if institution_match else "N/A"
                job["fields"] = fields_match.group(1).strip() if fields_match else "N/A"
                job["deadline"] = deadline_match.group(1).strip() if deadline_match else "N/A"
            else:
                job["sponsor"] = "N/A"
                job["institution"] = "N/A"
                job["fields"] = "N/A"
                job["deadline"] = "N/A"
        else:
            job["sponsor"] = "N/A"
            job["institution"] = "N/A"
            job["fields"] = "N/A"
            job["deadline"] = "N/A"
        
        # Add additional fields for consistency. 🛠️
        job["university"] = "N/A"
        job["program_type"] = "N/A"
        job["publication_date"] = "N/A"
        
        # Determine the main field by combining text from various fields. 🔑
        text_to_search = " ".join([job.get("fields", "N/A"), job.get("program_title", "N/A"), job.get("institution", "N/A")])
        job["main_field"] = extract_main_field(text_to_search)
        
        # Append the extracted job details to the jobs list. ✅
        jobs.append(job)
    
    # Return the list of all extracted job details. 📤
    return jobs
df  = pd.DataFrame(scrape_predoc()).head(10).to_markdown(index=False)
display(Markdown(df))

| source   | program_title                                | link                   | sponsor                                                    | institution                                                                             | fields                                                                            | deadline                                                                                                                                                                                                            | university   | program_type   | publication_date   | main_field                  |
|:---------|:---------------------------------------------|:-----------------------|:-----------------------------------------------------------|:----------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------|:---------------|:-------------------|:----------------------------|
| Predoc   | Research Associates                          | https://bit.ly/3Xhfcvq | N/A                                                        | N/A                                                                                     | Education, Finance, Labor, Macro, Public Policy, and Urban                        | Rolling In addition to a multi-institutional job pool, this website also provides general information about opportunities within the Federal Reserve System as well as up-to-date information re specific openings. | N/A          | N/A            | N/A                | Finance, Public Policy      |
| Predoc   | Pre-Doctoral Positions                       | https://bit.ly/3CM4IxV | Abhishek Nagaraj (UC Berkeley) and Matteo Tranchero (Penn) | UC Berkeley’s Haas School of Business; Wharton School of the University of Pennsylvania | Applied economics, innovation, entrepreneurship, data science and tech innovation | Rolling                                                                                                                                                                                                             | N/A          | N/A            | N/A                | Economics, Entrepreneurship |
| Predoc   | Pre-Doctoral Technical Associate             | https://bit.ly/4hBFgtU | N/A                                                        | N/A                                                                                     | N/A                                                                               | N/A                                                                                                                                                                                                                 | N/A          | N/A            | N/A                |                             |
| Predoc   | Pre-Doctoral Technical Associate             | https://bit.ly/4er9iOY | N/A                                                        | N/A                                                                                     | N/A                                                                               | N/A                                                                                                                                                                                                                 | N/A          | N/A            | N/A                |                             |
| Predoc   | Pre-Doctoral Research Associate in Economics | https://bit.ly/4jQlfBs | Matthew Pecenco                                            | Brown University                                                                        | Labor, Crime, Housing                                                             | Applications will be accepted and reviewed on a rolling basis.                                                                                                                                                      | N/A          | N/A            | N/A                | Economics                   |
| Predoc   | Full-Time Research Assistant                 | https://bit.ly/4hwY3a0 | Matthew Baron                                              | National Bureau of Economic Research                                                    | Banking, Financial Crises, Financial History                                      | Rolling                                                                                                                                                                                                             | N/A          | N/A            | N/A                |                             |
| Predoc   | Pre-Doctoral Fellowship                      | https://bit.ly/3WXU7GY | Hans-Joachim Voth                                          | University of Zurich                                                                    | Economic History, Political Economy, Cultural Economics                           | Applications will be reviewed immediately and are welcome until all positions are filled.                                                                                                                           | N/A          | N/A            | N/A                | Economics                   |
| Predoc   | Research Professional in Accounting          | https://bit.ly/4hyJouK | Professor Philip Berger                                    | The University of Chicago Booth School of Business                                      | Accounting, corporate finance, and labor economics.                               | Applications are reviewed on a rolling basis; the initial full review will be March 15, 2025.                                                                                                                       | N/A          | N/A            | N/A                | Economics, Finance          |
| Predoc   | Research Professional in Marketing           | https://bit.ly/3Eupeng | Professor Andreas Kraft                                    | University of Chicago Booth School of Business                                          | N/A                                                                               | N/A                                                                                                                                                                                                                 | N/A          | N/A            | N/A                |                             |
| Predoc   | Research Professional in Behavioral Science  | https://bit.ly/3QkttnL | Professor Alexander Todorov                                | University of Chicago Booth School of Business                                          | Behavioral Science                                                                | Applications are reviewed on a rolling basis; the initial full review will be March 1, 2025.                                                                                                                        | N/A          | N/A            | N/A                |                             |

# Web Scraping Section for NBER (Local HTML) 🔎

In this section, we extract job details from the locally saved NBER page HTML file. The function follows these steps:

- **📂 Read the Local HTML File:**  
  The function attempts to read `sources/nber.html`. If the file isn't found, it prints an error message and returns an empty list.

- **🥣 Parse HTML with BeautifulSoup:**  
  The HTML content is parsed so we can navigate and extract the data.

- **🔍 Locate the Container:**  
  It finds the `<div>` with class `page-header__intro-inner` that holds the job details.

- **✂️ Skip Header Paragraphs:**  
  The first three `<p>` elements are skipped as they contain header information.

- **📋 Extract Job Details:**  
  For each job posting, the function extracts:
  - Program title  
  - Sponsor  
  - Institution  
  - Fields of research  
  - Job link  
  If any of these details are missing, it defaults to `"N/A"`.

- **🔑 Determine Main Field:**  
  It combines relevant text and uses the helper function `extract_main_field()` (which should be defined elsewhere) to determine the primary research area.

- **✅ Return the Jobs List:**  
  Finally, all extracted job entries are stored in a list and returned.


In [101]:
def scrape_nber():
    """
    Scrapes the NBER research assistant positions page from a local HTML file
    and extracts job details.
    """
    jobs = []
    
    # Attempt to read the local HTML file. 📂
    try:
        with open("sources/nber.html", "r", encoding="utf-8") as f:
            html = f.read()
    except Exception as e:
        print("Error reading sources/nber.html. Please download the HTML from NBER before proceeding. 🚫")
        return jobs  # Return an empty list if the file can't be read.
    
    # Parse the HTML content using BeautifulSoup. 🥣
    soup = BeautifulSoup(html, "html.parser")
    
    # Find the container holding the job details using its class name. 🔍
    container = soup.find("div", class_="page-header__intro-inner")
    if container:
        # Get all <p> elements inside the container. 📝
        paragraphs = container.find_all("p")
        # Skip the first three header paragraphs. ✂️
        for p in paragraphs[2:]:
            job = {}
            job["source"] = "NBER"  # Mark the source as NBER. 🌟
            parts = p.decode_contents().split("<br>")[0].split("<br/>")


            if len(parts) >= 4:
                job["program_title"]        = parts[0].strip()
                job["sponsor"]              = parts[1].replace("NBER Sponsoring Researcher(s):","").strip()
                job["institution"]          = parts[2].replace("Institution:","").strip()
                if len(parts[3].replace("Field(s) of Research:","").strip().split("&amp"))>1:
                    fields = "".join(field.strip() for field in parts[3].replace("Field(s) of Research:","").strip().split("&amp"))
                else:
                    fields         = parts[3].replace("Field(s) of Research:","").strip()
                if len(fields.split(";")) > 1:
                    fields = ", ".join(field.strip() for field in fields.split(";"))

                if len(fields.split(":")) > 1:
                    fields = fields.split(":")[1]
                job["fields"]               =   fields
                job["program_type"]         = extract_program_type(job["program_title"])
                # Combine text fields (program title and university) to determine the main field. 🔑
                text_to_search =  fields
                job["main_field"] = extract_main_field(text_to_search)
                # Extract the job link from the HTML in the last part. 🔗
                link_soup = BeautifulSoup(parts[4], "html.parser")
                a_tag = link_soup.find("a")
                job["link"]                 = a_tag["href"] if a_tag else ""
                job["deadline"]             = "N/A"  # Deadline not provided. ⏰
                job["publication_date"]     = "N/A"
                # Append the extracted job to our list. ✅
                jobs.append(job)
    else:
        print("NBER container not found. 😢")
    
    # Return the list of all extracted job details. 📤
    return jobs
df  = pd.DataFrame(scrape_nber()).head(10).to_markdown(index=False)
display(Markdown(df))

| source   | program_title                                                 | sponsor                       | institution                                                                        | fields                                                                     | program_type        | main_field                | link                                                                                                                                      | deadline   | publication_date   |
|:---------|:--------------------------------------------------------------|:------------------------------|:-----------------------------------------------------------------------------------|:---------------------------------------------------------------------------|:--------------------|:--------------------------|:------------------------------------------------------------------------------------------------------------------------------------------|:-----------|:-------------------|
| NBER     | Predoctoral Research Analyst                                  | Josh Rauh                     | Hoover Institution                                                                 | State, Local Economic Policy                                               | PreDoctoral Program | Local Economic Policy     | https://careersearch.stanford.edu/jobs/research-analyst-for-state-and-local-government-initiative-27190                                   | N/A        | N/A                |
| NBER     | Postdoctoral Research Fellowship                              | Josh Rauh                     | Hoover Institution                                                                 | State, Local Economic Policy                                               | Post Doc            | Local Economic Policy     | https://applyrtf.hoover.org/                                                                                                              | N/A        | N/A                |
| NBER     | Predoctoral Research Associate                                | Paul A. Gompers               | Harvard Business School                                                            | Entrepreneurship, Finance                                                  | PreDoctoral Program | Entrepreneurship, Finance | https://www.dropbox.com/scl/fi/71ook3ulz90ncx67qvf8d/Gompers_predoc_job_posting_2025.pdf?rlkey=nqoh9hutr0gprkm8xyukhrm00&st=9h2t9x7v&dl=0 | N/A        | N/A                |
| NBER     | Predoctoral Fellow                                            | Amanda Pallais                | Harvard University                                                                 | Labor Economics                                                            | PreDoctoral Program | Economics                 | https://academicpositions.harvard.edu/postings/14690                                                                                      | N/A        | N/A                |
| NBER     | Full-Time Pre-Doctoral Fellow/Research Assistant              | Stephan Heblich               | University of Toronto, Department of Economics and Rotman School of Management     | Applied Economics                                                          | PreDoctoral Program | Economics                 | https://foslab.org/join-our-team/pre-doc-fellowships/pre-doctoral-fellowship-in-applied-economics/                                        | N/A        | N/A                |
| NBER     | Pre-doctoral Research Fellows                                 | Gordon Hanson and Dani Rodrik | Harvard Kennedy School                                                             | Regional dimensions of inequality in the United States and other countries | PreDoctoral Program |                           | https://www.hks.harvard.edu/centers/wiener/programs/economy/about/opportunities/pre-doc-fellow-2025                                       | N/A        | N/A                |
| NBER     | Research Professional                                         | Michael Greenstone            | University of Chicago, Climate Impact Lab                                          | Environment, Climate                                                       | Research Assistant  | Climate                   | https://job-boards.greenhouse.io/uchicagoepic/jobs/6349570003                                                                             | N/A        | N/A                |
| NBER     | Pre-doctoral Fellow                                           | Lisa B. Kahn                  | University of Rochester                                                            | Labor Economics                                                            | PreDoctoral Program | Economics                 | https://docs.google.com/forms/d/e/1FAIpQLSd2VitZ6wmUBL0Qyqiyg-zj-hSOjXIuMVke4XSJI4bLri0VDg/viewform?pli=1                                 | N/A        | N/A                |
| NBER     | Monitoring &amp; Evaluation Specialist, EPIC Air Quality Fund | Michael Greenstone            | University of Chicago (but not a direct hire, will work with an EOR)               | Air Quality, Environment, Climate                                          | Research Assistant  | Climate                   | https://epic.uchicago.edu/opportunities/monitoring-evaluation-specialist-epic-air-quality-fund/                                           | N/A        | N/A                |
| NBER     | Research Analyst                                              | Dean Karlan, Christopher Udry | Global Poverty Research Lab, Kellogg School of Management, Northwestern University | Development economics                                                      | Research Assistant  | Economics                 | https://www.povertyactionlab.org/careers/research-analyst-global-poverty-research-lab-kellogg-school-management-northwestern              | N/A        | N/A                |

### Web Scraping Section for EJM (Econ Job Market) 🔎

This function is designed to scrape job postings from the Econ Job Market (EJM) page. It performs the following tasks:

- **🌐 Fetching the Page:**  
  It sends an HTTP GET request to the EJM URL using the `requests` library.

- **🥣 Parsing HTML:**  
  The response content is parsed with BeautifulSoup to create a DOM structure for extraction.

- **🔍 Locating Job Panels:**  
  It finds all `<div>` elements with the classes `"panel panel-info"`, each representing a job posting.

- **🏷️ Extracting Job Details:**  
  For each panel, it extracts:
  - **Job Title & Link:** Located within an `<a>` tag with an ID starting with "title-".  
  - **University & Program Type:** Extracted from `<div>` elements with class `"col-md-4"` and `"col-md-2"`, respectively.
  - **Publication Date & Deadline:** Extracted from `<div>` elements with class `"col-md-2"`.
  - **Default Values:** Fields such as **sponsor**, **institution**, and **fields** are set to `"N/A"` since they're not provided.
  
- **🔑 Determining the Main Field:**  
  It combines the program title and university information to deduce the primary research field using the helper function `extract_main_field()`.

- **✅ Building the Result List:**  
  Each job is stored as a dictionary, and all such dictionaries are appended to a list which is then returned.



In [102]:
def scrape_ejm():
    """
    Scrapes the Econ Job Market (EJM) page and extracts detailed job information
    from the newer HTML structure. Post-processing steps include:
      - Merging single/multiple salaries into one cell.
      - Parsing sponsor(s) from text referencing "Professors ..."
      - Removing extraneous punctuation in 'fields' (bullet '•', semicolon ';', repeated commas).
      - Replacing 'link' with the final application link (or "N/A" if missing).
      - Inheriting 'start_date' if 'Flexible' from a previous non-Flexible record.
    
    Returns a list of dictionaries.
    """
    EJM_URL = "https://econjobmarket.org/market"
    jobs = []
    
    try:
        response = requests.get(EJM_URL)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, "html.parser")
        
        # Each job listing is typically under <div class="panel panel-info">
        panels = soup.find_all("div", class_="panel panel-info")
        for panel in panels:
            job = {}
            job["source"] = "ejm"
            
            # ---------- MAIN ROW (col-md-4, col-md-4, col-md-2, col-md-2) ----------
            main_row = panel.find("div", class_="row")
            if not main_row:
                continue
            
            cols = main_row.find_all("div", recursive=False)
            
            # --- FIRST COLUMN: title, location, start_date, duration ---
            if len(cols) >= 1:
                first_col = cols[0]
                title_a = first_col.find("a", id=lambda x: x and x.startswith("title-"))
                
                if title_a:
                    job["program_title"] = title_a.get_text(strip=True)
                    # We'll store a temporary link here; final link will become 'application_link'
                    job["temp_link"] = title_a.get("href", "").strip()
                else:
                    job["program_title"] = "N/A"
                    job["temp_link"] = ""
                
                col_text = first_col.get_text(separator="\n", strip=True).split("\n")
                # Often line 2 is location
                job["location"] = col_text[1].strip() if len(col_text) >= 2 else "N/A"
                
                job["start_date"] = "N/A"
                job["duration"] = "N/A"
                for line in col_text:
                    lower_line = line.lower()
                    if lower_line.startswith("starts"):
                        clean_line = line.replace("Starts", "").replace(".", "").strip()
                        job["start_date"] = clean_line if clean_line else "N/A"
                    elif lower_line.startswith("duration"):
                        clean_line = line.replace("Duration:", "").strip()
                        job["duration"] = clean_line if clean_line else "N/A"
            
            # --- SECOND COLUMN: department, university ---
            if len(cols) >= 2:
                second_col = cols[1]
                lines_2 = second_col.get_text(separator="\n", strip=True).split("\n")
                job["department"] = lines_2[0].strip() if len(lines_2) >= 1 else "N/A"
                job["university"] = lines_2[1].strip() if len(lines_2) >= 2 else "N/A"
            
            # --- THIRD COLUMN: program_type, fields ---
            if len(cols) >= 3:
                third_col = cols[2]
                program_text = third_col.get_text(separator="\n", strip=True).split("\n", 1)
                job["program_type"] = program_text[0].strip() if program_text else "N/A"
                
                fields_div = third_col.find("div", id=re.compile(r"cats-\d+"))
                if fields_div:
                    fields_raw = fields_div.get_text(separator=", ", strip=True)
                else:
                    fields_raw = program_text[1].strip() if len(program_text) > 1 else ""
                
                # Clean fields: remove bullet dots, semicolons, repeated commas
                fields_clean = re.sub(r"[•;]", "", fields_raw)
                fields_clean = re.sub(r",\s*,", ",", fields_clean)
                fields_clean = re.sub(r"\s+", " ", fields_clean).strip(" ,")
                job["fields"] = fields_clean if fields_clean else "N/A"
            
            # --- FOURTH COLUMN: publication_date, deadline ---
            if len(cols) >= 4:
                fourth_col = cols[3]
                spans = fourth_col.find_all("span")
                
                job["publication_date"] = spans[0].get_text(strip=True) if len(spans) > 0 else "N/A"
                job["deadline"] = spans[1].get_text(strip=True) if len(spans) > 1 else "N/A"
            else:
                job["program_type"] = job.get("program_type") or "N/A"
                job["publication_date"] = "N/A"
                job["deadline"] = "N/A"
                job["fields"] = job.get("fields") or "N/A"
            
            # Placeholders for collapsed info
            job["sponsor"] = "N/A"
            job["institution"] = job["university"]
            job["main_field"] = extract_main_field(job["fields"])
            job["degree_required"] = "N/A"
            job["salary_range"] = "N/A"
            job["application_link"] = "N/A"
            
            # ---------- COLLAPSE BLOCK (extended info) ----------
            if title_a:
                collapse_id = title_a.get("href", "")
                if collapse_id.startswith("#"):
                    collapse_div_id = collapse_id[1:]
                    collapse_div = panel.find("div", id=collapse_div_id)
                    if collapse_div:
                        # We'll parse the entire collapse text in one go
                        collapse_text = collapse_div.get_text(separator="\n", strip=True)
                        
                        # Parse sponsor(s) from text with "Professors" ...
                        # We'll look for a pattern: "Professors (.*?)." or "Professor (.*?)."
                        # This is a heuristic; adjust to your content.
                        prof_match = re.search(
                            r'(?:[Pp]rofessors?\s+)(.*?)(?:\.|$)', collapse_text
                        )
                        if prof_match:
                            sponsor_str = prof_match.group(1)
                            # Replace ' and ' with comma
                            sponsor_str = sponsor_str.replace(" and ", ", ")
                            # Split by commas
                            sponsor_list = [x.strip() for x in sponsor_str.split(",") if x.strip()]
                            # Re-join with commas
                            job["sponsor"] = ", ".join(sponsor_list)
                        
                        # We'll search within <div> tags with <strong> for structured data
                        additional_divs = collapse_div.find_all("div")
                        for div_item in additional_divs:
                            strong_tag = div_item.find("strong")
                            if strong_tag:
                                label = strong_tag.get_text(strip=True).lower()
                                val = div_item.get_text(separator="\n", strip=True)
                                # remove the strong text from val
                                val = val.replace(strong_tag.get_text(strip=True), "").strip(": \n")
                                
                                if "degree required" in label:
                                    job["degree_required"] = val if val else "N/A"
                                elif "job start date" in label:
                                    job["start_date"] = val if val else "N/A"
                                elif "job duration" in label:
                                    job["duration"] = val if val else "N/A"
                                elif "salary" in label:
                                    # unify multiple lines for salary
                                    raw_lines = val.split("\n")
                                    unified = " ".join(x.strip() for x in raw_lines if x.strip())
                                    job["salary_range"] = unified if unified else "N/A"
                        
                        # Try to parse "To Apply" link
                        apply_paragraph = collapse_div.find("p", text=re.compile(r"To\s+Apply", re.IGNORECASE))
                        if apply_paragraph:
                            next_link = apply_paragraph.find_next("a", href=True)
                            if next_link:
                                job["application_link"] = next_link.get("href", "N/A")
                        else:
                            # or search any <a> with 'apply' in text
                            apply_a = collapse_div.find("a", href=True, text=re.compile(r"apply", re.IGNORECASE))
                            if apply_a:
                                job["application_link"] = apply_a.get("href", "N/A")
            
            jobs.append(job)
    
    except Exception as e:
        print("Error during EJM scraping:", e)
    
    # ---------- POST-PROCESSING ----------
    # (1) Replace 'link' with final 'application_link', or "N/A" if missing/'https://econjobmarket.org'
    # (2) If 'start_date' == 'Flexible', copy from the nearest preceding non-Flexible record
    for i, job in enumerate(jobs):
        # (1) link substitution
        app_link = job["application_link"]
        if not app_link or app_link.strip() == "" or app_link.strip() == "https://econjobmarket.org":
            job["link"] = "N/A"
        else:
            job["link"] = app_link
        
        # (2) if 'start_date' is 'Flexible', inherit from previous
        sd = job.get("start_date")
        if isinstance(sd, str) and sd.lower() == "flexible":
            new_date = "N/A"
            # look upward
            for j in range(i-1, -1, -1):
                prev_sd = jobs[j].get("start_date")
                if prev_sd and isinstance(prev_sd, str) and prev_sd.lower() != "flexible":
                    new_date = prev_sd
                    break
            job["start_date"] = new_date  # can be "N/A" if never found
    
    # Remove temp columns
    for job in jobs:
        if "temp_link" in job:
            del job["temp_link"]
        if "application_link" in job:
            del job["application_link"]
    
    return jobs


df  = pd.DataFrame(scrape_ejm()).head(10).to_markdown(index=False)

display(Markdown(df))

  apply_paragraph = collapse_div.find("p", text=re.compile(r"To\s+Apply", re.IGNORECASE))
  apply_a = collapse_div.find("a", href=True, text=re.compile(r"apply", re.IGNORECASE))


| source   | program_title                                                                   | location              | start_date   | duration             | department                                                       | university                            | program_type                 | fields                                                                                                                                                                                                                                                                                          | publication_date   | deadline    | sponsor                                                                                                                          | institution                           | main_field                                                               | degree_required   | salary_range         | link                                                                                                                                                                                                                             |
|:---------|:--------------------------------------------------------------------------------|:----------------------|:-------------|:---------------------|:-----------------------------------------------------------------|:--------------------------------------|:-----------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------|:------------|:---------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------|:-------------------------------------------------------------------------|:------------------|:---------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ejm      | Faculty of Economics at Plaksha University, India                               | Mohali,               | N/A          | Continuing/permanent | Plaksha University                                               | N/A                                   | Various position types       | Development Growth, Econometrics, Environmental Ag. Econ., Experimental Economics, Finance, International Finance/Macro, Macroeconomics Monetary, Behavioral Economics, Any field, Business Economics, Marketing, Operations Research, Statistics, Microeconomic theory, Applied microeconomics | 19 Nov 2024        | 30 Jul 2025 | N/A                                                                                                                              | N/A                                   | Economics, Macroeconomics, Microeconomics, Microeconomic theory, Finance | Doctorate         | N/A                  | https://apply.interfolio.com/39648/positions                                                                                                                                                                                     |
| ejm      | Predoctoral Research Analyst -- Applied Microeconomics                          | Philadelphia,         | 2025-07-01   | 2 years              | Business Economics and Public Policy, Wharton School             | University of Pennsylvania            | Research Assistant (Pre-Doc) | Development Growth, Environmental Ag. Econ., Experimental Economics, Labor Demographic Economics, Public Economics, Behavioral Economics                                                                                                                                                        | 9 Oct 2024         | 1 Jul 2025  | Arthur van Benthem, Susanna Berkouwer, Benjamin Lockwood, Corinne Low, Judd Kessler, Alex Rees-Jones                             | University of Pennsylvania            | Economics                                                                | Bachelors         | 55,000 to 60,000 USD | https://wd1.myworkdaysite.com/en-US/recruiting/upenn/careers-at-penn/details/Faculty-Research-Analyst--Business--Economics--and-Public-Policy-Department--Wharton-School_JR00098140-1?jobFamily=ac2a3e0e9a860145e03c7bdc4209c207 |
| ejm      | Research Assistant                                                              | (see ad for location) | 2025-07-01   | 6 months             | Animal Welfare Economics Working Group                           | N/A                                   | Research Assistant           | Any field                                                                                                                                                                                                                                                                                       | 15 Dec 2024        | 15 Feb 2025 | N/A                                                                                                                              | N/A                                   |                                                                          | Bachelors         | 50 USD               | https://docs.google.com/forms/d/e/1FAIpQLSfi18yuyB6O4KiefLlYMqYTvD2HSNtqwpZnW_ZPPUz1cUp4yA/viewform?usp=sf_link                                                                                                                  |
| ejm      | Research Assistant (CAGE)                                                       | Coventry,             | 2025-07-01   | 18 months            | Economics                                                        | University of Warwick                 | Research Assistant           | Any field                                                                                                                                                                                                                                                                                       | 13 Feb 2025        | 17 Mar 2025 | N/A                                                                                                                              | University of Warwick                 |                                                                          | Masters           | N/A                  | N/A                                                                                                                                                                                                                              |
| ejm      | PhD scholarships in Economics and Management on Ports as Energy Transition Hubs | Frederiksberg,        | 2025-09-01   | 36 months            | Department of Economics                                          | Copenhagen Business School            | Doctoral student             | Business Economics Management, General                                                                                                                                                                                                                                                          | 22 Jan 2025        | 17 Mar 2025 | and professors, supported by research related PhD courses                                                                        | Copenhagen Business School            | Economics                                                                | Masters           | N/A                  | https://candidate.hr-manager.net/ApplicationInit.aspx?cid=1309&ProjectId=147437&DepartmentId=18993&MediaId=4614&SkipAdvertisement=true                                                                                           |
| ejm      | 1 Research Grant, IGIER Research Center                                         | Milano,               | 2025-05-01   | 36 months            | Economics                                                        | Bocconi University                    | Postdoctoral Scholar         | Labor Demographic Economics, Political Economy, Management, Information Technology                                                                                                                                                                                                              | 16 Dec 2024        | 9 Mar 2025  | N/A                                                                                                                              | Bocconi University                    | Economics                                                                | Doctorate         | N/A                  | N/A                                                                                                                                                                                                                              |
| ejm      | Post-Doctoral Associate in the Division of Social Science [Economics]           | Abu Dhabi,            | 2025-05-01   | 2 years              | Economics                                                        | New York University Abu Dhabi         | Research Assistant           | Macroeconomics Monetary                                                                                                                                                                                                                                                                         | 16 Nov 2024        | 1 Dec 2024  | Jean Imbs, Laurent Pauwels                                                                                                       | New York University Abu Dhabi         | Economics, Macroeconomics                                                | Doctorate         | N/A                  | https://nyuad.nyu.edu/content/dam/nyuad/about/careers/magazines/NYU-Abu-Dhabi-Compensation-and-Benefits.pdf                                                                                                                      |
| ejm      | Research Assistant                                                              | Boston,               | 2025-05-01   | 1 year               | FutureTech                                                       | Massachusetts Institute of Technology | Research Assistant           | Research Assistant (Pre-Doc) Econometrics                                                                                                                                                                                                                                                       | 15 Jan 2025        | 1 Mar 2025  | of economics, international studies at Boston College, holding the White Family assistant professorship chair between 2020, 2023 | Massachusetts Institute of Technology |                                                                          | Bachelors         | N/A                  | N/A                                                                                                                                                                                                                              |
| ejm      | Research Engineer or Postdoctoral position in Health Economics                  | Cergy-Pontoise cedex, | 2025-05-01   | 1 year               | Information Systems, Decision Sciences and Statistics Department | ESSEC Business School                 | Postdoctoral Scholar         | Other academic Research Assistant Econometrics Health Education Welfare                                                                                                                                                                                                                         | 13 Aug 2024        | 15 Sep 2024 | N/A                                                                                                                              | ESSEC Business School                 |                                                                          | Masters           | N/A                  | mailto:lamiraud@essec.edu                                                                                                                                                                                                        |
| ejm      | Several postdoctoral researchers in microeconomic theory                        | Bonn,                 | 2025-08-01   | 3 years              | Department of Economics                                          | University of Bonn                    | Postdoctoral Scholar         | Research Assistant Microeconomic theory                                                                                                                                                                                                                                                         | 5 Nov 2024         | 28 Feb 2025 | N/A                                                                                                                              | University of Bonn                    | Microeconomic theory                                                     | Doctorate         | N/A                  | https://econjobmarket.org/login                                                                                                                                                                                                  |

## CSV & Email Handling Section 📊✉️

This section contains helper functions to manage your job database and send email notifications when new opportunities are detected.

### 1. Reading Existing Jobs from a CSV File 📂

The `read_existing_jobs()` function reads a CSV file that contains saved job listings and returns a **set** of job links that are already recorded.  
- It checks if the file exists.  
- It uses Python's `csv.DictReader` to iterate over rows and collects the "link" field for each job.  

### 2. Appending New Jobs to the CSV File 💾

The `append_jobs_to_csv()` function takes a list of job dictionaries and appends them to the specified CSV file.
- If the CSV file doesn't exist, it creates the file and writes the header.  
- It then appends each job as a new row.  

### 3. Sending Email Notifications ✉️

- **Purpose:**  
  This function sends an email notification whenever new job records are found.  
  - **Single Record:** The subject is set to the university name from that record.
  - **Multiple Records:** The subject lists the unique university names (e.g., "University A, University B").

- **Email Body:**  
  The email body is constructed as an HTML document with a styled table that lists:
  - **Source** (e.g., "predoc", "nber", "ejm")
  - **Program Title**
  - **Clickable Link** (each link is rendered as a clickable hyperlink)
  - **Sponsor**
  - **Institution**
  - **Fields**
  - **Main Field**
  - **Deadline**
  - **University**
  - **Program Type**
  - **Publication Date**

- **How It Works:**  
  1. **Subject Creation:**  
     The function extracts university names from each job record. If there's only one record, it uses that university name; if multiple, it joins all unique names.
  
  2. **HTML Table Construction:**  
     An HTML table is built with one row per job record, ensuring that links are rendered as clickable hyperlinks.
  
  3. **Email Assembly:**  
     The email is composed as a multipart message with both plain text and HTML parts.
  
  4. **Sending the Email:**  
     Using Python's `smtplib`, the function logs in to the SMTP server (defaulting to Gmail) and sends the email.



In [103]:
def replace_none_or_empty_in_list_of_dicts(jobs):
    """
    Ensures all job dictionaries have consistent formatting:
    - Replace None or empty values with "N/A"
    - Strip extra whitespace
    - Convert to lowercase for consistency
    """
    cleaned_jobs = []
    for job in jobs:
        cleaned_job = {
            str(k).strip().lower(): str(v).strip() if v and v.strip() else "N/A"
            for k, v in job.items()
        }
        cleaned_jobs.append(cleaned_job)
    return cleaned_jobs



def read_existing_jobs(xml_file):
    """
    Reads existing job entries from XML and returns a dictionary of frozenset job signatures,
    categorized by 'source' (e.g., Predoc, NBER, EJM).
    
    If the XML file does not exist, returns an empty dictionary.
    """
    existing_signatures = {}

    if os.path.exists(xml_file):
        tree = ET.parse(xml_file)
        root = tree.getroot()

        for entry in root.findall("entry"):
            job_data = {child.tag.strip(): child.text.strip() if child.text else "N/A" for child in entry}
            job_signature = frozenset(sorted(job_data.items()))  # Sort keys to ensure consistency
            
            source = job_data.get("source", "Unknown")  # Extract source
            
            if source not in existing_signatures:
                existing_signatures[source] = set()  # Create a set for this source if it doesn’t exist
            
            existing_signatures[source].add(job_signature)  # Store the signature under the correct source

    print("\n🔍 Debug: Existing Job Signatures by Source from XML")
    for src, sigs in existing_signatures.items():
        print(f"📁 {src}: {len(sigs)} jobs stored")
    
    return existing_signatures







def append_jobs_to_xml(xml_file, jobs):
    """
    Saves a list of job dictionaries into an XML file.
    
    - If the file does not exist, it creates a new XML structure.
    - If the file exists, it appends only new job entries while avoiding duplicates.
    """
    # Load existing XML or create a new root if the file doesn't exist
    if os.path.exists(xml_file):
        tree = ET.parse(xml_file)
        root = tree.getroot()
    else:
        root = ET.Element("jobs")  # Create root element

    # Read existing jobs to avoid duplicates
    existing_signatures = read_existing_jobs(xml_file)

    new_entries_count = 0  # Track new records added

    for job in jobs:
        # Convert job dict into a frozenset signature
        job_str_dict = {str(k).strip(): str(v).strip() for k, v in job.items() if v != "N/A"}
        job_signature = frozenset(sorted(job_str_dict.items()))

        if job_signature not in existing_signatures:
            # This is a new job! Add it to XML.
            entry = ET.SubElement(root, "entry")
            
            for key, value in job.items():
                field = ET.SubElement(entry, key)
                field.text = value if value.strip() else "N/A"  # Ensure no empty values
            
            new_entries_count += 1

    # Only save if new entries were added
    if new_entries_count > 0:
        tree = ET.ElementTree(root)
        tree.write(xml_file, encoding="utf-8", xml_declaration=True)
        print(f"✅ {new_entries_count} new job(s) added to {xml_file}")
    else:
        print("🔹 No new jobs found; XML file remains unchanged.")



def send_email_new_jobs(new_jobs, sender_email, sender_password, receiver_email, smtp_server="smtp.gmail.com", smtp_port=587):
    """
    Sends an email with new job records using an HTML template.
    
    - Uses Jinja2 for templating.
    - Loads the email HTML template from `templates/email.html`.
    - Displays "Apply" buttons instead of raw links.
    - Shows the latest update timestamp.
    - Provides links to contribute or report issues on GitHub.
    """

    # Count new jobs for the email subject
    num_jobs = len(new_jobs)
    subject = f"New Job Opportunities Found ({num_jobs})"

    # Get the current date & time
    update_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    # Load the email template using Jinja2
    env = Environment(loader=FileSystemLoader("templates"))
    template = env.get_template("email.html")
    
    # Render the template with dynamic values
    html_body = template.render(
        new_jobs=new_jobs,
        update_time=update_time,
        github_repo_url=GITHUB_REPO_URL,
        github_issue_url=GITHUB_ISSUE_URL
    )

    # Create a multipart email message (plain text and HTML)
    msg = MIMEMultipart("alternative")
    msg["Subject"] = subject
    msg["From"] = sender_email
    msg["To"] = receiver_email

    # Plain text fallback
    text_body = f"{num_jobs} new research positions found. Please view this email in an HTML-compatible client."

    part1 = MIMEText(text_body, "plain")
    part2 = MIMEText(html_body, "html")

    msg.attach(part1)
    msg.attach(part2)

    # Send the email via SMTP
    try:
        with smtplib.SMTP(smtp_server, smtp_port) as server:
            server.starttls()  # Secure the connection
            server.login(sender_email, sender_password)
            server.send_message(msg)
        print("Email sent successfully!")
    except Exception as e:
        print("Failed to send email:", e)


def find_new_jobs():
    """
    Scrapes jobs from each source, checks for duplicates using XML storage,
    and returns a list of newly detected jobs.
    """
    # Scrape jobs from each source.
    predoc_jobs = scrape_predoc()
    nber_jobs = scrape_nber()
    ejm_jobs = scrape_ejm()

    # Combine all job records into a single list.
    all_jobs = predoc_jobs + nber_jobs + ejm_jobs
    all_jobs = replace_none_or_empty_in_list_of_dicts(all_jobs)

    if not all_jobs:
        print("No jobs were scraped.")
        return []

    # Read existing job signatures from XML.
    existing_signatures = read_existing_jobs(XML_FILE)

    print("\n🔍 Debug: Checking New Jobs Against Filtered Existing Records")
    
    new_jobs = []
    for job in all_jobs:
        job_str_dict = {
            str(k).strip().lower(): str(v).strip() if v and v.strip() else "N/A"
            for k, v in job.items()
        }
        job_signature = frozenset(sorted(job_str_dict.items()))

        # Use `source` to filter existing records before comparison
        job_source = job.get("source", "Unknown")

        if job_source in existing_signatures and job_signature in existing_signatures[job_source]:
            print(f"✅ Job Already Exists in XML ({job_source})")
        else:
            print(f"❌ New Job Detected! Adding to list. ({job_source})")
            new_jobs.append(job)

    print(f"\nFound {len(new_jobs)} new job(s).")
    return new_jobs  # Return list of new jobs


def debug_email_with_existing_jobs(existing_jobs):
    """
    Debug function to render the email using existing jobs.
    """
    from jinja2 import Environment, FileSystemLoader
    import datetime

    # Flatten jobs from all sources into a single list
    all_jobs = []
    for source, jobs in existing_jobs.items():
        for job in jobs:
            # Convert frozenset back to dict if necessary
            if isinstance(job, frozenset):
                job = dict(job)
            all_jobs.append(job)

    if not all_jobs:
        print("⚠️ No existing jobs found in the XML file.")
        return

    # Take only the first 5 jobs for debugging
    sample_jobs = all_jobs[:5]

    print("\n🔍 DEBUG: First 5 Jobs for Email Rendering")
    for job in sample_jobs:
        print(job)

    # Load the email template using Jinja2
    env = Environment(loader=FileSystemLoader("templates"))
    template = env.get_template("email.html")

    # Get the current timestamp
    update_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    # Render the email with the sample job data
    email_content = template.render(
        new_jobs=sample_jobs,
        update_time=update_time,
        github_repo_url="https://github.com/your_repo",
        github_issue_url="https://github.com/your_repo/issues"
    )

    print("\n📝 DEBUG: Rendered Email Content (Raw HTML):\n", email_content)

    # Save the output to a file for testing
    with open("debug_email_output.html", "w", encoding="utf-8") as f:
        f.write(email_content)

    print("\n✅ Email template successfully rendered and saved as 'debug_email_output.html'.")



# Load existing jobs from XML (assuming you have a function for this)
#existing_jobs = read_existing_jobs(XML_FILE)

# Call the debug function
#debug_email_with_existing_jobs(existing_jobs)



## Main Function: Scrape, Update, and Notify 🚀📊✉️

This **main()** function orchestrates the complete workflow of the project. It:

- **Scrapes Job Data:**  
  Calls the scraping functions for all three sources (Predoc, NBER, EJM) to collect job postings.

- **Filters New Jobs:**  
  Reads an existing CSV file (acting as a simple database) to get a set of already recorded job links. Then, it filters out jobs that are already present.

- **Sends Notifications:**  
  For each new job found, the function sends an email notification with the job details.

- **Updates the CSV Database:**  
  Finally, it appends the new job entries to the CSV file for future reference.

> **Note:**  
> Ensure that your SMTP credentials (i.e. `SENDER_EMAIL` and `SENDER_PASSWORD`) are set up and that the scraping functions (`scrape_predoc()`, `scrape_nber()`, and `scrape_ejm()`) along with CSV and email helper functions are defined before running `main()`.

In [104]:
def main():
    """
    Main execution function. Calls find_new_jobs, saves new jobs to XML,
    and optionally sends email notifications.
    """
    new_jobs = find_new_jobs()  # Call the new function

    if new_jobs:
        # Save new jobs to XML instead of CSV. 💾
        append_jobs_to_xml(XML_FILE, new_jobs)
        
        # Retrieve SMTP credentials from environment variables. 🔒
        sender_email    = os.getenv('SENDER_EMAIL')
        sender_password = os.getenv('SENDER_PASSWORD')
        receiver_email  = os.getenv('SENDER_EMAIL')
        
        # Convert new jobs to a DataFrame for better visualization.
        df_new = pd.DataFrame(new_jobs).head(10)
        md_table = df_new.to_markdown(index=False)
        
        # Uncomment to send email notifications
        send_email_new_jobs(new_jobs, sender_email, sender_password, receiver_email)
        
    else:
        print("No new jobs found.")
        if os.path.exists(XML_FILE):
            df_new = pd.read_xml(XML_FILE).head(10)
            md_table = df_new.to_markdown(index=False)
        else:
            md_table = "No XML file found."

    # Display the table in the notebook (either new jobs or existing XML).
    display(Markdown(md_table))



In [105]:
if __name__ == "__main__":
    main()
    

  apply_paragraph = collapse_div.find("p", text=re.compile(r"To\s+Apply", re.IGNORECASE))
  apply_a = collapse_div.find("a", href=True, text=re.compile(r"apply", re.IGNORECASE))



🔍 Debug: Existing Job Signatures by Source from XML

🔍 Debug: Checking New Jobs Against Filtered Existing Records
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to list. (Predoc)
❌ New Job Detected! Adding to 

| source   | program_title                                | link                   | sponsor                                                    | institution                                                                             | fields                                                                            | deadline                                                                                                                                                                                                            | university   | program_type   | publication_date   | main_field                  |   location |   start_date |   duration |   department |   degree_required |   salary_range |
|:---------|:---------------------------------------------|:-----------------------|:-----------------------------------------------------------|:----------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------|:---------------|:-------------------|:----------------------------|-----------:|-------------:|-----------:|-------------:|------------------:|---------------:|
| Predoc   | Research Associates                          | https://bit.ly/3Xhfcvq | N/A                                                        | N/A                                                                                     | Education, Finance, Labor, Macro, Public Policy, and Urban                        | Rolling In addition to a multi-institutional job pool, this website also provides general information about opportunities within the Federal Reserve System as well as up-to-date information re specific openings. | N/A          | N/A            | N/A                | Finance, Public Policy      |        nan |          nan |        nan |          nan |               nan |            nan |
| Predoc   | Pre-Doctoral Positions                       | https://bit.ly/3CM4IxV | Abhishek Nagaraj (UC Berkeley) and Matteo Tranchero (Penn) | UC Berkeley’s Haas School of Business; Wharton School of the University of Pennsylvania | Applied economics, innovation, entrepreneurship, data science and tech innovation | Rolling                                                                                                                                                                                                             | N/A          | N/A            | N/A                | Economics, Entrepreneurship |        nan |          nan |        nan |          nan |               nan |            nan |
| Predoc   | Pre-Doctoral Technical Associate             | https://bit.ly/4hBFgtU | N/A                                                        | N/A                                                                                     | N/A                                                                               | N/A                                                                                                                                                                                                                 | N/A          | N/A            | N/A                | N/A                         |        nan |          nan |        nan |          nan |               nan |            nan |
| Predoc   | Pre-Doctoral Technical Associate             | https://bit.ly/4er9iOY | N/A                                                        | N/A                                                                                     | N/A                                                                               | N/A                                                                                                                                                                                                                 | N/A          | N/A            | N/A                | N/A                         |        nan |          nan |        nan |          nan |               nan |            nan |
| Predoc   | Pre-Doctoral Research Associate in Economics | https://bit.ly/4jQlfBs | Matthew Pecenco                                            | Brown University                                                                        | Labor, Crime, Housing                                                             | Applications will be accepted and reviewed on a rolling basis.                                                                                                                                                      | N/A          | N/A            | N/A                | Economics                   |        nan |          nan |        nan |          nan |               nan |            nan |
| Predoc   | Full-Time Research Assistant                 | https://bit.ly/4hwY3a0 | Matthew Baron                                              | National Bureau of Economic Research                                                    | Banking, Financial Crises, Financial History                                      | Rolling                                                                                                                                                                                                             | N/A          | N/A            | N/A                | N/A                         |        nan |          nan |        nan |          nan |               nan |            nan |
| Predoc   | Pre-Doctoral Fellowship                      | https://bit.ly/3WXU7GY | Hans-Joachim Voth                                          | University of Zurich                                                                    | Economic History, Political Economy, Cultural Economics                           | Applications will be reviewed immediately and are welcome until all positions are filled.                                                                                                                           | N/A          | N/A            | N/A                | Economics                   |        nan |          nan |        nan |          nan |               nan |            nan |
| Predoc   | Research Professional in Accounting          | https://bit.ly/4hyJouK | Professor Philip Berger                                    | The University of Chicago Booth School of Business                                      | Accounting, corporate finance, and labor economics.                               | Applications are reviewed on a rolling basis; the initial full review will be March 15, 2025.                                                                                                                       | N/A          | N/A            | N/A                | Economics, Finance          |        nan |          nan |        nan |          nan |               nan |            nan |
| Predoc   | Research Professional in Marketing           | https://bit.ly/3Eupeng | Professor Andreas Kraft                                    | University of Chicago Booth School of Business                                          | N/A                                                                               | N/A                                                                                                                                                                                                                 | N/A          | N/A            | N/A                | N/A                         |        nan |          nan |        nan |          nan |               nan |            nan |
| Predoc   | Research Professional in Behavioral Science  | https://bit.ly/3QkttnL | Professor Alexander Todorov                                | University of Chicago Booth School of Business                                          | Behavioral Science                                                                | Applications are reviewed on a rolling basis; the initial full review will be March 1, 2025.                                                                                                                        | N/A          | N/A            | N/A                | N/A                         |        nan |          nan |        nan |          nan |               nan |            nan |