# Task
Create a daily automation task that runs at 12 PM, searches for entry-level AI and ML developer internships and jobs at top startups or MNCs using web search, and emails the results to the user. The task should utilize NLP and LLM techniques for processing the search results.

## Define search queries

### Subtask:
Define search terms for entry-level AI/ML internships and jobs, specifying top startups and MNCs.


**Reasoning**:
Define keywords, research and compile a list of top startups and MNCs, and combine them to create search queries in a structured format as instructed.



In [None]:
# 1. Identify keywords and phrases
keywords = [
    "entry-level AI internship",
    "entry-level ML internship",
    "junior AI engineer",
    "junior machine learning engineer",
    "associate data scientist",
    "AI developer intern",
    "ML developer intern",
    "entry level artificial intelligence jobs",
    "entry level machine learning jobs",
    "graduate AI program",
    "graduate ML program"
]

# 2. Research and compile a list of top startups and MNCs
# This is a placeholder list and should be updated with more companies
companies = [
    "Google", "Microsoft", "Amazon", "Meta", "Apple",
    "Nvidia", "Tesla", "Netflix", "IBM", "Intel",
    "DeepMind", "OpenAI", "Anthropic", "Databricks", "Snowflake",
    "C3.ai", "Palantir", "UiPath", "ServiceNow", "Workday"
]

# 3. Combine keywords and company names to create search queries
search_queries = []
for keyword in keywords:
    search_queries.append(keyword) # Add general searches
    for company in companies:
        search_queries.append(f"{keyword} at {company}")
        search_queries.append(f"{keyword} {company}") # Add variations

# 4. Organize search queries into a structured format (already a list)
# Display the first few search queries as an example
print("Generated Search Queries (first 10):")
for query in search_queries[:10]:
    print(query)

Generated Search Queries (first 10):
entry-level AI internship
entry-level AI internship at Google
entry-level AI internship Google
entry-level AI internship at Microsoft
entry-level AI internship Microsoft
entry-level AI internship at Amazon
entry-level AI internship Amazon
entry-level AI internship at Meta
entry-level AI internship Meta
entry-level AI internship at Apple


## Web search

### Subtask:
Implement a web scraping script to search job boards and company career pages using the defined queries.


**Reasoning**:
Implement a web scraping script using `BeautifulSoup` to search job boards and company career pages by iterating through the generated `search_queries`. Store the raw HTML or relevant data from the search results.



In [None]:
import requests
from bs4 import BeautifulSoup
import time
import random

def search_jobs(query):
    """
    Performs a web search for a given query and returns the HTML content.
    This is a simplified example and may require adjustments for specific job boards.
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    # This is a placeholder and should be replaced with actual job board search URLs
    # or a more sophisticated search engine integration.
    search_url = f"https://www.google.com/search?q={query.replace(' ', '+')}"

    try:
        response = requests.get(search_url, headers=headers)
        response.raise_for_status() # Raise an HTTPError for bad responses (4xx or 5xx)
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"Error searching for query '{query}': {e}")
        return None

raw_search_results = []
# Limit the number of queries for demonstration purposes
for query in search_queries[:5]:
    print(f"Searching for: {query}")
    html_content = search_jobs(query)
    if html_content:
        raw_search_results.append({'query': query, 'html': html_content})
    # Add a delay to avoid overwhelming websites
    time.sleep(random.uniform(2, 5))

print(f"Collected {len(raw_search_results)} raw search results.")
# Note: Storing raw HTML is just for this step. Subsequent steps will process this.
# You might want to store this data to disk in a real application.

Searching for: entry-level AI internship
Searching for: entry-level AI internship at Google
Searching for: entry-level AI internship Google
Searching for: entry-level AI internship at Microsoft
Searching for: entry-level AI internship Microsoft
Collected 5 raw search results.


## Extract relevant information

### Subtask:
Use NLP techniques to extract job titles, company names, locations, and job descriptions from the search results.


**Reasoning**:
Iterate through the raw search results, parse the HTML, and extract job information (title, company, location, description) using BeautifulSoup, storing the results in a list of dictionaries.



In [None]:
from bs4 import BeautifulSoup

extracted_job_listings = []

for result in raw_search_results:
    html_content = result['html']
    query = result['query']

    try:
        soup = BeautifulSoup(html_content, 'html.parser')

        # This is a general approach and may need significant refinement
        # based on the actual structure of the job board HTML.
        # Identifying job listings on a general Google search results page
        # is challenging as it's not designed for structured job data extraction.
        # A more robust solution would target specific job boards (e.g., LinkedIn, Indeed)
        # with known HTML structures or use dedicated job search APIs.

        # Attempt to find potential job listing elements - this is highly speculative
        # based on common patterns but likely won't be accurate for all sites.
        job_elements = soup.find_all('div', class_=['g', 'rc', 'xGjPbb', 'jP9IL']) # Example classes from Google search results

        for job_element in job_elements:
            title = None
            company = None
            location = None
            description = None

            # Attempt to extract title
            title_element = job_element.find(['h3', 'a'], class_=['LC20lb', 'MBeuO', 'DKV0Md']) # Example classes
            if title_element:
                title = title_element.get_text()

            # Attempt to extract company and location (highly variable)
            # This requires more specific knowledge of the HTML structure.
            # For a general search result, this is very difficult to reliably extract.
            # We'll make a very rough attempt based on common text patterns or nearby elements.

            text_content = job_element.get_text()

            # Simple pattern matching (very basic and prone to errors)
            # In a real scenario, use more sophisticated techniques or target specific sites.
            if title:
                # Try to find company and location in the text after the title
                remaining_text = text_content.replace(title, '', 1)
                # This is a highly simplified placeholder
                if ',' in remaining_text:
                    parts = remaining_text.split(',', 1)
                    company = parts[0].strip()
                    location = parts[1].strip()
                else:
                    # Even simpler attempt
                    first_sentence = remaining_text.split('.')[0].strip()
                    if len(first_sentence) < 100: # Heuristic to avoid long descriptions
                         company = first_sentence # Very rough guess

            # Attempt to extract description snippet (also highly variable)
            description_element = job_element.find('div', class_='VwiC3b') # Example class
            if description_element:
                description = description_element.get_text()


            if title or company or location or description:
                 # Filter out results that are clearly not job listings based on heuristics
                 # (e.g., very short descriptions, missing title).
                 # These heuristics are very basic and need refinement.
                 if title and (description is None or len(description) > 50): # Basic filter
                    extracted_job_listings.append({
                        'title': title,
                        'company': company,
                        'location': location,
                        'description': description,
                        'source_query': query # Keep track of the source query
                    })

    except Exception as e:
        print(f"Error parsing HTML for query '{query}': {e}")
        # Continue to the next result even if one fails

print(f"Extracted {len(extracted_job_listings)} potential job listings.")

# Display the first few extracted listings as an example
print("\nFirst 5 Extracted Listings:")
for i, listing in enumerate(extracted_job_listings[:5]):
    print(f"--- Listing {i+1} ---")
    print(f"Title: {listing.get('title', 'N/A')}")
    print(f"Company: {listing.get('company', 'N/A')}")
    print(f"Location: {listing.get('location', 'N/A')}")
    print(f"Description: {listing.get('description', 'N/A')[:200]}...") # Truncate description for display
    print("-" * 10)


Extracted 0 potential job listings.

First 5 Extracted Listings:


## Filter and rank results

### Subtask:
Apply criteria (e.g., entry-level, AI/ML focus) and potentially use LLMs to rank jobs based on relevance and quality.


**Reasoning**:
Acknowledge the failure of the previous subtask and explain why the current subtask cannot be completed, then recommend revisiting the data extraction step before finishing the task with a failure status.



In [None]:
print("Acknowledging the failure of the previous subtask.")
print("The previous subtask aimed to extract job listings from raw search results, but it unfortunately failed to extract any job information.")
print("As a result, this current subtask, which involves applying criteria and potentially using LLMs to rank jobs, cannot be completed as originally intended.")
print("The subsequent steps of filtering, ranking, formatting, and emailing the results are entirely dependent on having successfully extracted job information in the prior step.")
print("Therefore, it is recommended that the data extraction subtask needs to be revisited with a more effective strategy (e.g., targeting specific job boards or using job APIs) before proceeding with this subtask or any further steps in the plan.")
print("Because this crucial dependency is not met, the overall task cannot be completed with the current data.")

Acknowledging the failure of the previous subtask.
The previous subtask aimed to extract job listings from raw search results, but it unfortunately failed to extract any job information.
As a result, this current subtask, which involves applying criteria and potentially using LLMs to rank jobs, cannot be completed as originally intended.
The subsequent steps of filtering, ranking, formatting, and emailing the results are entirely dependent on having successfully extracted job information in the prior step.
Therefore, it is recommended that the data extraction subtask needs to be revisited with a more effective strategy (e.g., targeting specific job boards or using job APIs) before proceeding with this subtask or any further steps in the plan.
Because this crucial dependency is not met, the overall task cannot be completed with the current data.


## Extract relevant information retry 1

### Subtask:
Retry extracting job titles, company names, locations, and job descriptions from the raw search results. Given the previous failure due to the general nature of the search results, this retry will focus on a more targeted approach, assuming the raw HTML might contain some identifiable patterns even from a general search, or it will acknowledge if this is not feasible and suggest an alternative data source for future retries if necessary.


**Reasoning**:
Re-examine the raw search results and attempt to extract job information using BeautifulSoup with more robust parsing logic, acknowledging the difficulty with general search results. Store any extracted data.



In [None]:
from bs4 import BeautifulSoup

extracted_job_listings = []

for result in raw_search_results:
    html_content = result['html']
    query = result['query']

    try:
        soup = BeautifulSoup(html_content, 'html.parser')

        # Attempt to find elements that might represent search results or snippets
        # These class names are based on common patterns in Google search results,
        # but are not guaranteed to be consistent or contain structured job data.
        search_result_elements = soup.find_all('div', class_=['g', 'rc']) # Common classes for search results

        for element in search_result_elements:
            title = None
            company = None
            location = None
            description = None
            link = None

            # Attempt to extract title and link from the main anchor tag
            link_element = element.find('a')
            if link_element:
                title = link_element.get_text()
                link = link_element.get('href')

            # Attempt to extract description snippet (often in a div below the link)
            description_element = element.find('div', class_='VwiC3b') # Common class for description snippet
            if description_element:
                description = description_element.get_text()

            # Extracting company and location from general search results is highly unreliable.
            # We can try to look for text patterns in the element, but this is very basic.
            text_content = element.get_text()

            # Basic pattern matching attempt (very prone to errors)
            # Look for common separators or keywords near the title/description
            if title and description:
                 # This is still a very rough heuristic and will likely fail for many results
                try:
                    # Simple split after the title and before the description
                    parts = text_content.split(title, 1)[-1].split(description, 1)[0].strip()
                    if ',' in parts:
                         company_location_parts = parts.split(',', 1)
                         company = company_location_parts[0].strip()
                         location = company_location_parts[1].strip()
                    elif '-' in parts:
                         company_location_parts = parts.split('-', 1)
                         company = company_location_parts[0].strip()
                         location = company_location_parts[1].strip()
                    else:
                         # Fallback: just take the part after the title as potential company/location info
                         company = parts # Very rough
                         location = None # Cannot determine location reliably
                except:
                    pass # Ignore errors in this basic parsing attempt


            # Store the extracted (potentially incomplete) information if a title or description is found
            if title or description:
                extracted_job_listings.append({
                    'title': title,
                    'company': company,
                    'location': location,
                    'description': description,
                    'link': link,
                    'source_query': query
                })

    except Exception as e:
        print(f"Error parsing HTML for query '{query}': {e}")
        # Continue to the next result even if one fails

print(f"Attempted to extract {len(extracted_job_listings)} potential items from search results.")

# Display the first few extracted items (might not be actual job listings)
print("\nFirst 5 Extracted Items (Potential Job Listings):")
if extracted_job_listings:
    for i, listing in enumerate(extracted_job_listings[:5]):
        print(f"--- Item {i+1} ---")
        print(f"Title: {listing.get('title', 'N/A')}")
        print(f"Company: {listing.get('company', 'N/A')}")
        print(f"Location: {listing.get('location', 'N/A')}")
        print(f"Description: {listing.get('description', 'N/A')[:200]}...") # Truncate description for display
        print(f"Link: {listing.get('link', 'N/A')}")
        print(f"Source Query: {listing.get('source_query', 'N/A')}")
        print("-" * 10)
else:
    print("No items were extracted using the current parsing logic.")
    print("\nNote: Extracting structured job data from general search results is inherently difficult.")
    print("It is recommended to target specific job boards or use dedicated APIs for future retries.")


Attempted to extract 0 potential items from search results.

First 5 Extracted Items (Potential Job Listings):
No items were extracted using the current parsing logic.

Note: Extracting structured job data from general search results is inherently difficult.
It is recommended to target specific job boards or use dedicated APIs for future retries.


## Extract relevant information retry 2

### Subtask:
Retry extracting job titles, company names, locations, and job descriptions from the raw search results. Given the persistent failure with general search results, this retry will explicitly state that extracting structured job data from general search results is not feasible with this method and will recommend using dedicated job board APIs or targeting specific job board websites with more precise scraping logic as the necessary next step for data acquisition.


**Reasoning**:
Acknowledge the previous failures and state that extracting structured job data from general search results is not feasible, recommending alternative data sources. This directly addresses the instructions for this subtask.



In [None]:
print("Acknowledging the persistent failures in extracting structured job data from general web search results.")
print("Despite previous attempts, the inherent nature of parsing general search result HTML, which is not designed for structured data extraction like job listings, has proven to be an insufficient method for this task.")
print("To reliably obtain job titles, company names, locations, and job descriptions, a more effective approach is necessary.")
print("This includes utilizing dedicated job board APIs (such as those offered by LinkedIn, Indeed, or Glassdoor, subject to availability and terms of use) or developing specific web scraping logic tailored to known job board websites with consistent HTML structures.")
print("Attempting further parsing of the existing general search result HTML is not productive given the demonstrated limitations.")
print("Without a reliable source of structured job data from a job board API or targeted scraping, the subsequent steps of filtering, ranking, formatting, and emailing the results cannot be performed.")

Acknowledging the persistent failures in extracting structured job data from general web search results.
Despite previous attempts, the inherent nature of parsing general search result HTML, which is not designed for structured data extraction like job listings, has proven to be an insufficient method for this task.
To reliably obtain job titles, company names, locations, and job descriptions, a more effective approach is necessary.
This includes utilizing dedicated job board APIs (such as those offered by LinkedIn, Indeed, or Glassdoor, subject to availability and terms of use) or developing specific web scraping logic tailored to known job board websites with consistent HTML structures.
Attempting further parsing of the existing general search result HTML is not productive given the demonstrated limitations.
Without a reliable source of structured job data from a job board API or targeted scraping, the subsequent steps of filtering, ranking, formatting, and emailing the results canno

## Summary:

### Data Analysis Key Findings

*   Attempting to extract structured job listing information (title, company, location, description) from the HTML of general web search results proved to be consistently unsuccessful across multiple attempts.
*   The failure to extract job data prevented the execution of subsequent steps in the automation task, including filtering, ranking, formatting, and emailing the results.
*   The process identified that general search result pages are not designed for reliable, structured data extraction of specific items like job details.

### Insights or Next Steps

*   To successfully implement this automation task, the data acquisition method must be revised. Future attempts should focus on utilizing dedicated job board APIs (if available and permissible) or developing targeted web scraping logic for specific job board websites with consistent HTML structures.
*   Once a reliable method for extracting structured job data is established, the remaining steps involving NLP for filtering and ranking, formatting the results, and emailing them can be implemented.
