This project aims to **scrape real job listings** from platforms like **Talent.com** and store the data for **location- and skill-based analysis**.

---

Objectives

Job Data Collection

Scrape real job listings using Python tools such as:

- `requests`
- `BeautifulSoup`
- `Selenium`

Fields to Collect:

- `job_id`  
- `job_title`  
- `company_name`  
- `location`  
- `application_link`  
- `skills_req`  
- `source` (portal name)

Data Storage

- Store data in a structured **CSV file**
- Append new entries with every run to allow:
  - Data accumulation over time
  - Real-world, up-to-date job insights

---

Analysis Capabilities

1. Skill-based Analysis

- For a given keyword or skill (e.g., `"Python"`, `"React"`), analyze **which locations** have the most job openings
- Identify **regional demand trends** for specific technologies

2. Location-based Analysis

- For a given **city or region**, analyze **which skills or technologies** are most in demand
- Determine **hot skills** in specific geographical areas

---

Visualization Tools

Use the following libraries for data exploration and visual presentation:

- `pandas` – for data wrangling and processing
- `matplotlib` or `seaborn` – for charts and visual insights

---

Future Scope

- **OOP Refactoring**: Move to object-oriented structure if scaling across multiple portals or complex pipelines
- **Database Integration**: Replace CSV with `SQLite` or `PostgreSQL` for advanced querying and performance
- **Web UI / Dashboard**: Add a **frontend interface** for users to:
  - Search and filter job trends
  - View dynamic visualizations
  - Interact with real-time job market data

---

>  *This project combines scraping, analysis, and visualization to provide meaningful insights into the evolving job market, with room for scalability and interactivity.* 


Installing libraries

In [61]:
!pip install requests --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet
!pip install pandas --quiet
!pip install selenium --quiet


[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Importing the libraries

In [50]:
import urllib.parse
import requests
from bs4 import BeautifulSoup

Input the keyword from user, a keyword can be a company name, title or anything related to jobs

In [30]:
search_keyword = input('Enter a skill, company or job title:')

Enter a skill, company or job title: google


In [40]:
search_keyword

'data analyst'

Now we want to add the search keyword to the URL, but before that we need to format the search keyword as required in URL (portal wise)-
- Convert spaces to + or - (depending on the site).
- Convert text to lowercase if required.
- Remove special characters if needed.

In [51]:
def format_str(word, site):
    formatted_word = word.strip()
    if site == 'indeed' or site == 'talent' or site == 'monster':
        formatted_word = urllib.parse.quote(formatted_word)
        formatted_word = formatted_word.replace('%20','+')
    elif site == 'linkedln':
        formatted_word = urllib.parse.quote(formatted_word)
        
    return formatted_word

We want to pass the formatted search_keyword to our job portals, but before that, we need to format the URLs by adding the search_keyword in the serach query

In [47]:
indeed_job_search_base_url = 'https://www.indeed.com/jobs?q={}'
linkedln_job_search_base_url = 'https://www.linkedin.com/jobs/search/?keywords={}'

indeed_url = indeed_job_search_base_url.format(format_str(search_keyword, 'indeed'))
linkedln_url = linkedln_job_search_base_url.format(format_str(search_keyword, 'linkedln'))

In [48]:
print(indeed_url + ', ' + linkedln_url)

https://www.indeed.com/jobs?q=data+analyst, https://www.linkedin.com/jobs/search/?keywords=data%20analyst


Once our URLs are ready, we can proceed onto extracting the data from the webpages located at URL. 
Action plan -
- Download the page using requests
- Use Beautiful Soup to extract the top 5-10 jobs from the page
- Each job will have - jobid, title, company and location along with the link to the site to apply.
- At the end of the list, we can provide one link to the site to view more job listings

Let's create a function to extract the page which takes the URL as input and returns a beautifulsoap object containing the page HTML object.

In [52]:
def get_page(page_url):
    response = requests.get(page_url)
    if response.status_code != 200:
        raise Exception('Failed to load the page {}'.format(page_url))
    page = BeautifulSoup(response.text, 'html.parser')
    return page

In [None]:
get_page(linkedln_url)

In [58]:
response = requests.get(linkedln_url)
response.status_code

200

Using requests, we are not able to access the pages which are have additonal human verification enabled like indeed. Also we can't use requests when website is built with JavaScript frameworks (React, Angular, etc.).
##### Use Selenium when:

- The site is **JavaScript-heavy** (like LinkedIn, Naukri, etc.).
- You need to **interact with the page** (click buttons, scroll, login).
- The content doesn't appear in page source unless **fully loaded**.


###### **Pros:**
- Can handle **dynamic pages**
- Simulates a **real browser**
- Great for websites with **scroll to load**, modals, or popups


###### **Cons:**
- **Slower** than `requests`
- Uses **more memory**
- Can be harder to **deploy in headless/cloud environments**


So let's try and access the websites using Selenium and we can add more filters for serach using its functions

In [22]:
monster_url = 'https://www.foundit.in/'

In [24]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

In [32]:
# Set up Selenium WebDriver (Make sure you have ChromeDriver installed)
# driver = webdriver.Chrome('C:/Web Scraping course/chromedriver.exe')
driver = webdriver.Chrome()
driver.implicitly_wait(10) # seconds

# Open job search page
driver.get(monster_url)

# Input the company name into serach box as specified by user
# Xpath - '//*[@id="jobs-search-box-keyword-id-ember100"]'
# Full Xpath - '/html/body/div[6]/header/div/div/div/div[2]/div[1]/div/div/input[1]'
xpath_keyword = '//*[@id="heroSectionDesktop-skillsAutoComplete--input"]'
xpath_location = '//*[@id="heroSectionDesktop-locationAutoComplete--input"]'
xpath_search_btn = '//*[@id="searchForm"]/div/button/span'

# Fetch the input serach box code into variables
keyword_input = driver.find_element("xpath", xpath_keyword)
location_input = driver.find_element("xpath", xpath_location)

# Inputs the user company name and location to the website
keyword_input.send_keys(keyword)
location_input.send_keys(location)

# Presses the enter button to search
# company_name_input.send_keys(Keys.ENTER) - if want to enter the keys field itself
search_btn = driver.find_element('xpath', xpath_search_btn).click()

time.sleep(2)

# Close the webpage
driver.quit()

Now since Selenium is heavy and might take time to load, we can use requests + BS4 for monster site. Let's try!

In [361]:
keyword = 'big data'
location = 'uttar pradesh'

In [362]:
talent_search_url = 'https://in.talent.com/jobs?k={}&l={}'
talent_formatted_url = talent_search_url.format(format_str(keyword, 'talent'), format_str(location, 'talent'))

In [363]:
talent_formatted_url

'https://in.talent.com/jobs?k=big+data&l=uttar+pradesh'

Create a function to access and download the page and return an object of type bs4

In [53]:
def get_webpage(page_url):
    # Get the page from the given URL
    response = requests.get(page_url)
    # Check the status of the page is successful else print the error
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(page_url))
    # Parse the web page to beautiful soup
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [430]:
response = requests.get('https://www.foundit.in/srp/results?query=manager&locations=gurgaon&experienceRanges=4~4&experience=4')
response

<Response [403]>

In [364]:
talent_doc = get_webpage(talent_formatted_url)

In [113]:
talent_jobs_class = 'sc-8e83a395-1 eIzOpw'
talent_jobs_data = talent_doc.find_all('section', class_ = talent_jobs_class)

Let's decide what all features we want to collect and show at our project, For ex., we wouldd want to show 
- ```Job Title```
- ```Company name```
- ```Location```
- ```Experience required```
- ```Application link```

In [114]:
job_title_class = 'sc-6f9356d1-20 sc-6f9356d1-23 hOTlXX giCKXi'
company_class = 'sc-6f9356d1-10 sc-6f9356d1-12 jHSVbo gyxIYw'
location_class = 'sc-6f9356d1-10 sc-6f9356d1-11 aSgez bNXZmD'

job_title = talent_jobs_data[3].find('h2', job_title_class)
company = talent_jobs_data[3].find('span', company_class)
location = talent_jobs_data[3].find('span', location_class)

---------------------------------------------------------------------------------------------------------

Now to get other details like application URL, Skills, etc, we need to open the job page for each job listings and fetch the details. For that we need to prepare the URL to parse.  --TODO (Unable to figure out the logic!)

In [128]:
link_class = 'sc-6f9356d1-5 sc-6f9356d1-8 sc-d93925ca-5 eDTPEZ gjefvX kEITSw'
link = monster_jobs_data[0].find('a', link_class)
'https://in.talent.com/' + link['href']

'https://in.talent.com//view?id=96572993470b'

Set up headless Chrome
```
options = Options()
options.headless = True  # Runs Chrome in background
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--window-size=1920,1080')

driver = webdriver.Chrome(options=options)

```

In [None]:
driver = webdriver.Chrome()
driver.implicitly_wait(20) # seconds
job_url = 'https://in.talent.com/view?id=96572993470b'
driver.get(job_url)
jobpage = BeautifulSoup(driver.page_source, 'html.parser')

app_link_btn = jobpage.find_all('a', {'target': '_blank'})

xpath_keyword = '/html/body/div[2]/section/div[1]/div/div/div/div/div/div/div/p[7]'

# Fetch the input serach box code into variables
keyword_input = driver.find_element("xpath", xpath_keyword)

driver.quit()

In [None]:
# Returns application link and skills
def fetch_talent_details_from_job_specific_url(job_url):
    job_doc = get_webpage(job_url)
    pass

Below code is not working as the job detail page is rendered using Javascript it seems, so details not fetching usinf beautiful soup. We need to use Selenium for this!!

In [171]:
job_doc = get_webpage('https://in.talent.com//view?id=96572993470b')
link_div_class = 'job__description   '

app_link = job_doc.find_all('div', class_= link_div_class)

In [172]:
app_link

[]

--------------------------------------------------------------------------------------------------------------------

In [13]:
# Global job ID counter
job_id_counter = {
    'talent': 0,
    'monster': 0
}

In [14]:
def generate_job_id(source):
    global job_id_counter

    source_key = source.lower()
    job_id_counter[source_key] += 1

    prefix = {
        'talent': 'TAL',
        'monster': 'MON'
    }.get(source_key, 'GEN')

    return f"{prefix}{str(job_id_counter[source_key]).zfill(4)}"


Let's wrap all this in a function to fetch the details of lets say 10 jobs

In [42]:
def fetch_jobs_from_talent(talent_doc):
    talent_jobs_class = 'sc-8e83a395-1 eIzOpw'
    talent_jobs_data = talent_doc.find_all('section', class_ = talent_jobs_class)

    job_title_class = 'sc-6f9356d1-20 sc-6f9356d1-23 hOTlXX giCKXi'
    company_class = 'sc-6f9356d1-10 sc-6f9356d1-12 jHSVbo gyxIYw'
    location_class = 'sc-6f9356d1-10 sc-6f9356d1-11 aSgez bNXZmD'
    job_url_class = 'sc-6f9356d1-5 sc-6f9356d1-8 sc-d93925ca-5 eDTPEZ gjefvX kEITSw'
    image_class = 'sc-5c54c4fb-1 ixNEQE'

    jobs = {
            'job_id' : [],
            'job_title' : [],
            'company_name' : [],
            'location' : [],
            'industry' : [],
            'job_position': [],
            'experience' : [],
            'source' : []
    }
    
    for job_data in talent_jobs_data:
        job_title = job_data.find('h2', job_title_class).text
        company = job_data.find('span', company_class).text
        location = job_data.find('span', location_class).text
        #job_url = 'https://in.talent.com/' + job_data.find('a', job_url_class)['href']
        image = talent_jobs_data[0].find('div', 'sc-5c54c4fb-1 ixNEQE').find('img')['src'] # Can use further as required to display in frontend

        #fetch_talent_details_from_job_specific_url(job_url)
        jobs['job_id'].append(generate_job_id('talent'))
        jobs['job_title'].append(job_title)
        jobs['company_name'].append(company)
        jobs['location'].append(location)
        jobs['source'].append('Talent')
        #jobs['application_link'].append(job_url)
        jobs['industry'].append(categorize_industry(job_title))
        jobs['job_position'].append(get_job_position_level(job_title))
        jobs['experience'] = extract_years_from_title(job_title)

    return jobs

In [12]:
def extract_years_from_title(title):
    title = title.lower()
    
    # Common patterns
    import re
    match = re.search(r'(\d+)\+?\s*(?:years|yrs)', title)
    if match:
        return int(match.group(1))
    
    # Approximate based on seniority terms
    if any(term in title for term in ['director', 'head', 'vp', 'chief']):
        return '10+ years'
    elif 'manager' in title:
        return '6-10 Years'
    elif any(term in title for term in ['senior', 'lead']):
        return '4-6 Years'
    elif any(term in title for term in ['mid', 'associate']):
        return '2-3 years'
    elif any(term in title for term in ['junior', 'entry', 'graduate', 'intern']):
        return '0 Years'

    return 'Unknown'  # Unknown


In [41]:
extract_years_from_title('Junior Lead')

'4-6 Years'

In [265]:
image = talent_jobs_data[0].find('div', 'sc-5c54c4fb-1 ixNEQE').find('img')['src']
image

'https://cdn-dynamic.talent.com/ajax/img/get-logo.php?empcode=clickcast-appcast-li-in-cpc&empname=EXL'

In [388]:
jobs = fetch_jobs_from_talent(talent_doc)

In [389]:
jobs['industry']

['Data Science / Analytics',
 'Data Science / Analytics',
 'Cloud',
 'Data Science / Analytics',
 'Artificial Intelligence / Machine Learning',
 'Data Science / Analytics',
 'Data Science / Analytics',
 'Cloud',
 'Data Science / Analytics',
 'Data Science / Analytics']

Let's create a method to print in a formatted way to validate :)

In [390]:
def print_jobs(jobs):
    for i in range(5):
        print(' JobId: {} \n\tJob Title: {} \n\tCompany Name: {} \n\tLocation: {} \n\tIndustry: {} \n\tJob Position: {} \n\tSource: {} \n\tApplication Link: {}'
              .format(jobs['job_id'][i], jobs['job_title'][i], jobs['company_name'][i], jobs['location'][i], jobs['industry'][i], jobs['job_position'][i], jobs['source'][i], jobs['application_link'][i]))

In [None]:
print_jobs(jobs)

Let's create a method to determine the industry of the job as it will be helpful during data analytics!! As of now we can apply rule based categorization, later we can build a ML model for this to classify the data.

In [11]:
def categorize_industry(job_title):
    title = job_title.lower()

    if any(keyword in title for keyword in ['gen ai', 'generative ai', 'llm', 'artificial intelligence', 'ai', 'machine learning', 'ml']):
        return 'Artificial Intelligence / Machine Learning'
    elif any(keyword in title for keyword in ['cloud', 'aws', 'azure', 'gcp']):
        return 'Cloud'
    elif any(keyword in title for keyword in ['data scientist', 'data analyst', 'analytics', 'data', 'etl']):
        return 'Data Science / Analytics'
    elif any(keyword in title for keyword in ['web developer', 'web development', 'front end', 'frontend', 'html', 'css', 'javascript', 'react']):
        return 'Web Development'
    elif 'devops' in title:
        return 'DevOps'
    elif 'qa' in title or 'quality assurance' in title or 'test' in title or 'tester' in title:
        return 'Quality Assurance / Testing'
    elif any(keyword in title for keyword in ['software', 'developer', 'engineer', 'backend', 'frontend', 'full stack', 'architect', 'designer']):
        return 'Software Development'
    elif any(keyword in title for keyword in ['accountant', 'finance', 'auditor', 'chartered']):
        return 'Finance'
    elif any(keyword in title for keyword in ['marketing', 'seo', 'digital']):
        return 'Marketing'
    elif any(keyword in title for keyword in ['sales', 'business development', 'business analyst', 'business']):
        return 'Buisness Analytics'
    elif any(keyword in title for keyword in ['consulting', 'consultant']):
        return 'Consulting'
    elif 'manager' in title or 'management' in title or title.endswith(' manager'):
        return 'Management'
    elif any(keyword in title for keyword in ['hr', 'recruiter', 'talent']):
        return 'Human Resources'
    else:
        return 'Other'


In [376]:
categorize_industry('solutions architect')

'Software Development'

Let's create a method to determine the position level of the job as it will be helpful during data analytics!!

In [10]:
def get_job_position_level(title):
    title = title.lower()
    
    if "intern" in title or "trainee" in title:
        return "Intern"
    elif "manager" in title:
        return "Manager"
    elif "junior" in title or "jr." in title or "fresher" in title or "graduate" in title:
        return "Entry"
    elif "senior" in title or "sr." in title or "sr" in title:
        return "Senior"
    elif "lead" in title:
        return "Senior"
    elif "director" in title or "head" in title:
        return "Director"
    elif any(x in title for x in ["vp", "vice president", "chief", "ceo", "cto", "coo", "cfo"]):
        return "Executive"
    else:
        return "Mid"


In [352]:
get_job_position_level('Data Engineer')

'Mid'

Let's try and create a Dataframe using pandas to do more analytics on the data we collected!
- Store data in a structured CSV file
- Append new entries with every run to allow:
- Data accumulation over time
- Real-world, up-to-date job insights

CSV structure -
```
job_data.csv
┌──────────┬───────────────┬──────────────┬───────────┬────────────────────┬────────┐
│ job_id   │ job_title     │ company_name │ location  │ application_link   │ source │
└──────────┴───────────────┴──────────────┴───────────┴────────────────────┴────────┘

```

In [347]:
import pandas as pd

In [392]:
df_jobs = pd.DataFrame(jobs)

In [None]:
df_jobs[:5]

Let's save the results in the csv file for each execution and build sample data for analysis and visulation

In [9]:

import pandas as pd
import os

def save_jobs_to_csv(df, filename='jobs_data.csv'):
    file_exists = os.path.isfile(filename)

    df.to_csv(
        filename,
        mode='a' if file_exists else 'w',
        index=False,
        header=not file_exists,
        encoding='utf-8'
    )

In [55]:
import pandas as pd

In [43]:
def input_from_users():
    keyword = input('Enter skill, position or title: ')
    location = input('Enter the location: ')
    company = input('Search by company: ')
    experience = input('Search by experience (in years):')

    keyword = keyword if keyword else 'Developer'# Its a mandatory field
    location = location if location else 'India'
    experience = str(int(float(experience))) if float(experience) <= 40 else ''   # Take the values upto 40 only else invalid
    
    return keyword, location, company, experience

In [59]:
def talent_main():
    # User input for search keyword and location
    keyword, location, company, exp = input_from_users()
    # the website URL customised to add serach filters and Format the serach URL as per standard url format
    #https://in.talent.com/jobs?k=manager&l=Delhi%2C+IN&empname=accenture
    
    talent_search_url = 'https://in.talent.com/jobs?k={}&l={}'
    talent_formatted_url = ''
    if len(company.strip()) > 0:
        talent_search_url = 'https://in.talent.com/jobs?k={}&l={}%2C+IN&empname={}'
        talent_formatted_url = talent_search_url.format(format_str(keyword, 'talent'), format_str(location, 'talent'), format_str(company, 'talent'))
    else:
        talent_formatted_url = talent_search_url.format(format_str(keyword, 'talent'), format_str(location, 'talent'))
    # Open and access the webpage
    print('Accessing the webpage {}...'.format(talent_formatted_url))
    talent_doc = get_webpage(talent_formatted_url)
    print('Please Wait! Fetching the jobs...')
    # Fetch the list of jobs
    jobs = fetch_jobs_from_talent(talent_doc)
    print('Completed!')
    # Create a dataframe for jobs
    df_jobs = pd.DataFrame(jobs)
    print(df_jobs)
    # Create / Append the CSV file the jobs data
    print('Saving the data to CSV file...')
    filename = 'jobs_talent_data.csv'
    save_jobs_to_csv(df_jobs, filename)
    print('File named {} created/ updated.'.format(filename))
    

In [60]:
talent_main()

Enter skill, position or title:  engineer
Enter the location:  accenture
Search by company:  accenture
Search by experience (in years): 5


Accessing the webpage https://in.talent.com/jobs?k=engineer&l=accenture%2C+IN&empname=accenture...
Please Wait! Fetching the jobs...


NameError: name 'generate_job_id' is not defined

Let's try for Monster website to scrape the job listings


In [None]:
def monster_main():
    # User input for search keyword and location
    keyword, location, company, exp = input_from_users()
    # the website URL customised to add serach filters and Format the serach URL as per standard url format
    # https://www.foundit.in/srp/results?query=accenture&locations=india&experienceRanges=0~0&experience=0
    # https://www.foundit.in/srp/results?query=accenture&locations=india&experienceRanges=3~3&experience=3
    # https://www.foundit.in/srp/results?query=accenture&locations=india
    location = location if location else 'India'
    monster_search_url = 'https://www.foundit.in/srp/results?query={}&locations={}'
    monster_formatted_url = ''
    if len(exp.strip()) > 0:
        monster_search_url = 'https://www.foundit.in/srp/results?query={}&locations={}&experienceRanges={}~{}&experience={}'
        monster_formatted_url = monster_search_url.format(format_str(keyword, 'monster'), format_str(location, 'monster'),
                                                             exp, exp, exp)
    else:
        monster_formatted_url = monster_search_url.format(format_str(keyword, 'monster'), format_str(location, 'monster'))
        
    # Open and access the webpage
    print('Accessing the webpage {}...'.format(monster_formatted_url))
    monster_doc = get_webpage_selenium(monster_formatted_url)
    print('Please Wait! Fetching the jobs...')
    # Fetch the list of jobs
    jobs = fetch_jobs_from_monster(monster_doc)
    print('Completed!')
    # Create a dataframe for jobs
    df_jobs = pd.DataFrame(jobs)
    # Create / Append the CSV file the jobs data
    print('Saving the data to CSV file...')
    filename = 'jobs_monster_data.csv'
    save_jobs_to_csv(df_jobs, filename)
    print('File named {} created/ updated.'.format(filename))

In [3]:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def get_webpage_selenium(url):
    driver = webdriver.Chrome()

    driver.get(url)
    
    driver.save_screenshot("debug_page.png")
    try:
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CLASS_NAME, 'srpResultCardContainer'))
        )
    except Exception as e:
        print("Error waiting for element:", e)
        
    doc = BeautifulSoup(driver.page_source, 'html.parser')
    
    driver.quit()
    return doc

In [21]:
jobs = fetch_jobs_from_monster(monster_doc)

In [22]:
df = pd.DataFrame(jobs)
df

Unnamed: 0,job_id,job_title,company_name,location,industry,job_position,experience,source
0,MON0009,Web Developer,Accenture India,Noida,Web Development,Mid,2 - 5 Years,Monster
1,MON0010,ASP.Net Web Developer,Easemytrip com,"Gurugram, Noida",Web Development,Mid,3 - 8 Years,Monster
2,MON0011,UI/UX Web Developer (Freelancer),Soul AI,"Delhi, India",Web Development,Mid,2 - 4 Years,Monster
3,MON0012,Web Developer,Robotics Technologies,"Thane, Noida, Delhi NCR",Web Development,Mid,3 - 7 Years,Monster
4,MON0013,Looking For Front End Developer/Web Developer-...,Evalueserve.com Private Limited,Gurugram,Web Development,Mid,2 - 7 Years,Monster
5,MON0014,Full Stack web developer,Number 11,"Delhi, Noida",Web Development,Mid,2 - 5 Years,Monster
6,MON0015,Freelance Web Developers for Top Indian Freela...,Mukul Consultants India Private Limited,Remote,Web Development,Mid,3 - 10 Years,Monster


In [20]:
def fetch_jobs_from_monster(monster_doc):
    detailsclass = 'srpResultCardContainer'
    job_lists = monster_doc.find_all('div', class_ = detailsclass)

    jobs = {
            'job_id' : [],
            'job_title' : [],
            'company_name' : [],
            'location' : [],
            'industry' : [],
            'job_position': [],
            'experience' : [],
            'source' : []
    }

    for job in job_lists:
        job_title = job.find('div', class_ = 'jobTitle').text.strip()
        company = job.find('div', class_ = 'companyName').text.strip()
        location = job.find('div', class_ = 'details location').text.strip()
        exp = job.find('span', class_ = 'details').text.strip()

        jobs['job_id'].append(generate_job_id('monster'))
        jobs['job_title'].append(job_title)
        jobs['company_name'].append(company)
        jobs['location'].append(location)
        jobs['source'].append('Monster')
        jobs['industry'].append(categorize_industry(job_title))
        jobs['experience'].append(exp)
        jobs['job_position'].append(get_job_position_level(job_title))
    return jobs
        
        

In [29]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

#options = Options()
#options.add_argument("--headless")
#options.add_argument('--disable-gpu')
#options.add_argument('--no-sandbox')
#driver = webdriver.Chrome(options=options)
driver = webdriver.Chrome()

url = 'https://www.foundit.in/srp/results?query=data+analyst&locations=Bengaluru'
driver.get(url)

driver.save_screenshot("debug_page.png")
try:
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, 'srpResultCardContainer'))
    )
except Exception as e:
    print("Error waiting for element:", e)
    
from bs4 import BeautifulSoup
monster_doc = BeautifulSoup(driver.page_source, 'html.parser')

driver.quit()

In [30]:
detailsclass = 'srpResultCardContainer'
details = monster_doc.find_all('div', class_ = detailsclass)

In [None]:
details

In [32]:
job_title = details[0].find('div', class_ = 'jobTitle').text.strip()
company_name = details[0].find('div', class_ = 'companyName').text.strip()
location = details[0].find('div', class_ = 'details location').text.strip()
exp = details[0].find('span', class_ = 'details').text.strip()

In [33]:
exp

'8 - 11 Years'

In [466]:
atag = details[0].find_all('a')

In [467]:
atag

[]

In [488]:
btn_apply = monster_doc.find_all('button') #, {'id': 'applyNowBtn'})

In [489]:
btn_apply

[]

In [6]:
monster_doc

<html style="--vh: 6.99px;"><head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link crossorigin="" href="https://media.foundit.in/" rel="preconnect"/>
<link crossorigin="" href="https://static.criteo.net/" rel="preconnect"/>
<link crossorigin="" href="https://cdn.mouseflow.com/" rel="preconnect"/>
<link crossorigin="" href="https://px.ads.linkedin.com/" rel="preconnect"/>
<link crossorigin="" href="https://c.clarity.ms/" rel="preconnect"/>
<link crossorigin="" href="https://stats.g.doubleclick.net/" rel="preconnect"/>
<link crossorigin="" href="https://www.google-analytics.com/" rel="preconnect"/>
<link crossorigin="" href="https://static.hotjar.com/" rel="preconnect"/>
<link crossorigin="" href="https://script.hotjar.com/" rel="preconnect"/>
<link crossorigin="" href="https://vars.hotjar.com/" rel="preconnect"/>
<link crossorigin="" href="https://in1.wzrkt.com/" rel="preconnect"/>

In [5]:
monster_doc = get_webpage_selenium('https://www.foundit.in/srp/results?query=Web+developer&locations=Delhi&experienceRanges=3~3&experience=3')

In [1]:

import urllib.parse
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC