# **Introduction**  
In today's job market, finding relevant job listings efficiently is crucial for job seekers and recruiters. This Python script automates the process of scraping job listings from **Hirist**, a popular job portal, for various **Data Science and AI-related roles**. Instead of manually searching, this script extracts **structured job data**, including **job titles, descriptions, company details, experience requirements, salary, and skills**.  

# **Objective**  
The objective of this script is to:  
1. **Automate job data extraction** for various **AI and Data Science-related roles** from **Hirist**.  
2. **Retrieve job details efficiently**, including **company name, salary, location, and required skills**.  
3. **Store the extracted data in a structured format** (like a **Pandas DataFrame** or a **CSV file**) for further analysis.  
4. **Enhance job search automation** by allowing users to programmatically access job listings.  

# **Background**  
- **Hirist.tech** is a job portal specializing in technology-related roles.  
- **Job data is dynamically loaded** on the website, requiring us to extract job details from **JavaScript-rendered JSON data** rather than using traditional HTML scraping.  
- **Requests are used** to fetch job pages, and the **JSON data is extracted** from the `window.__PRELOADED_STATE__` variable embedded in the website's source code.  
- **The script loops through multiple job roles**, constructs job search URLs, and extracts relevant details from each job listing.  

---

In [1]:
import requests
import json
import pandas as pd
import re

# Headers to avoid getting blocked
headers = {
    "User-Agent": "Mozilla/5.0"
}

# Job roles to search for
job_roles = [
    "Data Scientist", "Data Science Intern", "Data Science Engineer",
    "Generative AI", "Data Analyst", "Power BI", "Machine Learning"
]

# Base URL for Hirist job search
base_url = "https://www.hirist.tech/search/"

# Function to create each job url using title and job id
def generate_hirist_url(job):
    """Generate a Hirist job post URL from JSON data."""
    base_url = "https://www.hirist.tech/j/"
    job_title = job['title'].lower().replace(" ", "-")  # Convert title to lowercase and replace spaces with "-"
    job_title = re.sub(r'[^a-z0-9-]', '', job_title)  # Remove special characters
    job_id = job['id']
    
    return f"{base_url}{job_title}-{job_id}.html"


# Generate URLs for each job role
job_urls = {role: f"{base_url}{role.replace(' ', '-').lower()}.html" for role in job_roles}
joblist=[]
# Print all URLs
for role, url in job_urls.items():
    print(f"{role}: {url}")
    
    # Send request
    response = requests.get(url, headers=headers)
    html_text = response.text
    
    # Extract JSON Data from script tag
    start = html_text.find('window.__PRELOADED_STATE__ =') + len('window.__PRELOADED_STATE__ =')
    end = html_text.find(';</script>', start)
    json_data = html_text[start:end].strip()
    temp=json_data[:json_data.index("</script>")]
    job_data = json.loads(temp)
    for job in job_data["feed"]["jobFeed"]:
        # Generate URLs for all jobs
        joburl=generate_hirist_url(job)
        print (joburl)
        try:
            # Send request
            level2_response = requests.get(joburl, headers=headers)
            level2_html_text = level2_response.text
            start = level2_html_text.find('window.__PRELOADED_STATE__ =') + len('window.__PRELOADED_STATE__ =')
            end = level2_html_text.find(';</script>', start)
            json_data = level2_html_text[start:end].strip()
            temp=json_data[:json_data.index("</script>")]
            job_detail = json.loads(temp)
        
            # Extract job details
            job_data = job_detail.get("jobInfo", {}).get("jobData", {}).get("data", {})
            
            # Extract skills as comma-separated string
            skills = ", ".join([skill.get("name", "") for skill in job_data.get("tags", [])])
            # Create a dictionary for the DataFrame
            job_dict = {
                "Job ID": job_data.get("id"),
                "Job Description": job_data.get("introText"),
                "Title": job_data.get("title"),
                "Designation": job_data.get("jobdesignation"),
                "Min Experience": job_data.get("min"),
                "Max Experience": job_data.get("max"),
                "Category ID": job_data.get("categoryId"),
                "Job URL": job_data.get("jobDetailUrl"),
                "Location": job_data.get("locations", [{}])[0].get("name", "Unknown"),
                "Company ID": job_data.get("companyData", {}).get("companyId"),
                "Company Name": job_data.get("companyData", {}).get("companyName"),
                "Recruiter ID": job_data.get("recruiter", {}).get("recruiterId"),
                "Recruiter Name": job_data.get("recruiter", {}).get("recruiterName"),
                "Recruiter Designation": job_data.get("recruiter", {}).get("designation"),
                "Min Salary": job_data.get("minSal"),
                "Max Salary": job_data.get("maxSal"),
                "Functional Area": job_data.get("functionalAreaName"),
                "Skills": skills  # Added skills field
            }
            joblist.append(job_dict)
        except:
            continue
    


Data Scientist: https://www.hirist.tech/search/data-scientist.html
https://www.hirist.tech/j/haleon---data-engineer---azure-databricksdata-factory-5-9-yrs-1432749.html
https://www.hirist.tech/j/data-analyst-0-3-yrs-1434028.html
https://www.hirist.tech/j/data-analyst-0-2-yrs-1433670.html
https://www.hirist.tech/j/data-scientist---machine-learning-models-4-8-yrs-1436311.html
https://www.hirist.tech/j/data-engineer---pythonspark-2-10-yrs-1434327.html
https://www.hirist.tech/j/data-scientist---machine-learning-models-5-15-yrs-1435185.html
https://www.hirist.tech/j/data-engineer---machine-learning-7-10-yrs-1433114.html
https://www.hirist.tech/j/zingbus---data-analyst---rpython-2-4-yrs-1434627.html
https://www.hirist.tech/j/engro-technologies---data-scientist---sas-viya-platform-1-3-yrs-1435959.html
https://www.hirist.tech/j/data-modeler---etlerwin-5-15-yrs-1432492.html
https://www.hirist.tech/j/skellam---data-engineer---etlsql-4-8-yrs-1432719.html
https://www.hirist.tech/j/data-analyst---po

## **🔹 Explanation of What We Did**
### **Step 1: Define Job Roles & Search URLs**
- We created a **list of job roles** such as **Data Scientist, Generative AI, Machine Learning, etc.**  
- Used the **Hirist job search base URL** and dynamically generated **search URLs** for each role.  

### **Step 2: Send Requests to Fetch Job Listings**
- Used **`requests.get(url, headers=headers)`** to send HTTP GET requests to each job search page.  
- Retrieved the **HTML response text** containing job data.  

### **Step 3: Extract JSON Data from `window.__PRELOADED_STATE__`**
- Job listings are embedded inside **a JavaScript variable** named `window.__PRELOADED_STATE__`.  
- We extracted this **JSON-encoded job data** using **string operations** (`find()` and `strip()`).  

### **Step 4: Loop Through Job Listings**
- **Iterated through all jobs** inside `job_data["feed"]["jobFeed"]`.  
- **Constructed detailed job URLs** and printed them.  

### **Step 5: Fetch Job Details from Each Listing**
- Sent another **request to fetch job details** from the **job-specific URL**.  
- Extracted **deeper JSON data** containing more job information like **company details, salaries, experience, and skills**.  

### **Step 6: Parse Skills & Store Job Data**
- Extracted **skills as a comma-separated string** from the **"tags"** field in JSON.  
- Created a **dictionary (`job_dict`)** storing **all job details**.  
- Appended **each job entry to a list (`joblist`)** for later use.  

---

## **🔹 Next Steps**
Now that we have the extracted job data, we can:
✔ **Convert it into a Pandas DataFrame:**  
```python
df = pd.DataFrame(joblist)
print(df.head())  # Preview data
```
✔ **Save it as a CSV file for analysis:**  
```python
df.to_csv("hirist_jobs.csv", index=False)
```
✔ **Analyze salary trends, required skills, and job locations!** 🚀

---

In [2]:
len(joblist)

682

In [3]:
# Convert the list of dictionaries into a DataFrame
df = pd.DataFrame(df_cleaned)

# Remove duplicates based on the 'Job ID' column (keeping the first occurrence)
df.drop_duplicates(subset="Job ID", keep="first")

# Display DataFrame
df.shape

NameError: name 'df_cleaned' is not defined

In [None]:
df.to_csv("hirist.csv")

In [None]:
df.head()

Unnamed: 0,Job ID,Job Description,Title,Designation,Min Experience,Max Experience,Category ID,Job URL,Location,Company ID,Company Name,Recruiter ID,Recruiter Name,Recruiter Designation,Min Salary,Max Salary,Functional Area,Skills
0,1432749,<p><p>Job Description :</p><p><br/></p><p>Were...,Haleon - Data Engineer - Azure Databricks/Data...,Data Engineer,5,9,7,https://www.hirist.com/j/haleon-data-engineer-...,Bangalore,3348,Hirist,180511,Talent Bridge,HR,15,27,Data Analysis / Business Analysis,"Data Engineering, Data Pipeline, Data Architec..."
1,1434028,<p>Job Summary :</p><p><br/></p><p>We are look...,Data Analyst,Data Analyst,0,3,7,https://www.hirist.com/j/data-analyst-1434028....,Chennai,2767,VIBRANTUM LABZ PRIVATE LIMITED,194867,swetha,hr,6,12,Data Analysis / Business Analysis,"Data Analyst, Data Analytics, SQL, Reporting T..."
2,1433670,<p>Job Title : Data Analyst<br/><br/>Experienc...,Data Analyst,Data Analyst,0,2,7,https://www.hirist.com/j/data-analyst-1433670....,Gurgaon/Gurugram,0,Creencia Technologies Pvt Ltd,57068,Aruna,Talent Acquisition Specialist,7,8,Data Analysis / Business Analysis,"Data Analyst, Data Analytics, Data Management,..."
3,1434327,<p>Job Description :</p><p><br/></p><p>Our Cli...,Data Engineer - Python/Spark,Data Engineer,2,10,7,https://www.hirist.com/j/data-engineer-pythons...,Bangalore,0,Fidius Advisory,132726,Priyanka Mohan,Consultant,8,30,Data Engineering,"Data Engineering, Python, Spark, Scala, Data M..."
4,1433114,<p>Job Description :<br/><br/>- We are seeking...,Data Engineer - Machine Learning,Data Engineer,7,10,7,https://www.hirist.com/j/data-engineer-machine...,Bangalore,0,Interon IT Solutions,212071,Charan Reddy,US IT Recruiter,20,30,Data Engineering,"Machine Learning, Data Engineering, ETL, Azure..."


## **🎯 Summary**
✅ We **scraped job listings** from **Hirist.tech** dynamically.  
✅ Extracted **job details like title, experience, salary, location, and skills**.  
✅ Stored the data in a structured format for **further analysis or automation**.  
✅ This approach can be **extended to other job portals** for **better job search automation**.  

💡 **Now, you can easily find relevant Data Science jobs without manual searches!** 🎯 🚀