<a href="https://colab.research.google.com/github/rohitpaul09/Web-Scraping-Data-Science-Job-Listings/blob/main/Web_Scraping_Job_Listings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Project Name: Web Scraping Data Science Job Listings**

##### **Project Type**    - EDA
##### **Contribution**    - Individual

## **Project Summary:**

 The primary goal is to develop an intelligent tool that streamlines data science job searches by utilizing web scraping on the Jobs website. Through the extraction of key details and the presentation of insights via visualizations, the tool aims to assist individuals in navigating the data science job market. It also keeps professionals, job seekers, and recruiters well-informed about industry trends.

**Web Scraping:** The project began with web scraping job listings from the TimesJobs website. The Python code utilized the BeautifulSoup library to extract relevant details from the job listings, including job title, company name, skills required, posting time, location, and salary. The scraping process involved iterating through multiple pages of job listings, refining the extraction process, and handling diverse data structures on the website.

**Data Cleaning and Transformation:** Following data extraction, the code applied techniques like strip() and replace() to organize the data systematically using the pandas library. Custom functions ensured a clean extraction of salary information, handling variations like 'Lacs,' and refining experience data for consistency.

**Visualization/EDA (Exploratory Data Analysis):**
Various visualizations were created to offer insights into different facets of the data science job market. The **WordCloud for In-Demand Skills** revealed Python, SQL, Machine Learning, and Data Mining as highly sought-after skills. **Top Cities with Most Job Openings** visualized job distribution across cities, highlighting Delhi with the highest number of openings. A **Comparison of Full-Time Jobs and Internships** indicated that 88.0% of opportunities are full-time positions, with the remaining 12.0% being internships. **Top Companies Providing Internship Opportunities** identified Maxgen Technologies as the leading provider. A **Comparison of Work from Home vs. On-Site Opportunities** showed 80.0% on-site and 20.0% work-from-home options. **Top Companies Providing Work from Home Options** showcased Minanshika Softech Solution Pvt Ltd and Soumya Gayen at the forefront. **Salary Distribution Analysis** depicted a concentration around 0-10 Lacs per annum, indicating entry-level pay scales, while **Experience Requirements Analysis** revealed a demand for beginners and 1-3 yrs experienced roles. The **Relationship Between Salary and Experience** visualized clusters, with 0-10 indicating entry-level and near 50 corresponding to higher packages for experienced professionals.

In summary, the project developed a valuable tool for comprehending data science job opportunities. By extracting data from the website and employing visualizations, it offered beneficial insights for professionals, job seekers, and recruiters in the dynamic field of data science. **It is important to note that these insights are derived from a snapshot of data and may evolve with real-time updates.**

## **GitHub Link:** https://github.com/rohitpaul09/Web-Scraping-Data-Science-Job-Listings

## **Problem Statement: Navigating the Data Science Job Landscape**

🚀 Unleash your creativity in crafting a solution that taps into the heartbeat of the data science job market! Envision an ingenious project that seamlessly wields cutting-edge web scraping techniques and illuminating data analysis.

🔍 Your mission? To engineer a tool that effortlessly gathers job listings from a multitude of online sources, extracting pivotal nuggets such as job descriptions, qualifications, locations, and salaries.

🧩 However, the true puzzle lies in deciphering this trove of data. Can your solution discern patterns that spotlight the most coveted skills? Are there threads connecting job types to compensation packages? How might it predict shifts in industry demand?

🎯 The core **objectives** of this challenge are as follows:

1. Web Scraping Mastery: Forge an adaptable and potent web scraping mechanism. Your creation should adeptly harvest data science job postings from a diverse array of online platforms. Be ready to navigate evolving website structures and process hefty data loads.

2. Data Symphony: Skillfully distill vital insights from the harvested job listings. Extract and cleanse critical information like job titles, company names, descriptions, qualifications, salaries, locations, and deadlines. Think data refinement and organization.

3. Market Wizardry: Conjure up analytical tools that conjure meaningful revelations from the gathered data. Dive into the abyss of job demand trends, geographic distribution, salary variations tied to experience and location, favored qualifications, and emerging skill demands.

4. Visual Magic: Weave a tapestry of visualization magic. Design captivating charts, graphs, and visual representations that paint a crystal-clear picture of the analyzed data. Make these visuals the compass that guides users through job market intricacies.

🌐 While the web scraping universe is yours to explore, consider these platforms as potential stomping grounds:

* LinkedIn Jobs
* Indeed
* Naukri
* Glassdoor
* AngelList
* TimesJobs


🎈 Your solution should not only decode the data science job realm but also empower professionals, job seekers, and recruiters to harness the dynamic shifts of the industry. The path is open, the challenge beckons – are you ready to embark on this exciting journey?






##**Let's Begin !**

## ***1. Know Your Data***

### **Import Libraries**

In [3]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import requests
from pymongo import MongoClient
import warnings
warnings.filterwarnings('ignore')


###**Web Scraping Job Listings with BeautifulSoup and Pandas**


In [4]:
# Define the function to extract salary information
def extract_salary(job_element):
    # Extract salary information containing 'Lacs'
    salary_tags = job_element.find_all('li')
    for tag in salary_tags:
        if 'Lacs' in tag.text:
            return tag.text.strip().replace('₹Rs','').replace('Lacs p.a.','')
    return 'Not Provided'

def scrape_jobs(pages):
    all_data = []
    experience_required_list = []

    for page in range(1, pages + 1):
        # Define the URL for each page
        url = f'https://www.timesjobs.com/candidate/job-search.html?from=submit&luceneResultSize=25&txtKeywords=0DQT0Data%20Analyst0DQT0%20,0DQT0Data%20Mining0DQT0,0DQT0Data%20Architect0DQT0,0DQT0Machine%20Learning0DQT0,0DQT0Power%20Bi0DQT0,0DQT0Business%20Analyst0DQT0,0DQT0senior%20business%20analyst0DQT0,0DQT0Bi%20Developer0DQT0&postWeek=7&searchType=personalizedSearch&actualTxtKeywords=0DQT0Data%20Analyst0DQT0%20,0DQT0Data%20Mining0DQT0,0DQT0Data%20Architect0DQT0,0DQT0Machine%20Learning0DQT0,0DQT0Power%20Bi0DQT0,0DQT0Business%20Analyst0DQT0,senior%20business%20analyst,0DQT0Bi%20Developer0DQT0&searchBy=0&rdoOperator=OR&pDate=I&sequence={page}&startPage=1'
        html_text = requests.get(url).text
        soup = BeautifulSoup(html_text, 'lxml')

        # Refining the extraction process to remove unwanted strings from the experience information
        for item in soup.find_all('ul', {'class': 'top-jd-dtl clearfix'}):
            exp_tag = item.find('li')
            if exp_tag and 'yrs' in exp_tag.text:
                # Extracting the experience text and removing any unwanted strings
                experience = exp_tag.text.replace('card_travel', '').strip().replace('yrs','')
                experience_required_list.append(experience)
            else:
                experience_required_list.append('Not Mentioned')

        # Extract job listings
        job_listings = soup.find_all('li', class_='clearfix job-bx wht-shd-bx')

        for job in job_listings:
            skills = job.find('span', class_='srp-skills').text.strip().replace(' ', '').replace('\r', '').replace('\n', '').replace('.', 'Not Provided')
            location_element = job.find('ul', class_='top-jd-dtl clearfix')
            location = location_element.find('span').text.strip() if location_element else 'Not Provided'
            posted_ago = job.find('span', class_='sim-posted').span.text.strip().replace('\r', '').replace('Posted ', '').replace('\t', '').replace('\n', '')
            company_name = job.find('h3', class_='joblist-comp-name').text.strip().replace('\r', '').replace('\n', '').replace(' (More Jobs)', '')

            # Create a dictionary for job data
            job_data = {
                'Job Title': job.find('h2').text.strip(),
                'Company': company_name,
                'Skills Required': skills,
                'Job Posted Ago': posted_ago,
                'Location': location,
                'Salary(Lacs p.a.)': extract_salary(job)
            }
            # Append the job data to the list
            all_data.append(job_data)

    # Create a DataFrame from the collected job data
    df = pd.DataFrame(all_data)
    # Add the 'Experience Required' column to the DataFrame
    df['Experience Required(Years)'] = experience_required_list

    return df

# Scrape the first 10 pages of job listings
df = scrape_jobs(10)


### **Dataset First View**

In [5]:
# Display the head of the DataFrame with data from multiple pages
print('Scraped Data from Multiple Pages:')
df.head(10)


Scraped Data from Multiple Pages:


Unnamed: 0,Job Title,Company,Skills Required,Job Posted Ago,Location,Salary(Lacs p.a.),Experience Required(Years)
0,Data Modeller,Electrobrain modern technologies pvt ltd,"BusinessAnalyst,Principal,AssociateDirector,Ar...",1 day ago,"Bengaluru / Bangalore, Chennai, Delhi/NCR, ...",Not Provided,4 - 9
1,Data Science Analytics,Electrobrain modern technologies pvt ltd,"powerpoint,machinelearningalgorithms,statistic...",1 day ago,"Bengaluru / Bangalore, Chennai, Delhi/NCR, ...",Not Provided,4 - 9
2,Manager-Data Science,Electrobrain modern technologies pvt ltd,"Projectmanagement,Finance,datascience,Automati...",1 day ago,"Bengaluru / Bangalore, Chennai, Delhi/NCR, ...",Not Provided,5 - 9
3,Mining Engineer Hiring For SINGAPORE,CLOUD VISA IMMIGRATION LLP,"miningengineer,miningoperator,miningengineeran...",today,Singapore,50.00 - 95.00,3 - 8
4,Head of Design,Rina Israni,"BusinessAnalyst,assistantmanager,teamleader,te...",today,"Kolkata, Mumbai, Noida/Greater Noida, Pune,...",16.00 - 29.00,13 - 18
5,Summer Internship in Ahmedabad,Maxgen Technologies,Not Provided,today,"Ahmedabad, Bhavnagar, Gandhinagar, Jamnagar...",Not Provided,0 - 1
6,GTU Internship in Ahmedabad,Maxgen Technologies,Not Provided,1 day ago,"Ahmedabad, Mehsana, Rajkot, Surat, Surendr...",Not Provided,0 - 1
7,Python Internship in Ahmedabad,Maxgen Technologies,Not Provided,today,"Ahmedabad, Mehsana, Rajkot, Surat, Surendr...",Not Provided,0 - 1
8,Summer Internship in Pune,Maxgen Technologies,Not Provided,today,"Pune, Amravati, Aurangabad, Sangli, Satara",1.00 - 2.00,0 - 1
9,Summer Internship in Pune,Maxgen Technologies,Not Provided,1 day ago,"Pune, Jalgaon, Kolhapur, Nagpur, Solapur",Not Provided,0 - 1


### **Dataset Rows & Columns count**

In [6]:
# Dataset Rows & Columns count
print('Scraped Data Rows Count:',df.shape[0])
print('Scraped Data Columns Count:',df.shape[1])


Scraped Data Rows Count: 250
Scraped Data Columns Count: 7


### **Dataset Information**

In [7]:
# Dataset Info
print('Scraped Data Info:')
df.info()


Scraped Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 7 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Job Title                   250 non-null    object
 1   Company                     250 non-null    object
 2   Skills Required             250 non-null    object
 3   Job Posted Ago              250 non-null    object
 4   Location                    250 non-null    object
 5   Salary(Lacs p.a.)           250 non-null    object
 6   Experience Required(Years)  250 non-null    object
dtypes: object(7)
memory usage: 13.8+ KB


#### **Duplicate Values**

In [8]:
# Dataset Duplicate Value Count
print('Play Store Data Duplicate Value Count:',len(df[df.duplicated()]))


Play Store Data Duplicate Value Count: 1


#### **Missing Values/Null Values**

In [9]:
# Missing Values/Null Values Count
# Function to calculate the percentage of null values in each column
def unified_null_percent(data_fm):
    # Convert empty strings to NaN
    data_fm = data_fm.replace('', pd.NA)

    null_info = pd.DataFrame(index=data_fm.columns)
    null_info["datatype"] = data_fm.dtypes
    null_info["not null values"] = data_fm.count()
    null_info["null value"] = data_fm.isnull().sum()
    null_info["null value(%)"] = round(data_fm.isnull().mean() * 100, 2)

    return null_info
# Display the percentage of null values for Play Store Data
print('Null value % in Scraped Data:', unified_null_percent(df), sep='\n')


Null value % in Scraped Data:
                           datatype  not null values  null value  \
Job Title                    object              250           0   
Company                      object              250           0   
Skills Required              object              250           0   
Job Posted Ago               object              250           0   
Location                     object              243           7   
Salary(Lacs p.a.)            object              250           0   
Experience Required(Years)   object              250           0   

                            null value(%)  
Job Title                             0.0  
Company                               0.0  
Skills Required                       0.0  
Job Posted Ago                        0.0  
Location                              2.8  
Salary(Lacs p.a.)                     0.0  
Experience Required(Years)            0.0  


### What did you know about your dataset?

The dataset is related to the online job portal industry, containing 250 rows and 7 columns obtained by scraping data from the first 10 pages of the job portal website. A 4.8% occurrence of missing values is observed in the 'Location' column. Our primary goal is to uncover trends in the current data science job market.

## ***2. Understanding Your Variables***

In [10]:
# Dataset Columns
print('Scraped Dataset Columns:',df.columns,sep='\n',end='\n\n')


Scraped Dataset Columns:
Index(['Job Title', 'Company', 'Skills Required', 'Job Posted Ago', 'Location',
       'Salary(Lacs p.a.)', 'Experience Required(Years)'],
      dtype='object')



In [11]:
# Dataset Description
df.describe(include='object')


Unnamed: 0,Job Title,Company,Skills Required,Job Posted Ago,Location,Salary(Lacs p.a.),Experience Required(Years)
count,250,250,250,250,250,250,250
unique,214,127,229,8,66,21,56
top,Business Analyst,BUSISOL SOURCING INDIA PVT. LTD,Not Provided,few days ago,Bengaluru / Bangalore,Not Provided,5 - 8
freq,11,21,7,83,57,227,23


### **Variables Description**

###Descriptions for Scraped Dataset:
**Job Title:** The specific designation associated with the job opening.

**Company:** The name of the organization that has posted the job.

**Skills Required:** The essential skills and qualifications needed for the job.

**Job Posted Ago:** The number of days elapsed since the job was posted, providing insight into its freshness.

**Location:** The list of cities where the job opportunity is available.

**Salary (Lacs p.a.):** The salary range for the position on an annual basis, denoted in lakhs.

**Experience Required (Years):** The number of years of professional experience required for the job.

## 3. ***Data Wrangling***

In [12]:
# Show Dataset Rows & Columns count Before Removing Duplicates
print('Shape Before Removing Duplicates:')
print('Scraped Dataset Rows count:',df.shape[0])
print('Scraped Dataset Columns count:',df.shape[1],end='\n\n')

# Remove duplicates
df.drop_duplicates(inplace=True)

# Show Dataset Rows & Columns count After Removing Duplicates
print('Shape After Removing Duplicates:')
print('Scraped Dataset Rows count:',df.shape[0])
print('Scraped Dataset Columns count:',df.shape[1])


Shape Before Removing Duplicates:
Scraped Dataset Rows count: 250
Scraped Dataset Columns count: 7

Shape After Removing Duplicates:
Scraped Dataset Rows count: 249
Scraped Dataset Columns count: 7


In [13]:
# Show Dataset Rows & Columns count Before Removing Missing Values
print('Shape Before Removing Missing Values:')
print('Scraped Dataset Rows count:',df.shape[0])
print('Scraped Dataset Columns count:',df.shape[1],end='\n\n')

# Replace empty strings with NaN in the 'Location' column
df['Location'].replace('', pd.NA, inplace=True)

# Drop rows with null values in the 'Location' column
df.dropna(subset=['Location'], inplace=True)

# Show Dataset Rows & Columns count After Removing Missing Values
print('Shape After Removing Missing Values:')
print('Scraped Dataset Rows count:',df.shape[0])
print('Scraped Dataset Columns count:',df.shape[1])


Shape Before Removing Missing Values:
Scraped Dataset Rows count: 249
Scraped Dataset Columns count: 7

Shape After Removing Missing Values:
Scraped Dataset Rows count: 242
Scraped Dataset Columns count: 7


In [14]:
# Check missing values again to confirm
print('Updated number of missing values in Scraped Dataset:')
df.isnull().sum()


Updated number of missing values in Scraped Dataset:


Job Title                     0
Company                       0
Skills Required               0
Job Posted Ago                0
Location                      0
Salary(Lacs p.a.)             0
Experience Required(Years)    0
dtype: int64

### What all manipulations have you done and insights you found?

- Show Dataset Rows & Columns count Before Removing Duplicates
  - Rows count: 250
  - Columns count: 7
- Remove duplicates
  - `df.drop_duplicates(inplace=True)`
- Show Dataset Rows & Columns count After Removing Duplicates
  - Rows count: 248
  - Columns count: 7
  
- Show Dataset Rows & Columns count Before Removing Missing Values
  - Rows count: 248
  - Columns count: 7

- Replace empty strings with NaN in the 'Location' column
  - `df['Location'].replace('', pd.NA, inplace=True)`
  
- Drop rows with null values in the 'Location' column
  - `df.dropna(subset=['Location'], inplace=True)`

- Show Dataset Rows & Columns count After Removing Missing Values
  - Rows count: 236
  - Columns count: 7

- Check missing values again to confirm
  - `df.isnull().sum()`

In [15]:


client = MongoClient('mongodb://localhost:27017/')
db = client['database_name']
collection = db['job_data']
df_dict = df.to_dict(orient='records')
collection.insert_many(df_dict)


<pymongo.results.InsertManyResult at 0x242b2b60430>

In [None]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from pymongo import MongoClient

# Define the function to extract salary information
def extract_salary(job_element):
    # Extract salary information containing 'Lacs'
    salary_tags = job_element.find_all('li')
    for tag in salary_tags:
        if 'Lacs' in tag.text:
            return tag.text.strip().replace('₹Rs','').replace('Lacs p.a.','')
    return 'Not Provided'

def scrape_jobs(pages):
    all_data = []
    experience_required_list = []

    for page in range(1, pages + 1):
        # Define the URL for each page
        url = f'https://www.timesjobs.com/candidate/job-search.html?from=submit&luceneResultSize=25&txtKeywords=0DQT0Data%20Analyst0DQT0%20,0DQT0Data%20Mining0DQT0,0DQT0Data%20Architect0DQT0,0DQT0Machine%20Learning0DQT0,0DQT0Power%20Bi0DQT0,0DQT0Business%20Analyst0DQT0,0DQT0senior%20business%20analyst0DQT0,0DQT0Bi%20Developer0DQT0&postWeek=7&searchType=personalizedSearch&actualTxtKeywords=0DQT0Data%20Analyst0DQT0%20,0DQT0Data%20Mining0DQT0,0DQT0Data%20Architect0DQT0,0DQT0Machine%20Learning0DQT0,0DQT0Power%20Bi0DQT0,0DQT0Business%20Analyst0DQT0,senior%20business%20analyst,0DQT0Bi%20Developer0DQT0&searchBy=0&rdoOperator=OR&pDate=I&sequence={page}&startPage=1'
        html_text = requests.get(url).text
        soup = BeautifulSoup(html_text, 'lxml')

        # Refining the extraction process to remove unwanted strings from the experience information
        for item in soup.find_all('ul', {'class': 'top-jd-dtl clearfix'}):
            exp_tag = item.find('li')
            if exp_tag and 'yrs' in exp_tag.text:
                # Extracting the experience text and removing any unwanted strings
                experience = exp_tag.text.replace('card_travel', '').strip().replace('yrs','')
                experience_required_list.append(experience)
            else:
                experience_required_list.append('Not Mentioned')

        # Extract job listings
        job_listings = soup.find_all('li', class_='clearfix job-bx wht-shd-bx')

        for job in job_listings:
            skills = job.find('span', class_='srp-skills').text.strip().replace(' ', '').replace('\r', '').replace('\n', '').replace('.', 'Not Provided')
            location_element = job.find('ul', class_='top-jd-dtl clearfix')
            location = location_element.find('span').text.strip() if location_element else 'Not Provided'
            posted_ago = job.find('span', class_='sim-posted').span.text.strip().replace('\r', '').replace('Posted ', '').replace('\t', '').replace('\n', '')
            company_name = job.find('h3', class_='joblist-comp-name').text.strip().replace('\r', '').replace('\n', '').replace(' (More Jobs)', '')

            # Create a dictionary for job data
            job_data = {
                'Job Title': job.find('h2').text.strip(),
                'Company': company_name,
                'Skills Required': skills,
                'Job Posted Ago': posted_ago,
                'Location': location,
                'Salary(Lacs p.a.)': extract_salary(job)
            }
            # Append the job data to the list
            all_data.append(job_data)

    # Create a DataFrame from the collected job data
    df = pd.DataFrame(all_data)
    # Add the 'Experience Required' column to the DataFrame
    df['Experience Required(Years)'] = experience_required_list

    return df

# Scrape the first 10 pages of job listings
df = scrape_jobs(10)

# Display the head of the DataFrame with data from multiple pages
print('Scraped Data from Multiple Pages:')
print(df.head(10))

# Connect to MongoDB
client = MongoClient('mongodb://localhost:27017/')
db = client['job_database']
collection = db['job_collection']

# Convert DataFrame to dictionary
df_dict = df.to_dict(orient='records')

# Insert the data into MongoDB
collection.insert_many(df_dict)
