# Web scraping using Selenium and Beautiful soup:

### Selenium:

![image.png](attachment:image.png)

### BeautifulSoup:

![image.png](attachment:image.png)

### Aim:
To fetch data of job posts on LinkedIn less than an year according to country and keywords.

#### Import Libraries:

In [37]:
import requests
from bs4 import BeautifulSoup
import re, time, os, math, json
from datetime import datetime, timedelta

#### Test - Fetching data from linkedIn url:

In [38]:
response = requests.get('https://www.linkedin.com/jobs/')

In [39]:
response.status_code

200

In [40]:
response.text

'<!DOCTYPE html>\n\n    \n    \n    \n    \n    \n    \n    \n    \n\n    \n    <html lang="en">\n      <head>\n        <meta name="pageKey" content="d_homepage-guest-jobs-home">\n<!---->        <meta name="locale" content="en_US">\n        <meta id="config" data-app-version="2.1.744" data-call-tree-id="AAX8rilvAE5rt/gb+x345Q==" data-jet-tags="guest-homepage" data-multiproduct-name="homepage-guest-frontend" data-service-name="homepage-guest-frontend" data-browser-id="74e7358f-4eb5-46ec-800b-c5e48fd9548f" data-enable-page-view-heartbeat-tracking data-disable-comscore-tracking data-page-instance="urn:li:page:d_homepage-guest-jobs-home;HkqtQLZ8QgSMry2fA+XUzQ==" data-disable-jsbeacon-pagekey-suffix="false" data-member-id="0">\n\n        <link rel="canonical" href="https://www.linkedin.com/jobs">\n          <link rel="alternate" hreflang="de" href="https://de.linkedin.com/jobs">\n          <link rel="alternate" hreflang="en-IE" href="https://ie.linkedin.com/jobs">\n          <link rel="alte

In [60]:
data1 = BeautifulSoup(response.text, 'html.parser')

In [61]:
data1.find('span', class_="results-context-header__job-count")

### Task 1. LinkedIn 1: Data Science and Machine Learning

Suppose that Integrify wants to get some insights for the Machine Learning and Data Science job market in order to build the best practice and update the curriculum to maximize the chance for getting as many job offers as possible for the students.

Your tasks are the following:
- Each group member will be working on one country (Finland, Netherlands, Denmark, Sweden, and Germany)
- Use the following keyword sets and try to locate 20 companies in each country:

    DataScience = [Data Science, Big data, Machine learning, Data mining, Artificial intelligence, Predictive modeling, Statistical analysis, Data visualization, Deep learning, Natural language processing, Business intelligence, Data warehousing, Data management, Data cleaning, Feature engineering, Time series analysis, Text analytics, Database, SQL, NoSQL, Neural networks, Regression analysis, Clustering, Dimensionality reduction, Anomaly detection, Recommender systems, Data integration, Data governance]

    MachineLearning = [Machine learning, Data preprocessing, Feature selection, Feature engineering, Data visualization, Model selection, Hyperparameter tuning, Cross-validation, Ensemble methods, Neural networks, Deep learning, Convolutional neural networks, Recurrent neural networks, Natural language processing, Computer vision, Reinforcement learning, Unsupervised learning, Clustering, Dimensionality reduction, Bayesian methods, Time series analysis, Random forest, Gradient boosting, Support vector machines, Decision trees, Regression analysis]
    

- Collect all job offers of each company for a one-year time frame. 
- You will end up with a dictionary where the keys are the company names and the values are a list of dictionaries. 
- The keys in the sub-dictionaries correspond to keywords, and the values correspond to the company’s posts that include those keywords. 
- In total, you will produce five dictionaries, each corresponding to one of the listed countries above. 
- Save each dictionary in JSON format under the name of the corresponding country.

#### Understanding the requirements:
1. We need to search jobs by keywords, country.
2. Collect job offers of each company within one year time limit. 

https://www.linkedin.com/jobs/search/?currentJobId=3617786050&f_TPR=r2592000&geoId=100456013&keywords=data%20scientist&location=Finland&refresh=true&sortBy=R&start=25

Above is the url produced when I searched on linkedIn manually for the keyword-data scientist. 
* TPR - Date posted - 24hrs
* geoId - Changes according to place
* keywords
* location - Country
* sortBy - R - Relevance
* start - Pagination. Starts from 0 for 1st page and increments by 25 for the next pages. 


#### Modifications for the url:

In [43]:
#To the above url, we need to automate with each keyword so that, we get each job description. \
#Hence we add trk=public_jobs_jobs-search-bar_search-submit to the above url

base_url = "https://www.linkedin.com/jobs/search?keywords={}&location=Sweden&geoId=105117694&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0&start={}"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}

#### Function to modify url by keyword:

In [51]:
#Extracting data for the jobs found:
def extract_job_data(data, keyword, pages, start, url, jobs_data):
        jobs_per_page = data.find_all('div', class_='base-card relative w-full hover:no-underline focus:no-underline base-card--link base-search-card base-search-card--link job-search-card')
        print('Number of jobs per page: ', len(jobs_per_page))
            
        for job in jobs_per_page:
            job_title = job.find('h3', class_='base-search-card__title').text.strip()
            company = job.find('h4', class_='base-search-card__subtitle').text.strip()
            job_link = job.find('a', class_='base-card__full-link')['href']
            date_posted = job.find('time', class_="job-search-card__listdate")
            
            if not date_posted:
                print('Date Posted: ', date_posted)
            else:
                #Checking if it is posted less than 1 year ago:
                date_posted = date_posted['datetime']
                job_post_date = datetime.strptime(date_posted, '%Y-%m-%d')
                previous_year = datetime.today() - timedelta(days=365)
                
                if job_post_date < previous_year:
                    print(f"Skipping job post since it is older than a year, job_post_date={job_post_date}")
                    continue
                
            #Checking if a company is already in our list:
            if company in jobs_data:
                jobs_data[company].append({keyword: job_link})
                print("Add another post of ",company)
            else: 
                jobs_data[company] = [{keyword: job_link}]
            if len(jobs_data.keys()) >=20 :
                break
            
        print(f"Pages={pages}")
        print(f"Job data keys = {len(jobs_data.keys())}")
           
    
    
    # Our task is to find 20 jobs. If there are less than 20 jobs in a page, we need to loop go to another page 
        if pages > 0 and len(jobs_data.keys()) < 20:
            pages -= 1
        
            #Changing the start value in the url
            url = url.replace("start="+str(start), "start="+str(start+25))
            print('New url with pagination:', url)
        
            #fetching data from new url. 
            response = requests.get(url, headers=headers)
            if response.status_code == 200:
                data = BeautifulSoup(response.text,'html.parser')
            extract_job_data(data, keyword, pages, start, url, jobs_data)

        return jobs_data
            
            
                
                

In [64]:
def fetch_linkedin_jobs_by_keyword(keywords,file_prefix):
    jobs_data = {}
    for keyword in keywords:
        #Formatting the url:
        keyword = keyword.replace(" ", "%20") #Any space in the keyword should be replaced with %20 in the url
        start = 0
        url = base_url.format(keyword, start)
        print('Fetching data for url: ', url)
        #Fetching data withe formatted url:
        response = requests.get(url, headers = headers)
        
        #Checking status_code:
        if response.status_code == 200:
            #Changing the format of the response.text:
            data = BeautifulSoup(response.text,'html.parser')
            
            #Finding the number of jobs found:
            total_jobs_found = data.find('span', class_="results-context-header__job-count")
            if total_jobs_found:
                total_jobs_found =  int(total_jobs_found.get_text().strip("+").replace(",",""))
            else:
                total_jobs_found = 0
            print('Total number of jobs found: ', total_jobs_found)
            num_of_pages = math.ceil(total_jobs_found/25)

            jobs_data = extract_job_data(data, keyword, num_of_pages, start, url, jobs_data)
            time.sleep(5)

                
        else:
            print('Something went wrong while fetching data from: ', url)
            
            
   # write output JSON for a country

    file_path = "E:\ML-DS\python-integrify\ML-practice\Homeworks\Homework6_Reethika\linkedIn_data"
    with open(file_path+"_"+file_prefix+".json", "w") as output_file:
        print('Writing file to this path:\n', file_path)
        json.dump(jobs_data, output_file, sort_keys=True)         
            
        
        

### Task1 - Datascience and Machine learning keywords: 

In [53]:
data_science_keywords = ["Data Science", "Big data", "Machine learning", "Data mining", "Artificial intelligence", 
                         "Predictive modeling", "Statistical analysis", "Data visualization", "Deep learning", 
                         "Natural language processing", "Business intelligence", "Data warehousing", "Data management", 
                         "Data cleaning", "Feature engineering", "Time series analysis", "Text analytics", "Database",
                         "SQL", "NoSQL", "Neural networks", "Regression analysis", "Clustering", "Dimensionality reduction", 
                         "Anomaly detection", "Recommender systems", "Data integration", "Data governance"]

machine_learning_keywords = ["Machine learning", "Data preprocessing", "Feature selection", "Feature engineering", 
                    "Data visualization", "Model selection", "Hyperparameter tuning", "Cross-validation", 
                    "Ensemble methods", "Neural networks", "Deep learning", "Convolutional neural networks", 
                    "Recurrent neural networks", "Natural language processing", "Computer vision", "Reinforcement learning", 
                    "Unsupervised learning", "Clustering", "Dimensionality reduction", "Bayesian methods", "Time series analysis",
                    "Random forest", "Gradient boosting", "Support vector machines", "Decision trees", "Regression analysis"]

#Calling the fetch function we defined:
fetch_linkedin_jobs_by_keyword(set(data_science_keywords + machine_learning_keywords),file_prefix="ML")

Fetching data for url:  https://www.linkedin.com/jobs/search?keywords=Statistical%20analysis&location=Sweden&geoId=105117694&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0&start=0
Total number of jobs found:  270
Number of jobs per page:  23
Add another post of  TalentKompass Deutschland
Add another post of  TalentKompass Deutschland
Add another post of  TalentKompass Deutschland
Add another post of  TalentKompass Deutschland
Date Posted:  None
Add another post of  JobBusters AB
Pages=11
Job data keys = 18
New url with pagination: https://www.linkedin.com/jobs/search?keywords=Statistical%20analysis&location=Sweden&geoId=105117694&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0&start=25
Number of jobs per page:  23
Add another post of  Moleculent
Add another post of  Ark Kapital
Add another post of  TalentKompass Deutschland
Add another post of  TalentKompass Deutschland
Add another post of  TalentKompass Deutschland
Add another post of  Integro Co

Total number of jobs found:  0
Number of jobs per page:  0
Pages=0
Job data keys = 31
Fetching data for url:  https://www.linkedin.com/jobs/search?keywords=Neural%20networks&location=Sweden&geoId=105117694&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0&start=0
Total number of jobs found:  19
Number of jobs per page:  17
Add another post of  emagine
Pages=1
Job data keys = 31
Fetching data for url:  https://www.linkedin.com/jobs/search?keywords=Business%20intelligence&location=Sweden&geoId=105117694&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0&start=0
Total number of jobs found:  1000
Number of jobs per page:  25
Pages=40
Job data keys = 32
Fetching data for url:  https://www.linkedin.com/jobs/search?keywords=Data%20management&location=Sweden&geoId=105117694&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0&start=0
Total number of jobs found:  1000
Number of jobs per page:  25
Add another post of  Volvo Group
Pages=40
Job data k

### Task2: Fullstack keywords:

In [63]:
fs_keywords= ["Front-end development", "HTML", "CSS", "JavaScript", "React", "Angular", 'Vue.js', "Bootstrap", 
              "jQuery", "responsive design", "Back-end development", "Node.js", "Python", "Ruby", "PHP", "Java",
               ".NET", "SQL", 'NoSQL', "RESTful APIs", "web servers", "Database management",  "MySQL", "PostgreSQL", 
               "MongoDB", "Redis", "Cassandra", "Oracle", "SQL Server", "DevOps", "AWS", "Azure", "Google Cloud", 
               "Docker", "Kubernetes", "Git", "Jenkins", "Travis CI", "CircleCI", "monitoring and logging tools", 
               "Project management"," Agile", "Scrum", "Kanban", "JIRA", "Trello", "Asana", "project planning", 
               "team collaboration"," communication skills"]

fetch_linkedin_jobs_by_keyword(set(fs_keywords),file_prefix="fullstack")

Fetching data for url:  https://www.linkedin.com/jobs/search?keywords=web%20servers&location=Sweden&geoId=105117694&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0&start=0
Total number of jobs found:  138
Number of jobs per page:  23
Add another post of  FRG Technology Consulting
Add another post of  B3 Consulting Group
Add another post of  Ampstek
Add another post of  B3 Consulting Group
Add another post of  B3 Consulting Group
Pages=6
Job data keys = 18
New url with pagination: https://www.linkedin.com/jobs/search?keywords=web%20servers&location=Sweden&geoId=105117694&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0&start=25
Number of jobs per page:  23
Add another post of  Lund University
Add another post of  Avalanche Studios Group
Add another post of  Adroit People Limited (UK)
Add another post of  FRG Technology Consulting
Add another post of  Konsu
Add another post of  Framtiden AB
Add another post of  SOKIGO
Add another post of  Tata Consult

Total number of jobs found:  1000
Number of jobs per page:  25
Pages=40
Job data keys = 31
Fetching data for url:  https://www.linkedin.com/jobs/search?keywords=%20Agile&location=Sweden&geoId=105117694&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0&start=0
Total number of jobs found:  5000
Number of jobs per page:  25
Pages=200
Job data keys = 32
Fetching data for url:  https://www.linkedin.com/jobs/search?keywords=Cassandra&location=Sweden&geoId=105117694&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0&start=0
Total number of jobs found:  47
Number of jobs per page:  24
Pages=2
Job data keys = 33
Fetching data for url:  https://www.linkedin.com/jobs/search?keywords=CircleCI&location=Sweden&geoId=105117694&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0&start=0
Total number of jobs found:  6
Number of jobs per page:  5
Add another post of  Ombori
Pages=1
Job data keys = 33
Fetching data for url:  https://www.linkedin.com/jobs/se

Total number of jobs found:  854
Number of jobs per page:  25
Pages=35
Job data keys = 46
Fetching data for url:  https://www.linkedin.com/jobs/search?keywords=Python&location=Sweden&geoId=105117694&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0&start=0
Total number of jobs found:  8000
Number of jobs per page:  25
Add another post of  FRG Technology Consulting
Pages=320
Job data keys = 46
Fetching data for url:  https://www.linkedin.com/jobs/search?keywords=Project%20management&location=Sweden&geoId=105117694&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0&start=0
Total number of jobs found:  2000
Number of jobs per page:  25
Add another post of  TalentKompass Deutschland
Pages=80
Job data keys = 46
Fetching data for url:  https://www.linkedin.com/jobs/search?keywords=Java&location=Sweden&geoId=105117694&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0&start=0
Total number of jobs found:  9000
Number of jobs per page:  25
Add an