# BIA-660: Web Scraping
**Final Project Part 1: Scraper**

**Date:** 

**Team**
- Jarrin Sacayanan
- Sabah Ahmed
- Million Mehari

## Instructions

### Steps
1. Collect at least 5,000 Job Ads for Data Scientists from Indeed.com
2. Collect at least 5,000 Job Ads for Software Engineers from Indeed.com
3. Get the HTML of the job description (as shown on the right side of the screen after you click on an Ad) for each Ad.
4. Extract the text from the HTML and create a CSV with 1 Ad per line and 2 columns: `<text>` and `<job title>`
5. Train a classificatino model that can predict whether a given Ad is for a Data Scientist or Software Engineer

### Notes
- Your trained model will be evaluated on a separate test set that you will not have access to before the deadline
- The deliverables include:
    - The scraping script(s) in .ipynb format
    - The classification script as a separate .ipynb Notebook
    - Instructions on how to run the 2 Notebooks
    - The CSV from step 4
- Your classification script should be able t oread a test CSV that will include 1 job description per line (no labels). It should then produce a new file that includes the predicted label for each line in the test file.

# Setup

In [67]:
# Basic imports
import pandas as pd
import numpy as np
import csv, re, time,os
import warnings

# Scraping imports
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service

In [68]:
# Set a warning filter for the driver deprication warning
warnings.filterwarnings('ignore')

In [69]:
# Set some options for the webdriver
chrome_options = Options() 
chrome_options.add_argument('--disable-notifications')
chrome_options.add_argument('--headless')      

In [70]:
# Create the Chrome webdriver to prepare for scraping
service = Service(ChromeDriverManager().install())



Current google-chrome version is 101.0.4951
Get LATEST chromedriver version for 101.0.4951 google-chrome
Driver [C:\Users\jarri\.wdm\drivers\chromedriver\win32\101.0.4951.41\chromedriver.exe] found in cache


# Build a Scraping Function
This function should perform scraping based on a passed URL and a scrape count and should result in a Pandas dataframe containing the desired contents. In this case, that is a dataframe with two columns:
- `Job Title`
- `Job Description`

In [71]:
def scrape_indeed(driver, label, city, lim, scrape_log):
    # Start a timer
    start_time = time.time()
    
    # Init storage for scraping results
    new_df = pd.DataFrame()
    
    # Init storage for job IDs
    seen_ids = []
    
    # Init a timeout counter
    timeout = 0
    
    # Init counters to test for infinite page looping
    page_loop_timeout = 0
    previous_job_count = 0
    
    # Init a bool to break the infinite loop later
    cont = True
    
    while timeout <= 3:
        try:
            # Go to the Indeed home page
            url = 'https://www.indeed.com/'
            driver.get(url)

            time.sleep(2)

            # Fill in the search fields
            job_field = driver.find_element_by_css_selector('[id="text-input-what"]')
            job_field.send_keys(label)

            where_field = driver.find_element_by_css_selector('[id="text-input-where"]')
            where_field.send_keys(Keys.CONTROL, 'a', Keys.DELETE)
            where_field.send_keys(city)

            # Perform the search
            print('Searching now...')
            scrape_log.write('Searching now...\n')
            time.sleep(2)
            where_field.send_keys(Keys.ENTER)

            # Wait before proceeding with page scrape
            time.sleep(5)

        except Exception as e:
            print(f'Search Error...{e}')
            scrape_log.write(f'Search Error...{e}\n')

        try:
            # Collect from indefinite pages
            while cont:
                # Collect all jobs on screen
                jobs = driver.find_elements_by_css_selector('[class="jcs-JobTitle"]')
                    
                if len(jobs) == 0:
                    print('No job listings collected...')
                    scrape_log.write('No job listings collected...\n')
                else:
                    print(f'Number of jobs on page: {len(jobs)}')
                    scrape_log.write(f'Number of jobs on page: {len(jobs)}\n')

                # Loop through the job cards
                for job in jobs:
                    if len(new_df) < lim:
                        try:
                            # Make sure the job hasn't been collected already
                            if job.get_attribute('id') not in seen_ids:
                                seen_ids.append(job.get_attribute('id'))
                            else:
                                print(f'Job {job.get_attribute("id")} already seen...')
                                break
                                
                        except Exception as e:
                            print(f'Something went wrong...{e}')
                            scrape_log.write(f'Something went wrong...{e}\n')
                          
                        # Debug log
                        print(f'Getting Job: {len(new_df) + 1}')
                        scrape_log.write(f'Getting Job: {len(new_df) + 1}\n')

                        # Make it wait a second to prevent bot blocks
                        time.sleep(1)

                        # Init a payload for the df
                        payload = {}

                        # Click on the job to open the job details on the right of the page
                        try:
                            job.click()
                            time.sleep(1)
                        except Exception as e:
                            print(f'Cant click...{job.text}')
                            scrape_log.write(f'Cant click...{job.text}\n')
                            break

                        # Switch to the iframe context to collect the description
                        description_frame = driver.find_element_by_css_selector('[id="vjs-container-iframe"]')
                        driver.switch_to.frame(description_frame)

                        # Try and store the job title 
                        try:
                            job_title = driver.find_element_by_css_selector('[class*="jobsearch-JobInfoHeader-title"]')
                            payload['Title'] = job_title.text.removesuffix('\n- job post')
                        except:
                            payload['Title'] = None

                        # Try and store the description
                        try:
                            job_description = driver.find_element_by_css_selector('[id="jobDescriptionText"]')
                            payload['Description'] = job_description.text
                        except:
                            payload['Description'] = None

                        # Store the label
                        payload['Label'] = label

                        # Append the payload to the df
                        new_df = new_df.append(payload, ignore_index=True)

                        # Switch the context back to the list of jobs
                        driver.switch_to.parent_frame()
                    else:
                        # Calculate end time
                        end_time = time.time()
                        delta = end_time - start_time

                        # Close the webpage
                        driver.close()

                        # Return the dataframe if limit wasn't reached I guess
                        return new_df, delta

                # Scroll to the bottom of the page where the next button is at
                driver.execute_script('window,scrollTo(0,document.body.scrollHeight)')

                # Look for the next page button
                try:
                    # Find the next button
                    next_button = WebDriverWait(driver,15).until(EC.presence_of_element_located((By.CSS_SELECTOR,'[aria-label="Next"]')))
                    print('Next button found')
                    scrape_log.write('Next button found\n')
                    
                    # Get the current length of the results
                    if len(new_df) == previous_job_count and page_loop_timeout < 3:
                        # Add to the page loop counter
                        page_loop_timeout += 1
                    
                    elif len(new_df) == previous_job_count and page_loop_timeout >= 3:
                        # Break the loop because it is stuck
                        print('Pages looping without grabbing jobs. Breaking loop.')
                        scrape_log.write('Pages looping without grabbing jobs. Breaking loop.\n')
                        
                        # Calculate end time
                        end_time = time.time()
                        delta = end_time - start_time

                        # Close the webpage
                        driver.close()

                        # Return the dataframe if limit wasn't reached I guess
                        return new_df, delta
                    
                    # Click the button
                    next_button.click()
                    time.sleep(3)
                        
                except:
                    next_button = None
                    print('Next button not found')
                    scrape_log.write('Next button not found\n')

                # Check for the final page
                if next_button is None and timeout >= 2:
                    print('Last page reached.')
                    scrape_log.write('Last page reached.\n')
                    
                    # Calculate end time
                    end_time = time.time()
                    delta = end_time - start_time

                    # Close the webpage
                    driver.close()

                    # Return the dataframe if limit wasn't reached I guess
                    return new_df, delta
                
                elif next_button is None and timeout < 2:
                    print(f'Timeout: {timeout} - Retrying...')
                    timeout += 1
                    break

        except Exception as e:
            print('Job scrape error...')
            scrape_log.write('Job Scrape error...\n')
            if timeout < 3:
                print(f'Timeout: {timeout}')
                scrape_log.write(f'Timeout: {timeout}\n')
                timeout += 1
            else:
                print('Timed out...moving on...')
                scrape_log.write('Timed out...moving on...\n')
                
                # Calculate end time
                end_time = time.time()
                delta = end_time - start_time

                # Close the webpage
                driver.close()

                # Return the dataframe if limit wasn't reached I guess
                return new_df, delta
            
    # Calculate end time
    end_time = time.time()
    delta = end_time - start_time
    
    # Close the webpage
    driver.close()
    
    # Return the dataframe if limit wasn't reached I guess
    return new_df, delta

In [72]:
# Set up a function to perform a search and log it
def setup_scrape(job_title, scrape_count):
    # Init storage
    scrape_df = pd.DataFrame()

    # Get list of 100 biggest cities in US
    cities = pd.read_csv('cities.txt', header=None)
    cities.columns = ['City', 'State']
    city_list = [f'{city},{state}' for city, state in zip(cities['City'], cities['State'])]

    # Get the current time for the log filename
    current_time = time.localtime()
    current_time = time.strftime('%H-%M-%S', current_time)

    # Build a file name
    log_file_name = f'{current_time}_{job_title}_scrape_log.txt'

    # Create a scraper log to track what's going on
    with open(log_file_name, 'w') as scrape_log:
        # Iterate through the cities and run the scraper for each one
        for city in city_list:
            # Add a logging line
            print(f'\nCurrent Result Count: {len(scrape_df)}\nStarting scrape {city}')
            scrape_log.write(f'\nCurrent Result Count: {len(scrape_df)}\nStarting scrape {city}\n')

            # Create the Chrome webdriver to prepare for scraping
            driver = webdriver.Chrome(service=service, options=chrome_options)

            time.sleep(2)

            # Call the function
            results, timer = scrape_indeed(driver, job_title, city, scrape_count, scrape_log)

            # Store the results
            scrape_df = scrape_df.append(results, ignore_index=True)

    # Close the scrape log
    scrape_log.close()
    print(f'Log File: {log_file_name}')
    
    # Return the results
    return scrape_df

In [73]:
# Make a function to name the output file
def get_file_name(job_acr):
    if job_acr in ['ds', 'se']:
        new_file = ''
        while new_file == '':
            try:
                current_files = os.listdir('scrapes/')

                ds_scrapes = []
                for file in current_files:
                    if job_acr in file:
                        ds_scrapes.append(file)

                file_inds = []
                for file in ds_scrapes:
                    file = file.rstrip('.csv')
                    file = file.split('_')
                    file_inds.append(file[2])

                new_file = f'scrapes/{job_acr}_scrape_{int(file_inds[-1]) + 1}.csv'
                
                return new_file
            except FileNotFoundError:
                print(f'Directory not found. Creating new directory.')
                try:
                    os.mkdir('test/')
                except Exception as e:
                    print('Something went wrong...{e}')
            except Exception as e:
                print(e)
    else:
        print('File name not defined. Use either "ds" or "se".')
        return None

# Performing Scrapes
**Instructions**

The function must be called using the following format:

`<result dataframe> = setup_scrape(<job_title>, <job listings per city>)`

The scraper is designed to perform 100 different scrapes: one for each of the 100 most populated cities in the United States. The result is a single dataframe with all of the results appended. The `label` column will be added to the dataframe with a label matching the `<job title>` string parameter that is passed in the function.

We will be running the scraper to collect dataframes for:
- Data Scientist
- Software Engineer

The resulting dataframes will be exported as CSV files to use in the second script `Final Project Classification.ipynb`

## Scrape Data Scientist Jobs

In [74]:
# Collect Data Scientist jobs
ds_df = setup_scrape('Data Scientist', 100)


Current Result Count: 0
Starting scrape New York City, New York
Searching now...
No job listings collected...
Next button found
No job listings collected...
Next button found
Next button not found
Timeout: 0 - Retrying...
Searching now...
No job listings collected...
Next button found
No job listings collected...
Next button found
Pages looping without grabbing jobs. Breaking loop.

Current Result Count: 0
Starting scrape Los Angeles, California
Searching now...
No job listings collected...
Next button found
No job listings collected...
Next button found
Next button not found
Timeout: 0 - Retrying...


KeyboardInterrupt: 

In [None]:
ds_df.shape

In [None]:
ds_df.head()

In [None]:
ds_df = pd.DataFrame()
ds_df['Test'] = [1, 2, 3]
ds_df['Test 2'] = [4, 5, 6]

In [None]:
ds_df.to_csv(get_file_name('ds'))

## Scrape Software Engineer Jobs

In [None]:
# Collect software engineer jobs
se_df = setup_scrape('Software Engineer', 100)

In [None]:
se_df.shape

In [None]:
se_df.head()

In [None]:
se_df.to_csv(get_file_name('se'))