# Software Testing, Automation and the Job Market - a Data Programming Project

### Research Questions, Project Limitations and Motivation 

#### Research Area
The goal of this project is to analyse a given dataset from a large job board in regards to software testing roles with the emphasis of the analysis placed on what the most sought for skills are in order to shed some light on the ongoing debate on whether automation testing is going to replace manual testing (e.g __[Can Manual Testing Be Completely Replaced by Automation Testing?](https://blog.qasource.com/resources/can-manual-testing-be-completely-replaced-by-automation-testing)__). 

#### Research Questions
More specifically, this project aims to answer the following questions:

* What are the most popular skills required for prospective software testers?
* Which programming languages are the most popular for test automation?

#### Future Work / Project Limitations 

Some questions that are interesting, but are out of scope for this project are the the ones below:

* How many job adverts mention a combination of manual and test automation both in their description?
* How many job descriptions mention training / upskilling?
* How many jobs exclusively focus on either manual or automation?

The reason why these problems are out of scope is because the data set used in this project is limited in terms of geographical region. Additionally, for the questions listed above a much larger dataset would be required in order to come up with a sensible analysis. Indeed -- even with a larger dataset -- the work would still be limited to one language, and this excludes analysis from countries such as China or South Korea.

Of course, one can limit the project to a specific language or region, but this kind of project still requires a large dataset that takes into consideration where the jobs are located (rural areas may have less to offer than urban ones), what time period they were posted in as well, what seniority level they address and which area of testing they are focused (e.g. performance testing). The deeper one dives into the topic, the more likely one is to find irregularities and different demands, which would make it difficult to generalise-- at least without performing some thorough analysis using metrics that are performed over a specific length of time. 

Also, the time required to answer such questions would be important as well: technology is in a constant state of flux, and it would be important to analyse job boards throughout a given period, with the focus placed in which areas change is happening the most and keeping an eye on how quickly some trends are growing.

#### Motivation
The motivation behind this specific area of research stems from my job as a QA where I have worked with both automation and manual testing tools. While I personally believe that a good software tester has skills in __[both white box and black box testing](https://www.geeksforgeeks.org/differences-between-black-box-testing-vs-white-box-testing/)__, I am keen to find out what philosophy the job board is leaning towards with data from job boards serving as a good basis for me to embrace this topic on a practical level. 

Moreover, I used to work for a software house that focused on gathering data for recruitment companies. During that time, I used to dabble a bit in job board searches in order to garner what the expectations / trends for software testers were. In fact, some of the observations related to the limits of the project come from direct exposure related to analysing data and working in relation to gathering such data.  

#### Previous Exploration of the Topic 
While __[many articles ](https://www.infoq.com/test-automation/articles/)__ have been and continue to be published on the subject already, I have not seen any students discussing this topic in our university slack channel or express much interest in software testing otherwise. Of interest; however, are the __[annual surveys carried out by Practitest](https://www.practitest.com/qa-learningcenter/webinars/learn-from-2021-state-of-testing/)__ which analyses software testing trends and provides information on its current domain. It showcases the versatility and importance of testing, especially with the rise of AI and larger teams recognising the constant need for quality assurance. 

Outside of the internet, there are __[also many workshops and courses available on automation testing](https://www.theknowledgeacademy.com/courses/automation-and-penetration-testing/fundamentals-of-test-automation-/bristol/)__ with their main focus being upskilling of manual testers, which proves how important automation is as a skill set for budding software testers.  


#### Acquisition of the Data Set
The dataset gathered comes from job boards that have public APIs available for web crawling purposes. In particular, the site 
__[The Programmable Web](https://www.programmableweb.com/news/top-10-jobs-apis-2021/brief/2021/06/30)__ proved to be a valuable source for finding APIs that were free and did not require any prior set up. 

#### Choice of Data Source 

#### Ethical Implications 

### Initial Scraping of Data 

Before setting out to do a more thorough scraping of the data, I wanted to play around with the basics of it, which is why I wrote two scripts that dealt with web crawling  -- scraperHelpers.py and main.py. The former contains functions which help to scrape data from a given url, while the latter calls upon these methods in order to produce a txt file with the job titles listed on the first page.

```python
from bs4 import BeautifulSoup
import requests
from typing import List

```

The script uses BeautifulSoup in order to parse scraped data into a parsed HTML file, which can then be further analysed and processed. Also, the request library is utilised in order to get data from a given url. Both libraries  were chosen because of their ease of implementation and use. 

```python

def retrieve_data_as_text(url: str) -> str:
    """
    Retrieves data from a url
    :param url:
    :return: data in form of a text
    """
    return requests.get(url).text

```
The method below pulls out the text from the job advert which can, oftentimes, be more readable than getting a parsed HTML text -- this is good for testing purposes where the print() method can be called upon in order to verify that it is the correct data. This method is also reused in the parse_data_into_html_ method which is called in the main.py script.  

```python

def parse_data_into_html(url: str) -> BeautifulSoup:
    """
    Retrieves data in html format
    :param url: the url in string format
    :return: data parsed into HTML
    """
    data = retrieve_data_as_text(url)
    soup = BeautifulSoup(data, 'html.parser')
    return soup
```

The method below retrieves the URL as text and then using the BeautifulSoup object parses it into a HTML file. This file serves as the basis for all further processing and analysis. 

```python

def find_jobs_by_header_title(scraped_data: BeautifulSoup) -> List:
    """
    Finds jobs by header title
    :param scraped_data:
    :param scraped_data: the data to be scraped
    :return: a list of job titles
    """
    jobs = []
    # code credit for text splitting:
    # @ https://www.geeksforgeeks.org/scraping-indeed-job-data-using-python
    data_str = ""
    for item in scraped_data.find_all("h2", class_="jobTitle"):
        data_str = "" + item.get_text()
        jobs.append(data_str.split("\n"))
    return jobs
    
```
The method above  lists all the data that fall under the category of job title and appends them to a list 
of job titles, which is then reused in the code below in order to create a text file with all the roles found in one page. 

```python

def save_jobs_as_txt(jobs: List):
    """
    Saves job titles into a text file
    :param jobs: a list of jobs
    :return: returns a txt file
    """
    with open('jobtitles.txt', 'w') as f:
        f.write("\n".join(str(job) for job in jobs))

```

The code above is called from a script called main that processes the data above into a text file containing all the available job titles. Separating the web scraping logic from the main file is a way to make the code reusable. 

```python

# !/usr/bin/python
# !python
from scraperHelpers import *

URL = 'https://uk.indeed.com/Remote-QA-jobs'

scraped_data = parse_data_into_html(URL)
jobs = find_jobs_by_header_title(scraped_data)
save_jobs_as_txt(jobs)

```