In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in

import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

## Setup libraries

Turn on Internet connection in the settings section of this notebook to download libraries

In [None]:
!pip install lajobsparser
!pip install pdfminer-six

In [None]:
import pathlib
import pandas as pd
import seaborn as sns
import itertools

%matplotlib inline
sns.set_style('whitegrid')

In [None]:
%load_ext autoreload
%autoreload 2

Import a custom Python library on PyPI to process job bulletins

In [None]:
import lajobsparser as ljp

## Get jobs data

In [None]:
SCRIPT_DIR = pathlib.Path().resolve()
DATA_ROOT = SCRIPT_DIR / '..' / 'input' / 'cityofla' / \
    'CityofLA' / 'Job Bulletins'
paths = list(DATA_ROOT.glob('*'))

In [None]:
print('There are {} job bulletins'.format(len(paths)))

Get headers for job bulletins

In [None]:
headers = ljp.get_bulletin_headers(paths)
selected_headers = set(header for header in headers
                       if not header.startswith('REVISED'))

Display some words in the job bulletins showing spelling errors

In [None]:
header_words = itertools.chain(*list(header.split()
                                     for header in selected_headers))
unique_header_words = sorted(set(header_words))
sample_header_words = [
    word for word in unique_header_words if word.startswith('QUAL')
]
print(sample_header_words)

## Recommendation 1: Use a spell checker

Use a spell checker for the job bulletins. This will look more professional and
will also help if any machine learning algorithms are used on the job
descriptions.

Get selected headers

In [None]:
# get headers that do not start with revised
selected_headers = set(header for header in headers
                       if not header.startswith('REVISED'))

Display some of the headers showing different wording

In [None]:
requirement_headers = sorted(
    set([
        header for header in selected_headers if header.startswith('REQUIREM')
    ]))
print('\n'.join(requirement_headers))

## Recommendation 2: Standardize headers in job bulletins

Use the same headers in all job bulletins. This will make it easier to work
with the bulletins using automated tools.

Get job bulletin contents as a data frame

In [None]:
bulletin_df = ljp.get_job_bulletins(paths)

Display the first few rows of the data

In [None]:
bulletin_df.head()

Get the job titles data

In [None]:
ADDITIONAL_DATA_PATH = SCRIPT_DIR / '..' / 'input' / 'cityofla' / \
    'CityofLA' / 'Additional data'
JOB_TITLES_PATH = ADDITIONAL_DATA_PATH / 'job_titles.csv'
title_df = pd.read_csv(JOB_TITLES_PATH, header=None, names=['job_class_title'])

Combine job titles from bulletins and additional data

In [None]:
bulletin_job = set(bulletin_df.job_class_title.str.lower())
title_job = set(title_df.job_class_title.str.lower())

jobs_in_either = bulletin_job | title_job

jobs_in_both = bulletin_job & title_job

title_list_only = title_job - bulletin_job

bulletin_list_only = bulletin_job - title_job

Display jobs titles in bulletins and/or job title list

In [None]:
title_compare_df = pd.DataFrame(
    {
        'jobs_in_either': len(jobs_in_either),
        'jobs_in_both': len(jobs_in_both),
        'jobs_in_title_list_only': len(title_list_only),
        'jobs_in_bulletin_list_only': len(bulletin_list_only)
    },
    index=[0])
title_compare_df

## Recommendation 3: Standardize the job titles

Most of the job titles used job bulletins match the jobs in the job title list
but a few do not. Using standard job titles will make it easier to make
recommendations.

## Recommendation 4. Improve the readability of the job description

There are other Kaggle kernels that explain this.

https://www.kaggle.com/silverfoxdss/city-of-la-readability-and-promotion-nudges

## Recommendation 5. Improve the language of the job bulleting using proselint

Proselint is an automated tool that will help improve the use of language.
According to the description they consider

    redundancy, jargon, illogic, clichés, sexism, misspelling, inconsistency,
    misuse of symbols, malapropisms, oxymorons, security gaffes, hedging,
    apologizing, pretension, and more.

https://github.com/amperser/proselint

When running it on text it claims to detect

* sexism.misc - Avoiding sexist language
* lgbtq.offensive_terms - Avoiding offensive LGBTQ terms
* lgbtq.terms - Misused LGBTQ terms

However when running it on the job bulletins it does not detect any offensive
language. Based on the source code it looks for explicitly offensive language.

It does however detect jargon and corporate-speak.

## Recommendation 6. Send LA county employees job paths annually

The entry level employees of LA are likely to be more diverse than the senior
levels. Keep track of the employees anniversaries at their job. As most jobs
require a whole number of years of experience send LA employees the relevant
job path document for their job title on their anniversary and let them know
that they may be eligible to apply for a higher position.

## Recommendation 7. Reduce unconscious bias about candidate names

One of the best known and most cited American Economic Review papers found
resumes with typical black names received fewer callbacks than those with
typical white names. The paper was titled:

    Are Emily and Greg More Employable than Lakisha and Jamal?

To reduce unconscious bias reviewers in LA county should evaluate resumes with
the candidates names obscured.

https://www.nber.org/papers/w9873

## Recommendation 8. Gather data to help use machine learning in the future

The most successful machine learning examples use supervised learning.

https://www.youtube.com/watch?v=21EiKfQYZXc

Supervised learning needs inputs and outcomes. For LA jobs descriptions, the
inputs would be job bulletins that appeal to diverse candidates and job
bulletins that do not do so. The outcomes would be the numbers and diversity of
the candidates that apply to the jobs.

As of 2019, there does not appear to be data set of labelled job descriptions
that can be used to train a machine learning algorithm. LA county can help make
such a data set available by doing the following.

1. Get experts to modify job descriptions to appeal to diverse candidates
2. Save the older job description
3. Keep statistics on the candidates that saw each kind of job bulletin and
   those that applied
4. Make the data set available to researchers

## Recommendation 9. Create a code library to process job descriptions

To create clean code to process job descriptions it is best to write it as
a separate Python library.

The PyPI repository of Python library hosts the [lajobsparser][lajobsparser] library created
specially to work with LA county job descriptions.

[lajobsparser]: https://pypi.org/project/lajobsparser

Additionally, the code is open source and hosted on [Github][github_code]

[github_code]: https://github.com/gavinln/lajobsparser

It uses the [flake8][flake8] tool to validate that the code follows the [pep8][pep8] coding
conventions.

[flake8]: https://pypi.org/project/flake8/

[pep8]: https://www.python.org/dev/peps/pep-0008/

It also uses the [mypy][mypy] tool for static type checking

[mypy]: http://mypy-lang.org/