## Problem Objective:

#### Help the City of Los Angeles to structure and analyze its job descriptions

The City of Los Angeles faces a big hiring challenge: 1/3 of its 50,000 workers are eligible to retire by July of 2020. The city has partnered with Kaggle to create a competition to improve the job bulletins that will fill all those open positions.

The content, tone, and format of job bulletins can influence the quality of the applicant pool. Overly-specific job requirements may discourage diversity. The Los Angeles Mayor’s Office wants to reimagine the city’s job bulletins by using text analysis to identify needed improvements.

The goal is to convert a folder full of plain-text job postings into a structured CSV file and then to use this data to: 

(1) identify language that can negatively bias the pool of applicants; 

(2) improve the diversity and quality of the applicant pool; and/or 

(3) make it easier to determine which promotions are available to employees in each job class.

In [None]:
# Import python packages
import os
from os import walk
import shutil
from shutil import copytree, ignore_patterns
from PIL import Image
from wand.image import Image as Img
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from collections import Counter
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS

%matplotlib inline

## Introduction

Thanks to Paul Mooney's [Kernel](https://www.kaggle.com/paultimothymooney/explore-job-postings), let us first start by looking at some of the job postings to get an idea of how they look like.

In [None]:
pdf = '../input/cityofla/CityofLA/Additional data/PDFs/2014/April 2014/040414/PORT POLICE SERGEANT 3222.pdf'
Img(filename=pdf, resolution=200)

In [None]:
pdf = '../input/cityofla/CityofLA/Additional data/PDFs/2018/December/Dec 14/CABLE TELEVISION PRODUCTION MANAGER 1801 121418.pdf'
Img(filename=pdf, resolution=200)

Looks like the job postings have a generic format which they adhere to like 

1. Job Title at the first line
2. Job code in the next line
3. Job open date in the next line 
4. Annual Salary in the next part
5. Duties in the next part
6. Requirements / Qualifications in the next part
7. Where to apply information
8. Application deadline information

So the aim of the contest (as well as this notebook) is to parse the application and to get some structured information out of these files. Also we will do some analysis to understand the content of the job postings.

We have also got a sample template on how the parsed file will look like. Let us have a look at it. 

In [None]:
sample_output_df = pd.read_csv("../input/cityofla/CityofLA/Additional data/sample job class export template.csv")
sample_output_df.head(3)

We need to extract the contents and make it structureds like the above one. We also need to create a data dictionary like the below one for the created fields.

In [None]:
data_dict_df = pd.read_csv("../input/cityofla/CityofLA/Additional data/kaggle_data_dictionary.csv")
data_dict_df.head()

We are also given the job descriptions in plain text files. Let us get the total number of files and have a look at top few lines of one of the files.

In [None]:
job_bulletins_path = "../input/cityofla/CityofLA/Job Bulletins/"
print("Number of Job bulletins : ",len(os.listdir(job_bulletins_path)))

In [None]:
with open(job_bulletins_path + os.listdir(job_bulletins_path)[0]) as f: 
    print (f.read(1000))

## Data Extraction

In this section, let us extract the data and create a structured table out of it.


In [None]:
jobs_list = []
for file_name in os.listdir(job_bulletins_path):
    with open(job_bulletins_path + file_name, encoding = "ISO-8859-1") as f:
        content = f.read()
        jobs_list.append([file_name, content])
jobs_df = pd.DataFrame(jobs_list)
jobs_df.columns = ["FileName", "Content"]
jobs_df

### More to come. Stay tuned! 