# Other Jobs You May Be Interested In

### Data Science for Good: City of Los Angeles
### Using Natural Language Processing to Find Similar Jobs

### Summary

The goal of this approach is to ensure a job seeker is aware of other jobs that they may be interested in.  Similar to an online retailer who displays similar items for sale, we will display similar jobs available.

In practice, this will increase the number of applicants the City of Los Angeles receives.  This will also aid applicants in navigating what may be thousands of available positions.

### Example Results

As this process finds similar jobs, here are a few examples demonstrating the value:

| Job | Similar Job |
|---|---|
| MEDICAL ASSISTANT | LICENSED VOCATIONAL NURSE |
| PRINCIPAL ACCOUNTANT | DEPARTMENTAL CHIEF ACCOUNTANT |

### Approach

The approach is to use natual language processing (NLP) methods to compare the job description of all provided jobs.  We will use term frequency–inverse document frequency to represent a job description as a numeric vector.  Then, we will compare each vector to all other vectors using cosine similarity.  The result will be an "rating" of how similar job descriptions are to one another.  Finally, we will filter those "ratings" for the highest value, which will represent the most similar job description.

In [None]:
# Imports
import pandas as pd
import os

We will use the "DUTIES" section of the job bulletins for our work.  This section contains information about the job responsibilities.  We can parse that section out using string manipulation via the below function:

In [None]:
def get_duty(filename):
    description = open(filename,"r").read()
    
    title = description.lstrip().split("\n")[0]
    
    duties = description.split("DUTIES")[1]    
    duties = duties.split("REQUIREMENT")[0]
    # A second split is required because the file format varies
    duties = duties.split("NOTE")[0]
    
    return title,duties 

Next, we will loop through every job description file and call our function.  Our final data structure, "rows", will be a list of dictionaries containing the title and duties section.

Note: There are 6 job descriptions without a DUTIES section.  We will filter those out.

In [None]:
rows = []


files = os.listdir("../input/cityofla/CityofLA/Job Bulletins/")
    
for filename in files:
    try:
        title, duties = get_duty("../input/cityofla/CityofLA/Job Bulletins/" + filename)
        rows.append({"title":title,"duties":duties})

    except:
        print("No Duties:", filename)

Let's convert to a Pandas data frame and view the results.

In [None]:
df = pd.DataFrame(rows)
df.head()

With our data prepared, let's call the TF-IDF functions provided by SciKit-Learn.

In [None]:
# Import and declare the vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()

# Fit the vectorizer on our DUTIES strings
duties_vectors = vec.fit_transform(df['duties'])

Having the vectors calculated, we can move onto using cosine similarity to determine how close those vectors are to each other.  Which is a method of determining how close the job descriptions are to each other.

In [None]:
# Import and create an empty list
from sklearn.metrics.pairwise import cosine_similarity
similar_jobs = []

# Loop through every job
for index, row in df.iterrows():
    # Calculate the cosine similarity between the current job's vector and all other job vectors.
    sim = cosine_similarity(duties_vectors[index], duties_vectors)
    
    # We are interested in the most similar job, so we must sort the cosine_similarity matrix
    # Convert to Pandas data frame
    temp_df = pd.DataFrame(sim.reshape(-1,1))
    # Sort it, then access the index value, this will be the index of the most similar job
    similar_job_index = temp_df.sort_values(0,ascending=False).iloc[1:2].index[0]
    
    # Finally, access the most similar job via its index, then add that record to our final data structure
    similar_job = df.iloc[similar_job_index]['title']
    similar_jobs.append(similar_job)

Let's convert our list to a data frame and review some results.

Many of the similar jobs we've found are very intuitive, ie:
AIR CONDITIONING MECHANIC and AIR CONDITIONING MECHANIC SUPERVISOR

In [None]:
df['similar_job'] = similar_jobs
df.head(10)

We can generate the CSV simply by selecting our columns and exporting to csv.

It is my recommendation that this methodology be used in partnership with other methods from this competition.  After the City of Los Angeles finalizes their job descriptions, run my methodology and generate the similar jobs based on those descriptions.

In [None]:
df[['title','similar_job']].to_csv("Job_Similarity.csv", index=False)

In [None]:
# Entire output for the public notebook
df[['title','similar_job']]

### Summary

Using the above methods we can systematically find similar jobs that may be available at the City of Los Angeles.  This methodology can be improved by improving the job descriptions (possible using the help from this competition).