<a href="https://colab.research.google.com/github/HudaAlfardus/brainstation-capstone/blob/main/Capstone_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Job Data Importing and Cleaning

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np

In [None]:
job_data = pd.read_csv('/content/drive/MyDrive/Brainstation Capstone/Combined_Jobs_Final.csv')

In [None]:
job_data.head(5)

Unnamed: 0,Job.ID,Provider,Status,Slug,Title,Position,Company,City,State.Name,State.Code,Address,Latitude,Longitude,Industry,Job.Description,Requirements,Salary,Listing.Start,Listing.End,Employment.Type,Education.Required,Created.At,Updated.At
0,111,1,open,palo-alto-ca-tacolicious-server,Server @ Tacolicious,Server,Tacolicious,Palo Alto,California,CA,,37.443346,-122.16117,Food and Beverages,Tacolicious' first Palo Alto store just opened...,,8.0,,,Part-Time,,2013-03-12 02:08:28 UTC,2014-08-16 15:35:36 UTC
1,113,1,open,san-francisco-ca-claude-lane-kitchen-staff-chef,Kitchen Staff/Chef @ Claude Lane,Kitchen Staff/Chef,Claude Lane,San Francisco,California,CA,,37.78983,-122.404268,Food and Beverages,\r\n\r\nNew French Brasserie in S.F. Financia...,,0.0,,,Part-Time,,2013-04-12 08:36:36 UTC,2014-08-16 15:35:36 UTC
2,117,1,open,san-francisco-ca-machka-restaurants-corp-barte...,Bartender @ Machka Restaurants Corp.,Bartender,Machka Restaurants Corp.,San Francisco,California,CA,,37.795597,-122.402963,Food and Beverages,We are a popular Mediterranean wine bar and re...,,11.0,,,Part-Time,,2013-07-16 09:34:10 UTC,2014-08-16 15:35:37 UTC
3,121,1,open,brisbane-ca-teriyaki-house-server,Server @ Teriyaki House,Server,Teriyaki House,Brisbane,California,CA,,37.685073,-122.400275,Food and Beverages,● Serve food/drinks to customers in a profess...,,10.55,,,Part-Time,,2013-09-04 15:40:30 UTC,2014-08-16 15:35:38 UTC
4,127,1,open,los-angeles-ca-rosa-mexicano-sunset-kitchen-st...,Kitchen Staff/Chef @ Rosa Mexicano - Sunset,Kitchen Staff/Chef,Rosa Mexicano - Sunset,Los Angeles,California,CA,,34.073384,-118.460439,Food and Beverages,"Located at the heart of Hollywood, we are one ...",,10.55,,,Part-Time,,2013-07-17 15:26:18 UTC,2014-08-16 15:35:40 UTC


We only need the Job title and Description for this analysis, so I will remove other columns

In [None]:
job_data = job_data.filter(["Title", "Job.Description"])
job_data = job_data.rename(columns={"Title": "job_title", "Job.Description":"job_description"})
job_data.head(2)

Unnamed: 0,job_title,job_description
0,Server @ Tacolicious,Tacolicious' first Palo Alto store just opened...
1,Kitchen Staff/Chef @ Claude Lane,\r\n\r\nNew French Brasserie in S.F. Financia...


We also want to ensure that we do not include any rows that contain empty job descriptions or job titles:

In [None]:
job_data.dropna(axis=0, inplace=True)
job_data = job_data[job_data.job_title != "none"]
job_data = job_data[job_data.job_description != "none"]
job_data

Unnamed: 0,job_title,job_description
0,Server @ Tacolicious,Tacolicious' first Palo Alto store just opened...
1,Kitchen Staff/Chef @ Claude Lane,\r\n\r\nNew French Brasserie in S.F. Financia...
2,Bartender @ Machka Restaurants Corp.,We are a popular Mediterranean wine bar and re...
3,Server @ Teriyaki House,● Serve food/drinks to customers in a profess...
4,Kitchen Staff/Chef @ Rosa Mexicano - Sunset,"Located at the heart of Hollywood, we are one ..."
...,...,...
84085,Book Keeper @ National Japanese American Histo...,NJAHS stands for National Japanese American Hi...
84086,Kitchen Staff/Chef @ Emporio Rulli,Weekend Brunch Line Cook \r\n● Other shifts ma...
84087,Driver @ Onigilly,ONIGILLY (Japanese rice ball wraps) seeks outg...
84088,Line Cook @ Machka Restaurants Corp.,We are a popular Mediterranean restaurant in F...


I wanted to ensure that the jobs included in the dataset are unique, and leveraged the job_title field for this:

In [None]:
job_data.drop_duplicates(subset=["job_title"], inplace=True)

Finally, as you can see above, some titles include an "@", followed by the restaurant name, e.g. "Waiter @ Taco Bell". Because I removed the official Company name column from the dataset, I decided to also remove the company name from job titles too:

In [None]:
def remove_at(string):
  if "@" in string:
    return string.split("@")[0].strip()
  else:
    return string

job_data["job_title"] = job_data["job_title"].map(remove_at)

In [None]:
job_data.shape

(41838, 2)

We now have 31,516 jobs remaining.

## Experience Data Importing and Cleaning

In [None]:
experience_data = pd.read_csv('/content/drive/MyDrive/Brainstation Capstone/Experience.csv')
experience_data.head(2)

Unnamed: 0,Applicant.ID,Position.Name,Employer.Name,City,State.Name,State.Code,Start.Date,End.Date,Job.Description,Salary,Can.Contact.Employer,Created.At,Updated.At
0,10001,Account Manager / Sales Administration / Quali...,Barcode Resourcing,Bellingham,Washington,WA,2012-10-15,,,,,2014-12-12 20:10:02 UTC,2014-12-12 20:10:02 UTC
1,10001,Electronics Technician / Item Master Controller,Ryzex Group,Bellingham,Washington,WA,2001-12-01,2012-04-01,,,,2014-12-12 20:10:02 UTC,2014-12-12 20:10:02 UTC


For this analysis, we only need the applicant id, position name, and job description.

In [None]:
experience_data = experience_data.filter(["Applicant.ID","Position.Name", "Job.Description"])
experience_data = experience_data.rename(columns={"Applicant.ID": "applicant_id", "Position.Name":"job_title", "Job.Description": "job_description"})
experience_data.head()

Unnamed: 0,applicant_id,job_title,job_description
0,10001,Account Manager / Sales Administration / Quali...,
1,10001,Electronics Technician / Item Master Controller,
2,10001,Machine Operator,
3,10003,maintenance technician,"Necessary maintenance for ""Make Ready"" Plumbin..."
4,10003,Electrical Helper,repair and services of electrical construction


Again, we need to clean up any rows that contain null values.

In [None]:
experience_data.dropna(axis=0, inplace=True)
experience_data = experience_data[experience_data.job_title != "none"]
experience_data = experience_data[experience_data.job_description != "none"]

Applicant IDs are duplicated, because each applicant may have had multiple jobs on their CV. I will concatenate the rows based on applicant id to produce "CVs"

In [None]:
experience_data = experience_data.groupby("applicant_id").agg("; ".join).reset_index().dropna()
experience_data.head(5)

Unnamed: 0,applicant_id,job_title,job_description
0,2,Writer for the Uloop Blog; Volunteer,"* Wrote articles for the ""Uloop Blog,"" which i..."
1,38,Sales Person & Phone Receptionist,Asking customer if they need any assistance an...
2,78,Impact team member,"Help maintain merchandise flow, Work on fillin..."
3,89,Healthcare Specialist / Combat Medic; Clerk's ...,"Clinical and field medicine, Healthcare educat..."
4,96,Cashier; Receptionist; Cashiet/Waiter,Greeting people and introducing/recommend food...


In [None]:
experience_data.shape

(2375, 3)

### Making a Dataframe to represent a CV being entered for a job search

In [None]:
cv = experience_data.iloc[-1].filter(["job_description"])
job_data_plus_cv = job_data.append(cv, ignore_index = True).filter(["job_description"])

In [None]:
job_data_plus_cv.tail()

Unnamed: 0,job_description
41834,NJAHS stands for National Japanese American Hi...
41835,Weekend Brunch Line Cook \r\n● Other shifts ma...
41836,We are a popular Mediterranean restaurant in F...
41837,We are looking for a cashier! \r\n\r\n ● Take...
41838,revolving contract; BeautifulGenius Labs is a ...


In [None]:
job_data_plus_cv.shape

(41839, 1)

## Summary of Data

We now have two dataframes that have undergone preliminary cleaning.

* `job_data`, which represents a random set of 31,517 job posts
* `experience_data`, which represents a set of 2,375 CVs of individuals who may apply for those jobs.
* `job_data_plus_cv` which represents the 31,517 jobs, alongside one CV from the experience data, which represents the CV of the person that is searching for a job.

## Text Preprocessing

To pre-process the text, I took the following steps:

1. **Converted all text to lower case**

2. **Removed English Stop Words, and words that were not made up of alpha characters only.**

This helped to remove noise in the data.

3. **Lemmatization** 

I preferred lemmatization over stemming, as lemmatization ensures that each of the resulting tokens is indeed an English word, and can also distinguish between tokens based on their context withing a sentence and which part of speech they are (e.g. nouns, verbs, etc))

4. **Term Frequency Inverse Document Frequency (TF-IDF) Vectorizer** 

I wanted to use TF-IDF rather than a CountVectorizer, as, in the context of CVs and Job Posts, the number of times a word appears does not necessarily correlate with its importance.

As we will be performing some Natural Language Processing (NLP) later on, I also want to normalise the data so that it is all in the same case:

In [None]:
job_data_plus_cv = job_data_plus_cv.applymap(str.lower)
job_data_plus_cv.shape

(41839, 1)

In [None]:
import nltk
from nltk.corpus import stopwords
import string

nltk.download("stopwords")
ENGLISH_STOP_WORDS = stopwords.words("english")

nltk.download("wordnet")
wordnet_lemmatizer = nltk.stem.WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [None]:
def custom_tokenizer(sentence):
    # remove encoding of special characters
    sentence = sentence.replace("&nbsp", " ")
    sentence = sentence.replace("&amp", "&")
    sentence = sentence.replace("&rsquo", "'")

    # remove punctuation
    for punctuation_mark in string.punctuation:
        sentence = sentence.replace(punctuation_mark, '')
           
    # remove tabs, carriage returns, and newlines
    sentence = sentence.replace('\r', " ")
    sentence = sentence.replace('\n', " ")
    sentence = sentence.replace('\t', " ")

        
    # split sentence into words
    list_of_words = sentence.split(' ')
    list_of_lemmatized_words = []
    
    # remove stop words and any tokens that are just empty strings
    for word in list_of_words:
        if (not word in ENGLISH_STOP_WORDS) and (word!='') and word.isalpha():
            # Lemmatize word
            lemmatized_word = wordnet_lemmatizer.lemmatize(word)
            list_of_lemmatized_words.append(lemmatized_word)

    return list_of_lemmatized_words

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(tokenizer = custom_tokenizer, stop_words = ENGLISH_STOP_WORDS, ngram_range = (1,5))

In [None]:
job_data_plus_cv_transformed = tfidf.fit_transform(job_data_plus_cv["job_description"])

  'stop_words.' % sorted(inconsistent))


### Finding Jobs using Cosine Similarity

I am now going to use cosine similarity to predict jobs that are most similar to the selected CV, which is printed below:

In [None]:
print(cv["job_description"])

revolving contract; BeautifulGenius Labs is a global platform that designs and shares STEAM resources and projects between labs around the universe. 


BG Labs projects include: 
KaMasaJei BioDynamics "Brain Maps, Neurotransmitters and Bodywork" Interactive Exhibition, which has been presented at SU-NASA Ames, Fox Film Studios, TagDF Mexico City sponsored by Televisa, and FutureMed; 
GeniaBella Arts which infuses science/tech concepts into Music & Dance - including EA video game soundtrack; and 
WomenGoGlobal highlighting innovative women across the universe - including Women Olympic competitors at the 2012 London Olympics, Women Innovators, and Women Artists.; WGG History:
WomenGoGlobal established its roots as a global event series highlighting and education women entrepreneurs, leaders and womens resource networks, and awarded them for their work in the spheres of Sustainability, Techie, Social Mission, Global Innovation and Collaboration, Arts, Education and Wellness.  Women

For the sake of computation time, I will only process the first 20,000 jobs. 

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
cv_transformed = job_data_plus_cv_transformed.tocsr()[-1,:].todense()
similarities = []

#for index in range(job_data_plus_cv_transformed.shape[0] - 1):
for index in range(20000):
  print(f"Processing Job {index + 1}...")
  job = job_data_plus_cv_transformed.tocsr()[index,:].todense()
  similarity = cosine_similarity(cv_transformed, job)[0][0]
  similarities.append((index, similarity))

similarities = sorted(similarities, key=lambda x: x[1], reverse=True)
print(similarities)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Processing Job 15002...
Processing Job 15003...
Processing Job 15004...
Processing Job 15005...
Processing Job 15006...
Processing Job 15007...
Processing Job 15008...
Processing Job 15009...
Processing Job 15010...
Processing Job 15011...
Processing Job 15012...
Processing Job 15013...
Processing Job 15014...
Processing Job 15015...
Processing Job 15016...
Processing Job 15017...
Processing Job 15018...
Processing Job 15019...
Processing Job 15020...
Processing Job 15021...
Processing Job 15022...
Processing Job 15023...
Processing Job 15024...
Processing Job 15025...
Processing Job 15026...
Processing Job 15027...
Processing Job 15028...
Processing Job 15029...
Processing Job 15030...
Processing Job 15031...
Processing Job 15032...
Processing Job 15033...
Processing Job 15034...
Processing Job 15035...
Processing Job 15036...
Processing Job 15037...
Processing Job 15038...
Processing Job 15039...
Processing Job 15040...

From this, we can determine the top 10 jobs and evaluate how well they match the CV

In [None]:
top_10_jobs_indexes = []
top_10_jobs_similarities = []

# get the indexes of the top 10 matching jobs
for i, (index, score) in enumerate(similarities):
  if i > 0:
    if round(similarities[i][1], 4) != round(similarities[i-1][1], 4):
      top_10_jobs_indexes.append(index)
      top_10_jobs_similarities.append(score)
  else:
    top_10_jobs_indexes.append(index)
    top_10_jobs_similarities.append(score)

In [None]:
# Display the top 10 job descriptions
top_10_jobs = job_data_plus_cv.filter(top_10_jobs_indexes, axis='index')
for i in range(10):
  print(f"==== Job {i+1} ====")
  print(top_10_jobs.iloc[i, 0])

==== Job 1 ====
seasonal gallery attendantphoenix art museum is looking for responsible, professionalgallery attendants to join the team for the upcoming exhibition of:leonardo da vinci’s codex leicester and the power of observationthe exhibition, at phoenix art museum, will be groundbreaking in its approach of bringing leonardo da vinci into a broad artistic context that explores his continuing influence on artists into our own time. this exhibition will take the perspective that curiosity, direct observation, and thinking on paper, which define leonardo’s active mind and working method, are vital ingredients to the creative process and are gateways to discovery and invention. in addition to the codex leicester, will be carefully selected works of art by a diverse group of artists who shared aspects of leonardo’s practices.gallery attendantgallery attendants ensure that all visitors are welcomed to the museum by providing the highest standards of customer service, and provide a safe, 

In [None]:
print(cv["job_description"])

revolving contract; BeautifulGenius Labs is a global platform that designs and shares STEAM resources and projects between labs around the universe. 


BG Labs projects include: 
KaMasaJei BioDynamics "Brain Maps, Neurotransmitters and Bodywork" Interactive Exhibition, which has been presented at SU-NASA Ames, Fox Film Studios, TagDF Mexico City sponsored by Televisa, and FutureMed; 
GeniaBella Arts which infuses science/tech concepts into Music & Dance - including EA video game soundtrack; and 
WomenGoGlobal highlighting innovative women across the universe - including Women Olympic competitors at the 2012 London Olympics, Women Innovators, and Women Artists.; WGG History:
WomenGoGlobal established its roots as a global event series highlighting and education women entrepreneurs, leaders and womens resource networks, and awarded them for their work in the spheres of Sustainability, Techie, Social Mission, Global Innovation and Collaboration, Arts, Education and Wellness.  Women

### Evaluation of the model 

As there is no ground truth in the data sources, and because the quality of a job recommendation is a inherently highly qualitative judgment, there are not many options for generating a robust quantitative evaluation of this method.

That said, one evaluation metric available is precision, which is a measure of how relevant the job recommendations are for a given CV.

To evaluate the model, I decided to manually qualitatively evaluate how many of the job posts recommended for a CV are appropriate for that candiate. Although this scoring approach introduces issues surrounding replicability, the only true way to eradicate any bias in this matter and know for certain whether a job post is relevant or not to an applicant is to have the applicant themselves confirm either way. 

As this was not possible, I believe this was the best option available.

It is clear from reading the Cv that the applicant has a lot of experience in setting up exhibitions, particularly in topics related to science, music, art, and women empowement. Based on this information, my qualitative best judgement of the relevance of the recommendations is as follows:

In [None]:
relevance_scores = [1, 1, 1, 0, 0, 0, 1, 1, 1, 0]

The `relevance_scores` array above is an array of 10 items, one for each of the top 10 recommended job posts. 1 means that I deemed the job post relevant, and 0 means that I deemed the job post irrelevant. 



In [None]:
precision = sum(relevance_scores)/10 * 100
print(precision)

60.0


These scores can be used to give a rough precision score for the model for this given CV, which is 60%.