# Find the most similar job posting to our resume

## Objective 

Narrow down the set of job postings to those that are most similar to our resume in preparation for further analysis.
Workflow

## Workflow 

1. Obtain the resume from the GitHub repository. Transform job posting text and our resume into TF-IDF vectors using sklearn's TF-IDF vectorizer class.
2. Compute the cosine similarity between the vectorized resume and the job postings using sklearn's cosine similarity function.
3. Sort the job postings based on similarity to our resume, and choose an appropriate cutoff for selecting the most similar jobs. Store the most similar job postings in a new DataFrame for later use, and save the DataFrame to disk.

In [35]:
import numpy as np
import pandas as pd

In [36]:
SRC_PATH = "../data/job_offers.csv"

In [37]:
df = pd.read_csv(SRC_PATH,delimiter=";")
df.head()

Unnamed: 0,title,body,bullets
0,Data Scientist II - Payment Products - Seattle...,Data Scientist II - Payment Products - Seattle...,"('Bachelor’s degree in Computer Science, Mathe..."
1,"Data Scientist - Seattle, WA","Data Scientist - Seattle, WA\nRing is looking ...",('Use predictive analytics and machine learnin...
2,"Data Scientist - Jersey City, NJ 07311","Data Scientist - Jersey City, NJ 07311\nWorkin...",('Create predictive models using current and e...
3,2020 PhD Data Scientist Internship - Uber Eats...,2020 PhD Data Scientist Internship - Uber Eats...,('Develop models for user behavior and marketp...
4,Data Analyst- Data Science & Analytics - Palo ...,Data Analyst- Data Science & Analytics - Palo ...,('Detailed and clear understanding of data use...


In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_matrix = tfidf_vectorizer.fit_transform(df.bullets.values)
tfidf_matrix.shape

(388, 5286)

In [75]:
# Open text from CV
with open("../data/Liveproject Resume.txt","r") as f:
    my_cv = f.read()

In [76]:
input_vectorized = tfidf_vectorizer.transform([my_cv])
input_np_vectorized = input_vectorized.toarray()
input_np_vectorized.T.shape

In [95]:
cosine_similarities = tfidf_matrix @ input_np_vectorized.T
most_similar_index = np.argsort(cosine_similarities.flatten())[-1]
similarity = cosine_similarities[most_similar_index]

In [96]:
most_similar_job_offer = df.iloc[most_similar_index]

In [97]:
most_similar_job_offer

title      Machine Learning / Data Scientist Internship (...
body       Machine Learning / Data Scientist Internship (...
bullets    ('How to build models at scale using vast amou...
Name: 306, dtype: object

In [98]:
print("My Resume")
print(my_cv)

print("\n************* BEST JOB OFFER ******************\n")
print("Job title",most_similar_job_offer.title)

print("\nMost similar offer bullets",most_similar_job_offer.bullets.split(","))



My Resume
﻿Good Student
Data Scientist
	  

Good Student
123 Fake Street
Some City, QT 12345
123.456.7890
no_reply@fakesite.com
	ㅡ
Skills
	  

Python, Pandas, machine learning, natural language processing
	ㅡ
Experience
	  

Manning / Data Analyst
Oct 2019 - PRESENT,  REMOTE
Analyzed and visualized vast amounts of data using Pandas, Python, and Matplotlib.
	ㅡ
Education
	  

Berkeley / B.S. Mathematics
August 2015 - May 2019,  BERKELEY, CA
Graduated summa cum laude.

	ㅡ
Awards
	  

Tau Beta Pi Honors Society


************* BEST JOB OFFER ******************

Job title Machine Learning / Data Scientist Internship (Summer 2020) - San Diego - San Diego, CA

Most similar offer bullets ["('How to build models at scale using vast amounts of structured and unstructured heterogeneous types of data.'", " 'Ensuring high accuracy based on industry’s stringent requirements around precision or recall and with minimum Type I and Type II errors.'", " 'Generating predictions for millions of rows of data