<a href="https://colab.research.google.com/github/MaiMejia/ML-Projects/blob/main/Potential_Talents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


### <b> Potential Candidates  - Predicting the best fit</b>
<a id='top'></a>

#### <b> Table of Contents</b>


1.  <a href="#Context">Context</a>
2.  <a href="#Data">Data</a>
1.  <a href="#Text Cleaning">Text Cleaning</a>
1. <a href="#Text Preprocessing">Text Preprocessing</a>
1. <a href="#Text Representation (Text Vectorization)">Text Representation (Text Vectorization)</a>
    <br><a href="#Traditional Approach: Term-Frequency-Inverse Document Frequency (TF-IDF)"> Traditional Approach: Term-Frequency-Inverse Document Frequency (TF-IDF)</a><br>
    <a href="#Keyword based search: Seeking Human Resources"> i) Keyword
based search: Seeking Human Resources</a> <br>
    <a href="#Keyword based search: Aspiring Human Resources">ii) Keyword based search: Aspiring Human Resources</a><br>
    <a href="#Keyword based search: Research Assistant">iii) Keyword based search: Research Assistant</a><br>
  <a href="#Final Notes">Final Notes</a>

<a name="Context"></a>
#### <b> 1. Context</b>                                                     

<p style="font-size:14px; line-height:1.5; margin-top:0px; margin-bottom:4px;">
A talent sourcing and management company is interested in finding good candidates for tech companies. So they can outsource those candidates. However, finding a good fit for opening positions is not an easy task. This is because the firm faces three key challenges:  <br>
1) Getting a deep understanding of the position, <br>
2) Defining the skillset candidates must have to be selected, and  <br>
3) Contacting the best job seekers.

As this job requires a lot of labourios work, an automate process could be benefitial for the company while connecting with top performers workers more efficiently.<br>

**Goal:** Predict how fit the candidate is based on their available information (variable fit).<br>

<b>Data</b> <br>
The data comes from our sourcing efforts. Contains an unique identifier for each candidate to ensure the company is compliant with data privacy laws. <br>

<b>Attributes</b>  <br>
**id:** unique identifier for candidate (numeric) <br>
**job_title:** job title for candidate (text) <br>
**location:** geographical location for candidate (text) <br>
**connections** number of connections candidate has, 500+ means over 500 (text) <br>

**Output (desired target):** <br>
**fit** - how fit the candidate is for the role? (numeric, probability between 0-1)  <br>
**Keywords** - “Aspiring human resources” or “seeking human resources”
</p>

In [None]:
#Installing libraries
%pip install nltk pyspark wordfreq gensim



<p style="font-size:14px; line-height:1.5; margin-top:0px; margin-bottom:4px;">
<b>Importing Libraries</b> </p>

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Model building libraries
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('words')
import gensim
from gensim.models import Word2Vec
from nltk.corpus import stopwords
from nltk.corpus import words
from wordfreq import word_frequency
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
import spacy
from spacy import displacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


<a name="Data"></a>
#### <b>2. Data</b>
<p style="font-size:14px; line-height:1.5; margin-top:0px; margin-bottom:4px;">Loading and previewing the raw data to understand its structure.</p>

In [None]:
### Open the csv file

# Alternatively, you can open the file using the url + file id
data = pd.read_csv("https://drive.google.com/uc?id=13p6JXZUvAdXccUOMSMHRo2qsko0hLlr3", on_bad_lines='skip', engine='python', index_col=0)
pd.set_option('display.max_colwidth', 200)                                      # set_option allows to expand the column's width to see the full text

data = data.iloc[:,:3]                                                          # Removing the fit column because it doesn't add any value
data.head()

Unnamed: 0_level_0,job_title,location,connection
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional,"Houston, Texas",85
2,Native English Teacher at EPIK (English Program in Korea),Kanada,500+
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44
4,People Development Coordinator at Ryan,"Denton, Texas",500+
5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+


In [None]:
# This is a small dataset
data.shape

(104, 3)

<a name="Text Cleaning"></a>
#### <b>3. Text Cleaning</b>
<p style="font-size:14px; line-height:1.5; margin-top:0px; margin-bottom:4px;">Cleaning the text data.</p>

In [None]:
import re

abbv_map = {'c': 'charles','epik': 'english program korea','hr': 'human resources','hris': 'human resources information system','svp': 'senior vice president',
               'chro': 'chief human resources officer','csr': 'corporate social responsibility officer','engie': 'energy company','gphr': 'global professional human resources',
               'sphr': 'senior professional human resources','inc': 'company','mes': 'manufacturing execution system','heil': 'environmental company','gi': 'geographic information system',
               'rrp': 'recommended retail price','jti': 'japan tobacco international','ey': 'ernst young ','lab': 'laboratory','st': 'saint'}

def expand_abbv(text):
  if pd.isnull(text):
    return text
  else:
    sorted_keys = sorted(abbv_map.keys(), key=len, reverse=True)              # Sort keys by length to avoid partial replacements (e.g., 'hr' inside 'hris')
    for key in sorted_keys:
        pattern = r'\b' + re.escape(key) + r'\b'                               # Use word boundaries to match whole words only
        text = re.sub(pattern, abbv_map[key], text, flags=re.IGNORECASE)
    return text

text = data['job_title'].apply(expand_abbv)
text.head(10)

Unnamed: 0_level_0,job_title
id,Unnamed: 1_level_1
1,2019 charles.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional
2,Native English Teacher at english program korea (English Program in Korea)
3,Aspiring Human Resources Professional
4,People Development Coordinator at Ryan
5,Advisory Board Member at Celal Bayar University
6,Aspiring Human Resources Specialist
7,Student at Humber College and Aspiring Human Resources Generalist
8,human resources Senior Specialist
9,Student at Humber College and Aspiring Human Resources Generalist
10,Seeking Human Resources human resources information system and Generalist Positions


<a name="Text Preprocessing"></a>
#### <b>4. Text Preprocessing</b>
<p> This step requires to clean, transform, and prepare the text for analysis and modeling.<br>
<b>Key techiques:</b> Tokenization, Stopwords, Stemming, and Lemmatization. </p>

In [None]:
def text_preprocessing(text):
  '''
  Function to clean the text, transform and prepare it for analysis and modeling.
  '''

  # Tokenization & lowercasing
  tokenizer = RegexpTokenizer(r'\w+')
  stop_words = set(word.lower() for word in nltk.corpus.stopwords.words('english'))
  lemmatizer = WordNetLemmatizer()

  # If passing a string
  if isinstance(text, str):
    text =tokenizer.tokenize(text.lower())
    text = [word for word in text if word not in stop_words and not word.isdigit()]
    text = [lemmatizer.lemmatize(word, pos=wordnet.VERB) for word in text]

    return ' '.join(text)

  # If passing a series
  else:
    text = (text.str.lower()).apply(tokenizer.tokenize)                                # The text is divided into individuals words (tokens)
    text = text.apply(lambda tokens:[
      WordNetLemmatizer().lemmatize(word, pos=wordnet.VERB)
      for word in tokens
      if word not in stop_words and not word.isdigit()])                                 # This line removes stopwords and digits


  # join tokens into a single string
  joined_text = text.apply(lambda tokens:' '.join(tokens))

  return joined_text

# Applying the preprocessing function
text_prep = text_preprocessing(text)
text_prep.head(10)

Unnamed: 0_level_0,job_title
id,Unnamed: 1_level_1
1,charles bauer college business graduate magna cum laude aspire human resources professional
2,native english teacher english program korea english program korea
3,aspire human resources professional
4,people development coordinator ryan
5,advisory board member celal bayar university
6,aspire human resources specialist
7,student humber college aspire human resources generalist
8,human resources senior specialist
9,student humber college aspire human resources generalist
10,seek human resources human resources information system generalist position


<a name="Text Representation (Text Vectorization)"></a>
#### <b>5. Text Representation (Text Vectorization)</b>
<p style="font-size:14px; line-height:1.5; margin-top:0px; margin-bottom:4px;"> This step helps us to transform the text into its numerical vector. So the ML algorithm  can ingests the text.</p>

<a name="Traditional Approach: Term-Frequency-Inverse Document Frequency (TF-IDF)"></a>
<b>Traditional Approach: Term-Frequency-Inverse Document Frequency (TF-IDF)</b>

In [None]:
# Evaluating three ranking options
from sklearn.metrics.pairwise import cosine_similarity

def keyword_ranking(text,ngram_range, boost):
    # TF-IDF set up
    tfidf = TfidfVectorizer(ngram_range=ngram_range)                                        # Defining the n-grams also help to increase score and gain more precise results
    tfidf_vec = tfidf.fit_transform(text)

    # Keyword Input
    keyword = input("Enter a keyword: ")
    keyword_prep = text_preprocessing(keyword)
    query_vec = tfidf.transform([keyword_prep])

    cosine_ekw = cosine_similarity(query_vec, tfidf_vec).flatten()

    ### ------------ 3. Exact Keyword-based Ranking ----------- ####
    data['score'] = cosine_ekw
    data['score'] += (text.str.contains(keyword_prep)).astype(int) * boost         # This line allows direct string match
    data['ranks'] = data['score'].rank(method='dense', ascending=False).astype(int)

    pd.set_option('display.max_rows', None)
    df = data.sort_values(by='ranks')
    return df

<a name="Keyword based search: Seeking Human Resources"></a>
<b>i) Keyword based search: Seeking Human Resources</b>

In [None]:
# Getting ranks for seeking human resources
shr = keyword_ranking(
    text=text_prep,
    ngram_range=(1, 3),
    boost=0.2
)
shr

Enter a keyword: seeking human resources


Unnamed: 0_level_0,job_title,location,connection,score,ranks
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.839155,1
30,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.839155,1
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.833866,2
62,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,500+,0.614512,3
40,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,500+,0.614512,3
53,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,500+,0.614512,3
10,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,500+,0.614512,3
94,Seeking Human Resources Opportunities. Open to travel and relocation.,Amerika Birleşik Devletleri,415,0.542529,4
75,"Nortia Staffing is seeking Human Resources, Payroll & Administrative Professionals!! (408) 709-2621","San Jose, California",500+,0.501241,5
73,"Aspiring Human Resources Manager, seeking internship in Human Resources.","Houston, Texas Area",7,0.198531,6


<p style="font-size:14px; line-height:1.5; margin-top:0px; margin-bottom:4px;">
<b>Including job titles with scores greater than 0.10</p>

In [None]:
shr_top = shr.loc[shr['score']>0.10]
shr_top

Unnamed: 0_level_0,job_title,location,connection,score,ranks
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
28,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.839155,1
30,Seeking Human Resources Opportunities,"Chicago, Illinois",390,0.839155,1
99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.833866,2
62,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,500+,0.614512,3
40,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,500+,0.614512,3
53,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,500+,0.614512,3
10,Seeking Human Resources HRIS and Generalist Positions,Greater Philadelphia Area,500+,0.614512,3
94,Seeking Human Resources Opportunities. Open to travel and relocation.,Amerika Birleşik Devletleri,415,0.542529,4
75,"Nortia Staffing is seeking Human Resources, Payroll & Administrative Professionals!! (408) 709-2621","San Jose, California",500+,0.501241,5
73,"Aspiring Human Resources Manager, seeking internship in Human Resources.","Houston, Texas Area",7,0.198531,6


<a name="Keyword based search: Aspiring Human Resources"></a>
<b>ii) Keyword based search: Aspiring Human Resources</b>

In [None]:
# Getting ranks for Aspiring human resources
ahr = keyword_ranking(
    text=text_prep,
    ngram_range=(1, 3),
    boost=0.2
)
ahr

Enter a keyword: Aspiring Human Reources


Unnamed: 0_level_0,job_title,location,connection,score,ranks
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71,0.520733,1
3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.520733,1
33,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.520733,1
21,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.520733,1
58,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.520733,1
46,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.520733,1
17,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,0.520733,1
6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.433985,2
24,Aspiring Human Resources Specialist,Greater New York City Area,1,0.433985,2
36,Aspiring Human Resources Specialist,Greater New York City Area,1,0.433985,2


<a name="Keyword based search: Research Assistant"></a>
<b>iii) Keyword based search: Research Assistant</b>

In [None]:
# Getting ranks for Research Assistant
resass = keyword_ranking(
    text=text_prep,
    ngram_range=(1, 2),
    boost=0.2
)
resass.head(20)

Enter a keyword: Research Assistant 


Unnamed: 0_level_0,job_title,location,connection,score,ranks
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
90,Undergraduate Research Assistant at Styczynski Lab,Greater Atlanta Area,155,0.77735,1
1,2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional,"Houston, Texas",85,0.0,2
75,"Nortia Staffing is seeking Human Resources, Payroll & Administrative Professionals!! (408) 709-2621","San Jose, California",500+,0.0,2
74,Human Resources Professional,Greater Boston Area,16,0.0,2
73,"Aspiring Human Resources Manager, seeking internship in Human Resources.","Houston, Texas Area",7,0.0,2
72,Business Management Major and Aspiring Human Resources Manager,"Monroe, Louisiana Area",5,0.0,2
71,"Human Resources Generalist at ScottMadden, Inc.","Raleigh-Durham, North Carolina Area",500+,0.0,2
70,"Retired Army National Guard Recruiter, office manager, seeking a position in Human Resources.","Virginia Beach, Virginia",82,0.0,2
69,"Director of Human Resources North America, Groupe Beneteau","Greater Grand Rapids, Michigan Area",500+,0.0,2
68,Human Resources Specialist at Luxottica,Greater New York City Area,500+,0.0,2


<a name="Final Notes"></a>
#### <b>Final Notes</b>

<p style="font-size:14px; line-height:1.5; margin-top:0px; margin-bottom:4px;">
<ul>
<li> The <b>TF-DFI model</b> with Exact Match keyword Rank-based produced better rankings because it matches keywords literally, such as <i>"seeking human resources"</i> and <i>"aspiring human resources".</i></li>
<br>
<li> Results can be filtered by defining a <b>score threshold,</b> which might vary depending on the keywords. Keep it wider (>10) when the job title is common within the dataset, or on the contrary, tighten the threshold when the search is for an uncommon title (e.g., research assistant).</li>
<br>
<li> Given that keyword-based rankings might suffer from any type of <b>bias (vocabulary, synonym, or spelling)</b>. A mitigation technique should be implemented, like fuzzy matching, which is useful to find partial string matches. </li>
<br>
<li> Lastly, I tried two <b>Neural Approaches:</b> Continuous Bag of Words (CBOW) and Skip-Gram Models. However, none of these models improve the final score because the essence of this technique is based on semantic similarity, instead of an exact match. Therefore, additional words in a sentence can dilute or reinforce the meaning when ranking for keywords.</li>
</ul>
</p>