# Project 3 - Potential Talents

**Goal(s):**

Determine best matching candidates based on how fit these candidates are for a given role:
- Predict how fit the candidate is based on their available information (variable fit).
- Rank candidates based on a fitness score.
- Re-rank candidates when a candidate is starred.

Attributes:

- id : unique identifier for candidate (numeric)
- job_title : job title for candidate (text)
- location : geographical location for candidate (text)
- connections: number of connections candidate has, 500+ means over 500 (text)

Output (desired target):

- fit : how fit the candidate is for the role? (numeric, probability between 0-1)

**Results:**

- We were able to rank candidates based on a fitness score.
- We were able to re-rank candidates when a candidate is starred.

**Bonus(es):**

- *We are interested in a robust algorithm, tell us how your solution works and show us how your ranking gets better with each starring action.*

Once keywords are added to the initial data, these are preprocessed by eliminating punctuation, special characters, stopwords and by lemmantizing the words. For the algorithm to work, this process must be applied to the keyword chosen for the first task. Once data is preprocessed, word embeddings are obtained so that the cosine similarity function can have a correct interpretation of the data. Similar words between chosen keywords and the rest of the candidates' description will have a higher cosine similarity score. For task two candidates will be re-ranked using a different keyword which should be chosen based on a specific candidate's description. As new keywords are chosen, these become more specific to what the company is looking for which will make the algorithm get a closer match to the ideal candidate.

- *How can we filter out candidates which in the first place should not be in this list?*

When running the algorithm for the first time we can keep the top N results or eliminate those candidates with 0% fit result.

- *Can we determine a cut-off point that would work for other roles without losing high potential candidates?*

Original candidates will always be part of the original database. If filtering is applied for a specific search, this can be removed and/or modifyed for future candidate searches.

- *Do you have any ideas that we should explore so that we can even automate this procedure to prevent human bias?*


## Analysis

In [1]:
import numpy as np
import pandas as pd

from utils import *
from state_information import USA_STATES

In [2]:
file_name = 'Seeking_human_resources.csv'
df = pd.read_csv(file_name)

feature_names = df.columns

display(df.head())
df.info()

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    object 
 4   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.2+ KB


Since our task is to complete de Y (fit) column, we should start by doing some data wrangling on columns job_title, location and connection.

### Data wrangling

#### Connection

In [3]:
connection_values = ['501' if x == '500+ ' else x for x in list(df['connection'].values)]
connection_values = list(map(int, connection_values))
print(f"Values between {np.min(connection_values)}-500+")

Values between 1-500+


In this case we will re-organize data into three categories:
- `100-`: less than 100 connections
- `100-500`: between 100 and 500 connections
- `500+`: more than 500 connections

In [4]:
def correct_connection_feature(idd):
    connection = df.loc[df['id'] == idd, 'connection'].values[0]
    if connection == '500+ ': df.loc[df['id'] == idd, 'connection'] = '500+'
    else:
        if (int(connection) <= 500) and (int(connection) >= 100): df.loc[df['id'] == idd, 'connection'] = '100-500'
        else: df.loc[df['id'] == idd, 'connection'] = '100-'

In [5]:
df['id'].apply(lambda x: correct_connection_feature(x))

print('Value counts:')
df['connection'].value_counts()

Value counts:


100-       49
500+       44
100-500    11
Name: connection, dtype: int64

#### Location

Our objective is to add information that will be useful to detected certain paterns in text. In this case we will add the following information:
- Add alternative spellings
- Add english translation
- Add US regions, divisions and state acronyms

We will create a new column called `location_key_words` where all keywords will be accumulated for each `location` row. And the `location` attribute will be left with a clean version of the location for presentation purposes.

In [6]:
df['location_key_words'] = df['location']

In [7]:
def add_to_key_attribute(df, original_value, new_value, clean_value, original_att='location', key_att='location_key_words'):
    df.loc[df[original_att] == original_value, key_att] = df.loc[df[original_att] == original_value, key_att] + ', ' + new_value
    df.loc[df[original_att] == original_value, original_att] = clean_value

In [8]:
add_to_key_attribute(df, original_value='Kanada', new_value='Canada', clean_value='Canada')
add_to_key_attribute(df, original_value='İzmir, Türkiye', new_value='Izmir, Turkey', clean_value='Izmir, Turkey')
add_to_key_attribute(df, original_value='Amerika Birleşik Devletleri', new_value='United States of America, USA, US', clean_value='USA')

Add state acronyms and region according to the US Census Bureau we can divide the US states into 4 Regions:
<br><https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf>
<br><https://www.scouting.org/resources/los/states/>
	 
| West | Midwest | South | Northeast |
|-----------|---------|-------|------|
| Arizona | Indiana | Delaware | Connecticut |
| Colorado | Illinois | District of Columbia | Maine |
| Idaho | Michigan | Florida | Massachusetts |
| New Mexico | Ohio | Georgia | New Hampshire |
| Montana | Wisconsin | Maryland | Rhode Island |
| Utah | Iowa | North Carolina | Vermont |
| Nevada | Kansas | South Carolina | New Jersey |
| Wyoming | Minnesota | Virginia | New York |
| Alaska | Missouri | West Virginia | Pennsylvania |
| California | Nebraska | Alabama |  |
| Hawaii | North Dakota | Kentucky |  |
| Oregon | South Dakota | Mississippi |  |
| Washington |  | Tennessee |  |
|  |  | Arkansas |  |
|  |  | Louisiana |  |
|  |  | Oklahoma |  |
|  |  | Texas |  |

In [9]:
for loc in df['location'].value_counts().index:

    split = loc.split(' ') # split location value by ' ' to find state key words

    # Special cases include states with two words or cities.
    # In the latter case, the split value was replaced by the state name where that city belongs to
    if 'North Carolina' in loc: split = ['North Carolina']
    elif 'New York' in loc: split = ['New York']
    elif ('San Francisco' in loc) and ('California' not in loc): split = ['California']
    elif ('Philadelphia' in loc) and ('Pennsylvania' not in loc): split = ['Pennsylvania']
    elif ('Chicago' in loc) and ('Illinois' not in loc): split = ['Illinois']
    elif ('Los Angeles' in loc) and ('California' not in loc): split = ['California']
    elif ('Dallas/Fort Worth' in loc) and ('Texas' not in loc): split = ['Texas']
    elif ('Boston' in loc) and ('Massachusetts' not in loc): split = ['Massachusetts']
    elif ('Dallas/Fort Worth' in loc) and ('Texas' not in loc): split = ['Texas']

    # If it a special case, the length should be = 1 and the the split value should correspond to 1 USA_STATES key
    if (len(split) == 1) and (split[0] in USA_STATES.keys()):
        nv = USA_STATES[split[0]]['Standard'] + ', ' + USA_STATES[split[0]]['Postal'] + ', ' + USA_STATES[split[0]]['Region']
        add_to_key_attribute(df, original_value=loc, new_value=nv, clean_value=loc)

    # In this case split has a length of one but doesn't have value that corresponds to 1 USA_STATES key
    elif (len(split) == 1) and (split[0] not in USA_STATES.keys()): pass

    # If it isn't a special case, split's length will be > than 1. We iterate through all keywords stored in split.
    # If the keyword corresponds to 1 USA_STATES key, add_to_key_attribute
    else:
        for s in split:
            if s in USA_STATES.keys():
                nv = USA_STATES[s]['Standard'] + ', ' + USA_STATES[s]['Postal'] + ', ' + USA_STATES[s]['Region']
                add_to_key_attribute(df, original_value=loc, new_value=nv, clean_value=loc)

df.head()  # TODO shouold I eliminate duplicates in the location_key_words?

Unnamed: 0,id,job_title,location,connection,fit,location_key_words
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",100-,,"Houston, Texas, Texas, TX, South, S, West Sout..."
1,2,Native English Teacher at EPIK (English Progra...,Canada,500+,,"Kanada, Canada"
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",100-,,"Raleigh-Durham, North Carolina Area, N.C., NC,..."
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,,"Denton, Texas, Texas, TX, South, S, West South..."
4,5,Advisory Board Member at Celal Bayar University,"Izmir, Turkey",500+,,"İzmir, Türkiye, Izmir, Turkey"


Add alternative words for some cities:

In [10]:
for loc in df['location'].value_counts().index:
    if 'New York City' in loc:   add_to_key_attribute(df, original_value=loc, new_value='NYC', clean_value=loc)
    if 'Philadelphia' in loc:   add_to_key_attribute(df, original_value=loc, new_value='Philly', clean_value=loc)
    if 'Los Angeles' in loc:   add_to_key_attribute(df, original_value=loc, new_value='LA', clean_value=loc)

We can now convert all words into lower case:

In [11]:
df[['job_title', 'location', 'location_key_words']] = df[['job_title', 'location', 'location_key_words']].applymap(lambda x: x.lower())
df.head()

Unnamed: 0,id,job_title,location,connection,fit,location_key_words
0,1,2019 c.t. bauer college of business graduate (...,"houston, texas",100-,,"houston, texas, texas, tx, south, s, west sout..."
1,2,native english teacher at epik (english progra...,canada,500+,,"kanada, canada"
2,3,aspiring human resources professional,"raleigh-durham, north carolina area",100-,,"raleigh-durham, north carolina area, n.c., nc,..."
3,4,people development coordinator at ryan,"denton, texas",500+,,"denton, texas, texas, tx, south, s, west south..."
4,5,advisory board member at celal bayar university,"izmir, turkey",500+,,"i̇zmir, türkiye, izmir, turkey"


#### Job title

We will add some abbreviations to this column.

In [12]:
def add_abbreviations_to_job_attribute(df, original_value, new_value):
    df.loc[df['job_title'] == original_value, 'job_title'] = df.loc[df['job_title'] == original_value, 'job_title'] + ', ' + new_value

In [13]:
for job in df['job_title']:
    if 'hr' in job: add_abbreviations_to_job_attribute(df, job, 'human resources')
    if 'human resources' in job: add_abbreviations_to_job_attribute(df, job, 'hr')
    if ('sr.' in job) or ('sr ' in job): add_abbreviations_to_job_attribute(df, job, 'senior')
    if 'senior' in job: add_abbreviations_to_job_attribute(df, job, 'Sr')
    if ('jr.' in job) or ('jr ' in job): add_abbreviations_to_job_attribute(df, job, 'junior')
    if 'junior' in job: add_abbreviations_to_job_attribute(df, job, 'jr')
    if 'entry-level' in job: add_abbreviations_to_job_attribute(df, job, 'entry level')
    if 'business intelligence' in job: add_abbreviations_to_job_attribute(df, job, 'bi')
    if 'bi' in job: add_abbreviations_to_job_attribute(df, job, 'business intelligence')
    if 'information systems' in job: add_abbreviations_to_job_attribute(df, job, 'it')
    if 'it' in job: add_abbreviations_to_job_attribute(df, job, 'information technology systems')
    if 'engineer' in job: add_abbreviations_to_job_attribute(df, job, 'eng')
    if 'bachelor of science' in job: add_abbreviations_to_job_attribute(df, job, 'bs')
    if 'hris' in job: add_abbreviations_to_job_attribute(df, job, 'human resources information systems hr it technology')
    if 'gis' in job: add_abbreviations_to_job_attribute(df, job, 'geographic information system it technology')
    if 'rrp' in job: add_abbreviations_to_job_attribute(df, job, 'recommended retail price')
    if 'mes' in job: add_abbreviations_to_job_attribute(df, job, 'manufacturing execution system')
    if 'svp' in job: add_abbreviations_to_job_attribute(df, job, 'senior vice president sr')
    if 'senior vice president' in job: add_abbreviations_to_job_attribute(df, job, 'svp sr')
    if 'chief human resources officer' in job: add_abbreviations_to_job_attribute(df, job, 'chro hr')
    if 'chro' in job: add_abbreviations_to_job_attribute(df, job, 'chief human resources officer hr')
    if 'csr' in job: add_abbreviations_to_job_attribute(df, job, 'corporate social responsibility')
    if 'corporate social responsibility' in job: add_abbreviations_to_job_attribute(df, job, 'csr')
    if 'gphr' in job: add_abbreviations_to_job_attribute(df, job, 'global professional in human resources hr')
    if 'global professional in human resources' in job: add_abbreviations_to_job_attribute(df, job, 'gphr hr')
    if 'sphr' in job: add_abbreviations_to_job_attribute(df, job, 'senior professional in human resources hr sr')
    if 'senior professional in human resources' in job: add_abbreviations_to_job_attribute(df, job, 'sphr hr sr')

df.head()

Unnamed: 0,id,job_title,location,connection,fit,location_key_words
0,1,2019 c.t. bauer college of business graduate (...,"houston, texas",100-,,"houston, texas, texas, tx, south, s, west sout..."
1,2,native english teacher at epik (english progra...,canada,500+,,"kanada, canada"
2,3,"aspiring human resources professional, hr, hum...","raleigh-durham, north carolina area",100-,,"raleigh-durham, north carolina area, n.c., nc,..."
3,4,people development coordinator at ryan,"denton, texas",500+,,"denton, texas, texas, tx, south, s, west south..."
4,5,advisory board member at celal bayar universit...,"izmir, turkey",500+,,"i̇zmir, türkiye, izmir, turkey"


In [14]:
df['key_words'] = df['job_title'] + df['connection'] + df['location_key_words']

### Data preprocessing and word embeddings

Once we finished adding keywords, we continue with the data preprocessing. We have already converted words into lower case words. We will also apply lemmatization and remove stopwords.

In [15]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
from sklearn.feature_extraction.text import  TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


In [16]:
def preprocess_text(df):

    df = df.apply(lambda x: word_tokenize(x.lower()))   # tokenize: split sentence into array of words
    df = df.apply(lambda x: list(set(word_tokenize(re.sub('[^A-Za-z*$]', ' ', str(x)))))) # remove special characters and punctuation
    
    lemmatizer = WordNetLemmatizer()
    df = df.apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
    df = df.apply(lambda x: ' '.join(x))

    return df

def get_encodings(corpus):
    from nltk.corpus import stopwords
    stopwords = stopwords.words('english')
    tfvectoriser = TfidfVectorizer(stop_words=stopwords)
    encoding = tfvectoriser.fit_transform(corpus)
    encoding_df = pd.DataFrame(encoding.toarray(), columns=tfvectoriser.get_feature_names_out()) # df that contains tfidf values of each token for each row in th data
    return encoding, encoding_df

def rank_candidates(keyword, df, col):
    corpus = df[col].tolist()
    preprocessed_keyword = preprocess_text(pd.Series(keyword))
    corpus.append(preprocessed_keyword.tolist()[0])

    encoding, encoding_df = get_encodings(corpus)

    cos_sim = cosine_similarity(encoding.toarray()[:encoding_df.shape[0]-1], encoding.toarray()[encoding_df.shape[0]-1].reshape(1, -1))

    df['fit_cos_sim'] = cos_sim
    final_df = df.sort_values('fit_cos_sim', ascending=False)
    return final_df

### First task
Determine best matching candidates based on how fit these candidates are for a given role based on some keywords

In [17]:
df['key_words'] = preprocess_text(df['key_words'])

keyword = 'Aspiring human resources'
ranked_candidates = rank_candidates(keyword, df, 'key_words')
ranked_candidates

Unnamed: 0,id,job_title,location,connection,fit,location_key_words,key_words,fit_cos_sim
49,50,student at humber college and aspiring human r...,canada,100-,,"kanada, canada",canada student hr generalist human humber kana...,0.343088
38,39,student at humber college and aspiring human r...,canada,100-,,"kanada, canada",canada student hr generalist human humber kana...,0.343088
24,25,student at humber college and aspiring human r...,canada,100-,,"kanada, canada",canada student hr generalist human humber kana...,0.343088
51,52,student at humber college and aspiring human r...,canada,100-,,"kanada, canada",canada student hr generalist human humber kana...,0.343088
6,7,student at humber college and aspiring human r...,canada,100-,,"kanada, canada",canada student hr generalist human humber kana...,0.343088
...,...,...,...,...,...,...,...,...
19,20,native english teacher at epik (english progra...,canada,500+,,"kanada, canada",in canada english kanada program teacher epik ...,0.000000
1,2,native english teacher at epik (english progra...,canada,500+,,"kanada, canada",in canada english kanada program teacher epik ...,0.000000
17,18,people development coordinator at ryan,"denton, texas",500+,,"denton, texas, texas, tx, south, s, west south...",denton s texas ryan south central people devel...,0.000000
79,80,"junior mes engineer| information systems, jr","myrtle beach, south carolina area",100-,,"myrtle beach, south carolina area",junior south jr me system carolina area myrtle...,0.000000


### Second task
Once we were able to list and rank fitting candidates, the company generally employs a manual review procedure where maybe a new candidate is chosen instead of the top 1 from the previouse ranking. In this case the company needs to re-rank the previous list based on the keywords found in this new candidate.

In [18]:
candidate_id = 0
new_ranked_candidates = rank_candidates(ranked_candidates.loc[candidate_id, 'key_words'], df, 'key_words')
new_ranked_candidates

Unnamed: 0,id,job_title,location,connection,fit,location_key_words,key_words,fit_cos_sim
0,1,2019 c.t. bauer college of business graduate (...,"houston, texas",100-,,"houston, texas, texas, tx, south, s, west sout...",laude s bauer professional texas resource west...,1.0
43,44,2019 c.t. bauer college of business graduate (...,"houston, texas",100-,,"houston, texas, texas, tx, south, s, west sout...",laude s bauer professional texas resource west...,1.0
56,57,2019 c.t. bauer college of business graduate (...,"houston, texas",100-,,"houston, texas, texas, tx, south, s, west sout...",laude s bauer professional texas resource west...,1.0
14,15,2019 c.t. bauer college of business graduate (...,"houston, texas",100-,,"houston, texas, texas, tx, south, s, west sout...",laude s bauer professional texas resource west...,1.0
30,31,2019 c.t. bauer college of business graduate (...,"houston, texas",100-,,"houston, texas, texas, tx, south, s, west sout...",laude s bauer professional texas resource west...,1.0
...,...,...,...,...,...,...,...,...
22,23,advisory board member at celal bayar universit...,"izmir, turkey",500+,,"i̇zmir, türkiye, izmir, turkey",board university i system turkey celal member ...,0.0
94,95,"student at westfield state university, informa...","bridgewater, massachusetts",100-,,"bridgewater, massachusetts, mass., ma, northea...",student ma massachusetts university ne westfie...,0.0
84,85,rrp brand portfolio executive at jti (japan to...,greater philadelphia area,500+,,"greater philadelphia area, p.a, pa, northeast,...",recommended retail price p northeast area at j...,0.0
89,90,undergraduate research assistant at styczynski...,greater atlanta area,100-500,,greater atlanta area,lab styczynski area assistant atlanta greater ...,0.0
