# Project 3 - Potential Talents

**Goal(s):**

Determine best matching candidates based on how fit these candidates are for a given role:
- Predict how fit the candidate is based on their available information (variable fit).
- Rank candidates based on a fitness score.
- Re-rank candidates when a candidate is starred.

Attributes:

- id : unique identifier for candidate (numeric)
- job_title : job title for candidate (text)
- location : geographical location for candidate (text)
- connections: number of connections candidate has, 500+ means over 500 (text)

Output (desired target):

- fit : how fit the candidate is for the role? (numeric, probability between 0-1)

**Results:**


<!-- # TODO -->
<!-- - We are also interested in finding customers who are more likely to buy the investment product. Determine the segment(s) of customers our client should prioritize. -->

## Analysis

In [1]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from utils import *
from state_information import USA_STATES

In [2]:
file_name = 'Seeking_human_resources.csv'
df = pd.read_csv(file_name)
df = df.rename({'fit': 'Y'}, axis=1)

feature_names = df.columns

display(df.head())
df.info()

Unnamed: 0,id,job_title,location,connection,Y
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    object 
 4   Y           0 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.2+ KB


Since our task is to complete de Y (fit) column, we should start by doing some data wrangling on columns job_title, location and connection.

### Data wrangling

#### Connection

In [3]:
connection_values = ['501' if x == '500+ ' else x for x in list(df['connection'].values)]
connection_values = list(map(int, connection_values))
print(f"Values between {np.min(connection_values)}-500+")

Values between 1-500+


In this case we will re-organize data into three categories:
- `100-`: less than 100 connections
- `100-500`: between 100 and 500 connections
- `500+`: more than 500 connections

In [4]:
for idd in df['id']:
    connection = df.loc[df['id'] == idd, 'connection'].values[0]
    if connection == '500+ ': df.loc[df['id'] == idd, 'connection'] = '500+'
    else:
        if (int(connection) <= 500) and (int(connection) >= 100): df.loc[df['id'] == idd, 'connection'] = '100-500'
        else: df.loc[df['id'] == idd, 'connection'] = '100-'

print('Value counts:')
df['connection'].value_counts()

Value counts:


100-       49
500+       44
100-500    11
Name: connection, dtype: int64

#### Location

In [5]:
df['location'].value_counts()

Kanada                                 12
Raleigh-Durham, North Carolina Area     8
Houston, Texas Area                     8
Greater New York City Area              7
Houston, Texas                          7
Denton, Texas                           6
San Francisco Bay Area                  5
Greater Philadelphia Area               5
İzmir, Türkiye                          4
Lake Forest, California                 4
Atlanta, Georgia                        4
Chicago, Illinois                       2
Austin, Texas Area                      2
Greater Atlanta Area                    2
Amerika Birleşik Devletleri             2
Long Beach, California                  1
Milpitas, California                    1
Greater Chicago Area                    1
Torrance, California                    1
Greater Los Angeles Area                1
Bridgewater, Massachusetts              1
Lafayette, Indiana                      1
Kokomo, Indiana Area                    1
Las Vegas, Nevada Area            

Our objective is to add information that will be useful to detected certain paterns in text. In this case we will add the following information:
- Add alternative spellings
- Add english translation
- Add US regions, divisions and state acronyms

We will create a new column called `location_key_words` where all keywords will be accumulated for each `location` row. And the `location` attribute will be left with a clean version of the location for presentation purposes.

In [6]:
df['location_key_words'] = df['location']

In [7]:
def add_to_key_attribute(df, original_value, new_value, clean_value, original_att='location', key_att='location_key_words'):
    df.loc[df[original_att] == original_value, key_att] = df.loc[df[original_att] == original_value, key_att] + ', ' + new_value
    df.loc[df[original_att] == original_value, original_att] = clean_value

In [8]:
add_to_key_attribute(df, original_value='Kanada', new_value='Canada', clean_value='Canada')
add_to_key_attribute(df, original_value='İzmir, Türkiye', new_value='Izmir, Turkey', clean_value='Izmir, Turkey')
add_to_key_attribute(df, original_value='Amerika Birleşik Devletleri', new_value='United States of America, USA, US', clean_value='USA')

Add state acronyms and region according to the US Census Bureau we can divide the US states into 4 Regions:
<br><https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf>
<br><https://www.scouting.org/resources/los/states/>
	 
| West | Midwest | South | Northeast |
|-----------|---------|-------|------|
| Arizona | Indiana | Delaware | Connecticut |
| Colorado | Illinois | District of Columbia | Maine |
| Idaho | Michigan | Florida | Massachusetts |
| New Mexico | Ohio | Georgia | New Hampshire |
| Montana | Wisconsin | Maryland | Rhode Island |
| Utah | Iowa | North Carolina | Vermont |
| Nevada | Kansas | South Carolina | New Jersey |
| Wyoming | Minnesota | Virginia | New York |
| Alaska | Missouri | West Virginia | Pennsylvania |
| California | Nebraska | Alabama |  |
| Hawaii | North Dakota | Kentucky |  |
| Oregon | South Dakota | Mississippi |  |
| Washington |  | Tennessee |  |
|  |  | Arkansas |  |
|  |  | Louisiana |  |
|  |  | Oklahoma |  |
|  |  | Texas |  |

In [9]:
for loc in df['location'].value_counts().index:

    split = loc.split(' ') # split location value by ' ' to find state key words

    # Special cases include states with two words or cities.
    # In the latter case, the split value was replaced by the state name where that city belongs to
    if 'North Carolina' in loc: split = ['North Carolina']
    elif 'New York' in loc: split = ['New York']
    elif ('San Francisco' in loc) and ('California' not in loc): split = ['California']
    elif ('Philadelphia' in loc) and ('Pennsylvania' not in loc): split = ['Pennsylvania']
    elif ('Chicago' in loc) and ('Illinois' not in loc): split = ['Illinois']
    elif ('Los Angeles' in loc) and ('California' not in loc): split = ['California']
    elif ('Dallas/Fort Worth' in loc) and ('Texas' not in loc): split = ['Texas']
    elif ('Boston' in loc) and ('Massachusetts' not in loc): split = ['Massachusetts']
    elif ('Dallas/Fort Worth' in loc) and ('Texas' not in loc): split = ['Texas']

    # If it a special case, the length should be = 1 and the the split value should correspond to 1 USA_STATES key
    if (len(split) == 1) and (split[0] in USA_STATES.keys()):
        nv = USA_STATES[split[0]]['Standard'] + ', ' + USA_STATES[split[0]]['Postal'] + ', ' + USA_STATES[split[0]]['Region']
        add_to_key_attribute(df, original_value=loc, new_value=nv, clean_value=loc)

    # In this case split has a length of one but doesn't have value that corresponds to 1 USA_STATES key
    elif (len(split) == 1) and (split[0] not in USA_STATES.keys()): pass

    # If it isn't a special case, split's length will be > than 1. We iterate through all keywords stored in split.
    # If the keyword corresponds to 1 USA_STATES key, add_to_key_attribute
    else:
        for s in split:
            if s in USA_STATES.keys():
                nv = USA_STATES[s]['Standard'] + ', ' + USA_STATES[s]['Postal'] + ', ' + USA_STATES[s]['Region']
                add_to_key_attribute(df, original_value=loc, new_value=nv, clean_value=loc)

df.head()  # TODO shouold I eliminate duplicates in the location_key_words?

Unnamed: 0,id,job_title,location,connection,Y,location_key_words
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",100-,,"Houston, Texas, Texas, TX, South, S, West Sout..."
1,2,Native English Teacher at EPIK (English Progra...,Canada,500+,,"Kanada, Canada"
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",100-,,"Raleigh-Durham, North Carolina Area, N.C., NC,..."
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,,"Denton, Texas, Texas, TX, South, S, West South..."
4,5,Advisory Board Member at Celal Bayar University,"Izmir, Turkey",500+,,"İzmir, Türkiye, Izmir, Turkey"


Add alternative words for some cities:

In [10]:
for loc in df['location'].value_counts().index:
    if 'New York City' in loc:   add_to_key_attribute(df, original_value=loc, new_value='NYC', clean_value=loc)
    if 'Philadelphia' in loc:   add_to_key_attribute(df, original_value=loc, new_value='Philly', clean_value=loc)
    if 'Los Angeles' in loc:   add_to_key_attribute(df, original_value=loc, new_value='LA', clean_value=loc)

We can now convert all words into lower case:

In [11]:
df[['job_title', 'location', 'location_key_words']] = df[['job_title', 'location', 'location_key_words']].applymap(lambda x: x.lower())
df.head()

Unnamed: 0,id,job_title,location,connection,Y,location_key_words
0,1,2019 c.t. bauer college of business graduate (...,"houston, texas",100-,,"houston, texas, texas, tx, south, s, west sout..."
1,2,native english teacher at epik (english progra...,canada,500+,,"kanada, canada"
2,3,aspiring human resources professional,"raleigh-durham, north carolina area",100-,,"raleigh-durham, north carolina area, n.c., nc,..."
3,4,people development coordinator at ryan,"denton, texas",500+,,"denton, texas, texas, tx, south, s, west south..."
4,5,advisory board member at celal bayar university,"izmir, turkey",500+,,"i̇zmir, türkiye, izmir, turkey"


#### Job title

In [12]:
df['job_title'].value_counts()

2019 c.t. bauer college of business graduate (magna cum laude) and aspiring human resources professional                 7
aspiring human resources professional                                                                                    7
student at humber college and aspiring human resources generalist                                                        7
people development coordinator at ryan                                                                                   6
native english teacher at epik (english program in korea)                                                                5
aspiring human resources specialist                                                                                      5
hr senior specialist                                                                                                     5
student at chapman university                                                                                            4
svp, chro, marke

We will add some abbreviations to this column.

In [13]:
def add_abbreviations_to_job_attribute(df, original_value, new_value):
    df.loc[df['job_title'] == original_value, 'job_title'] = df.loc[df['job_title'] == original_value, 'job_title'] + ', ' + new_value

In [14]:
for job in df['job_title']:
    if 'hr' in job: add_abbreviations_to_job_attribute(df, job, 'human resources')
    if 'human resources' in job: add_abbreviations_to_job_attribute(df, job, 'hr')
    if ('sr.' in job) or ('sr ' in job): add_abbreviations_to_job_attribute(df, job, 'senior')
    if 'senior' in job: add_abbreviations_to_job_attribute(df, job, 'Sr')
    if ('jr.' in job) or ('jr ' in job): add_abbreviations_to_job_attribute(df, job, 'junior')
    if 'junior' in job: add_abbreviations_to_job_attribute(df, job, 'jr')
    if 'entry-level' in job: add_abbreviations_to_job_attribute(df, job, 'entry level')
    if 'business intelligence' in job: add_abbreviations_to_job_attribute(df, job, 'bi')
    if 'bi' in job: add_abbreviations_to_job_attribute(df, job, 'business intelligence')
    if 'information systems' in job: add_abbreviations_to_job_attribute(df, job, 'it')
    if 'it' in job: add_abbreviations_to_job_attribute(df, job, 'information technology systems')
    if 'engineer' in job: add_abbreviations_to_job_attribute(df, job, 'eng')
    if 'bachelor of science' in job: add_abbreviations_to_job_attribute(df, job, 'bs')
    if 'hris' in job: add_abbreviations_to_job_attribute(df, job, 'human resources information systems hr it technology')
    if 'gis' in job: add_abbreviations_to_job_attribute(df, job, 'geographic information system it technology')
    if 'rrp' in job: add_abbreviations_to_job_attribute(df, job, 'recommended retail price')
    if 'mes' in job: add_abbreviations_to_job_attribute(df, job, 'manufacturing execution system')
    if 'svp' in job: add_abbreviations_to_job_attribute(df, job, 'senior vice president sr')
    if 'senior vice president' in job: add_abbreviations_to_job_attribute(df, job, 'svp sr')
    if 'chief human resources officer' in job: add_abbreviations_to_job_attribute(df, job, 'chro hr')
    if 'chro' in job: add_abbreviations_to_job_attribute(df, job, 'chief human resources officer hr')
    if 'csr' in job: add_abbreviations_to_job_attribute(df, job, 'corporate social responsibility')
    if 'corporate social responsibility' in job: add_abbreviations_to_job_attribute(df, job, 'csr')
    if 'gphr' in job: add_abbreviations_to_job_attribute(df, job, 'global professional in human resources hr')
    if 'global professional in human resources' in job: add_abbreviations_to_job_attribute(df, job, 'gphr hr')
    if 'sphr' in job: add_abbreviations_to_job_attribute(df, job, 'senior professional in human resources hr sr')
    if 'senior professional in human resources' in job: add_abbreviations_to_job_attribute(df, job, 'sphr hr sr')

df.head()

Unnamed: 0,id,job_title,location,connection,Y,location_key_words
0,1,2019 c.t. bauer college of business graduate (...,"houston, texas",100-,,"houston, texas, texas, tx, south, s, west sout..."
1,2,native english teacher at epik (english progra...,canada,500+,,"kanada, canada"
2,3,"aspiring human resources professional, hr, hum...","raleigh-durham, north carolina area",100-,,"raleigh-durham, north carolina area, n.c., nc,..."
3,4,people development coordinator at ryan,"denton, texas",500+,,"denton, texas, texas, tx, south, s, west south..."
4,5,advisory board member at celal bayar universit...,"izmir, turkey",500+,,"i̇zmir, türkiye, izmir, turkey"


Once we finished adding keywords, we continue with the data preprocessing. We have already converted words into lower case words. We will also apply lemmatization and remove stopwords.

In [27]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize  # word_tokenize = split(' '); sent_tokenize = split('.')
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

In [43]:
def preprocess_text(df):

    from nltk.corpus import stopwords
    # print(df[0])
    print('[INFO] Splitting sentence into array...')
    df = df.apply(lambda x: word_tokenize(x.lower()))   # tokenize: split sentence into array of words
    # print(df[0])
    
    print('[INFO] Removing special characters and punctuation...')
    df = df.apply(lambda x: list(set(word_tokenize(re.sub('[^A-Za-z*$]', ' ', str(x)))))) # remove special characters and digits
    # print(df[0])

    print('[INFO] Lemmatizing words...')
    lemmatizer = WordNetLemmatizer()
    df = df.apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
    # print(df[0])
    
    stopwords = stopwords.words('english')
    # df = df.apply(lambda x: [word for word in x if word not in stopwords])
    # print(df[0])

    print('[INFO] Removing stopwords and generating embeddings vector...')
    df = df.apply(lambda x: ' '.join(x))
    # display(df[0])

    vectorizer = CountVectorizer(stop_words = stopwords, lowercase = False, strip_accents = 'unicode')
    encoding = vectorizer.fit_transform(df)
    encoding_df = pd.DataFrame(encoding.todense(), columns = vectorizer.get_feature_names_out())
    # display(encoding_df)

    vectorizer = TfidfTransformer()
    encoding = vectorizer.fit_transform(encoding_df)
    encoding_df = pd.DataFrame(encoding.todense(), columns = vectorizer.get_feature_names_out())

    vectorizer = TfidfVectorizer(stop_words=stopwords)  # can remove words that appear too unfrequently and too frequently (min_df and max_df)
    encoding = vectorizer.fit_transform(df)
    encoding_df = pd.DataFrame(encoding.todense(), columns = vectorizer.get_feature_names_out())

    # If you need the term frequency (term count) vectors for different tasks, use Tfidftransformer.
    # If you need to compute tf-idf scores on documents within your “training” dataset, use Tfidfvectorizer
    # If you need to compute tf-idf scores on documents outside your “training” dataset, use either one, both will work.
    # https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.Y5UtcdKZPb0

    return encoding, encoding_df

In [45]:
encoding, encoding_df = preprocess_text(df['location_key_words'])
query_encoding, query_encoding_df = preprocess_text(pd.Series('HR Specialist'))

[INFO] Splitting sentence into array...
[INFO] Removing special characters and punctuation...
[INFO] Lemmatizing words...
[INFO] Removing stopwords and generating embeddings vector...
[INFO] Splitting sentence into array...
[INFO] Removing special characters and punctuation...
[INFO] Lemmatizing words...
[INFO] Removing stopwords and generating embeddings vector...


In [49]:
from numpy.linalg import norm

for entry in df['location_key_words']:
    encoding, encoding_df = preprocess_text(pd.Series(entry))
    cosine = np.dot(query_encoding, encoding)/(norm(query_encoding)*norm(encoding))
    print("Cosine Similarity:", cosine)

[INFO] Splitting sentence into array...
[INFO] Removing special characters and punctuation...
[INFO] Lemmatizing words...
[INFO] Removing stopwords and generating embeddings vector...


ValueError: dimension mismatch

In [46]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(query_encoding, encoding)

ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 2 while Y.shape[1] == 112

In [40]:
# SVD allows to extract important and informative words
# SVD is an algebraic transformation similar to PCA that can be used to find linear combinations of the terms that are informative, 
# so that we can describe the dataset with fewer combinations than the number of terms we originally had. 
# These combinations can be considered as dimensions with latent semantic dimensions, that is, dimensions in which it makes sense 
# to project the dataset precisely because of its semantic content.

# The reason why we can reduce the dimensionality of the texts by projecting them to these latent semantic dimensions is that many times
# there is redundancy in the set of documents. That is to say that with more or less different words, many documents talk about the same topics. 

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.decomposition import TruncatedSVD

svd=TruncatedSVD(n_components=2);
P=svd.fit_transform(tfidf_encoding)

comp1, comp2=svd.components_ # coeficientes (pesos) de los términos en cada una de las dos dimensiones

indices=np.argsort(comp1); # los ordenamos de menor a mayor y nos quedamos con los índices de sus posiciones en el array
indices=indices[::-1] # invertimos para que queden ordenados de mayor a menor

print('Dimension 1:')
print(np.array(vectorizer.get_feature_names_out())[indices]) # Evaluamos los términos en estas posiciones

print('\n')

indices=np.argsort(comp2);
indices=indices[::-1]
print('Dimension 2:')
print(np.array(vectorizer.get_feature_names_out())[indices])

Dimension 1:
['aspiring' 'resources' 'human' 'professional' 'college' 'specialist'
 'generalist' 'student' 'humber' 'seeking' 'business' 'magna' 'bauer'
 'laude' 'graduate' 'cum' '2019' 'management' 'internship' 'positions'
 'hris' 'manager' 'opportunities' 'major' 'buckhead' 'atlanta'
 'intercontinental' 'senior' 'coordinator' 'hr' 'position' 'chapman'
 'university' 'loparex' 'schwan' 'luxottica' 'leader' 'retail'
 'experienced' 'staffing' 'recruiting' 'director' 'scottmadden' 'inc'
 'team' 'energetic' 'focused' 'ey' 'liberal' 'arts' 'analyst' 'world'
 'gis' 'software' 'may' 'louis' 'level' 'entry' 'st' '2020' 'graduating'
 'open' 'relocation' 'travel' 'work' 'environment' 'engaging' 'create'
 'inclusive' 'helping' 'passionate' 'heil' 'partner' 'environmental'
 'america' 'north' 'beneteau' 'groupe' 'army' 'office' 'guard' 'retired'
 'recruiter' 'national' 'development' 'people' 'ryan' '709' 'nortia'
 'professionals' 'payroll' 'administrative' '408' '2621' 'conflict'
 'policies' 'compe