# Starring candidates

Apziva project #3<br>
2023 07 17

__Summary:__
* __Iteratively favoring candidates__ ("starring"):
    * This is based on information beyond the initial automatic labeling.
    * Example: favoring candidates with a high amount of connections.
* This is a __manual process__.
    * User interface: CSV file, to be opened and adjusted using Excel for example.
* Automatic processes after manual starring:
    * __Adding a bonus__ to the starred candidate's fitness.
    * __Transferring this bonus signal__ to other candidates using __kNN__.
* This learning process is __protocoled__ for the next notebook (learning history).

## TOC: <a class="anchor" id="TOC"></a>
* [Notebook description](#NotebookDescription)
* [Utilities](#Utilities)
* [Learning step](#LearningStep)
    * [Get last state of fitness values](#GetLastStateOfFitnessValues)
    * [Read star](#ReadStar)

## Notebook description <a class="anchor" id="NotebookDescription"></a>
[TOC](#TOC)

* This notebook...

## Utilities <a class="anchor" id="Utilities"></a>
[TOC](#TOC)

In [1]:
# own libraries
import Utilities as u
import MachineLearning as ml

# activate changes in libraries
import importlib
importlib.reload(u)
importlib.reload(ml)

# aliases
from Utilities import TypeChecker as t
from Utilities import PrintAlias as p

In [2]:
def JoinNumbers(varListOrSet, strSeparator=", "):
    '''
    Generalized function to join lists and sets of numbers, as the base function "join" works only with strings and with lists.
    
    When       Who What
    2023 07 20 dh  Created
    '''
    
    # convert to list of numbers
    if isinstance(varListOrSet, list):
        lvarNumbers = varListOrSet.copy()
    elif isinstance(varListOrSet, set):
        lvarNumbers = list(varListOrSet.copy())
    else:
        p("Strange data type in function JoinNumbers")
        return

    # convert to list of strings
    lstrNumbers = [str(varNumber) for varNumber in lvarNumbers]

    # join
    return strSeparator.join(lstrNumbers)

if True:
    lintNumbers = [1, 2, 3, 4, 5]
    sfltNumbers = set([1, 2, 3, 4, 5.555])
    p("List of integers:",JoinNumbers(lintNumbers))
    p("Set of floats:",JoinNumbers(sfltNumbers))

List of integers: 1, 2, 3, 4, 5
Set of floats: 1, 2, 3, 4, 5.555


## General settings <a class="anchor" id="GeneralSetting"></a>
[TOC](#TOC)

In [3]:
# constants

# general
cfltRandomSeed = 42 # any number

# files
cstrSourcePath = "../data/raw/"
cstrSourceFile = "PotentialTalentsUTF8.csv"

cstrOutputPath = "../data/interim/"
cstrOutputFile = "RankedCandidates.csv"

# Learning step <a class="anchor" id="LearningStep"></a>
[TOC](#TOC)

## Get last state of fitness values <a class="anchor" id="GetLastStateOfFitnessValues"></a>
[TOC](#TOC)

In [4]:
from pathlib import Path

# get last fitness states
p("Getting last fitness states".upper())
objFilePath = Path("../models/dfrFitnessFromLearning.p") 
if objFilePath.exists():
    dfrFitnessFromLearning = u.FromDisk("dfrFitnessFromLearning","model")
    lintStarredCandidates = u.FromDisk("lintStarredCandidates","models")
    p("- Learning step based on previous learning step.")
else:
    dfrFitnessFromLearning = u.FromDisk("dfrFitnessFromScoring","model")
    lintStarredCandidates = []
    p("- Learning step based on fitness by scoring.")   

# best fitnesses on top
dfrFitnessFromLearning.sort_values("fit",ascending = False, inplace=True)

# create CSV file for user input
strFullPath = f"{cstrOutputPath}{cstrOutputFile}"
dfrRankedCandidates = dfrFitnessFromLearning.loc[:, ~dfrFitnessFromLearning.columns.str.startswith("L_")].copy()
dfrRankedCandidates["Star"] = " "
dfrRankedCandidates.to_csv(strFullPath, encoding='utf-8', sep=";", index=False)
p(f"- Candidate suggestions have been saved as a CSV file: {strFullPath}") 
p(f"- Open this CSV file:")
p(f"  - star a candidate.")
p(f"  - save file.")
p(f"  - close file.")

GETTING LAST FITNESS STATES
- Learning step based on previous learning step.
- Candidate suggestions have been saved as a CSV file: ../data/interim/RankedCandidates.csv
- Open this CSV file:
  - star a candidate.
  - save file.
  - close file.


### Read star <a class="anchor" id="ReadStar"></a>
[TOC](#TOC)

In [5]:
import pandas as pd

def StarredId(dfrStarred):
    '''
    Returns the starred ID. Returns an error message, if not exactly one ID starred.
    
    When       Who What
    2023 07 19 dh  Created
    '''
    
    # count stars
    dfrStarred["Star"] = dfrStarred["Star"].str.strip()
    dfrStarred["Star"] = dfrStarred["Star"].fillna("")
    intStars = len(dfrStarred[dfrStarred["Star"] != ""])

    # get ID
    if intStars == 1:
        intStarredId = dfrStarred.loc[dfrStarred["Star"] != "", "id"].values[0]
        return intStarredId
    elif intStars == 0:
        return "No candidate starred."
    else:
        return f"{intStars} candidates starred, instead of 1."

if False:
    dvarTest = {
        "id": [1, 2, 3, 4, 5],
        "Star": ["  ", " X ", " * ", " ", "  "]
    }
    dfrStarred = pd.DataFrame(dvarTest)
    varAnswer = StarredId(dfrStarred)
    t(varAnswer)

In [6]:
def AdjustAsProbabilities(dfrSource, strColumn):
    '''
    Normalizes column strColumn to values from 0 to 1.
    
    When       Who What
    2023 07 19 dh  Created
    '''
    fltMaximum = dfrSource[strColumn].max()
    dfrSource[strColumn] = dfrSource[strColumn].apply(lambda x: x / fltMaximum if fltMaximum != 0 else 0)

In [7]:
def UpdateFitnessFromStar(dfrStarred,intStarredId,fltBonus = 0.30):
    '''
    Adds a bonus to the starred candidate and re-adjusts the probabilities to the range 0 to 1.
    
    When       Who What
    2023 07 19 dh  Created
    '''    
    dfrUpdatedFitness = dfrStarred.copy()
    dfrUpdatedFitness.loc[dfrUpdatedFitness['id'] == intStarredId, 'fit'] += fltBonus
    AdjustAsProbabilities(dfrUpdatedFitness,"fit")
    return dfrUpdatedFitness

In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
def UpdateFitnessFromKNN(dfrFitnessOriginal,dfrBonusAdjustedFitness,intK = 2):
    '''
    Trains a KNN model and applies it back on the training dataset
    
    When       Who What
    2023 07 19 dh  Created
    '''  
    
    # init
    dfrUpdated = dfrFitnessOriginal.copy()
    
    # transfer bonus from starring
    dfrUpdated = dfrUpdated.merge(dfrBonusAdjustedFitness[['id', 'fit']], on='id', suffixes=('', '_bonus'))
    dfrUpdated['fit'] = dfrUpdated['fit_bonus']
    dfrUpdated.drop('fit_bonus', axis=1, inplace=True)

    # create X and y
    lstrFeatureColumns = ["connection"] + dfrUpdated.filter(regex=r'^L_').columns.tolist()
    strTargetColumn = "fit"
    X = dfrUpdated[lstrFeatureColumns]
    y = dfrUpdated[strTargetColumn]
    
    # scale
    objStandardScaler = StandardScaler()
    objStandardScaler.fit(X)
    X_scaled = objStandardScaler.transform(X)
    
    # train KNN model
    objKNN = KNeighborsRegressor(n_neighbors=intK)
    objKNN.fit(X_scaled, y)

    # re-fit training data
    a1fltAdjustedFitnessValues = objKNN.predict(X_scaled)
    dfrUpdated[strTargetColumn] = a1fltAdjustedFitnessValues
    dfrUpdated.sort_values(strTargetColumn,ascending=False)
    
    # finalize
    return dfrUpdated

In [9]:
def SuccessReport(lintStarredCandidates,dfrSortedFitness):
    '''
    Evaluates if starring had a positive effect:
    - 100%: all candidates starred in the past are on top.
    -   0%: none of the starred candidates is on top.
    
    When       Who What
    2023 07 20 dh  Created
    '''    
        
    # init
    sintStarredCandidates = set(lintStarredCandidates)
    intStarredWithoutDuplicates = len(sintStarredCandidates)
    dfrTop = dfrSortedFitness.copy()
    
    # get top candidates
    dfrTop = dfrTop.head(intStarredWithoutDuplicates)
    srsMatchingRecords = dfrTop['id'].isin(sintStarredCandidates)
    intStarredCandidatesOnTop = srsMatchingRecords.sum()
    fltPortionOnTop = intStarredCandidatesOnTop/intStarredWithoutDuplicates
    
    # get mean fitness of... 
    srsMaskStarredCandidates = dfrSortedFitness['id'].isin(sintStarredCandidates)
    
    # ... starred candidates
    dfrFilteredRecords = dfrSortedFitness.loc[srsMaskStarredCandidates]
    fltMeanFitnessStarred = dfrFilteredRecords['fit'].mean()
    
    # ... candidates never starred
    dfrFilteredRecords = dfrSortedFitness.loc[~srsMaskStarredCandidates]
    fltMeanFitnessNotStarred = dfrFilteredRecords['fit'].mean()
    fltDelta = fltMeanFitnessStarred - fltMeanFitnessNotStarred

    # report
    p("Success report".upper())
    p(f"- starring events so far:")
    p(f"  - all:                {len(lintStarredCandidates)}, i.e. { JoinNumbers(lintStarredCandidates,' ')}")
    p(f"  - without duplicates: {intStarredWithoutDuplicates}, i.e. {JoinNumbers(sintStarredCandidates,' ')}")
    p(f"- starred candidates on top:")
    p(f"  - count:              {intStarredCandidatesOnTop}")
    p(f"  - portion:            {round(100 * fltPortionOnTop)}%")
    p(f"- mean fitness:")
    p(f"  - starred:            {round(100 * fltMeanFitnessStarred,1)}%")
    p(f"  - rest:               {round(100 * fltMeanFitnessNotStarred,1)}%")
    p(f"  - delta:              {round(100 * fltDelta,1)}%")

In [11]:
import pandas as pd
from pathlib import Path

p("Learning step".upper())

# get user input file
strFilename = f"{cstrOutputPath}{cstrOutputFile}"
objFilePath = Path(strFilename) 

if objFilePath.exists():
    cintExamplesSmall = 5
    cintExamplesBig = 10
    dfrStarred = pd.read_csv(f"{cstrOutputPath}/{cstrOutputFile}", encoding='utf-8', sep=';')
    dfrStarred.sort_values(["Star","fit"], ascending=[False,False], inplace=True)
    p(f"- Top {cintExamplesSmall} candidates, before bonus:")
    u.DisplayDataFrame(dfrStarred.head(cintExamplesSmall))
    varAnswer = StarredId(dfrStarred)
    if isinstance(varAnswer, str):
        p(f"- error: {varAnswer}")
        p(f"- re-open CSV file and star exactly 1 candidate.")
        raise RuntimeError(varAnswer)
    else:
        
        # init
        intStarredId = varAnswer
        p(f"- ID starred: {intStarredId}")
        
        # convert star into bonus
        dfrUpdatedFitnessFromStar = UpdateFitnessFromStar(dfrStarred,intStarredId)
        dfrUpdatedFitnessFromStar = dfrUpdatedFitnessFromStar.drop("Star", axis=1)
        p(f"- Top {cintExamplesBig} candidates, after bonus:")
        dfrUpdatedFitnessFromStar.sort_values("fit",ascending=False, inplace=True)
        u.DisplayDataFrame(dfrUpdatedFitnessFromStar.head(cintExamplesBig))
        
        # use KNN to re-calculate fitness
        dfrUpdatedFitnessFromKNN = UpdateFitnessFromKNN(dfrFitnessFromLearning,dfrUpdatedFitnessFromStar)
        dfrUpdatedFitnessFromKNN.sort_values("fit",ascending=False, inplace=True)
        p(f"- Top {cintExamplesBig} candidates, after KNN:")
        u.DisplayDataFrame(dfrUpdatedFitnessFromKNN[["id","job_title","location","connection","fit"]].head(cintExamplesBig))
        
        # remember new fitness values
        u.ToDisk(dfrUpdatedFitnessFromKNN,"dfrFitnessFromLearning","models")
        
        # remember starring history
        lintStarredCandidates.append(intStarredId)
        u.ToDisk(lintStarredCandidates, strType="models")
        
        # feedback on effect of starring
        SuccessReport(lintStarredCandidates,dfrUpdatedFitnessFromKNN)
        
else:
    p("Problem:".upper())
    p(f"- For unknown reasons, the file '{strFilename}' does not exist.")
    p(f"- Restart this notebook")
    sys.exit()

LEARNING STEP
- Top 5 candidates, before bonus:


Unnamed: 0,id,job_title,location,connection,fit,Star
35,71,"Human resources generalist at scottmadden, inc.","Raleigh-Durham, North Carolina Area",500,0.474608,x
0,40,Seeking human resources HRIS and generalist po...,Greater Philadelphia Area,500,1.0,
1,10,Seeking human resources HRIS and generalist po...,Greater Philadelphia Area,500,1.0,
2,53,Seeking human resources HRIS and generalist po...,Greater Philadelphia Area,500,1.0,
3,62,Seeking human resources HRIS and generalist po...,Greater Philadelphia Area,500,1.0,


- ID starred: 71
- Top 10 candidates, after bonus:


Unnamed: 0,id,job_title,location,connection,fit
1,10,Seeking human resources HRIS and generalist po...,Greater Philadelphia Area,500,1.0
2,53,Seeking human resources HRIS and generalist po...,Greater Philadelphia Area,500,1.0
3,62,Seeking human resources HRIS and generalist po...,Greater Philadelphia Area,500,1.0
0,40,Seeking human resources HRIS and generalist po...,Greater Philadelphia Area,500,1.0
4,99,Seeking human resources position,"Las Vegas, Nevada Area",48,0.99211
5,100,"Aspiring human resources manager, graduating m...","Cape Girardeau, Missouri",103,0.940991
6,70,"Retired army national guard recruiter, office ...","Virginia Beach, Virginia",82,0.9386
35,71,"Human resources generalist at scottmadden, inc.","Raleigh-Durham, North Carolina Area",500,0.774608
7,67,"Human resources, staffing and recruiting profe...","Jackson, Mississippi Area",500,0.620842
10,29,Aspiring human resources management student se...,"Houston, Texas Area",500,0.612089


- Top 10 candidates, after KNN:


Unnamed: 0,id,job_title,location,connection,fit
0,40,Seeking human resources HRIS and generalist po...,Greater Philadelphia Area,500,1.0
2,53,Seeking human resources HRIS and generalist po...,Greater Philadelphia Area,500,1.0
3,62,Seeking human resources HRIS and generalist po...,Greater Philadelphia Area,500,1.0
1,10,Seeking human resources HRIS and generalist po...,Greater Philadelphia Area,500,1.0
4,99,Seeking human resources position,"Las Vegas, Nevada Area",48,0.996055
5,100,"Aspiring human resources manager, graduating m...","Cape Girardeau, Missouri",103,0.96655
6,70,"Retired army national guard recruiter, office ...","Virginia Beach, Virginia",82,0.965355
35,71,"Human resources generalist at scottmadden, inc.","Raleigh-Durham, North Carolina Area",500,0.646127
9,27,Aspiring human resources management student se...,"Houston, Texas Area",500,0.612089
10,29,Aspiring human resources management student se...,"Houston, Texas Area",500,0.612089


SUCCESS REPORT
- starring events so far:
  - all:                7, i.e. 53 40 10 62 67 68 71
  - without duplicates: 7, i.e. 67 68 53 71 40 10 62
- starred candidates on top:
  - count:              4
  - portion:            57%
- mean fitness:
  - starred:            82.2%
  - rest:               36.1%
  - delta:              46.1%


__Learning step done:__
* Re-start this notebook again.
* After a reasonable number of learning steps (e.g. 7) proceed to the next notebook.