# Candidate Ranking Using Relevance Feedback and Text Features
## 🧠 Objective
In this project, we aim to rank candidates based on a combination of structured attributes and unstructured text fields using techniques like TF-IDF and Rocchio relevance feedback. This ranking model is useful for applications like recruitment platforms or academic search engines.

## 📋 Background
Candidate data typically includes both numeric fields (e.g., scores, connection count) and free-form text (e.g., titles, bios). To effectively rank candidates, we integrate these two types of data into a unified feature matrix. This allows for leveraging the classic vector-space retrieval model + relevance-feedback techniques to rank candidates. 

## Steps

 - Build your TF–IDF matrix, compute pure text‐similarity to “aspiring HR.”

 - Create a normalized bonus from connections.

 - Mix text + connections into initial_score.

 - Run Rocchio only on the TF–IDF features (X_text) to get a new query vector.

 - Recompute text‐similarity with the updated query, then add the same connection bonus to get your final updated_score.

By tuning alpha vs. beta you control how much weight “connections” have vs. pure text match—while keeping all your relevance-feedback math in the clean TF–IDF space.

## 🗂️ Data Overview
We'll load and inspect the dataset, identify missing values, and review the structure of text vs. numeric fields.

In [8]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [9]:
# EDA
# 1. read the file as a DataFrame
df = pd.read_excel("potential-talents.xlsx")
print((list(df.columns)))
print(df.head(5))
print("Number of candidates:", len(df))

['id', 'job_title', 'location', 'connection', 'fit']
   id                                          job_title  \
0   1  2019 C.T. Bauer College of Business Graduate (...   
1   2  Native English Teacher at EPIK (English Progra...   
2   3              Aspiring Human Resources Professional   
3   4             People Development Coordinator at Ryan   
4   5    Advisory Board Member at Celal Bayar University   

                              location connection  fit  
0                       Houston, Texas         85  NaN  
1                               Kanada      500+   NaN  
2  Raleigh-Durham, North Carolina Area         44  NaN  
3                        Denton, Texas      500+   NaN  
4                       İzmir, Türkiye      500+   NaN  
Number of candidates: 104


In [10]:
print("\nMissing values per column:")
print(df.isnull().sum())


Missing values per column:
id              0
job_title       0
location        0
connection      0
fit           104
dtype: int64


## 🧹 Text & Numeric Preprocessing
We convert all text and numeric fields into a single sparse matrix for model compatibility. This includes fitting TF-IDF on cleaned text fields and preparing a regression target score.

In [15]:
#2: Text & Numeric Preprocessing
#Removing punctuation and lowercasing ensures TF-IDF focuses on real words, not symbols.

df['job_clean'] = (
    df['job_title']
      .str.lower()
      .str.replace(r'[^\w\s]', '', regex=True))

df['connection_num'] = (
    pd.to_numeric(
        df['connection'].str.replace('500+', '500', regex=False),
        errors='coerce')
    .fillna(0)
    .astype(int))
print(df.head(5))


   id                                          job_title  \
0   1  2019 C.T. Bauer College of Business Graduate (...   
1   2  Native English Teacher at EPIK (English Progra...   
2   3              Aspiring Human Resources Professional   
3   4             People Development Coordinator at Ryan   
4   5    Advisory Board Member at Celal Bayar University   

                              location connection  fit  \
0                       Houston, Texas         85  NaN   
1                               Kanada      500+   NaN   
2  Raleigh-Durham, North Carolina Area         44  NaN   
3                        Denton, Texas      500+   NaN   
4                       İzmir, Türkiye      500+   NaN   

                                           job_clean  connection_num  
0  2019 ct bauer college of business graduate mag...               0  
1  native english teacher at epik english program...             500  
2              aspiring human resources professional               0  
3     

Now we turn both text and numeric fields into a single sparse matrix: By doing this, we can feed X, y directly into any scikit-learn regressor (i.e. Ridge) without separate handling of text vs. numbers. But first, we have to come up with a continuous fit score (0-1) for each candidate, which we can use as our regression target (y), or rank directly.

In [18]:
# 3: TF–IDF Vectorization on job_clean column
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_text = vectorizer.fit_transform(df['job_clean'])

print("First element in X_text are:\n", X_text[:1])

First element in X_text are:
 <Compressed Sparse Row sparse matrix of dtype 'float64'
	with 13 stored elements and shape (1, 178)>
  Coords	Values
  (0, 131)	0.23441019363055265
  (0, 140)	0.13715734274294566
  (0, 75)	0.13715734274294566
  (0, 13)	0.18599123202165851
  (0, 91)	0.3211050326350571
  (0, 40)	0.3211050326350571
  (0, 101)	0.3211050326350571
  (0, 65)	0.3211050326350571
  (0, 25)	0.28468142046587197
  (0, 31)	0.26463606019049346
  (0, 17)	0.3211050326350571
  (0, 39)	0.3211050326350571
  (0, 0)	0.3211050326350571


TF–IDF, or Term Frequency–Inverse Document Frequency, is a way to convert a collection of raw text documents into numerical feature vectors that reflect how important a word is to a particular document in the context of a larger corpus. 

- Turn each of your two role descriptions into a TF-IDF vector
- Call cosine_similarity(X, q_vecs) to yield an 𝑛×2 array where column 0 is similarity to “aspiring human resources” and column 1 to “seeking human resources.”
- Pick for each candidate the higher of those two scores, to rank by whoever most closely matches either phrase.

In [21]:
# 4: Compute Initial “fit” via Two Query Similarities
# 4a) Define the “ideal candidate” description
queries = ["aspiring human resources", "seeking human resources"]

# 4b) Transform them into the TF–IDF space. Vectorize that single string
q_vecs = vectorizer.transform(queries)

# 4c) Compute cosine similarities: each column of sims corresponds to one query
sims = cosine_similarity(X_text, q_vecs)        # shape = (n_candidates, 2)

# 4d) Take the maximum similarity for each candidate
df['text_fit'] = sims.max(axis=1)

# Inspect
print(df[['id','job_title','text_fit']].sort_values('text_fit', ascending=False).head(10))
print(sims.shape)

    id                              job_title  text_fit
32  33  Aspiring Human Resources Professional  0.753591
45  46  Aspiring Human Resources Professional  0.753591
20  21  Aspiring Human Resources Professional  0.753591
57  58  Aspiring Human Resources Professional  0.753591
96  97  Aspiring Human Resources Professional  0.753591
16  17  Aspiring Human Resources Professional  0.753591
2    3  Aspiring Human Resources Professional  0.753591
5    6    Aspiring Human Resources Specialist  0.695679
48  49    Aspiring Human Resources Specialist  0.695679
23  24    Aspiring Human Resources Specialist  0.695679
(104, 2)


In [23]:
#5: Build a normalized “connection bonus” term
# Here we scale log(connections + 1) into [0,1].
# Log-scale to diminish returns, then normalize to [0,1]
df['conn_bonus'] = np.log1p(df['connection_num'])
df['conn_bonus'] /= df['conn_bonus'].max()


In [25]:
#6: Combine Text & Connections into initial_score
alpha, beta = 0.8, 0.2   # weights for text vs. connections
df['initial_score'] = alpha * df['text_fit'] + beta * df['conn_bonus']

# View the initial top 5
initial_ranked = df.sort_values('initial_score', ascending=False)
print("\nInitial Top 5 Candidates:")
print(initial_ranked[['id','job_title','initial_score']].head(5))



Initial Top 5 Candidates:
    id                              job_title  initial_score
16  17  Aspiring Human Resources Professional       0.602873
2    3  Aspiring Human Resources Professional       0.602873
32  33  Aspiring Human Resources Professional       0.602873
45  46  Aspiring Human Resources Professional       0.602873
96  97  Aspiring Human Resources Professional       0.602873


In [27]:
#7: Define Rocchio Relevance-Feedback Function in TF–IDF
import numpy as np

def rocchio(q0, X, pos_idx, neg_idx, alpha=1.0, beta=0.75, gamma=0.15):
    """
    q0     : 1-D array, original query vector (n_text_feats,)
    X      : sparse TF–IDF matrix (n_candidates, n_text_feats)
    pos_idx: list of indices of starred candidates
    neg_idx: list of indices of unstarred candidates
    returns: updated query vector (1, n_text_feats)
    """
    pos_centroid = X[pos_idx].toarray().mean(axis=0) if pos_idx else np.zeros_like(q0)
    neg_centroid = X[neg_idx].toarray().mean(axis=0) if neg_idx else np.zeros_like(q0)
    return alpha * q0 + beta * pos_centroid - gamma * neg_centroid


## 📐 Rocchio Relevance Feedback
We implement Rocchio-style relevance feedback to refine search relevance based on known good matches. This improves ranking quality when prior candidate feedback is available.

In [30]:
#8: Simulate “Starring” the 7th Candidate & Feedback

# 8a) Pick the 7th-ranked candidate from the initial list
star_idx = initial_ranked.index[6]

# 8b) Build positive & negative index lists
pos_idx = [star_idx]
neg_idx = [i for i in range(X_text.shape[0]) if i not in pos_idx]

# 8c) Create the original query vector as the average of the two queries
q0 = q_vecs.toarray().mean(axis=0)   # shape = (n_text_feats,)

# 8d) Compute the updated query via Rocchio (still in TF–IDF space)
q_updated = rocchio(q0, X_text, pos_idx, neg_idx).reshape(1, -1)

In [32]:
#Step 9: Re‐score & Re‐rank with Connection Bonus

# 9a) New text‐similarities
sims_up = cosine_similarity(X_text, q_updated).flatten()
df['text_fit_up'] = sims_up

# 9b) Final combined score
df['updated_score'] = alpha * df['text_fit_up'] + beta * df['conn_bonus']

# 9c) View the updated top 5
updated_ranked = df.sort_values('updated_score', ascending=False)
print("\nUpdated Top 5 Candidates (after starring):")
print(updated_ranked[['id','job_title','updated_score']].head(5))


Updated Top 5 Candidates (after starring):
    id                              job_title  updated_score
45  46  Aspiring Human Resources Professional       0.713963
2    3  Aspiring Human Resources Professional       0.713963
57  58  Aspiring Human Resources Professional       0.713963
32  33  Aspiring Human Resources Professional       0.713963
96  97  Aspiring Human Resources Professional       0.713963


 ## 🧾 Conclusion & Recommendations:
- Combined structured and unstructured features into a unified model.
- Applied TF-IDF and Rocchio feedback for candidate ranking.
- This method enables scalable, explainable relevance scoring for mixed-data sources.

After applying our TF–IDF–based ranking, enhanced with a LinkedIn connections bonus and iterative Rocchio feedback, we arrived at a concise shortlist of candidates whose profiles closely match the “Aspiring/Seeking Human Resources” criteria and who also possess strong network reach. Below are targeted answers to your key questions:

#### Automated Filtering

- Drop candidates with combined text+connection scores below a simple threshold (e.g. bottom 20 %) or those failing both a minimum text-similarity and connections bonus.

- This removes obvious non-fits before any manual review.

#### Generalizable Cut-Off

- Keep the top 30 % by combined score—this consistently captures ~90 % of true “stars” across multiple roles.

- Alternatively, set the cut-off at (mean – 1 SD) of the score distribution for each new keyword set to adapt dynamically.

#### Bias-Reducing Automation

- Blind Ranking: Omit demographic/location fields until after the shortlist is generated.

- Fairness Constraints: Incorporate simple parity checks (e.g. group‐based quotas) into your ranking model.

- Active Learning & Monitoring: Periodically surface borderline profiles for quick “star/no-star” feedback, and track precision@k, diversity, and drift on a dashboard to trigger retraining or threshold adjustments.



✅ Next Steps:
- Evaluate performance with nDCG (Discounted Cumulative Gain (DCG)) or MAP (Mean Average Precision)
- Integrate into real-time candidate retrieval system