<h3>Pipeline Stage 3: Keyword Scoring</h3>

<p>Welcome to stage 3 of the pipeline! In this stage, we will leverage the keywords extracted in Pipeline 2 to score and match students to job opportunities. By utilizing both keyword matches and the labels assigned during extraction, we can create a more nuanced and sophisticated matching system. Additionally, we incorporate fuzzy matching to capture relevant connections between students and jobs, even when exact keyword matches are not present.</p>

<p>The scoring criteria employed in this stage are somewhat arbitrary and should be revisited for refinement. The approach mirrors the existing scoring system but introduces the use of labels and distinct categories to enhance the process. Specifically, technical skills and industry alignment are given the highest weight in the scoring system, reflecting their critical importance to employers. In contrast, soft skills and value alignment are weighted lower, serving as supplementary factors that can elevate a candidate’s standing if they already meet the primary qualifications.</p>

<p>My rationale is that companies prioritize technical skills and industry alignment when evaluating candidates, as these are the most directly related to job performance. On the other hand, soft skills and value alignment, while important, are viewed as additional strengths that can differentiate candidates who are already technically qualified. This weighted approach aims to provide a balanced and effective matching process, ensuring that the most qualified candidates are paired with the most suitable job opportunities.</p>

</p>The scoring process is not parallelized or optimized for speed and currently takes some time to complete (around a minute). If you'd prefer to skip the scoring process, a pre-scored table is available in the final results section.</p>


In [4]:
#~~~
import os 
import re
import time 
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
import pandas as pd
from fuzzywuzzy import fuzz
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict

# Custom imports
from mypackage.utils import *

In [None]:
# Load the project path
path_to_project = load_project_path()

if path_to_project:
    print(f"Project path loaded: {path_to_project}")
else:
    print("Please set the project path in the initial notebook.")

In [None]:
SP_path = f'{path_to_project}/data/SP_table/SP2_post_extract_o_bot.parquet'

SP = pd.read_parquet(SP_path) 

print(SP.shape)

<h3>Technical Skills Matching and Scoring</h3>

<p>This section of the pipeline involves comparing and scoring technical skills between student profiles and job postings. .</p>

<p>The following code block defines functions that compare technical skills between job postings and student profiles. The scoring process is enhanced by utilizing the labels assigned during keyword extraction. Skills that are highly relevant to both the candidate and the job are awarded higher points. The process is broken down into several key functions that handle the parsing, matching, and scoring of skills, utilizing both exact and fuzzy matching techniques</p>

<p>One of the significant challenges in the current matching process is the inclusion of irrelevant skills in student profiles. Students, myself included, often add as many skills as possible to their profiles, regardless of their relevance to the jobs or industries they are targeting. This can lead to a cluttered and less effective matching process.</p>

<p>By incorporating the relevance and necessity scores from the job postings, the scoring functions can filter out these irrelevant skills, allowing us to concentrate on the most critical ones. This refined approach ensures that the matching process focuses on the skills that truly matter, improving the overall quality and accuracy of the matches.</p>


<h4>Key Functions:</h4>

</ul>
     <li><strong>parse_skills:</strong> This function extracts and structures the technical skills from the text of either a student profile or a job posting. Depending on the source, it parses the skill name, relevance (or necessity) score, and skill level score, returning them as a list of tuples.</li>
    
</ul>
    <li><strong>calculate_match_score:</strong> This function calculates the match score for a pair of skills, one from a job posting and one from a student profile. It assigns base points based on the alignment of the relevance (or necessity) scores and adds or subtracts points based on the alignment of the skill level scores.</li>
    
</ul>
    <li><strong>compare_technical_skills:</strong> This function compares the technical skills between job postings and student profiles. It first performs exact matching, directly comparing skill names. Then, it applies fuzzy matching to find skills that are similar but not identical, using a specified similarity threshold. The function calculates the total score for all matched skills and returns this score along with a detailed breakdown of the matches.</li>
    
</ul>
    <li><strong>apply_technical_skills_comparison:</strong> This function applies the technical skills comparison to each row in the DataFrame, which contains both student profiles and job postings. For each row, it calculates the total points for technical skill matches and stores a breakdown of the matched skills in the DataFrame.</li>
    
</ul>

<p>In this code, the technical skills from student profiles are compared against those required by job postings. The matching process is designed to be both rigorous and flexible, allowing for exact matches as well as near matches (fuzzy matching). By integrating relevance and necessity scores, the matching process focuses on the most pertinent skills, filtering out less relevant ones and providing a nuanced score that reflects the alignment between a candidate's skills and the job requirements.</p>


In [7]:
# Function to parse skills from a given text
def parse_skills(skills_text, source):
    """
    Function Name: parse_skills
    Purpose/Description:
        Parses the skills from a given text, extracting relevant information based on the source (student or job posting).
    Parameters:
        - skills_text (str): The text containing the skills to parse.
        - source (str): The source of the text ('s' for student, 'j' for job posting).
    Return Value:
        - list: A list of tuples, each containing the skill name, relevance/necessity score, and skill level score.
    """
    
    skills = []
    if source == 's':  # If the source is a student profile
        for line in skills_text.splitlines():  # Split the text into lines and process each line
            # Match the pattern for skills with relevance and skill level scores
            match = re.match(r'^(.*?), \(Relevance Score: (\w+)\), \(Skill Level Score: (\w+)\)$', line.strip())
            if match:
                skill = match.group(1).strip()
                relevance = match.group(2).strip()
                skill_level = match.group(3).strip()
                skills.append((skill, relevance, skill_level))  # Append the parsed skill information to the list
        return skills
    elif source == 'j':  # If the source is a job posting
        for line in skills_text.splitlines():  # Split the text into lines and process each line
            # Match the pattern for skills with necessity and skill level scores
            match = re.match(r'^(.*?), \(Necessity Score: (\w+)\), \(Skill Level Score: (\w+)\)$', line.strip())
            if match:
                skill = match.group(1).strip()
                relevance = match.group(2).strip()
                skill_level = match.group(3).strip()
                skills.append((skill, relevance, skill_level))  # Append the parsed skill information to the list
        return skills

# Function to calculate the match score for a single skill pair
def calculate_match_score(job_skill, student_skill):
    """
    Function Name: calculate_match_score
    Purpose/Description:
        Calculates the match score for a pair of job and student skills based on relevance/necessity and skill level alignment.
    Parameters:
        - job_skill (tuple): A tuple containing the job skill name, relevance/necessity score, and skill level score.
        - student_skill (tuple): A tuple containing the student skill name, relevance score, and skill level score.
    Return Value:
        - int: The total match score for the skill pair.
    """
    
    job_skill_name, job_relevance, job_skill_level = job_skill
    student_skill_name, student_relevance, student_skill_level = student_skill
    
    # Calculate base points for Necessity and Relevance alignment
    base_points = 0
    if job_relevance == 'High' and student_relevance == 'High':
        base_points = 20
    elif job_relevance == 'High' and student_relevance == 'Medium':
        base_points = 14
    elif job_relevance == 'High' and student_relevance == 'Low':
        base_points = 8
    elif job_relevance == 'Medium' and student_relevance == 'High':
        base_points = 16
    elif job_relevance == 'Medium' and student_relevance == 'Medium':
        base_points = 12
    elif job_relevance == 'Medium' and student_relevance == 'Low':
        base_points = 6
    elif job_relevance == 'Low' and student_relevance == 'High':
        base_points = 12
    elif job_relevance == 'Low' and student_relevance == 'Medium':
        base_points = 8
    elif job_relevance == 'Low' and student_relevance == 'Low':
        base_points = 4
    
    # Calculate additional points for Skill Level alignment
    additional_points = 0
    if job_skill_level == 'Advanced' and student_skill_level == 'Advanced':
        additional_points = 4
    elif job_skill_level == 'Advanced' and student_skill_level == 'Intermediate':
        additional_points = 2
    elif job_skill_level == 'Advanced' and student_skill_level == 'Beginner':
        additional_points = -2
    elif job_skill_level == 'Intermediate' and student_skill_level == 'Advanced':
        additional_points = 2
    elif job_skill_level == 'Intermediate' and student_skill_level == 'Intermediate':
        additional_points = 0
    elif job_skill_level == 'Intermediate' and student_skill_level == 'Beginner':
        additional_points = -2
    elif job_skill_level == 'Beginner' and student_skill_level == 'Advanced':
        additional_points = 2
    elif job_skill_level == 'Beginner' and student_skill_level == 'Intermediate':
        additional_points = 0
    elif job_skill_level == 'Beginner' and student_skill_level == 'Beginner':
        additional_points = 0
    
    # Final match score
    total_score = base_points + additional_points
    return total_score

# Function to compare technical skills with exact and fuzzy matching
def compare_technical_skills(job_skills_text, student_skills_text, threshold=90):
    """
    Function Name: compare_technical_skills
    Purpose/Description:
        Compares technical skills between job postings and student profiles using both exact and fuzzy matching.
    Parameters:
        - job_skills_text (str): The text containing the technical skills from the job posting.
        - student_skills_text (str): The text containing the technical skills from the student profile.
        - threshold (int): The similarity threshold for fuzzy matching (default is 90).
    Return Value:
        - tuple: A tuple containing the total match score and a string describing the matched skills.
    """
    
    job_skills = parse_skills(job_skills_text, 'j')  # Parse skills from job posting
    student_skills = parse_skills(student_skills_text, 's')  # Parse skills from student profile

    total_score = 0
    matched_skills = []
    exact_matches = set()

    # Exact matching
    for job_skill in job_skills:
        for student_skill in student_skills:
            if job_skill[0] == student_skill[0]:  # Check for exact match of skill names
               
                score = calculate_match_score(job_skill, student_skill)  # Calculate match score
                total_score += score
                matched_skills.append(f"{job_skill[0]}: {score} points")  # Record matched skill and score
                exact_matches.add((job_skill[0], student_skill[0]))  # Track exact matches
    
    # Fuzzy matching, excluding exact matches
    for job_skill in job_skills:
        for student_skill in student_skills:      
            if (job_skill[0], student_skill[0]) not in exact_matches and fuzz.token_set_ratio(job_skill[0], student_skill[0]) >= threshold:
                score = calculate_match_score(job_skill, student_skill)  # Calculate match score for fuzzy match
                total_score += score
                matched_skills.append(f"{job_skill[0]} (Fuzzy Match: {student_skill[0]}): {score} points")  # Record fuzzy match and score
                exact_matches.add((job_skill[0], student_skill[0]))  # Track fuzzy matches to avoid duplicates

    return total_score, "\n".join(matched_skills)

# Apply the function to the DataFrame
def apply_technical_skills_comparison(df):
    """
    Function Name: apply_technical_skills_comparison
    Purpose/Description:
        Applies the technical skills comparison to each row of the DataFrame, calculating the total points and providing a breakdown of matched skills.
    Parameters:
        - df (pandas.DataFrame): The DataFrame containing the student and job profiles to compare.
    Return Value:
        - pandas.DataFrame: The updated DataFrame with calculated technical skill match scores and breakdowns.
    """
    
    df['Tech_total_points'] = 0  # Initialize column for total technical skill points
    df['Tech_total_points_breakup'] = ""  # Initialize column for the breakdown of matched skills

    for index, row in df.iterrows():  # Iterate through each row of the DataFrame
        job_skills_text = row['pos_technical_keywords']  # Extract job technical skills text
        student_skills_text = row['stu_technical_keywords']  # Extract student technical skills text
        
        total_score, matched_skills_breakup = compare_technical_skills(job_skills_text, student_skills_text)  # Compare skills
        
        df.at[index, 'Tech_total_points'] = total_score  # Assign total score to DataFrame
        df.at[index, 'Tech_total_points_breakup'] = f"Matched Technical Skills:\n{matched_skills_breakup}"  # Assign skill breakdown to DataFrame

    return df

# Assuming your DataFrame is called SP, you would run:
SP = apply_technical_skills_comparison(SP)


<h3>Industry Keyword Matching and Scoring</h3>

<p>This section of the pipeline is dedicated to comparing and scoring industry keywords between job postings and student profiles. The process involves parsing industry-related keywords from both job and student profiles, matching them based on their labels, and calculating a total match score. This helps in determining how well a student's industry interests align with the industries relevant to a job posting.</p>

<h4>Key Functions:</h4>

<ul>
    <li><strong>parse_industries:</strong> This function extracts and structures industry keywords from the provided text. It parses the industry name and its associated label (e.g., Primary, Secondary, Tertiary) and returns them as a list of tuples. This is done for both job postings and student profiles to prepare the data for comparison.</li>
    <li><strong>calculate_industry_match_score:</strong> This function calculates the match score for a pair of industries, one from a job posting and one from a student profile. The score is determined based on how the labels (Primary, Secondary, Tertiary) of the industries align. For example, a Primary-Primary match yields a higher score than a Primary-Tertiary match.</li>
    <li><strong>compare_industries:</strong> This function compares the industry keywords between a job posting and a student profile. It performs exact matching of industry names and calculates a total match score for the matched industries. The function returns the total score and a detailed breakdown of the matched industries and their respective scores.</li>
    <li><strong>apply_industry_comparison:</strong> This function applies the industry comparison process to each row in the DataFrame, which contains both student profiles and job postings. It calculates the total points for industry matches for each row and stores a breakdown of the matched industries in the DataFrame.</li>
</ul>

<p>By implementing this industry keyword matching process, the pipeline can better assess how well a student's industry interests align with the industries that are relevant to the job postings. This refined approach contributes to a more accurate and meaningful matching process, ultimately leading to better student-job matches.</p>


In [8]:
# Function to parse industry keywords from a given text
def parse_industries(industry_text):
    """
    Function Name: parse_industries
    Purpose/Description:
        Parses industry keywords from the provided text, extracting the industry name and its associated label.
    Parameters:
        - industry_text (str): The text containing the industry keywords to parse.
    Return Value:
        - list: A list of tuples, each containing the industry name and its corresponding label.
    """
    industries = []
    for line in industry_text.splitlines():  # Split the text into lines and process each line
        # Match the pattern for industries with labels
        match = re.match(r'^(.*?), \(Industry Label: (\w+)\)$', line.strip())
        if match:
            industry = match.group(1).strip()
            industry_label = match.group(2).strip()
            industries.append((industry, industry_label))  # Append the parsed industry information to the list
    return industries

# Function to calculate the match score for a single industry pair
def calculate_industry_match_score(job_industry, student_industry):
    """
    Function Name: calculate_industry_match_score
    Purpose/Description:
        Calculates the match score for a pair of industries, one from a job posting and one from a student profile, based on their labels.
    Parameters:
        - job_industry (tuple): A tuple containing the job industry name and its label.
        - student_industry (tuple): A tuple containing the student industry name and its label.
    Return Value:
        - int: The match score for the industry pair.
    """
    job_industry_name, job_label = job_industry
    student_industry_name, student_label = student_industry
    
    # Calculate points based on the alignment of industry labels
    if job_label == 'Primary' and student_label == 'Primary':
        return 40
    elif job_label == 'Primary' and student_label == 'Secondary':
        return 30
    elif job_label == 'Primary' and student_label == 'Tertiary':
        return 5
    elif job_label == 'Secondary' and student_label == 'Primary':
        return 30
    elif job_label == 'Secondary' and student_label == 'Secondary':
        return 30
    elif job_label == 'Secondary' and student_label == 'Tertiary':
        return 5
    elif job_label == 'Tertiary' and student_label == 'Primary':
        return 15
    elif job_label == 'Tertiary' and student_label == 'Secondary':
        return 10
    elif job_label == 'Tertiary' and student_label == 'Tertiary':
        return 5

# Function to compare industry keywords
def compare_industries(job_industry_text, student_industry_text):
    """
    Function Name: compare_industries
    Purpose/Description:
        Compares industry keywords between job postings and student profiles, calculating a total match score.
    Parameters:
        - job_industry_text (str): The text containing the industry keywords from the job posting.
        - student_industry_text (str): The text containing the industry keywords from the student profile.
    Return Value:
        - tuple: A tuple containing the total match score and a string detailing the matched industries.
    """
    job_industries = parse_industries(job_industry_text)  # Parse industries from job posting
    student_industries = parse_industries(student_industry_text)  # Parse industries from student profile
    
    total_score = 0
    matched_industries = []

    # Exact matching of industries
    for job_industry in job_industries:
        for student_industry in student_industries:
            if job_industry[0] == student_industry[0]:  # Check for exact match of industry names
                score = calculate_industry_match_score(job_industry, student_industry)  # Calculate match score
                total_score += score
                matched_industries.append(f"{job_industry[0]}: {score} points")  # Record matched industry and score

    return total_score, "\n".join(matched_industries)

# Apply the function to the DataFrame
def apply_industry_comparison(df):
    """
    Function Name: apply_industry_comparison
    Purpose/Description:
        Applies the industry comparison function to each row in the DataFrame, calculating total points for industry matches.
    Parameters:
        - df (pandas.DataFrame): The DataFrame containing student and job profiles to compare.
    Return Value:
        - pandas.DataFrame: The updated DataFrame with calculated industry match scores and breakdowns.
    """
    
    df['Industry_total_points'] = 0  # Initialize column for total industry points
    df['Industry_total_points_breakup'] = ""  # Initialize column for the breakdown of matched industries

    for index, row in df.iterrows():  # Iterate through each row of the DataFrame
        job_industry_text = row['pos_industry_keywords']  # Extract job industry keywords text
        student_industry_text = row['stu_industry_keywords']  # Extract student industry keywords text
        
        total_score, matched_industries_breakup = compare_industries(job_industry_text, student_industry_text)  # Compare industries
        
        df.at[index, 'Industry_total_points'] = total_score  # Assign total score to DataFrame
        df.at[index, 'Industry_total_points_breakup'] = f"Matched Industries:\n{matched_industries_breakup}"  # Assign industry breakdown to DataFrame

    return df

# Assuming your DataFrame is called SP, you would run:
SP = apply_industry_comparison(SP)


<h3>Soft Skills Keyword Matching and Scoring</h3>

<p>This section of the pipeline is designed to compare and score soft skills between job postings and student profiles. The process involves parsing soft skills from both job and student profiles, matching them based on their skill levels, and calculating a total match score. This helps determine how well a student's soft skills align with the requirements or preferences of a job posting.</p>

<h4>Key Functions:</h4>

<ul>
    <li><strong>parse_soft_skills:</strong> This function extracts and structures soft skills from the provided text. It parses the skill name and its associated skill level score (e.g., Beginner, Intermediate, Advanced) and returns them as a list of tuples. This is done for both job postings and student profiles to prepare the data for comparison.</li>
    <li><strong>calculate_soft_skill_match_score:</strong> This function calculates the match score for a pair of soft skills, one from a job posting and one from a student profile, based on their skill levels. The score is determined by how well the skill levels align. For example, an Advanced-Advanced match yields a higher score than an Advanced-Beginner match.</li>
     <li><strong>compare_soft_skills:</strong> This function compares the soft skills between a job posting and a student profile. It performs exact matching of soft skill names and calculates a total match score for the matched skills. The function returns the total score and a detailed breakdown of the matched soft skills and their respective scores.</li>
    <li><strong>apply_soft_skills_comparison:</strong> This function applies the soft skills comparison process to each row in the DataFrame, which contains both student profiles and job postings. It calculates the total points for soft skill matches for each row and stores a breakdown of the matched skills in the DataFrame.</li>
</ul>

<p>By implementing this soft skills keyword matching process, the pipeline can better assess how well a student's soft skills align with the soft skills that are relevant to the job postings. This refined approach contributes to a more accurate and meaningful matching process, ultimately leading to better student-job matches.</p>


In [9]:
# Function to parse soft skills from a given text
def parse_soft_skills(soft_skills_text):
    """
    Function Name: parse_soft_skills
    Purpose/Description:
        Parses soft skills from the provided text, extracting the skill name and its associated skill level score.
    Parameters:
        - soft_skills_text (str): The text containing the soft skills to parse.
    Return Value:
        - list: A list of tuples, each containing the soft skill name and its corresponding skill level score.
    """
    soft_skills = []
    for line in soft_skills_text.splitlines():  # Split the text into lines and process each line
        # Match the pattern for soft skills with skill level scores
        match = re.match(r'^(.*?), \(Skill Level Score: (\w+)\)$', line.strip())
        if match:
            skill = match.group(1).strip()
            skill_level = match.group(2).strip()
            soft_skills.append((skill, skill_level))  # Append the parsed soft skill information to the list
    return soft_skills

# Function to calculate the match score for a single soft skill pair
def calculate_soft_skill_match_score(job_skill, student_skill):
    """
    Function Name: calculate_soft_skill_match_score
    Purpose/Description:
        Calculates the match score for a pair of soft skills, one from a job posting and one from a student profile, based on their skill levels.
    Parameters:
        - job_skill (tuple): A tuple containing the job soft skill name and its skill level score.
        - student_skill (tuple): A tuple containing the student soft skill name and its skill level score.
    Return Value:
        - int: The match score for the soft skill pair.
    """
    job_skill_name, job_skill_level = job_skill
    student_skill_name, student_skill_level = student_skill
    
    # Calculate points based on the alignment of skill level scores
    if job_skill_level == 'Advanced' and student_skill_level == 'Advanced':
        return 3
    elif job_skill_level == 'Advanced' and student_skill_level == 'Intermediate':
        return 2
    elif job_skill_level == 'Advanced' and student_skill_level == 'Beginner':
        return 1
    elif job_skill_level == 'Intermediate' and student_skill_level == 'Advanced':
        return 2
    elif job_skill_level == 'Intermediate' and student_skill_level == 'Intermediate':
        return 2
    elif job_skill_level == 'Intermediate' and student_skill_level == 'Beginner':
        return 1
    elif job_skill_level == 'Beginner' and student_skill_level == 'Advanced':
        return 1
    elif job_skill_level == 'Beginner' and student_skill_level == 'Intermediate':
        return 1
    elif job_skill_level == 'Beginner' and student_skill_level == 'Beginner':
        return 1

# Function to compare soft skills
def compare_soft_skills(job_soft_skills_text, student_soft_skills_text):
    """
    Function Name: compare_soft_skills
    Purpose/Description:
        Compares soft skills between job postings and student profiles, calculating a total match score.
    Parameters:
        - job_soft_skills_text (str): The text containing the soft skills from the job posting.
        - student_soft_skills_text (str): The text containing the soft skills from the student profile.
    Return Value:
        - tuple: A tuple containing the total match score and a string detailing the matched soft skills.
    """
    job_soft_skills = parse_soft_skills(job_soft_skills_text)  # Parse soft skills from job posting
    student_soft_skills = parse_soft_skills(student_soft_skills_text)  # Parse soft skills from student profile
    
    total_score = 0
    matched_soft_skills = []

    # Exact matching of soft skills
    for job_skill in job_soft_skills:
        for student_skill in student_soft_skills:
            if job_skill[0] == student_skill[0]:  # Check for exact match of soft skill names
                score = calculate_soft_skill_match_score(job_skill, student_skill)  # Calculate match score
                total_score += score
                matched_soft_skills.append(f"{job_skill[0]}: {score} points")  # Record matched soft skill and score

    return total_score, "\n".join(matched_soft_skills)

# Apply the function to the DataFrame
def apply_soft_skills_comparison(df):
    """
    Function Name: apply_soft_skills_comparison
    Purpose/Description:
        Applies the soft skills comparison function to each row in the DataFrame, calculating total points for soft skill matches.
    Parameters:
        - df (pandas.DataFrame): The DataFrame containing student and job profiles to compare.
    Return Value:
        - pandas.DataFrame: The updated DataFrame with calculated soft skill match scores and breakdowns.
    """
    
    df['Soft_Skills_total_points'] = 0  # Initialize column for total soft skills points
    df['Soft_Skills_total_points_breakup'] = ""  # Initialize column for the breakdown of matched soft skills

    for index, row in df.iterrows():  # Iterate through each row of the DataFrame
        job_soft_skills_text = row['pos_soft_keywords']  # Extract job soft skills keywords text
        student_soft_skills_text = row['stu_soft_keywords']  # Extract student soft skills keywords text
        
        total_score, matched_soft_skills_breakup = compare_soft_skills(job_soft_skills_text, student_soft_skills_text)  # Compare soft skills
        
        df.at[index, 'Soft_Skills_total_points'] = total_score  # Assign total score to DataFrame
        df.at[index, 'Soft_Skills_total_points_breakup'] = f"Matched Soft Skills:\n{matched_soft_skills_breakup}"  # Assign soft skill breakdown to DataFrame

    return df

# Assuming your DataFrame is called SP, you would run:
SP = apply_soft_skills_comparison(SP)


<h3>Values Keyword Matching and Scoring</h3>

<p>This section of the pipeline focuses on comparing and scoring the values between job postings and student profiles. The process involves parsing values from both job and student profiles, matching them based on exact matches, and calculating a total match score. This helps to assess how well a student's personal values align with the values emphasized in the job postings.</p>

<h4>Key Functions:</h4>

<ul>
    <li><strong>parse_values:</strong> This function extracts individual values from the provided text. It splits the text into lines, removes any extra whitespace, and filters out empty lines. This process is used for both job postings and student profiles to prepare the data for comparison.</li>
    <li><strong>calculate_value_match_score:</strong> This function calculates the match score for values between job postings and student profiles. If a value from the job posting matches a value from the student profile, a fixed number of points (3 points) is added to the total score. The function returns the total score and a breakdown of the matched values with their respective scores.</li>
    <li><strong>compare_values:</strong> This function compares the values from job postings and student profiles. It uses the <code>parse_values</code> function to extract the values and the <code>calculate_value_match_score</code> function to compute the match score. The function returns the total match score along with a detailed breakdown of the matched values.</li>
    <li><strong>apply_values_comparison:</strong> This function applies the values comparison process to each row in the DataFrame, which contains both student profiles and job postings. It calculates the total points for value matches for each row and stores a breakdown of the matched values in the DataFrame.</li>
</ul>

<p>This values keyword matching process allows the pipeline to better understand how well a student's values align with those of potential employers. By focusing on exact matches, this approach ensures that the matching process is both accurate and meaningful, contributing to a more refined student-job matching process.</p>


In [10]:
# Function to parse values from a given text
def parse_values(values_text):
    """
    Function Name: parse_values
    Purpose/Description:
        Parses values or keywords related to values from the provided text.
    Parameters:
        - values_text (str): The text containing the values to parse.
    Return Value:
        - list: A list of values extracted from the text.
    """
    # Split the text into lines, strip any extra whitespace, and filter out any empty lines
    values = [line.strip() for line in values_text.splitlines() if line.strip()]
    return values

# Function to calculate the match score for values
def calculate_value_match_score(job_values, student_values):
    """
    Function Name: calculate_value_match_score
    Purpose/Description:
        Calculates the match score for a set of values between job postings and student profiles.
    Parameters:
        - job_values (list): A list of values from the job posting.
        - student_values (list): A list of values from the student profile.
    Return Value:
        - tuple: The total match score and a string detailing the matched values and their respective scores.
    """
    total_score = 0
    matched_values = []

    # Exact matching of values
    for job_value in job_values:
        if job_value in student_values:  # If the job value is found in the student's values
            total_score += 3  # Add 3 points for each match
            matched_values.append(f"{job_value}: 3 points")  # Record the matched value and its score

    return total_score, "\n".join(matched_values)

# Function to compare values
def compare_values(job_values_text, student_values_text):
    """
    Function Name: compare_values
    Purpose/Description:
        Compares values between job postings and student profiles, calculating a total match score.
    Parameters:
        - job_values_text (str): The text containing the values from the job posting.
        - student_values_text (str): The text containing the values from the student profile.
    Return Value:
        - tuple: The total match score and a string detailing the matched values and their respective scores.
    """
    job_values = parse_values(job_values_text)  # Parse values from the job posting
    student_values = parse_values(student_values_text)  # Parse values from the student profile
    
    total_score, matched_values_breakup = calculate_value_match_score(job_values, student_values)  # Compare and score the values

    return total_score, matched_values_breakup

# Apply the function to the DataFrame
def apply_values_comparison(df):
    """
    Function Name: apply_values_comparison
    Purpose/Description:
        Applies the values comparison function to each row in the DataFrame, calculating total points for value matches.
    Parameters:
        - df (pandas.DataFrame): The DataFrame containing student and job profiles to compare.
    Return Value:
        - pandas.DataFrame: The updated DataFrame with calculated values match scores and breakdowns.
    """
    df['Values_total_points'] = 0  # Initialize a column for total values points
    df['Values_total_points_breakup'] = ""  # Initialize a column for the breakdown of matched values

    for index, row in df.iterrows():  # Iterate through each row of the DataFrame
        job_values_text = row['pos_values_keywords']  # Extract job values keywords text
        student_values_text = row['stu_values_keywords']  # Extract student values keywords text
        
        total_score, matched_values_breakup = compare_values(job_values_text, student_values_text)  # Compare values
        
        df.at[index, 'Values_total_points'] = total_score  # Assign total score to DataFrame
        df.at[index, 'Values_total_points_breakup'] = f"Matched Values:\n{matched_values_breakup}"  # Assign values breakdown to DataFrame

    return df

# Assuming your DataFrame is called SP, you would run:
SP = apply_values_comparison(SP)


In [11]:
# Extraction total points are calculated by summing up the points from the four categories. 
# I put the Extraction title to differentiate it from the 'Total Points' column that scores using the current matching logic.

points = ['Tech_total_points', 'Industry_total_points', 'Soft_Skills_total_points', 'Values_total_points']

SP['Extraction_total_points'] = SP[points].sum(axis=1)

### Final Result

The final result of this pipeline is a scored DataFrame based on keyword matching. In the next stage, these scores will be used to select the best matches for students and job postings. This process ensures that the most aligned candidates are paired with the most suitable opportunities, optimizing the matching process for both students and employers.

If you have generated and scored your own data, comment out the file read section below.

In [13]:
post_scoring_path = f'{path_to_project}/data/SP_table/SP3_post_keyword_scoring.parquet'
SP = pd.read_parquet(post_scoring_path)

In [None]:
# Example of the the extracted points. Only showing rows where the total points are greater than 40 and the Industry points are greater than 0.

cols = '''
stu_Legal Name
pos_Company
pos_Name
Extraction_total_points
Tech_total_points
Tech_total_points_breakup
Industry_total_points
Industry_total_points_breakup
Soft_Skills_total_points
Soft_Skills_total_points_breakup
Values_total_points
Values_total_points_breakup
'''

cols = as_list(cols)

sample = SP[cols]
sample = sample[sample['Extraction_total_points'] > 40]
sample = sample[sample['Industry_total_points'] > 0]
sample_10 = sample.sample(10)
pretty_print(sample_10)
