Open this notebook in Google Colab : [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Riminder/hrflow-cookbook/blob/main/examples/%5BParsing%5D%20parsing_evaluator.ipynb)

##### Copyright 2024 HrFlow's AI Research Department

Licensed under the Apache License, Version 2.0 (the "License");

In [None]:
# Copyright 2024 HrFlow's AI Research Department. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

Welcome to this Google Colaboratory tutorial for developers. This Jupyter notebook is crafted to streamline **the evaluation of CV parsing** effectiveness using HrFlow's robust AI technology. It enables users to **generate a comprehensive Excel report** assessing the parsing accuracy of resumes previously processed through a specific HrFlow source.

Before we proceed, please ensure that you have created a source to store your data and you have already parsed some profiles. You can find detailed instructions on how to create them through the following links:
- **Create your source**: [Connectors Source Documentation](https://developers.hrflow.ai/docs/connectors-source)
- **Parse profiles**:
  - [📖 Resume Parsing Guide](https://developers.hrflow.ai/docs/resume-parsing)
  - [🎯 Endpoint Documentation](https://developers.hrflow.ai/reference/parse-a-resume)
  - [📓 Notebook](https://github.com/Riminder/hrflow-cookbook/blob/main/examples/%5BParsing%5D%20profile_job_parsing.ipynb)

# Getting Started

In [None]:
!pip install -q -U tqdm ipywidgets openpyxl pydantic==1.10.14 hrflow

In [None]:
import json
import os
import re
import typing as t
import urllib
from functools import wraps
from getpass import getpass
from glob import glob
from io import BytesIO
from time import sleep, time

from hrflow import Hrflow
from openpyxl import load_workbook
from openpyxl.utils.cell import get_column_interval
from openpyxl.workbook.workbook import Workbook
from openpyxl.worksheet.worksheet import Worksheet
from pydantic import BaseModel
from tqdm.notebook import tqdm

In [None]:
OUTPUT_PATH = "parsing-evaluation.xlsx"

API_SECRET = getpass("YOUR_API_SECRET")
API_USER = getpass("USER@EMAIL.DOMAIN")
SOURCE_KEY = getpass("YOUR_SOURCE_KEY")

client = Hrflow(api_secret=API_SECRET, api_user=API_USER)

# 0. 🛠 Retrieve Parsed Profiles
This segment of the notebook fetches profiles stored in your designated HrFlow source. It showcases how to paginate through your profiles and collect them for evaluation.

In [None]:
max_page = client.profile.storing.list(source_keys=[SOURCE_KEY])["meta"]["maxPage"]
retrieved_profiles = []
for page in tqdm(range(1, max_page + 1), "Retrieving profiles"):
    retrieved_profiles += client.profile.storing.list(
        source_keys=[SOURCE_KEY], page=page, return_profile=True)["data"]

# 1. ⭐ Profile Evaluation

To facilitate a structured and detailed analysis, we introduce several classes (`InfoEvaluation`, `ExperienceEvaluation`, `EducationEvaluation`, `OtherEvaluation`, `ProfileEvaluation`). These classes break down the evaluation into specific categories such as personal information, experience, education, and other sections of a CV. This structured approach aids in a comprehensive assessment of parsing quality.

The `parsing_evaluator` function orchestrates the evaluation process by analyzing each profile against the defined criteria. It returns a list of evaluations that will be utilized to populate the Excel report.

In [None]:
class InfoEvaluation(BaseModel):
    score: float
    person: float
    first_name: float
    last_name: float
    phone: float
    email: float
    location: float
    summary: float
    driving_license: float
    
    @staticmethod
    def from_profile(profile : t.Dict[str, t.Any]) -> "InfoEvaluation":
        info = profile["info"]
        first_name = 1 if info.get("first_name") else 0
        last_name = 1 if info.get("last_name") else 0
        phone = 1 if info.get("phone") else 0
        email = 1 if info.get("email") else 0
        location = 1 if info.get("location") else 0
        summary = 1 if info.get("summary") else 0
        driving_license = 1 if info.get("driving_license") else 0
        
        score = first_name + last_name + phone + email + location + summary + driving_license
        score /= 7
        return InfoEvaluation(
            score=score,
            person=1,
            first_name=first_name,
            last_name=last_name,
            phone=phone,
            email=email,
            location=location,
            summary=summary,
            driving_license=driving_license
        )

class ExperienceEvaluation(BaseModel):
    score: float
    count: int
    title: float
    company: float
    start_date: float
    end_date: float
    location: float
    description: float
    skills: int
    tasks: int
    courses: int
    certifications: int
    
    @staticmethod
    def from_profile(profile : t.Dict[str, t.Any]) -> "ExperienceEvaluation":
        experiences = profile["experiences"]
        
        count = len(experiences)
        
        title = 0
        company = 0
        start_date = 0
        end_date = 0
        location = 0
        description = 0
        
        skills = 0
        tasks = 0
        courses = 0
        certifications = 0
        
        for experience in experiences:
            title += 1 if experience.get("title") else 0
            company += 1 if experience.get("company") else 0
            start_date += 1 if experience.get("date_start") else 0
            end_date += 1 if experience.get("date_end") else 0
            location += 1 if experience.get("location", {}).get("text") else 0
            description += 1 if experience.get("description") else 0
            
            skills += len(experience.get("skills", []))
            tasks += len(experience.get("tasks", []))
            courses += len(experience.get("courses", []))
            certifications += len(experience.get("certifications", []))
        
        if count > 0:
            title /= count
            company /= count
            start_date /= count
            end_date /= count
            location /= count
            description /= count
        
        score = title + company + start_date + end_date + location + description
        score /= 6
        
        return ExperienceEvaluation(
            score=score,
            count=count,
            title=title,
            company=company,
            start_date=start_date,
            end_date=end_date,
            location=location,
            description=description,
            skills=skills,
            tasks=tasks,
            courses=courses,
            certifications=certifications
        )

class EducationEvaluation(BaseModel):
    score: float
    count: int
    title: float
    school: float
    start_date: float
    end_date: float
    location: float
    description: float
    skills: int
    tasks: int
    courses: int
    certifications: int

    @staticmethod
    def from_profile(profile : t.Dict[str, t.Any]) -> "EducationEvaluation":
        educations = profile["educations"]
        
        count = len(educations)
        
        title = 0
        school = 0
        start_date = 0
        end_date = 0
        location = 0
        description = 0
        
        skills = 0
        tasks = 0
        courses = 0
        certifications = 0
        
        for education in educations:
            title += 1 if education.get("title") else 0
            school += 1 if education.get("school") else 0
            start_date += 1 if education.get("date_start") else 0
            end_date += 1 if education.get("date_end") else 0
            location += 1 if education.get("location", {}).get("text") else 0
            description += 1 if education.get("description") else 0
            
            skills += len(education.get("skills", []))
            tasks += len(education.get("tasks", []))
            courses += len(education.get("courses", []))
            certifications += len(education.get("certifications", []))
        
        if count > 0:
            title /= count
            school /= count
            start_date /= count
            end_date /= count
            location /= count
            description /= count
        
        score = title + school + start_date + end_date + location + description
        score /= 6
        
        return EducationEvaluation(
            score=score,
            count=count,
            title=title,
            school=school,
            start_date=start_date,
            end_date=end_date,
            location=location,
            description=description,
            skills=skills,
            tasks=tasks,
            courses=courses,
            certifications=certifications
        )

class OtherEvaluation(BaseModel):
    skills: int
    languages: int
    tasks: int
    courses: int
    certifications: int
    interests: int
    
    @staticmethod
    def from_profile(profile : t.Dict[str, t.Any]) -> "OtherEvaluation":
        skills = len(profile.get("skills", []))
        languages = len(profile.get("languages", []))
        tasks = len(profile.get("tasks", []))
        courses = len(profile.get("courses", []))
        certifications = len(profile.get("certifications", []))
        interests = len(profile.get("interests", []))
        
        return OtherEvaluation(
            skills=skills,
            languages=languages,
            tasks=tasks,
            courses=courses,
            certifications=certifications,
            interests=interests
        )

class ProfileEvaluation(BaseModel):
    info: InfoEvaluation
    experience: ExperienceEvaluation
    education: EducationEvaluation
    other: OtherEvaluation
    
    filename: str
    resume_url: str
    profile_url: str
    
    @staticmethod
    def get_filename(profile : t.Dict[str, t.Any]) -> str:
        return profile["attachments"][0].get("original_file_name", "")
    
    @staticmethod
    def get_resume_url(profile : t.Dict[str, t.Any]) -> str:
        return profile["attachments"][0].get("public_url", "")
    
    @staticmethod
    def get_profile_url(profile : t.Dict[str, t.Any]) -> str:
        resume_url = ProfileEvaluation.get_resume_url(profile)
        base_url = resume_url.rsplit("/", 2)[0]
        return f"{base_url}/object.json"
    
    @staticmethod
    def from_profile(profile : t.Dict[str, t.Any]) -> "ProfileEvaluation":
        return ProfileEvaluation(
            info=InfoEvaluation.from_profile(profile),
            experience=ExperienceEvaluation.from_profile(profile),
            education=EducationEvaluation.from_profile(profile),
            other=OtherEvaluation.from_profile(profile),
            filename=ProfileEvaluation.get_filename(profile),
            resume_url=ProfileEvaluation.get_resume_url(profile),
            profile_url=ProfileEvaluation.get_profile_url(profile)
        )

def parsing_evaluator(profile_list : t.List[t.Dict[str, t.Any]]) -> t.List[ProfileEvaluation]:
    """
    Read a list of profiles and evaluate them

    Args:
        profile_list (t.List[t.Dict[str, t.Any]]): List of profiles

    Returns:
        t.List[ProfileEvaluation]: List of evaluated profiles
    """
    return [ProfileEvaluation.from_profile(profile) for profile in tqdm(profile_list, desc="Evaluating profiles")]

In [None]:
score_list = parsing_evaluator(retrieved_profiles)

# 2. 📝Generate Excel Report

Leveraging a pre-defined Excel template, this section outlines the process of generating the parsing evaluation report. It demonstrates how to load the template, populate it with evaluated data, and save the final report. 

The Excel workbook consists of various sections, including metadata, personal info, experience, education, and other skills, offering a holistic view of parsing accuracy.

In [None]:
TEMPLATE_URL = "https://riminder-documents-eu-2019-12-dev.s3.eu-west-1.amazonaws.com/evaluation/parsing-evaluation-template.xlsx"
STATISTICS_SHEET_NAME = "1. Statistics"
START_ROW_ID = 5

FILENAME_COLUMN_ID = "A"
RESUME_COLUMN_ID = "B"
PROFILE_COLUMN_ID = "C"

INFO_FIELD_LIST = ("score", "person", "first_name", "last_name", "phone", "email", "location", "summary", "driving_license")
INFO_START_COLUMN_ID, INFO_END_COLUMN_ID = ("D", "L")

EXPERIENCE_FIELD_LIST = ("score", "count", "title", "company", "start_date", "end_date", "location", "description", "skills", "tasks", "courses", "certifications")
EXPERIENCE_START_COLUMN_ID, EXPERIENCE_END_COLUMN_ID = ("M", "X")

EDUCATION_FIELD_LIST = ("score", "count", "title", "school", "start_date", "end_date", "location", "description", "skills", "tasks", "courses", "certifications")
EDUCATION_START_COLUMN_ID, EDUCATION_END_COLUMN_ID = ("Y", "AJ")

OTHER_FIELD_LIST = ("skills", "languages", "tasks", "courses", "certifications", "interests")
OTHER_START_COLUMN_ID, OTHER_END_COLUMN_ID = ("AK", "AP")

def load_workbook_from_url(url : str) -> Workbook:
    """
    Load an excel file from a url

    Args:
        url (str): The url of the file to load 

    Returns:
        Workbook: The loaded workbook
    """
    file = urllib.request.urlopen(url).read()
    return load_workbook(filename=BytesIO(file))

def fill_metadata(work_sheet : Worksheet, profile_eval_list : t.List[ProfileEvaluation]) -> None:
    """
    Fill the metadata of the profiles in the worksheet

    Args:
        work_sheet (Worksheet): The worksheet to fill
        profile_eval_list (t.List[ProfileEvaluation]): The list of profile evaluations
    """
    for row_id, profile_eval in enumerate(tqdm(profile_eval_list, desc="Filling meta-data"), START_ROW_ID):
        work_sheet[f"{FILENAME_COLUMN_ID}{row_id}"].value = profile_eval.filename
        work_sheet[f"{RESUME_COLUMN_ID}{row_id}"].hyperlink = profile_eval.resume_url
        work_sheet[f"{PROFILE_COLUMN_ID}{row_id}"].hyperlink = profile_eval.profile_url

def fill_info(work_sheet : Worksheet, profile_eval_list : t.List[ProfileEvaluation]) -> None:
    """
    Fill the info scores of the profiles in the worksheet

    Args:
        work_sheet (Worksheet): The worksheet to fill
        profile_eval_list (t.List[ProfileEvaluation]): The list of profile evaluations
    """
    colum_id_list = get_column_interval(INFO_START_COLUMN_ID, INFO_END_COLUMN_ID)
    for row_id, profile_eval in enumerate(tqdm(profile_eval_list, desc="Filling info scores"), START_ROW_ID):
        for column_id, field in zip(colum_id_list, INFO_FIELD_LIST):
            work_sheet[f"{column_id}{row_id}"].value = getattr(profile_eval.info, field)

def fill_experience(work_sheet : Worksheet, profile_eval_list : t.List[ProfileEvaluation]) -> None:
    """
    Fill the experience scores of the profiles in the worksheet

    Args:
        work_sheet (Worksheet): The worksheet to fill
        profile_eval_list (t.List[ProfileEvaluation]): The list of profile evaluations
    """
    colum_id_list = get_column_interval(EXPERIENCE_START_COLUMN_ID, EXPERIENCE_END_COLUMN_ID)
    for row_id, profile_eval in enumerate(tqdm(profile_eval_list, desc="Filling experience scores"), START_ROW_ID):
        for column_id, field in zip(colum_id_list, EXPERIENCE_FIELD_LIST):
            work_sheet[f"{column_id}{row_id}"].value = getattr(profile_eval.experience, field)
            
def fill_education(work_sheet : Worksheet, profile_eval_list : t.List[ProfileEvaluation]) -> None:
    """
    Fill the education scores of the profiles in the worksheet

    Args:
        work_sheet (Worksheet): The worksheet to fill
        profile_eval_list (t.List[ProfileEvaluation]): The list of profile evaluations
    """
    colum_id_list = get_column_interval(EDUCATION_START_COLUMN_ID, EDUCATION_END_COLUMN_ID)
    for row_id, profile_eval in enumerate(tqdm(profile_eval_list, desc="Filling education scores"), START_ROW_ID):
        for column_id, field in zip(colum_id_list, EDUCATION_FIELD_LIST):
            work_sheet[f"{column_id}{row_id}"].value = getattr(profile_eval.education, field)

def fill_other(work_sheet : Worksheet, profile_eval_list : t.List[ProfileEvaluation]) -> None:
    """
    Fill the other scores of the profiles in the worksheet

    Args:
        work_sheet (Worksheet): The worksheet to fill
        profile_eval_list (t.List[ProfileEvaluation]): The list of profile evaluations
    """
    colum_id_list = get_column_interval(OTHER_START_COLUMN_ID, OTHER_END_COLUMN_ID)
    for row_id, profile_eval in enumerate(tqdm(profile_eval_list, desc="Filling other scores"), START_ROW_ID):
        for column_id, field in zip(colum_id_list, OTHER_FIELD_LIST):
            work_sheet[f"{column_id}{row_id}"].value = getattr(profile_eval.other, field)

In [None]:
work_book = load_workbook_from_url(TEMPLATE_URL)
work_sheet = work_book[STATISTICS_SHEET_NAME]

fill_metadata(work_sheet, score_list)
fill_info(work_sheet, score_list)
fill_experience(work_sheet, score_list)
fill_education(work_sheet, score_list)
fill_other(work_sheet, score_list)

work_book.save(OUTPUT_PATH)
work_book.close()

The output is an Excel file named `parsing-evaluation.xlsx`, summarizing the parsing accuracy of CVs stored in the specified HrFlow source.

The report contains 2 sheets: 
1. **Definition**: This page explains each field and how to interpret the results.
2. **Statistics**: This page presents the comprehensive set of results."