Open this notebook in Google Colab : [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Riminder/hrflow-cookbook/blob/main/examples/%5BParsing%5D%20profile_job_parsing.ipynb)

##### Copyright 2023 HrFlow's AI Research Department

Licensed under the Apache License, Version 2.0 (the "License");

In [None]:
# Copyright 2023 HrFlow's AI Research Department. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

Welcome to this Google Colaboratory tutorial for developers. **In only 4 steps**, we'll help you **parse, store and get your profiles and jobs**. This will enable you to test the powerful **HrFlow.ai parsing feature**.

Before we proceed, please ensure that you have created a source and a board in HrFlow.ai to store your data. You can find detailed instructions on how to create them through the following links:
- **Create your source**: [Connectors Source Documentation](https://developers.hrflow.ai/docs/connectors-source)
- **Create your board**: [Connectors Board Documentation](https://developers.hrflow.ai/docs/connectors-board)

**NB :** It is also necessary to activate the **real-time (Sync)** parsing, please follow the steps described in the links above.

Now, let's take a quick look at how this notebook is organized:
1. **Profiles**

    **1.1 📝 Parse and Store Profiles** 
    
    **1.2 👷 Retrieve Stored Profiles** 
2. **Jobs**
    
    **2.1 📝 Parse and Store Jobs**
    
    **2.2 🛠 Retrieve Stored Jobs**

Let's get started and harness the capabilities of HrFlow.ai!

# Getting Started

In [1]:
!pip install --quiet hrflow

In [1]:
import os
import re
import json
import typing
from glob import glob
from functools import wraps
from getpass import getpass
from time import time, sleep
from tqdm.notebook import tqdm
from pydantic import BaseModel, root_validator

from hrflow import Hrflow


API_SECRET = getpass("YOUR_API_SECRET")
API_USER = getpass("USER@EMAIL.DOMAIN")
SOURCE_KEY = getpass("YOUR_SOURCE_KEY")
BOARD_KEY = getpass("YOUR_BOARD_KEY")


def rate_limiter(
    max_requests_per_minute=30,
    min_sleep_per_request=1.0
):
    """
    Decorator that applies rate limiting to a function.

    Args:
        max_requests_per_minute (int): The maximum number of requests allowed per minute.
        min_sleep_per_request (float): The minimum time to sleep between consecutive requests.
    """
    def decorator(func):
        requests_per_minute = 0
        last_reset_time = time()

        @wraps(func)
        def wrapper(*args, **kwargs):
            nonlocal requests_per_minute, last_reset_time

            current_time = time()
            elapsed_time = current_time - last_reset_time

            if elapsed_time < 60:
                requests_per_minute += 1
                if requests_per_minute >= max_requests_per_minute:
                    sleep(60 - elapsed_time)
                    requests_per_minute = 0
                    last_reset_time = time()
            else:
                requests_per_minute = 0
                last_reset_time = current_time

            sleep(min_sleep_per_request)
            return func(*args, **kwargs)

        return wrapper

    return decorator


client = Hrflow(api_secret=API_SECRET, api_user=API_USER)

# 0. Create Folders to Store CVs and Job Descriptions

This notebook requires the file structure below. You should store your raw CVs (any format `.pdf`, `.jpg`, `.png`...) in the `raw_cvs` folder and your raw job descriptions (raw texte format `.txt`) in the `raw_jobs` folder.

```bash
.
├── raw_cvs/
│   ├── cv_{reference_1}.pdf
│   ├── cv_{reference_2}.pdf
│   └── ...
├── retrieved_cvs/
├── raw_jobs/
|   ├── job_{reference_1}.txt
|   ├── job_{reference_2}.txt
|   └── ...
├── parsed_jobs/
├── retrieved_jobs/
└── this_notebook_😁.ipynb
```

**NB**: 
- Please make sure to respect the naming convention for your files : i.e. `cv_{reference}.[extension]` and `job_{reference}.txt`.
- The reference of your files should be unique, it will be the main way to identify your profiles and jobs in HrFlow.ai.

In [None]:
folders = ["raw_cvs", "retrieved_cvs","raw_jobs", "parsed_jobs", "retrieved_jobs"]
for folder in folders:
    if not os.path.exists(folder):
        os.makedirs(folder)

# 1. Profiles

# 1.1 📝 Parse and Store Profiles

To parse and upload your CVs to HrFlow.ai, we will use the `Profile` class below, it is using the [HrFlow.ai python SDK](https://github.com/Riminder/python-hrflow-api). However, to enrich profiles it is possible to add :

- `created_at` to a profile following the `ISO 8601` format, e.g. `2021-01-01T00:00:00`. Otherwise, it will automatically be set to the current time. 
- `tags` and `metadata` to a profile using the same class. Below is an example of `tags` and `metadata` :

```python
tags = [
    {
        "name": "DUMMY_NAME",
        "value": "DUMMY_VALUE"          # <- Could be None
    },
    {
        "name": "DUMMY_NAME_2",
        "value": "DUMMY_VALUE_2"        # <- Could be None
    },
]
metadatas = [
    {
        "name": "DUMMY_NAME",
        "value": "DUMMY_VALUE"          # <- Could be None
    }
]
```

In [5]:
upload_cv = rate_limiter()(client.profile.parsing.add_file)


class Profile(BaseModel):
    raw_document_path: str
    reference: str
    created_at: typing.Optional[str]
    tags: typing.Optional[typing.List[typing.Dict]]
    metadatas: typing.Optional[typing.List[typing.Dict]]
    source_key: str = SOURCE_KEY
    profile_json: typing.Optional[typing.Dict[str, typing.Any]] = None

    @root_validator
    def check_custom_fields(cls, values):
        tags = values.get("tags")
        metadata = values.get("metadatas")
        if tags:
            for tag in tags:
                assert tag.get("name") and tag.get(
                    "value"), "All tags must have a name and a value"
        if metadata:
            for data in metadata:
                assert data.get("name") and data.get(
                    "value"), "All metadata must have a name and a value"
        return values

    def parse_document(self):
        with open(self.raw_document_path, "rb") as file:
            profile_file = file.read()
        resp = upload_cv(
            source_key=self.source_key,
            profile_file=profile_file,
            profile_content_type="application/pdf",
            reference=self.reference,
            tags=self.tags,
            metadatas=self.metadatas,
            created_at=self.created_at,
            sync_parsing=1,
            sync_parsing_indexing=1,
            webhook_parsing_sending=0,
        )
        self.profile_json = resp["data"]["profile"]
        return self

In [None]:
local_profiles = []
for cv_path in tqdm(glob("raw_cvs/*"), desc="Parsing CVs"):
    file_name = cv_path.split("/")[-1]
    profile_reference = re.findall(r"cv_(.*)\.", file_name)[0]
    local_profiles.append(
        Profile(
            raw_document_path=cv_path,
            reference=profile_reference,
        ).parse_document()
    )

# 1.2 👷 Retrieve Stored Profiles

Now lets retrieve the profiles from the HrFlow.ai store. These profiles will be stored in the `retrieved_cvs` folder.

In [None]:
max_page = client.profile.storing.list(source_keys=[SOURCE_KEY])["meta"]["maxPage"]
retrieved_profiles = []
for page in tqdm(range(1, max_page + 1), "Retrieving profiles"):
    retrieved_profiles += client.profile.storing.list(
        source_keys=[SOURCE_KEY], page=page, return_profile=True)["data"]

In [12]:
for profile in retrieved_profiles:
    with open(f"retrieved_cvs/profile_{profile['reference']}.json", "w") as file:
        json.dump(profile, file)

# 2. Jobs

# 2.1 📝 Parse and Store Jobs


To parse, format and upload a job description to HrFlow.ai, we will use the `Job` class below. First it will parse the texte description using the HrFlow.ai parsing API, then it will format the job description to match the HrFlow.ai format. Finally, it will upload the job to HrFlow.ai. However, to enrich jobs it is possible to add :

- Personalized `sections`. The class below doesn't split the job description into multiple sections, If your job description contains multiple sections like : `Responsibilities`, `Requirements`, `Benefits`... you can specify this by editing the `format` method in the `Job` class.
- `created_at` to a job following the `ISO 8601` format, e.g. `2021-01-01T00:00:00`. Otherwise, it will automatically be set to the current time. 
- `tags` and `metadata` to a job using the same class. Below is an example of `tags` and `metadata` :

```python
tags = [
    {
        "name": "DUMMY_NAME",
        "value": "DUMMY_VALUE"          # <- Could be None
    },
    {
        "name": "DUMMY_NAME_2",
        "value": "DUMMY_VALUE_2"        # <- Could be None
    },
]
metadatas = [
    {
        "name": "DUMMY_NAME",
        "value": "DUMMY_VALUE"          # <- Could be None
    }
]
```

In [9]:
parse_job = rate_limiter()(client.text.parsing.post)
upload_job = rate_limiter()(client.job.storing.add_json)


class Job(BaseModel):
    raw_text: str
    reference: str
    created_at: typing.Optional[str]
    tags: typing.Optional[typing.List[typing.Dict]]
    metadatas: typing.Optional[typing.List[typing.Dict]]
    board_key: str = BOARD_KEY
    parsed_job: typing.Optional[typing.Dict] = None
    job_json: typing.Optional[typing.Dict] = None

    @root_validator
    def check_custom_fields(cls, values):
        tags = values.get("tags")
        metadata = values.get("metadatas")
        if tags:
            for tag in tags:
                assert tag.get("name") and tag.get(
                    "value"), "All tags must have a name and a value"
        if metadata:
            for data in metadata:
                assert data.get("name") and data.get(
                    "value"), "All metadata must have a name and a value"
        return values

    def parse_job(self):
        response = parse_job(text=self.raw_text)
        self.parsed_job = response["data"]["parsing"]
        return self

    def format(self):
        self.job_json = dict(
            name=self.parsed_job.get("job_titles", [""])[0],
            reference=self.reference,
            url=None,
            summary=None,
            created_at=self.created_at,
            sections=[{
                "name": "section 1",
                "title": "title section 1",
                "description": self.raw_text,
            }],
            skills=[
                {
                    "name": skill,
                    "value": None,
                    "type": "hard"
                }
                for skill in self.parsed_job.get("skills_hard")
            ] + [{
                "name": skill,
                "value": None,
                "type": "soft"
            }
                for skill in self.parsed_job.get("skills_soft")
            ],
            languages=[{
                "name": lang,
                "value": None
            }
                for lang in self.parsed_job.get("languages")
            ],
            tasks=[{
                "name": task,
                "value": None
            }
                for task in self.parsed_job.get("tasks")
            ],
            certifications=[{
                "name": certification,
                "value": None
            }
                for certification in self.parsed_job.get("certifications")
            ],
        )
        if self.tags:
            self.job_json["tags"] = self.tags
        if self.metadatas:
            self.job_json["metadatas"] = self.metadatas

        return self

    def cache_job(self, folder="parsed_jobs"):
        with open(f"{folder}/{self.reference}.json", "w") as file:
            json.dump(self.job_json, file)
        return self

    def upload_to_board(self):
        resp = upload_job(
            board_key=self.board_key, job_json=self.job_json)

Since jobs are first parsed, we will first cache them into the `parsed_jobs` folder. Then, we will upload them to HrFlow.ai.

In [None]:
local_jobs = []
for job in tqdm(glob("raw_jobs/*"), desc="Parsing jobs"):
    file_name = job.split("/")[-1]
    job_reference = re.findall(r"job_(.*)\.", file_name)[0]
    local_jobs.append(
        Job(
            raw_text=open(job).read(),
            reference=job_reference,
        ).parse_job().format().cache_job().upload_to_board()
    )

# 2.2 🛠 Retrieve Stored Jobs

Now lets retrieve the jobs from the HrFlow.ai store. These jobs will be stored in the `retrieved_jobs` folder.

In [None]:
max_page = client.job.storing.list(board_keys=[BOARD_KEY])["meta"]["maxPage"]
retreived_jobs = []
for page in tqdm(range(1, max_page + 1), "Retrieving jobs"):
    retreived_jobs += client.job.storing.list(board_keys=[BOARD_KEY], page=page, return_job=True)["data"]

In [16]:
for job in retreived_jobs: 
    with open(f"retrieved_jobs/job_{job['reference']}.json", "w") as file:
        json.dump(job, file)