<a href="https://colab.research.google.com/github/PanoEvJ/GenAI-CoverLetter/blob/main/PE_Synthetic_Dataset_Creation_for_GenAI_CoverLetter_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### OpenAI Access

First things first, you'll need to set-up an account on [OpenAI](platform.openai.com). Once you've done that - follow [these resources](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key) to create an API key. Make sure you save your API key!

### OpenAI API Library

We'll be leveraging [this](https://github.com/openai/openai-python) library to access OpenAI's model endpoints.

There are a number of models to choose from and you can find resources about them [here](https://platform.openai.com/docs/models) and their pricing [here](https://openai.com/pricing).

Import the OPENAI library and set our API key in your personal .env environment.

In [1]:
# !pip install openai -q

In [74]:
import openai 
import os 
from dotenv import load_dotenv

load_dotenv() # load OPEN_API_KEY as environment variable from .env file

# Set your personal OPENAI_API_KEY environment variable in the .env file
openai.api_key = os.getenv("OPENAI_API_KEY")

If you wanted to use `gpt-4`, you'd need an account that has closed beta access to the model endpoint. 

You can check if your API Key has access using the following cell.

In [5]:
# check if acct. has gpt-4 access
"gpt-4" in [model["root"] for model in openai.Model.list()["data"]]

False

For the rest of the tutorial, we're going to assume you're using `gpt-3.5-turbo` as your model.

Let's make some helper functions for prompting our model and generating our prompts.

In [6]:
def prompt_model(prompt_list, model="gpt-3.5-turbo"):
  return openai.ChatCompletion.create(model=model, messages=prompt_list)

def create_prompt(role, prompt):
  return {"role" : role, "content" : prompt}

As you can see, our prompts have to be in a specific format - as set by OpenAI.

Here's an example:

```
{"role" : "system", "content" : "You are an expert in Python programming."}
{"role" : "user", "content" : "Please define a function that provides the Nth number of the fibonacci sequence."}
```

Let's see that in action! Remember that you can feed OpenAI's chat completion endpoint with a list of prompts!

### Generating Synthetic Data

Alright, now we can pull everything together and start creating our synthetic data!

**NOTE:** Using OpenAI's endpoints to create our dataset does mean that we cannot use our model for commercial use. This is meant to demonstrate the methods, and can be extended to any open-source LLM.

In [None]:
job_titles     = ['Machine Learning Engineer', 'Data Scientist', 'Research Scientist', 'Business Intelligence Developer', 'AI Product Manager', 'AI Consultant', 'Robotics Engineer', 'NLP Engineer', 'Research Assistant', 'Deep Learning Engineer', 'MLOps engineer']
position_level = [ "entry", "senior", "mid-level" ]

applicant_name       = "Jon Doe"
generic_company_name = 'COMPANY_NAME'

job_postings      = []
gen_cover_letters = []

i = 0
for position_level_t in position_level:
  for job_title_t in job_titles:

    # Generate job postings
    list_of_prompts_for_job_postings = [
        {"role" : "system", "content" : "You are a technical hiring manager working at an AI company."}, 
        {"role" : "user", "content" : f"""Please define a job description for a {job_title_t} role in a {position_level_t} level. Write the requirements, responsibilities, location and job culture. Use the generic company name {generic_company_name}."""}
    ]

    job = prompt_model(list_of_prompts_for_job_postings)

    job_postings.append(job["choices"][0]["message"]["content"])


    # Generate cover letters 
    list_of_prompts_for_cover_letter = [
        {"role" : "system", "content" : "You are a Machine Learning Engineer."}, 
        {"role" : "user", "content" : f"""Create a generic cover letter based on the following job posting {job}. As applicant name use the generic name {applicant_name}. Explain why you want this job, what makes you a good fit and how your own skills can add value to the company. Explain how you fit with the company culture."""}
    ]

    cover_letter = prompt_model(list_of_prompts_for_cover_letter)
    gen_cover_letters.append(cover_letter["choices"][0]["message"]["content"])


As you can see, we get a lot of information back from the endpoint. 

We can see the number of tokens we used, why the output stopped, what the output is, and more!

Let's view the prompt a bit clearer using some display libraries. 

In [None]:
from IPython.display import display, Markdown

display(Markdown(job_postings[0]))
display(Markdown(gen_cover_letters[0]))

In [29]:
jobs_dataset = { "job_postings" : job_postings,
                 "cover_letters": gen_cover_letters }

### Uploading Dataset to HuggingFace Hub

Now that we've created our synthetic dataset - let's push it to the HuggingFace hub!

In [2]:
# !pip install huggingface_hub -q

Now we can log-in to Hugging Face!

Make sure you have a Hugging Face account, and you have set up a read/write token!

More info here: https://huggingface.co/docs/hub/security-tokens

In [78]:
# Set & Load your personal HUGGINGFACE_TOKEN environment variable in/from the .env file
load_dotenv() 
openai.api_key = os.getenv("HUGGINGFACE_TOKEN")

In [79]:
!git config --global credential.helper store
!huggingface-cli login --token print(hf_token)

/bin/bash: -c: line 1: syntax error near unexpected token `('
/bin/bash: -c: line 1: `huggingface-cli login --token print(hf_token)'


Now we can load our data into the desired format - and upload it to the hub!

In [33]:
from datasets import load_dataset, Dataset
import pandas as pd

In [34]:
hf_dataset = Dataset.from_pandas(pd.DataFrame(data=jobs_dataset))

In [None]:
hf_username = "PanoEvJ"
dataset_name = "job_postings_GPT"

hf_dataset.push_to_hub(f"{hf_username}/{dataset_name}")

### Conclusion

And that's it! You just created a synthetic dataset and pushed it to the hub! 

Next stop? [Modeling!](https://colab.research.google.com/drive/1RfUuzG11Q8AaZuJIHLzXCVC087xoDeSd?usp=sharing)