<a href="https://colab.research.google.com/github/PanoEvJ/GenAI-CoverLetter/blob/main/PE_Synthetic_Dataset_Creation_for_%E2%9C%89%EF%B8%8F_MarketMail_AI%E2%9C%89%EF%B8%8F_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### OpenAI Access

First things first, you'll need to set-up an account on [OpenAI](platform.openai.com). Once you've done that - follow [these resources](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key) to create an API key. Make sure you save your API key!

In [2]:
import os 

# Set the OPENAI_API_KEY environment variable
os.environ["OPENAI_API_KEY"] = "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"

### OpenAI API Library

We'll be leveraging [this](https://github.com/openai/openai-python) library to access OpenAI's model endpoints.

There are a number of models to choose from and you can find resources about them [here](https://platform.openai.com/docs/models) and their pricing [here](https://openai.com/pricing).

The first step is to install `openai`!

In [3]:
!pip install openai -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.9/71.9 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m46.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.6/149.6 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
[?25h

Once we've installed it, we need to import it and set our API key!

In [4]:
import openai 

openai.api_key = os.environ.get("OPENAI_API_KEY")

If you wanted to use `gpt-4`, you'd need an account that has closed beta access to the model endpoint. 

You can check if your API Key has access using the following cell.

In [5]:
# check if acct. has gpt-4 access
"gpt-4" in [model["root"] for model in openai.Model.list()["data"]]

False

For the rest of the tutorial, we're going to assume you're using `gpt-3.5-turbo` as your model.

Let's make some helper functions for prompting our model and generating our prompts.

In [6]:
def prompt_model(prompt_list, model="gpt-3.5-turbo"):
  return openai.ChatCompletion.create(model=model, messages=prompt_list)

def create_prompt(role, prompt):
  return {"role" : role, "content" : prompt}

As you can see, our prompts have to be in a specific format - as set by OpenAI.

Here's an example:

```
{"role" : "system", "content" : "You are an expert in Python programming."}
{"role" : "user", "content" : "Please define a function that provides the Nth number of the fibonacci sequence."}
```

Let's see that in action! Remember that you can feed OpenAI's chat completion endpoint with a list of prompts!

### Generating Synthetic Data

Alright, now we can pull everything together and start creating our synthetic data!

**NOTE:** Using OpenAI's endpoints to create our dataset does mean that we cannot use our model for commercial use. This is meant to demonstrate the methods, and can be extended to any open-source LLM.

In [None]:
job_titles     = ['Machine Learning Engineer', 'Data Scientist', 'Research Scientist', 'Business Intelligence Developer', 'AI Product Manager', 'AI Consultant', 'Robotics Engineer', 'NLP Engineer', 'Research Assistant', 'Deep Learning Engineer', 'MLOps engineer']
position_level = [ "entry", "senior", "mid-level" ]

applicant_name       = "Jon Doe"
generic_company_name = 'COMPANY_NAME'

job_postings      = []
gen_cover_letters = []

i = 0
for position_level_t in position_level:
  for job_title_t in job_titles:

    # Generate job postings
    list_of_prompts_for_job_postings = [
        {"role" : "system", "content" : "You are a technical hiring manager working at an AI company."}, 
        {"role" : "user", "content" : f"""Please define a job description for a {job_title_t} role in a {position_level_t} level. Write the requirements, responsibilities, location and job culture. Use the generic company name {generic_company_name}."""}
    ]

    job = prompt_model(list_of_prompts_for_job_postings)

    job_postings.append(job["choices"][0]["message"]["content"])


    # Generate cover letters 
    list_of_prompts_for_cover_letter = [
        {"role" : "system", "content" : "You are a Machine Learning Engineer."}, 
        {"role" : "user", "content" : f"""Create a generic cover letter based on the following job posting {job}. As applicant name use the generic name {applicant_name}. Explain why you want this job, what makes you a good fit and how your own skills can add value to the company. Explain how you fit with the company culture."""}
    ]

    cover_letter = prompt_model(list_of_prompts_for_cover_letter)
    gen_cover_letters.append(cover_letter["choices"][0]["message"]["content"])

print( f"job_posting: {job_postings[0]} \n \n cover_letter: {gen_cover_letters[0]}")


As you can see, we get a lot of information back from the endpoint. 

We can see the number of tokens we used, why the output stopped, what the output is, and more!

Let's view the prompt a bit clearer using some display libraries. 

In [17]:
# job_postings_backup = job_postings
len(gen_cover_letters)

33

In [9]:
from IPython.display import display, Markdown

# markdown_output = model_output["choices"][0]["message"]["content"]
markdown_output = job_postings[0]["choices"][0]["message"]["content"]

display(Markdown(markdown_output))

TypeError: ignored

In [None]:
# text_response = []
# for job in job_postings:
#   text_response.append(job["choices"][0]["message"]["content"])

In [None]:
# text_response[4]

"Job Title: Junior AI Product Manager\n\nJob Description:\n\nOur AI company is looking for a Junior AI Product Manager to join our team. In this role, you will help define and shape our AI products, working closely with our development team and leadership.\n\nResponsibilities:\n\n- Conduct market research and analysis to assess customer needs and market opportunities\n- Collaborate with the development team to create and prioritize product requirements\n- Plan and track product roadmap and releases\n- Work with UX and UI designers to create user-centered designs\n- Define and track product metrics to ensure success and iterate as necessary\n- Keep up-to-date with industry trends and emerging technologies\n- Collaborate with cross-functional teams to build consensus and drive projects forward\n\nRequirements:\n\n- Bachelor's degree in a relevant field such as Computer Science, Engineering, Business or Marketing\n- Strong analytical and problem-solving skills\n- Excellent written and ver

In [None]:
# jobs_dataset = text_response

### Uploading Dataset to HuggingFace Hub

Now that we've created our synthetic dataset - let's push it to the HuggingFace hub!

As always, the first task is to get the required dependencies.

In [None]:
!pip install huggingface_hub -q

Now we can log-in to Hugging Face!

Make sure you have a Hugging Face account, and you have set up a read/write token!

More info here: https://huggingface.co/docs/hub/security-tokens

In [None]:
!git config --global credential.helper store
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) y
Token is valid.
Your token has been saved in your configured git cred

Now we can load our data into the desired format - and upload it to the hub!

In [None]:
!pip install datasets -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from datasets import load_dataset, Dataset
import pandas as pd

In [None]:
hf_dataset = Dataset.from_pandas(pd.DataFrame(data=jobs_dataset))

In [None]:
hf_dataset

Dataset({
    features: ['0'],
    num_rows: 33
})

In [None]:
hf_username = "PanoEvJ"
dataset_name = "job_postings_GPT"

hf_dataset.push_to_hub(f"{hf_username}/{dataset_name}")

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

Deleting unused files from dataset repository:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading metadata:   0%|          | 0.00/349 [00:00<?, ?B/s]



### Conclusion

And that's it! You just created a synthetic dataset and pushed it to the hub! 

Next stop? [Modeling!](https://colab.research.google.com/drive/1RfUuzG11Q8AaZuJIHLzXCVC087xoDeSd?usp=sharing)