<a href="https://colab.research.google.com/github/CesarChaMal/WYNAssociates/blob/main/docs/ref-deeplearning/ex24f%20-%20process%20custom%20data%20from%20pdf%20and%20push%20to%20huggingface%20to%20prep%20for%20fine%20tune%20task%20of%20llama%202%20using%20lora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Download Library

In [None]:
! pip install openai

Collecting openai
  Downloading openai-1.11.1-py3-none-any.whl (226 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/226.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m225.3/226.1 kB[0m [31m7.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.1/226.1 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.26.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.2-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none

In [None]:
import openai

In [None]:
! pip install datasets

## Load `openassitant-guanaco` Data as Demonstration

We want to understand the data format.

### Code

- **Importing `load_dataset`**: The `from datasets import load_dataset` statement imports the `load_dataset` function from the `datasets` library. This library is a part of the Hugging Face ecosystem, designed to easily share, load, and work with datasets in the machine learning field.

- **Loading the Dataset**: The `load_dataset("timdettmers/openassistant-guanaco")` function call tells the library to load a dataset identified by the name `timdettmers/openassistant-guanaco`. This identifier typically consists of the username or organization name (`timdettmers` in this case) and the specific dataset name (`openassistant-guanaco`).

In [None]:
from datasets import load_dataset

dataset = load_dataset("timdettmers/openassistant-guanaco")

In [None]:
dataset

In [None]:
dataset["train"]

In [None]:
dataset["train"][0]

In [None]:
type(dataset["train"][0])

In [None]:
dataset["train"][0].keys()

In [None]:
dataset["train"][1]["text"]

In [None]:
type(dataset["train"][0]["text"])

In [None]:
len(dataset["train"][0]["text"])

## Scrap `Any` PDF

We need the `PyMuPDF` package in python. So, we install it first.

In [None]:
! pip install PyMuPDF

### `read_pdf_content` function

In [None]:
import fitz  # PyMuPDF

def read_pdf_content(pdf_path):
    """
    Reads a PDF and returns its content as a list of strings.

    Args:
    pdf_path (str): The file path to the PDF.

    Returns:
    list of str: A list where each element is the text content of a PDF page.
    """
    content_list = []
    with fitz.open(pdf_path) as doc:
        for page in doc:
            content_list.append(page.get_text())

    return content_list

In [None]:
%%time

scraped_content = read_pdf_content("/content/JVM Troubleshooting Guide.pdf")
print("\n")
print(scraped_content)

In [None]:
len(scraped_content[0])

In [None]:
scraped_content = ' '.join(scraped_content)
print(scraped_content)

In [None]:
len(scraped_content)

In [None]:
type(scraped_content)

In [None]:
scraped_content.split('. ')[0]

## API Call to Create Data

Here we use the `client.chat.completions.create` function from *OpenAI* as a helper function to assist us to create question answer.

In [None]:
OPENAI_API_KEY = "sk-token"
openai_client = openai.OpenAI(api_key=OPENAI_API_KEY)


def call_chatgpt(query: str, model: str = "gpt-3.5-turbo") -> str:
    """
    Generates a response to a query using the specified language model.
    Args:
        query (str): The user's query that needs to be processed.
        model (str, optional): The language model to be used. Defaults to "gpt-3.5-turbo".
    Returns:
        str: The generated response to the query.
    """

    # Prepare the conversation context with system and user messages.
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": f"Question: {query}."},
    ]

    # Use the OpenAI client to generate a response based on the model and the conversation context.
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )

    # Extract the content of the response from the first choice.
    content: str = response.choices[0].message.content

    # Return the generated content.
    return content

In [None]:
resp = call_chatgpt("What is the Java Heap?")

In [None]:
resp

### Prompt Engineer

We use prompt engineer to ensure the content `GPT` gave us is in the same content as the `openassist/guanaco` data.

```python
    ### Human:
    ### Assistant:
```

In [None]:
def prompt_engineered_api(text: str):

    prompt = f"""
        I have the following content: {text}

        I want to create a question-answer content that has the following format:

        ### Human:
        ### Assistant:

        Make sure to write question and answer based on the content I provided.

        The ### Human means it's a question, and the ### Assistant means it's an answer.
    """

    resp = call_chatgpt(prompt)

    return resp

In [None]:
scraped_content.split('. ')[0]

In [None]:
resp = prompt_engineered_api(scraped_content.split('. ')[0])

resp

In [None]:
type(resp)

## Create `DatasetDict` Data Structure

```python
from datasets import Dataset, DatasetDict

# Example data - replace these with your actual data
train_data = {'text': [resp]*3}
test_data = {'text': [resp]*2}

# Create Dataset objects for training and testing
train_dataset = Dataset.from_dict(train_data)
test_dataset = Dataset.from_dict(test_data)

# Combine them into a DatasetDict
dataset_dict = DatasetDict({
    'train': train_dataset,
    'test': test_dataset
})

# Display the structure of the dataset
print(dataset_dict)
```

In [None]:
from datasets import Dataset, DatasetDict
from tqdm import tqdm

In [None]:
raw_content_for_train = []

for i in tqdm(range(len(scraped_content.split('. ')))):
    resp = prompt_engineered_api(scraped_content.split('. ')[i])
    raw_content_for_train.append(resp)

In [None]:
raw_content_for_train[0]

In [None]:
# Example data - replace these with your actual data
train_data = {'text': raw_content_for_train[0:100]}
test_data = {'text': raw_content_for_train[100::]}

# Create Dataset objects for training and testing
train_dataset = Dataset.from_dict(train_data)
test_dataset = Dataset.from_dict(test_data)

# Combine them into a DatasetDict
dataset_dict = DatasetDict({
    'train': train_dataset,
    'test': test_dataset
})

# Display the structure of the dataset
print(dataset_dict)

In [None]:
dataset_dict["train"][0]["text"]

## Push to HuggingFace Hub

In [None]:
! huggingface-cli login

In [None]:
from huggingface_hub import HfApi, create_repo

In [None]:
# Replace 'your_token_here' with your actual Hugging Face Auth token
# Replace 'youthless-homeless-shelter-web-scrape-dataset' with your desired repository name
auth_token = 'xxx'
repo_name = 'jvm_troubleshooting_guide'
username = 'CesarChaMal' # replace with your Hugging Face username

api = HfApi()
create_repo(repo_name, token=auth_token, private=False) # Set private=True if you want it to be a private dataset


In [None]:
app_id = f"{username}/{repo_name}"
print(app_id)

In [None]:
dataset_dict.push_to_hub(app_id)