In [1]:
from open_ai import OpenAIClient
import pandas as pd
import os
import json
from parallel_post_processor import ParallelPostProcessor 
%load_ext autoreload
%autoreload 2

In [2]:
# download the previously pre-processed dataset
file_path = os.getcwd() + "/dataset/processed_dataset.csv"
data_set = pd.read_csv(file_path)
data_set.rename(columns={"ID": "id"}, inplace=True)

## Let's move on to data labeling using the GPT API.

gpt-4o-mini:

- Maximum 500 requests per minute.
- Token limit is up to 200,000 per minute (higher than gpt-4o).
- Maximum 10,000 requests per day.

Due to the limitation of a maximum of 10,000 requests per day, we will split our dataset into two parts. First, we will label the first part of the dataset, and then the second part.

#### Important conclusions

- To avoid the issue of a limited number of daily requests and to reduce the total number of requests, it would be more efficient to structure the prompt in a way that allows for the analysis of two or more texts (since their combined length is less than 5,000 words) in a single request.

- The class that implements parallel request sending (ParallelPostProcessor) can be improved. For example, recording the response result after each request is less efficient than implementing a batch write after N requests. However, for our case with a small dataset, this is not critical.

- At the end, we need to merge 10 files with processed datasets. This could become an issue with larger datasets, so it may be worth improving this process for better efficiency.

- In the case of a very large number of errors (which isn't our case, as our error rate is below one percent), error handling could cause delays due to the use of locks in the current implementation.

- So far, we haven't been able to fully generalize the module (ParallelPostProcessor), as it remains tailored to the specific format of GPT responses used in our task. This is something that may need further refinement.


##### The labeling time for half of the dataset (6507 posts) was 20 minutes. The cost is 41 cents.

In [4]:
split_index = len(data_set) // 2

data_set_part1 = data_set.iloc[:split_index + 1]
data_set_part2 = data_set.iloc[split_index + 1:]

To make a request you need to have an API key.

In [5]:
current_directory = os.getcwd()
configs_name = "configs/open_ai_access.json"
configs_name_path = os.path.join(current_directory, configs_name)
with open(configs_name_path, 'r') as json_file:
    open_ai_data = json.load(json_file)

Next, we want to implement multiprocessing for processing posts, as this can take a considerable amount of time. To achieve this, we will use the ParallelPostProcessor class from the parallel_post_processor module. The results_dir is the directory where the processing results will be stored. We split our process into 10 parts.

In [21]:
path_to_labeling_data = os.path.join(current_directory, "dataset")
parallel_p_p = ParallelPostProcessor(results_dir=path_to_labeling_data, api_key=open_ai_data["api_key"], organization=open_ai_data["organization"])

Now it is necessary to formulate the prompt correctly. Playground was used for testing and making the final selection of the prompt.

In addition to the main task, we will also ask the chat to determine the gender and age of the author of the post, if possible. Maybe in the future this will be interesting for text analysis.

In [7]:
prompt = """
Analyze the following text and answer three questions:

Text: "{text}"
1. What personal problems does the author describe about themselves? Only include issues that are explicitly stated as part of the author's own life **strictly in Russian**.
2. What is the gender of the author? Answer with "male", "female". If gender cannot be determined from the text, leave the gender "unspecified". Pay close attention to grammatical cues such as verb forms, pronouns, and adjectives that indicate gender **strictly in English**.
3. If the author explicitly states their age, extract it. If no age is mentioned, return "unspecified" **strictly in English**.
Generate the response **in valid JSON format**.

The response should look like this:

{{
  "problems": ["[describe the problem 1]", "[describe the problem 2]", ...],
  "gender": "[male/female/unspecified]",
  "age": "[please specify age or "unspecified"]"
}}
"""

In [22]:
parallel_p_p.process_dataset_in_parallel(dataset = data_set_part2,prompt_template=prompt, num_processes=10)

  return bound(*args, **kwds)


Progress: 100/6506 posts processed.
Progress: 200/6506 posts processed.
Progress: 300/6506 posts processed.
Progress: 400/6506 posts processed.
Progress: 500/6506 posts processed.
Progress: 600/6506 posts processed.
Progress: 700/6506 posts processed.
Progress: 800/6506 posts processed.
Progress: 900/6506 posts processed.
Progress: 1000/6506 posts processed.
Progress: 1100/6506 posts processed.
Progress: 1200/6506 posts processed.
Progress: 1300/6506 posts processed.
Progress: 1400/6506 posts processed.
Progress: 1500/6506 posts processed.
Progress: 1600/6506 posts processed.
Progress: 1700/6506 posts processed.
Progress: 1800/6506 posts processed.
Progress: 1900/6506 posts processed.
Progress: 2000/6506 posts processed.
Progress: 2100/6506 posts processed.
Progress: 2200/6506 posts processed.
Progress: 2300/6506 posts processed.
Progress: 2400/6506 posts processed.
Progress: 2500/6506 posts processed.
Progress: 2600/6506 posts processed.
Progress: 2700/6506 posts processed.
Progress: 

Next, we need to go through all files where the processes recorded labeled data results, combine them for both parts of the dataset, and then save the result to a file.

In [33]:
folders = [
    'folder_to_first_labeled_dataset',
    'folder_to_second_labeled_dataset'
]

column_names = ['id', 'problems', 'gender', 'age']
data_set_labeled = pd.DataFrame(columns=column_names)
for folder_path in folders:
    for i in range(10):
        file_path = os.path.join(folder_path, f'successful_requests_process_{i}.txt')
        
      
        current_process = pd.read_csv(file_path, delimiter=';', header=None)
        current_process.columns = column_names
        data_set_labeled = pd.concat([data_set_labeled, current_process], ignore_index=True)

data_set_labeled.to_csv('folder_to_final_joined_dataset', index=False, sep=';')