Genrating dataset using GPT 3.5 to fine-tune Llama 2 for news classification

News articles are crucial in machine learning research for various reasons. They provide a vast amount of information across diverse subjects such as politics, economics, and technology. Furthermore, these articles frequently incorporate intricate language structures, incorporating metaphors, analogies, and specialized terminology. Leveraging this diverse and abundant textual data in both research and industry proves to be a valuable asset for training and assessing machine learning models, contributing significantly to the progress of natural language understanding and related domains.

In [1]:
!pip install --upgrade openai --progress-bar off
!pip install -Uqqq datasets --progress-bar off

Collecting openai
  Downloading openai-1.2.3-py3-none-any.whl (220 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.25.1-py3-none-any.whl (75 kB)
Collecting httpcore (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.2-py3-none-any.whl (76 kB)
Collecting h11<0.15,>=0.13 (from httpcore->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
Installing collected packages: h11, httpcore, httpx, openai
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not installed.[0m[31m
[0mSuccessfully installed h11-0.14.0 httpcore-1.0.2 httpx-0.25.1 openai-1.2.3


In [2]:
import pandas as pd
import numpy as np
import openai
import time
import random
from random import randrange
from tenacity import retry, stop_after_attempt, wait_random_exponential, retry_if_exception_type
from datasets import load_dataset
import warnings
warnings.filterwarnings('ignore')
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


Databricks Dolly 15K. It contains 15,000 high-quality human-generated prompt / response pairs specifically designed for instruction-tuning large language models.

In [3]:
instruction_dataset_name = "databricks/databricks-dolly-15k"
dataset = load_dataset(instruction_dataset_name, split = "train")

print(f'Number of prompts: {len(dataset)}')
print(f'Column names are: {dataset.column_names}')
print(dataset[randrange(len(dataset))])

Downloading readme:   0%|          | 0.00/8.20k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/13.1M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Number of prompts: 15011
Column names are: ['instruction', 'context', 'response', 'category']
{'instruction': 'why did Syd Barrett left the Pink Floyd?', 'context': '', 'response': 'Syd Barrett was one of the main members of the Pink Floyd. He has used drugs a lot and after a while he was not able to perform with the team. Even though the band gave him multiple chances, he could not keep up with the band. In the end, they had to ask him to leave the band. The main reason that he has left the team is his health problems and addiction to drugs.', 'category': 'open_qa'}


In [4]:
path = "/content/drive/MyDrive/"
input_data_filename = "signalmedia-1m.jsonl.gz"
preprocessed_data_filename = "signalmedia_news_dataset_sample.csv"
processed_data_filename = "signalmedia_news_dataset_sample_classified.csv"
output_data_json_filename = "news_classification.json"
output_data_csv_filename = "news_classification.csv"

#Remove the key before git check-in
openai.api_key = "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"
model_name = "gpt-3.5-turbo"

 Signal 1 Million News Articles Dataset by Signal AI. This dataset, available as a zipped JSONL file, contains 1 million news articles and blogs from a variety of data sources for a period of 1 month (September 2015). There are approximately 735K news articles and 265K blog articles. We will select only 1000 news articles to tune Llama 2.

In [6]:
# Generating random indices
n_samples = 10
random_indices = random.sample(range(len(dataset)), n_samples)
samples = []

# Appending prompts to a list
for idx in random_indices:
    sample = dataset[idx]

    sample_data = {
        'instruction': sample['instruction'],
        'context': sample['context'],
        'response': sample['response'],
        'category': sample['category']
    }
    samples.append(sample_data)

# Creating a DataFrame
dolly_df = pd.DataFrame(samples)

In [7]:
display(dolly_df)

Unnamed: 0,instruction,context,response,category
0,What is the scientific name for a jaguar?,,Panthera onca,open_qa
1,From the passage identify the places where Bac...,Bacteria (/bækˈtɪəriə/ (listen); singular: bac...,"soil, water, acidic hot springs, radioactive w...",information_extraction
2,What is a noun?,,"Noun can be used that can define a Place, Name...",open_qa
3,What are questions you can ask to get to know ...,,Questions you can ask to get to know someone a...,open_qa
4,Is Singapore a good place to develop wealth?,Economy\nMain article: Economy of Singapore\nS...,"Singapore economy is regarded as free, innovat...",summarization
5,Classify each of the following as root or shoo...,,"tomato, brinjal, lady finger, cucumber are sho...",classification
6,What is the minimal set of garden tools to sta...,,"For an outdoor garden, you only need a spade, ...",brainstorming
7,What are the prizes of the Festival of Festiva...,"Golden Gryphon, Silver Gryphon, Bronze Gryphon...",Grand Prix – Gold or Golden Gryphon (Griffon) ...,information_extraction
8,What is the etymology of the word cookie?,The word cookie dates from at least 1701 in Sc...,The earliest known usage of the work cookie co...,summarization
9,What are the Olympic light weight events.,The first lightweight events were added to the...,Two Olympic lightweight events are men's doubl...,summarization


In [8]:
raw_news_df = pd.read_json(f"{path}{input_data_filename}", lines = True)

# Selecting "News" records
raw_news_df2 = raw_news_df[raw_news_df['media-type'] == "News"]

# Shuffling the dataset
raw_news_df3 = raw_news_df2.sample(frac = 1)

# Selecting top 1000 records/news articles
raw_news_df4 = raw_news_df3.head(1000)

# Saving the preprocessed data as a CSV file
raw_news_df4.to_csv(f"{path}{preprocessed_data_filename}", index = False)

ValueError: ignored