Genrating dataset using GPT 3.5 to fine-tune Llama 2 for news classification

News articles are crucial in machine learning research for various reasons. They provide a vast amount of information across diverse subjects such as politics, economics, and technology. Furthermore, these articles frequently incorporate intricate language structures, incorporating metaphors, analogies, and specialized terminology. Leveraging this diverse and abundant textual data in both research and industry proves to be a valuable asset for training and assessing machine learning models, contributing significantly to the progress of natural language understanding and related domains.

In [31]:
!pip install --upgrade openai --progress-bar off
!pip install -Uqqq datasets --progress-bar off



In [32]:
import pandas as pd
import numpy as np
import openai
import time
import random
from random import randrange
from tenacity import retry, stop_after_attempt, wait_random_exponential, retry_if_exception_type
from datasets import load_dataset
import warnings
warnings.filterwarnings('ignore')
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


Databricks Dolly 15K. It contains 15,000 high-quality human-generated prompt / response pairs specifically designed for instruction-tuning large language models.

In [33]:
instruction_dataset_name = "databricks/databricks-dolly-15k"
dataset = load_dataset(instruction_dataset_name, split = "train")

print(f'Number of prompts: {len(dataset)}')
print(f'Column names are: {dataset.column_names}')
print(dataset[randrange(len(dataset))])

Downloading readme:   0%|          | 0.00/8.20k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/13.1M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Number of prompts: 15011
Column names are: ['instruction', 'context', 'response', 'category']
{'instruction': 'Give me list of main cast of Friends TV show', 'context': 'Friends is an American television sitcom created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast starring Jennifer Aniston, Courteney Cox, Lisa Kudrow, Matt LeBlanc, Matthew Perry and David Schwimmer, the show revolves around six friends in their 20s and 30s who live in Manhattan, New York City. The series was produced by Bright/Kauffman/Crane Productions, in association with Warner Bros. Television. The original executive producers were Kevin S. Bright, Kauffman, and Crane.', 'response': 'Jennifer Aniston, \nCourteney Cox, \nLisa Kudrow, \nMatt LeBlanc, \nMatthew Perry, \nDavid Schwimmer', 'category': 'information_extraction'}


In [34]:
path = "/content/drive/MyDrive/"
input_data_filename = "signalmedia-1m.jsonl.gz"
preprocessed_data_filename = "signalmedia_news_dataset_sample.csv"
processed_data_filename = "signalmedia_news_dataset_sample_classified.csv"
output_data_json_filename = "news_classification.json"
output_data_csv_filename = "news_classification.csv"

#Remove the key before git check-in
openai.api_key = "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"
model_name = "gpt-3.5-turbo"

 Signal 1 Million News Articles Dataset by Signal AI. This dataset, available as a zipped JSONL file, contains 1 million news articles and blogs from a variety of data sources for a period of 1 month (September 2015). There are approximately 735K news articles and 265K blog articles. We will select only 1000 news articles to tune Llama 2.

In [35]:
# Generating random indices
n_samples = 10
random_indices = random.sample(range(len(dataset)), n_samples)
samples = []

# Appending prompts to a list
for idx in random_indices:
    sample = dataset[idx]

    sample_data = {
        'instruction': sample['instruction'],
        'context': sample['context'],
        'response': sample['response'],
        'category': sample['category']
    }
    samples.append(sample_data)

# Creating a DataFrame
dolly_df = pd.DataFrame(samples)

In [36]:
display(dolly_df)

Unnamed: 0,instruction,context,response,category
0,How to make a dish thicken fast without adding...,,Everyone likes when the dish is thick and crea...,general_qa
1,Classify each of the following as either title...,,"George R.R. Martin: A Game of Thrones, Dying o...",classification
2,Please give me a brief summary of this paragra...,"The winningest quarterback in NFL history, Bra...",Brady is the winningest quarterback in NFL his...,summarization
3,"Given this paragraph about Simon Bolivar, tell...",Simón José Antonio de la Santísima Trinidad Bo...,"Simon Bolivar was born in Caracas, Venezuela o...",closed_qa
4,Who planted the first wine vineyard at Califor...,,Father Junípero Serra,open_qa
5,"Classify each as a ocean, sea, or lake: Pacifi...",,Pacific - ocean\nMediterranean - sea \nErie - ...,classification
6,What high school did Paul Allen and Bill Gates...,,Lakeside High School in Seattle Washington,open_qa
7,What is the Maareech Advanced Torpedo Defence ...,Maareech Advanced Torpedo Defence System (ATDS...,The Maareech Advanced Torpedo Defence System (...,information_extraction
8,Give me a bulleted list of the 7 wonders of th...,,Here are the 7 wonders of the ancient world:\n...,brainstorming
9,"Given this reference text about beneficence, w...",Beneficence is a concept in research ethics th...,Ensure you are not harming your research parti...,closed_qa


In [37]:
raw_news_df = pd.read_json(f"{path}{input_data_filename}", lines = True)

# Selecting "News" records
raw_news_df2 = raw_news_df[raw_news_df['media-type'] == "News"]

# Shuffling the dataset
raw_news_df3 = raw_news_df2.sample(frac = 1)

# Selecting top 1000 records/news articles
raw_news_df4 = raw_news_df3.head(1000)

# Saving the preprocessed data as a CSV file
raw_news_df4.to_csv(f"{path}{preprocessed_data_filename}", index = False)

In [38]:
# Loading the preprocessed data as a Pandas DataFrame
prep_news_df = pd.read_csv(f"{path}{preprocessed_data_filename}")

In [39]:
display(prep_news_df)

Unnamed: 0,id,content,title,media-type,source,published
0,a4ed7af0-4d7d-491e-bbd2-1b0a973c0617,"We've glimpsed the future, and it's all about ...","""Sweatshirts Are the Future"" and More Kanye We...",News,Penfield Post,2015-09-25T05:08:44Z
1,86afc956-f25a-4be2-8387-f7acb472e66e,The lower jobless rate means the Fed is sure t...,The jobs report leaves Wall Street in a bind,News,MyInforms,2015-09-04T20:00:29Z
2,80cb4093-9c1d-4f9c-8a1a-09aadbf96628,Zacks lowered shares of Vedanta (NASDAQ:VEDL) ...,Zacks Downgrades Vedanta to Sell (VEDL),News,Lulegacy.com,2015-09-23T12:59:37Z
3,31d6cc7d-c1cc-46f7-aad8-11626ea2de3f,BY\r\n ...,Kurt Angle’s brother charged in woman’s death ...,News,New York Daily News,2015-09-22T15:07:15Z
4,7f71a919-e91d-4e21-b892-7257eecaefa5,Women in the City campaign aims to link city a...,Crow coasts to Olympic rowing berth,News,BrisbaneNews.Net,2015-09-04T16:14:41Z
...,...,...,...,...,...,...
995,156c5305-e571-444b-bb4f-440365372552,Photo Credit: Fox News A boy is diagnosed w...,Autoimmune disease tragically melts face of boy,News,Kicker Daily News,2015-09-03T10:53:10Z
996,329076f7-6f04-42c1-ac4f-c5ec8cccd638,during the UEFA Champions League group F footb...,Olympiakos' Nigerian forward Ideye Brown (R) v...,News,UAE NewsApp.com,2015-09-16T19:19:04Z
997,24a4c661-4407-4b3a-8503-f94bc0ea8844,"SOURCE Raytheon Company\n\nTUCSON, Ariz. \n\n""...",US Navy approves full-rate production for Rayt...,News,WFMJ 21 - TV,2015-09-03T12:28:00Z
998,85051f7e-0c5a-41bb-bc47-de1aac205799,(CNN) - \nUkraine banned Russian airlines fro...,Ukraine bans Russian airlines from country,News,KWCH 12,2015-09-26T10:42:27Z


In [40]:
# Defining bot behavior and instructing
SYSTEM_PROMPT = """You are ChatGPT, an intelligent bot. I will give you a news article. You have to classify the news into one of the 43 categories."""

USER_PROMPT_1 = """Are you clear about your role?"""

ASSISTANT_PROMPT_1 = """Sure, I'm ready to help you with your news classification task. Please provide me with the necessary information to get started."""

# Few Shot Prompting
PROMPT = (
"""
Categories:

U.S. NEWS
COMEDY
PARENTING
WORLD NEWS
CULTURE & ARTS
TECH
SPORTS
ENTERTAINMENT
POLITICS
WEIRD NEWS
ENVIRONMENT
EDUCATION
CRIME
SCIENCE
WELLNESS
BUSINESS
STYLE & BEAUTY
FOOD & DRINK
MEDIA
QUEER VOICES
HOME & LIVING
WOMEN
BLACK VOICES
TRAVEL
MONEY
RELIGION
LATINO VOICES
IMPACT
WEDDINGS
COLLEGE
PARENTS
ARTS & CULTURE
STYLE
GREEN
TASTE
HEALTHY LIVING
THE WORLDPOST
GOOD NEWS
WORLDPOST
FIFTY
ARTS
DIVORCE
ESG

If you don't know the category, response "OTHERS".

Output Format:
Category name

Examples:
1. News: New Product Gives Marketers Access to Real Keywords, Conversions and Results Along With 13 Months of Historical Data

SAN FRANCISCO, CA -- (Marketwired) -- 09/17/15 -- Jumpshot, a marketing analytics company that uses distinctive data sources to paint a complete picture of the online customer journey, today announced the launch of Jumpshot Elite, giving marketers insight into what their customers are doing the 99% of the time they're not on your site. For years, marketers have been unable to see what organic and paid search terms users were entering, much less tie those searches to purchases. Jumpshot not only injects that user search visibility back into the market, but also makes it possible to tie those keywords to conversions -- for any web site.

"Ever since search engines encrypted search results, marketers have been in the dark about keywords, impacting not only the insight into their own search investments, but also their ability to unearth high converting keywords for their competitors," said Deren Baker, CEO of Jumpshot. "Our platform eliminates the hacks, assumptions, and guesswork that marketers are doing now and provides real data: actual searches tied to actual conversions conducted by real people with nothing inferred."

Unlike other keyword research tools that receive data through the Adwords API or send bots to cobble together various data inputs and implied metrics, Jumpshot leverages its panel of over 115 million global consumers to analyze real search activity. As a result, Jumpshot is able to provide companies with actionable data to improve the ROI of their search marketing campaigns, SEO tactics and content marketing initiatives.

Available today, Jumpshot Elite provides 13 months of backward-looking data as well as:

Access to real queries used by searchers

Paid and organic results for any website

Visibility into organic keywords, eliminating the "not provided" outcome in web analytics

Real user queries, clicks and transactions instead of machine-generated clicks with inferred results

Ability to tie keywords to real transactions on any website

Variable attribution models and lookback windows

Launched in January, 2015, Jumpshot grew out of the ambitions of a group of smart marketers and data scientists who were frustrated about the limitations of the data they had access to, and excited about the opportunity to provide new insights into online behavior.

The company uses distinctive data sources to paint a complete picture of the online world for businesses, from where customers spend time online to what they do there and how they get from place to place. By tracking the online customer journey down to each click, Jumpshot reveals how and why customers arrive at purchase decisions. The company tracks more data in more detail than other services, tracking 160 billion monthly clicks generated by its extensive data panel.

About Jumpshot

Jumpshot is a marketing analytics platform that reveals the entire customer journey -- from the key sources of traffic to a site, to browsing and buying behavior on any domain. With a panel of 115 million users, Jumpshot provides marketers with the insight to understand what their customers are doing the 99% of the time they're not on their own site -- a scope of information never before attainable. Jumpshot was founded in 2015 and is headquartered in San Francisco.

For more information, please visit www.jumpshot.com.

Image Available: http://www2.marketwire.com/mw/frame_mw?attachid=2889222

Kelly Mayes

The Bulleit Group

615-200-8845

Published Sep. 17, 2015

Copyright © 2015 SYS-CON Media, Inc. — All Rights Reserved.

Syndicated stories and blog feeds, all rights reserved by the author.

Output: TECHNOLOGY

2. News: SOURCE Harwood Feffer LLP

NEW YORK

On July 21, 2015

On this news, VASCO stock nearly 33% and has not recovered.

Our investigation concerns whether the Company board of directors has breached its fiduciary duties to shareholders, grossly mismanaged the Company, and/or committed abuses of control in connection with the foregoing.

If you own VASCO shares and wish to discuss this matter with us, or have any questions concerning your rights and interests with regard to this matter, please contact:

Robert I. Harwood, Esq.

Harwood Feffer

The law firm responsible for this advertisement is Harwood Feffer LLP (www.hfesq.com). Prior results do not guarantee or predict a similar outcome with respect to any future matter.

Logo - http://photos.prnewswire.com/prnh/20120215/MM54604LOGO

To view the original version on PR Newswire, visit:http://www.prnewswire.com/news-releases/harwood-feffer-llp-announces-investigation-of-vasco-data-security-international-inc-300149371.html

©2015 PR Newswire. All Rights Reserved.

Output: BUSINESS

3. {}
Output:
"""
)

Open Ai Function REST API Call

In [41]:
from openai import OpenAI

def openai_chat_completion_response(USER_PROMPT_2):
  api_key = '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'
  client = OpenAI(api_key=api_key)

  response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages = [
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user", "content": USER_PROMPT_1},
                    {"role": "assistant", "content": ASSISTANT_PROMPT_1},
                    {"role": "user", "content": USER_PROMPT_2}
                ]

  )

  return response.choices[0].message.content



News Classification Function Call

In [42]:
# Function to classify news articles
def predict_news_category(news_body):
  # Add news article to the prompt
  NEWS = news_body
  FINAL_PROMPT = PROMPT.format(NEWS)
  # Send prompt for inference
  try:
    classify_news = openai_chat_completion_response(FINAL_PROMPT)
  except:
    # Output "NA" if the request fails
    classify_news = "NA"
  time.sleep(20)
  return classify_news

In [43]:
# Selecting 100 records at a time for inference
prep_news_df2 = prep_news_df.iloc[0:100,:].copy()

In [44]:
# Lambda function to iterate over news articles and save response as a new column
prep_news_df2['predicted_category'] = prep_news_df2['content'].apply(lambda x: predict_news_category(x))

In [46]:
display(prep_news_df2[['content', 'predicted_category']])

Unnamed: 0,content,predicted_category
0,"We've glimpsed the future, and it's all about ...",ENTERTAINMENT
1,The lower jobless rate means the Fed is sure t...,BUSINESS
2,Zacks lowered shares of Vedanta (NASDAQ:VEDL) ...,BUSINESS
3,BY\r\n ...,SPORTS
4,Women in the City campaign aims to link city a...,OTHERS
...,...,...
95,A teen says Obama is wrong for inviting Ahmed ...,POLITICS
96,"Sep 5, 2015; Baton Rouge, LA, USA; Lightning s...",SPORTS
97,"The Genesis Prize Foundation, in partnership w...",RELIGION
98,"]]> \nNEW YORK , Sept. 8, 2015 /PRNewswire/ --...",TECH


In [48]:
#Savin the output file

prep_news_df2.to_csv(f"{path}{processed_data_filename}", index = False)

#Creating Instruction Dataset

In [49]:

# Loading processed data as a Pandas DataFrame
prep_news_df2 = pd.read_csv(f"{path}{processed_data_filename}")

In [50]:
display(prep_news_df2)

Unnamed: 0,id,content,title,media-type,source,published,predicted_category
0,a4ed7af0-4d7d-491e-bbd2-1b0a973c0617,"We've glimpsed the future, and it's all about ...","""Sweatshirts Are the Future"" and More Kanye We...",News,Penfield Post,2015-09-25T05:08:44Z,ENTERTAINMENT
1,86afc956-f25a-4be2-8387-f7acb472e66e,The lower jobless rate means the Fed is sure t...,The jobs report leaves Wall Street in a bind,News,MyInforms,2015-09-04T20:00:29Z,BUSINESS
2,80cb4093-9c1d-4f9c-8a1a-09aadbf96628,Zacks lowered shares of Vedanta (NASDAQ:VEDL) ...,Zacks Downgrades Vedanta to Sell (VEDL),News,Lulegacy.com,2015-09-23T12:59:37Z,BUSINESS
3,31d6cc7d-c1cc-46f7-aad8-11626ea2de3f,BY\r\n ...,Kurt Angle’s brother charged in woman’s death ...,News,New York Daily News,2015-09-22T15:07:15Z,SPORTS
4,7f71a919-e91d-4e21-b892-7257eecaefa5,Women in the City campaign aims to link city a...,Crow coasts to Olympic rowing berth,News,BrisbaneNews.Net,2015-09-04T16:14:41Z,OTHERS
...,...,...,...,...,...,...,...
95,f072a43a-f939-47b8-8902-89b31d780634,A teen says Obama is wrong for inviting Ahmed ...,Teen slams White House over invite,News,Click2Houston.com,2015-09-21T01:09:07Z,POLITICS
96,a3a15467-cee3-4b95-8c2c-4af4a33b636a,"Sep 5, 2015; Baton Rouge, LA, USA; Lightning s...",LSU announces refund plan for canceled game 25...,News,WWL-TV,2015-09-08T22:13:33Z,SPORTS
97,133b39b2-56e7-4aac-919e-55300c5572fc,"The Genesis Prize Foundation, in partnership w...",Reaching out to bring in,News,Jewish News of Greater Phoenix,2015-09-02T17:00:00Z,RELIGION
98,4a9102f1-9999-4fa3-bd3d-218307af1d84,"]]> \nNEW YORK , Sept. 8, 2015 /PRNewswire/ --...",EdTech Startup Surpasses 2 Million User Downlo...,News,Sys-Con Media,2015-09-08T15:00:29Z,TECH


In [51]:
# Frequency distribution of predicted news categories
pred_cat_freq_dist = prep_news_df2['predicted_category'].value_counts(dropna = False).sort_values(ascending = False).reset_index()
pred_cat_freq_dist = pred_cat_freq_dist.rename(columns = {"index": "predicted_category", "predicted_category": "count"})
display(pred_cat_freq_dist)

Unnamed: 0,predicted_category,count
0,TECH,14
1,SPORTS,14
2,BUSINESS,12
3,OTHERS,10
4,ENTERTAINMENT,9
5,WORLD NEWS,7
6,POLITICS,6
7,CRIME,4
8,TRAVEL,3
9,MEDIA,2


In [52]:
#dealing with new unspecicified categories
# Merging new news categories with existing ones
prep_news_df2['predicted_category'] = np.where(prep_news_df2['predicted_category'] == "TECHNOLOGY", "TECH", prep_news_df2['predicted_category'])
prep_news_df2['predicted_category'] = np.where(prep_news_df2['predicted_category'] == "SPACE", "SCIENCE", prep_news_df2['predicted_category'])
prep_news_df2['predicted_category'] = np.where(prep_news_df2['predicted_category'] == "FINANCE", "MONEY", prep_news_df2['predicted_category'])
prep_news_df2['predicted_category'] = np.where(prep_news_df2['predicted_category'] == "MARKETING & ADVERTISING", "OTHERS", prep_news_df2['predicted_category'])
prep_news_df2['predicted_category'] = np.where(prep_news_df2['predicted_category'] == "ARTS & CULTURE", "CULTURE & ARTS", prep_news_df2['predicted_category'])

In [53]:
# Frequency distribution of updated predicted news categories
pred_cat_freq_dist = prep_news_df2['predicted_category'].value_counts(dropna = False).sort_values(ascending = False).reset_index()
pred_cat_freq_dist = pred_cat_freq_dist.rename(columns = {"index": "predicted_category", "predicted_category": "count"})
display(pred_cat_freq_dist)

Unnamed: 0,predicted_category,count
0,SPORTS,14
1,TECH,14
2,BUSINESS,12
3,OTHERS,10
4,ENTERTAINMENT,9
5,WORLD NEWS,7
6,POLITICS,6
7,CRIME,4
8,TRAVEL,3
9,RELIGION,2


In [54]:
# Creating instruction against each news article / news category pairs
prep_news_df2['instruction'] = """Categorize the news article into one of the 18 categories:

WORLD NEWS
COMEDY
POLITICS
TECH
SPORTS
BUSINESS
OTHERS
ENTERTAINMENT
CULTURE & ARTS
FOOD & DRINK
MEDIA
RELIGION
MONEY
HEALTHY LIVING
SCIENCE
EDUCATION
CRIME
ENVIRONMENT

"""

In [55]:
# Removing null news category records
prep_news_df3 = prep_news_df2[~prep_news_df2['predicted_category'].isna()]

# Renaming and selecting relevant columns
prep_news_df4 = prep_news_df3.rename(columns = {'content': 'input', 'predicted_category': 'output'})
output_news_df = prep_news_df4[['instruction', 'input', 'output']]

In [56]:
display(output_news_df)

Unnamed: 0,instruction,input,output
0,Categorize the news article into one of the 18...,"We've glimpsed the future, and it's all about ...",ENTERTAINMENT
1,Categorize the news article into one of the 18...,The lower jobless rate means the Fed is sure t...,BUSINESS
2,Categorize the news article into one of the 18...,Zacks lowered shares of Vedanta (NASDAQ:VEDL) ...,BUSINESS
3,Categorize the news article into one of the 18...,BY\r\n ...,SPORTS
4,Categorize the news article into one of the 18...,Women in the City campaign aims to link city a...,OTHERS
...,...,...,...
95,Categorize the news article into one of the 18...,A teen says Obama is wrong for inviting Ahmed ...,POLITICS
96,Categorize the news article into one of the 18...,"Sep 5, 2015; Baton Rouge, LA, USA; Lightning s...",SPORTS
97,Categorize the news article into one of the 18...,"The Genesis Prize Foundation, in partnership w...",RELIGION
98,Categorize the news article into one of the 18...,"]]> \nNEW YORK , Sept. 8, 2015 /PRNewswire/ --...",TECH


In [57]:
# Converting to list of dictionaries
news_json = output_news_df.to_json(orient = 'records', lines = True).splitlines()

In [58]:
print(news_json[0])

{"instruction":"Categorize the news article into one of the 18 categories:\n\nWORLD NEWS\nCOMEDY\nPOLITICS\nTECH\nSPORTS\nBUSINESS\nOTHERS\nENTERTAINMENT\nCULTURE & ARTS\nFOOD & DRINK\nMEDIA\nRELIGION\nMONEY\nHEALTHY LIVING\nSCIENCE\nEDUCATION\nCRIME\nENVIRONMENT\n\n","input":"We've glimpsed the future, and it's all about Kanye. Kanye West gave a mega-interview to Vanity Fair that came out Thursday. They had a chance to talk to him just after his New York Fashion Week collection hit the runways. Here's what we learned: On stressing out about his collection West referred repeatedly to not sleeping for days on end, or crashing in his workspace. \"I slept at the studio and I would have dreams or nightmares about the look board,\" he said. Hillary Clinton has one request for Kanye's 2020 presidential run On how his album is coming along He described it as \"a sonic landscape, a two-year painting.\" During his show, they played a song called \"Fade\" from the upcoming album. \"The song I pl

In [59]:
# Saving as a JSON file
with open(f"{path}{output_data_json_filename}", 'w') as f:
    for line in news_json:
        f.write(f"{line}\n")

In [60]:
# Saving as a CSV file
output_news_df.to_csv(f"{path}{output_data_csv_filename}", index = False)