## Fine Tuning Classification example

#### We will fine -tune an ada classifier to distinguish between the two sports : Baseball and Hockey.

In [1]:
!pip install --quiet openai

In [2]:
!pip install --upgrade openai



In [2]:
from sklearn.datasets import fetch_20newsgroups

In [4]:
import numpy as np
import pandas as pd
import openai

In [3]:
categories=['rec.sport.baseball','rec.sport.hockey']
sports_dataset=fetch_20newsgroups(subset='train',shuffle=True,random_state=42,categories=categories)

## Data Exploration
##### The newsgroup dataset can be loaded using sklearn. First we will look at the data itself.

In [4]:
print(sports_dataset['data'][0])

From: dougb@comm.mot.com (Doug Bank)
Subject: Re: Info needed for Cleveland tickets
Reply-To: dougb@ecs.comm.mot.com
Organization: Motorola Land Mobile Products Sector
Distribution: usa
Nntp-Posting-Host: 145.1.146.35
Lines: 17

In article <1993Apr1.234031.4950@leland.Stanford.EDU>, bohnert@leland.Stanford.EDU (matthew bohnert) writes:

|> I'm going to be in Cleveland Thursday, April 15 to Sunday, April 18.
|> Does anybody know if the Tribe will be in town on those dates, and
|> if so, who're they playing and if tickets are available?

The tribe will be in town from April 16 to the 19th.
There are ALWAYS tickets available! (Though they are playing Toronto,
and many Toronto fans make the trip to Cleveland as it is easier to
get tickets in Cleveland than in Toronto.  Either way, I seriously
doubt they will sell out until the end of the season.)

-- 
Doug Bank                       Private Systems Division
dougb@ecs.comm.mot.com          Motorola Communications Sector
dougb@nwu.edu       

In [5]:
sports_dataset['target_names']

['rec.sport.baseball', 'rec.sport.hockey']

In [6]:
sports_dataset.target_names[sports_dataset['target'][0]]

'rec.sport.baseball'

In [7]:
len_all,len_baseball,len_hockey=len(sports_dataset.data),len([e for e in sports_dataset.target if e==0]),len([e for e in sports_dataset.target if e==1])
print(f'Total examples: {len_all}, Baseball_examples: {len_baseball},Hockey examples: {len_hockey}')

Total examples: 1197, Baseball_examples: 597,Hockey examples: 600


## Data Preparation
We transform the dataset info a pandas dataframe, with a column for prompt and completion. The prompt contains the email from the mailling list , and the completion is a name of the sport, either hockey or baseball. For  demonstration purpose only and speed of fine-tuning we take only 300 examples. In a real use case the more examples the better the performence.

In [8]:
import pandas as pd

labels=[sports_dataset.target_names[x].split('.')[-1] for x in sports_dataset['target']]
texts=[text.strip() for text in sports_dataset['data']]
df=pd.DataFrame(zip(texts,labels),columns=['prompt','completion'])
df.head()

Unnamed: 0,prompt,completion
0,From: dougb@comm.mot.com (Doug Bank)\nSubject:...,baseball
1,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,hockey
2,From: rudy@netcom.com (Rudy Wade)\nSubject: Re...,baseball
3,From: monack@helium.gas.uug.arizona.edu (david...,hockey
4,Subject: Let it be Known\nFrom: <ISSBTL@BYUVM....,baseball


In [9]:
df.to_json('sport2.jsonl',orient='records',lines=True)

: 

## Data Prepration tool
We can now use a data prepration tool which will suggest a few improvements to our dataset before fine-tuning . Before launching the tool we update the openai library to ensure we're using the latest data