# News Classifier with Cohere

With LLMs, instead of having to prepare thousands of training data points, you can get up and running with just a handful of examples, called *few-shot* classification. Having said that, you probably want to have a certain level of control over how you train a classifier, and especially, how to get the best performance out of a model. For example, if you do happen to have a large dataset at your disposal, you will want to make full use of it when training a classifier. With the Cohere API, we want to give this flexibility to developers.

In [None]:
#! pip install cohere 

Import library

In [53]:
# Import the required modules
import cohere
import numpy as np
import pandas as pd
import os,sys
sys.path.append('../scr')
import config
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

In [54]:
# Set up the Cohere client
#api_key = 'apikey' # Paste your API key here. Remember to not share it publicly 
api_key = config.cohere_api["api_key"]
co = cohere.Client(api_key)

# Prepare the Dataset

In [55]:
# Load the dataset to a dataframe
df = pd.read_csv('../data/Example_data.csv')
df.head()

Unnamed: 0,Domain,Title,Description,Body,Link,timestamp,Analyst_Average_Score,Analyst_Rank,Reference_Final_Score
0,rassegnastampa.news,Boris Johnson using a taxpayer-funded jet for ...,…often trigger a protest vote that can upset…t...,Boris Johnson using a taxpayer-funded jet for ...,https://rassegnastampa.news/boris-johnson-usin...,2021-09-09T18:17:46.258006,0.0,4,1.96
1,twitter.com,"Stumbled across an interesting case, a woman f...","Stumbled across an interesting case, a woman f...","Stumbled across an interesting case, a woman f...",http://twitter.com/CoruscaKhaya/status/1435585...,2021-09-08T13:02:45.802298,0.0,4,12.0
2,atpe-tchad.info,Marché Résines dans les peintures et revêtemen...,…COVID-19…COVID…COVID…COVID-19 et Post COVID…C...,Le rapport d’étude de marché Résines dans les ...,http://atpe-tchad.info/2021/09/13/marche-resin...,2021-09-13T07:32:46.244403,0.0,4,0.05
3,badbluetech.bitnamiapp.com,"AI drives data analytics surge, study finds",…hate raiders' linked to automated harassment ...,How to drive the funnel through content market...,http://badbluetech.bitnamiapp.com/p.php?sid=21...,2021-09-11T00:17:45.962605,0.0,4,6.1
4,kryptogazette.com,Triacetin Vertrieb Markt 2021: Globale Unterne...,…Abschnitten und Endanwendungen / Organisation...,Global Triacetin Vertrieb-Markt 2021 von Herst...,https://kryptogazette.com/2021/09/08/triacetin...,2021-09-08T12:47:46.078369,0.0,4,0.13


In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Domain                 10 non-null     object 
 1   Title                  10 non-null     object 
 2   Description            10 non-null     object 
 3   Body                   10 non-null     object 
 4   Link                   10 non-null     object 
 5   timestamp              10 non-null     object 
 6   Analyst_Average_Score  10 non-null     float64
 7   Analyst_Rank           10 non-null     int64  
 8   Reference_Final_Score  10 non-null     float64
dtypes: float64(2), int64(1), object(6)
memory usage: 848.0+ bytes


Performing Data Cleaning

In [63]:
# Obtaining Additional Stopwords From nltk
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

In [64]:
import gensim
from nltk.corpus import stopwords
from gensim.models import Word2Vec
# Removing Stopwords And Remove Words With 2 Or Less Characters
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 2 and token not in stop_words:
            result.append(token)

    return result

In [65]:
# Import libraries
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from nltk.corpus import stopwords
from textblob import TextBlob
from textblob import Word

In [66]:
df['Body']

0    Boris Johnson using a taxpayer-funded jet for ...
1    Stumbled across an interesting case, a woman f...
2    Le rapport d’étude de marché Résines dans le p...
3    How to drive the funnel through content market...
4    Global Triacetin Vertrieb-Markt 2021 von Herst...
5    South African Police Service Office of the Pro...
6    Today is the 7th anniversary [Tragic collapse ...
7    Construction activity grew steadily by 4% in t...
8    - Former Eskom CEO Matshela Moses Koko sought ...
9    Global and Regional Beta-Carotene Market Resea...
Name: Body, dtype: object

In [67]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /home/sucess/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [68]:
# Applying The Function To The Dataframe and Lower casing and removing punctuations
# Lemmatization
df['Body'] = df['Body'].astype(str).apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df['Body'] = df['Body'].astype(str).apply(preprocess)
#data_df['clean'] =" ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(df) if w not in string.punctuation]))
#Summary of title
#lemmatization of Description 
df['Description'] = df['Description'].astype(str).apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df['Description'] = df['Description'].astype(str).apply(preprocess)

In [69]:
# Writing a function to lemmatize words
lmtzr = WordNetLemmatizer()
def lem(text):
    return [lmtzr.lemmatize(word) for word in text]

# Applying the function to each row of the text
# i.e. reducing each word to its lemma
tokens_des = df['Description'].apply(lem)
tokens_body = df['Body'].apply(lem)

In [73]:
tokens_des

0    [trigger, protest, vote, upset, minister, brea...
1    [stumbled, interesting, case, woman, facing, e...
2    [covid, covid, covid, covid, post, covid, covi...
3    [hate, raider, linked, automated, harassment, ...
4    [abschnitten, und, endanwendungen, organisatio...
5    [crime, stamp, road, appear, court, sap, crime...
6    [lagos, nigeria, south, african, killed, build...
7    [additional, spending, building, repair, secur...
8    [lawsuit, public, participation, designed, int...
9    [key, player, dsm, basf, allied, biotech, chr,...
Name: Description, dtype: object

In [72]:
tokens_body

0    [boris, johnson, taxpayer, funded, jet, electi...
1    [stumbled, interesting, case, woman, facing, e...
2    [rapport, étude, marché, résines, dans, peintu...
3    [drive, funnel, content, marketing, link, buil...
4    [global, triacetin, vertrieb, markt, von, hers...
5    [south, african, police, service, office, prov...
6    [today, anniversary, tragic, collapse, buildin...
7    [construction, activity, grew, steadily, secon...
8    [eskom, ceo, matshela, moses, koko, sought, da...
9    [global, regional, beta, carotene, market, res...
Name: Body, dtype: object

In [71]:
lengths = df['Body'].apply(lambda x: len(x)) 
lengths

0    1120
1      23
2     861
3    2316
4     986
5     156
6     152
7     259
8     376
9     489
Name: Body, dtype: int64

In [70]:
lengths = df['Description'].apply(lambda x: len(x)) 
lengths

0    21
1    23
2    28
3    30
4    31
5    26
6    26
7    21
8    25
9    28
Name: Description, dtype: int64

In [99]:
rank__to_10 = df['Analyst_Average_Score'].apply(lambda x: 'low' if x < 5 else 'high')
df['rank_1'] = rank__to_10
df['rank_1']

0    low
1    low
2    low
3    low
4    low
5    low
6    low
7    low
8    low
9    low
Name: rank_1, dtype: object

In [76]:
def handle_sub_class(value):
    if value >= 0 and value < 1:
        return "low_1"
    elif value >= 1 and value < 2:
        return "low_2"
    elif value >= 2 and value < 3:
        return "low_3"
    elif value >= 3 and value < 4:
        return "low_4"
    elif value >= 4 and value < 5:
        return "low_5"
    elif value >= 5 and value < 6:
        return "high_1"
    elif value >= 6 and value < 7:
        return "high_2"
    elif value >= 7 and value < 8:
        return "high_3"
    elif value >= 8 and value < 9:
        return "high_4"
    else:
        return "high_5"

#We will add new column with value 'high_1' for rows having Analyst_Average_Score value of 5-6, high_2 for rows having Analyst_Average_Score value of 6-7 til high_5 for rows having Analyst_Average_Score value of 9-10 
#We will add new column with value 'low_1' for rows having Analyst_Average_Score value of 0-1, low_2 for rows having Analyst_Average_Score value of 1-2 til low_5 for rows having Analyst_Average_Score value of 4-5 

In [100]:
df['rank_2'] = df['Analyst_Average_Score'].apply(lambda x: handle_sub_class(x))
df['rank_3'] = df['Analyst_Average_Score'].apply(lambda x: handle_sub_class(int("{:.2f}".format(x)[2])))
df['rank_4'] = df['Analyst_Average_Score'].apply(lambda x: handle_sub_class(int("{:.2f}".format(x)[3])))

In [101]:
df['final_score'] = df['rank_1'].map(str)+'__'+df['rank_2'].map(str)+'__'+df['rank_3'].map(str)+'__'+df['rank_4'].map(str)
df['final_score']

0      low__low_1__low_1__low_1
1      low__low_1__low_1__low_1
2      low__low_1__low_1__low_1
3      low__low_1__low_1__low_1
4      low__low_1__low_1__low_1
5      low__low_2__low_4__low_4
6      low__low_1__low_1__low_1
7    low__low_2__high_2__high_2
8      low__low_1__low_4__low_4
9      low__low_1__low_1__low_1
Name: final_score, dtype: object

In [81]:
# Split the dataset into training and test portions
# Training = For use in Sections 2 and 3
# Test = For evaluating the classifier performance
X,y = df['Title'], df['final_score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=2, random_state=21)

In [82]:
# View the list of all available categories
intents = y_train.unique().tolist()
print(intents)

['low__low_2__high_2__high_2', 'low__low_1__low_1__low_1', 'low__low_2__low_4__low_4', 'low__low_1__low_4__low_4']


# 1 - Few-shot classification with the Classify endpoint

Few-shot here means we just need to supply a few examples per class and have a decent classifier working. With Cohere’s Classify endpoint, the ‘training’ dataset is referred to as *examples*. The minimum number of examples per class is five, where each example consists of a text (in our case, the `query`), and a label (in our case, the `label`)

## Prepare the examples

In [84]:
# Set the number of examples per category
#EX_PER_CAT = 6

# Create list of examples containing texts and labels - sample from the dataset
ex_texts, ex_labels = [], []
for intent in intents:
  ex_texts += X_train.tolist()
  ex_labels += y_train.tolist()

print(f'Number of classes: {len(intents)}')
print(f'Total number of examples: {len(ex_texts)}')

Number of classes: 4
Total number of examples: 32


## Get classifications via the Classify endpoint

In [85]:
# Collate the examples via the Example module
from cohere.classify import Example

examples = list()
for txt, lbl in zip(ex_texts,ex_labels):
  examples.append(Example(txt,lbl))

In [86]:
# Perform classification
def classify_text(text,examples):
  classifications = co.classify(
    model='medium', # model version - medium-22020720
    inputs=[text],
    examples=examples
    )
  return classifications.classifications[0].prediction

In [87]:
# Generate classification predictions on the test dataset (this will take a few minutes)
y_pred = X_test.apply(classify_text, args=(examples,)).tolist()

In [88]:
# Compute metrics on the test dataset
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print(f'Accuracy: {100*accuracy:.2f}')
print(f'F1-score: {100*f1:.2f}')

Accuracy: 100.00
F1-score: 100.00


# 2 - Build your own classifier with the Embed endpoint

In this section, we’ll look at how we can use the Embed endpoint to build a classifier. We are going to build a classification model using these embeddings as inputs. For this, we’ll use the Support Vector Machine (SVM) algorithm.

## Generate embeddings for the input text

In [89]:
# Get embeddings
def embed_text(text):
  output = co.embed(
                model='medium', # model version - medium-22020720
                texts=text)
  return output.embeddings

# Embed and prepare the inputs
X_train_emb = np.array(embed_text(X_train.tolist()))
X_test_emb = np.array(embed_text(X_test.tolist()))

## Get classifications via the SVM algorithm

In [90]:
# Import modules
from sklearn.svm import SVC
from sklearn import preprocessing

# Prepare the labels
le = preprocessing.LabelEncoder()
le.fit(y_train)
y_train_le = le.transform(y_train)
y_test_le = le.transform(y_test)

# Initialize the model
svm_classifier = SVC(class_weight='balanced')

# Fit the training dataset to the model
svm_classifier.fit(X_train_emb, y_train_le)

In [96]:
# Generate classification predictions on the test dataset
y_pred_le = svm_classifier.predict(X_test_emb)
y_pred_le

array([0, 0])

In [92]:
# Compute metrics on the test dataset
accuracy = accuracy_score(y_test_le, y_pred_le)
f1 = f1_score(y_test_le, y_pred_le, average='weighted')

print(f'Accuracy: {100*accuracy:.2f}')
print(f'F1-score: {100*f1:.2f}')

Accuracy: 100.00
F1-score: 100.00


# 3 - Finetuning a model

In this section, we build a custom model that’s finetuned to excel at a specific task, and potentially outperforming the previous two approaches we have seen.

## Prepare dataset

In [93]:
# Download the training dataset for finetuning
df_train = pd.concat([X_train, y_train],axis=1)
df_train.to_csv("news_finetune.csv", index=False)

## Create a finetuned model

Creating the finetune is done is the Playground. Refer to [this guide](https://docs.cohere.ai/finetuning-representation-models) for the finetuning steps.

## Get classifications via the Classify endpoint

In [97]:
# Perform classification using the finetuned model
def classify_text_finetune(text):
  classifications = co.classify(
    model='eeba7d8c-61bd-42cd-a6b5-e31db27403cc-ft', # replace with your own finetune model ID 
    inputs=text
    )
  return classifications.classifications

In [None]:
# Generate classification predictions on the test dataset (this will take a few minutes)
y_pred_raw = classify_text_finetune(X_test.tolist())
y_pred = [y.prediction for y in y_pred_raw]

In [None]:
# Compute metrics on the test dataset
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print(f'Accuracy: {100*accuracy:.2f}')
print(f'F1-score: {100*f1:.2f}')

Accuracy: 94.50
F1-score: 94.53


We have now seen how the different options compare performance-wise. And crucially, what’s important to note is the level of control that you have when working with the Classify endpoint.