# Text Classification Using Embeddings on Luganda Topic Classification Dataset
This notebook shows how to build a classifiers using Cohere's embeddings.
<img src="https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/simple-classifier-embeddings.png"
style="width:100%; max-width:600px"
alt="first we embed the text in the dataset, then we use that to train a classifier"/>

The example classification task here will be sentiment analysis of film reviews. We'll train a simple classifier to detect whether a film review is negative (class 0) or positive (class 1).

We'll go through the following steps:

1. Get the dataset
2. Get the embeddings of the reviews (for both the training set and the test set).
3. Train a classifier using the training set
4. Evaluate the performance of the classifier on the testing set

In [63]:
# Let's first install Cohere's python SDK
# !pip install cohere

## 1. Get the dataset

In [87]:
import pandas as pd
import cohere
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import RandomOverSampler
pd.set_option('display.max_colwidth', None)

# Get the SST2 training and test sets
#https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv
df = pd.read_csv('/content/train.csv')

In [88]:
# Let's glance at the dataset
df.head()

Unnamed: 0,English,Luganda,Topic
0,Refugees have started practicing farming so as to earn a living.,Abanoonyiboobubudamu batandise okulima okusobola okwebeezaawo.,Agriculture
1,Men should start up savings groups.,Abasajja bateekeddwa okutandika ebibiina ebitereka ensimbi.,Business
2,The savings group is composed of people of different cultures.,Ekibiina ekiterekebwamu ensimbi kibeeramu abantu ab'obuwangwa obw'enjawulo.,Business
3,His body has been taken for postmortem.,Omulambo gwe gwatwaliddwa okwekebejebwa.,Health
4,The headteacher told us to be careful while registering for exams.,Omukulu w'essomero yatugambye tubeere begendereza nga twewandiisa okukola ebibuuzo.,Education


In [89]:
# df =df.loc[:,~df.columns.str.contains('^Unnamed')]

In [90]:
# df.head()
label_encoder = LabelEncoder()
df['Target'] = label_encoder.fit_transform(df['Topic'])


In [68]:
df.head()

Unnamed: 0,English,Luganda,Topic,Target
0,Refugees have started practicing farming so as to earn a living.,Abanoonyiboobubudamu batandise okulima okusobola okwebeezaawo.,Agriculture,0
1,Men should start up savings groups.,Abasajja bateekeddwa okutandika ebibiina ebitereka ensimbi.,Business,1
2,The savings group is composed of people of different cultures.,Ekibiina ekiterekebwamu ensimbi kibeeramu abantu ab'obuwangwa obw'enjawulo.,Business,1
3,His body has been taken for postmortem.,Omulambo gwe gwatwaliddwa okwekebejebwa.,Health,8
4,The headteacher told us to be careful while registering for exams.,Omukulu w'essomero yatugambye tubeere begendereza nga twewandiisa okukola ebibuuzo.,Education,4


In [77]:
df.Topic.unique()

array(['Agriculture', 'Business', 'Health', 'Education', 'Environment',
       'Gender Based Voilence', 'Culture', 'Entertainment', 'Covid'],
      dtype=object)

We'll only use a subset of the training and testing datasets in this example. We'll only use 100 examples since this is a toy example. You'll want to increase the number to get better performance and evaluation.

In [69]:
num_examples = 500
df_sample = df.sample(num_examples)

# Split into training and testing sets
sentences_train, sentences_test, labels_train, labels_test = train_test_split(
            list(df_sample['Luganda']), list(df_sample['Target']), test_size=0.25, random_state=0)

## 2. Get the embeddings of the reviews
We're now ready to retrieve the embeddings from the API. You'll need your API key for this next cell. [Sign up to Cohere](https://os.cohere.ai/) and get one if you haven't yet.

In [70]:
# ADD YOUR API KEY HERE
api_key = "KEp3BXAcpX3cX9USRhqAIO5qaHc6y6pMa5XYwSyw"

# Create and retrieve a Cohere API key from os.cohere.ai
co = cohere.Client(api_key)

In [71]:
# Embed the training set
embeddings_train = co.embed(texts=sentences_train,
                             model="large",
                             truncate="RIGHT").embeddings
# Embed the testing set
embeddings_test = co.embed(texts=sentences_test,
                             model="large",
                             truncate="RIGHT").embeddings

We now have two sets of embeddings, `embeddings_train` contains the embeddings of the training  sentences while `embeddings_test` contains the embeddings of the testing sentences.

Curious what an embedding looks like? we can print it:

In [72]:
print(f"Review text: {sentences_train[0]}")
print(f"Embedding vector: {embeddings_train[0][:10]}")

Review text: Obukiiko obudduukirize obw'enjawulo buweereddwa obuyambi okuyambako mu kulwanyisa akawuka ka kolona.
Embedding vector: [1.5654297, 1.4248047, 1.2636719, 0.6347656, 1.1640625, -1.4335938, 0.5439453, -1.4785156, 1.5078125, -0.56591797]


## 3. Train a classifier using the training set
Now that we have the embedding we can train our classifier. We'll use an SVM from sklearn.

In [102]:
# import SVM classifier code
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


# Initialize a support vector machine, with class_weight='balanced' because
# our training set has roughly an equal amount of positive and negative
# sentiment sentences
# svm_classifier = make_pipeline(StandardScaler(), SVC(class_weight='balanced'))
svm_classifier = make_pipeline(StandardScaler(), SVC(class_weight='balanced', probability=True))

# fit the support vector machine
# svm_classifier.fit(embeddings_train, labels_train)
# Define the parameter grid for GridSearchCV
param_grid = {
    'svc__C': [0.1, 1, 10],  # Adjust the range of C values
    'svc__kernel': ['linear', 'rbf', 'poly'],  # Experiment with different kernels
}

# Create GridSearchCV instance
grid_search = GridSearchCV(estimator=svm_classifier, param_grid=param_grid, cv=10, n_jobs=-1)

# Fit GridSearchCV to the data
grid_search.fit(embeddings_train, labels_train)

# Get the best SVM model from the grid search
best_svm_classifier = grid_search.best_estimator_

# Print the best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)

# Fit the best model to the data
best_svm_classifier.fit(embeddings_train, labels_train)



Best Hyperparameters: {'svc__C': 10, 'svc__kernel': 'rbf'}


## 4. Evaluate the performance of the classifier on the testing set

In [103]:
# get the score from the test set, and print it out to screen!
score = best_svm_classifier.score(embeddings_test, labels_test)
print(f"Validation accuracy on Large is {100*score}%!")

Validation accuracy on Large is 45.6%!


In [97]:
# !pip install lime
print(f"Review text: {sentences_test[0]}")
print(f"Embedding vector: {embeddings_test[0][:10]}")

Review text: Abantu beewuunyizza obubonero bwange obulungi.
Embedding vector: [0.7753906, 1.5058594, 2.6152344, 1.2011719, 0.017822266, -0.30981445, 0.6459961, -1.3242188, 0.74121094, -0.30517578]


This was a small scale example, meant as a proof of concept and designed to illustrate how you can build a custom classifier quickly using a small amount of labelled data and Cohere's embeddings. Increase the number of training examples to achieve better performance on this task.