# Bantu Language Classification

Written by Emmanuella Budu and Boago Okgetheng. This a notebook that details how to construct a simple lanaguage classifier using Naive Bayes. The languages are Sestswana and Sesotho. The data used has been sourced from text from the Internet in th form of 'txt' files.

Importing the libraries

In [1]:
import pandas as pd

from string import punctuation

from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score


Loading the data, text files contain data extracted from documents written in both the Setswana and Sesotho language

In [2]:
#Data
sesotho_data = open("..data/sesotho.txt', errors='ignore").read()
setswana_data = open("../data/setswana.txt', errors='ignore").read()

Preprocess the text document removing the stop words, spaces, etc.

In [3]:
#Preprocessing
stopwords=['a','e','i','o','u']

#tokenizer = RegexpTokenizer(r"\w+")

#convert to lowercase
sesotho_data=sesotho_data.lower()
setswana_data=setswana_data.lower()

#remove numbers
def strip_numbers(s):
    return''.join(c for c in s if not c.isdigit())

sesotho_data= strip_numbers(sesotho_data)
setswana_data= strip_numbers(setswana_data)

#remove punctuation
def strip_punctuation(s):
    return ''.join(c for c in s if c not in punctuation)

sesotho_data= strip_punctuation(sesotho_data)
setswana_data= strip_punctuation(setswana_data)


#Tokenize text
sesotho_token=word_tokenize(sesotho_data)
setswana_token=word_tokenize(setswana_data)

#remove stop words
sesotho_token = [word for word in sesotho_token if word not in stopwords]
setswana_token = [word for word in setswana_token if word not in stopwords]

print(sesotho_token)

['molimo', 'bopa', 'leholimo', 'le', 'lefatse', 'tsimolohong', 'molimo', 'ne', 'hlole', 'leholimo', 'le', 'lefatse', 'lefatse', 'le', 'ne', 'le', 'se', 'na', 'sebopeho', 'le', 'se', 'na', 'letho', 'lefifi', 'le', 'ne', 'le', 'aparetse', 'boliba', 'moea', 'oa', 'molimo', 'ne', 'okaokela', 'metsi', 'molimo', 're', 'leseli', 'le', 'be', 'teng', 'eaba', 'leseli', 'le', 'ba', 'teng', 'molimo', 'bona', 'hore', 'leseli', 'le', 'letle', 'me', 'molimo', 'le', 'arohanya', 'le', 'lefifi', 'molimo', 're', 'leseli', 'ke', 'motseare', 'me', 're', 'lefifi', 'ke', 'bosiu', 'ha', 'phirima', 'ha', 'esa', 'ea', 'eba', 'letsatsi', 'la', 'pele', 'eaba', 'molimo', 're', 'lapi', 'le', 'be', 'teng', 'pakeng', 'tsa', 'metsi', 'ho', 'arohanya', 'metsi', 'ho', 'metsi', 'mang', 'molimo', 'etsa', 'loapi', 'ho', 'arohanya', 'metsi', 'kaholimo', 'ho', 'lona', 'le', 'metsi', 'katlase', 'ho', 'lona', 'ha', 'fela', 'ha', 'eba', 'joalo', 'molimo', 're', 'loapi', 'ke', 'leholimo', 'ha', 'phirima', 'ha', 'esa', 'ea', 'eba

Text data to Pandas dataframe

In [4]:
#Create a dataframe for Sesotho and Setswana text separately
sesotho_df= pd.DataFrame()

#Insert tokens in the dataframe along with the correct label
sesotho_df['text']=sesotho_token
sesotho_df['label']='sesotho'

setswana_df= pd.DataFrame()
setswana_df['text']=setswana_token
setswana_df['label']='setswana'

# Create a dataframe to store both Sesotho and Setswana text
train_df= pd.DataFrame()
train_df=sesotho_df.append(setswana_df)

#Shuffle the data frame
train_df = train_df.sample(frac=1).reset_index(drop=True)
train_df.head(50)

Unnamed: 0,text,label
0,rona,sesotho
1,ã,setswana
2,se,sesotho
3,mme,setswana
4,fela,sesotho
5,dilo,setswana
6,ha,sesotho
7,ba,sesotho
8,kgosi,setswana
9,le,setswana


Train and test sets

In [5]:
# Divide the dataset into training and testing sets
X= train_df.text

Y=train_df.label

X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.70, random_state = 0)



Term Frequency Vectorizer:
1. Output is a document term matrixConvert a collection of raw documents to a matrix of TF-IDF features.
2. Equivalent to CountVectorizer followed by TfidfTransformer.
3. We transform the text because the Naive Bayes classifier does not take text as input


In [6]:
#TF-IDF
tf_vect = TfidfVectorizer()

X_train_tf= tf_vect.fit_transform(X_train)
X_test_tf= tf_vect.transform(X_test)

print(tf_vect.vocabulary_)

{'efela': 94, 'leba': 245, 'boammaaruri': 38, 'œo': 573, 'yo': 561, 'khahlang': 221, 'lefatseng': 251, 'le': 243, 'serapeng': 493, 'litaba': 298, 're': 462, 'ntlha': 422, 'rene': 464, 'johane': 199, 'ne': 403, 'bana': 23, 'reng': 465, 'ka': 202, 'morena': 373, 'ho': 169, 'mphile': 393, 'nese': 407, 'motho': 384, 'itse': 184, 'serapa': 492, 'gagoâ': 138, 'se': 479, 'leina': 261, 'nyoloha': 429, 'ke': 212, 'li': 282, 'tla': 515, 'bogolo': 44, 'bophelo': 66, 'mangâ': 324, 'leholimo': 260, 'lefatse': 250, 'œnnyayaâ': 571, 'pula': 457, 'sefate': 482, 'la': 235, 'aparetse': 7, 'mesiaâ': 341, 'nqhekella': 419, 'sehubeng': 484, 'tsa': 525, 'seng': 488, 'joale': 196, 'bona': 56, 'ee': 92, 'moruti': 375, 'holima': 172, 'atang': 14, 'ea': 87, 'phetolelong': 452, 'lefifi': 254, 'latelang': 242, 'kamehla': 207, 'me': 334, 'motseare': 389, 'kaholimo': 205, 'molimo': 362, 'ile': 177, 'ngata': 408, 'bora': 67, 'mo': 345, 'sea': 480, 'lapi': 241, 'phirima': 453, 'rile': 467, 'ba': 15, 'ena': 102, 'hoo'

Classification Model

In [7]:
#Naive Bayes Model
clf= MultinomialNB().fit(X_train_tf, y_train)

#Make predictions using test data
y_pred = clf.predict(X_test_tf)
#y_pred=clf.predict(tf_vect.transform(["modimo"]))
#print(y_pred)

#Calculate the accuracy
print ('ACCURACY:',accuracy_score(y_pred,y_test)*100)

ACCURACY: 81.32387706855792
