<a href="https://colab.research.google.com/github/K-AMO/KamogeloMasekwa_Language_Identification/blob/main/KamogeloMasekwa_SouthAfrican_Language_Hack2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Introduction

## 1.1 Overview

South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages.

## 1.2 Problem Statement

With such a multilingual population, it is only obvious that our systems and devices also communicate in multi-languages.

In this challenge, the tast is to take text which is in any of South Africa's 11 Official languages and identify which language the text is in. This is an example of NLP's Language Identification, the task of determining the natural language that a piece of text is written in




##1.3 Data Description

###1.3.1 Data

The dataset used for this challenge is the NCHLT Text Corpora collected by the South African Department of Arts and Culture & Centre for Text Technology (CTexT, North-West University, South Africa). The training set was improved through additional cleaning done by Praekelt.

The data is in the form Language ID, Text. The text is in various states of cleanliness. Some NLP techniques will be necessary to clean up the data.

###1.3.2 File Descriptions

train_set.csv - the training set

test_set.csv - the test set

sample_submission.csv - a sample submission file in the correct format

###1.3.3 Language IDs
afr - Afrikaans
eng - English
nbl - isiNdebele
nso - Sepedi
sot - Sesotho
ssw - siSwati
tsn - Setswana
tso - Xitsonga
ven - Tshivenda
xho - isiXhosa
zul - isiZulu

# 2. Download, Import Packages and Loading the Data 

######2.1 Download and install external libraries/packages


In [20]:
import pandas as pd

# Standard libraries
import re
import csv
import nltk
import spacy
import string
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Preprocessing
import en_core_web_sm
from collections import Counter
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords, wordnet  
from sklearn.feature_extraction.text import CountVectorizer   
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.model_selection import train_test_split, RandomizedSearchCV
nlp = spacy.load('en_core_web_sm')
from spacy.lang.en.stop_words import STOP_WORDS

# Building classification models
from sklearn.svm import SVR
from sklearn import preprocessing
from sklearn import utils
from sklearn import metrics
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

# Model evaluation
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, f1_score, precision_score, recall_score

# Downloads
#nlp = spacy.load('en')
#nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords') 



[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### 2.2 Import Data

In [21]:
train= pd.read_csv('train_set.csv')
test= pd.read_csv('test_set.csv')
sample= pd.read_csv('sample_submission.csv')

###2.3 Review Dataset

In [22]:
print('Shape of Train Dataset:', train.shape)

print('Shape of Train Dataset' ,test.shape)

display(train.head())


display(test.head())

Shape of Train Dataset: (33000, 2)
Shape of Train Dataset (5682, 2)


Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


Unnamed: 0,index,text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.


In [23]:
sample.head()

Unnamed: 0,index,lang_id
0,1,tsn
1,2,nbl


In [24]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB


There are 33000 entries in total for the train dataset and there are no  missing values 

In [25]:
train.lang_id.value_counts()

xho    3000
eng    3000
nso    3000
ven    3000
tsn    3000
nbl    3000
zul    3000
ssw    3000
tso    3000
sot    3000
afr    3000
Name: lang_id, dtype: int64

In [26]:
train.lang_id.nunique()

11

Data looks balanced with equal observations in each class. There is about 11 classes.

##3. Data Cleaning

In [27]:
test.isnull().sum()

index    0
text     0
dtype: int64

In [28]:
def preprocessing(string):
    #lowering each word in the sentence
    string = string.lower()
    
    #removal of punctuaction and numbers 
    string = re.sub(r'[^a-z0-9\s]','', string)
    message = re.sub(r'[0-9]+', '', string)
    return message 

In [29]:
df = train.copy()

In [30]:
X = df["text"].apply(preprocessing)
y = df["lang_id"]
print(X.shape)
print(y.shape)

(33000,)
(33000,)


In [31]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.2, random_state=42)

In [32]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
X1 = count_vector.fit_transform(X_train)

In [33]:
X1.shape

(26400, 126260)

In [34]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_dtm = tfidf_transformer.fit_transform(X1)

In [35]:
X_train_dtm.shape

(26400, 126260)

# 3. Model Predictions

Model 1 : Logistics Regression Model

In [36]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(multi_class='ovr',solver = 'liblinear')

In [37]:
logreg.fit(X_train_dtm, y_train)

LogisticRegression(multi_class='ovr', solver='liblinear')

In [38]:
#transforming the vector
print(X_test.shape)
X_test_c = count_vector.transform(X_test)
print(f'Test data after transforming data {X_test_c.shape}')

(6600,)
Test data after transforming data (6600, 126260)


Prediction

In [39]:
y_pred_logreg = logreg.predict(X_test_c)

In [40]:
#metrics
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_logreg))
print(metrics.confusion_matrix(y_test, y_pred_logreg))

0.9839393939393939
[[581   1   0   0   0   1   0   0   0   0   0]
 [  0 615   0   0   0   0   0   0   0   0   0]
 [  1   1 555   2   3   0   0   3   0   4  14]
 [  0   0   0 619   1   0   5   0   0   0   0]
 [  0   0   0   1 617   0   0   0   0   0   0]
 [  0   3   0   0   0 567   0   2   2   0  10]
 [  1   0   0   8   2   0 587   0   0   0   0]
 [  0   0   0   0   0   0   0 561   0   0   0]
 [  0   0   0   0   0   0   0   0 634   0   0]
 [  0   0   1   2   2   0   0   3   3 593   5]
 [  0   2   4   4   5   1   0   2   0   7 565]]


Model 2: Multinomial

In [41]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [42]:
nb.fit(X_train_dtm, y_train)

MultinomialNB()

In [43]:
y_pred_nb = nb.predict(X_test_c)
print(metrics.accuracy_score(y_test, y_pred_nb))
print(metrics.confusion_matrix(y_test, y_pred_nb))

0.9972727272727273
[[583   0   0   0   0   0   0   0   0   0   0]
 [  0 615   0   0   0   0   0   0   0   0   0]
 [  0   2 580   0   0   0   0   0   0   0   1]
 [  0   0   0 623   1   0   1   0   0   0   0]
 [  0   0   0   0 618   0   0   0   0   0   0]
 [  0   1   0   0   0 581   0   1   0   0   1]
 [  1   0   0   0   0   0 597   0   0   0   0]
 [  0   0   0   0   0   0   0 561   0   0   0]
 [  0   0   0   0   0   0   0   0 634   0   0]
 [  0   0   1   0   0   0   0   1   0 606   1]
 [  0   1   3   1   0   0   0   0   0   1 584]]


Model 3: Decision Tree Classifier

In [44]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=42)

In [45]:
tree.fit(X_train_dtm, y_train)

DecisionTreeClassifier(random_state=42)

In [46]:
y_pred_tree = tree.predict(X_test_c)
print(metrics.accuracy_score(y_test, y_pred_tree))
print(metrics.confusion_matrix(y_test, y_pred_tree))

0.9427272727272727
[[580   0   0   0   0   3   0   0   0   0   0]
 [  1 613   0   0   0   0   0   0   0   0   1]
 [  1   1 504   0  14  18   0   2   0  11  32]
 [  0   0   1 602   3   0  15   4   0   0   0]
 [  0   0   1   2 610   0   5   0   0   0   0]
 [  0   2   9   0   1 533   0   1   0   4  34]
 [  1   1   1   3   5   0 586   1   0   0   0]
 [  1   0   0   1   0   3   1 553   1   1   0]
 [  0   0   0   0   2   0   0   5 627   0   0]
 [  0   3  26   0  12  22   1   2   0 517  26]
 [  0   1  24   0   5  51   0   1   0  11 497]]


Model 4: Random Classifier

In [47]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=100, random_state=42)

In [48]:
forest.fit(X_train_dtm, y_train)

RandomForestClassifier(random_state=42)

In [49]:
y_pred_forest = forest.predict(X_test_c)
print(metrics.accuracy_score(y_test, y_pred_forest))
print(metrics.confusion_matrix(y_test, y_pred_forest))

0.9845454545454545
[[582   0   0   0   0   0   0   0   0   0   1]
 [  0 615   0   0   0   0   0   0   0   0   0]
 [  1   1 554   0   0   1   0   0   0   5  21]
 [  0   0   0 622   1   0   2   0   0   0   0]
 [  0   0   0   0 618   0   0   0   0   0   0]
 [  0   1   2   0   0 562   0   0   0   0  19]
 [  1   0   0   0   2   0 595   0   0   0   0]
 [  0   0   0   0   0   0   0 561   0   0   0]
 [  0   0   0   0   0   0   0   0 634   0   0]
 [  0   0   5   0   0   3   0   0   0 586  15]
 [  0   1   8   0   0   4   0   0   0   8 569]]


Model 5: Support Vector Machine

In [50]:
# Form a prediction set
from sklearn.svm import LinearSVC
lsvc_1 = LinearSVC()
lsvc_1.fit(X_train_dtm, y_train)
pred_lsvc_1 = lsvc_1.predict(X_test_c)
print(metrics.accuracy_score(y_test,pred_lsvc_1))
print(metrics.confusion_matrix(y_test, pred_lsvc_1))

0.9934848484848485
[[582   1   0   0   0   0   0   0   0   0   0]
 [  0 615   0   0   0   0   0   0   0   0   0]
 [  1   1 570   0   0   0   0   1   0   1   9]
 [  0   0   0 623   1   0   1   0   0   0   0]
 [  0   0   0   1 617   0   0   0   0   0   0]
 [  0   1   0   0   0 577   0   0   1   0   5]
 [  1   0   0   1   1   0 595   0   0   0   0]
 [  0   0   0   0   0   0   0 561   0   0   0]
 [  0   0   0   0   0   0   0   0 634   0   0]
 [  0   0   1   1   0   0   0   0   1 601   5]
 [  0   1   2   1   1   0   0   0   0   3 582]]


# 4. Tuning Models

Hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process.

The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data.

In [51]:
count_vector_2 = CountVectorizer(ngram_range =(1,2))
X_train_c2= count_vector_2.fit_transform(X_train)

In [52]:
tfidf2 =  TfidfTransformer()
X_train_dtm2 = tfidf2.fit_transform(X_train_c2)

In [53]:
nb_ngrams = MultinomialNB(alpha =0)

In [54]:
nb_ngrams.fit(X_train_dtm2, y_train)

  % _ALPHA_MIN


MultinomialNB(alpha=0)

In [55]:
X_test_c2=count_vector_2.transform(X_test)

In [56]:
y_pred_nb_grams = nb_ngrams.predict(X_test_c2)
print(metrics.accuracy_score(y_test,y_pred_nb_grams))
print(metrics.confusion_matrix(y_test, y_pred_nb_grams))

0.998030303030303
[[583   0   0   0   0   0   0   0   0   0   0]
 [  0 615   0   0   0   0   0   0   0   0   0]
 [  0   0 583   0   0   0   0   0   0   0   0]
 [  0   0   0 623   2   0   0   0   0   0   0]
 [  0   0   0   0 618   0   0   0   0   0   0]
 [  0   0   0   0   0 583   1   0   0   0   0]
 [  1   0   0   0   1   0 596   0   0   0   0]
 [  0   0   0   0   0   0   0 561   0   0   0]
 [  0   0   0   0   0   0   0   0 634   0   0]
 [  0   0   1   0   0   0   0   0   0 607   1]
 [  0   1   3   0   0   0   0   0   0   2 584]]


In [57]:
from sklearn.model_selection import cross_val_score

In [58]:
from sklearn.model_selection import GridSearchCV

# 5. Submission

Submitting models guide:

Model 1 = LogisticsRegression

Model 2 = MultinomialNB

Model 3 = DecisionTree

Model 4 = MultinomialNB(changed Count Vector)


Model 5 = Support Vector Machine

In [59]:
test = pd.read_csv('test_set.csv')
test.head(20)

Unnamed: 0,index,text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.
5,6,"Ke feela dilense tše hlakilego, tša pono e tee..."
6,7,<fn>(762010101403 AM) 1495 Final Gems Birthing...
7,8,Ntjhafatso ya konteraka ya mosebetsi: Etsa bon...
8,9,u-GEMS uhlinzeka ngezinzuzo zemithi yezifo ezi...
9,10,"So, on occasion, are statistics misused."


In [60]:
test.shape

(5682, 2)

In [61]:
testdf = test.copy()

In [62]:
testdf['text'] = test['text'].apply(preprocessing)
testdf.head()

Unnamed: 0,index,text
0,1,mmasepala fa maemo a a kgethegileng a letlelel...
1,2,uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,tshivhumbeo tshi fana na ngano dza vhathu
3,4,kube inja nelikati betingevakala kutsi titsini...
4,5,winste op buitelandse valuta


In [63]:
x_test = count_vector.transform(testdf['text'].values.astype(str))

In [64]:
x_test.shape

(5682, 126260)

In [65]:
x_test2 = count_vector_2.transform(testdf['text'].values.astype(str))

Submission 1

In [66]:
y_pred_test1 = logreg.predict(x_test)

In [67]:
textid = testdf['index']

In [68]:
submission_logreg = pd.DataFrame(
    {'index': textid,
     'lang_id': y_pred_test1
    })

In [69]:
submission_logreg.head(20)

Unnamed: 0,index,lang_id
0,1,tsn
1,2,nbl
2,3,ven
3,4,ssw
4,5,afr
5,6,nso
6,7,xho
7,8,sot
8,9,zul
9,10,eng


In [70]:
submission_logreg.to_csv("logreg_predictions.csv",encoding = 'utf-8', index = False)

Submission 2

In [71]:
x_test_dtm = tfidf_transformer.transform(x_test) 

In [72]:
y_pred_test2 = nb.predict(x_test)

In [73]:
submission_nb = pd.DataFrame(
    {'index': textid,
     'lang_id': y_pred_test2
    })
submission_nb.head(20)

Unnamed: 0,index,lang_id
0,1,tsn
1,2,nbl
2,3,ven
3,4,ssw
4,5,afr
5,6,nso
6,7,eng
7,8,sot
8,9,zul
9,10,eng


In [74]:
submission_nb.to_csv("multinomialNB.csv",encoding = 'utf-8', index = False)

Submission 3


In [75]:
y_pred_test3 = nb_ngrams.predict(x_test2)
submission_ngrams = pd.DataFrame(
    {'index': textid,
     'lang_id': y_pred_test3
    })
submission_nb.head()

Unnamed: 0,index,lang_id
0,1,tsn
1,2,nbl
2,3,ven
3,4,ssw
4,5,afr


In [76]:
submission_ngrams.to_csv("multinomialnb2.csv",encoding = 'utf-8', index = False)

Submission 4

In [78]:
y_pred_test4 = lsvc_1.predict(x_test)
submission_lsvc = pd.DataFrame(
    {'index': textid,
     'lang_id': y_pred_test4
    })
submission_lsvc.head()

Unnamed: 0,index,lang_id
0,1,tsn
1,2,nbl
2,3,ven
3,4,ssw
4,5,afr


In [79]:
submission_lsvc.to_csv("lsvc.csv",encoding = 'utf-8', index = False)