# SOUTH AFRICAN LANGUAGE IDENTIFICATION HACKATHON

South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages. With such a multilingual population, it is only obvious that our systems and devices also communicate in multi-languages.



## AIM 

In this challenge, we will take text which is in any of South Africa's 11 Official languages and identify which language the text is in. This is an example of NLP's Language Identification, the task of determining the natural language that a piece of text is written in.

In [76]:
#IMPORT LIBRARIES
# imports for Natural Language  Processing
import pandas as pd
import numpy as np
import nltk
import string
from sklearn.pipeline import Pipeline
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

# feature extractioin
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Preprocessing
from collections import Counter
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split

# classification models
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier



# Hyperparameter tunning methods
from sklearn.model_selection import GridSearchCV

# metrics

from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

import matplotlib.pyplot as plt
%matplotlib inline

In [77]:
# IMPORT DATA
train = pd.read_csv('train_set.csv')
test = pd.read_csv('test_set.csv')
sample_submission = pd.read_csv('sample_submission.csv')

In [78]:
train.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [79]:
test.head()

Unnamed: 0,index,text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.


## EDA

In [80]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB


In [81]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5682 entries, 0 to 5681
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   index   5682 non-null   int64 
 1   text    5682 non-null   object
dtypes: int64(1), object(1)
memory usage: 88.9+ KB


###### 
From the above we can see that the train dataset has 2 colunms that each have 3300 rows of data type object and the test dataset has 2 colunms that each have 5682 rows of data type object ant int64

## DATA ENGINEERING

In [82]:
#Split text separated by hypen
train['text_split'] = train['text'].str.split("-", )
test['text_split'] = test['text'].str.split("-", )

# Return to string from list
train['text_split'] = train.text_split.apply(lambda x: ' '.join([str(i) for i in x]))
test['text_split'] = test.text_split.apply(lambda x: ' '.join([str(i) for i in x]))

In [83]:
train.head()

Unnamed: 0,lang_id,text,text_split
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...,umgaqo siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...,i dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...,the province of kwazulu natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [84]:
test.head()

Unnamed: 0,index,text,text_split
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele...","Mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...,Kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.,Winste op buitelandse valuta.


In [85]:
# Word Tokenization
nltk.download('punkt')
train['tokens'] = train['text_split'].apply(nltk.word_tokenize)
test['tokens'] = test['text_split'].apply(nltk.word_tokenize)

[nltk_data] Downloading package punkt to C:\Users\Mabotse
[nltk_data]     Selamolela\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [86]:
train.head()

Unnamed: 0,lang_id,text,text_split,tokens
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...,umgaqo siseko wenza amalungiselelo kumaziko ax...,"[umgaqo, siseko, wenza, amalungiselelo, kumazi..."
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...,i dha iya kuba nobulumko bokubeka umsebenzi na...,"[i, dha, iya, kuba, nobulumko, bokubeka, umseb..."
2,eng,the province of kwazulu-natal department of tr...,the province of kwazulu natal department of tr...,"[the, province, of, kwazulu, natal, department..."
3,nso,o netefatša gore o ba file dilo ka moka tše le...,o netefatša gore o ba file dilo ka moka tše le...,"[o, netefatša, gore, o, ba, file, dilo, ka, mo..."
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...,khomishini ya ndinganyiso ya mbeu yo ewa maana...,"[khomishini, ya, ndinganyiso, ya, mbeu, yo, ew..."


In [87]:
test.head()

Unnamed: 0,index,text,text_split,tokens
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele...","Mmasepala, fa maemo a a kgethegileng a letlele...","[Mmasepala, ,, fa, maemo, a, a, kgethegileng, ..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...,Uzakwaziswa ngokufaneleko nakungafuneka eminye...,"[Uzakwaziswa, ngokufaneleko, nakungafuneka, em..."
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.,Tshivhumbeo tshi fana na ngano dza vhathu.,"[Tshivhumbeo, tshi, fana, na, ngano, dza, vhat..."
3,4,Kube inja nelikati betingevakala kutsi titsini...,Kube inja nelikati betingevakala kutsi titsini...,"[Kube, inja, nelikati, betingevakala, kutsi, t..."
4,5,Winste op buitelandse valuta.,Winste op buitelandse valuta.,"[Winste, op, buitelandse, valuta, .]"


In [88]:
#Remove one charater tokens
train['tokens'] = train['tokens'].apply(lambda x: [token for token in x if len(token) > 1])
test['tokens'] = test['tokens'].apply(lambda x: [token for token in x if len(token) > 1])

In [89]:
test.head()

Unnamed: 0,index,text,text_split,tokens
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele...","Mmasepala, fa maemo a a kgethegileng a letlele...","[Mmasepala, fa, maemo, kgethegileng, letlelela..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...,Uzakwaziswa ngokufaneleko nakungafuneka eminye...,"[Uzakwaziswa, ngokufaneleko, nakungafuneka, em..."
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.,Tshivhumbeo tshi fana na ngano dza vhathu.,"[Tshivhumbeo, tshi, fana, na, ngano, dza, vhathu]"
3,4,Kube inja nelikati betingevakala kutsi titsini...,Kube inja nelikati betingevakala kutsi titsini...,"[Kube, inja, nelikati, betingevakala, kutsi, t..."
4,5,Winste op buitelandse valuta.,Winste op buitelandse valuta.,"[Winste, op, buitelandse, valuta]"


In [90]:
#Remove Punctuations
train['tokens'] = train['tokens'].apply(lambda x : [token for token in x if token not in string.punctuation])
test['tokens'] = test['tokens'].apply(lambda x : [token for token in x if token not in string.punctuation])

In [91]:
#!pip install stopwordsiso
!pip install stopwordsiso



In [92]:
from nltk.corpus import stopwords
import stopwordsiso as stopwordz
nltk.download('stopwords')

stop_eng = stopwords.words('english')
stop_sotho = stopwordz.stopwords("st")
stop_zulu = stopwordz.stopwords("zu")
stop_afr = stopwordz.stopwords("af")


# Ensure all text is in lower case
train['no_stopwords'] = train['tokens'].apply(lambda x: [word.lower() for word in x])

# Remove stopwords
train['no_stopwords'] = train['tokens'].apply(lambda x: [item for item in x if item not in stop_eng])
train['no_stopwords'] = train['no_stopwords'].apply(lambda x: [item for item in x if item not in stop_sotho])
train['no_stopwords'] = train['no_stopwords'].apply(lambda x: [item for item in x if item not in stop_zulu])
train['no_stopwords'] = train['no_stopwords'].apply(lambda x: [item for item in x if item not in stop_afr])

#Test
test['no_stopwords'] = test['tokens'].apply(lambda x: [item for item in x if item not in stop_eng])
test['no_stopwords'] = test['no_stopwords'].apply(lambda x: [item for item in x if item not in stop_sotho])
test['no_stopwords'] = test['no_stopwords'].apply(lambda x: [item for item in x if item not in stop_zulu])
test['no_stopwords'] = test['no_stopwords'].apply(lambda x: [item for item in x if item not in stop_afr])
train.head()

[nltk_data] Downloading package stopwords to C:\Users\Mabotse
[nltk_data]     Selamolela\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,lang_id,text,text_split,tokens,no_stopwords
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...,umgaqo siseko wenza amalungiselelo kumaziko ax...,"[umgaqo, siseko, wenza, amalungiselelo, kumazi...","[umgaqo, siseko, wenza, amalungiselelo, kumazi..."
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...,i dha iya kuba nobulumko bokubeka umsebenzi na...,"[dha, iya, kuba, nobulumko, bokubeka, umsebenz...","[dha, iya, kuba, nobulumko, bokubeka, umsebenz..."
2,eng,the province of kwazulu-natal department of tr...,the province of kwazulu natal department of tr...,"[the, province, of, kwazulu, natal, department...","[province, kwazulu, natal, department, transpo..."
3,nso,o netefatša gore o ba file dilo ka moka tše le...,o netefatša gore o ba file dilo ka moka tše le...,"[netefatša, gore, ba, file, dilo, ka, moka, tš...","[netefatša, gore, file, dilo, moka, tše, dumel..."
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...,khomishini ya ndinganyiso ya mbeu yo ewa maana...,"[khomishini, ya, ndinganyiso, ya, mbeu, yo, ew...","[khomishini, ya, ndinganyiso, ya, mbeu, yo, ew..."


## MODELING

In [93]:
#Separate X and Y variables
X = train['no_stopwords']
y = train['lang_id']
X_test = test['no_stopwords']

In [94]:
#Train/Test Split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state = 42)

In [95]:
# Convert to string list
X_train = list(X_train.apply(' '.join))
X_val = list(X_val.apply(' '.join))
X_test = list(X_test.apply(' '.join))

### Decision Tree Classifier

In [96]:
# DecisionTreeClassifier Pipeline
tree_tfidf = Pipeline([('tfidf', TfidfVectorizer()),('tree', DecisionTreeClassifier()),])
tree_count = Pipeline([('CountVec',  CountVectorizer(analyzer = 'word', 
                             tokenizer = None, 
                             preprocessor = None, 
                             stop_words = None, 
                             max_features = 180000,
                             min_df = 1,
                             ngram_range = (1,2)
                            )),('tree', DecisionTreeClassifier()),])

tree_tfidf.fit(X_train, y_train)
tree_count.fit(X_train, y_train)

tree_prediction_cv = tree_count.predict(X_val) # DecisionTreeClassifier predictions
print('\nDecision Tree\n', classification_report(y_val, tree_prediction_cv))


Decision Tree
               precision    recall  f1-score   support

         afr       0.99      0.98      0.99       583
         eng       0.97      0.93      0.95       615
         nbl       0.92      0.89      0.90       583
         nso       0.99      0.95      0.97       625
         sot       0.94      0.99      0.96       618
         ssw       0.98      0.88      0.93       584
         tsn       0.95      0.97      0.96       598
         tso       1.00      0.99      0.99       561
         ven       1.00      0.97      0.99       634
         xho       0.96      0.90      0.93       609
         zul       0.73      0.93      0.81       590

    accuracy                           0.94      6600
   macro avg       0.95      0.94      0.94      6600
weighted avg       0.95      0.94      0.94      6600



In [97]:
# Print the overall model performanceLsvc_prediction_cv
tree_prediction_acc_vec = round(accuracy_score(y_val, tree_prediction_cv), 4)
print(f'\nOverall accuracy score for Decision Tree from CountVectorizer : {round(tree_prediction_acc_vec*100, 4)}')
tree_prediction_f1_vec = round(f1_score(y_val, tree_prediction_cv, average="weighted"), 4)
print(f'\nWeighted avg f1 score Decision Tree from CountVectorizer: {round(tree_prediction_f1_vec*100, 4)}')


Overall accuracy score for Decision Tree from CountVectorizer : 94.21

Weighted avg f1 score Decision Tree from CountVectorizer: 94.36


### Logistic Regression

In [98]:
# Logistic Regression pipeline
logreg_tfidf = Pipeline([('tfidf', TfidfVectorizer()),('logistic', LogisticRegression()),])
logreg_count = Pipeline([('CountVec',  CountVectorizer(analyzer = 'word', 
                             tokenizer = None, 
                             preprocessor = None, 
                             stop_words = None, 
                             max_features = 180000,
                             min_df = 1,
                             ngram_range = (1,3)
                            )),('logistic', LogisticRegression()),])
logreg_tfidf.fit(X_train, y_train)
logreg_count.fit(X_train, y_train)

logreg_prediction_cv = logreg_count.predict(X_val) # Logistic regression predictions
print('\nLogistic Regression\n', classification_report(y_val, logreg_prediction_cv))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



Logistic Regression
               precision    recall  f1-score   support

         afr       1.00      1.00      1.00       583
         eng       1.00      1.00      1.00       615
         nbl       0.99      0.98      0.99       583
         nso       1.00      0.99      1.00       625
         sot       0.99      1.00      1.00       618
         ssw       1.00      0.99      1.00       584
         tsn       0.99      0.99      0.99       598
         tso       1.00      1.00      1.00       561
         ven       1.00      1.00      1.00       634
         xho       1.00      0.99      0.99       609
         zul       0.96      0.99      0.98       590

    accuracy                           0.99      6600
   macro avg       0.99      0.99      0.99      6600
weighted avg       0.99      0.99      0.99      6600



In [99]:
# Print the overall model performanceLsvc_prediction_cv
logreg_acc_vec = round(accuracy_score(y_val, logreg_prediction_cv), 4)
print(f'\nOverall accuracy score for Logistic Regression from CountVectorizer : {round(logreg_acc_vec*100, 4)}')
logreg_f1_vec = round(f1_score(y_val, logreg_prediction_cv, average="weighted"), 4)
print(f'\nWeighted avg f1 score Logistic Regression from CountVectorizer: {round(logreg_f1_vec*100, 4)}')


Overall accuracy score for Logistic Regression from CountVectorizer : 99.35

Weighted avg f1 score Logistic Regression from CountVectorizer: 99.35


### Random Forest Classifier

In [100]:
# RandomForestClassifier Pipeline
rfc_tfidf = Pipeline([('tfidf', TfidfVectorizer()), ('rfc', RandomForestClassifier())])
rfc_count = Pipeline([('CountVec',  CountVectorizer(analyzer = 'word', 
                             tokenizer = None, 
                             preprocessor = None, 
                             stop_words = None, 
                             max_features = 180000,
                             min_df = 1,
                             ngram_range = (1,2)
                            )),('rfc', RandomForestClassifier()),])

rfc_tfidf.fit(X_train, y_train)
rfc_count.fit(X_train, y_train)

rfc_prediction_cv = rfc_count.predict(X_val) # RandomForestClassifier predictions
print('\nRandomForestClassifier\n', classification_report(y_val, rfc_prediction_cv))


RandomForestClassifier
               precision    recall  f1-score   support

         afr       1.00      0.99      0.99       583
         eng       0.98      0.98      0.98       615
         nbl       1.00      0.90      0.95       583
         nso       1.00      0.99      0.99       625
         sot       0.97      1.00      0.99       618
         ssw       0.99      0.95      0.97       584
         tsn       0.99      0.98      0.99       598
         tso       1.00      1.00      1.00       561
         ven       1.00      1.00      1.00       634
         xho       0.99      0.92      0.96       609
         zul       0.81      0.98      0.88       590

    accuracy                           0.97      6600
   macro avg       0.98      0.97      0.97      6600
weighted avg       0.98      0.97      0.97      6600



In [101]:
# Print the overall model performanceLsvc_prediction_cv
random_forest_acc_vec = round(accuracy_score(y_val, rfc_prediction_cv), 4)
print(f'\nOverall accuracy score for RandomForestClassifier from CountVectorizer : {round(random_forest_acc_vec*100, 4)}')
random_forest_f1_vec = round(f1_score(y_val, rfc_prediction_cv, average="weighted"), 4)
print(f'\nWeighted avg f1 score RandomForestClassifier from CountVectorizer: {round(random_forest_f1_vec*100, 4)}')


Overall accuracy score for RandomForestClassifier from CountVectorizer : 97.24

Weighted avg f1 score RandomForestClassifier from CountVectorizer: 97.31


## Compering the models

In [102]:
# Creating a dataframe with our models and their performances metrics
classifier_scores = {'Classifiers':['Decision Tree', 'Logistic Regression','Random Forest'],
                    'Accuracy on CV':[tree_prediction_acc_vec, logreg_acc_vec, random_forest_acc_vec,],
                     'F1 Score on CV':[tree_prediction_f1_vec, logreg_f1_vec, random_forest_f1_vec,]}
df = pd.DataFrame(classifier_scores)
df.sort_values(by=['F1 Score on CV'],ascending=False, inplace = True)
df

Unnamed: 0,Classifiers,Accuracy on CV,F1 Score on CV
1,Logistic Regression,0.9935,0.9935
2,Random Forest,0.9724,0.9731
0,Decision Tree,0.9421,0.9436


### MAKING SUBMISSIONS

In [103]:
#Decision tree
y_pred = tree_tfidf.predict(X_test)
test['lang_id'] = y_pred
#test[['index','lang_id']].to_csv('Decision Tree Classifier_tfidf.csv', index=False)
test[['index','lang_id']]


Unnamed: 0,index,lang_id
0,1,zul
1,2,nbl
2,3,ven
3,4,ssw
4,5,zul
...,...,...
5677,5678,zul
5678,5679,nso
5679,5680,sot
5680,5681,sot


In [104]:
#Logistic Regression
y_pred = logreg_tfidf.predict(X_test)
test['lang_id'] = y_pred
#test[['index','lang_id']].to_csv('Logistic Regression_tfidf.csv', index=False)
test[['index','lang_id']]

Unnamed: 0,index,lang_id
0,1,tsn
1,2,nbl
2,3,ven
3,4,ssw
4,5,zul
...,...,...
5677,5678,eng
5678,5679,nso
5679,5680,sot
5680,5681,sot


In [105]:
#Random Forest
y_pred = rfc_tfidf.predict(X_test)
test['lang_id'] = y_pred
#test[['index','lang_id']].to_csv('Random Forest Classifier_tfidf.csv', index=False)
test[['index','lang_id']]

Unnamed: 0,index,lang_id
0,1,tsn
1,2,zul
2,3,ven
3,4,ssw
4,5,zul
...,...,...
5677,5678,zul
5678,5679,nso
5679,5680,sot
5680,5681,sot
