# South African Language Identification Hack 2022
# EDSA 2201 & 2207 classification hackathon
© Explore Data Science Academy


## Honour Code
I **Mahlatse Philix, Ramabopa**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code (https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).  

Non-compliance with the honour code constitutes a material breach of contract.

## EXPLORE Data Science Academy Classification Hackathon
### Overview

South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages.

From South African Government


<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Preprocessing</a>

<a href=#four>4. Exploratory Data Analysis (EDA)</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Explanations</a>


 
-----------------------------------------------------------------------------------------------------
 

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////


<a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>


In [25]:

# Importing modules
import nltk

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import re

# Magic function to enable graphs and plots to be plotted below the cell where your plotting commands are written
%matplotlib inline

# Set plot style
sns.set()

from sklearn import metrics

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Importing Model Modules
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB


 
-----------------------------------------------------------------------------------------------------
 

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////


<a id="two"></a>
## 2. Loading Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>


In [26]:

# Loading the training data for the model 
df_train = pd.read_csv("train_set.csv")


In [27]:

# Loading the testing data for the model 
df_test = pd.read_csv("test_set.csv")


 
-----------------------------------------------------------------------------------------------------
 

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////


<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>


In [28]:

# Checking the data types of all the columns I have in my data
df_train.dtypes


lang_id    object
text       object
dtype: object

In [29]:

# Checking the data types of all the columns I have In the test data
df_test.dtypes


index     int64
text     object
dtype: object

In [30]:

# Viewing the dimension of the size of the train DataFrame
df_train.shape


(33000, 2)

In [31]:

# Viewing the dimension of the size of test DataFrame
df_test.shape


(5682, 2)

In [32]:

# Viewing the data type info for the train data
df_train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB


In [33]:

# Viewing the data type info for the test data
df_test.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5682 entries, 0 to 5681
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   index   5682 non-null   int64 
 1   text    5682 non-null   object
dtypes: int64(1), object(1)
memory usage: 88.9+ KB


In [34]:

# Looking into the 'df_train' to check for missing values 
df_train.isnull().sum()


lang_id    0
text       0
dtype: int64

In [35]:

# Looking into our 'df_test' to check for missing values 
df_test.isnull().sum()


index    0
text     0
dtype: int64


- I found out that my dataset does not have any missing values and they are both of the same data type.


 
 -----------------------------------------------------------------------------------------------------
 

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////


<a id="four"></a>
## 4. Preprocessing
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>


In [37]:

# Viewing the first five row of the train data
df_train.head()


Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [38]:

# Viewing the last five row of the train data
df_train.tail()


Unnamed: 0,lang_id,text
32995,tsn,popo ya dipolateforomo tse ke go tlisa boetele...
32996,sot,modise mosadi na o ntse o sa utlwe hore thaban...
32997,eng,closing date for the submission of completed t...
32998,xho,nawuphina umntu ofunyenwe enetyala phantsi kwa...
32999,sot,mafapha a mang le ona a lokela ho etsa ditlale...


In [39]:

# Viewing the first five row of the test data
df_test.head()


Unnamed: 0,index,text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.


In [16]:

# Viewing the last five row of the train data
df_test.tail()


Unnamed: 0,index,text
5677,5678,You mark your ballot in private.
5678,5679,Ge o ka kgetha ka bowena go se šomiše Mofani k...
5679,5680,"E Ka kopo etsa kgetho ya hao ka hloko, hobane ..."
5680,5681,"TB ke bokudi ba PMB, mme Morero o tla lefella ..."
5681,5682,Vakatjhela iwebhusayidi yethu ku-www.



From viewing the first and last 5 rows of my dataset, i observed that the `df_train` dataset is clean and the `df_test` is not clean because it contains some noise of capitalisation, punctuations and numbers.

##### Data Cleaning on `df_train` and `df_test`

- Making all text lowercase, removing numbers and punctuations


In [40]:

# Make lower case
print ('Lowering case...')
df_train["clean_text"] = df_train["text"].str.lower()
df_test["clean_text"] = df_test["text"].str.lower()

# Remove punctuation
import string

print ('Cleaning numbers...')
print ('Cleaning punctuation...')

def remove_punctuation_numbers(text):
    punc_numbers = string.punctuation + '0123456789'
    return ''.join([l for l in text if l not in punc_numbers])

df_train["clean_text"] = df_train["clean_text"].apply(remove_punctuation_numbers)
df_test["clean_text"] = df_test["clean_text"].apply(remove_punctuation_numbers)


Lowering case...
Cleaning numbers...
Cleaning punctuation...


In [41]:

# Viewing the first 10 rows
df_train.head(10)


Unnamed: 0,lang_id,text,clean_text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...,umgaqosiseko wenza amalungiselelo kumaziko axh...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...,idha iya kuba nobulumko bokubeka umsebenzi nap...
2,eng,the province of kwazulu-natal department of tr...,the province of kwazulunatal department of tra...
3,nso,o netefatša gore o ba file dilo ka moka tše le...,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...,khomishini ya ndinganyiso ya mbeu yo ewa maana...
5,nso,dinyakišišo tše tša go dirwa gabedi ka ngwaga ...,dinyakišišo tše tša go dirwa gabedi ka ngwaga ...
6,tsn,kgetse nngwe le nngwe e e sa faposiwang mo tsh...,kgetse nngwe le nngwe e e sa faposiwang mo tsh...
7,ven,mbadelo dze dza laelwa dzi do kwama mahatulele...,mbadelo dze dza laelwa dzi do kwama mahatulele...
8,nso,maloko a dikhuduthamaga a ikarabela mongwe le ...,maloko a dikhuduthamaga a ikarabela mongwe le ...
9,tsn,fa le dirisiwa lebone le tshwanetse go bontsha...,fa le dirisiwa lebone le tshwanetse go bontsha...


In [42]:

# Viewing the last 10 rows
df_train.tail(10)


Unnamed: 0,lang_id,text,clean_text
32990,eng,government has a long-term programme for the u...,government has a longterm programme for the up...
32991,nso,mo kgopelo e dirilwego go ya ka karolo ya go b...,mo kgopelo e dirilwego go ya ka karolo ya go b...
32992,zul,a umqondisi-jikelele azise umuntu okukhulunywa...,a umqondisijikelele azise umuntu okukhulunywa ...
32993,nso,molawana o akaretša mesepelo ka moka ya baname...,molawana o akaretša mesepelo ka moka ya baname...
32994,eng,manuel marin s ill-fated debt sources but very...,manuel marin s illfated debt sources but very ...
32995,tsn,popo ya dipolateforomo tse ke go tlisa boetele...,popo ya dipolateforomo tse ke go tlisa boetele...
32996,sot,modise mosadi na o ntse o sa utlwe hore thaban...,modise mosadi na o ntse o sa utlwe hore thaban...
32997,eng,closing date for the submission of completed t...,closing date for the submission of completed t...
32998,xho,nawuphina umntu ofunyenwe enetyala phantsi kwa...,nawuphina umntu ofunyenwe enetyala phantsi kwa...
32999,sot,mafapha a mang le ona a lokela ho etsa ditlale...,mafapha a mang le ona a lokela ho etsa ditlale...


In [43]:

# Viewing the first 10 rows
df_test.head(10)


Unnamed: 0,index,text,clean_text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele...",mmasepala fa maemo a a kgethegileng a letlelel...
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...,uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.,tshivhumbeo tshi fana na ngano dza vhathu
3,4,Kube inja nelikati betingevakala kutsi titsini...,kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.,winste op buitelandse valuta
5,6,"Ke feela dilense tše hlakilego, tša pono e tee...",ke feela dilense tše hlakilego tša pono e tee ...
6,7,<fn>(762010101403 AM) 1495 Final Gems Birthing...,fn am final gems birthing optionszulutxtfn
7,8,Ntjhafatso ya konteraka ya mosebetsi: Etsa bon...,ntjhafatso ya konteraka ya mosebetsi etsa bonn...
8,9,u-GEMS uhlinzeka ngezinzuzo zemithi yezifo ezi...,ugems uhlinzeka ngezinzuzo zemithi yezifo ezin...
9,10,"So, on occasion, are statistics misused.",so on occasion are statistics misused


In [44]:

# Viewing the last 10 rows
df_test.tail(10)


Unnamed: 0,index,text,clean_text
5672,5673,Die raad kan van tyd tot tyd en in ooreenstemm...,die raad kan van tyd tot tyd en in ooreenstemm...
5673,5674,halutshedzo ya ' tshiimo tsha u vha na vhudzim...,halutshedzo ya tshiimo tsha u vha na vhudzimu...
5674,5675,botlalo tšeo di hlokegago o mongwe le o mongwe.,botlalo tšeo di hlokegago o mongwe le o mongwe
5675,5676,Muanewa-muhali. Mafhungo o he a lungano a kwam...,muanewamuhali mafhungo o he a lungano a kwama ...
5676,5677,Afitafiti go tšwa go leloko go netefatša tlhok...,afitafiti go tšwa go leloko go netefatša tlhok...
5677,5678,You mark your ballot in private.,you mark your ballot in private
5678,5679,Ge o ka kgetha ka bowena go se šomiše Mofani k...,ge o ka kgetha ka bowena go se šomiše mofani k...
5679,5680,"E Ka kopo etsa kgetho ya hao ka hloko, hobane ...",e ka kopo etsa kgetho ya hao ka hloko hobane h...
5680,5681,"TB ke bokudi ba PMB, mme Morero o tla lefella ...",tb ke bokudi ba pmb mme morero o tla lefella t...
5681,5682,Vakatjhela iwebhusayidi yethu ku-www.,vakatjhela iwebhusayidi yethu kuwww


 
-----------------------------------------------------------------------------------------------------
 

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////


<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>



- Feature extraction


In [45]:

# Feature extraction using the 'CountVectorizer'
betterVect = CountVectorizer(min_df=2,
                             max_df=0.5,
                             ngram_range=(1, 1))


In [46]:

# Defining variables
X = betterVect.fit_transform(df_train["text"])

y = df_train["lang_id"]  


In [47]:

# Spliting the 'df_train' using the 'train-test_split' 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)



- Building some classification models


In [48]:

names = ["Logistic Regression", "Random Forest", "Naive Bayes"]


In [49]:

classifiers = [
    LogisticRegression(),
    
    RandomForestClassifier(max_depth = 5, n_estimators = 1, max_features = 1),
    
    MultinomialNB(alpha = 0.3)
]


In [None]:

results = []

models = {}
confusion = {}
class_report = {}

for name, clf in zip(names, classifiers):
    print ('Fitting {:s} model...'.format(name))
    run_time = %timeit -q -o clf.fit(X_train, y_train)

    print ('... predicting')
    y_pred = clf.predict(X_train)
    y_pred_test = clf.predict(X_test)

    print ('... scoring')
    accuracy  = metrics.accuracy_score(y_train, y_pred)
    precision = metrics.precision_score(y_train, y_pred, average = "weighted")
    recall    = metrics.recall_score(y_train, y_pred, average = "weighted")

    f1        = metrics.f1_score(y_train, y_pred, average = "weighted")
    f1_test   = metrics.f1_score(y_test, y_pred_test, average = "weighted")

    # Save the results to dictionaries
    models[name] = clf
    confusion[name] = metrics.confusion_matrix(y_train, y_pred)
    class_report[name] = metrics.classification_report(y_train, y_pred)

    results.append([name, accuracy, precision, recall, f1, f1_test, run_time.best])


results = pd.DataFrame(results, columns=['Classifier', 'Accuracy', 'Precision', 'Recall', 'F1 Train', 'F1 Test', 'Train Time'])
results.set_index('Classifier', inplace= True)

print ('... All done!')


Fitting Logistic Regression model...


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt


Here is the results of three models


In [None]:

# Visualizing my results
results.sort_values('F1 Train', ascending = False)



Here is the plot of these values to view them visually to make sense of them.


In [None]:

# Plotting the bar graph to compare visually
fig, ax = plt.subplots(1, 2, figsize = (10, 5))
results.sort_values("F1 Train", ascending = False, inplace = True)
results.plot(y = ["F1 Test"], kind=  "bar", ax = ax[0], xlim = [0, 1.1], ylim = [0.85, 0.92])
results.plot(y = "Train Time", kind = "bar", ax = ax[1])


In [None]:

# Printing out the classification repoert of all the models
print("Logistic Regression Classification Report")
print(class_report["Logistic Regression"])
print("\n")

print("Random Forest Classification Report")
print(class_report["Random Forest"])
print("\n")

print("Multinomial Naive Bayes Classification Report")
print(class_report["Naive Bayes"])


In [None]:

model = MultinomialNB(alpha = 0.1)
model.fit(X_train, y_train)


In [None]:

# Creating a 'csv' file and saving it for kaggle submission
predict = model.predict(X_test)

submission = pd.DataFrame(df_test["index"])

test_text = df_test["text"]

test_vec = betterVect.transform(test_text)

y_pred = model.predict(test_vec)

submission["lang_id"] = y_pred

submission.to_csv("submission_final.csv", index = False)


In [None]:

# Viewing the first 20 rows to see the predictions
submission.head(20)


In [None]:

# Viewing the last 20 rows to see the predictions
submission.tail(20) 


 
-----------------------------------------------------------------------------------------------------
 

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////


<a id="six"></a>
## 6. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>



In conclusion, I discovered that the `Naive Bayes` and `Logistic Regression` models have performed the same by having the same f1_score and accuracy_score from the classifictaion report I displyed above. On the other hand the `Random Forest` model perfomed bad compared to other models.

To further improve the performance of my models I will use Hyper-parameter tuning.
