---
---
<a id="cont"></a>

## Table of Contents
---
---



<a href=#three>1. Introduction</a>

<a href=#four>2. Problem Statement</a>

<a href=#five>3. Aim & Objectives</a>

<a href=#six>4. Literature Review</a>

<a href=#seven>5. Importing Packages</a>

<a href=#eight>6. Loading Data</a>

<a href=#nine>7. Exploratory Data Analysis (EDA)</a>

<a href=#ten>8. Preprocessing</a>

<a href=#eleven>9. Modeling and Evaluation</a>

<a href=#twelve>10. Analysis and Output</a>

<a href=#thirteen>11. Conclusion</a>


<a id="three"></a>
## 1. Introduction
<a href=#cont>Back to Table of Contents</a>

South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages.

<a id="four"></a>
## 2. Problem Statement 
<a href=#cont>Back to Table of Contents</a>

In such multilingual societies, there are several circumstances that may warrant the desire or need to understand text written in a certain strange language. Before any meaningful translation is possible, it is necessary to first identify the language in which the text was written. Automatic language translator systems also need to identify the language of a text before mapping it to the corpora or lexicon of the known language for translation. 

<a id="five"></a>
## 3. Aim & Objectives
<a href=#cont>Back to Table of Contents</a>

### Aim
- The aim of this project is to accurately classify any text into its appropriate language.

### Objectives
- Explanatory Data Analysis of the dataset provided.
- Data Preprocessing and Feature Engineering.
- Applying of different Classification models.
- Model Evaluaion and Explanation.

<a id="six"></a>
## 4. Literature Review
<a href=#cont>Back to Table of Contents</a>

Language identification may be classified under Natural Language Processing which is a subfield of a nunber of areas including Computer Science, Artificial Intelligence and linguistics. Its essentially involved with making computer systems understand human language in voice or text data format in order to provide an accurate intuitive response. NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models. Together, these technologies enable computers to process human language in the form of text or voice data and to ‘understand’ its full meaning, complete with the speaker or writer’s intent and sentiment

<a id="seven"></a>
## 5. Importing Packages
<a href=#cont>Back to Table of Contents</a>

In [10]:
# Import the basic libraries 
import nltk
import numpy as np
import pandas as pd

# imports for Natural Language  Processing
import re
import os
import nltk
import string
import time
import unicodedata
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from html.parser import HTMLParser
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from collections import Counter
import itertools

# Classification Models
from sklearn import metrics
from sklearn.svm import LinearSVC, SVC
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier

# Performance Evaluation
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedKFold
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, classification_report, confusion_matrix

# Import library for train test split
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV

# Set plot style
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

# Wordcloud
from PIL import Image
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt


import warnings
warnings.filterwarnings("ignore")


<a id="eight"></a>
## 6. Loading data 
<a href=#cont>Back to Table of Contents</a>

In [11]:
df_train = pd.read_csv("train_set.csv") # load train dataset
df_test = pd.read_csv("test_set.csv") # load test dataset

In [12]:
df_train


Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...
...,...,...
32995,tsn,popo ya dipolateforomo tse ke go tlisa boetele...
32996,sot,modise mosadi na o ntse o sa utlwe hore thaban...
32997,eng,closing date for the submission of completed t...
32998,xho,nawuphina umntu ofunyenwe enetyala phantsi kwa...


In [13]:
df_test

Unnamed: 0,index,text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.
...,...,...
5677,5678,You mark your ballot in private.
5678,5679,Ge o ka kgetha ka bowena go se šomiše Mofani k...
5679,5680,"E Ka kopo etsa kgetho ya hao ka hloko, hobane ..."
5680,5681,"TB ke bokudi ba PMB, mme Morero o tla lefella ..."


<a id="nine"></a>
## 7. Exploratory Data Analysis
<a href=#cont>Back to Table of Contents</a>

In [14]:
# explore shape of data
print(df_train.shape)
print(df_test.shape)

(33000, 2)
(5682, 2)


In [15]:
# Looking for duplicates
percent_duplicates = round((1-(df_train['text'].nunique()/len(df_train['text'])))*100,2)
print(percent_duplicates,'%')

9.25 %


In [16]:
df_train.drop_duplicates(subset="text",keep="first",inplace=True,ignore_index=True) #remove duplicate entries from test data
df_train.text.duplicated(keep="first").value_counts() #confirm removal of duplicate entries

False    29948
Name: text, dtype: int64

In [17]:
# explore summary statistics for dataset

df_train.describe()

Unnamed: 0,lang_id,text
count,29948,29948
unique,11,29948
top,eng,umgaqo-siseko wenza amalungiselelo kumaziko ax...
freq,2998,1


In [19]:
# checking for unique values for language id
langs = df_train['lang_id'].unique() 
langs

array(['xho', 'eng', 'nso', 'ven', 'tsn', 'nbl', 'zul', 'ssw', 'tso',
       'sot', 'afr'], dtype=object)

In [22]:
# checking for unique value count for the target variable
df_train['lang_id'].value_counts() 

eng    2998
zul    2924
nso    2873
tsn    2869
sot    2833
tso    2758
xho    2659
afr    2641
ven    2605
ssw    2426
nbl    2362
Name: lang_id, dtype: int64

<a id="ten"></a>
## 8. Preprocessing
<a href=#cont>Back to Table of Contents</a>

In [23]:
def get_lemmas(df):
    """
    function that takes a dataframe that contains text data and returns same dataframe with 
    
    additional columns which respectively contains tokenised and lemmatised versions of the 
    
    texts
    """ 
    df['tokens'] = df['text'].apply(word_tokenize)
    
    ### commence Part-of-Speech tagging    
    df['pos_tags'] = df['tokens'].apply(nltk.tag.pos_tag)

    def get_wordnet_pos(tag):

        if tag.startswith('J'):
            return wordnet.ADJ
        elif tag.startswith('V'):
            return wordnet.VERB
        elif tag.startswith('N'):
            return wordnet.NOUN
        elif tag.startswith('R'):
            return wordnet.ADV
        else:
            return wordnet.NOUN
        
    # create lemmatizer object    
    lemmatizer = WordNetLemmatizer() 
    df['pos_tags'] = df['pos_tags'].apply(lambda x: [(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in x])
    df['lemmatized'] = df['pos_tags'].apply(lambda x: [lemmatizer.lemmatize(word, tag) for word, tag in x])
    df['lemmatized'] = [' '.join(map(str, l)) for l in df['lemmatized']] 
    return df

In [24]:
# distinguish features and target variables
X = df_train["text"]
y = df_train["lang_id"]

# Create train and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

<a id="eleven"></a>
## 9. Modelling and Evaluation
<a href=#cont>Back to Table of Contents</a>

In [25]:
def model(models, X_train: pd.DataFrame , y_train: pd.DataFrame, X_test: pd.DataFrame, y_test: pd.DataFrame) -> pd.DataFrame:
    '''
    Lightweight script to test many models DataFrame of predictions
    '''
    
    dfs = []
    results = []
    names = []
    target_names = ['xho', 'eng', 'nso', 'ven','tsn', 'nbl', 'zul', 'ssw','tso', 'sot', 'afr']
    for name, model in models:
        kfold = KFold(n_splits=10, shuffle=True, random_state=50) # splitting the data into kfolds
        cv_results = cross_validate(model, X_train, y_train, cv=kfold)
        clf = model.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        print(name)
        print(classification_report(y_test, y_pred, target_names=target_names))
        results.append(cv_results)
        names.append(name)
        this_df = pd.DataFrame(cv_results)
        this_df['model'] = name
        dfs.append(this_df)
    final = pd.concat(dfs, ignore_index=True)
    return final

In [26]:
# base model using count vectorizer 
model_base = [
        ('LogReg', Pipeline([('Count',CountVectorizer()),('classify', LogisticRegression())])), 
        ('RF', Pipeline([('Count',CountVectorizer()),('classify', RandomForestClassifier())])),
        ('KNN', Pipeline([('Count',CountVectorizer()),('classify', KNeighborsClassifier())])),
        ('MULT', Pipeline([('Count',CountVectorizer()),('classify',MultinomialNB())])),        
        ('LINSVM', Pipeline([('Count',CountVectorizer()),('classify', LinearSVC())]))]

In [27]:
modelled = model(model_base, X_train, y_train, X_test, y_test)

LogReg
              precision    recall  f1-score   support

         xho       1.00      1.00      1.00       879
         eng       1.00      1.00      1.00      1008
         nso       0.98      0.98      0.98       757
         ven       1.00      1.00      1.00       930
         tsn       1.00      1.00      1.00       941
         nbl       0.99      0.99      0.99       824
         zul       1.00      1.00      1.00       928
         ssw       1.00      1.00      1.00       894
         tso       1.00      1.00      1.00       859
         sot       0.99      0.99      0.99       864
         afr       0.98      0.97      0.98       999

    accuracy                           0.99      9883
   macro avg       0.99      0.99      0.99      9883
weighted avg       0.99      0.99      0.99      9883

RF
              precision    recall  f1-score   support

         xho       1.00      1.00      1.00       879
         eng       0.99      1.00      1.00      1008
         nso  

<a id="twelve"></a>
## 10. Analysis and Output
<a href=#cont>Back to Table of Contents</a>

From the foregoing results of the cross validation of the models used, it is clear that the best performing models are the Multinomial Naive Bayes and the Linear Support Vector Machine. However, the Multinomial Naive Bayes just about does enough to edge the Support Vector Machine model when comparing their f1-score as well as their respective accuracies.

Consequently, the whole training data will be used to fit the multinomial NB model except this time, a TFidf Vectoriser will be used in place of the Count Vectoriser and then the trained model will be used to predict the outcome of the test data.

In [29]:
#Multinomial Naive Bayes
multi = Pipeline([('tfidf', TfidfVectorizer(sublinear_tf=True, 
                                            smooth_idf = True, 
                                            max_df = 0.2,
                                            ngram_range = (1, 5),
                                            stop_words='english')),
                  ('clf', MultinomialNB())])

In [30]:
#('MULT', Pipeline([('Count',CountVectorizer()),('classify',MultinomialNB())]))

In [31]:
multi.fit(X, y)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(max_df=0.2, ngram_range=(1, 5),
                                 stop_words='english', sublinear_tf=True)),
                ('clf', MultinomialNB())])

In [32]:
# set test data object for predictions
X_test = df_test['text']
predictions = multi.predict(X_test)
result_df = pd.DataFrame(predictions, columns=['lang_id'])
result_df


Unnamed: 0,lang_id
0,tsn
1,nbl
2,ven
3,ssw
4,afr
...,...
5677,eng
5678,nso
5679,sot
5680,sot


In [33]:
# prepare required output data file
output = pd.DataFrame({"index":df_test['index']})
submission = output.join(result_df)
submission

Unnamed: 0,index,lang_id
0,1,tsn
1,2,nbl
2,3,ven
3,4,ssw
4,5,afr
...,...,...
5677,5678,eng
5678,5679,nso
5679,5680,sot
5680,5681,sot


In [34]:
# output the submission file in csv format
submission.to_csv("Joseph_submission.csv", index = False)

<a id="thirteen"></a>
## 11. Conclusion
<a href=#cont>Back to Table of Contents</a>

In this project a model has been created to identify to a high extent of accuracy, the language any South African text is written in. A number of models were tested and the MultiNomial Naive Bayes model was the best performing model and hence was used to train the whole train data.