# South African Language Identification Hack 2022
### EDSA 2201 & 2207 classification hackathon

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Import necessary libraries</a>

<a href=#two>2. Load and view the data</a>

<a href=#three>3. Data Preprocessing</a>

<a href=#four>4. Splitting the data</a>

<a href=#five>5. Performance Metrics for model evaluation</a>


###  Overview

South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages.
From South African Government

With such a multilingual population, it is only obvious that our systems and devices also communicate in multi-languages.

In this challenge, you will take text which is in any of South Africa's 11 Official languages and identify which language the text is in. This is an example of NLP's Language Identification, the task of determining the natural language that a piece of text is written in.

### Data Description

The dataset used for this challenge is the NCHLT Text Corpora collected by the South African Department of Arts and Culture & Centre for Text Technology (CTexT, North-West University, South Africa). The training set was improved through additional cleaning done by Praekelt.

The data is in the form Language ID, Text. The text is in various states of cleanliness. Some NLP techniques will be necessary to clean up the data.

### 1. Import necessary libraries

In [17]:
#importing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import nltk
import re
import string
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nltk import word_tokenize, pos_tag, pos_tag_sents
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
from sklearn.utils import resample
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV


### 2. Load and view the data

In [18]:
#Import data
test = pd.read_csv("test_set.csv")
train = pd.read_csv("train_set.csv")
sample = pd.read_csv("sample_submission.csv")

In [12]:
train.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [13]:
test.head()

Unnamed: 0,index,text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.


In [14]:
sample.head()

Unnamed: 0,index,lang_id
0,1,tsn
1,2,nbl


The data is in the form Language ID, Text. The text is in various states of cleanliness. Some NLP techniques will be necessary to clean up the data.. A description of each variable in the dataset is given below.  
### Training set
**Variable definitions:**  

- **lang_id** - Unique Language ID.
- **Text** -  string characters.       
  

**Each text is then labeled as one of the following languages:**  
 
    
| **Class** | **Tag** |
|:---------:|:---------
|   **2**   | **tsn** |
|   **1**   | **nbi** |
 

### Testing set  
During testing we do not have access to the **lang_id	** variable, but the testing dataset remains the same as the training dataset otherwise.  

### Data types 

Let's get quick overview of the datasets we will be working with throughout the notebook. The output below contains the shape of the dataset, a list of all columns with their data types and the number of non-null values present in each column.  

**Train data**  

In [15]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB


The train dataset has 33000 entries, contains no null entries, and the data types for "lang_id" and "text" are both object data types.

#### Test data

In [16]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5682 entries, 0 to 5681
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   index   5682 non-null   int64 
 1   text    5682 non-null   object
dtypes: int64(1), object(1)
memory usage: 88.9+ KB


The test dataset has 5682 entries, contains no null entries, and the variable "tesxt" has the object data type.

### 3. Data Preprocessing

In [17]:
#Data Preprocessing
#Identifying missing values and data types
train.isna().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   lang_id  33000 non-null  bool 
 1   text     33000 non-null  bool 
dtypes: bool(2)
memory usage: 64.6 KB


In [18]:
test.isna().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5682 entries, 0 to 5681
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   index   5682 non-null   bool 
 1   text    5682 non-null   bool 
dtypes: bool(2)
memory usage: 11.2 KB


### Exploring the data

In [33]:
sns.countplot(train.lang_id, ax=axes[0])

NameError: name 'axes' is not defined

Looking at the above graph, there seems to be no class **imbalance**, therefore**:**
* **No** need for **upsampling** and,
* **No** need for **downsampling**

#### Copying train

In [19]:
data_lang = train.copy()

In [20]:
def hashtag_extract(text):  
    
  
    hashtags = []

    for a in text:
        ht = re.findall(r"#(\w+)", a)
        hashtags.append(ht)

    hashtags = sum(hashtags, [])
    frequency = nltk.FreqDist(hashtags)

    hashtag_data = pd.DataFrame({'hashtag': list(frequency.keys()),
                           'count': list(frequency.values())})
    hashtag_data = hashtag_data.nlargest(15, columns="count")

    return hashtag_data

#### Replacing url's and symbols

In [21]:
#remove all url/websites
pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+' #url regualr expressions
subs_url = r'url-web' # replace each url with 'url-web'
data_lang['text'] = data_lang['text'].replace(to_replace = pattern_url, value = subs_url, regex = True)
# make all lower case 
data_lang['text'] = data_lang['text'].str.lower()

#Removing RT ftom tweets
data_lang['text'] = data_lang['text'].str.strip('rt ')

# Remove @ mentions
pattern = r"@[\w]+" # pattern to remove
sub = r'' # what to replace it with
data_lang['text'] = data_lang['text'].replace(to_replace = pattern, value = sub, regex = True) #replace

In [22]:
def lookup_dict(text, dictionary):
    
    for word in text.split(): 
        if word.lower() in dictionary: 
            if word.lower() in text.split():
                text = text.replace(word, dictionary[word.lower()]) 
    return text

#### Removing punctuations

In [23]:
#remove puntuation
data_lang['text'] = data_lang['text'].apply(lambda x: ''.join([l for l in x if l not in string.punctuation]))
data_lang

Unnamed: 0,lang_id,text
0,xho,umgaqosiseko wenza amalungiselelo kumaziko axh...
1,xho,idha iya kuba nobulumko bokubeka umsebenzi nap...
2,eng,he province of kwazulunatal department of tran...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...
...,...,...
32995,tsn,popo ya dipolateforomo tse ke go tlisa boetele...
32996,sot,modise mosadi na o ntse o sa utlwe hore thaban...
32997,eng,closing date for the submission of completed t...
32998,xho,nawuphina umntu ofunyenwe enetyala phantsi kwa...


#### Tokentizing  the text column

In [24]:
#tokenizing the tweets
tokeniser = TreebankWordTokenizer()
data_lang['tokenized'] = data_lang['text'].apply(tokeniser.tokenize)

In [26]:
data_lang.head()

Unnamed: 0,lang_id,text,tokenized
0,xho,umgaqosiseko wenza amalungiselelo kumaziko axh...,"[umgaqosiseko, wenza, amalungiselelo, kumaziko..."
1,xho,idha iya kuba nobulumko bokubeka umsebenzi nap...,"[idha, iya, kuba, nobulumko, bokubeka, umseben..."
2,eng,he province of kwazulunatal department of tran...,"[he, province, of, kwazulunatal, department, o..."
3,nso,o netefatša gore o ba file dilo ka moka tše le...,"[o, netefatša, gore, o, ba, file, dilo, ka, mo..."
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...,"[khomishini, ya, ndinganyiso, ya, mbeu, yo, ew..."


### 4. Splitting the data

In [27]:
#Splitting features and target variables
X = train['text']#X is the features of the cleaned tweets
y = train['lang_id'] #Y is the target variable which is the train sentiment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 11)

In [28]:
tfidf = TfidfVectorizer() #Call the TFidfVectorizer
cf= CountVectorizer() #Call the CountVectorizer

#### Training model and evaluation

### 5. Performance Metrics for model evaluation

We will evaluate our models using the the F1 Score which is the number of true instances for each label.

#### Precision

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations

$$ Precision = \frac{TP}{TP \space + FP} = \frac{TP}{Total \space Predicted \space Positive} $$

#### Recall

The recall is intuitively the ability of the classifier to find all the positive samples

$$ Recall = \frac{TP}{TP \space + FN} = \frac{TP}{Total \space Actual \space Positive}$$

#### F1 Score

Weighted average of precision and recall. 

$$F_1 = 2 \times \frac {Precision \space \times \space Recall }{Precision \space + \space Recall }$$

In [29]:
l_r = LogisticRegression(C=1, class_weight='balanced', max_iter=1000)
# call the model
clf_lr = Pipeline([('tfidf', tfidf), ('clf', l_r)]) #Create a pipeline
clf_lr.fit(X_train, y_train) #Fit the training data to the pipeline
y_pred_lr= clf_lr.predict(X_test)#Make predictions
print('accuracy %s' % accuracy_score(y_pred_lr, y_test)) #Print the accuracy
print('f1_score %s' % metrics.f1_score(y_test,y_pred_lr,average='weighted')) #Print the weighted f1 score
print(classification_report(y_test, y_pred_lr)) #Classification

accuracy 0.9954545454545455
f1_score 0.9954522748128464
              precision    recall  f1-score   support

         afr       1.00      1.00      1.00       323
         eng       1.00      1.00      1.00       280
         nbl       0.98      0.99      0.98       295
         nso       1.00      1.00      1.00       316
         sot       1.00      1.00      1.00       307
         ssw       0.99      1.00      0.99       295
         tsn       1.00      1.00      1.00       296
         tso       1.00      1.00      1.00       297
         ven       1.00      1.00      1.00       258
         xho       0.99      1.00      0.99       317
         zul       0.99      0.98      0.98       316

    accuracy                           1.00      3300
   macro avg       1.00      1.00      1.00      3300
weighted avg       1.00      1.00      1.00      3300



In [30]:
n_b = MultinomialNB()
clf_nb= Pipeline([('tfidf', tfidf), ('clf', n_b)])
clf_nb.fit(X_train, y_train)
y_pred_nb = clf_nb.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred_nb, y_test)) #Print the accuracy
print('f1_score %s' % metrics.f1_score(y_test,y_pred_nb,average='weighted')) #Print the f1 score
print(classification_report(y_test, y_pred_nb)) #Print out the classification

accuracy 0.9981818181818182
f1_score 0.99818151138554
              precision    recall  f1-score   support

         afr       1.00      1.00      1.00       323
         eng       0.99      1.00      1.00       280
         nbl       0.99      1.00      0.99       295
         nso       1.00      1.00      1.00       316
         sot       1.00      1.00      1.00       307
         ssw       1.00      1.00      1.00       295
         tsn       1.00      1.00      1.00       296
         tso       1.00      1.00      1.00       297
         ven       1.00      1.00      1.00       258
         xho       1.00      1.00      1.00       317
         zul       1.00      0.99      1.00       316

    accuracy                           1.00      3300
   macro avg       1.00      1.00      1.00      3300
weighted avg       1.00      1.00      1.00      3300



In [31]:
#MultinomialNB Hyperparameter tuning
tfid = TfidfVectorizer()
text = tfid.fit_transform(train['text'])
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(text,y, test_size = 0.2, random_state = 10)
params = {'alpha':[0.01,0.1,1]}

grid_MNB = GridSearchCV(MultinomialNB(), params)
grid_MNB.fit(X_train_h, y_train_h)
print(grid_MNB.best_params_)

{'alpha': 0.1}


In [15]:
multi = Pipeline([('tfid', TfidfVectorizer()),
             ('clf', MultinomialNB(alpha = 0.1))])
multi.fit(X_train, y_train)
t = test['text']
y_pred_m = multi.predict(t)
sub = pd.DataFrame( data = {'index': test['index'],
                             'lang_id': y_pred_m })
sub.to_csv('submission_m.csv', index = False)
print('accuracy %s' % accuracy_score(y_pred_nb, y_test)) #Print the accuracy
print('f1_score %s' % metrics.f1_score(y_test,y_pred_nb,average='weighted')) #Print the f1 score
print(classification_report(y_test, y_pred_nb)) #Print out the classification

accuracy 0.9981818181818182
f1_score 0.99818151138554
              precision    recall  f1-score   support

         afr       1.00      1.00      1.00       323
         eng       0.99      1.00      1.00       280
         nbl       0.99      1.00      0.99       295
         nso       1.00      1.00      1.00       316
         sot       1.00      1.00      1.00       307
         ssw       1.00      1.00      1.00       295
         tsn       1.00      1.00      1.00       296
         tso       1.00      1.00      1.00       297
         ven       1.00      1.00      1.00       258
         xho       1.00      1.00      1.00       317
         zul       1.00      0.99      1.00       316

    accuracy                           1.00      3300
   macro avg       1.00      1.00      1.00      3300
weighted avg       1.00      1.00      1.00      3300



In [32]:
# Make Submission
My_submission = pd.DataFrame(test['index'])
My_submission['lang_id'] = clf_nb.predict(test['text'])
My_submission.to_csv('Tebelelo_Selowa_Classification_Hackathon',index=False)