## South African Language Identification Hack 2022

### Overview

South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages.
In this challenge, we will take a text which is in any of South Africa's 11 Official languages and identify which language the text is in. This is an example of NLP's Language Identification, the task of determining the natural language that a piece of text is written in.

### Problem statement

To develop a sophisticated machine learning model which can predict the South African language a text has been written in.

### Importing relevant packages

In [145]:
# Data loading and Text processing
import numpy as np
import pandas as pd
import string
import nltk
from sklearn.feature_extraction.text import CountVectorizer


# Data Visualisation
import matplotlib.pyplot as plt

# Modeling and Evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn import metrics

from sklearn.preprocessing import StandardScaler


### Loading Data

In [146]:
# read train dataset
train_set = pd.read_csv(r'C:\Users\b1806\Desktop\experiment 2\South-African-Language-Identification-Hack-2022\train_set.csv')

# read test dataset
test_set = pd.read_csv(r'C:\Users\b1806\Desktop\experiment 2\South-African-Language-Identification-Hack-2022\test_set.csv')

### Exploratory data analysis

Exploratory data analsysis is the process of deriving insights from our dataset without making any assumptions. Here we will using both graphical and non-graphical exploratory data analysis

#### Overview of training set

In [4]:
#Training set
train_set.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [5]:
test_set.head()

Unnamed: 0,index,text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.


#### Analysis of languages

In [11]:
#Counting the occurance of each language in the training set
Language_counts = train_set['lang_id'].value_counts()
print(Language_counts)

xho    3000
eng    3000
nso    3000
ven    3000
tsn    3000
nbl    3000
zul    3000
ssw    3000
tso    3000
sot    3000
afr    3000
Name: lang_id, dtype: int64


The languages each appear 3000 times.

In [12]:
#Viewing data type of each column
train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB


We can see that there are 33000 rows and no null values.

#### Cleaning data

We will be performing minimal cleaning in our dataset, by converting the text to lower case and removing punctuations.

In [18]:
def clean_data(text):   
    
    # change the case of all the words in the text to lowercase 
    text = text.lower()
    
    # remove punctuation
    text = "".join([x for x in text if x not in string.punctuation])
    return text

In [21]:
#cleaning train dataset
train_set['text'] = train_set['text'].apply(clean_data)
#cleaning test dataset
test_set['text'] = test_set['text'].apply(clean_data)

#### Transforming Text into Numbers

lets convert our languages into intergers as follows {'xho':1, 'eng':2, 'nso':3, 'ven':4, 'tsn':5, 'nbl':6,'zul':7, 'ssw':8, 'tso':9, 'sot':10, 'afr':11}

In [36]:
lang_dict = {'xho':1, 'eng':2, 'nso':3, 'ven':4, 'tsn':5, 'nbl':6,'zul':7, 'ssw':8, 'tso':9, 'sot':10, 'afr':11}

train_set['lang_id'].replace(lang_dict, inplace=True)

In [39]:
train_set['lang_id'].unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11], dtype=int64)

lets extract  our features using  count vectorizer

In [134]:
 def convert_text_numbers(list_of_words):
    vect = CountVectorizer(lowercase=True , max_features=10000)
    count = vect.fit_transform(list_of_words.values.astype(str))
    matrix = count.toarray()
    return matrix

In [135]:
# converting messages in  train set to numbers
X = convert_text_numbers(train_set['text'])

# converting messages in  train set to numbers
Xtest = convert_text_numbers(test_set['text'])


In [136]:
X.shape

(33000, 10000)

In [137]:
y= train_set['lang_id']

In [139]:
n= 10000

In [141]:
#splitting dataset into training and validation
X_train, X_test, y_train, y_test = train_test_split(X[:n], y[:n])

## Modelling and Evaluation

We will be building different types of classification models, and will be comparing their performances

In [142]:
names = ['Logistic Regression', 'Nearest Neighbors', 
         'Linear SVM', 'RBF SVM',          
         'Decision Tree', 'Random Forest',  'AdaBoost']

In [143]:
classifiers = [
    LogisticRegression(), 
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),    
    AdaBoostClassifier()
]

In [149]:
SVM = SVC(gamma=2, C=1)

In [None]:
SVM.fit(X_train, y_train)

In [None]:
y_valsvm = SVM.predict(X_test)

In [80]:
print(classification_report(y_test,y_valsvm))

              precision    recall  f1-score   support

           1       0.99      0.99      0.99       429
           2       0.99      1.00      1.00       471
           3       1.00      1.00      1.00       441
           4       1.00      1.00      1.00       452
           5       1.00      1.00      1.00       433
           6       0.99      0.99      0.99       462
           7       0.98      0.97      0.98       431
           8       0.99      0.99      0.99       450
           9       1.00      1.00      1.00       437
          10       1.00      1.00      1.00       449
          11       1.00      0.99      1.00       495

    accuracy                           0.99      4950
   macro avg       0.99      0.99      0.99      4950
weighted avg       0.99      0.99      0.99      4950



In [87]:
Prediction = log.predict(Xtest)
A = Prediction

In [89]:
lang_dict = {'xho':1, 'eng':2, 'nso':3, 'ven':4, 'tsn':5, 'nbl':6,'zul':7, 'ssw':8, 'tso':9, 'sot':10, 'afr':11}



In [101]:
duh =np.array(A,dtype='object')

In [104]:
duh.dtype
A =duh
A.dtype

dtype('O')

In [105]:
A[A == 1] = 'xho'
A[A == 2] = 'eng'
A[A == 3] = 'nso'
A[A == 4] = 'ven'
A[A == 5] = 'tsn'
A[A == 6] = 'nbl'
A[A == 7] = 'zul'
A[A == 8] = 'ssw'
A[A == 9] = 'tso'
A[A == 10] = 'sot'
A[A == 11] = 'afr'

In [106]:
A

array(['ssw', 'ssw', 'ssw', ..., 'eng', 'xho', 'eng'], dtype=object)

In [110]:
test_set['index']

0          1
1          2
2          3
3          4
4          5
        ... 
5677    5678
5678    5679
5679    5680
5680    5681
5681    5682
Name: index, Length: 5682, dtype: int64

In [116]:
output = pd.DataFrame({'index':test_set['index'], 'lang_id':A})
output.to_csv('submission_svm.csv', index=False)