# Introduction

South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society. With such a multilingual population, it is only obvious that our systems and devices also communicate in multi-languages.

The challenge was to compete in a Kaggle competition by training a classification model to predict in which South African language a piece of text is written in.

### Import Libraries

In [194]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, f1_score
import warnings
warnings.filterwarnings('ignore')

### Import Data sets

In [195]:
train = pd.read_csv('train_set.csv')
test = pd.read_csv('test_set.csv')
sample_submission = pd.read_csv('sample_submission.csv')

### Acessing the data sets

In [196]:
train.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [197]:
test.head()

Unnamed: 0,index,text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.


In [198]:
sample_submission.head()

Unnamed: 0,index,lang_id
0,1,tsn
1,2,nbl


# Model Buiding

### Defining the tartget variable

In [199]:
X = train['text']
y = train['lang_id']

### Splitting the training data into a training and validation set

In [200]:
# Splitting train and test data
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25,shuffle=False, stratify=None,random_state = 42)

### Data tranformation with TfidfVectorizer

In [201]:
clf = text_clf = Pipeline([('tfidf', TfidfVectorizer()),('clf', MultinomialNB())])

### Model fitting

In [202]:
#Fitting the model
model = clf.fit(X_train, y_train)

### Predictions

In [203]:
#Making predictions
predict = model.predict(X_test)

### Classification Report

In [204]:
#Printing the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         afr       1.00      1.00      1.00       735
         eng       0.99      1.00      1.00       769
         nbl       1.00      0.99      1.00       765
         nso       1.00      1.00      1.00       711
         sot       1.00      1.00      1.00       769
         ssw       1.00      1.00      1.00       793
         tsn       1.00      1.00      1.00       748
         tso       1.00      1.00      1.00       748
         ven       1.00      1.00      1.00       756
         xho       1.00      1.00      1.00       723
         zul       1.00      0.99      1.00       733

    accuracy                           1.00      8250
   macro avg       1.00      1.00      1.00      8250
weighted avg       1.00      1.00      1.00      8250



### Submission to Kaggle

In [185]:
#Making the submission
Sub_df = pd.DataFrame(test['index'])
Sub_df['lang_id'] = text_clf.predict(test['text'])
Sub_df.to_csv('language.csv', index=False)