# Overview
South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages.

### Imports

In [38]:
import nltk

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score, classification_report, confusion_matrix



### Load the data

In [39]:
# Load the South African Language Identification data
df = pd.read_csv('train_set.csv')
df.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


### Data Preprocessing

In [40]:
# Split the data into labels
X = df['text']
y = df['lang_id']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Feature Extraction


In [41]:
# Conversion of the text data into numerical features
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


### Model

In [42]:
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)

In [43]:
predictions = classifier.predict(X_test_tfidf)

### Assessing the Model Performance

In [44]:
# Evaluation of how well the model performs
f1 = f1_score(y_test, predictions, average='weighted')
print(f"F1 Score: {f1}")

print("\nClassification Report:")
print(classification_report(y_test, predictions))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, predictions))

F1 Score: 0.9980299054262277

Classification Report:
              precision    recall  f1-score   support

         afr       1.00      1.00      1.00       583
         eng       1.00      1.00      1.00       615
         nbl       0.99      1.00      0.99       583
         nso       1.00      1.00      1.00       625
         sot       1.00      1.00      1.00       618
         ssw       1.00      1.00      1.00       584
         tsn       1.00      1.00      1.00       598
         tso       1.00      1.00      1.00       561
         ven       1.00      1.00      1.00       634
         xho       1.00      1.00      1.00       609
         zul       1.00      0.99      0.99       590

    accuracy                           1.00      6600
   macro avg       1.00      1.00      1.00      6600
weighted avg       1.00      1.00      1.00      6600


Confusion Matrix:
[[583   0   0   0   0   0   0   0   0   0   0]
 [  0 615   0   0   0   0   0   0   0   0   0]
 [  1   0 582   0   0

### Create Predictions on Unseen Data

In [47]:

test_df = pd.read_csv('test_set.csv')
X_submission = test_df['text']
X_submission_tfidf = vectorizer.transform(X_submission)

submission_predictions = classifier.predict(X_submission_tfidf)

submission_df = pd.DataFrame({'index': test_df['index'], 'lang_id': submission_predictions})

submission_df.to_csv('submission.csv', index=False)