## South African Language Identification Hack 2022

### Overview

South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages.
In this challenge, we will take a text which is in any of South Africa's 11 Official languages and identify which language the text is in. This is an example of NLP's Language Identification, the task of determining the natural language that a piece of text is written in.

### Problem statement

To develop a sophisticated machine learning model which can predict the South African language a text has been written in.

### Importing relevant packages

In [17]:
# Data loading and Text processing
import numpy as np
import pandas as pd
import string
import nltk


# Data Visualisation
import matplotlib.pyplot as plt

# Modeling and Evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report






### Loading Data

In [2]:
# read train dataset
train_set = pd.read_csv(r'C:\Users\b1806\Desktop\experiment 2\South-African-Language-Identification-Hack-2022\train_set.csv')

# read test dataset
test_set = pd.read_csv(r'C:\Users\b1806\Desktop\experiment 2\South-African-Language-Identification-Hack-2022\test_set.csv')

### Exploratory data analysis

Exploratory data analsysis is the process of deriving insights from our dataset without making any assumptions. Here we will using both graphical and non-graphical exploratory data analysis

#### Overview of training set

In [4]:
#Training set
train_set.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [5]:
test_set.head()

Unnamed: 0,index,text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.


#### Analysis of languages

In [11]:
#Counting the occurance of each language in the training set
Language_counts = train_set['lang_id'].value_counts()
print(Language_counts)

xho    3000
eng    3000
nso    3000
ven    3000
tsn    3000
nbl    3000
zul    3000
ssw    3000
tso    3000
sot    3000
afr    3000
Name: lang_id, dtype: int64


The languages each appear 3000 times.

In [12]:
#Viewing data type of each column
train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB


We can see that there are 33000 rows and no null values.

#### Cleaning data

We will be performing minimal cleaning in our dataset, by converting the text to lower case and removing punctuations.

In [18]:
def clean_data(text):   
    
    # change the case of all the words in the text to lowercase 
    text = text.lower()
    
    # remove punctuation
    text = "".join([x for x in text if x not in string.punctuation])
    return text

In [21]:
#cleaning train dataset
train_set['text'] = train_set['text'].apply(clean_data)
#cleaning test dataset
test_set['text'] = test_set['text'].apply(clean_data)

#### Transforming Text into Numbers

## Modelling and Evaluation