# ExploreAI Academy Classification Hackathon

## South African Language Identification Hack 2023

### EA language classification hackathon

## Overview
---
South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages. (From South African Government)

With such a multilingual population, it is only obvious that our systems and devices also communicate in multi-languages.

In this challenge, you will take text which is in any of South Africa's 11 Official languages and identify which language the text is in. This is an example of NLP's Language Identification, the task of determining the natural language that a piece of text is written in.

---




## Importing Packages

In [123]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC
import warnings
warnings.filterwarnings("ignore")

## Loading the Data

In [124]:
# Load the training and test datasets
train_data = pd.read_csv('train_set.csv')
test_data = pd.read_csv('test_set.csv')


## Exploratory Data Analysis (EDA)

In [125]:
#view train dataset
print(train_data.head())  # View the train dataset

  lang_id                                               text
0     xho  umgaqo-siseko wenza amalungiselelo kumaziko ax...
1     xho  i-dha iya kuba nobulumko bokubeka umsebenzi na...
2     eng  the province of kwazulu-natal department of tr...
3     nso  o netefatša gore o ba file dilo ka moka tše le...
4     ven  khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [126]:
# Get information about the train dataset
print(train_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB
None


In [127]:
# Check the shape of the train dataset
print(train_data.shape) 

(33000, 2)


In [128]:
#view test dataset
print(test_data.head())

   index                                               text
0      1  Mmasepala, fa maemo a a kgethegileng a letlele...
1      2  Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2      3         Tshivhumbeo tshi fana na ngano dza vhathu.
3      4  Kube inja nelikati betingevakala kutsi titsini...
4      5                      Winste op buitelandse valuta.


In [129]:
# Get information about the test dataset
print(test_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5682 entries, 0 to 5681
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   index   5682 non-null   int64 
 1   text    5682 non-null   object
dtypes: int64(1), object(1)
memory usage: 88.9+ KB
None


In [130]:
# Check the shape of the test dataset
print(test_data.shape)

(5682, 2)


In [131]:
# Checking duplicates in train_data
msg= train_data['text']
train_data[msg.isin(msg[msg.duplicated()])]

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
9,tsn,fa le dirisiwa lebone le tshwanetse go bontsha...
10,nbl,lapho inarha yangeqadi ingenwe ngokungasimthet...
12,zul,i-tip-offs anonymous wusizo locingo oluzimele ...
19,nbl,isitifikhethi somtjhado esingakarhunyezwa namk...
...,...,...
32980,ssw,inhloso ye-wua kutsi yente bantfu bendzawo let...
32983,ssw,timiso tesigatjana titawusebenta ngetingucuko ...
32985,nso,ge o nyaka go kgopela phihlelelo ya direkoto t...
32989,ssw,imenenja yesigodzi utakwatisa ngembhalo uma le...


In [132]:
# EDA: Analyze the distribution of languages in the training dataset
language_counts = train_data['lang_id'].value_counts()
print(language_counts)

xho    3000
eng    3000
nso    3000
ven    3000
tsn    3000
nbl    3000
zul    3000
ssw    3000
tso    3000
sot    3000
afr    3000
Name: lang_id, dtype: int64


## Data Processing

#### Data Processing: Convert text to lowercase and remove punctuation

In [133]:
# Define a function to preprocess the text
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = ''.join(char for char in text if char.isalpha() or char.isspace())
    return text

In [134]:
# Apply the preprocessing function to the 'text' column in both datasets
train_data['text'] = train_data['text'].apply(preprocess_text)
test_data['text'] = test_data['text'].apply(preprocess_text)

## Engineering Feastures

#### Feature Engineering: Extract additional features if available

In [135]:
# In this case, we can add the text length as a feature
train_data['text_length'] = train_data['text'].apply(len)
test_data['text_length'] = test_data['text'].apply(len)

#### Feature Extraction: Convert text data into numerical features using different methods

In [136]:
# Bag-of-Words representation
vectorizer_bow = CountVectorizer()
X_train_bow = vectorizer_bow.fit_transform(train_data['text'])
X_test_bow = vectorizer_bow.transform(test_data['text'])

In [137]:
# TF-IDF representation
vectorizer_tfidf = TfidfVectorizer()
X_train_tfidf = vectorizer_tfidf.fit_transform(train_data['text'])
X_test_tfidf = vectorizer_tfidf.transform(test_data['text'])


In [138]:
# Target labels
y_train = train_data['lang_id']

## Modelling

#### Model Training and Evaluation: Train different models and evaluate their performance

In [139]:
# Split the training data into training and validation sets
X_train_bow_split, X_val_bow_split, y_train_split, y_val_split = train_test_split(X_train_bow, y_train, test_size=0.20, random_state=42)
X_train_tfidf_split, X_val_tfidf_split, y_train_split, y_val_split = train_test_split(X_train_tfidf, y_train, test_size=0.20, random_state=42)

In [140]:
# Model 1: Multinomial Naive Bayes classifier with Bag-of-Words
classifier_bow = MultinomialNB()
classifier_bow.fit(X_train_bow_split, y_train_split)
predictions_bow = classifier_bow.predict(X_val_bow_split)

In [141]:
# Model 2: Logistic Regression classifier with TF-IDF
classifier_tfidf = LogisticRegression(max_iter=1000)
classifier_tfidf.fit(X_train_tfidf_split, y_train_split)
predictions_tfidf = classifier_tfidf.predict(X_val_tfidf_split)

In [142]:
# Model 3: Stochastic Gradient Descent (SGD) classifier with Bag-of-Words
classifier_sgd = SGDClassifier()
param_grid_sgd = {
    'alpha': [0.0001, 0.001, 0.01],
    'penalty': ['l1', 'l2'],
    'max_iter': [100, 200, 500]
}
grid_search_sgd = GridSearchCV(classifier_sgd, param_grid=param_grid_sgd, cv=5)
grid_search_sgd.fit(X_train_bow_split, y_train_split)
best_classifier_sgd = grid_search_sgd.best_estimator_
predictions_sgd = best_classifier_sgd.predict(X_val_bow_split)

In [143]:
# Model 4: Linear Support Vector Classification (LinearSVC) with TF-IDF
classifier_svc = LinearSVC()
param_grid_svc = {
    'C': [0.1, 1.0, 10.0],
    'penalty': ['l1', 'l2'],
    'max_iter': [100, 200, 500]
}
grid_search_svc = GridSearchCV(classifier_svc, param_grid=param_grid_svc, cv=5)
grid_search_svc.fit(X_train_tfidf_split, y_train_split)
best_classifier_svc = grid_search_svc.best_estimator_
predictions_svc = best_classifier_svc.predict(X_val_tfidf_split)

In [144]:
# Performance Metrics
accuracy_bow = accuracy_score(y_val_split, predictions_bow)
precision_bow = precision_score(y_val_split, predictions_bow, average='weighted')
recall_bow = recall_score(y_val_split, predictions_bow, average='weighted')
f1_score_bow = f1_score(y_val_split, predictions_bow, average='weighted')

In [145]:
accuracy_tfidf = accuracy_score(y_val_split, predictions_tfidf)
precision_tfidf = precision_score(y_val_split, predictions_tfidf, average='weighted')
recall_tfidf = recall_score(y_val_split, predictions_tfidf, average='weighted')
f1_score_tfidf = f1_score(y_val_split, predictions_tfidf, average='weighted')

In [146]:
accuracy_sgd = accuracy_score(y_val_split, predictions_sgd)
precision_sgd = precision_score(y_val_split, predictions_sgd, average='weighted')
recall_sgd = recall_score(y_val_split, predictions_sgd, average='weighted')
f1_score_sgd = f1_score(y_val_split, predictions_sgd, average='weighted')

In [147]:
accuracy_svc = accuracy_score(y_val_split, predictions_svc)
precision_svc = precision_score(y_val_split, predictions_svc, average='weighted')
recall_svc = recall_score(y_val_split, predictions_svc, average='weighted')
f1_score_svc = f1_score(y_val_split, predictions_svc, average='weighted')


In [148]:
print("Bag-of-Words Model:")
print("Accuracy:", accuracy_bow)
print("Precision:", precision_bow)
print("Recall:", recall_bow)
print("F1 Score:", f1_score_bow)


Bag-of-Words Model:
Accuracy: 0.9989393939393939
Precision: 0.998941955790228
Recall: 0.9989393939393939
F1 Score: 0.9989392771541917


In [149]:
print("TF-IDF Model:")
print("Accuracy:", accuracy_tfidf)
print("Precision:", precision_tfidf)
print("Recall:", recall_tfidf)
print("F1 Score:", f1_score_tfidf)

TF-IDF Model:
Accuracy: 0.9943939393939394
Precision: 0.9944216933464599
Recall: 0.9943939393939394
F1 Score: 0.9943976987335509


In [150]:
print("SGD Model:")
print("Accuracy:", accuracy_sgd)
print("Precision:", precision_sgd)
print("Recall:", recall_sgd)
print("F1 Score:", f1_score_sgd)

SGD Model:
Accuracy: 0.9954545454545455
Precision: 0.9954657211179962
Recall: 0.9954545454545455
F1 Score: 0.995458252459493


In [151]:
print("LinearSVC Model:")
print("Accuracy:", accuracy_svc)
print("Precision:", precision_svc)
print("Recall:", recall_svc)
print("F1 Score:", f1_score_svc)

LinearSVC Model:
Accuracy: 0.9965151515151515
Precision: 0.9965163927529209
Recall: 0.9965151515151515
F1 Score: 0.996513421406879


In [152]:
# Choose the best-performing model for final predictions
best_model = classifier_bow
best_predictions = predictions_bow
best_model_name = 'Bag-of-Words Model'

if accuracy_tfidf > accuracy_bow:
    best_model = classifier_tfidf
    best_predictions = predictions_tfidf
    best_model_name = 'TF-IDF Model'

if accuracy_sgd > accuracy_bow and accuracy_sgd > accuracy_tfidf:
    best_model = classifier_sgd
    best_predictions = predictions_sgd
    best_model_name = 'SGD Model'

if accuracy_svc > accuracy_bow and accuracy_svc > accuracy_tfidf and accuracy_svc > accuracy_sgd:
    best_model = classifier_svc
    best_predictions = predictions_svc
    best_model_name = 'LinearSVC Model'

In [153]:
# Final predictions
final_predictions = best_model.predict(X_test_bow if best_model_name == 'Bag-of-Words Model' else X_test_tfidf)

In [154]:
# Prepare submission file in CSV format
submission_df = pd.DataFrame({'index': test_data['index'], 'lang_id': predictions})
submission_df.to_csv('best_submission.csv', index=False)

In [155]:
# Prepare submission files for each model
submission_bow = pd.DataFrame({'index': test_data['index'], 'lang_id': final_predictions})
submission_bow.to_csv('submission_bow.csv', index=False)

In [156]:
submission_tfidf = pd.DataFrame({'index': test_data['index'], 'lang_id': final_predictions})
submission_tfidf.to_csv('submission_tfidf.csv', index=False)

In [157]:
submission_sgd = pd.DataFrame({'index': test_data['index'], 'lang_id': final_predictions})
submission_sgd.to_csv('submission_sgd.csv', index=False)

In [158]:
submission_svc = pd.DataFrame({'index': test_data['index'], 'lang_id': final_predictions})
submission_svc.to_csv('submission_svc.csv', index=False)

In [159]:
print("Submission files created successfully!")

Submission files created successfully!
