Language Identification in South African Text: Kaggle Competition

This notebook presents my approach to tackle the Language Identification Challenge on Kaggle. The challenge focuses on classifying text written in South Africa's 11 Official languages. The notebook covers data exploration, preprocessing, feature extraction, model training, evaluation, and submission generation. By leveraging machine learning techniques, I aim to develop a classification model that accurately predicts the language of a given text.

Importing the necessary libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

Loading the training set

In [2]:
train_data = pd.read_csv('train_set.csv')

Loading the test set

In [3]:
test_data = pd.read_csv('test_set.csv')

Exploring the data

In [4]:
print(train_data.columns)


Index(['lang_id', 'text'], dtype='object')


Preprocessing the data

In [5]:
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_data['text'])
y_train = train_data['lang_id']


In [6]:
X_test = vectorizer.transform(test_data['text'])


Training the model

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# Defining the algorithms with their respective hyperparameters
algorithms = {
    'Naive Bayes': MultinomialNB(alpha=0.1),
    'SVM': SVC(C=1, kernel='linear'),
    'Random Forest': RandomForestClassifier(n_estimators=100)
}

# Iterating over the algorithms and train the models
for algorithm_name, algorithm in algorithms.items():
    print(f'Training {algorithm_name}...')
    algorithm.fit(X_train, y_train)

# Predicting using the best-performing algorithm
best_algorithm = algorithms['Random Forest']
y_pred = best_algorithm.predict(X_test)


Training Naive Bayes...
Training SVM...


In [None]:
best_algorithm = RandomForestClassifier(n_estimators=300)
best_algorithm.fit(X_train, y_train)
y_pred = best_algorithm.predict(X_test)


Making predictions on the test data

In [None]:
# Creating and train the model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Making predictions
predictions = model.predict(X_test)


Testing my model

In [None]:
test_sentences = [
    "Hello, how are you?",
    "Hoe gaan dit?",
    "Ngiyi-Data Scientist enkulu",
    "Igama lam ndinguCara"
]

Preprocess test sentences

In [None]:
X_test = vectorizer.transform(test_sentences)

Predicting the language of test sentences

In [None]:
y_pred = model.predict(X_test)

Mapping the predicted language codes to their corresponding language names

In [None]:
language_names = {
    'afr': 'Afrikaans',
    'eng': 'English',
    'nbl': 'isiNdebele',
    'nso': 'Sepedi',
    'sot': 'Sesotho',
    'ssw': 'siSwati',
    'tsn': 'Setswana',
    'tso': 'Xitsonga',
    'ven': 'Tshivenda',
    'xho': 'isiXhosa',
    'zul': 'isiZulu'
}

Printing the predicted languages for the test sentences

In [None]:
for sentence, pred in zip(test_sentences, y_pred):
    language = language_names[pred]
    print(f"Sentence: {sentence}")
    print(f"Predicted Language: {language}\n")

Creating a submission file

In [None]:
submission = pd.DataFrame({'index': test_data['index'], 'language': predictions})
submission.to_csv('sample_submission.csv', index=False)