# Language Identification in South African Text: Kaggle Competition

This notebook presents my approach to tackle the Language Identification Challenge on Kaggle. The challenge focuses on classifying text written in South Africa's 11 Official languages. The notebook covers data exploration, preprocessing, feature extraction, model training, evaluation, and submission generation. By leveraging machine learning techniques, I aim to develop a classification model that accurately predicts the language of a given text.

## Importing necessary libraries

In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB


## Loading the data

In [None]:
train_df = pd.read_csv('train_set.csv')
test_df = pd.read_csv('test_set.csv')

## Exploratory Data Analysis (EDA)

In [None]:
print("Train Dataset:")
print(train_df.head())

print("\nTest Dataset:")
print(test_df.head())


## Data Preprocessing

In [None]:
def preprocess_data(train_df, test_df):
    # Initializing the TF-IDF vectorizer
    vectorizer = TfidfVectorizer()

    # Fitting the vectorizer on the training data
    vectorizer.fit(train_df['text'])

    # Transforming the training and test data using the fitted vectorizer
    train_features = vectorizer.transform(train_df['text'])
    test_features = vectorizer.transform(test_df['text'])

    return train_features, test_features, vectorizer



## Preprocessing the data

In [None]:
train_features, test_features, vectorizer = preprocess_data(train_df, test_df)


## Training and Evaluation

### Logistic Regression

In [None]:
X_train, X_val, y_train, y_val = train_test_split(train_features, train_df['lang_id'], test_size=0.2, random_state=42)
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
lr_preds = lr_model.predict(X_val)
lr_f1 = f1_score(y_val, lr_preds, average='weighted')

print("Logistic Regression F1 Score:", lr_f1)


### K Nearest Neighbors (KNN)

In [None]:
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
knn_preds = knn_model.predict(X_val)
knn_f1 = f1_score(y_val, knn_preds, average='weighted')

print("KNN F1 Score:", knn_f1)


### Support Vector Machine

In [None]:
svm = SVC()
svm.fit(X_train, y_train)
svm_predictions = svm.predict(X_val)
svm_f1 = f1_score(y_val, svm_predictions, average='weighted')
print("SVM F1 Score:", svm_f1)

### Naive Bayes

In [None]:
nb = MultinomialNB()
nb.fit(X_train, y_train)
nb_predictions = nb.predict(X_val)
nb_f1 = f1_score(y_val, nb_predictions, average='weighted')
print("Naive Bayes F1 Score:", nb_f1)


## Generate predictions on the test set

In [None]:
# Converting the test data into TF-IDF vectors
X_test = vectorizer.transform(test_data['text'])

# Generating predictions on the best performing model
test_predictions = nb.predict(X_test)

## Creating a csv for submission

In [None]:
# Creating a submission dataframe with 'index' and 'lang_id' columns
submission_df = pd.DataFrame({'index': test_data['index'], 'lang_id': test_predictions})

submission_df.to_csv('FinalSub1.csv', index=False)