# Hackathon Language Identification Challenge: Notebook

## Introduction
Welcome to the Language Identification Challenge Hackathon! In this challenge, we aim to build a robust language identification model that can accurately classify text into its respective language category. This notebook serves as a comprehensive guide to our approach, methodology, and the steps taken to create an effective language identification solution.

## Challenge Overview
Language identification is a crucial task in natural language processing (NLP) and has numerous applications, ranging from content filtering to improving machine translation systems. The goal of this hackathon is to leverage machine learning techniques to build a model that excels at accurately determining the language of a given text, even in cases of multilingual or ambiguous content.

## Dataset
Our dataset comprises a diverse collection of text samples from various languages. Each text entry is labeled with its corresponding language, forming the basis for supervised learning. The challenge is to train a classification model that can generalize well to unseen text data.

## Approach

#### Data Exploration:
I will begin by exploring the dataset, gaining insights into its structure, and understanding the distribution of languages.

#### Data Preprocessing: 
To prepare the data for model training, I will perform necessary preprocessing steps such as tokenization, handling missing values, and converting text into a suitable format for machine learning.

#### Feature Engineering: 
Extracting relevant features is crucial for the success of our model. We may consider techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings.

#### Model Selection: 
I will experiment with various classification algorithms, such as logistic regression, support vector machines, or neural networks, to identify the one that performs best for our specific language identification task.

#### Model Training: 
Once the model is selected, I will train it on the training dataset and fine-tune hyperparameters to achieve optimal performance.

#### Evaluation: 
We will evaluate the model using appropriate metrics, considering factors like precision, recall, and F1-score, given the potential class imbalance.

#### Inference: 
After training the model, we will demonstrate its language identification capabilities on new, unseen text samples.

# Importing Libraries

In [2]:
#importing of required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from nltk import bigrams
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
from nltk.corpus import reuters

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB

In [3]:
#importing the Training Data
train_df = pd.read_csv('train_set.csv')

#Importing the test data
test_df = pd.read_csv('test_set.csv')

# Data Exploration

In [4]:
train_df.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [5]:
train_df.shape

(33000, 2)

In [6]:
print(f' There are {train_df.shape[0]} rows and {train_df.shape[1]} columns')

 There are 33000 rows and 2 columns


In [7]:
train_df.dtypes

lang_id    object
text       object
dtype: object

# *Observations*
#### **The dataset has the following columns**:

laung_id: Represents the different types of language identifiction abbreviations.

text: Contains the text of the sentences associated with each language.

#### **Data Types**

The data columns have the following data types:
laung_id : strings (str) message: text (str) 

#### **Dataset Size**

The dataset consists of 33000 entries

This dataset will be used for training and evaluating machine learning models to classify which language the text column is in. 

# **Observing the Target Variable**

We will explore the following:
<ul>
  <li>Summary Statistics</li>
  <li>Target Variable Distribution</li>
</ul>

In [8]:
#Explore summary Statistics
train_df['text'].describe()

count                                                 33000
unique                                                29948
top       ngokwesekhtjheni yomthetho ophathelene nalokhu...
freq                                                     17
Name: text, dtype: object

# Exploratory Data Analysis (EDA)

In [9]:
print("Train Dataset:")
print(train_df.head())

print("\nTest Dataset:")
print(test_df.head())

Train Dataset:
  lang_id                                               text
0     xho  umgaqo-siseko wenza amalungiselelo kumaziko ax...
1     xho  i-dha iya kuba nobulumko bokubeka umsebenzi na...
2     eng  the province of kwazulu-natal department of tr...
3     nso  o netefatša gore o ba file dilo ka moka tše le...
4     ven  khomishini ya ndinganyiso ya mbeu yo ewa maana...

Test Dataset:
   index                                               text
0      1  Mmasepala, fa maemo a a kgethegileng a letlele...
1      2  Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2      3         Tshivhumbeo tshi fana na ngano dza vhathu.
3      4  Kube inja nelikati betingevakala kutsi titsini...
4      5                      Winste op buitelandse valuta.


# 5. Text Data Preprocessing 

#### Pre-processing is a crucial step in building language models as it helps prepare the raw text data for effective learning. The specific pre-processing steps depend on the nature of your language model and the task at hand. 

The following defines functions preprocess_lemmatize and preprocess_stemming and are used to clean and preprocess text data in the 'message' column of a DataFrame. It includes steps such as converting text to lowercase and usernames, expanding contractions, and lemmatizing the text. The processed DataFrames (lemmatized_train_df) can be used for training classification models on language identification analysis or other natural language processing tasks.


In [20]:
def preprocess_data(train_df, test_df):
    # Initializing the TF-IDF vectorizer
    vectorizer = TfidfVectorizer()

    # Fitting the vectorizer on the training data
    vectorizer.fit(train_df['text'])

    # Transforming the training and test data using the fitted vectorizer
    train_features = vectorizer.transform(train_df['text'])
    test_features = vectorizer.transform(test_df['text'])

    return train_features, test_features, vectorizer

# Preprocessing

In [21]:
train_features, test_features, vectorizer = preprocess_data(train_df, test_df)

NameError: name 'TfidfVectorizer' is not defined

# Training and Evaluation

#### Logistic Regression

In [12]:
X_train, X_val, y_train, y_val = train_test_split(train_features, train_df['lang_id'], test_size=0.2, random_state=42)
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
lr_preds = lr_model.predict(X_val)
lr_f1 = f1_score(y_val, lr_preds, average='weighted')

print("Logistic Regression F1 Score:", lr_f1)

NameError: name 'train_features' is not defined

# K Nearest Neighbors (KNN)

In [13]:
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
knn_preds = knn_model.predict(X_val)
knn_f1 = f1_score(y_val, knn_preds, average='weighted')

print("KNN F1 Score:", knn_f1)

NameError: name 'X_train' is not defined

# Support Vector Machine

In [14]:
svm = SVC()
svm.fit(X_train, y_train)
svm_predictions = svm.predict(X_val)
svm_f1 = f1_score(y_val, svm_predictions, average='weighted')
print("SVM F1 Score:", svm_f1)

NameError: name 'X_train' is not defined

# Naive Bayes

In [15]:
nb = MultinomialNB()
nb.fit(X_train, y_train)
nb_predictions = nb.predict(X_val)
nb_f1 = f1_score(y_val, nb_predictions, average='weighted')
print("Naive Bayes F1 Score:", nb_f1)

NameError: name 'X_train' is not defined

# Generate predictions on the test set

In [16]:
# Converting the test data into TF-IDF vectors
X_test = vectorizer.transform(test_data['text'])

# Generating predictions on the best performing model
test_predictions = nb.predict(X_test)

NameError: name 'vectorizer' is not defined

# Creating a csv for submission

In [17]:
# Creating a submission dataframe with 'index' and 'lang_id' columns
submission_df = pd.DataFrame({'index': test_data['index'], 'lang_id': test_predictions})

submission_df.to_csv('FinalSub1.csv', index=False)

NameError: name 'test_data' is not defined