# Hackathon Language Identification Challenge: Notebook

## Introduction
Welcome to the Language Identification Challenge Hackathon! In this challenge, we aim to build a robust language identification model that can accurately classify text into its respective language category. This notebook serves as a comprehensive guide to our approach, methodology, and the steps taken to create an effective language identification solution.

## Challenge Overview
Language identification is a crucial task in natural language processing (NLP) and has numerous applications, ranging from content filtering to improving machine translation systems. The goal of this hackathon is to leverage machine learning techniques to build a model that excels at accurately determining the language of a given text, even in cases of multilingual or ambiguous content.

## Dataset
Our dataset comprises a diverse collection of text samples from various languages. Each text entry is labeled with its corresponding language, forming the basis for supervised learning. The challenge is to train a classification model that can generalize well to unseen text data.

## Approach

#### Data Exploration:
I will begin by exploring the dataset, gaining insights into its structure, and understanding the distribution of languages.

#### Data Preprocessing: 
To prepare the data for model training, I will perform necessary preprocessing steps such as tokenization, handling missing values, and converting text into a suitable format for machine learning.

#### Feature Engineering: 
Extracting relevant features is crucial for the success of our model. We may consider techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings.

#### Model Selection: 
I will experiment with various classification algorithms, such as logistic regression, support vector machines, or neural networks, to identify the one that performs best for our specific language identification task.

#### Model Training: 
Once the model is selected, I will train it on the training dataset and fine-tune hyperparameters to achieve optimal performance.

#### Evaluation: 
We will evaluate the model using appropriate metrics, considering factors like precision, recall, and F1-score, given the potential class imbalance.

#### Inference: 
After training the model, we will demonstrate its language identification capabilities on new, unseen text samples.

# Importing Libraries

In [2]:
#importing of required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


In [3]:
#importing the Training Data
train_df = pd.read_csv('train_set.csv')

#Importing the test data
test_df = pd.read_csv('test_set.csv')

# Data Exploration

In [4]:
train_df.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [5]:
train_df.shape

(33000, 2)

In [6]:
print(f' There are {train_df.shape[0]} rows and {train_df.shape[1]} columns')

 There are 33000 rows and 2 columns


In [7]:
train_df.dtypes

lang_id    object
text       object
dtype: object

# *Observations*
#### **The dataset has the following columns**:

laung_id: Represents the different types of language identifiction abbreviations.

text: Contains the text of the sentences associated with each language.

#### **Data Types**

The data columns have the following data types:
laung_id : strings (str) message: text (str) 

#### **Dataset Size**

The dataset consists of 33000 entries

This dataset will be used for training and evaluating machine learning models to classify which language the text column is in. 

# **Observing the Target Variable**

We will explore the following:
<ul>
  <li>Summary Statistics</li>
  <li>Target Variable Distribution</li>
</ul>

In [8]:
#Explore summary Statistics
train_df['text'].describe()

count                                                 33000
unique                                                29948
top       ngokwesekhtjheni yomthetho ophathelene nalokhu...
freq                                                     17
Name: text, dtype: object