## Problem Definition

Language detection is a Natural Language Processing (NLP) task that involves identifying the language of a text or document. A few years ago, using machine learning for language identification was a challenging endeavor due to the lack of data on languages. However, with the recent proliferation of data, powerful machine-learning models have been developed to facilitate language identification.
As a human, you can easily recognize the languages you know. For instance, I can quickly identify Kiswahili and English, but being a Kenyan, it is not always possible for me to identify all Kenyan languages. This is where language identification technology can be of great help. Google Translate is one of the most widely used language translators in the world, with millions of people around the globe relying on it. It also includes a machine learning model to detect languages, which can be used if you are unsure of which language you want to translate.

## Data Preparation

The most crucial part of training a language detection model is data. The more data you have about each language, the more accurate your model will be in real-time. For this purpose, I am using a dataset collected from Kaggle, which contains data about 22 popular languages and 1000 sentences in each of the languages. This makes it an ideal dataset for training a language detection model with machine learning. In the section below, I will take you through the process of training a language detection model.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
data = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-data/master/dataset.csv")
print(data.head())

                                                Text  language
0  klement gottwaldi surnukeha palsameeriti ning ...  Estonian
1  sebes joseph pereira thomas  på eng the jesuit...   Swedish
2  ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...      Thai
3  விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...     Tamil
4  de spons behoort tot het geslacht haliclona en...     Dutch


Now that we have gotten our data ready we can check if there is null values and perform some data cleaning before we use it to train our model

In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22000 entries, 0 to 21999
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Text      22000 non-null  object
 1   language  22000 non-null  object
dtypes: object(2)
memory usage: 343.9+ KB


In [3]:
data.isnull().sum()

Text        0
language    0
dtype: int64

Our dataset do not have any null values now we can proceed to have a look at the languages present in our dataset

In [4]:
data["language"].value_counts()

Estonian      1000
Swedish       1000
English       1000
Russian       1000
Romanian      1000
Persian       1000
Pushto        1000
Spanish       1000
Hindi         1000
Korean        1000
Chinese       1000
French        1000
Portugese     1000
Indonesian    1000
Urdu          1000
Latin         1000
Turkish       1000
Japanese      1000
Dutch         1000
Tamil         1000
Thai          1000
Arabic        1000
Name: language, dtype: int64

This dataset contains 22 languages with 1000 sentences from each language. This is a very balanced dataset with no missing values, so we can say this dataset is completely ready to be used to train a machine learning model.

## Building Language Detection Model
Now let’s split the data into training and test sets:

In [5]:
x = np.array(data["Text"])
y = np.array(data["language"])

cv = CountVectorizer()
X = cv.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.30, 
                                                    random_state=42)

As this is a problem of multiclass classification, so I will be using the Multinomial Naïve Bayes algorithm to train the language detection model as this algorithm always performs very well on the problems based on multiclass classification

In [6]:
model = MultinomialNB()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.9528787878787879

Our model has a score of 95% which is quite good ,we can proceed and do some test by using a user input

In [7]:
user = input("Enter a Text: ")
data = cv.transform([user]).toarray()
output = model.predict(data)
print(output)

Enter a Text: Добрый день! Меня зовут Иван, и я рад познакомиться с вами. Я живу в Москве и люблю исследовать город и его историю. В свободное время я люблю готовить и путешествовать. Как ваши дела?
['Russian']


So as you can see that the model performs well. One thing to note here is that this model can only detect the languages mentioned in the dataset.


## Conclusion
A language detection model project is a type of machine learning project focused on automatically determining which language any given input text is written in. The goal of the project is to develop an algorithm that can accurately classify languages based on patterns within words, phrases, and sentences. To achieve this, researchers often collect data from various sources to create datasets for training their model. After creating the dataset, engineers train their models using supervised learning techniques such as support vector machines or neural networks. Once trained, the language detection model can then be tested against a test set of unseen data to assess its accuracy at predicting languages. Such projects are extremely beneficial in various natural language processing tasks like distinguishing source content during web crawls or automatic translation services.And like we have seen our model can help in that though on a large scale of more data to train on.