# Language Detection

As a human, you can easily detect the languages you know. For example, I can easily identify Hindi and English, but being an Indian, it is also not possible for me to identify all Indian languages. This is where the language identification task can be used. Google Translate is one of the most popular language translators in the world which is used by so many people around the world. It also includes a machine learning model to detect languages that you can use if you don’t know which language you want to translate.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB


In [2]:
data = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-data/master/dataset.csv")
print(data.head())

                                                Text  language
0  klement gottwaldi surnukeha palsameeriti ning ...  Estonian
1  sebes joseph pereira thomas  på eng the jesuit...   Swedish
2  ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...      Thai
3  விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...     Tamil
4  de spons behoort tot het geslacht haliclona en...     Dutch


In [3]:
# Let’s have a look at whether this dataset contains any null values or not:
data.isnull().sum()

Text        0
language    0
dtype: int64

In [4]:
# Now let’s have a look at all the languages present in this dataset:
data['language'].value_counts()

Tamil         1000
Swedish       1000
Hindi         1000
Russian       1000
Persian       1000
Pushto        1000
Chinese       1000
Spanish       1000
Indonesian    1000
Urdu          1000
Korean        1000
Turkish       1000
Latin         1000
French        1000
Arabic        1000
Estonian      1000
Japanese      1000
Thai          1000
Romanian      1000
English       1000
Dutch         1000
Portugese     1000
Name: language, dtype: int64

In [5]:
# This dataset contains 22 languages with 1000 sentences from each language. This is a very balanced dataset with no missing values, so we can say this dataset is completely ready to be used to train a machine learning model.

## Language Detection Model 

In [6]:
# Now let’s split the data into training and test sets:
x = np.array(data['Text'])
y = np.array(data['language'])

In [7]:
cv = CountVectorizer()
X = cv.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=42)

In [8]:
# As this is a problem of multiclass classification, so I will be using the Multinomial Naïve Bayes algorithm to train the language detection model as this algorithm always performs very well on the problems based on multiclass classification:

In [9]:
model = MultinomialNB()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.953168044077135

In [10]:
# Now let’s use this model to detect the language of a text by taking a user input:
user = input('Enter a text: ')
data = cv.transform([user]).toarray()
output = model.predict(data)
print(output)

Enter a text:  देखकर अच्छा लगता है
['Hindi']
