# Language Detection Model

Language detection is the process of automatically identifying the language of a given text or speech. A language detection model is a machine learning algorithm trained to recognize and classify languages based on various features such as vocabulary, grammar, syntax, and character n-grams. These models are trained on large datasets of text in different languages and use statistical and computational techniques to make accurate predictions about the language of a given input. Language detection models have many practical applications, such as in multilingual chatbots, language translation, and content filtering. They are typically evaluated on metrics such as accuracy, precision, and recall, and their performance can be further improved through techniques such as ensemble learning and active learning.

**Importing Python Libraries**

In [1]:
import numpy as np
import pandas as pd
import string
import matplotlib.pyplot as plt
import seaborn as sns
import re

**Reading the Dataset**

Datset available at https://www.kaggle.com/datasets/basilb2s/language-detection?resource=download

In [2]:
df = pd.read_csv('Language Detection.csv')

In [3]:
df.head()

Unnamed: 0,Text,Language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English


In [4]:
# To remove all the punctuations

def remove_punctuations(text):
    for punctuations in string.punctuation:
        text = text.replace(punctuations, "")
    text = text.lower()
    return (text)

In [5]:
df['Text'] = df['Text'].apply(remove_punctuations)

In [6]:
df.head()

Unnamed: 0,Text,Language
0,nature in the broadest sense is the natural p...,English
1,nature can refer to the phenomena of the physi...,English
2,the study of nature is a large if not the only...,English
3,although humans are part of nature human activ...,English
4,1 the word nature is borrowed from the old fre...,English


In [7]:
df.shape

(10337, 2)

In [8]:
df['Language'].unique()

array(['English', 'Malayalam', 'Hindi', 'Tamil', 'Portugeese', 'French',
       'Dutch', 'Spanish', 'Greek', 'Russian', 'Danish', 'Italian',
       'Turkish', 'Sweedish', 'Arabic', 'German', 'Kannada'], dtype=object)

In [9]:
X = df['Text']
y = df['Language']

In [10]:
print(X.shape)
print(y.shape)

(10337,)
(10337,)


**Splitting the dataset into train and testing part**

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [13]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(7235,)
(7235,)
(3102,)
(3102,)


**Using feature extraction**

In [14]:
from sklearn import feature_extraction

In [15]:
vectorizer = feature_extraction.text.TfidfVectorizer(ngram_range=(1,2), analyzer='char')

In [16]:
from sklearn import pipeline

**Here I've used the logistic regression machine learning model**

In [17]:
from sklearn.linear_model import LogisticRegression

In [18]:
logistic_regression_model = LogisticRegression()

In [19]:
pipe = pipeline.Pipeline([('vec', vectorizer), ('logistoc_regression_model', logistic_regression_model)])

In [20]:
pipe.fit(X_train, y_train)

In [21]:
pipe.classes_

array(['Arabic', 'Danish', 'Dutch', 'English', 'French', 'German',
       'Greek', 'Hindi', 'Italian', 'Kannada', 'Malayalam', 'Portugeese',
       'Russian', 'Spanish', 'Sweedish', 'Tamil', 'Turkish'], dtype=object)

In [22]:
predictions = pipe.predict(X_test)

In [23]:
predictions

array(['Malayalam', 'German', 'English', ..., 'French', 'Spanish',
       'Sweedish'], dtype=object)

**Used metrics to check the accuracy score of the model**

In [24]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [25]:
accuracy_score(y_test, predictions)

0.9693745970341715

In [26]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

      Arabic       1.00      0.99      0.99       160
      Danish       0.93      0.85      0.89       131
       Dutch       0.96      0.93      0.95       158
     English       0.97      0.99      0.98       419
      French       0.96      0.98      0.97       303
      German       0.94      0.98      0.96       137
       Greek       1.00      1.00      1.00       107
       Hindi       1.00      1.00      1.00        20
     Italian       0.97      0.94      0.95       236
     Kannada       1.00      1.00      1.00       103
   Malayalam       1.00      1.00      1.00       181
  Portugeese       0.99      0.95      0.97       215
     Russian       0.98      1.00      0.99       197
     Spanish       0.93      0.95      0.94       233
    Sweedish       0.93      0.96      0.95       214
       Tamil       1.00      1.00      1.00       144
     Turkish       0.99      0.97      0.98       144

    accuracy              

In [27]:
confusion_matrix(y_test, predictions)

array([[158,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   2,
          0,   0,   0,   0],
       [  0, 111,   3,   4,   1,   0,   0,   0,   0,   0,   0,   0,   0,
          0,  12,   0,   0],
       [  0,   4, 147,   3,   0,   4,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [  0,   1,   0, 413,   2,   1,   0,   0,   1,   0,   0,   0,   0,
          1,   0,   0,   0],
       [  0,   0,   0,   1, 297,   2,   0,   0,   1,   0,   0,   0,   0,
          1,   0,   0,   1],
       [  0,   0,   1,   0,   1, 134,   0,   0,   0,   0,   0,   0,   0,
          0,   1,   0,   0],
       [  0,   0,   0,   0,   0,   0, 107,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,  20,   0,   0,   0,   0,   0,
          0,   0,   0,   0],
       [  0,   0,   1,   1,   2,   0,   0,   0, 222,   0,   0,   1,   0,
          9,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0, 103,   0,   0,   0,
         

In [28]:
pipe.predict(['Hey, I am Glen'])

array(['English'], dtype=object)

In [29]:
pipe.predict(['Привет, я Глен'])

array(['Russian'], dtype=object)

In [30]:
import pickle

In [31]:
new_file = open('model.pkl', 'wb')
pickle.dump(pipe, new_file)
new_file.close()