## 1. Import dependencies

In [2]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import pickle

import warnings
warnings.simplefilter("ignore")

## 2. Loading the dataset

In [3]:
data = pd.read_csv("Language Detection.csv")

In [4]:
data.head()

Unnamed: 0,Text,Language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English


## 3. Feature Engineering

In [5]:
X = data["Text"]
y = data["Language"]

In [6]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [7]:
le.classes_

array(['Arabic', 'Danish', 'Dutch', 'English', 'French', 'German',
       'Greek', 'Hindi', 'Italian', 'Kannada', 'Malayalam', 'Portugeese',
       'Russian', 'Spanish', 'Sweedish', 'Tamil', 'Turkish'], dtype=object)

In [8]:
data_list = []
for text in X:
    text = re.sub(r'[!@#$(),\n"%^*?\:;~`0-9]', ' ', text)
    text = re.sub(r'[[]]', ' ', text)
    text = text.lower()
    data_list.append(text)

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

## 4. Performing NLP (creating bag of words using countvectorizer)

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
cv.fit(X_train)

x_train = cv.transform(X_train).toarray()
x_test  = cv.transform(X_test).toarray()

In [11]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(x_train, y_train)

MultinomialNB()

In [12]:
y_pred = model.predict(x_test)

## 5. Model Evaluation

In [13]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

ac = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)

In [14]:
print("Accuracy is :",ac)

Accuracy is : 0.9753384912959381


Learn more about sklearn pipeline from: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

In [15]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([('vectorizer', cv), ('multinomialNB', model)])
pipe.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', CountVectorizer()),
                ('multinomialNB', MultinomialNB())])

In [16]:
y_pred2 = pipe.predict(X_test)
ac2 = accuracy_score(y_test, y_pred2)
print("Accuracy is :",ac2)

Accuracy is : 0.9753384912959381


## 6. Saving the trained model

In [17]:
with open('trained_pipeline-0.1.0.pkl','wb') as f:
    pickle.dump(pipe, f)

## 7. Testing the model

In [23]:
# text = "Hello, how are you?"
text = "salut comment vas-tu"

y = pipe.predict([text])
le.classes_[y[0]], y

('French', array([4]))

## Summary

In this notebook, we have successfully developed an Language Detection ML model that addressed the fundamental challenge of identifying the language of 
text data ('Arabic', 'Danish', 'Dutch', 'English', 'French', 'German', 'Greek', 'Hindi', 'Italian', 'Kannada', 'Malayalam', Portuguesee', 'Russian', 'Spanish', 'Sweedish', 'Tamil', 'Turkish'). Through a combination of robust data preprocessing, feature extraction,
and machine learning techniques, we have developed a reliable and accurate language detection system.

## References:
[1] Aaron Jaech, George Mulcaire, Shobhit Hathi, Mari Ostendor, Noah A. Smith: "Hierarchical Character-Word Models for
Language Identification" (Aug 2016).

[2] Priyank Mathur, Arkajyoti Misra, Emrah Budur: "Language Identification from Text Documents" (2015).

[3] Shashank Simha B K, Rahul M, Jyoti R Munavalli, Prajwal Anand: "Dual-Language Detection using Machine Learning"
(Dec 2022).

[4] Sowmya Vajjala, Sagnik Banerjee: "A study of N-gram and Embedding Representations for Native Language Identification"
(September 8, 2017).

[5] Adarsh.D.Patil, Akshay Vishwas Joshi, Harsha. K.C, Pramod. N: "Spoken Language Identification Using Machine Learning"
(May 2012).

[6] Marco Lui, Jey Han Lau, Timothy Baldwin: "Automatic Detection and Language Identification of Multilingual Documents"
(Feb 2014).