**Language Detection using Python**

Let’s start the task of language detection with machine learning by importing the necessary Python libraries and the dataset:

In [15]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
import joblib
import pickle

In [16]:
data = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-data/master/dataset.csv")
print(data.head())

                                                Text  language
0  klement gottwaldi surnukeha palsameeriti ning ...  Estonian
1  sebes joseph pereira thomas  på eng the jesuit...   Swedish
2  ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...      Thai
3  விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...     Tamil
4  de spons behoort tot het geslacht haliclona en...     Dutch


Let’s have a look at whether this dataset contains any null values or not:

In [21]:
data.isnull().sum()

Text        0
language    0
dtype: int64

Now let’s have a look at all the languages present in this dataset:

In [7]:
data["language"].value_counts()

language
Estonian      1000
Swedish       1000
English       1000
Russian       1000
Romanian      1000
Persian       1000
Pushto        1000
Spanish       1000
Hindi         1000
Korean        1000
Chinese       1000
French        1000
Portugese     1000
Indonesian    1000
Urdu          1000
Latin         1000
Turkish       1000
Japanese      1000
Dutch         1000
Tamil         1000
Thai          1000
Arabic        1000
Name: count, dtype: int64

This dataset contains 22 languages with 1000 sentences from each language. This is a very balanced dataset with no missing values, so we can say this dataset is completely ready to be used to train a machine learning model.

**Language Detection Model**

Now let’s split the data into training and test sets:

In [19]:
x = np.array(data["Text"])
y = np.array(data["language"])
# Save the fitted vectorizer using joblib

cv = CountVectorizer()
X = cv.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001,random_state=100)
joblib.dump(cv, 'Vectorizer.joblib')

['Vectorizer.joblib']

As this is a problem of multiclass classification, so I will be using the **Multinomial Naïve Bayes algorithm** to train the language detection model as this algorithm always performs very well on the problems based on multiclass classification:

In [9]:
model = MultinomialNB()
model.fit(X_train,y_train)
model.score(X_test,y_test)

1.0

**Accessing Feature Log Probabilities**

In [25]:
# Accessing feature log probabilities
feature_log_probs = model.feature_log_prob_

# Print or use these probabilities as needed
print("Feature Log Probabilities:", feature_log_probs)

Feature Log Probabilities: [[-12.0623899  -12.75553708 -12.75553708 ... -12.75553708 -12.75553708
  -12.75553708]
 [-12.60973279 -12.60973279 -12.60973279 ... -12.60973279 -12.60973279
  -12.60973279]
 [-12.70950803 -12.01636085 -12.70950803 ... -12.70950803 -12.70950803
  -12.70950803]
 ...
 [-12.02734363 -12.72049081 -12.72049081 ... -12.72049081 -12.72049081
  -12.72049081]
 [-12.00601676 -12.69916394 -12.69916394 ... -12.69916394 -12.69916394
  -12.69916394]
 [-12.74922508 -12.74922508 -12.74922508 ... -12.74922508 -12.74922508
  -12.74922508]]


**Save The Model**

In [30]:
joblib.dump(model, "Language_Detection_Model.pkl")
joblib.dump(model, 'Language_Detection_Model.joblib')

['Language_Detection_Model.joblib']

Now let’s use this model to detect the language of a text by taking a user input:

In [20]:
user = input("Enter a Text: ")
model1=joblib.load('/workspaces/Language_Detector-Translator/Models/Language Detector/Language_Detection_Model.joblib')
cv1=joblib.load('/workspaces/Language_Detector-Translator/Models/Language Detector/Vectorizer.joblib')
data = cv1.transform([user]).toarray()
output = model1.predict(data)

message= str(output)
print(message)

['English']


So as you can see that the model performs well. **One thing to note here is that this model can only detect the languages mentioned in the dataset.**

**Summary**

Using machine learning for language identification was a difficult task a few years ago because there was not a lot of data on languages, but with the availability of data with ease, several powerful machine learning models are already available for language identification. I hope you liked this Project on detecting languages with machine learning using Python. Feel free to ask your valuable questions.