# Language Detection
- Language detection is a natural language processing task where we need to identify the language of a text or document.

- Let’s start the task of language detection with machine learning by importing the necessary Python libraries and the dataset

In [85]:
# import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

- Let's now load the dataset

In [86]:
data = pd.read_csv('language.csv')
data.head()

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch


- Now let's look at the nformation of the dataset

In [87]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1715 entries, 0 to 1714
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Text      1715 non-null   object
 1   language  1714 non-null   object
dtypes: object(2)
memory usage: 26.9+ KB


- As you can see from the above observation our dataset contains two columns namely:
1. Text - sentence for the respective language
2. language - name of the language
- And also our dataset contains one null value in the languages column we can verify this by cecking for null values in our dataset

In [88]:
data.isnull().sum()

Text        0
language    1
dtype: int64

- As you can see our language column contains one null value. There are a few options for dealing with null values in a datafarame
1. Drop the row with the null value
2. Impute the missing value
 - Use the most frequent value(mode)
 - Use the mean or median value
 - Use a regression model to predict the missing value
3. Create a new category for null values

- Note:
 - If you have only a few null values, you may be able to manually identify and correct them.
 - If you have a large number of null values, you may want to consider using a machine learning model to predict the missing values.

- In our case as we only have a single missing value from the languages column we will to manually identify and correct it

In [89]:
data[data['language'].isnull()]

Unnamed: 0,Text,language
1714,إليوشن بالروسية илью́шин و هي شركة روسية لتصمي...,


- As you can see from the above identification of the null value the text kind of contains two languages so we can't simply assign it a single language value so we will try to drop the respective row

In [90]:
data = data.dropna(subset=['language'])
data.isnull().sum()

Text        0
language    0
dtype: int64

- We have succesfully removed the missing value from the dataset. Now lets have a look at the language column of the dataset

In [91]:
data['language'].value_counts().reset_index()

Unnamed: 0,language,count
0,Russian,100
1,Japanese,89
2,Latin,89
3,Dutch,85
4,Turkish,83
5,Portugese,83
6,Hindi,83
7,Tamil,81
8,Swedish,81
9,Persian,80


## Language Detection Model
- Now, let's split the data into training and test sets

In [92]:
X = np.array(data['Text'])
y = np.array(data['language'])

cv = CountVectorizer()
X = cv.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

- As this is a problem of multiclass classification, so we will be using the Multinomial Naïve Bayes algorithm to train the language detection model as this algorithm always performs very well on the problems based on multiclass classification

In [93]:
model = MultinomialNB()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.9416909620991254

- Now let’s use this model to detect the language of a text by taking a user input

In [94]:
user = input('Enter a Text: ')
data = cv.transform([user]).toarray()
output = model.predict(data)
print(output)

Enter a Text: bonne nuit
['French']
