# Language Detection Model

Project overview: Over 23 languages detection performed through a sentence to help models in language translation processes. The languages include: Thai, Tamil, Portuguese, Indonesian, Hindi, Pushto, English, Dutch, Persian, Spanish, Romanian, Korean, Latin, Turkish, Urdu, Japanese, Nepali, French, Russian, Arabic, Estonian, Swedish, Chinese.

---

---

In [27]:
#necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
import joblib

## <div style="background-color:#FFFFC1">Data Load</div>

In [28]:
#reading the csv file with mixed dataset 
mixed_lang_df = pd.read_csv("Languages.csv")
mixed_lang_df.head()

Unnamed: 0,Text,language
0,klement gottwaldi surnukeha palsameeriti ning ...,Estonian
1,sebes joseph pereira thomas på eng the jesuit...,Swedish
2,ถนนเจริญกรุง อักษรโรมัน thanon charoen krung เ...,Thai
3,விசாகப்பட்டினம் தமிழ்ச்சங்கத்தை இந்துப் பத்திர...,Tamil
4,de spons behoort tot het geslacht haliclona en...,Dutch


In [29]:
#the dimension of the loaded dataset
mixed_lang_df.shape

(22000, 2)

Around 22000 rows with two columns as text and their respective languages. The dataset is a 2-dimension array

In [30]:
#reading the nepali data csv file
nep_lang_df = pd.read_csv("nepali_real_data2.csv")
nep_lang_df.head()

Unnamed: 0,Text,language
0,आजको मौसम राम्रो छ।,Nepali
1,म स्कूल जान्छु।,Nepali
2,उनीहरू फुटबल खेलिरहेका छन्।,Nepali
3,नेपाल सुन्दर देश हो।,Nepali
4,मलाई किताब पढ्न मन पर्छ।,Nepali


In [31]:
#concating the two different datasets into one according to their respective columns : text and language
lang_df = pd.concat([mixed_lang_df,nep_lang_df], ignore_index=True)

In [32]:
#the final dataset is shuffled 
lang_df = lang_df.sample(frac=1, random_state=42)

In [33]:
lang_df

Unnamed: 0,Text,language
12626,"เรื่องราวของ ""galaxy angel"" จะเป็นเรื่องราวของ...",Thai
2004,பாட்டாளி வர்க்க சர்வாதிகாரத்தின் கீழ் வர்க்கப்...,Tamil
15062,"""vírus"" é um single do iron maiden lançado em ...",Portugese
259,paspor ini berisi atau halaman dan berlaku s...,Indonesian
2195,अंतर्राष्ट्रीय फ़्रेंचाइज़िंग वाले पक्ष दस्ताव...,Hindi
...,...,...
11964,باباجان غفورف تاریخ‌دان و نویسندهٔ کتاب تاریخ ...,Persian
21575,en fue invitado por fernando ii para ocupar l...,Spanish
5390,doğu kanada atabasklarına geleneksel olarak dü...,Turkish
860,پژواک د يوې ځانگړې پروژې په توگه د اساسي قانون...,Pushto


## <div style="background-color:#FFFFC1">Data Understanding</div>

In [34]:
#to check if there are any missing values 
lang_df.isnull().sum()

Text        0
language    0
dtype: int64

There are no missing values in any columns as the sum outputs zero (0).

In [35]:
#Confirming equal distribution of training datas for each languages
lang_df['language'].value_counts()

language
Thai          1000
Tamil         1000
Portugese     1000
Indonesian    1000
Hindi         1000
Pushto        1000
English       1000
Dutch         1000
Persian       1000
Spanish       1000
Romanian      1000
Korean        1000
Latin         1000
Turkish       1000
Urdu          1000
Japanese      1000
Nepali        1000
French        1000
Russian       1000
Arabic        1000
Estonian      1000
Swedish       1000
Chinese       1000
Name: count, dtype: int64

All stored languages have equal distribution of training data which is 1000.

In [36]:
#to check the data type of columns
lang_df.dtypes

Text        object
language    object
dtype: object

Both datatypes are object which means the data are stored as strings.

## <div style="background-color:#FFFFC1">Data Transform</div>

In [37]:
#to convert the columns into numpy array for further vectorization
array_text = np.array(lang_df['Text'])
array_lang = np.array(lang_df['language'])

In [38]:
array_text

array(['เรื่องราวของ "galaxy angel" จะเป็นเรื่องราวของหน่วยงานที่ได้รับการสนับสนุนโดยรัฐบาล ที่มีชื่อว่า angel troupe angel-tai มีหน้าที่หลักคือค้นหา เทคโนโลยีที่สาบสูญ lost technology เพื่อนำมาใช้งานต่างๆ',
       'பாட்டாளி வர்க்க சர்வாதிகாரத்தின் கீழ் வர்க்கப் போராட்டத்தைத் தொடர்வதற்கான சித்தாந்தமும் கருவியுமாக கலாச்சாரப் புரட்சி பயன்படுத்தப்பட வேண்டும்',
       '"vírus" é um single do iron maiden lançado em  é o primeiro single desde  "women in uniform" que não aparece em nenhum álbum oficial de estúdio do iron maiden foi no entanto caracterizado como um novo circuito de retrospectiva da carreira da banda é a única canção do iron maiden a ser creditada aos dois guitarristas da banda',
       ...,
       'doğu kanada atabasklarına geleneksel olarak düşman olan kızılderililer krilerdir plains cree woods cree swampy cree kuzey amerikada avrupalı tüccarların getirip kürk karşılığında ticaret karakollarında değiş tokuş ettikleri misket tüfeklerine sahip olan ilk yerliler krilerdir ve bun

In [39]:
array_lang

array(['Thai', 'Tamil', 'Portugese', ..., 'Turkish', 'Pushto', 'Japanese'],
      shape=(23000,), dtype=object)

In [40]:
#to create an object of CountVectorizer and initialized passing parameter to analyze by characters 
cv = CountVectorizer(analyzer='char')

In [41]:
#the model then fits into the dataset x and then transforms into vectors
text_vector = cv.fit_transform(array_text)

In [42]:
#splitting the train and test data into ratio of 80-20
X_train, X_test, y_train, y_test = train_test_split(text_vector,array_lang, test_size=0.2, random_state=42)

In [43]:
print(X_train)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 748489 stored elements and shape (18400, 7186)>
  Coords	Values
  (0, 0)	99
  (0, 20)	8
  (0, 14)	56
  (0, 25)	13
  (0, 37)	1
  (0, 27)	25
  (0, 18)	55
  (0, 33)	32
  (0, 31)	36
  (0, 28)	48
  (0, 34)	19
  (0, 29)	10
  (0, 22)	27
  (0, 32)	36
  (0, 16)	16
  (0, 35)	7
  (0, 80)	6
  (0, 76)	5
  (0, 26)	16
  (0, 17)	23
  (0, 74)	3
  (0, 19)	8
  (0, 30)	2
  (0, 70)	5
  (0, 15)	5
  :	:
  (18399, 364)	31
  (18399, 392)	6
  (18399, 346)	9
  (18399, 360)	4
  (18399, 358)	4
  (18399, 351)	23
  (18399, 400)	3
  (18399, 357)	11
  (18399, 399)	4
  (18399, 395)	1
  (18399, 350)	8
  (18399, 391)	5
  (18399, 408)	1
  (18399, 393)	9
  (18399, 361)	16
  (18399, 345)	4
  (18399, 355)	6
  (18399, 356)	5
  (18399, 354)	3
  (18399, 343)	6
  (18399, 348)	3
  (18399, 338)	1
  (18399, 368)	27
  (18399, 402)	2
  (18399, 405)	6


The text column has been successfully vectorized 

In [44]:
y_test

array(['Romanian', 'Persian', 'Nepali', ..., 'Portugese', 'Russian',
       'Indonesian'], shape=(4600,), dtype=object)

## <div style="background-color:#FFFFC1">Model Build</div>

In [45]:
#initializing multinomialNB (Naive Bayes) to classify the texts 
model = MultinomialNB()

In [46]:
model

In [47]:
#fitting the training data into the model
model.fit(X_train,y_train)

## <div style="background-color:#FFFFC1">Model Evaluate</div>

In [48]:
#to check the accuracy of the prediction
model.score(X_test,y_test)

0.957391304347826

Accuracy rate of the model to detect the correct languages is 95.7% which means out of every 100 prediction it gives 95.7 correct.

In [54]:
#to check the output result
user = input("Enter a text:")
data =cv.transform([user]).toarray()
output = model.predict(data)
print(output)

Enter a text: All has been completed with a successful detection.


['English']


## <div style="background-color:#FFFFC1">Model Save</div>

In [55]:
joblib.dump(model,'Language_detect_model_joblib')

['Language_detect_model_joblib']

---

---