# Language Detection

The complete test-use process constitutes language assessment, which is considerably more than just administering a language test. In fact, the main objective of language evaluation is to provide us with better information so that we may make better judgments and take better actions in the field of language education. This Python program will assist us in this field by inputting a few sentences in a single language and returning the language of the sentences.

Done by: Joshua, Richard, Sally 

# Previous Work

This method demonstrates splitting phrases down into words, which are then stored by characters to detect sentences and to return the langauages along with the score of which is the model’s best prediction.

##  Import Libraries

In [22]:
from os import getcwd
import pandas as pd

## Load Dataset

In [23]:
fpath = getcwd()
data = pd.read_csv('/Users/sallypang/Library/CloudStorage/OneDrive-JamesCookUniversity/MA 3831/MA 3831 - Assignment01/formatted_data.csv', ";")
print(data.tail())

   language                                               text  length_text
16       pt  Reinício da sessãoDeclaro reaberta a sessão do...       730576
17       ro  Componenţa Parlamentului: a se vedea procesul-...       336424
18       sk  Schválenie zápisnice z predchádzajúceho rokova...       328185
19       sl  Sprejetje zapisnika predhodne seje: glej zapis...       308136
20       sv  Återupptagande av sessionenJag förklarar Europ...       674945


## Pre-process Data

In [24]:
def filter_tokens(bow):
    filtered = []
    punctuations = "!\"#$%&'()*+, -./:;<=>?@[\]^_`{|}~"
    for i in range(len(bow)):
        word = bow[i]
        for punctuation in punctuations:
            if punctuation in word:
                word = word.replace(punctuation, "")
                break
        filtered.append(word)
    return filtered

In [26]:
languages = {}
for ind in data.index:
    bow = data['text'][ind].lower().split(" ")
    bow = filter_tokens(bow)
    bow = "".join(bow)
    languages[data['language'][ind]] = bow

charSet = set()
for lang in languages.keys():
    charSet = charSet.union(set(languages[lang]))

for lang in languages.keys():
    charDict = dict.fromkeys(charSet, 0)
    for char in languages[lang]:
        charDict[char] += 1

    charDict = {key: value for key, value in sorted(charDict.items())}
    languages[lang] = charDict

df = pd.DataFrame(languages)

indexes = {}
for col in df.columns:
    z = df[col].sort_values(ascending=False)
    indexes[col] = z.index

## Test Model

In [31]:
inputted_text = input("Enter text: ")
print()
inputted_text = inputted_text.lower().split(" ")
inputted_text = filter_tokens(inputted_text)
inputted_text = "".join(inputted_text)
charDict = dict.fromkeys(charSet, 0)
for char in inputted_text:
    charDict[char] += 1
charDict = {key: value for key, value in sorted(charDict.items())}
inputted_text = pd.DataFrame(charDict, index=[0])
inputted_text = inputted_text.transpose()
inputted_text = inputted_text.sort_values(ascending=False, by=0)
inputted_text = list(inputted_text.index)

for key in indexes.keys():
    score = 0
    for i in range(len(indexes[key])):
        if df[key][indexes[key][i]] == 0:
            break
        if inputted_text.index(indexes[key][i]) < inputted_text.index(indexes[key][i + 1]):
            score += 1
    print(f"{key}: {score}")

Enter text: Парламента

bg: 71
cs: 55
da: 32
de: 28
el: 41
en: 30
es: 31
et: 58
fi: 27
fr: 32
hu: 50
it: 33
lt: 51
lv: 59
nl: 28
pl: 55
pt: 33
ro: 56
sk: 55
sl: 57
sv: 31


Despite the fact that this is simply the first layer of the algorithm, the program returns results with a low degree of accuracy. Since most languages use the same alphabets for the words, even though those words may have different meanings and this is one of the main contributing factors. 

# Import Libraries 

In [6]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
import nltk
import csv
import warnings
warnings.simplefilter("ignore")

# Exploratory Data

In [7]:
data = pd.read_csv(r'/Users/sallypang/Library/CloudStorage/OneDrive-JamesCookUniversity/MA 3831/MA 3831 - Assignment01/formatted_data.csv', ";")

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   language     21 non-null     object
 1   text         21 non-null     object
 2   length_text  21 non-null     int64 
dtypes: int64(1), object(2)
memory usage: 632.0+ bytes


In [9]:
data.head(21)

Unnamed: 0,language,text,length_text
0,bg,Състав на Парламента: вж. протоколиОдобряване ...,327263
1,cs,Schválení zápisu z předchozího zasedání: viz z...,317927
2,da,Genoptagelse af sessionenJeg erklærer Europa-P...,678400
3,de,Wiederaufnahme der SitzungsperiodeIch erkläre ...,747690
4,el,Επαvάληψη της συvσδoυΚηρύσσω την επανάληψη της...,523277
5,en,Resumption of the sessionI declare resumed the...,690268
6,es,Reanudación del período de sesionesDeclaro rea...,733658
7,et,Eelmise istungi protokolli kinnitamine vaata p...,324119
8,fi,Istuntokauden uudelleenavaaminen Julistan perj...,694523
9,fr,Reprise de la sessionJe déclare reprise la ses...,756201


In [10]:
data.shape

(21, 3)

This "formatted_data.csv" dataset, which includes sentences in 21 different languages and along with ";" as the separator. The language code, text, and the amount of characters in the designated language are listed in the first, second, and third columns, respectively. 
The texts in this database were taken from the European Union Proceedings. 

# Code for Languages 

$ bg - Bulgarian \\ cs - Czech \\  da - Danish \\  de - German \\  el - Greek \\ en - English \\ es - Spanish \\ et - Estonian \\ fi - Finnish \\ fr - French \\ hu - Hungarian \\  it - Italian \\ lt - Lithuanian \\ lv - Latvian \\ nl - Dutch \\ pl - Polish \\ pt - Portuguese \\ ro - Romanian \\ sk - Slovak \\ sl - Slovenian \\  sv - Swedish $

# Pre-process Data

All text sentences from each language were split using the nltk function, and the data was then restored into a new csv file. Mark the language of each sentence in the Language column. Although there are many different types of methods, this is by far the most accurate and delivers the best results. We can easily proceed with tokenizing the variables by spitting out the text in each language.

In [11]:
language = []
for ind in data.index:
    lang = data["text"][ind]
    lang_list = nltk.tokenize.sent_tokenize(lang)
    for sentence in lang_list:
        new_sent = [sentence, data["language"][ind]]
        language.append(new_sent)

In [12]:
myFile = open('demo_file.csv', 'w')
writer = csv.writer(myFile)
writer.writerow(['Text', 'Language'])
for data_list in language:
    writer.writerow(data_list)
myFile.close()

# Test Model

Retrieve dependent and independent variables by reading the new entity file (demo file.csv). In order to pre-process text data before creating the vector representation, CountVectorizer helps with tokenization as it has flexible feature representation module for text due to this functionality. Additionally since this is a multiclass classification problem, we utilised the Multinomial Naive Bayes technique to train the language detection model as it consistently delivers excellent results in multiclass classification problems.

In [13]:
df = pd.read_csv(r'/Users/sallypang/Library/CloudStorage/OneDrive-JamesCookUniversity/MA 3831/MA 3831 - Assignment01/demo_file.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56886 entries, 0 to 56885
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Text      56886 non-null  object
 1   Language  56886 non-null  object
dtypes: object(2)
memory usage: 889.0+ KB


In [14]:
x = np.array(df["Text"])
y = np.array(df["Language"])

cv = CountVectorizer()
X = cv.fit_transform(x) # performs fit and transform on the input data at a single time and converts the data points.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
model = MultinomialNB()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.9932325540516787

# Test for Languages

This section is to test 21 languages based on user's input. 

In [17]:
print("Test Languages\n *Enter to quit.\n------------------")
user = input("Enter a Text: ")
while user != "":
    data = cv.transform([user]).toarray()
    output = model.predict(data)
    print(output)
    user = input("Enter a Text: ")
print("------------------\nTest ended.")

Test Languages
 *Enter to quit.
------------------
Enter a Text: Израилтяните нека поставят шатрите си, всеки при знамето си, със знаковете на бащиния си дом; срещу шатъра за срещане да поставят шатрите с и изоколо.
['bg']
Enter a Text: De aquí en adelante no daréis hornija al pueblo para hacer ladrillo, como ayer y antes de ayer; vayan ellos y recojan hornija por sí mismos .
['es']
Enter a Text: Keď prichádzal do Kafarnauma, pristúpil k nemu rímsky stotník a poprosil ho:
['sk']
Enter a Text: Azután vidd be az asztalt, és hozd azt rendbe a reávalókkal. Vidd be a gyertyatartót is, és gyújtsd meg annak mécseit.
['hu']
Enter a Text: ”Skulle de då få behandla vår syster som en prostituerad?” svarade de.
['sv']
Enter a Text: Elon Musk has said that he will step down as CEO of Twitter once a suitable replacement can be found. On Sunday he ran a poll asking if he should leave the role, and Twitter users overwhelmingly told him to go. He didn't immediately respond to the results, but by Tuesda

# Conclusion

Our language identification system's findings are presented, and we have looked at some of the problems with automatically determining a text's language. While some more modern algorithms perform with a similar level of accuracy, our solution utilizes a straightforward, widely accepted methodology. However, although the total number of languages in this project is only 21, some of languages share the same word spellings but have different meanings. For instance, even though we meant "van" in the language of "en", it will be recognized as the language "nl". As a result, our project's drawback is that the model performs less efficient when we are only testing a single word from a specific language. 

Our method stores each sentence instead of paragraph into separated cells as seen on the original file to conserve memory space and lower calculation costs while increasing efficiency since words may be easily identified in sentences to distinguish different languages. In addition, eliminating the stop words from the text of each phrase provided with the permission to use the corpus is part of our strategy to advance our project by improving the program better. It will help draw more attention to the important points. Furthermore, the project could be improved by increasing the quantity of unique words in each language.