In [1]:
import math
import re

In [2]:
!unzip language-training.zip

Archive:  language-training.zip
   creating: language-training/Czech/
  inflating: language-training/Czech/23_cm_her101  
  inflating: language-training/Czech/bestiary  
  inflating: language-training/Czech/character  
  inflating: language-training/Czech/cn_jaskier07  
  inflating: language-training/Czech/cn_julian01  
  inflating: language-training/Czech/cn_kalkstein10  
  inflating: language-training/Czech/cn_lady01  
  inflating: language-training/Czech/cn_leuvaarden10  
  inflating: language-training/Czech/cn_raymond05  
  inflating: language-training/Czech/cn_shani19  
  inflating: language-training/Czech/cn_talar01  
  inflating: language-training/Czech/cn_triss01  
  inflating: language-training/Czech/cn_vincent02  
  inflating: language-training/Czech/tutorial  
   creating: language-training/English/
  inflating: language-training/English/23_cm_her101  
  inflating: language-training/English/bestiary  
  inflating: language-training/English/character  
  inflating: language-t

# LOADING TRAINING DATA

loading training data from folder for given language

In [3]:
def load_training_data(lang):
  files = ['23_cm_her101', 'bestiary', 'character', 'cn_jaskier07', 'cn_julian01', 'cn_kalkstein10', 'cn_lady01', 'cn_leuvaarden10', 'cn_raymond05', 'cn_shani19', 'cn_talar01', 'cn_triss01', 'cn_vincent02', 'tutorial']

  text = []
  for f in files:
    para = open('language-training/' + lang + '/' + f, 'r', encoding="utf8").readlines()
    for line in para:
      text.append(line)
  return text

Training data is an array of paragraphs. for example here is eng_text[3]

In [4]:
eng_text = load_training_data('English')
eng_text[3]

'I see you are a witcher. Has a villager finally sought to do something about the midday ladies?\n'

# PREPROCESSING TRAINING DATA

In [5]:
def preprocess_data(text):
  ret_words = []
  for sentence in text:
    words = sentence.split()
    for word in words:
      cleaned_word = re.sub(r'[\W]', '', word)
      if cleaned_word != '':
        ret_words.append(cleaned_word.lower())
  return ret_words

Preprocessing the english text. For example here is pre-process for ["'I see you are a witcher. Has a villager finally sought to do something about the midday ladies?\n'"]. notice stripping of special characters such as ?

In [6]:
eng_words = preprocess_data(["'I see you are a witcher. Has a villager finally sought to do something about the midday ladies?\n'"])
eng_words

['i',
 'see',
 'you',
 'are',
 'a',
 'witcher',
 'has',
 'a',
 'villager',
 'finally',
 'sought',
 'to',
 'do',
 'something',
 'about',
 'the',
 'midday',
 'ladies']

# COUNTING NUMBER OF TIMES A TRIGRAM WAS OBSERVED

Through this function, we get counts of each trigram. Now for a given text in a language, we can create an n-dimensional vector where each trigram corresponds to a dimnension and the number of times a trigram was observed is the value for that dimension.

In [7]:
def create_trigram_vector(words):
  trigram_vector = {}
  for w in words:
    ch = '.' + w + '.'
    for c1, c2, c3 in zip(ch, ch[1:], ch[2:]):
      trigram_vector[c1+c2+c3] = trigram_vector.get(c1+c2+c3, 0) + 1
  return trigram_vector

For example, creating trigram counts for the english language from the english words

In [39]:
eng_words = preprocess_data(load_training_data('English'))
eng_count = create_trigram_vector(eng_words)
print('printing first 10 dimensions')
list(eng_count.items())[:10]

printing first 10 dimensions


[('.i.', 690),
 ('.do', 339),
 ('don', 131),
 ('ont', 177),
 ('nt.', 508),
 ('.ha', 436),
 ('hav', 218),
 ('ave', 328),
 ('ve.', 535),
 ('.ti', 81)]

# CREATING A MODEL

A smaller angle between the language vector and the vector made from the given text implies greater similarity. To get the angle we compute cosine of the angle between the vectors. Since greater cosine means smaller angle, we directly use the cosine as the score for a given language. Language with the maximum score is the predicted language.

In [19]:
class cosine_model:
  def __init__(self):
    # initializing language to trigram counts
    self.lang_trigram = {}

  # train model (create trigram vector for given language)
  def train(self, language, preprocessed_data):
    self.lang_trigram[language] = create_trigram_vector(preprocessed_data)

  # function to return vector length given trigram counts
  def vec_length(self, trigram_count):
    sum = 0
    for key, value in trigram_count.items():
      sum += value*value
    length = math.sqrt(sum)
    return length

  # function return cosine of language vector and text trigram vector
  def cosine(self, language, text_trigram_vec):
    dot = 0.0
    # for given language look up the trigram vector
    lang_count = self.lang_trigram[language]
    # calculating dot product
    for key, value in text_trigram_vec.items():
      if key in lang_count:
        dot += (value * lang_count[key])
    # calculating cosine using dot product of the vectors and their lengths
    cosine = dot / (self.vec_length(lang_count) * self.vec_length(text_trigram_vec))
    return cosine

  # predicting language for text given
  def predict(self, text):
    trigram_text = create_trigram_vector(text)
    result = {}
    for key, value in self.lang_trigram.items():
      result[key] = self.cosine(key, trigram_text)
        
    result = sorted(result.items(), key = lambda x: -x[1])
    if result[0][1] == 0.0:
      print('\ncannot detect language')
    else:
      print('\nlanguage of given text document is most likely to be: ', result[0][0])
    return result[0][0]


# TESTING OUT THE MODEL

training the model

In [21]:
m = cosine_model()
m.train('English', preprocess_data(load_training_data('English')))
m.train('French', preprocess_data(load_training_data('French')))
m.train('Czech', preprocess_data(load_training_data('Czech')))
m.train('German', preprocess_data(load_training_data('German')))
m.train('Hungarian', preprocess_data(load_training_data('Hungarian')))
m.train('Italian', preprocess_data(load_training_data('Italian')))
m.train('Polish', preprocess_data(load_training_data('Polish')))
m.train('Russian', preprocess_data(load_training_data('Russian')))
m.train('Spanish', preprocess_data(load_training_data('Spanish')))

testing with an english paragraph

In [25]:
test = [input("enter a paragraph to detect language: ")]
test_clean = preprocess_data(test)
language = m.predict(test_clean)
assert language == 'English'

enter a paragraph to detect language: Along with the smooth flow of sentences, a paragraph’s coherence may also be related to its length. If you have written a very long paragraph, one that fills a double-spaced typed page, for example, you should check it carefully to see if it should start a new paragraph where the original paragraph wanders from its controlling idea. On the other hand, if a paragraph is very short (only one or two sentences, perhaps), you may need to develop its controlling idea more thoroughly, or combine it with another paragraph.

language of given text document is most likely to be:  English


testing with a french paragraph

In [26]:
test = [input("enter a paragraph to detect language: ")]
test_clean = preprocess_data(test)
language = m.predict(test_clean)
assert language == 'French'

enter a paragraph to detect language: Je m’appelle Jessica. Je suis une fille, je suis française et j’ai treize ans. Je vais à l’école à Nice, mais j’habite à Cagnes-Sur-Mer. J’ai deux frères. Le premier s’appelle Thomas, il a quatorze ans. Le second s’appelle Yann et il a neuf ans. Mon papa est italien et il est fleuriste. Ma mère est allemande et est avocate. Mes frères et moi parlons français, italien et allemand à la maison. Nous avons une grande maison avec un chien, un poisson et deux chats.

language of given text document is most likely to be:  French


testing with a czech paragraph

In [27]:
test = [input("enter a paragraph to detect language: ")]
test_clean = preprocess_data(test)
language = m.predict(test_clean)
assert language == 'Czech'

enter a paragraph to detect language: Pan Novák stojí na nádraží a vyhlíží svůj vlak. „Už tu měl dávno být, asi má zpoždění,“ říká si. Dnes jede na pracovní schůzku do Brna. V Brně se mu líbí. Je to krásné město a stále se tam něco děje: výstavy, festivaly, koncerty, mají tam dobré restaurace a hezkou přírodu. Škoda jen, že se tam v centru špatně parkuje.

language of given text document is most likely to be:  Czech


testing with a german paragraph

In [28]:
test = [input("enter a paragraph to detect language: ")]
test_clean = preprocess_data(test)
language = m.predict(test_clean)
assert language == 'German'

enter a paragraph to detect language: Familie Müller plant ihren Urlaub. Sie geht in ein Reisebüro und lässt sich von einem Angestellten beraten. Als Reiseziel wählt sie Mallorca aus. Familie Müller bucht einen Flug auf die Mittelmeerinsel. Sie bucht außerdem zwei Zimmer in einem großen Hotel direkt am Strand. Familie Müller badet gerne im Meer.

language of given text document is most likely to be:  German


testing with an hungarian paragraph

In [29]:
test = [input("enter a paragraph to detect language: ")]
test_clean = preprocess_data(test)
language = m.predict(test_clean)
assert language == 'Hungarian'

enter a paragraph to detect language: A nevelésnek az emberi személyiség teljes kibontakoztatására, valamint az emberi jogok és alapvető szabadságok tiszteletbentartásának megerősítésére kell irányulnia. A nevelésnek elő kell segítenie a nemzetek, valamint az összes faji és vallási csoportok közötti megértést, türelmet és barátságot, valamint az Egyesült Nemzetek által a béke fenntartásának érdekében kifejtett tevékenység kifejlődését. 3) A szülőket elsőbbségi jog illeti meg a gyermekeiknek adandó nevelés megválasztásában.

language of given text document is most likely to be:  Hungarian


testing with an italian paragraph

In [30]:
test = [input("enter a paragraph to detect language: ")]
test_clean = preprocess_data(test)
language = m.predict(test_clean)
assert language == 'Italian'

enter a paragraph to detect language: La nostra famiglia è composta anche da altre due persone, i nostri figli, Manuela che ha diciassette anni, e Marco che ha quindici anni, e poi c'è anche Tremendo, il cane che vive con noi da nove anni, ed è parte della famiglia. Viviamo tutti nella nostra splendida casa con un grande giardino.

language of given text document is most likely to be:  Italian


testing with a polish paragraph

In [31]:
test = [input("enter a paragraph to detect language: ")]
test_clean = preprocess_data(test)
language = m.predict(test_clean)
assert language == 'Polish'

enter a paragraph to detect language: Każdego roku Mateusz nie może się doczekać tego dnia. Już wiele tygodni przed tą datą starannie planuje całe przyjęcie. Zaczyna od wyboru listy gości. Nie można oczywiście zapomnieć o rodzinie. Dlatego zawsze mile widziani są: mama, tata, brat oraz siostra. Czasem udaje się też zaprosić babcię, jeżeli dobrze się czuje. Przecież im więcej gości tym lepiej - nie tylko ze względu na prezenty. Oprócz gości będących osobami z jego rodziny, Mateusz nigdy nie zapomina też o swoich kolegach i przyjaciołach. Co to byłyby za urodziny, na których nie pojawiłby się Kacper, Ola, Wojtek albo Dawid?

language of given text document is most likely to be:  Polish


testing with a russian paragraph

In [32]:
test = [input("enter a paragraph to detect language: ")]
test_clean = preprocess_data(test)
language = m.predict(test_clean)
assert language == 'Russian'

enter a paragraph to detect language: Я с детства хотел завести собаку, но родители мне не разрешали. Пока я был ребёнком, у меня жил хомяк Хома. Хома был очень маленький и пушистый. Его шерсть была средней длинны и коричневого цвета. Родители купили большую клетку для него, с двумя этажами. Я был очень рад, когда у меня появился маленький друг. Было очень весело смотреть как Хома бегает в колесе. Мне нравилось кормить его морковкой и орехами

language of given text document is most likely to be:  Russian


testing with a spanish paragraph

In [33]:
test = [input("enter a paragraph to detect language: ")]
test_clean = preprocess_data(test)
language = m.predict(test_clean)
assert language == 'Spanish'

enter a paragraph to detect language: Hoy hace mucho frío. Es invierno y todas las calles están cubiertas de nieve. Dentro de poco vendrá la primavera y con ella el sol y el tiempo cálido. La semana pasada estuvo de lluvia y tormenta. Incluso un rayo cayó encima de la campana de la catedral, pero no ocurrió nada. Los truenos siempre me han dado miedo y mucho respeto. Pero tenemos suerte... pues la previsión del tiempo para mañana es muy buena. Dicen que hoy habrá heladas y por la tarde granizo, pero mañana el día será soleado. A ver si tengo suerte y veo algún arcoíris.

language of given text document is most likely to be:  Spanish
