#### Name : Syed Khalid Ahmed
#### Marticulation number : 276970

# Language Identification using Python

This program identifies 4 languages using n-grams technique. I have used the udhr corpus for English, German, Italian and Spanish. To further improve the accuracy, I have also used news text in these languages. I will now describe the logic and flow of the program.

First I started by importing the relevant libraries.

In [1]:
import numpy as np
import nltk
from nltk.corpus import udhr
import math
from decimal import *
import re

Then I have declared the dictionaries to store the n-grams of input text as well as the udhr corpus for each language. I saved these dictionaries in a list so that it easy to manipualte them later.

In [2]:
udhr_store_english = dict()
udhr_store_german = dict()
udhr_store_italian = dict()
udhr_store_spanish = dict()

stores = [udhr_store_english,udhr_store_german,udhr_store_italian,udhr_store_spanish]

This array stores the name of the dictionaries sequentially so that it becomes easier to access the above dictionaries later in the code.

In [3]:
names = ['English','German','Italian','Spanish']
weight = None

## N-Gram Function

This function calculates the N-grams of the given text passed as an argument. The choice argument represents for which type we want the function to work for. For example: If choice is 0 then this means that the text is input, so save it in the relevant dictionary. If choice is greater than 0 then we would save the n-grams in the relevant index of list "stores". Since each element of the list contains a dictionary, therefore we would save the n-grams of the text in the respective list index dictionary.

In [4]:
def NgramCalculator(text,choice):

    input_store = dict()
    global stores
    global names
    global weight
    
    # Taking 3 gram
    weight = 3

    if choice == 0:                          # For Input 
        for i in range(len(text)):           # Loop until the length of the text   
            temp = text[i:i+weight].strip().lower()     # Produce n-grams, strip the whitespaces and convert into lowercase

            #if len(temp) < weight-1:
            #    continue
            
            if temp in input_store:     # If the n-gram is already in the store    
                input_store[temp] += 1  # Increment its count by 1

            else:                       # If appearing for the first time
                input_store[temp] = 1   # Create a key of it and assign a value of 1

        CosineSimilarity(input_store)   # Call this function to find the Cosine Similarity
        
    else:                               # For language corpora
        for i in range(len(text)):
            
            temp = text[i:i+weight].strip().lower()    # Produce n-grams, strip the whitespaces and convert into lowercase

            ## The following lines perform text cleanup by removing special characters,
            ## numbers, tabs and newline characters from the text. Since these do not
            ## help in language identification, so it is better to remove them. 
            ## Now the n-grams contain values which are of most interest. 

            temp = re.sub('[,!@#$-]','',temp)        
            temp = re.sub('[0-9]','',temp)
            temp = re.sub('\t',' ',temp)
            temp = re.sub('\n',' ',temp).strip()

            #########################################################
            
            #if len(temp) < weight-1:
            #    continue

            # Since the stores is a list and each on each index is a dictionary
            # so we can perform a dictionary lookup
            
            if temp in stores[choice-1]:            
                stores[choice-1][temp] += 1
            else:
                stores[choice-1][temp] = 1


## Lookup Function

This function performs a cross check on other dictionaries for a given n-gram of input text. If an n-gram appears in only one dictionary, then it is highly probable that the letters appearing in that n-gram are unique for that language. For example: The german umlauts (ö,ß,ä,ü) are unique to german language. Hence if they appear in an n-gram and not in any other dictionary, then we can say that the input belongs to german. 

The argument 'key' represents the n-gram and the argument 'dict_index' represents the dictionary which is currently used.

In [5]:
def Lookup(key,dict_index):
    global stores
    global names

    for i in range(len(stores)):    # Loop through the available dictionaries    
        if i == dict_index:         # If on the same dictionary as the language, skip it since we are interested in finding the n-gram in other dictionaries
            continue
        
        if key in stores[i]:        # If the n-gram is also present in another dictionary, then it is common among languages and hence of no interest
            return False            # immediately return false

    return True                     # If not present in any other dictionary, return true


## Cosine Similarity Function

This function finds the cosine similarity between two languages.

In [6]:
def CosineSimilarity(input_store):

    global stores
    global names
    global weight

    print("\t\t\t\t\tStatistics :\n")

    ## Loop through all the number of dictionaries present    
    ## Since there are 4 dictionaries, so this loop will run from 0-3
    for i in range(len(stores)):
        numerator = 0           # Numerator = 0 for each iteration             
        denominator = 0         # Numerator = 0 for each iteration
        TrainingData_temp = 0   # Training data values
        TestingData_temp = 0    # Testing data values

        ## Loop through each key-value pair in the input n-gram dictionary
        for key,value in input_store.items():

            # If the key is present in the language dictionary    
            if key in stores[i]:
                if Lookup(key,i):   # Perform a lookup on other dictionaries, if it is true then
                    numerator += (value * int(stores[i][key])) ** 2     # Raise the power of numerator by 2 since it is highly probably that the current language is same as input
                    TrainingData_temp += value**2
                    TestingData_temp += int(stores[i][key]) ** 2

                else:
                    numerator += (value * int(stores[i][key]))
                    TrainingData_temp += value**2
                    TestingData_temp += int(stores[i][key]) ** 2
            else:
                numerator += (value * 0)
                TrainingData_temp += value**2
                TestingData_temp += 0
                
        denominator = math.sqrt(TrainingData_temp) * math.sqrt(TestingData_temp)

        try:
            cos_theta = numerator / denominator
            if cos_theta >= 0.99:       # If the score reaches above 99%
                cos_theta = 0.99        # Clip it to 99%

            print("For " + names[i] + ", similarity percentage is: ")
            print(str(round(cos_theta*100,3)) + " % \n")

        except ZeroDivisionError:       # Thrown when no word matches. For Example : süß will never appear in english
            print("No matching word in "+names[i])


## Training Function

This function reads data from udhr corpora as well as news texts and passes those to the N-Gram function which then generates the N-grams for it.

In [7]:
def Training():

    global weight
    
    print("\nTraining the model using the given data , Please Wait . . . \n")

    ## Read the corpora
    english = udhr.raw("English-Latin1")
    german = udhr.raw("German_Deutsch-Latin1")
    italian = udhr.raw("Italian-Latin1")
    spanish = udhr.raw("Spanish-Latin1")

    ## Pass these to NgramCalculator to calculate n-grams
    NgramCalculator(english,1)
    NgramCalculator(german,2)
    NgramCalculator(italian,3)
    NgramCalculator(spanish,4)

    print("Taking "+str(weight)+" grams")

    ## Read the news files sequentially 
    for i in range(len(names)):
        filename = names[i]+".txt"
        string = ""
        with open(filename,encoding="utf-8") as file:
            content = file.readlines()
            for line in content:
                string += "".join(line)     # Append to the string
        
        NgramCalculator(string,i+1)


    print("\nTraining Completed . . .\n")

## Input Function

This function takes input from the user.

In [None]:
def TakeInput():

    while True:
        print("Please enter a string to find its language similarity (press 'q' to quit) \n")
        string = str(input("--> "))

        if string.strip() == 'q':
            print("\nGoodbye :)")
            break

        else:
            NgramCalculator(string.strip(),0)   # Pass this to N-gram function with code 0 specifying this as an input text
            print("____________________________________________________\n")

## Main

In [None]:
Training()
TakeInput()


Training the model using the given data , Please Wait . . . 

Taking 3 grams

Training Completed . . .

Please enter a string to find its language similarity (press 'q' to quit) 

--> Hello world
					Statistics :

For English, similarity percentage is: 
75.965 % 

For German, similarity percentage is: 
42.75 % 

For Italian, similarity percentage is: 
52.161 % 

For Spanish, similarity percentage is: 
42.725 % 

____________________________________________________

Please enter a string to find its language similarity (press 'q' to quit) 

--> das wetter ist gut
					Statistics :

For English, similarity percentage is: 
68.187 % 

For German, similarity percentage is: 
54.687 % 

For Italian, similarity percentage is: 
63.036 % 

For Spanish, similarity percentage is: 
42.79 % 

____________________________________________________

Please enter a string to find its language similarity (press 'q' to quit) 

--> ist gut nacht
					Statistics :

For English, similarity percentage is: 
5