# Question 1.
What approach would you use to automatically determine what languages are used in a given message? Please keep in mind that messages do not always contain two different languages, they can also contain just one language. If you believe that some situations cannot be handled by your implementation, please specify which ones and why.

In [1]:
pip install langdetect

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 24.5 MB/s 
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993242 sha256=89f2f7d64c58624384c981a6227f2c290de4ace74217f761e5f97b8a1be9558d
  Stored in directory: /root/.cache/pip/wheels/c5/96/8a/f90c59ed25d75e50a8c10a1b1c2d4c402e4dacfa87f3aff36a
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [2]:
from collections import Counter
from langdetect import detect, detect_langs
from typing import List

class Solution():
  """
  Detect languages in a text message, in a context where users often write 
  messages containing multiple languages.
  """

  def __init__(self, words_to_combine: int, message: str, max_lang: int):
    """ Class initialiser
    
    Args:
        words_to_combine (int): The number N of N-grams.
        message (str): The user's multilangual text message.
        max_lang (int): The maximun number of languages to be detected.
    """
    self.words_to_combine = words_to_combine
    self.message = message
    self.max_lang = max_lang

  def generate_ngrams(self) -> List[str]:
      """ Creates an N-grams of all possible combinations of “N” successive
      words from a text.
      
      Returns:
          output (List[str]): The list of the generated N-grams. 
      """
      words = self.message.split()
      output = []  
      for i in range(len(words)-self.words_to_combine+1):
          output.append(" ".join(words[i:i+self.words_to_combine-1]))
      return output

  def detect_languages(self):
    """Detect the language of a text message. 
    
    Returns:
        detected_languages (dict): The dictionary of detected languages.
                detected_languages[n_gram_text] = language
        n (int): The number on generated N-grams.
    """
    detected_languages = {}
    n_grams = self.generate_ngrams()
    for sub_set in n_grams:
        detected_languages[sub_set] = detect(sub_set)
    return detected_languages, len(n_grams)
    
  def get_top_languages(self) -> None:
      """Extract the top self.max_lang languages
      """
      detected_languages, total_grams = self.detect_languages()
      count = Counter(detected_languages.values())
      sorted_count = sorted(
          dict(count).items(), key=lambda x: x[1], reverse=True)

      if sorted_count[0][1]/total_grams >= 0.80: # Only one Language is present
        detected_languages = sorted_count[0]
        print(f"The detected language in: '%s' is: %s" %(self.message, detected_languages[0][0].upper()))
      else:
        detected_languages = sorted_count[:self.max_lang] # Multiple languages are present
        print(f"The detected languages in: '%s' are: %s and %s" %(self.message, detected_languages[0][0].upper(), detected_languages[1][0].upper()))

### Test Solution ###
document = ["Hello, tu as vu Lost in the Middle of Night l’autre jour ?",
        "This is an English sentence written in english, dans un endroit frais et sec",
        "Who are you? 小家伙 or just what we call 非常小的家伙",
        "توفر Analytics Vidhya بوابة معرفية قائمة على المجتمع لمحترفي التحليلات وعلوم البيانات",
        "Hello, how are you doing?"]

solution = Solution(words_to_combine=3, message=document[1], max_lang=2)
solution.get_top_languages()

The detected languages in: 'This is an English sentence written in english, dans un endroit frais et sec' are: EN and FR


## Question 2.
How would you proceed to detect if a given word is a variation of the word “blackhat”? Please write a Python function that determines if a string is a variation of this word. How would you generalize this to any given term?

In [3]:
def remove_duplicates(word: str) -> str:
    """ Delete duplicates characters from a string
        
    Args:
        word (str): A word with duplicated characters.

    Returns:
      Output (str): The word after deleting its duplicated characters.
    """
    chars = []
    prev = ""
    for char in word:
        if prev != char:
            chars.append(char)
            prev = char
    output = ''.join(chars)
    return output

def check_obfuscation(original_word: str, variation_word: str) -> bool:
    """ Verifies if an is variation_word the variation of the original_word

    Args:
        original_word (str): An original word.
        variation_word (str): A possible obfuscation of the original one.

    Returns:
      same_word (bool): True if original_word and variation_word are similar,
      False else.
    """
    same_word = False
    variation_word = variation_word.replace("@", "a")
    new_word = "".join(char for char in variation_word if char.isalpha())
    new_word = remove_duplicates(new_word)
    if original_word == new_word:
        same_word = True
    return same_word


### Test Solution ###
words = ["blackkkhat", "bl@khat", "b__la-c_k_hat", "abcd"]
check_obfuscation(original_word="blackhat", variation_word=words[0])

True