## Language Detction from documents using n-gram profiles

This notebook is an attempt at building an n-gram profile based language detector inspired by [N-gram-based text categorization Cavnar, Trenkle (1994)](https://sdmines.sdsmt.edu/upload/directory/materials/12247_20070403135416.pdf).



#### BibTex entry
```bibtex
@inproceedings{Cavnar1994NgrambasedTC,
  title={N-gram-based text categorization},
  author={William B. Cavnar and John M. Trenkle},
  year={1994},
  url={https://api.semanticscholar.org/CorpusID:170740}
}
```

### Core concept

According to the Zipf's Law, the most dominant words in a language are lesser in frequency than their more frequent yet less dominant counterparts. N-gram profiles are built on the idea of the ranking of the most prominent n-grams in a language.

Let's assume that we have a corpus $C$ of $N$ languages. For each language $L$ in the $C$, we can then create the ranking of the most common n-grams, which will act as the n-gram profile, $R_l$ for $l$. Once the profiles for all languages have been computed, we can infer on a held out corpus, containing $S$ sentences. For each sentence $s$ in the corpus, we first create the n-gram profile of $s$, $R_s$. Then, we measure the distance in the rankings of the n-grams in $R_s$ against the n-gram profiles of all the languages. In the end, the language which will have the least distance is selected as the predicted result. For our prediction target $y_l$, 

$$
y_l = min(R_{s_i} - [R_{L_1} , R_{L_2}, ... , R_{N}])
$$

### Corpus

I am using this small corpus from Kaggle titled [Language Detection](https://www.kaggle.com/code/basilb2s/language-detection-using-nlp). It contains 17 languages.

In [1]:
import mlcroissant as mlc
import pandas as pd

DATASET_URL = "https://www.kaggle.com/datasets/basilb2s/language-detection/croissant/download"

def get_croissant_dataset(dataset_url: str = DATASET_URL) -> pd.DataFrame:
    # Fetch the Croissant JSON-LD
    croissant_dataset = mlc.Dataset(dataset_url)

    # Check what record sets are in the dataset
    record_sets = croissant_dataset.metadata.record_sets

    # Fetch the records and put them in a DataFrame
    df = pd.DataFrame(
        croissant_dataset.records(record_set=record_sets[0].uuid))
    
    # Rename the columns
    df.rename(columns={"Language+Detection.csv/Text": "text",
              "Language+Detection.csv/Language": "language"}, inplace=True)
    
    # convert the binary strings to utf-8
    df["text"] = df["text"].apply(lambda x: x.decode("utf-8"))
    df["language"] = df["language"].apply(lambda x: x.decode("utf-8"))
        
    return df

df = get_croissant_dataset()
df.head()

  -  [Metadata(Language Detection)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.


Unnamed: 0,text,language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English


### Language to index dictionary

I'm assigning an integer id to each of the unique target languages in the dataset, which can then be used as index while creating language specific arrays later on.

In [2]:
unique_languages = df["language"].unique()
unique_languages

array(['English', 'Malayalam', 'Hindi', 'Tamil', 'Portugeese', 'French',
       'Dutch', 'Spanish', 'Greek', 'Russian', 'Danish', 'Italian',
       'Turkish', 'Sweedish', 'Arabic', 'German', 'Kannada'], dtype=object)

Swedish and Portuguese are misspelled here. I am going to fix it before proceeding any further.

In [3]:
def fix_spelling(lang: str) -> str:
    """Fix the spelling of the language name"""
    spelling = {
        "Portugeese": "Portuguese",
        "Sweedish": "Swedish",
    }
    return spelling.get(lang, lang)


df["language"] = df["language"].apply(fix_spelling)
unique_languages = df["language"].unique()
unique_languages

array(['English', 'Malayalam', 'Hindi', 'Tamil', 'Portuguese', 'French',
       'Dutch', 'Spanish', 'Greek', 'Russian', 'Danish', 'Italian',
       'Turkish', 'Swedish', 'Arabic', 'German', 'Kannada'], dtype=object)

Now to apply the following pre-proessing steps:

1. Tokenization
2. Stop word removal
3. Punctuation removal
4. Number removal

Now would also be a good time to check for which languages in the corpus, there's no stopwords list. This is important since the whole approach relies on finding the dominant n-grams.

In [4]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

In [5]:
from IPython.display import clear_output

nltk.download("punkt_tab")
clear_output()

In [6]:
def find_sw_missing_languages(languages: list[str]):
    """Find the languages that are missing stopwords"""
    missing = []
    for lang in languages:
        try:
            _ = stopwords.words(lang.lower())
        except Exception:
            missing.append(lang)
        
            
    return missing


missing_languages = find_sw_missing_languages(unique_languages.tolist())
missing_languages

['Malayalam', 'Hindi', 'Tamil', 'Kannada']

I am going to exclude the data for these languages from the dataset.

In [7]:
# now to write a combined function to preprocess the corpus

import string

def lower_case(text: str, language: str) -> str:
    """Lower case the text"""
    return text.lower(), language.lower()


def remove_stopwords(text: str, language: str) -> list[str]:
    """Remove stopwords from the text"""
    tokens = word_tokenize(text)
    return [word for word in tokens if word not in stopwords.words(language)]


def remove_punctuation(text: str) -> list[str]:
    """Remove punctuation from the text"""
    return [word for word in text if word not in string.punctuation]

def remove_numbers(text: str) -> list[str]:
    """Remove numbers from the text"""
    return [word for word in text if not word.isdigit()]

def remove_special_characters(text: str) -> list[str]:
    """Remove special characters from the text"""
    return [word for word in text if word.isalnum()]


def preprocess(df: pd.DataFrame):
    """Preprocess the data"""
    # remove languages missing stopwords
    missing_languages = find_sw_missing_languages(df["language"].unique())
    df = df[~df["language"].isin(missing_languages)]
    assert find_sw_missing_languages(df["language"].unique()) == []
    
    # fix spelling
    df["language"] = df["language"].apply(fix_spelling)
    
    # lower case, also tokenize
    df["text"], df["language"] = zip(*df.apply(lambda x: lower_case(x["text"], x["language"]), axis=1))
    
    # remove stopwords
    df["text"] = df.apply(lambda x: remove_stopwords(x["text"], x["language"]), axis=1)
    
    # remove punctuation
    df["text"] = df.apply(lambda x: remove_punctuation(x["text"]), axis=1)
    
    # remove numbers
    df["text"] = df.apply(lambda x: remove_numbers(x["text"]), axis=1)

    
    # mapping language to index
    unique_languages = df["language"].unique()
    language_to_index = {language: index for index, language in enumerate(unique_languages)}
    return df, language_to_index

# ==============================
# loading df again
df = get_croissant_dataset()
df, language_to_index = preprocess(df)
clear_output()

In [8]:
df.head()

Unnamed: 0,text,language
0,"[nature, broadest, sense, natural, physical, m...",english
1,"[``, nature, '', refer, phenomena, physical, w...",english
2,"[study, nature, large, part, science]",english
3,"[although, humans, part, nature, human, activi...",english
4,"[word, nature, borrowed, old, french, nature, ...",english


### Train-Test Splits

In [9]:
from sklearn.model_selection import train_test_split

texts, languages = df["text"], df["language"]
train_texts, test_texts, train_languages, test_languages = train_test_split(texts, languages, test_size=0.2, random_state=42)

assert len(train_texts) == len(train_languages)
assert len(test_texts) == len(test_languages)

### Language Classifier

In [10]:
from dataclasses import dataclass
import numpy as np

@dataclass
class Language:
    id: int
    name: str
    profile: dict    
    def __eq__(self, other: "Language") -> bool:
        # two languages are equal if their profiles are the same
        return np.array_equal(self.profile, other.profile)

In [11]:
@dataclass
class NGrams:
    n: int
    language: str
    grams: list[str]
    
    def __eq__(self, other: "NGrams") -> bool:
        # two n-grams are equal if their n and the language are the same
        return self.n == other.n and self.language == other.language

In [30]:
from tqdm.auto import tqdm
from collections import Counter


class NGramProfileClassifier:
    def __init__(self, n: int, mapping: dict[str, int], profile_size: int = 25):
        self.n = n
        self.mapping = mapping
        self.profile_size = profile_size
        
        # a dictionary of languages and their n-gram profiles
        self.languages = {}
        self.n_grams = {}
        self.__populate()
        
        # number of languages
        self.n_languages = len(self.mapping)
        
        
    def __populate(self):
        for language, index in self.mapping.items():
            self.languages[language] = Language(index, language, {})
            self.n_grams[language] = NGrams(n=self.n, language=language, grams=[])
            
            
    def __process_single(self, text_tokens: str):
        # create the n-grams
        n_grams = ngrams(text_tokens, self.n)
        n_grams = list(n_grams)
        return n_grams
            
    
    def __process_texts(self, text_tokens: list[list[str]], languages: list[str]):
        for text_tokens, language in tqdm(zip(text_tokens, languages), total=len(text_tokens), desc="processing texts"):
            n_grams = self.__process_single(text_tokens)
                
            # add the n-grams to the n-grams list
            self.n_grams[language].grams.extend(n_grams)  
            
            
    def __get_least_common_n_grams(self, n_grams: list[str], n: int):
        counts = Counter(n_grams)
        least_common = sorted(counts.items(), key=lambda x: x[1])[:n]
        return least_common
    
    
    def get_profile(self, n_grams: list[str], n: int):
        counts = Counter(n_grams)
        least_commons = self.__get_least_common_n_grams(counts, n)
        profile = {k[0]: v for k, v in least_commons}
        return profile
                
                
    def __build_language_profiles(self):
        for language, n_grams in self.n_grams.items():
            profile = self.get_profile(n_grams.grams, self.profile_size)
            self.languages[language].profile = dict(profile)
            
    
    def fit(self, text_tokens: list[list[str]], languages: list[str]):
        assert len(text_tokens) == len(languages)
        
        # first create the n-grams for each text and language
        self.__process_texts(text_tokens, languages)
        
        # create the profile for the language
        self.__build_language_profiles()
        
    
    def __distance(self, text_profile: dict, language_name: str):
        language_profile = self.languages[language_name].profile
        # find the common keys
        common_keys = set(text_profile.keys()) & set(language_profile.keys())
                
        distance = 0.0
        for ck in common_keys:
            distance += abs(text_profile[ck] - language_profile[ck])
        return distance
    
    def get_distance(self, text_profile: dict):
        distances  = []
        language_names = [lang for lang in self.languages.keys()]
        
        for language_name in language_names:
            distances.append(self.__distance(text_profile, language_name))
            
        return distances
        
    
    def __get_language_name(self, index: int):
        return [lang for lang, idx in self.mapping.items() if idx == index][0]
        
    
    def predict_single(self, text_tokens: list[str]) -> str:
        """Predict the language name in lowercase"""
        
        # preprocess the text and get the profile
        text_profile = self.get_profile(text_tokens, self.profile_size)
        
        # get the distances
        distances = self.get_distance(text_profile)
        distances = np.array(distances)
        
        prediction = np.argmin(distances)
        prediction = self.__get_language_name(prediction)
        return prediction
    
    def predict(self, text_tokens: list[list[str]]) -> list[str]:
        """Predict the language of the texts"""
        predictions = [self.predict_single(text_tokens) for text_tokens in text_tokens]
        return predictions
    

### Inference

In [27]:
def infer(clf: NGramProfileClassifier, test_dataset: list[list[str]]) -> list[str]:
    """Infer the language of the text"""
    return clf.predict(test_dataset)

### Evaluation

Since this is a classification task, I am using the usual accuracy, precision, recall and f1-score as metric.

In [28]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report


def evaluate(predictions: list[str], labels: list[str]) -> tuple[float, float, float, float]:
    """Evaluate the predictions"""
    accuracy = accuracy_score(predictions, labels)
    precision = precision_score(predictions, labels, average='macro')
    recall = recall_score(predictions, labels, average='macro')
    f1 = f1_score(predictions, labels, average='macro')
    
    # also print the classification report
    print(classification_report(predictions, labels))
    
    return accuracy, precision, recall, f1

### Experimenting with n-grams

In [31]:
# for unigrams
unigram_clf = NGramProfileClassifier(n=1, mapping=language_to_index)
unigram_clf.fit(train_texts, train_languages)

processing texts:   0%|          | 0/5941 [00:00<?, ?it/s]

In [32]:
unigram_predictions = unigram_clf.predict(test_texts)
unigram_scores = evaluate(unigram_predictions, test_languages)

              precision    recall  f1-score   support

      arabic       0.00      0.00      0.00         0
      danish       0.00      0.00      0.00         0
       dutch       0.00      0.00      0.00         0
     english       1.00      0.19      0.32      1486
      french       0.00      0.00      0.00         0
      german       0.00      0.00      0.00         0
       greek       0.00      0.00      0.00         0
     italian       0.00      0.00      0.00         0
     russian       0.00      0.00      0.00         0
     spanish       0.00      0.00      0.00         0
     turkish       0.00      0.00      0.00         0

    accuracy                           0.19      1486
   macro avg       0.09      0.02      0.03      1486
weighted avg       1.00      0.19      0.32      1486



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [33]:
# for bigrams
bigram_clf = NGramProfileClassifier(n=2, mapping=language_to_index)
bigram_clf.fit(train_texts, train_languages)

processing texts:   0%|          | 0/5941 [00:00<?, ?it/s]

In [34]:
bigram_predictions = bigram_clf.predict(test_texts)
bigram_scores = evaluate(bigram_predictions, test_languages)

              precision    recall  f1-score   support

      arabic       0.00      0.00      0.00         0
      danish       0.00      0.00      0.00         0
       dutch       0.00      0.00      0.00         0
     english       1.00      0.19      0.32      1486
      french       0.00      0.00      0.00         0
      german       0.00      0.00      0.00         0
       greek       0.00      0.00      0.00         0
     italian       0.00      0.00      0.00         0
     russian       0.00      0.00      0.00         0
     spanish       0.00      0.00      0.00         0
     turkish       0.00      0.00      0.00         0

    accuracy                           0.19      1486
   macro avg       0.09      0.02      0.03      1486
weighted avg       1.00      0.19      0.32      1486



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [39]:
# for trigrams
trigram_clf = NGramProfileClassifier(n=3, mapping=language_to_index)
trigram_clf.fit(train_texts, train_languages)

processing texts:   0%|          | 0/5941 [00:00<?, ?it/s]

In [40]:
trigram_predictions = trigram_clf.predict(test_texts)
trigram_scores = evaluate(trigram_predictions, test_languages)

              precision    recall  f1-score   support

      arabic       0.00      0.00      0.00         0
      danish       0.00      0.00      0.00         0
       dutch       0.00      0.00      0.00         0
     english       1.00      0.19      0.32      1486
      french       0.00      0.00      0.00         0
      german       0.00      0.00      0.00         0
       greek       0.00      0.00      0.00         0
     italian       0.00      0.00      0.00         0
     russian       0.00      0.00      0.00         0
     spanish       0.00      0.00      0.00         0
     turkish       0.00      0.00      0.00         0

    accuracy                           0.19      1486
   macro avg       0.09      0.02      0.03      1486
weighted avg       1.00      0.19      0.32      1486



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [42]:
unigram_clf.languages["english"].profile

{'origins': 1,
 'origin': 1,
 'pleasant': 1,
 'tasting': 1,
 '9th': 1,
 'millennium': 1,
 'bce': 1,
 'gardens': 1,
 'aesthetic': 1,
 'ornamentation': 1,
 'breeding': 1,
 'thermal': 1,
 'meteorite': 1,
 'collision': 1,
 'probably': 1,
 'triggered': 1,
 'non-avian': 1,
 'dinosaurs': 1,
 'reptiles': 1,
 'spared': 1,
 'mammals': 1,
 'complain': 1,
 'somehow': 1,
 'parliament': 1,
 'oviedo': 1}