## Language Detction from documents using n-gram profiles

This notebook is an attempt at building an n-gram profile based language detector inspired by [N-gram-based text categorization Cavnar, Trenkle (1994)](https://sdmines.sdsmt.edu/upload/directory/materials/12247_20070403135416.pdf).



#### BibTex entry
```bibtex
@inproceedings{Cavnar1994NgrambasedTC,
  title={N-gram-based text categorization},
  author={William B. Cavnar and John M. Trenkle},
  year={1994},
  url={https://api.semanticscholar.org/CorpusID:170740}
}
```

### Core concept

According to the Zipf's Law, the most dominant words in a language are lesser in frequency than their more frequent yet less dominant counterparts. N-gram profiles are built on the idea of the ranking of the most prominent n-grams in a language.

Let's assume that we have a corpus $C$ of $N$ languages. For each language $L$ in the $C$, we can then create the ranking of the most common n-grams, which will act as the n-gram profile, $R_l$ for $l$. Once the profiles for all languages have been computed, we can infer on a held out corpus, containing $S$ sentences. For each sentence $s$ in the corpus, we first create the n-gram profile of $s$, $R_s$. Then, we measure the distance in the rankings of the n-grams in $R_s$ against the n-gram profiles of all the languages. In the end, the language which will have the least distance is selected as the predicted result. For our prediction target $y_l$, 

$$
y_l = min(R_{s_i} - [R_{L_1} , R_{L_2}, ... , R_{N}])
$$

### Corpus

I am using this small corpus from Kaggle titled [Language Detection](https://www.kaggle.com/code/basilb2s/language-detection-using-nlp). It contains 17 languages.

In [5]:
import mlcroissant as mlc
import pandas as pd

DATASET_URL = "https://www.kaggle.com/datasets/basilb2s/language-detection/croissant/download"

def get_croissant_dataset(dataset_url: str = DATASET_URL) -> pd.DataFrame:
    # Fetch the Croissant JSON-LD
    croissant_dataset = mlc.Dataset(dataset_url)

    # Check what record sets are in the dataset
    record_sets = croissant_dataset.metadata.record_sets

    # Fetch the records and put them in a DataFrame
    df = pd.DataFrame(
        croissant_dataset.records(record_set=record_sets[0].uuid))
    
    # Rename the columns
    df.rename(columns={"Language+Detection.csv/Text": "text",
              "Language+Detection.csv/Language": "language"}, inplace=True)
    
    # convert the binary strings to utf-8
    df["text"] = df["text"].apply(lambda x: x.decode("utf-8"))
    df["language"] = df["language"].apply(lambda x: x.decode("utf-8"))
        
    return df

df = get_croissant_dataset()
df.head()

  -  [Metadata(Language Detection)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.


Unnamed: 0,text,language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English


### Language to index dictionary

I'm assigning an integer id to each of the unique target languages in the dataset, which can then be used as index while creating language specific arrays later on.

In [14]:
unique_languages = df["language"].unique()

language_to_index = {language: index for index, language in enumerate(unique_languages)}
language_to_index

{'English': 0,
 'Malayalam': 1,
 'Hindi': 2,
 'Tamil': 3,
 'Portugeese': 4,
 'French': 5,
 'Dutch': 6,
 'Spanish': 7,
 'Greek': 8,
 'Russian': 9,
 'Danish': 10,
 'Italian': 11,
 'Turkish': 12,
 'Sweedish': 13,
 'Arabic': 14,
 'German': 15,
 'Kannada': 16}

### Train-Test Splits

In [15]:
from sklearn.model_selection import train_test_split

texts, languages = df["text"], df["language"]
train_texts, test_texts, train_languages, test_languages = train_test_split(texts, languages, test_size=0.2, random_state=42)

assert len(train_texts) == len(train_languages)
assert len(test_texts) == len(test_languages)

### Language Classifier

In [52]:
import numpy as np

class Language:
    id: int
    name: str
    profile: dict    
    def __eq__(self, other: "Language") -> bool:
        # two languages are equal if their profiles are the same
        return np.array_equal(self.profile, other.profile)

In [30]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

In [37]:
from IPython.display import clear_output

nltk.download("punkt_tab")
clear_output()

In [53]:
class NGrams:
    n: int
    language: str
    grams: list[str]
    
    def __eq__(self, other: "NGrams") -> bool:
        # two n-grams are equal if their n and the language are the same
        return self.n == other.n and self.language == other.language

In [67]:
from tqdm.auto import tqdm
from collections import Counter


class NGramProfileClassifier:
    def __init__(self, n: int, mapping: dict[str, int], most_common_n_grams: int = 25):
        self.n = n
        self.mapping = mapping
        self.most_common_n_grams = most_common_n_grams
        
        # a dictionary of languages and their n-gram profiles
        self.languages = {}
        self.n_grams = {}
        self.__populate()
        
        # number of languages
        self.n_languages = len(self.mapping)
        
        
    def __populate(self):
        for language, index in self.mapping.items():
            # lower case the language name
            language = language.lower()
            self.languages[language] = Language(index, language, {})
            self.n_grams[language] = NGrams(n=self.n, language=language, grams=[])
            
            
    def __process_single(self, text: str):
        # tokenize the text
        tokens = word_tokenize(text.lower())
        # remove stopwords
        # tokens = [token for token in tokens if token not in stopwords.words('english')]
        # create the n-grams
        n_grams = ngrams(tokens, self.n)
        n_grams = list(n_grams)
        return n_grams
            
    
    def __process_texts(self, texts: list[str], languages: list[str]):
        for text, language in tqdm(zip(texts, languages), total=len(texts), desc="processing texts"):
            n_grams = self.__process_single(text)
                
            # add the n-grams to the n-grams list
            self.n_grams[language.lower()].grams.extend(n_grams)  
                
                
                
    def __build_profile_language(self):
        for language, n_grams in self.n_grams.items():
            # take the count of the ngrams  
            counts = Counter(n_grams.grams)
            # take the top most_common_n_grams n-grams as profile
            profile = counts.most_common(self.most_common_n_grams)
            self.languages[language].profile = profile
            
    def __build_profile_text(self, text: str):
        n_grams = self.__process_single(text)
        counts = Counter(n_grams)
        profile = counts.most_common(self.most_common_n_grams)
        return profile
    

    
    def fit(self, texts: list[str], languages: list[str]):
        assert len(texts) == len(languages)
        
        # first create the n-grams for each text and language
        self.__process_texts(texts, languages)
        
        # create the profile for the language
        self.__build_profile_language()
        
        
    def get_profile_matrix(self):
        return self.profile_matrix
    
    def __distance(self, text_profile: dict, language: str):
        language_profile = self.languages[language].profile
        # find the common keys
        common_keys = set(text_profile.keys()) & set(language_profile.keys())
                
        distance = 0.0
        for ck in common_keys:
            distance += abs(text_profile[ck] - language_profile[ck])
        return distance
    
    def get_distance(self, text_profile: dict):
        language_profiles = [self.languages[language].profile for language in self.languages.keys()]
        distances  = []
        
        for lp in language_profiles:
            distances.append(self.__distance(text_profile, lp))
            
        return distances
        
    
    def __get_language_name(self, index: int):
        return [lang for lang, idx in self.mapping.items() if idx == index][0]
        
    
    def predict_single(self, text: str) -> str:
        """Predict the language name in lowercase"""
        
        # get ngrams
        text_n_grams = self.__process_single(text)
        
        # count the n-grams
        text_n_gram_counts = Counter(text_n_grams)

        # create a profile for the text
        text_profile = text_n_gram_counts.most_common(self.most_common_n_grams)
        
        distances = self.get_distance(text_profile)
        distances = np.array(distances)
        
        prediction = np.argmin(distances)
        prediction = self.__get_language_name(prediction)
        return prediction
    
    def predict(self, texts: list[str]) -> list[str]:
        """Predict the language of the texts"""
        predictions = [self.predict_single(text) for text in texts]
        return predictions
    

### Inference

In [68]:
def infer(clf: NGramProfileClassifier, test_dataset: list[str]) -> list[str]:
    """Infer the language of the text"""
    return clf.predict(test_dataset)

### Evaluation

Since this is a classification task, I am using the usual accuracy, precision, recall and f1-score as metric.

In [69]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report


def evaluate(predictions: list[str], labels: list[str]) -> tuple[float, float, float, float]:
    """Evaluate the predictions"""
    accuracy = accuracy_score(predictions, labels)
    precision = precision_score(predictions, labels, average='macro')
    recall = recall_score(predictions, labels, average='macro')
    f1 = f1_score(predictions, labels, average='macro')
    
    # also print the classification report
    print(classification_report(predictions, labels))
    
    return accuracy, precision, recall, f1

### Experimenting with n-grams

In [70]:
# for unigrams
unigram_clf = NGramProfileClassifier(n=1, mapping=language_to_index)
unigram_clf.fit(train_texts, train_languages)

processing texts:   0%|          | 0/8269 [00:00<?, ?it/s]

In [71]:
unigram_predictions = unigram_clf.predict(test_texts)
unigram_scores = evaluate(unigram_predictions, test_languages)

TypeError: unhashable type: 'list'

In [None]:
# for bigrams
bigram_clf = NGramProfileClassifier(n=2, mapping=language_to_index)
bigram_clf.fit(train_texts, train_languages)

In [None]:
# for trigrams
trigram_clf = NGramProfileClassifier(n=3, mapping=language_to_index)
trigram_clf.fit(train_texts, train_languages)

In [61]:
a = {"a": 45, "b": 10, "c": 100, "d": 1000}
b = {"a": 35, "b": 2}

# find common keys
common_keys = set(a.keys()) & set(b.keys())
common_keys = list(common_keys)
common_keys

distance = 0.0
for ck in common_keys:
    distance += abs(a[ck] - b[ck])
    
print(distance)

18.0
