## Language Detction from documents using n-gram profiles

This notebook is an attempt at building an n-gram profile based language detector inspired by [N-gram-based text categorization Cavnar, Trenkle (1994)](https://sdmines.sdsmt.edu/upload/directory/materials/12247_20070403135416.pdf).



#### BibTex entry
```bibtex
@inproceedings{Cavnar1994NgrambasedTC,
  title={N-gram-based text categorization},
  author={William B. Cavnar and John M. Trenkle},
  year={1994},
  url={https://api.semanticscholar.org/CorpusID:170740}
}
```

### Core concept

According to the Zipf's Law, the most dominant words in a language are lesser in frequency than their more frequent yet less dominant counterparts. N-gram profiles are built on the idea of the ranking of the most prominent n-grams in a language.

Let's assume that we have a corpus $C$ of $N$ languages. For each language $L$ in the $C$, we can then create the ranking of the most common n-grams, which will act as the n-gram profile, $R_l$ for $l$. Once the profiles for all languages have been computed, we can infer on a held out corpus, containing $S$ sentences. For each sentence $s$ in the corpus, we first create the n-gram profile of $s$, $R_s$. Then, we measure the distance in the rankings of the n-grams in $R_s$ against the n-gram profiles of all the languages. In the end, the language which will have the least distance is selected as the predicted result. For our prediction target $y_l$, 

$$
y_l = min(R_{s_i} - [R_{L_1} , R_{L_2}, ... , R_{N}])
$$

### Corpus

I am using this small corpus from Kaggle titled [Language Detection](https://www.kaggle.com/code/basilb2s/language-detection-using-nlp). It contains 17 languages.

In [5]:
import mlcroissant as mlc
import pandas as pd

DATASET_URL = "https://www.kaggle.com/datasets/basilb2s/language-detection/croissant/download"

def get_croissant_dataset(dataset_url: str = DATASET_URL) -> pd.DataFrame:
    # Fetch the Croissant JSON-LD
    croissant_dataset = mlc.Dataset(dataset_url)

    # Check what record sets are in the dataset
    record_sets = croissant_dataset.metadata.record_sets

    # Fetch the records and put them in a DataFrame
    df = pd.DataFrame(
        croissant_dataset.records(record_set=record_sets[0].uuid))
    
    # Rename the columns
    df.rename(columns={"Language+Detection.csv/Text": "text",
              "Language+Detection.csv/Language": "language"}, inplace=True)
    
    # convert the binary strings to utf-8
    df["text"] = df["text"].apply(lambda x: x.decode("utf-8"))
    df["language"] = df["language"].apply(lambda x: x.decode("utf-8"))
        
    return df

df = get_croissant_dataset()
df.head()

  -  [Metadata(Language Detection)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.


Unnamed: 0,text,language
0,"Nature, in the broadest sense, is the natural...",English
1,"""Nature"" can refer to the phenomena of the phy...",English
2,"The study of nature is a large, if not the onl...",English
3,"Although humans are part of nature, human acti...",English
4,[1] The word nature is borrowed from the Old F...,English
