# PolyglotNLP: Language Detection with spaCy

This Notebook documents the process of creating my first language detection model, leveraging datasets encompassing the languages: German, Spanish, Italian, Russian, Portuguese, and English, with the plan to add more.

The datasets for this project are sourced from [Tatoeba](https://tatoeba.org/en/downloads), a repository that offers an extensive array of sentences across numerous languages, with weekly updates to ensure richness and diversity. The project aims to build a robus model capable of identifying the aforementioned languages.

Author: Robert Heßhaus  
Date: [21/03/2024](date:"dmy")

## Loading Packages

Let's start with loading the required libraries and set up the environment for the project.

In [19]:
import pandas as pd
import numpy as np
import re
import spacy
from spacy.lang.de import German
from spacy.lang.es import Spanish
from spacy.lang.en import English
from spacy.lang.it import Italian
from spacy.lang.ru import Russian
from spacy.lang.pt import Portuguese
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV

# Load spaCy models
nlp_de = German()
nlp_es = Spanish()
nlp_en = English()
nlp_pt = Portuguese()
nlp_ru = Russian()
nlp_it = Italian()

## Loading Language Files

Next up I'll load the language files and merge them into one to make it more accessible.

In [20]:
# Function to load data

def load_data(file):
    df = pd.read_csv(file, sep='\t', header=None, names=['ID', 'Language', 'Sentence'])
    return df


# Paths to Language Files and dictionary with spaCy Models
lang_files = {
    "deu": "./lang_data/deu_sentences.tsv",
    "eng": "./lang_data/eng_sentences.tsv",
    "spa": "./lang_data/spa_sentences.tsv",
    "rus": "./lang_data/rus_sentences.tsv",
    "por": "./lang_data/por_sentences.tsv",
    "ita": "./lang_data/ita_sentences.tsv",
}

language_models = {
    'deu': nlp_de,
    'eng': nlp_en,
    'spa': nlp_es,
    'rus': nlp_ru,
    'por': nlp_pt,
    'ita': nlp_it,
}

# Load Data

dfs = []

for lang_code, file_path in lang_files.items():
    dfs.append(load_data(file_path))

# Merge into one

df_combined = pd.concat(dfs).reset_index(drop=True)                     

## Checking and Removal of Duplicates

Eventhough Tatoeba regularly checks and updates their datasets, we still want to make sure to remove all duplicate sentences.

In [21]:
df_combined.drop_duplicates(subset="Sentence", inplace=True)

## Normalziation and Tokenization

To prepare the data for model training, normalization and tokenization are crucial steps. These processes involve converting the text to a uniform format and breaking it down into manageable pieces (tokens), respectively. For this purpose, we leverage the spaCy library, which offers robust tools for natural language processing across multiple languages.

Normalization ensures consistency in text representation by converting all characters to lowercase and removing non-language-specific characters. Tokenization then splits the text into individual words or symbols, allowing for more effective language modeling.

In [22]:
# Normalize the text
def normalize_text(text):
    if not text:
        return ""
        
    # Convert Text to  lowercase    
    text = text.lower()
    
    # Removal of all special letters not relevant to the languages
    text = re.sub(r"[^a-zA-ZäöüßáéíóúñÁÉÍÓÚÑàèìòùÀÈÌÒÙçÇâêîôûÂÊÎÔÛëÿüïöäËYÜÏÖÄãõÃÕёЁа-яА-Я]+", ' ', text)
    return text

# Tokenize the Rows based on Language
def tokenize_with_language(row):
    text = row['Normalized_Sentence']
    lang = row['Language']

    if not text:
        return []
    # Select the appropriate spaCy model based on language
    if lang in language_models:
        doc = language_models[lang](text)
    else:
        raise ValueError(f"Unrecognized language code: {lang}")
    
    # Extract the tokens from the processed row
    return [token.text for token in doc]

# Shuffling the Data to ensure random distribution
df_combined = df_combined.sample(frac=1).reset_index(drop=True)

# Apply normalization and tokenization
df_combined['Normalized_Sentence'] = df_combined['Sentence'].apply(normalize_text)
df_combined['Tokens'] = df_combined.apply(tokenize_with_language, axis=1)

# Display the frst few rows of processed data
df_combined.head()

Unnamed: 0,ID,Language,Sentence,Normalized_Sentence,Tokens
0,3999040,rus,"Он не сказал мне, как его зовут.",он не сказал мне как его зовут,"[он, не, сказал, мне, как, его, зовут]"
1,8307109,rus,"Думаю, все мы хотим знать, что произошло.",думаю все мы хотим знать что произошло,"[думаю, все, мы, хотим, знать, что, произошло]"
2,6015486,por,Ele mudou o número da placa do carro dele.,ele mudou o número da placa do carro dele,"[ele, mudou, o, número, da, placa, do, carro, ..."
3,6599133,ita,Non lo so. Sono appena arrivato qua.,non lo so sono appena arrivato qua,"[non, lo, so, sono, appena, arrivato, qua]"
4,7035322,eng,A lot of students had a crush on our philosoph...,a lot of students had a crush on our philosoph...,"[a, lot, of, students, had, a, crush, on, our,..."


## Vectorization & Model compilation

To train the neural network model, the first step is to convert the preprocessed textual data into a numerical format, a process known as vectorization. For this task, i employ the TfidfVectorizer from the scikit-learn library, which transforms the text into a TF-IDF (Term Frequency-Inverse Document Frequency) matrix. This method highlights the importance of each word in the context of its sentence and across the dataset.

Following the vectorization, the data is utilized to train a Naive Bayes model. Despite being a simpler classification algorithm compared to neural networks, Naive Bayes is remarkably effective for text classification tasks, including language detection, due to its assumption of independence among predictor

For further details, visit:
- [TFidfVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
- [Naive Bayes Classifier on WIkipedia](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)s.

In [24]:
# Extract sentences and labels
sentences = df_combined['Normalized_Sentence']
labels = df_combined['Language']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(sentences, labels, test_size=0.2)

# Initialization and training of the vectorizer
vectorizer = TfidfVectorizer(ngram_range=(1, 2), analyzer='char')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Set up the hyperparameter grid
param_grid = {'alpha': [0.01, 0.1, 1, 10, 100]}

# Instantiate the Naive Bayes classifier
classifier = MultinomialNB()

# Instantiate the GridSearch for Hyperparameteroptimization
nb_rando_search = RandomizedSearchCV(classifier, param_grid, cv=5, n_jobs=1, n_iter=5)

# Fit the Classifier to the data
nb_rando_search.fit(X_train_tfidf, y_train)



In [25]:
# Make predictions using the classifier and calculate accuracy as well as Cross-Validation Scores
predictions = nb_rando_search.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, predictions)
scores = cross_val_score(nb_grid_search, X_train_tfidf, y_train)

print(f"Cross Validation Scores: {scores}")
print(f"Average Score: {np.mean(scores)}")

# print the accuracy
print(f"Accuracy: {accuracy * 100:.2f}%")

Cross Validation Scores: [0.96530756 0.96539292 0.96521509 0.96546283 0.96534309]
Average Score: 0.9653442997626358
Accuracy: 96.54%


## Testing Lab

To try out the model, write the sentences like the Example Sentence in the array "new_sentences" below and then run the cell.\
If an error shows up, make sure you ran all code cells above this.

In [27]:
new_sentences = [
    "This is an example!"
    
]
normalized_sentences = [normalize_text(sentence) for sentence in new_sentences]
new_sentences_tfidf = vectorizer.transform(normalized_sentences)
predictions = nb_rando_search.predict(new_sentences_tfidf)


language_codes = {'deu': 'German', 'eng': 'English', 'spa': 'Spanish', 'rus': 'Russian', 'ita': 'Italian', 'por': 'Portuguese', }
predicted_languages = [language_codes[code] for code in predictions]

for sentence, prediction in zip(new_sentences, predicted_languages):
    print(f"Sentence: '{sentence}'\nPredicted Language: {prediction}\n")

Sentence: 'This is an example!'
Predicted Language: English

