# PolyglotNLP: Language Detection with spaCy

This Notebook documents the process of creating my first language detection model, leveraging datasets encompassing three languages: German, Spanish, and English, with the plan to add more.

The datasets for this project are sourced from [Tatoeba](https://tatoeba.org/en/downloads), a repository that offers an extensive array of sentences across numerous languages, with weekly updates to ensure richness and diversity. The project aims to build a robus model capable of identifying the aforementioned languages.

Author: Robert Heßhaus  
Date: [21/03/2024](date:"dmy")

## Loading Packages

Let's start with loading the required libraries and set up the environment for the project.

In [27]:
import pandas as pd
import numpy as np
import re
import spacy
from spacy.lang.de import German
from spacy.lang.es import Spanish
from spacy.lang.en import English
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

# Load spaCy models
nlp_de = German()
nlp_es = Spanish()
nlp_en = English()

## Loading Language Files

Next up I'll load the language files and merge them into one to make it more accessible.

In [4]:
# Function to load data

def load_data(file):
    df = pd.read_csv(file, sep='\t', header=None, names=['ID', 'Language', 'Sentence'])
    return df

# Paths to Language Files

en_file = "./lang_data/eng_sentences.tsv"
de_file = "./lang_data/deu_sentences.tsv"
es_file = "./lang_data/spa_sentences.tsv"

# Load Data

df_en = load_data(en_file)
df_de = load_data(de_file)
df_es = load_data(es_file)

# Merge into one

df_combined = pd.concat([df_en, df_de, df_es]).reset_index(drop=True)                     

## Checking and Removal of Duplicates

Eventhough Tatoeba regularly checks and updates their datasets, we still want to make sure to remove all duplicate sentences.

In [5]:
df_combined.drop_duplicates(subset="Sentence", inplace=True)

## Normalziation and Tokenization

To prepare the data for model training, normalization and tokenization are crucial steps. These processes involve converting the text to a uniform format and breaking it down into manageable pieces (tokens), respectively. For this purpose, we leverage the spaCy library, which offers robust tools for natural language processing across multiple languages.

Normalization ensures consistency in text representation by converting all characters to lowercase and removing non-language-specific characters. Tokenization then splits the text into individual words or symbols, allowing for more effective language modeling.

In [11]:
# Normalize the text
def normalize_text(text):
    if not text:
        return ""
        
    # Convert Text to  lowercase    
    text = text.lower()
    
    # Removal of all special letters not relevant to the languages
    text = re.sub(r"[^a-zäöüßáéíóúñÁÉÍÓÚÑ]+", ' ', text) 
    return text

# Tokenize the Rows based on Language
def tokenize_with_language(row):
    text = row['Normalized_Sentence']
    lang = row['Language']

    if not text:
        return []
    # Select the appropriate spaCy model based on language
    if lang == 'deu':
        doc = nlp_de(text)
    elif lang == 'eng':
        doc = nlp_en(text)
    elif lang == 'spa':
        doc = nlp_es(text)
    else:
        raise ValueError(f"Unrecognized language code: {lang}")
    
    # Extract the tokens from the processed row
    return [token.text for token in doc]

# Shuffling the Data to ensure random distribution
df_combined = df_combined.sample(frac=1).reset_index(drop=True)

# Apply normalization and tokenization
df_combined['Normalized_Sentence'] = df_combined['Sentence'].apply(normalize_text)
df_combined['Tokens'] = df_combined.apply(tokenize_with_language, axis=1)

# Display the frst few rows of processed data
df_combined.head()

Unnamed: 0,ID,Language,Sentence,Normalized_Sentence,Tokens
0,11101026,spa,Este es mi ordenador portable.,este es mi ordenador portable,"[este, es, mi, ordenador, portable]"
1,5813019,deu,Tom ist bei uns zu Hause nicht gerne gesehen.,tom ist bei uns zu hause nicht gerne gesehen,"[tom, ist, bei, uns, zu, hause, nicht, gerne, ..."
2,9425561,eng,Nobody knows why Tom did it.,nobody knows why tom did it,"[nobody, knows, why, tom, did, it]"
3,8567236,eng,This towel has a nasty odor.,this towel has a nasty odor,"[this, towel, has, a, nasty, odor]"
4,3075050,eng,Minetest is a clone of Minecraft.,minetest is a clone of minecraft,"[minetest, is, a, clone, of, minecraft]"


## Vectorization & Model compilation

To train our neural network model, the first step is to convert the preprocessed textual data into a numerical format, a process known as vectorization. For this task, we employ the TfidfVectorizer from the scikit-learn library, which transforms the text into a TF-IDF (Term Frequency-Inverse Document Frequency) matrix. This method highlights the importance of each word in the context of its sentence and across the dataset.

Following the vectorization, the data is utilized to train a Naive Bayes model. Despite being a simpler classification algorithm compared to neural networks, Naive Bayes is remarkably effective for text classification tasks, including language detection, due to its assumption of independence among predictor

For further details, visit:
- [TFidfVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
- [Naive Bayes Classifier on WIkipedia](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)s.

In [30]:
# Extract sentences and labels
sentences = df_combined['Normalized_Sentence']
labels = df_combined['Language']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(sentences, labels, test_size=0.2)

# Initialization and training of the vectorizer
vectorizer = TfidfVectorizer(ngram_range=(1, 3), analyzer='char')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Set up the hyperparameter grid
param_grid = {'alpha': [0.01, 0.1, 1, 10, 100]}

# Instantiate the Naive Bayes classifier
classifier = MultinomialNB()

# Instantiate the GridSearch for Hyperparameteroptimization
nb_grid_search = GridSearchCV(classifier, param_grid, cv=5)

# Fit the Classifier to the data
nb_grid_search.fit(X_train_tfidf, y_train)

In [31]:
# Make predictions using the classifier and calculate accuracy as well as Cross-Validation Scores
predictions = nb_grid_search.predict(X_test_tfidf)
accuracy = accuracy_score(y_test, predictions)
scores = cross_val_score(nb_grid_search, X_train_tfidf, y_train)

print(f"Cross Validation Scores: {scores}")
print(f"Average Score: {np.mean(scores)}")

# print the accuracy
print(f"Accuracy: {accuracy * 100:.2f}%")

Cross Validation Scores: [0.99739259 0.99725604 0.99727951 0.99726243 0.99713227]
Average Score: 0.9972645682501169
Accuracy: 99.73%


## Testing Lab

To try out the model, write the sentences like the Example Sentence in the array "new_sentences" below and then run the cell.\
If an error shows up, make sure you ran all code cells above this.

In [32]:
new_sentences = [
    "Fabio ist ein runder Mensch"
]
normalized_sentences = [normalize_text(sentence) for sentence in new_sentences]
new_sentences_tfidf = vectorizer.transform(normalized_sentences)
predictions = nb_grid_search.predict(new_sentences_tfidf)


language_codes = {'deu': 'German', 'eng': 'English', 'spa': 'Spanish'}
predicted_languages = [language_codes[code] for code in predictions]

for sentence, prediction in zip(new_sentences, predicted_languages):
    print(f"Sentence: '{sentence}'\nPredicted Language: {prediction}\n")

Sentence: 'Fabio ist ein runder Mensch'
Predicted Language: German



## TODO

The next steps are:
- Adding more languages
- Improving performance on those languages
- Improving the Models capabilities in understanding Code Switching
- Implementing a Speech-to-Text Feature

Since it's a study project, I will most likely not work on this project after finishing most of the above features.