### Create a Language Detector using RegExps.

### (I):

#### (1):

In [1]:
import re

def language_detector(text):
    greek_pattern = re.compile(r'\b[α-ωά-ώΆ-ΏίϊΐόύϋΰήΑ-ΩΊΪΌΎΫ\s]+\b', re.IGNORECASE)
    english_pattern = re.compile(r'\b[a-zA-Z\s]+\b', re.IGNORECASE)
    greeklish_pattern = re.compile(r'\b[α-ωά-ώΆ-Ώίϊΐόύϋΰήa-zA-Z\s]*'
                              r'(?:g|G|γ|Γ|th|TH|χ|Χ|ei|ou|th|ch|ph|ai|oi|ei|oi|si|ti|ri|ni|xi|psi|tsi|κ|Κ|άι|έι|όι|ού|υι|ευ|ηυ|αυ|άϊ|έϊ|ώϊ|οϊ|ϋι|ίς|ής|ος|ως|ας|ές|ής|ίς|ός|ύς|ώς|εί|αί|οί|ου|άς|ές|ής|ίς|ός|ύς|ώς|εί|αί|οί|ου|ά|έ|ή|ί|ό|ύ|ώ)?'
                              r'[a-zA-Z]+'
                              r'(?:is|aki|akis)?'
                              r'[a-zA-Z\s]*\b', re.IGNORECASE)


    greek_matches = greek_pattern.findall(text)
    english_matches = english_pattern.findall(text)
    greeklish_matches = greeklish_pattern.findall(text)

    greek_ratio = len(''.join(greek_matches)) / len(text)
    english_ratio = len(''.join(english_matches)) / len(text)
    greeklish_ratio = len(''.join(greeklish_matches)) / len(text)

    if greek_ratio > 0.5 and english_ratio < 0.2:
        return "Greek"
    elif english_ratio > 0.5 and greek_ratio < 0.2:
        return "English"
    elif greeklish_ratio > 0.5:
        return "Greeklish"
    else:
        return "Other"


#### (2):

#### Let's test to a greek/english/greeklish dataset

Firstly, read the csv file.

In [2]:
import pandas as pd

# Load csv file using pandas
df = pd.read_csv("../csv/gold.csv")



### Assess Classifier

In [3]:
# Apply our Language detector
df['lang_detector'] = df['text'].apply(language_detector)
print(df['lang_detector'].value_counts())
df

lang_detector
English      20222
Greek        11586
Other         3209
Greeklish      618
Name: count, dtype: int64


Unnamed: 0,text,orig_lang,lang_detector
0,Γέλια σαν κομπολόγια,Greek,Greek
1,Καρδίαν καθαράν θέλ' ο Θεός,Greek,Greek
2,Ου Θεός κι ου γείτονας,Greek,Greek
3,Θέλει να κρυφθή πίσω από το δάχτυλό του,Greek,Greek
4,Όλοι κλαίν' τα χάλια τ'ς κι ο μυλωνάς τη δέσι,Greek,Greek
...,...,...,...
35630,Your money or your life,English,English
35631,Your name is mud,English,English
35632,You've never had it so good,English,English
35633,Zero tolerance,English,English


In [4]:
import numpy as np
accuracy = ((df['orig_lang'] == df['lang_detector']).sum())/ len(df)
print(f"Language Detector accuracy: {accuracy * 100:.2f}%")


Language Detector accuracy: 50.88%


Based on this dataset the classification error probability is:

In [5]:
print(f"Pe = {(1-accuracy)*100:.2f}%")

Pe = 49.12%


We have a basic understanding of the performance of our RegExps language Detector.

Let's in more detail: 

In [6]:
print("Based on gold.csv dataset:")
print("--------------------------------------------------")
for lang in df['orig_lang'].unique():
    match_count = (df['lang_detector'][df['orig_lang'] == lang] == lang).sum()
    total_instances = len(df[df['orig_lang'] == lang])
    accuracy = match_count / total_instances if total_instances > 0 else 0
    print(f"Accurancy for {lang}: {accuracy * 100:.2f}%")
print("--------------------------------------------------")


Based on gold.csv dataset:
--------------------------------------------------
Accurancy for Greek: 97.65%
Accurancy for Greeklish: 1.70%
Accurancy for English: 92.67%
Accurancy for Other: 34.19%
--------------------------------------------------


In [7]:
print("Based on gold.csv dataset:")
print("----------------------------------------------------------")
for lang in df['orig_lang'].unique():
    match_count = (df['lang_detector'][df['orig_lang'] == lang] == lang).sum()
    total_instances = len(df[df['orig_lang'] == lang])
    accuracy = match_count / total_instances if total_instances > 0 else 0
    print(f"Classification error probability for {lang}: {(1-accuracy)*100:.2f}%")
print("----------------------------------------------------------")


Based on gold.csv dataset:
----------------------------------------------------------
Classification error probability for Greek: 2.35%
Classification error probability for Greeklish: 98.30%
Classification error probability for English: 7.33%
Classification error probability for Other: 65.81%
----------------------------------------------------------
