# Initial Setup and Dependencies

This cell sets up the required environment for a language classification system:

- Imports essential libraries:
  - NLP: NLTK for text processing
  - Data processing: NumPy, Pandas
  - Machine Learning: Scikit-learn components
  - Utilities: tqdm, warnings

The cell also:
- Downloads required NLTK resources: udhr, punkt, stopwords
- Verifies NLTK functionality with a test sentence
- Suppresses warnings for cleaner output

In [1]:
import nltk
import numpy as np
import pandas as pd
from nltk.corpus import udhr, gutenberg, brown, reuters
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords 
from collections import defaultdict, Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
import string
import re
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Download required NLTK data
print("Downloading required NLTK resources...")
resources = ['udhr', 'punkt', 'stopwords']
for resource in resources:
    nltk.download(resource)

print("\nVerifying NLTK data...")
test_text = "This is a test sentence."
tokens = word_tokenize(test_text)
print("NLTK resources verified successfully!")

Downloading required NLTK resources...

Verifying NLTK data...
NLTK resources verified successfully!


[nltk_data] Downloading package udhr to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package udhr is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Enhanced Language Feature Extractor

Custom scikit-learn transformer that extracts linguistic features from text:

## Key Features
- TF-IDF vectorization with n-gram support
- Text length metrics
- Character-level statistics:
  - Punctuation ratios
  - Case ratios (upper/lowercase)
  - Digit ratios
  - Whitespace analysis
  - Special character distribution

## Methods
- `get_advanced_features()`: Extracts detailed linguistic features
- `fit()`: Trains the feature extractor
- `transform()`: Converts texts into feature matrices

Designed for integration with scikit-learn pipelines and comprehensive text analysis.

In [2]:
class EnhancedLanguageFeatureExtractor(BaseEstimator, TransformerMixin):
    """Enhanced feature extractor with linguistic features"""
    
    def __init__(self, n_gram_range=(1, 3), max_features=1000):
        self.n_gram_range = n_gram_range
        self.max_features = max_features
        self.stopwords = set(stopwords.words('english'))
        self.feature_names_ = None
        self.tfidf = TfidfVectorizer(
            max_features=max_features,
            ngram_range=n_gram_range,
            stop_words='english'
        )
        
    def get_advanced_features(self, text):
        """Extract advanced linguistic features"""
        features = {}
        
        # Text length features
        features['total_length'] = len(text)
        features['avg_word_length'] = np.mean([len(w) for w in text.split()])
        
        # Punctuation features
        for punct in string.punctuation:
            features[f'punct_ratio_{punct}'] = text.count(punct) / len(text)
            
        # Case features
        features['uppercase_ratio'] = sum(1 for c in text if c.isupper()) / len(text)
        features['lowercase_ratio'] = sum(1 for c in text if c.islower()) / len(text)
        
        # Digit features
        features['digit_ratio'] = sum(1 for c in text if c.isdigit()) / len(text)
        
        # Whitespace features
        features['space_ratio'] = text.count(' ') / len(text)
        features['newline_ratio'] = text.count('\n') / len(text)
        
        # Language-specific features
        features['english_char_ratio'] = sum(1 for c in text if c in string.ascii_letters) / len(text)
        features['special_char_ratio'] = sum(1 for c in text if not c.isalnum() and c not in string.whitespace) / len(text)
        
        return features
    
    def fit(self, X, y=None):
        """Fit the feature extractor"""
        # Fit TF-IDF
        self.tfidf.fit(X)
        
        # Get sample features to establish feature names
        sample_features = self.get_advanced_features(X[0])
        self.feature_names_ = list(sample_features.keys()) + self.tfidf.get_feature_names_out().tolist()
        
        return self
    
    def transform(self, X):
        """Transform texts into feature matrix"""
        # Get TF-IDF features
        tfidf_features = self.tfidf.transform(X).toarray()
        
        # Get advanced features
        feature_matrix = []
        for text in tqdm(X, desc="Extracting features"):
            advanced_features = self.get_advanced_features(text)
            feature_vector = list(advanced_features.values())
            feature_matrix.append(feature_vector)
            
        # Combine features
        return np.hstack([np.array(feature_matrix), tfidf_features])

# Multilingual Text Collector

Text collection system that gathers and processes multilingual data from various sources:

## Features
- Collects English texts from:
  - Project Gutenberg corpus
  - Brown corpus
- Gathers non-English texts from:
  - Universal Declaration of Human Rights (UDHR) corpus

## Key Functions
- `collect_english_texts()`: Extracts English text samples
- `collect_non_english_texts()`: Gathers non-English samples
- `chunk_text()`: Splits texts into manageable chunks
- `collect_dataset()`: Creates balanced dataset with specified sample limits

Includes progress tracking and error handling for robust data collection.

In [3]:
class MultilingualTextCollector:
    """Collect and preprocess multilingual text data"""
    
    def __init__(self, min_text_length=1000):
        self.min_text_length = min_text_length
    
    def collect_english_texts(self):
        """Collect English texts from multiple sources"""
        english_texts = []
        
        print("\nCollecting English Texts:")
        print("-----------------------")
        
        # Collect from Gutenberg
        print("\nSamples from Gutenberg:")
        for fileid in gutenberg.fileids():
            try:
                text = ' '.join(gutenberg.words(fileid))
                chunks = self.chunk_text(text)
                if chunks:
                    english_texts.extend(chunks)
                    print(f"\nFile: {fileid}")
                    print("Sample text:")
                    print(chunks[0][:200] + "...\n")
            except Exception as e:
                print(f"Error processing Gutenberg file {fileid}: {str(e)}")
        
        # Collect from Brown corpus
        print("\nSamples from Brown corpus:")
        for fileid in brown.fileids():
            try:
                text = ' '.join(brown.words(fileid))
                chunks = self.chunk_text(text)
                if chunks:
                    english_texts.extend(chunks)
                    print(f"\nFile: {fileid}")
                    print("Sample text:")
                    print(chunks[0][:200] + "...\n")
            except Exception as e:
                print(f"Error processing Brown file {fileid}: {str(e)}")
        
        return english_texts
    
    def collect_non_english_texts(self):
        """Collect non-English texts from UDHR"""
        non_english_texts = []
        print("\nCollecting Non-English Texts:")
        print("---------------------------")
        
        available_languages = [fid for fid in udhr.fileids() 
                             if fid != 'English-Latin1' and 'Latin1' in fid]
        
        for lang in tqdm(available_languages, desc="Collecting non-English texts"):
            try:
                text = ' '.join(udhr.words(lang))
                chunks = self.chunk_text(text)
                if chunks:
                    non_english_texts.extend(chunks)
                    print(f"\nLanguage: {lang}")
                    print("Sample text:")
                    print(chunks[0][:200] + "...\n")
            except Exception as e:
                print(f"Error processing language {lang}: {str(e)}")
                continue
        
        return non_english_texts
    
    def chunk_text(self, text):
        """Split text into chunks of minimum length"""
        if not text:
            return []
            
        chunks = []
        current_chunk = []
        current_length = 0
        
        for word in text.split():
            current_chunk.append(word)
            current_length += len(word) + 1
            
            if current_length >= self.min_text_length:
                chunks.append(' '.join(current_chunk))
                current_chunk = []
                current_length = 0
        
        if current_chunk and current_length >= self.min_text_length / 2:
            chunks.append(' '.join(current_chunk))
            
        return chunks
    
    def collect_dataset(self, max_samples_per_class=1000):
        """Collect and prepare the complete dataset"""
        print("Collecting English texts...")
        english_texts = self.collect_english_texts()
        
        print(f"\nTotal English texts collected: {len(english_texts)}")
        if english_texts:
            print(f"Average length of English texts: {np.mean([len(text) for text in english_texts]):.0f} characters")
        
        print("\nCollecting non-English texts...")
        non_english_texts = self.collect_non_english_texts()
        
        print(f"\nTotal non-English texts collected: {len(non_english_texts)}")
        if non_english_texts:
            print(f"Average length of non-English texts: {np.mean([len(text) for text in non_english_texts]):.0f} characters")
        
        # Balance the dataset
        min_samples = min(len(english_texts), len(non_english_texts), max_samples_per_class)
        english_texts = english_texts[:min_samples]
        non_english_texts = non_english_texts[:min_samples]
        
        print("\nFinal Dataset Statistics:")
        print(f"Number of English samples: {len(english_texts)}")
        print(f"Number of non-English samples: {len(non_english_texts)}")
        
        # Create labels
        texts = english_texts + non_english_texts
        labels = ['english'] * len(english_texts) + ['non-english'] * len(non_english_texts)
        
        return texts, labels

# Complete Language Classification System

Comprehensive system for language classification with multiple models and evaluation:

## Components
1. **Classifiers**
   - Random Forest
   - SVM
   - Neural Network (MLP)

2. **Pipeline Features**
   - Text preparation
   - Feature extraction
   - Model evaluation
   - Cross-validation

## Main Functions
- `train_and_evaluate()`: Trains models and selects best performer
- `predict()`: Makes predictions with confidence scores
- `main()`: Demonstrates system with diverse test cases

## Testing Suite
Includes various test cases:
- Short texts
- Technical language
- Mixed language indicators
- Special characters and numbers

In [4]:
class LanguageClassificationSystem:
    """Complete language classification system with improvements"""
    
    def __init__(self, min_text_length=100, max_samples_per_class=1000):
        self.min_text_length = min_text_length
        self.max_samples_per_class = max_samples_per_class
        
        # Initialize classifiers with better parameters
        self.classifiers = {
            'random_forest': RandomForestClassifier(
                n_estimators=200,
                max_depth=20,
                min_samples_leaf=5,
                class_weight='balanced',
                random_state=42
            ),
            'svm': SVC(
                kernel='rbf',
                probability=True,
                class_weight='balanced',
                random_state=42
            ),
            'neural_net': MLPClassifier(
                hidden_layer_sizes=(200, 100, 50),
                max_iter=500,
                early_stopping=True,
                validation_fraction=0.2,
                random_state=42
            )
        }
        
        self.feature_extractor = EnhancedLanguageFeatureExtractor(
            max_features=2000
        )
        self.pipelines = {}
        self.best_pipeline = None
        
    def prepare_text(self, text):
        """Prepare text for prediction by padding if necessary"""
        if len(text) < self.min_text_length:
            repetitions = (self.min_text_length // len(text)) + 1
            text = (text + " ") * repetitions
        return text[:self.min_text_length]
    
    def build_pipelines(self):
        """Build classification pipelines"""
        for name, classifier in self.classifiers.items():
            self.pipelines[name] = Pipeline([
                ('features', self.feature_extractor),
                ('classifier', classifier)
            ])
    
    def train_and_evaluate(self):
        """Train and evaluate all models"""
        # Prepare data
        print("Preparing dataset...")
        collector = MultilingualTextCollector(min_text_length=self.min_text_length)
        texts, labels = collector.collect_dataset(max_samples_per_class=self.max_samples_per_class)
        
        # Split data
        X_train, X_test, y_train, y_test = train_test_split(
            texts, labels, test_size=0.2, stratify=labels, random_state=42
        )
        
        # Build and train pipelines
        print("\nTraining models...")
        self.build_pipelines()
        
        # Train and evaluate each model
        results = {}
        for name, pipeline in self.pipelines.items():
            print(f"\nEvaluating {name}...")
            
            # Cross-validation
            cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
            print(f"Cross-validation scores: {cv_scores}")
            print(f"Average CV score: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
            
            # Train on full training set
            pipeline.fit(X_train, y_train)
            
            # Evaluate on test set
            test_score = pipeline.score(X_test, y_test)
            print(f"Test score: {test_score:.4f}")
            
            # Detailed classification report
            y_pred = pipeline.predict(X_test)
            print("\nClassification Report:")
            print(classification_report(y_test, y_pred))
            
            # Save results
            results[name] = {
                'cv_mean': cv_scores.mean(),
                'cv_std': cv_scores.std(),
                'test_score': test_score,
                'pipeline': pipeline
            }
        
        # Select best model based on both CV and test performance
        best_model = max(results.items(), 
                        key=lambda x: (x[1]['cv_mean'] + x[1]['test_score']) / 2)
        self.best_pipeline = best_model[1]['pipeline']
        print(f"\nBest model: {best_model[0]}")
        print(f"CV Score: {best_model[1]['cv_mean']:.4f}")
        print(f"Test Score: {best_model[1]['test_score']:.4f}")
        
        return self.best_pipeline
    
    def predict(self, text, get_probabilities=False):
        """Make prediction with text preparation"""
        if self.best_pipeline is None:
            raise ValueError("Model not trained. Call train_and_evaluate first.")
        
        # Prepare text
        prepared_text = self.prepare_text(text)
        
        if get_probabilities:
            probs = self.best_pipeline.predict_proba([prepared_text])[0]
            # Add confidence adjustment for very short texts
            if len(text) < self.min_text_length:
                # Reduce confidence for short texts
                probs = (probs + 1) / 3
            return probs
        return self.best_pipeline.predict([prepared_text])[0]

def main():
    """Main execution function with diverse test cases"""
    print("Initializing Language Classification System...")
    system = LanguageClassificationSystem(
        min_text_length=100,
        max_samples_per_class=2000
    )
    
    try:
        best_classifier = system.train_and_evaluate()
        
        # Test with various lengths and styles
        test_texts = [
            # Very short
            "This is English.",
            
            # Short with numbers and punctuation
            "Testing 123! Is this working properly?",
            
            # Medium with variety
            """This text includes numbers (123), punctuation marks (!?.,), 
            and some UPPERCASE words. It should test various features.""",
            
            # Non-English looking text
            "Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
            
            # Mixed language indicators
            "English with émbelishments and école words.",
            
            # Technical English
            "The API endpoint returns JSON data with UTF-8 encoding.",
        ]
        
        print("\nTesting with diverse examples:")
        for i, test_text in enumerate(test_texts, 1):
            prediction = system.predict(test_text)
            probabilities = system.predict(test_text, get_probabilities=True)
            
            print(f"\nTest Example {i}:")
            print(f"Length: {len(test_text)} characters")
            print("Text:", test_text)
            print(f"Classification: {prediction}")
            print("Confidence scores:")
            for label, prob in zip(['english', 'non-english'], probabilities):
                print(f"  {label}: {prob:.4f}")
                
            # Add warning for very short texts
            if len(test_text) < system.min_text_length:
                print("Warning: Text is shorter than recommended length. Results may be less reliable.")
    
    except Exception as e:
        print(f"Error during execution: {str(e)}")

if __name__ == "__main__":
    main()

Initializing Language Classification System...
Preparing dataset...
Collecting English texts...

Collecting English Texts:
-----------------------

Samples from Gutenberg:

File: austen-emma.txt
Sample text:
[ Emma by Jane Austen 1816 ] VOLUME I CHAPTER I Emma Woodhouse , handsome , clever , and rich , with...


File: austen-persuasion.txt
Sample text:
[ Persuasion by Jane Austen 1818 ] Chapter 1 Sir Walter Elliot , of Kellynch Hall , in Somersetshire...


File: austen-sense.txt
Sample text:
[ Sense and Sensibility by Jane Austen 1811 ] CHAPTER 1 The family of Dashwood had long been settled...


File: bible-kjv.txt
Sample text:
[ The King James Bible ] The Old Testament of the King James Bible The First Book of Moses : Called...


File: blake-poems.txt
Sample text:
[ Poems by William Blake 1789 ] SONGS OF INNOCENCE AND OF EXPERIENCE and THE BOOK of THEL SONGS OF INNOCENCE...


File: bryant-stories.txt
Sample text:
[ Stories to Tell to Children by Sara Cone Bryant 1918 ] TWO LITTLE RIDD

Collecting non-English texts:   7%|▋         | 13/189 [00:00<00:01, 119.65it/s]


Language: Achehnese-Latin1
Sample text:
PEUNYATAANUMUM TEUNTANG HAK - HAK ASASI MANUSIA MUKADIMAH Assalammualaikum meunan ulon kheun wate bie...


Language: Achuar-Shiwiar-Latin1
Sample text:
Mash Nungkanam Pujuinau Angkan Pengker Pujusarat Tusar Pachisar Chichaamu Aints Mash Tu Pujusarti Tusar...


Language: Afaan_Oromo_Oromiffa-Latin1
Sample text:
Labsii Walii - gala Mirgoota Namummaa Murtoo Wal - ga ' ii Walii - gala kan Dhaabbata Waldaa Mootummootaatiin...


Language: Afrikaans-Latin1
Sample text:
UNIVERSELE VERKLARING VAN MENSEREGTE Aanhef AANGESIEN erkenning vir die inherente waardigheid en die...


Language: Aguaruna-Latin1
Sample text:
AENTS NII ANENTAIBAU , WAKEJAMU , YUPICHU DUTIKMAIN ASHI NUGKANUM ETSEJAMU NAGKABAU Juu ainawai ima...


Language: Albanian_Shqip-Latin1
Sample text:
DEKLARATA E PERGJITHSHME MBI TE DREJTAT E NJERIUT HYRJE Mbasi njohja e dinjitetit të lindur të të drejtave...


Language: Amahuaca-Latin1
Sample text:
DECLARACION UNIVERSAL DE DERECHOS HUMANOS [ Pr

Collecting non-English texts:  20%|█▉        | 37/189 [00:00<00:01, 118.68it/s]


Language: Caquinte-Latin1
Sample text:
QUERO ANCOTAVACAAJIAQUEMPANI MAASANO CAQUINTE ANCHOOCMIAQUENIJI CAMEETSA Jero oca iroqueti anquenqueajiaque...


Language: Cashibo-Cacataibo-Latin1
Sample text:
A ÑUCAMA SINANXUN CA AXA ANU TIMËCË UNICAMAN ËSAQUIN CAXA : A PAIN SINANCË BANA CAMAXUNBI CA ' UNANIA...


Language: Cashinahua-Latin1
Sample text:
JAVADA JUNIBUN MAl DASIBÍ AUN AKAIBUN , JATU NEMATIDUBUMAKI Preambulo Ainbu inun juni yudakuinki , yuinakamadan...


Language: Catalan-Latin1
Sample text:
Declaració Universal de Drets Humans PREÀMBUL Considerant que el reconeixement de la dignitat inherent...


Language: Catalan_Catala-Latin1
Sample text:
Declaració Universal de Drets Humans PREÀMBUL Considerant que el reconeixement de la dignitat inherent...


Language: Cebuano-Latin1
Sample text:
MALUKPANONG DEKLARASYON SA TAWHANONG MGA KATUNGOD [ Preamble ] Samtamg ang pag - ila sa tiunay nga kabililhon...


Language: Chamorro-Latin1
Sample text:
UNIVERSAL NA DECLARASION I DERECHO SIHA PAR

Collecting non-English texts:  26%|██▌       | 49/189 [00:00<00:01, 115.42it/s]


Language: Friulian_Friulano-Latin1
Sample text:
DECLARAZION UNIVERSÂL DAI DERITS DAL OM Fate buine e proclamade de Samblee gjenerâl des Nazion Unidis...


Language: Galician_Galego-Latin1
Sample text:
Declaración Universal dos Dereitos das Persoas Preámbulo A liberdade , a xustiza e a paz no mundo teñen...


Language: Garifuna_Garifuna-Latin1
Sample text:
Adamuridagunt to Tagumairagüdaru Garada to Ayanuhabaun luagu le yubat haun gürigia ubauwagu Tagúmeseha...


Language: German_Deutsch-Latin1
Sample text:
Die Allgemeine Erklärung der Menschenrechte Resolution 217 A ( III ) vom 10 . 12 . 1948 Präambel Da...


Language: Greenlandic_Inuktikut-Latin1
Sample text:
INUTTUT PISINNAATITAAFFIIT PILLUGIT SILARSUARMIOQATIGIINNUT NALUNAARUT AALLAQQAASIUT Ataqqinassusermik...


Language: Guarani-Latin1
Sample text:
TEKOVE YVYPORA KUERA MAYMAYVA DERECHO KUAAUKAHA Art . 1 Mayma yvypóra ou ko yvy ári iñapytl ' yre ha...


Language: HaitianCreole_Kreyol-Latin1
Sample text:
DECLARASYON INIVESEL DWA DE 

Collecting non-English texts:  32%|███▏      | 61/189 [00:00<00:01, 113.20it/s]


Language: Hmong_Miao-SouthernEast-Guizhou-Latin1
Sample text:
ghad bob did doub nangd renl njanl xand yanl kit doux nangd daot Ad leb ghad bob did doub nend , leb...


Language: Hmong_Miao_Northern-East-Guizhou-Latin1
Sample text:
FANGB DAB NONGD DAIL NAIX BANGF QUAIF LIT Hveb Dangi Denx Cenf rent laib zaid naix taix laix maix zenb...


Language: Huasteco-Latin1
Sample text:
DHÉY TSALAP ABAL PATAL AN INIK ANI AN UXUM AXI K ' WAJÍL TI AL AN TSABÁL ONU ti al an tamub 1948 . Kom...


Language: Huitoto_Murui-Latin1
Sample text:
Nana Comini Uri Illafue Naga nairai Naciones Unidas railla rafuemo danomo duide . Perú iemo duide ....


Language: Hungarian_Magyar-Latin1
Sample text:
Az Emberi Jogok Egyetemes Nyilatkozata Bevezetô Tekintettel arra , hogy az emberiség családja minden...


Language: Ibibio_Efik-Latin1
Sample text:
Edisuan Etop Mbana Mme Unen Owo Ke Ofuri Ekondo AKWA EWET NWED Kpukpuru owo emana ye ukemukem ye asana...


Language: Icelandic_Yslenska-Latin1
Sample text:
Mannréttinda

Collecting non-English texts:  39%|███▊      | 73/189 [00:00<00:01, 108.52it/s]


Language: Inuktikut_Greenlandic-Latin1
Sample text:
INUTTUT PISINNAATITAAFFIIT PILLUGIT SILARSUARMIOQATIGIINNUT NALUNAARUT AALLAQQAASIUT Ataqqinassusermik...


Language: IrishGaelic_Gaeilge-Latin1
Sample text:
DEARBHÚ UILE - CHOITEANN CEARTA AN DUINE [ Preamble ] De Bhrí gurb é aithint dínte dúchais agus chearta...


Language: Italian-Latin1
Sample text:
DICHIARAZIONE UNIVERSALE DEI DIRITTI UMANI Preambolo Considerato che il riconoscimento della dignità...


Language: Italian_Italiano-Latin1
Sample text:
DICHIARAZIONE UNIVERSALE DEI DIRITTI UMANI Preambolo Considerato che il riconoscimento della dignità...


Language: Javanese-Latin1
Sample text:
PRANYATAN UMUM NGENANI HAK - HAK ASASI ( UMAT ) MANUNGSA Saben umat manungsa lair kanthi hak - hak kang...


Language: Kaonde-Latin1
Sample text:
MUKAMBIZHO MUKATAMPE PA NSAMBU YAFWAINWA MUNTU LULAVANANO Byo kyayukanyikwa mba buneme ne kwesakena...


Language: Kapampangan-Latin1
Sample text:
Ding Pang Universung Karapatang Pantau A Miproklama

Collecting non-English texts:  44%|████▍     | 84/189 [00:00<00:00, 107.09it/s]


Language: Kituba-Latin1
Sample text:
Luzayisu Ya Yinza Muvimba Ya Baluve Ya Muntu Dyambu ya ntete Nakutalaka ti kutambula ngenda ya binama...


Language: Latin_Latina-Latin1
Sample text:
DECLARATIONEM HOMINIS IURIUM UNIVERSAM EXORDIUM Omnium humanae gentis partium perspecto et cognito consensum...


Language: Latin_Latina-v2-Latin1
Sample text:
UNIVERSALIS DE JURE HOMINUM DECLARATIO quae , decreto CCXVII A ( III ), a Communi Conventu probata et...


Language: Latvian-Latin1
Sample text:
VISPÂRÈJÂ CILVÈKA TIESÌBU DEKLARÂCIJA ANO Åenerâlâ Asambleja pieðèmusi 1948 . gada 10 . decembrì PREAMBULA...


Language: Lingala-Latin1
Sample text:
Lisakoli ya molongo ya makoki ya moto Maloba ya yambo Na botalaka  te kondima limemya ya bato nyonso...


Language: Lozi-Latin1
Sample text:
TUMELELANO YE TUNA YA SWANELO YA MUTU MAKALELO Kinto ye zibahala kuli mutu kaufela mwa lifasi ki yena...


Language: Luba-Kasai_Tshiluba-Latin1
Sample text:
MAPANGADIKA MANGATA KUDI BUKWA - BISAMBA BYA BULOBA BUJIMA

Collecting non-English texts:  50%|█████     | 95/189 [00:00<00:00, 107.28it/s]


Language: Makonde-Latin1
Sample text:
LILOVE LYA VILAMBO LYA WASA WA VAN Pachinjililo Lisiku lya 10 Disemba 1948 , Lukumbi Lukulu lwa Umoja...


Language: Malagasy-Latin1
Sample text:
FANAMBARANA IRAISAM - PIRENENA MOMBA NY ZON ' OLOMBELONA Sasin - Teny Heverina fa ny fankatoavana ny...


Language: Malay_BahasaMelayu-Latin1
Sample text:
PERISYTIHARAN HAK ASASI MANUSIA SEJAGAT MUKADIMAH Bahawasanya pengiktirafan keutuhan kemuliaan dan hak...


Language: Mam-Latin1
Sample text:
AT TU ' MALAAL KOPIB ' IL TWI ' YALAAL KAAWB ' IL B ' IX NIINB ' IL [ Preamble ] Tuj kyaqiil twitz tx...


Language: Maori-Latin1
Sample text:
WHAKAPUAKITANGA WHANUI O NGA MANA O TE TANGATA - 1948 No te mea na te whakanoa a na te whakahawea ki...


Language: Mapudungun_Mapuzgun-Latin1
Sample text:
Kom Mapu Fijke Az Tañi Az Mogeleam Tuwvlzugun (" Preámbulo " pi ta wigka ) Kimnieel fij mapu mew tañi...


Language: Marshallese-Latin1
Sample text:
NAN IN KWALOK EO AN LAL IN KIN MARON KO AN ARMIJ Kin kwaloki Kinke wat

Collecting non-English texts:  57%|█████▋    | 108/189 [00:00<00:00, 112.37it/s]


Language: Minangkabau-Latin1
Sample text:
Deklarasi Sadunia Hak - Hak Asasi Manusia Mukadimah Sasungguahnyo pangakuan taradok martabat dasar dan...


Language: Miskito_Miskito-Latin1
Sample text:
Upla sut Raitka nani ba Tasba aiska laka ba Bapuia Asla Takanka tara ba Naha Upla sut Raitka nani ba...


Language: Mixteco-Latin1
Sample text:
Tnu ' u saja ní ñayiví ja ñatu na sa ' a ndeva ' tna ' a Se ' e kaka ' an taka ma ñu na ' nu Uxi kií...


Language: Nahuatl-Latin1
Sample text:
TLAJTOLNEMILISTLI TLEN MOIJTOJTOK PARA MA KUALI TIMOUIKAKAJ IPAN NI TLALTIPAKTLI KENKE YOLKI NI TLAJTOLI...


Language: Ndebele-Latin1
Sample text:
AMALUNGELO OMUNTU WONKE EMHLABENI WONKE JIKELELE [ Preamble ] Loba isimilo lenhlonipho lamalungelo agcweleyo...


Language: Ngangela_Nyemba-Latin1
Sample text:
VWAMBULULO VWA CIFUTI VWA VILINGA VYA VANU Mu matangwa likumi a ngonde ya nzimbi ya mwa ka wa kanunu...


Language: NigerianPidginEnglish-Latin1
Sample text:
Universal Declaration of Human Rights For Decembe

Collecting non-English texts:  65%|██████▍   | 122/189 [00:01<00:00, 115.23it/s]


Language: Nyanja_Chinyanja-Latin1
Sample text:
CHIBVOMEREZO CA LAMULO LOSAMALIRA KHALIDWE LA MUNTHU PA DZIKO LONSE LA PANSI CIYAMBI Popeza kuti citsimikizo...


Language: OccitanAuvergnat-Latin1
Sample text:
NONSIAMEN DEÚ DRET DEÚ Z - OMEI PÀ LÀ TARÀ TENTEIRÀ Badazou En pezâ que couvenï de là dïneto lïjadà...


Language: OccitanLanguedocien-Latin1
Sample text:
DECLARACION UNIVERSALA DELS DRETS HUMANS Preambul Considerant que de reconéisser la dignitat inerenta...


Language: Oromiffa_AfaanOromo-Latin1
Sample text:
Labsii Walii - gala Mirgoota Namummaa Murtoo Wal - ga ' ii Walii - gala kan Dhaabbata Waldaa Mootummootaatiin...


Language: Oshiwambo_Ndonga-Latin1
Sample text:
OMUSHANGWA GWAAYEHE GUUTHEMBA WOMUNTU Oohaputetekeli Uuna mpoka pwa taambwa ko esimano lyomuntu pavalo...


Language: Otomi_Nahnu-Latin1
Sample text:
XO MAA NU XIJMOJOI NU KUCHTI KJA ' NI XO MAA Nu ro ndejte ka ro jiegi nu föste i nu jiegi xa ' to da...


Language: Paez-Latin1
Sample text:
NASA YA ' NWE ' WEWA ' TE 

Collecting non-English texts:  71%|███████▏  | 135/189 [00:01<00:00, 117.13it/s]


Language: Quechua-Latin1
Sample text:
LLAJTANCHEJNINTINPAJ QHELQESQA TUKUY RUNAJPA ATIYNINKUTA SUT ' INCHASPA [ Preamble ] Kaytaqa jatun tantakuypi...


Language: Quichua-Latin1
Sample text:
TUCUY RUNACUNAPAC HATUM RIMAY CALLARI Yuyashpa ima shina quishpiri , taripa , sumac causay pachapi ,...


Language: Rarotongan_MaoriCookIslands-Latin1
Sample text:
AKAKITEANGA KI TE KATOATOA I TE AU TIKAANGA TANGATA ( Arikiia e akakiteia ki te katoatoa na roto i te...


Language: Rhaeto-Romance_Rumantsch-Latin1
Sample text:
Decleranza universala dals drets da l ' uman Pream Considerand cha l ' arcugnuschentscha da la dignità...


Language: Romani-Latin1
Sample text:
SA THEMENQI DEKLARÀCIA E MANUSIKANE HAKAJENQIRI Anglivak Dikhindor so o prinzaripen e manu ? enqe somandrune...


Language: Rukonzo_Konjo-Latin1
Sample text:
EBIRI OMO KISAKANGO EKYEKIHUGHO KYOSI EKIKAHAMULHA OBUHOLHO NEKISUMBI KYO ' BUNDUEKYAMABIRIKANIBWAKO...


Language: Rundi_Kirundi-Latin1
Sample text:
7IBIMENYESHEJWE N ' AMAKUNGU 

Collecting non-English texts:  78%|███████▊  | 147/189 [00:01<00:00, 116.00it/s]


Language: ScottishGaelic_GaidhligAlbanach-Latin1
Sample text:
GAIRM CHOITCHEANN AIR COIRICHEAN A ' CHINNE - DAONNA ROI - RADH Do bhrìgh ' s gu bheil e air aideachadh...


Language: Sharanahua-Latin1
Sample text:
Non ahuuacaimain huatiroquin ishon . Niaifofoan ichanancashon mato cunushonifo . [ Preamble ] Manifoti...


Language: Shipibo-Conibo-Latin1
Sample text:
JATIBIAINOA JONÍ COSHIBAON , JASCAASHON JACON JAHUEQUL ARESTI JONIBAON JAHUEQUESCAMABÍ ITIAQUIN SHINANA...


Language: Shona-Latin1
Sample text:
KURUDZIRO YEKUCHENGETEDZVA KWEKODZERO DZEVANHU PASI POSE ZARURO Sezvo kucherechedza hunhu nekodzero...


Language: Siswati-Latin1
Sample text:
SIMEMETELO SEMHLABA WONKHE MAYELANA NEMALUNGELO EBUNTFU SINGENISO Njengoba kwatiswa ngekubakhona ngekwemvelo...


Language: SolomonsPidgin_Pijin-Latin1
Sample text:
Universol Declarason lo Hiuman Raits Pidgin  Solomon Aelans . Preamble Taem umi kam fo luk savve lo...


Language: Somali-Latin1
Sample text:
BAAQA CAALAMIGA EE XUQ WQDA AADANAHA H

Collecting non-English texts:  84%|████████▍ | 159/189 [00:01<00:00, 110.95it/s]


Language: Sundanese-Latin1
Sample text:
PERNYATAAN UMUM NGEUNAAN HAK - HAK ASASI MANUSA Sakabeh manusa , gubragna ka alam dunya teh bari nampa...


Language: Swaheli-Latin1
Sample text:
UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA YA ULIMWENGU JUU YA HAKI ZA BINADAMU UTANGULIZI Kwa...


Language: Swahili_Kiswahili-Latin1
Sample text:
UMOJA WA MATAIFA OFISI YA IDARA YA HABARI TAARIFA YA ULIMWENGU JUU YA HAKI ZA BINADAMU UTANGULIZI Kwa...


Language: Swedish_Svenska-Latin1
Sample text:
ALLMÄN FÖRKLARING OM DE MÄNSKLIGA RÄTTIGHETERNA INLEDNING Enär erkännandet av det inneboende värdet...


Language: Tenek_Huasteco-Latin1
Sample text:
JUNKUNAL BITSOWTSIK Ti laju , in ajumtal a iich ' laju chaab ban tamub 1948 , an junkun - talaab ban...


Language: Tetum-Latin1
Sample text:
Deklarasaun Mundu Nia Ba Direitus Ema Nia Preambulu Hanesan hatene tuir dignidade ho direitu ema hotu...


Language: Tiv-Latin1
Sample text:
MYEE U GBAR - GBAR U AKAA A I DOO SHA CI U A ER A HANMAOR UMACE KENG KEN

Collecting non-English texts:  90%|█████████ | 171/189 [00:01<00:00, 89.69it/s] 


Language: Uighur_Uyghur-Latin1
Sample text:
dunya kishilik hoquqi xitabnamisi söz béshi insanlar ailisining barliq ezalirining özige xas izzet -...


Language: Umbundu-Latin1
Sample text:
UNKANDA WOLWALI WOMOKO YOMUNU Eteke cakala ekwi ( 10 ) ko sayi ya cemba kulima wohulukayi ovita eciya...


Language: Urarina-Latin1
Sample text:
SATIIN CAA CHAURUATANE QUE NENACAAURU CACHAAURU RAl RAUHI Nunue Satiin , cachaauru raite ne rauhi ....


Language: Uzbek-Latin1
Sample text:
INSON HUQUQLARI UMUMJAHON DEKLARATCIYASI 1948 yil , 10 dekabrda Birlashgan Millatlar Tashkiloti Bosh...


Language: Vlach-Latin1
Sample text:
DECLARATSIA UNIVERSALÃ TI - NDREPTURLI - A OMLUI ZBOR NÃINTI Ti - atsea câ pricânushtearea - a nâmuziljei...


Language: Walloon_Wallon-Latin1
Sample text:
DÉCLARÅCION UNIVERSÈLE DÈS DREÛTS D ' L ' OME [ Preamble ] Il a stu ad ' mètu ' ne fèye po totes Qui...


Language: Waray-Latin1
Sample text:
Sangkalibutan Nga Pag - Asoy Bahin Han Kanan Tawo Mga Katungod Pasiuna Tungod han pag

Collecting non-English texts: 100%|██████████| 189/189 [00:01<00:00, 107.24it/s]


Language: Xhosa-Latin1
Sample text:
INKCAZO - JIKELELE NGEEMFANELO ZOLUNTU ISINGENISO Njengoko iimfanelo zesidima soluntu semvelo kunye...


Language: Yagua-Latin1
Sample text:
TUCHODA TITAJU NIJYANVAJYU VURYATIDYE VICHASARA SAMIRYA VARIY Rijechipiyajada sesionmu ttJau jiryatiy...


Language: Yao-Latin1
Sample text:
Mkamulano Wa Ilambo Yoscope Pa Ufulu Wa Chipago Wa Wandu Malowe Gandanda Aga ni maufulu gakasapagwa...


Language: Yapese-Latin1
Sample text:
MATT  AWEN GUBIN E GIDII NI NGAN NANG MORNGAAGEN Bochan ni ngan nang ni gubine gidii mabee mabay matt...


Language: Zapoteco-Latin1
Sample text:
Kuan mbe ' s men yuy par diti Kuan na mbes par lu ' Xa ndob kan lu ngak , xa yent ndob xa yuy kinó riet...


Language: Zapoteco-SanLucasQuiavini-Latin1
Sample text:
Declarasyoony x : te : e ' n Deree ' ch x : te : e ' Ra ' ta ' Bu : unny Introducsyoony Zi ' cy na :...


Language: Zhuang-Latin1
Sample text:
SEIQGYAIQ YINZGENZ SENHYENZ VAH BAIHNAJ Aenvih roxnyinh vunz lajmbwn bonilaiz couh m


Extracting features: 100%|██████████| 2560/2560 [00:00<00:00, 28876.53it/s]
Extracting features: 100%|██████████| 640/640 [00:00<00:00, 31864.80it/s]
Extracting features: 100%|██████████| 2560/2560 [00:00<00:00, 30836.13it/s]
Extracting features: 100%|██████████| 640/640 [00:00<00:00, 32430.32it/s]
Extracting features: 100%|██████████| 2560/2560 [00:00<00:00, 27758.11it/s]
Extracting features: 100%|██████████| 640/640 [00:00<00:00, 29239.11it/s]
Extracting features: 100%|██████████| 2560/2560 [00:00<00:00, 28705.53it/s]
Extracting features: 100%|██████████| 640/640 [00:00<00:00, 25246.22it/s]
Extracting features: 100%|██████████| 2560/2560 [00:00<00:00, 31649.99it/s]
Extracting features: 100%|██████████| 640/640 [00:00<00:00, 23424.91it/s]


Cross-validation scores: [0.9390625 0.9390625 0.94375   0.9421875 0.9578125]
Average CV score: 0.9444 (+/- 0.0139)


Extracting features: 100%|██████████| 3200/3200 [00:00<00:00, 29108.73it/s]
Extracting features: 100%|██████████| 800/800 [00:00<00:00, 28210.75it/s]


Test score: 0.9625


Extracting features: 100%|██████████| 800/800 [00:00<00:00, 24496.40it/s]



Classification Report:
              precision    recall  f1-score   support

     english       0.97      0.95      0.96       400
 non-english       0.95      0.97      0.96       400

    accuracy                           0.96       800
   macro avg       0.96      0.96      0.96       800
weighted avg       0.96      0.96      0.96       800


Evaluating svm...


Extracting features: 100%|██████████| 2560/2560 [00:00<00:00, 30335.03it/s]
Extracting features: 100%|██████████| 640/640 [00:00<00:00, 13520.34it/s]
Extracting features: 100%|██████████| 2560/2560 [00:00<00:00, 13846.03it/s]
Extracting features: 100%|██████████| 640/640 [00:00<00:00, 14207.37it/s]
Extracting features: 100%|██████████| 2560/2560 [00:00<00:00, 15529.45it/s]
Extracting features: 100%|██████████| 640/640 [00:00<00:00, 32573.17it/s]
Extracting features: 100%|██████████| 2560/2560 [00:00<00:00, 31003.52it/s]
Extracting features: 100%|██████████| 640/640 [00:00<00:00, 13848.23it/s]
Extracting features: 100%|██████████| 2560/2560 [00:00<00:00, 14595.23it/s]
Extracting features: 100%|██████████| 640/640 [00:00<00:00, 15555.07it/s]


Cross-validation scores: [0.7015625 0.6796875 0.7234375 0.7078125 0.7046875]
Average CV score: 0.7034 (+/- 0.0281)


Extracting features: 100%|██████████| 3200/3200 [00:00<00:00, 14541.56it/s]
Extracting features: 100%|██████████| 800/800 [00:00<00:00, 15451.76it/s]


Test score: 0.7063


Extracting features: 100%|██████████| 800/800 [00:00<00:00, 13154.42it/s]



Classification Report:
              precision    recall  f1-score   support

     english       0.64      0.96      0.77       400
 non-english       0.92      0.45      0.61       400

    accuracy                           0.71       800
   macro avg       0.78      0.71      0.69       800
weighted avg       0.78      0.71      0.69       800


Evaluating neural_net...


Extracting features: 100%|██████████| 2560/2560 [00:00<00:00, 13799.75it/s]
Extracting features: 100%|██████████| 640/640 [00:00<00:00, 14367.90it/s]
Extracting features: 100%|██████████| 2560/2560 [00:00<00:00, 15310.87it/s]
Extracting features: 100%|██████████| 640/640 [00:00<00:00, 12398.06it/s]
Extracting features: 100%|██████████| 2560/2560 [00:00<00:00, 15363.22it/s]
Extracting features: 100%|██████████| 640/640 [00:00<00:00, 14762.18it/s]
Extracting features: 100%|██████████| 2560/2560 [00:00<00:00, 14593.78it/s]
Extracting features: 100%|██████████| 640/640 [00:00<00:00, 11935.34it/s]
Extracting features: 100%|██████████| 2560/2560 [00:00<00:00, 14743.85it/s]
Extracting features: 100%|██████████| 640/640 [00:00<00:00, 13414.33it/s]


Cross-validation scores: [0.9921875 0.9953125 0.9984375 0.9984375 1.       ]
Average CV score: 0.9969 (+/- 0.0056)


Extracting features: 100%|██████████| 3200/3200 [00:00<00:00, 14466.11it/s]
Extracting features: 100%|██████████| 800/800 [00:00<00:00, 13474.16it/s]


Test score: 0.9988


Extracting features: 100%|██████████| 800/800 [00:00<00:00, 13793.93it/s]



Classification Report:
              precision    recall  f1-score   support

     english       1.00      1.00      1.00       400
 non-english       1.00      1.00      1.00       400

    accuracy                           1.00       800
   macro avg       1.00      1.00      1.00       800
weighted avg       1.00      1.00      1.00       800


Best model: neural_net
CV Score: 0.9969
Test Score: 0.9988

Testing with diverse examples:


Extracting features: 100%|██████████| 1/1 [00:00<00:00, 977.01it/s]
Extracting features: 100%|██████████| 1/1 [00:00<00:00, 1069.16it/s]



Test Example 1:
Length: 16 characters
Text: This is English.
Classification: non-english
Confidence scores:
  english: 0.4752
  non-english: 0.5248


Extracting features: 100%|██████████| 1/1 [00:00<00:00, 1004.38it/s]
Extracting features: 100%|██████████| 1/1 [00:00<00:00, 1001.74it/s]



Test Example 2:
Length: 38 characters
Text: Testing 123! Is this working properly?
Classification: non-english
Confidence scores:
  english: 0.4376
  non-english: 0.5624


Extracting features: 100%|██████████| 1/1 [00:00<00:00, 1002.22it/s]
Extracting features: 100%|██████████| 1/1 [00:00<?, ?it/s]



Test Example 3:
Length: 131 characters
Text: This text includes numbers (123), punctuation marks (!?.,), 
            and some UPPERCASE words. It should test various features.
Classification: non-english
Confidence scores:
  english: 0.1599
  non-english: 0.8401


Extracting features: 100%|██████████| 1/1 [00:00<00:00, 1056.23it/s]
Extracting features: 100%|██████████| 1/1 [00:00<00:00, 983.89it/s]



Test Example 4:
Length: 56 characters
Text: Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Classification: non-english
Confidence scores:
  english: 0.4532
  non-english: 0.5468


Extracting features: 100%|██████████| 1/1 [00:00<?, ?it/s]
Extracting features: 100%|██████████| 1/1 [00:00<?, ?it/s]



Test Example 5:
Length: 43 characters
Text: English with émbelishments and école words.
Classification: non-english
Confidence scores:
  english: 0.3619
  non-english: 0.6381


Extracting features: 100%|██████████| 1/1 [00:00<00:00, 707.54it/s]
Extracting features: 100%|██████████| 1/1 [00:00<00:00, 1000.07it/s]


Test Example 6:
Length: 55 characters
Text: The API endpoint returns JSON data with UTF-8 encoding.
Classification: non-english
Confidence scores:
  english: 0.3649
  non-english: 0.6351





# Model Persistence

Optional code for model serialization:
- Saves trained model using pickle
- Provides loading functionality for future use
- Currently commented out for selective implementation

Note: Uncomment and use as needed for model deployment.

In [5]:
# import pickle
# import system

# # Save model
# with open('language_classifier.pkl', 'wb') as f:
#     pickle.dump(system.best_pipeline, f)

# # Load model
# with open('language_classifier.pkl', 'rb') as f:
#     loaded_model = pickle.load(f)