## Language Identification  
The objective of this project is to implement a machine learning-based Language Identification System that can automatically detect the language of a given sentence. The focus is on identifying four languages: English, Urdu, Russian, and Chinese.

The dataset was obtained from the Hugging Face Dataset Willi 2018 it contains 235 different languages and I filtered my desired languages 

In [2]:
import pandas as pd
import os

# --- Configuration ---
# Define the numerical labels you want to keep
desired_label_numbers = [41, 179, 30, 24]

# Define the mapping from numerical labels to language names
label_to_language_map = {
    41: 'English',
    179: 'Chinese',
    30: 'Urdu',
    24: 'Russian'
}

# Define your input file paths (assuming they are in the current directory)
train_input_file = 'train.parquet'
test_input_file = 'test.parquet'

# Define your output file names for the new filtered and mapped files
train_output_file = 'train_filtered_languages.parquet'
test_output_file = 'test_filtered_languages.parquet'

# --- 1. Load the Parquet files ---

df_train = pd.read_parquet(train_input_file)
df_test = pd.read_parquet(test_input_file)
print("Original train and test data loaded successfully.")
print(f"Original train shape: {df_train.shape}")
print(f"Original test shape: {df_test.shape}")



# --- 2. Filter the DataFrames by desired labels ---
print("\nFiltering DataFrames...")
df_train_filtered = df_train[df_train['label'].isin(desired_label_numbers)].copy()
df_test_filtered = df_test[df_test['label'].isin(desired_label_numbers)].copy()

print(f"Train data filtered to {df_train_filtered.shape[0]} rows.")
print(f"Test data filtered to {df_test_filtered.shape[0]} rows.")

# --- 3. Convert numerical labels to language names ---
print("Converting numerical labels to language names...")
df_train_filtered['language'] = df_train_filtered['label'].map(label_to_language_map)
df_test_filtered['language'] = df_test_filtered['label'].map(label_to_language_map)


# Reorder columns if you want 'language' before 'sentence', for example
df_train_filtered = df_train_filtered[['sentence', 'language']] # Keep original label too
df_test_filtered = df_test_filtered[['sentence', 'language']]

# --- 4. Save the new DataFrames to Parquet files ---
print(f"\nSaving filtered and mapped train data to '{train_output_file}'...")
df_train_filtered.to_parquet(train_output_file, index=False) # index=False prevents writing the DataFrame index as a column

print(f"Saving filtered and mapped test data to '{test_output_file}'...")
df_test_filtered.to_parquet(test_output_file, index=False)

print("\nProcess complete!")
print(f"New train file saved: {train_output_file}")
print(f"New test file saved: {test_output_file}")

# Display a sample of the new DataFrames to verify
print("\n--- Sample of Filtered Train DataFrame (first 5 rows) ---")
print(df_train_filtered.head())
print("\n--- Value counts for 'language' in Filtered Train DataFrame ---")
print(df_train_filtered['language'].value_counts())

print("\n--- Sample of Filtered Test DataFrame (first 5 rows) ---")
print(df_test_filtered.head())
print("\n--- Value counts for 'language' in Filtered Test DataFrame ---")
print(df_test_filtered['language'].value_counts())

Original train and test data loaded successfully.
Original train shape: (117500, 2)
Original test shape: (117500, 2)

Filtering DataFrames...
Train data filtered to 2000 rows.
Test data filtered to 2000 rows.
Converting numerical labels to language names...

Saving filtered and mapped train data to 'train_filtered_languages.parquet'...
Saving filtered and mapped test data to 'test_filtered_languages.parquet'...

Process complete!
New train file saved: train_filtered_languages.parquet
New test file saved: test_filtered_languages.parquet

--- Sample of Filtered Train DataFrame (first 5 rows) ---
                                              sentence language
44   برقی بار (electric charge) تمام زیرجوہری ذرات ...     Urdu
67   胡赛尼本人和小说的主人公阿米尔一样，都是出生在阿富汗首都喀布尔，少年时代便离开了这个国家。胡...  Chinese
219  Занимает пятое место в диптихе автокефальных п...  Russian
269  In 1978 Johnson was awarded an American Instit...  English
306  Bussy-Saint-Georges has built its identity on ...  English

--- Value coun

In [3]:
import pandas as pd

# Define the paths to your new files
train_new_file = 'train_filtered_languages.parquet'
test_new_file = 'test_filtered_languages.parquet'


df_train = pd.read_parquet(train_new_file)
df_test = pd.read_parquet(test_new_file)


In [4]:
df_test.head()

Unnamed: 0,sentence,language
0,大都会区有它自己的当地路边快餐口味，包括瓦达帕夫（蓬松面包劈开一半，填入锅贴）、潘尼普里（油...,Chinese
1,16 апреля 2009 года в Шатойском районе произош...,Russian
2,Anton (or Antonius) Maria Schyrleus (also Schy...,English
3,اس وقت کے گورنر ڈی وٹ کلنٹن نے دریائے ہڈسن کو ...,Urdu
4,近幾年來，由於氣候變遷對人類帶來的警訊，讓各國政府紛紛思考如何減碳節能。為減少對化石能源的依...,Chinese


In [5]:
df_train.head()

Unnamed: 0,sentence,language
0,برقی بار (electric charge) تمام زیرجوہری ذرات ...,Urdu
1,胡赛尼本人和小说的主人公阿米尔一样，都是出生在阿富汗首都喀布尔，少年时代便离开了这个国家。胡...,Chinese
2,Занимает пятое место в диптихе автокефальных п...,Russian
3,In 1978 Johnson was awarded an American Instit...,English
4,Bussy-Saint-Georges has built its identity on ...,English


### Data Preprocessing and cleaning
Preprocessing is a crucial step in language detection to reduce noise and improve model performance. A custom clean_text function was implemented to clean each sentence by:

•	Removing digits

•	Removing punctuation and special characters

•	Collapsing multiple spaces into a single space

•	Stripping leading/trailing whitespaces


In [6]:
import pandas as pd
import re

def clean_text(text):
    text = re.sub(r'[0-9]', '', text)  # Remove digits
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation and special characters
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    text = text.strip()  # Remove leading and trailing whitespace
    return text

# Apply cleaning and replace original sentence column
df_train['sentence'] = df_train['sentence'].apply(clean_text)
df_test['sentence'] = df_test['sentence'].apply(clean_text)

# Optional: Preview cleaned data
print(df_train['sentence'].head(5))


0    برقی بار electric charge تمام زیرجوہری ذرات کی...
1    胡赛尼本人和小说的主人公阿米尔一样都是出生在阿富汗首都喀布尔少年时代便离开了这个国家胡赛尼直...
2    Занимает пятое место в диптихе автокефальных п...
3    In Johnson was awarded an American Institute o...
4    BussySaintGeorges has built its identity on a ...
Name: sentence, dtype: object


In [7]:
df_test.head()

Unnamed: 0,sentence,language
0,大都会区有它自己的当地路边快餐口味包括瓦达帕夫蓬松面包劈开一半填入锅贴潘尼普里油炸crêpe...,Chinese
1,апреля года в Шатойском районе произошёл бой м...,Russian
2,Anton or Antonius Maria Schyrleus also Schyrl ...,English
3,اس وقت کے گورنر ڈی وٹ کلنٹن نے دریائے ہڈسن کو ...,Urdu
4,近幾年來由於氣候變遷對人類帶來的警訊讓各國政府紛紛思考如何減碳節能為減少對化石能源的依賴性有...,Chinese


### Embeddings using TF-IDF
To convert textual data into numeric form, TF-IDF (Term Frequency - Inverse Document Frequency) features were extracted. Since character-level patterns are more effective for language detection, character n-grams ranging from 1 to 5 were used.


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Use character-level n-grams (works best for language detection)
vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(1, 5), max_features=10000)

# Fit on training set and transform both train and test
X_train = vectorizer.fit_transform(df_train['sentence'])
X_test = vectorizer.transform(df_test['sentence'])

# Target labels
y_train = df_train['language']
y_test = df_test['language']


### Modeling and Evaluation 
A Multinomial Naive Bayes model was selected for training due to its effectiveness and speed on text classification tasks.
The model was evaluated using accuracy and a classification report, which includes precision, recall, and F1-score for each language.


In [12]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

# Initialize and train the model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Predict
y_pred = nb_model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Save the model and vectorizer for future use
import joblib  
# Save the model
joblib.dump(nb_model, 'language_detection_model.pkl')  
# Save the vectorizer
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')
# Load the model and vectorizer
nb_model_loaded = joblib.load('language_detection_model.pkl')
vectorizer_loaded = joblib.load('tfidf_vectorizer.pkl')



Accuracy: 0.991

Classification Report:
               precision    recall  f1-score   support

     Chinese       1.00      0.98      0.99       500
     English       0.97      1.00      0.98       500
     Russian       1.00      1.00      1.00       500
        Urdu       1.00      0.99      0.99       500

    accuracy                           0.99      2000
   macro avg       0.99      0.99      0.99      2000
weighted avg       0.99      0.99      0.99      2000



### Validating

In [10]:
text= "Hello, how are you?"
# Clean the input text
cleaned_text = clean_text(text)
# Transform the cleaned text into the same feature space
X_input = vectorizer.transform([cleaned_text])
# Predict the language
predicted_language = nb_model.predict(X_input)
print(f"The predicted language for the input text is: {predicted_language[0]}")

The predicted language for the input text is: English


In [11]:
text = "你好，你好吗？"
# Clean the input text 
cleaned_text = clean_text(text)
# Transform the cleaned text into the same feature space
X_input = vectorizer.transform([cleaned_text])
# Predict the language
predicted_language = nb_model.predict(X_input)
print(f"The predicted language for the input text is: {predicted_language[0]}")

The predicted language for the input text is: Chinese


In [None]:
# Example usage of the loaded model
def predict_language(sentence):
    # Clean the input sentence
    cleaned_sentence = clean_text(sentence)
    # Transform the sentence using the loaded vectorizer
    X_new = vectorizer_loaded.transform([cleaned_sentence])
    # Predict the language
    predicted_language = nb_model_loaded.predict(X_new)
    return predicted_language[0]


The predicted language for the example sentence is: English


In [15]:
# Example usage
example_sentence = "Hello, how are you?"
predicted_language = predict_language(example_sentence)
print(f"The predicted language for the example sentence is: {predicted_language}")

The predicted language for the example sentence is: English


In [16]:
example_sentence = "你好，你好吗？"
predicted_language = predict_language(example_sentence)
print(f"The predicted language for the example sentence is: {predicted_language}")
# Example usage with a different sentence  
example_sentence = "Привет, как дела?"
predicted_language = predict_language(example_sentence)
print(f"The predicted language for the example sentence is: {predicted_language}")

The predicted language for the example sentence is: Chinese
The predicted language for the example sentence is: Russian
