# Language Detection with Machine Learning 

Md Khalid Siddiqui

## Install the necessary dependencies

In [None]:
# ! pip install pandas scikit-learn numpy warnings

## Reading the dataset

### Information on dataset:
This is a compact dataset available on Kaggle, a data science competition platform, designed for language detection purposes. It contains textual information for 17 distinct languages, enabling the development of a natural language processing (NLP) model capable of predicting these languages.

**List of Languages:**

English, Malayalam, Hindi, Tamil, Kannada, French, Spanish, Portuguese, Italian, Russian, Swedish, Dutch, Arabic, Turkish, German, Danish, Greek, 

**Url to download the dataset:** [Kaggle](https://www.kaggle.com/code/suryadeepti/language-detction)

In [21]:
# ignore warnings
import warnings
warnings.simplefilter("ignore")

# read the dataset using pandas
import pandas as pd
dataset = pd.read_csv("dataset\Language Detection.csv")

# show the data in rows 7000 - 7049
dataset[7000:7050]

Unnamed: 0,Text,Language
7000,lad mig afslutte.,Danish
7001,hold et øjeblik.,Danish
7002,forslag.,Danish
7003,"hvad siger du, vi går i biografen?",Danish
7004,hvad med at have pizza til middag i aften?,Danish
7005,det ville være rart.,Danish
7006,lyder godt for mig.,Danish
7007,det er jeg ikke sikker på.,Danish
7008,"nej, det tror jeg ikke.",Danish
7009,laver planer.,Danish


## Define dependent and independent variables

In [26]:
texts = dataset["Text"] # independent feature
labels = dataset["Language"] # dependent variable or Label


## Split dataset into Training and test data

We split the data into training and testing sets using train_test_split from scikit-learn. This ensures that we can evaluate the model's performance on unseen data. The test_size parameter is set to 0.2, indicating an 80-20 split.

In [27]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)

## Count Vectoriser

In the next step we use *CountVectorizer* for feature extraction. This converts the text data into a matrix of token counts, representing the frequency of words in the text. It's a common choice for text classification tasks.

The *fit_transform* operation on the training data allows the CountVectorizer to learn the vocabulary (unique words) from the training set and transform the text into a numerical format. This step ensures consistency in feature representation across the training data. 

The *transform* method is applied to the test data using the same CountVectorizer instance. This ensures that the test data is represented using the same vocabulary as the training data.

In [28]:
# Create a CountVectorizer to convert text into numerical features
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

## Multinomial Naive Bayes (NB) Classifier:

Multinomial Naive Bayes is chosen for its simplicity, speed, and effectiveness in handling discrete data, which aligns with the nature of language data represented as word counts. It assumes independence between features given the class, making it suitable for language detection where the presence of specific words can strongly indicate a language.

## Fit Classifier:

The model is trained on the transformed training data, allowing it to learn the statistical relationships between word frequencies and language labels. The simplicity of the MultinomialNB algorithm makes it a good starting point, especially when dealing with text data.

In [29]:
# Create and train the Multinomial Naive Bayes classifier
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train_vec, y_train)


## Make Predictions:

The trained model is applied to the test set to predict the languages of the unseen texts. This step tests the model's ability to generalize its learned patterns to new instances.


In [30]:
# Make predictions on the test set
predictions = classifier.predict(X_test_vec)

## Evaluate Model:

Model evaluation is crucial for understanding its performance. Accuracy provides an overall measure of correctness, while metrics like precision, recall, and F1-score offer insights into the model's performance on a per-language basis. This detailed evaluation helps identify potential weaknesses and areas for improvement.

### Accuracy:

Accuracy is a fundamental metric that measures the overall correctness of the model's predictions. It is calculated as the ratio of correctly predicted instances to the total number of instances. A higher accuracy generally indicates better model performance.
Classification Report:

The classification report provides a more detailed breakdown of the model's performance for each class (language). It includes the following metrics:

### Precision: 

Precision is the ratio of correctly predicted positive observations (true positives) to the total predicted positives (true positives + false positives). In the context of language detection, precision measures how many of the predicted instances for a particular language are actually correct.

### Recall (Sensitivity): 

Recall is the ratio of correctly predicted positive observations to all actual positives (true positives + false negatives). In language detection, recall indicates how many instances of a particular language were correctly identified by the model.

### F1-Score: 

The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall, giving a single metric that considers both false positives and false negatives.

### Support: 

Support is the number of actual occurrences of the class in the specified dataset. It provides context for the other metrics, especially in imbalanced datasets where some classes may have fewer instances.


In [31]:
# Evaluate the model
from sklearn.metrics import accuracy_score, classification_report
accuracy = accuracy_score(y_test, predictions)
report = classification_report(y_test, predictions)

## Results:

The final step involves reviewing the accuracy and classification report. Understanding these results guides further model refinement, feature engineering, or the exploration of more complex models if needed.

In [33]:
print(f"Accuracy: {accuracy:.2f}")
print(f"Classification Report:\n{report}")

Accuracy: 0.98
Classification Report:
              precision    recall  f1-score   support

      Arabic       1.00      0.98      0.99       106
      Danish       0.97      0.96      0.97        73
       Dutch       0.99      0.97      0.98       111
     English       0.92      1.00      0.96       291
      French       0.99      0.99      0.99       219
      German       1.00      0.97      0.98        93
       Greek       1.00      0.97      0.99        68
       Hindi       1.00      1.00      1.00        10
     Italian       1.00      0.99      1.00       145
     Kannada       1.00      1.00      1.00        66
   Malayalam       1.00      0.98      0.99       121
  Portugeese       0.99      0.98      0.99       144
     Russian       1.00      0.99      0.99       136
     Spanish       0.99      0.97      0.98       160
    Sweedish       1.00      0.98      0.99       133
       Tamil       1.00      0.99      0.99        87
     Turkish       1.00      0.94      0.97

### Explanation of results: 
For English, the model achieved a precision of 92%, recall of 100%, and an F1-score of 96%. This indicates that when the model predicted English, it was correct 92% of the time, and it captured 100% of the actual English instances.

The overall accuracy of the model is 98%, representing the proportion of correctly predicted instances across all languages.

The macro avg and weighted avg provide average metrics across all classes, considering either equal weighting or weighting based on class support.

# Running Predictions on the model

## Saving the Model

First we will save the model into a pkl file which will then be used for prediction.

In [35]:
# Save the model and vectorizer to pickle files
import pickle

# saving the model

with open('language_detection_model.pkl', 'wb') as model_file:
    pickle.dump(classifier, model_file)
    
# saving the vectorizer

with open('count_vectorizer.pkl', 'wb') as vectorizer_file:
    pickle.dump(vectorizer, vectorizer_file)

## Use the Exported Model for Inference: 

### Loading the Model and Vectoriser

In [39]:
import pickle

# Load the pre-trained model
with open('language_detection_model.pkl', 'rb') as model_file:
    model = pickle.load(model_file)

# Load the corresponding CountVectorizer
with open('count_vectorizer.pkl', 'rb') as vectorizer_file:
    vectorizer = pickle.load(vectorizer_file)

### Define a function to predict language

In [40]:
# prediction function    
def predict_language(user_input):
    # Transform user input using the loaded CountVectorizer
    user_input_vec = vectorizer.transform([user_input])

    # Predict the language using the loaded model
    predicted_language = model.predict(user_input_vec)[0]

    return predicted_language

### Predict the Language or Inference

In [41]:
user_text = "Bonjour, comment ça va?"
predicted_language = predict_language(user_text)
print(f"The predicted language is: {predicted_language}")

The predicted language is: French
