# Language Classifier Model

---

This is a ML model created by Keshav Ghai (An aspiring AI/ML dev).
It is basic language classifier which can successfully differ between 3 languages, English, Hindi and Punjabi. This model features a tensorflow framework, utilising sklearn metrics for support. The training script **"model_trainer.py"** takes training data, puts it through a rigorous process of calculations *(Explained below)* and returns a trained model and a few graphs to understand how well the training of the model went. The model is trained on 100K+ parameters and a dataset which was handpicked and cleaned.

## Imports:- 
---

In [None]:
import tensorflow as tf
import os
import seaborn as sns
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import json
import numpy as np


## 1. Loading the Dataset (.txt in current directory)
---


> The dataset is loaded from **"dataset.txt"** using fileread.

In [None]:
DATA_FILE = "dataset.txt"   # Name of your file in same folder
assert os.path.exists(DATA_FILE), "Dataset file not found!"

texts = []
labels = []

with open(DATA_FILE, "r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue

        # Expected format: "text ### label"
        if "###" in line:
            text, label = line.split("###")
            texts.append(text.strip())
            labels.append(label.strip())
        else:
            print("Skipping line (wrong format):", line)

# Encode labels
label_to_id = {"english": 0, "hindi": 1, "punjabi": 2}
y = np.array([label_to_id[l.lower()] for l in labels])

print("Loaded samples:", len(texts))

## 2. Character Tokenizer 
---

> The tokenizer is setup and text is converted into char indices.

In [None]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(
    char_level=True,
    lower=True,
    filters=""
)
tokenizer.fit_on_texts(texts)

# Save tokenizer
with open("char_tokenizer.json", "w", encoding="utf-8") as f:
    f.write(json.dumps(tokenizer.to_json(), ensure_ascii=False))

print("Tokenizer saved as char_tokenizer.json")

# Convert text â†’ char indices
sequences = tokenizer.texts_to_sequences(texts)

# Pad sequences
max_len = max(len(seq) for seq in sequences)
X = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_len)

## 3. Defining the Model's dimensions (FP32, 100K parameters)
---

> The model is created with 6 layers. The specifications of these layers can be changed later. (Just make sure it doesn't overfit)

In [None]:
vocab_size = len(tokenizer.word_index) + 1

model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(max_len,), dtype="int32"),
    tf.keras.layers.Embedding(vocab_size, 128),       # char embedding
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(128, activation="relu"),
    tf.keras.layers.Dense(3, activation="softmax", dtype="float32")
])

model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

model.summary() # This shows you the model's details

## 4. Training the model (No validation)
---
> The model is fitted over 20 epochs and a batch size of 16. (These settings are sensitive, so only change them if you know what you are doing)

In [None]:
history = model.fit(
    X, y,
    epochs=20,
    batch_size=16
)

model.save("language_classifier.keras")
print("\nModel saved as language_classifier.keras")

## 5. Graphs
---
> Multiple graphs are created to visualize what the model is doing, how much we lost during training and more details. (Good for learning about ML)

### a. Loss Over Epochs:-

In [None]:
plt.figure(figsize=(6,4))
plt.plot(history.history['loss'])
plt.title("Training Loss Over Epochs")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.grid(True)
plt.savefig("loss_graph.png")
plt.close()

### b. Train vs. Test Accuracy:-

In [None]:
train_acc = history.history['accuracy']

plt.figure(figsize=(6,4))
plt.plot(train_acc, label="Training Accuracy")
plt.title("Accuracy on Training Data")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.grid(True)
plt.savefig("train_accuracy_graph.png")
plt.close()

### c. Confusion Matrix:-

In [None]:
pred = model.predict(X)
pred_labels = np.argmax(pred, axis=1)

cm = confusion_matrix(y, pred_labels)

plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt="d",
            xticklabels=["English", "Hindi", "Punjabi"],
            yticklabels=["English", "Hindi", "Punjabi"])
plt.title("Confusion Matrix")
plt.savefig("confusion_matrix.png")
plt.close()

print("Saved: loss_graph.png, train_accuracy_graph.png, confusion_matrix.png")

## 6. Interactive Testing
---
> This is a basic test for the model. It checks whether the model is predicting languages correctly or not. (Remember to use this thoroughly if you tampered with the specifications)

In [None]:
while True:
    user_inp = input("\nEnter text to classify (or 'quit'): ").strip()
    if user_inp.lower() == "quit":
        break

    seq = tokenizer.texts_to_sequences([user_inp])
    seq = tf.keras.preprocessing.sequence.pad_sequences(seq, maxlen=max_len)

    pred = model.predict(seq)[0]
    idx = np.argmax(pred)

    for lang, id_ in label_to_id.items():
        if id_ == idx:
            print("Predicted Language:", lang.capitalize())
            break