# Programming Language Classifier with BiLSTM + Attention

This project builds a deep learning model to classify code snippets into their corresponding programming languages. It uses a custom architecture combining an embedding layer, a bidirectional LSTM, and an attention mechanism to improve classification accuracy across 10 language classes.

**Dataset**  
The model is trained on a subset of [IBM Project CodeNet](https://developer.ibm.com/exchanges/data/all/project-codenet/), a large-scale dataset of source code files in multiple programming languages.

**Model Objective**  
To learn language-specific patterns in source code using a sequence model, enabling accurate prediction of the programming language for a given snippet.

**Techniques Used**
- Text preprocessing and tokenization of source code
- Label encoding and padding of sequences
- Neural network built with PyTorch
- Evaluation using accuracy, F1-score, and classification report

This notebook contains all steps: data preprocessing, model architecture, training loop, and evaluation.


In [None]:
import re
from pathlib import Path
from collections import Counter

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score, f1_score

# PyTorch
import torch
from torch import nn, optim
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import TensorDataset, DataLoader, random_split

## Data Preprocessing

All code files are read line by line and stored as `(code_text, language_label)` pairs. The parser preserves line breaks and file structure. The language is extracted from the file extension (e.g., `.py`, `.c`, `.java`).

We tokenize each code snippet using simple word/symbol-level tokenization, transforming the raw text into sequences suitable for neural network input.

The processing pipeline includes:
- Encoding language labels using `LabelEncoder`
- Tokenizing code into vocabulary indices
- Padding sequences to uniform length
- Converting sequences and labels into PyTorch tensors
- Creating a combined dataset and splitting into training and validation sets

In [1]:
def read_code_files(folder_path):
    file_paths = list(Path(folder_path).rglob("*.*"))
    data = []
    for path in file_paths:
        with open(path, "r", encoding="utf-8", errors="ignore") as f:
            content = f.read()
            label = path.suffix[1:]  # 'py', 'c', etc.
            data.append((content, label))
    return data

In [13]:
df = read_code_files("data/train")

I then turn this list into a DataFrame.

In [14]:
df = pd.DataFrame(df, columns=["code", "label"])
df.sample(5)

Unnamed: 0,code,label
738,#include<stdio.h>\n\n#define SET_MAX 1024\n\n/...,c
691,(function(input) {\n var p = input.replace(/\...,js
33,#include <bits/stdc++.h>\nusing namespace std;...,cpp
191,using System;\nusing System.IO;\nusing System....,cs
317,fn main(){\n loop {\n let nm: Vec<usize> =...,rs


Notice that the classes are balanced.

In [15]:
df["label"].value_counts()

label
cpp     90
py      90
cs      90
rs      90
java    90
hs      90
php     90
js      90
c       90
d       90
Name: count, dtype: int64

In [None]:
def tokenize(code):
    return re.findall(r'\w+|[^\s\w]', code)

In [17]:
tokenize(df["code"][0])

['/',
 '/',
 'template',
 '{',
 '{',
 '{',
 '#',
 'include',
 '<',
 'bits',
 '/',
 'stdc',
 '+',
 '+',
 '.',
 'h',
 '>',
 'using',
 'namespace',
 'std',
 ';',
 '/',
 '/',
 '#',
 'define',
 'int',
 'long',
 'long',
 '#',
 'define',
 'GET_MACRO',
 '(',
 'a',
 ',',
 'b',
 ',',
 'c',
 ',',
 'd',
 ',',
 'NAME',
 ',',
 '.',
 '.',
 '.',
 ')',
 'NAME',
 '#',
 'define',
 'REP2',
 '(',
 'i',
 ',',
 'n',
 ')',
 'REP3',
 '(',
 'i',
 ',',
 '0',
 ',',
 'n',
 ')',
 '#',
 'define',
 'REP3',
 '(',
 'i',
 ',',
 'a',
 ',',
 'b',
 ')',
 'REP4',
 '(',
 'i',
 ',',
 'a',
 ',',
 'b',
 ',',
 '1',
 ')',
 '#',
 'define',
 'REP4',
 '(',
 'i',
 ',',
 'a',
 ',',
 'b',
 ',',
 's',
 ')',
 'for',
 '(',
 'll',
 'i',
 '=',
 '(',
 'a',
 ')',
 ';',
 'i',
 '<',
 '(',
 'll',
 ')',
 '(',
 'b',
 ')',
 ';',
 'i',
 '+',
 '=',
 's',
 ')',
 '#',
 'define',
 'RREP2',
 '(',
 'i',
 ',',
 'n',
 ')',
 'RREP3',
 '(',
 'i',
 ',',
 '0',
 ',',
 'n',
 ')',
 '#',
 'define',
 'RREP3',
 '(',
 'i',
 ',',
 'a',
 ',',
 'b',
 ')',
 'for',
 '(',
 

We construct a vocabulary based on the training data, mapping each unique token to an integer index.

In [None]:
all_tokens = []
for code in df['code']:
    tokens = tokenize(code)
    all_tokens.extend(tokens)

# Vocabulary creation
token_freqs = Counter(all_tokens)
vocab = {token: i+2 for i, (token, _) in enumerate(token_freqs.items())}
vocab["<PAD>"] = 0
vocab["<UNK>"] = 1

Map the tokens in each code snippet to their corresponding indices in the vocabulary.

In [19]:
def tokens_to_ids(tokens, vocab):
    return [vocab.get(t, vocab["<UNK>"]) for t in tokens]

In [None]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(df["label"])

MAX_LEN = 500
encoded_sequences = []

for code in df["code"]:
    tokens = tokenize(code)
    token_ids = tokens_to_ids(tokens[:MAX_LEN], vocab)
    encoded_sequences.append(torch.tensor(token_ids, dtype=torch.long))

padded_sequences = pad_sequence(encoded_sequences, batch_first=True, padding_value=vocab["<PAD>"])
labels_tensor = torch.tensor(y_encoded, dtype=torch.long)

print(padded_sequences.shape)  # (num_samples, MAX_LEN)
print(labels_tensor.shape)     # (num_samples,)

torch.Size([900, 500])
torch.Size([900])


In [21]:
padded_sequences

tensor([[   2,    2,    3,  ...,  105,  113,   17],
        [   5,    6,    7,  ...,   13,  231,   17],
        [   5,    6,    7,  ...,   13,  231,   17],
        ...,
        [ 709,   16,   11,  ..., 6300,   17, 6539],
        [ 709,   16,   11,  ...,   40,  393,   99],
        [ 709,   16,   11,  ...,    0,    0,    0]])

In [None]:
dataset = TensorDataset(padded_sequences, labels_tensor)

train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_ds, val_ds = random_split(dataset, [train_size, val_size])

train_loader = DataLoader(train_ds, batch_size=32, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=32)

## Model Architecture

The neural network follows a deep learning pipeline optimized for sequential classification:

- **Embedding Layer**: Transforms token IDs into dense vector representations
- **Bidirectional LSTM**: Captures context from both forward and backward directions
- **Attention Layer**: Assigns dynamic weights to sequence elements to emphasize informative tokens
- **Fully Connected Output Layer**: Maps the attention-aggregated features to class probabilities via softmax

The model is trained using Cross-Entropy Loss and the Adam optimizer.

In [None]:
class CodeClassifierBiLSTM_Attn(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, dropout=0.3):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.attn = nn.Linear(hidden_dim * 2, 1)  # Attention layer
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_dim * 2, num_classes)

    def forward(self, x):
        x = self.embedding(x)                           # (B, T, E)
        out, _ = self.lstm(x)                           # (B, T, 2H)
        attn_weights = torch.softmax(self.attn(out), dim=1)  # (B, T, 1)
        context = torch.sum(attn_weights * out, dim=1)  # (B, 2H),
        context = self.dropout(context)
        return self.fc(context)

In [42]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = CodeClassifierBiLSTM_Attn(
    vocab_size=len(vocab),
    embed_dim=128,
    hidden_dim=256,
    num_classes=len(label_encoder.classes_),
    dropout=0.3
).to(device)

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

In [None]:
def train_model(model, train_loader, val_loader, epochs=10, save_path="best_model.pt"):
    best_f1 = 0.0

    for epoch in range(epochs):
        model.train()
        total_loss = 0

        for batch in train_loader:
            inputs, labels = batch
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)  # logits
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(train_loader)

        # === VALIDATION ===
        model.eval()
        val_preds = []
        val_labels = []

        with torch.no_grad():
            for batch in val_loader:
                inputs, labels = batch
                inputs = inputs.to(device)
                outputs = model(inputs)
                preds = torch.argmax(outputs, dim=1).cpu().numpy()
                val_preds.extend(preds)
                val_labels.extend(labels.numpy())

        acc = accuracy_score(val_labels, val_preds)
        f1 = f1_score(val_labels, val_preds, average='macro')

        print(f"Epoch {epoch+1}/{epochs} | Train Loss: {avg_loss:.4f} | Val Acc: {acc:.4f} | Val F1: {f1:.4f}")

        # Save best model
        if f1 > best_f1:
            best_f1 = f1
            torch.save(model.state_dict(), save_path)
            print(f"✅ Nuevo mejor modelo guardado con F1: {f1:.4f}")

In [45]:
def final_report(model, val_loader):
    model.eval()
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for inputs, labels in val_loader:
            inputs = inputs.to(device)
            outputs = model(inputs)
            preds = torch.argmax(outputs, dim=1).cpu().numpy()
            all_preds.extend(preds)
            all_labels.extend(labels.numpy())

    print(classification_report(all_labels, all_preds, target_names=label_encoder.classes_))

In [49]:
#train_model(model, train_loader, val_loader, epochs=50)
final_report(model, val_loader)

              precision    recall  f1-score   support

           c       0.90      1.00      0.95        18
         cpp       1.00      1.00      1.00        10
          cs       1.00      1.00      1.00        18
           d       1.00      1.00      1.00        21
          hs       1.00      0.96      0.98        24
        java       1.00      0.96      0.98        24
          js       1.00      0.95      0.98        22
         php       1.00      1.00      1.00        13
          py       1.00      1.00      1.00        11
          rs       0.95      1.00      0.97        19

    accuracy                           0.98       180
   macro avg       0.98      0.99      0.99       180
weighted avg       0.98      0.98      0.98       180



## Model Evaluation

The test set undergoes the same preprocessing steps: tokenization, index conversion, and padding — using the trained vocabulary and label encoder.

Evaluation metrics include:
- **Accuracy**
- **F1-Score (per class and macro-average)**
- **Confusion matrix**

### Key Results:
- 7 out of 10 languages were classified with **F1 = 1.00**
- Minor confusion occurred between languages with overlapping syntax (e.g., Python and JavaScript)
- No classes exhibited consistently poor performance

The model demonstrates high robustness and a strong ability to distinguish syntactic patterns across multiple programming languages.

In [59]:
df_test = read_code_files("data/test")
df_test = pd.DataFrame(df_test, columns=["code", "label"])

In [60]:
df_test =  df_test[df_test["label"] != ""]

df_test["label"].value_counts()

label
cpp     10
py      10
cs      10
rs      10
java    10
hs      10
php     10
js      10
c       10
d       10
Name: count, dtype: int64

In [None]:
y_test = label_encoder.transform(df_test["label"])

# Tokenize and turn into IDs
encoded_test_sequences = []
for code in df_test["code"]:
    tokens = tokenize(code)
    token_ids = tokens_to_ids(tokens[:MAX_LEN], vocab)
    encoded_test_sequences.append(torch.tensor(token_ids, dtype=torch.long))

# Padding
padded_test = pad_sequence(encoded_test_sequences, batch_first=True, padding_value=vocab["<PAD>"])
labels_test_tensor = torch.tensor(y_test, dtype=torch.long)

In [63]:
test_dataset = TensorDataset(padded_test, labels_test_tensor)
test_loader = DataLoader(test_dataset, batch_size=32)

In [92]:
def evaluate_on_test(model, test_loader):
    model.eval()
    preds = []
    labels = []

    with torch.no_grad():
        for inputs, y in test_loader:
            inputs = inputs.to(device)
            outputs = model(inputs)
            batch_preds = torch.argmax(outputs, dim=1).cpu().numpy()
            preds.extend(batch_preds)
            labels.extend(y.numpy())
            
    print(classification_report(labels, preds, target_names=label_encoder.classes_))

In [None]:
model.load_state_dict(torch.load("best_model.pt"))
model.to(device)
evaluate_on_test(model, test_loader)

              precision    recall  f1-score   support

           c       1.00      1.00      1.00        10
         cpp       1.00      1.00      1.00        10
          cs       1.00      1.00      1.00        10
           d       1.00      1.00      1.00        10
          hs       1.00      0.90      0.95        10
        java       1.00      1.00      1.00        10
          js       0.91      1.00      0.95        10
         php       1.00      1.00      1.00        10
          py       0.91      1.00      0.95        10
          rs       1.00      0.90      0.95        10

    accuracy                           0.98       100
   macro avg       0.98      0.98      0.98       100
weighted avg       0.98      0.98      0.98       100



In [None]:
def prepare_input(code_str, vocab, max_len=500):
    tokens = tokenize(code_str)
    token_ids = tokens_to_ids(tokens[:max_len], vocab)
    seq = torch.tensor(token_ids, dtype=torch.long)

    if len(seq) < max_len:
        pad_len = max_len - len(seq)
        pad_tensor = torch.full((pad_len,), vocab["<PAD>"], dtype=torch.long)
        seq = torch.cat([seq, pad_tensor])
    return seq.unsqueeze(0)  # shape: (1, MAX_LEN)

In [72]:
def predict_language(code_str, model, vocab, label_encoder, max_len=500):
    model.eval()
    input_tensor = prepare_input(code_str, vocab, max_len).to(device)
    
    with torch.no_grad():
        logits = model(input_tensor)
        pred_idx = torch.argmax(logits, dim=1).item()
        pred_label = label_encoder.inverse_transform([pred_idx])[0]
        return pred_label

## Model Demo

Below is an example of the model predicting the programming language of unseen code snippets.

The classifier accurately detects the correct language even on short or ambiguous samples by leveraging both sequential context (via BiLSTM) and learned attention weights.

In [None]:
rand_int = np.random.randint(0, len(df_test))
sample_code = df_test["code"].iloc[rand_int]  # Change this to test other samples
print(f" Código (primeras 10 líneas):\n{'\n'.join(sample_code.splitlines()[:10])}")
predicted_language = predict_language(sample_code, model, vocab, label_encoder, max_len=MAX_LEN)
print(f"Predicted value: {predicted_language}, Actual value: {df_test["label"].iloc[rand_int]}")

 Código (primeras 10 líneas):
import std.stdio, std.string, std.conv;
import std.array, std.algorithm, std.range;

void main()
{
    foreach(s;stdin.byLine())
    {
        int[10] c; foreach(n;s) ++c[n-'0'];
        bool dfs(int n, bool d)
        {
Predicted value: d, Actual value: d
