# Code Comment Classification - Encoding

This notebook performs the following encoding operations:
1. Load the split dataset
2. Separate features and target
3. Encode the Target Labels
4. Build and Fit the Feature Engineering Pipeline
5. Transform and Save Data

## 1. Load the split datasets
We load the separate Training and Testing files created in the previous cleaning step. This ensures our test set remains unseen during the fitting process.

In [1]:
import pandas as pd
import numpy as np

# Load the training data (Use this to learn patterns/vocabulary)
df_train = pd.read_csv("code-comment-classification-train-unbalanced.csv")

# Load the test data (Use this ONLY for evaluation)
df_test = pd.read_csv("code-comment-classification-test.csv")

print(f"Training Set Shape: {df_train.shape}")
print(f"Test Set Shape:     {df_test.shape}")

# Verify columns
print("Columns:", df_train.columns.tolist())

Training Set Shape: (2291, 4)
Test Set Shape:     (573, 4)
Columns: ['comment_sentence_id', 'class', 'comment_sentence', 'category']


## 2. Separate features and target
We separate the input features (`class`, `comment_sentence`) from the target variable (`category`) for both datasets.

In [2]:
FEATURES = ["class", "comment_sentence"]
TARGET = "category"

# Split Training Data
X_train = df_train[FEATURES]
y_train = df_train[TARGET]

# Split Test Data
X_test = df_test[FEATURES]
y_test = df_test[TARGET]

print("Features and Target separated.")

Features and Target separated.


## 3. Encode the Target Labels
We convert the text labels (e.g., "Usage", "Summary") into numbers (0, 1, 2...).
- We `.fit()` the label encoder only on `y_train`
- We check if the test set contains any new labels (unlikely in this dataset, but good practice).

In [3]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

# Fit on Training labels
y_train_encoded = label_encoder.fit_transform(y_train)

# Transform Test labels (using the same mapping)
y_test_encoded = label_encoder.transform(y_test)

# Display the mapping
print("Category to Numeric Mapping:")
for i, category in enumerate(label_encoder.classes_):
    print(f"  {category}: {i}")

Category to Numeric Mapping:
  DevelopmentNotes: 0
  Expand: 1
  Parameters: 2
  Summary: 3
  Usage: 4


## 4. Build and Fit the Feature Engineering Pipeline
We build a pipeline to transform our raw data into numbers.
1. `OneHotEncoder`: Converts the `class` column into binary columns.
2. `TfidfVectorizer`: Converts the `comment_sentence` into a matrix of word importance scores.

It's important to note that we use `preprocess.fit(X_train)` to learn the vocabulary only from the training data.

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the transformers
preprocess = ColumnTransformer(
    transformers=[
        # Categorical: One-Hot Encoding
        ("cat", OneHotEncoder(handle_unknown="ignore"), ["class"]),
        
        # Text: TF-IDF Encoding
        # We limit features to top 5000 to keep the model manageable
        ("text", TfidfVectorizer(stop_words="english", ngram_range=(1, 2), max_features=5000), "comment_sentence")
    ]
)

# FIT ONLY ON TRAINING DATA
print("Fitting preprocessing pipeline on Training Data...")
preprocess.fit(X_train)
print("Pipeline fitted.")

Fitting preprocessing pipeline on Training Data...
Pipeline fitted.


## 5. Transform and Save Data
Now we transform both datasets into sparse matrices (numbers). Then we save them separately so the Model Training notebook can load them easily.

In [5]:
from scipy import sparse
import joblib

# 1. Transform the Training Data
X_train_encoded = preprocess.transform(X_train)

# 2. Transform the Test Data (using the pipeline fitted on train)
X_test_encoded = preprocess.transform(X_test)

print(f"Encoded Train Shape: {X_train_encoded.shape}")
print(f"Encoded Test Shape:  {X_test_encoded.shape}")

# --- SAVING FILES ---

# Save Features (Sparse Matrices)
sparse.save_npz("train_features.npz", X_train_encoded)
sparse.save_npz("test_features.npz", X_test_encoded)

# Save Targets (CSVs)
pd.DataFrame(y_train_encoded, columns=['category']).to_csv("code-comment-classification-train-target.csv", index=False)
pd.DataFrame(y_test_encoded, columns=['category']).to_csv("code-comment-classification-test-target.csv", index=False)

# Save the Pipeline and Encoder for later use
joblib.dump(preprocess, "preprocessing_pipeline.pkl")
joblib.dump(label_encoder, "label_encoder.pkl")

print("\nFiles Saved Successfully:")
print("- train_features.npz & code-comment-classification-train-target.csv")
print("- test_features.npz  & code-comment-classification-test-target.csv")
print("- preprocessing_pipeline.pkl")
print("- label_encoder.pkl")

Encoded Train Shape: (2291, 5306)
Encoded Test Shape:  (573, 5306)

Files Saved Successfully:
- train_features.npz & code-comment-classification-train-target.csv
- test_features.npz  & code-comment-classification-test-target.csv
- preprocessing_pipeline.pkl
- label_encoder.pkl
