# Feature Encoding - Code Comment Classification

This notebook encodes features using BERT embeddings + metadata + class one-hot encoding.

## Steps:
1. Load train/test splits from data cleaning
2. Extract metadata features
3. Encode target labels
4. Generate BERT embeddings
5. One-hot encode class names
6. Combine all features
7. Save encoded features

## Input Files:
- `code-comment-classification-train-unbalanced.csv`
- `code-comment-classification-test.csv`

## Output Files:
- `train_features_4cat_bert_meta.npz`
- `test_features_4cat_bert_meta.npz`
- `train_target_4cat_meta.csv`
- `test_target_4cat_meta.csv`
- `class_encoder_4cat_meta.pkl`
- `bert_model_4cat_meta.pkl`
- `label_encoder_4cat_meta.pkl`

## Import Required Libraries

In [8]:
# Data manipulation
import pandas as pd
import numpy as np

# Scikit-learn preprocessing
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# BERT embeddings
from sentence_transformers import SentenceTransformer

# Utilities
from scipy import sparse
import joblib

print("All libraries imported successfully!")

All libraries imported successfully!


---
# Part 2: Encoding
---

## 2.1 Load the Split Datasets
We load the separate Training and Testing files created in the previous cleaning step. This ensures our test set remains unseen during the fitting process.

In [9]:
# Load the training data (Use this to learn patterns/vocabulary)
df_train = pd.read_csv("code-comment-classification-train-unbalanced.csv")

# Load the test data (Use this ONLY for evaluation)
df_test = pd.read_csv("code-comment-classification-test.csv")

print(f"Training Set Shape: {df_train.shape}")
print(f"Test Set Shape:     {df_test.shape}")

# Verify columns
print("Columns:", df_train.columns.tolist())

Training Set Shape: (2249, 4)
Test Set Shape:     (563, 4)
Columns: ['comment_sentence_id', 'class', 'comment_sentence', 'category']


## 2.2 Separate Features and Target
We separate the input features (`class`, `comment_sentence`) from the target variable (`category`) for both datasets.

In [10]:
FEATURES = ["class", "comment_sentence"]
TARGET = "category"

# Split Training Data
X_train_enc = df_train[FEATURES].copy()
y_train_enc = df_train[TARGET]

# Split Test Data
X_test_enc = df_test[FEATURES].copy()
y_test_enc = df_test[TARGET]

print("Features and Target separated.")

Features and Target separated.


## 2.3 Extract Metadata Features

Before encoding, we'll extract additional features that might help classification:
- **comment_length**: Number of words in the comment
- **has_params**: Whether comment mentions parameter-related words
- **has_code_symbols**: Whether comment contains code-related symbols
- **starts_with_verb**: Whether comment starts with common action verbs
- **has_default**: Whether comment mentions default values

These metadata features will be combined with BERT embeddings.

In [11]:
# Define metadata extraction functions
def extract_metadata_features(df):
    """Extract metadata features from comment text"""
    
    # 1. Comment length (number of words)
    df['comment_length'] = df['comment_sentence'].str.split().str.len()
    
    # 2. Has parameter-related keywords
    param_keywords = ['param', 'parameter', 'arg', 'argument', 'int', 'str', 'bool', 'float', 'list', 'dict', 'type']
    param_pattern = '|'.join(param_keywords)
    df['has_params'] = df['comment_sentence'].str.contains(param_pattern, case=False, regex=True).astype(int)
    
    # 3. Has code symbols
    code_symbols = [r'\(', r'\)', r'\[', r'\]', r'\{', r'\}', r'_', r'\.']
    code_pattern = '|'.join(code_symbols)
    df['has_code_symbols'] = df['comment_sentence'].str.contains(code_pattern, regex=True).astype(int)
    
    # 4. Starts with common action verbs
    action_verbs = ['returns', 'return', 'creates', 'create', 'provides', 'provide', 'handles', 'handle', 
                    'implements', 'implement', 'executes', 'execute', 'generates', 'generate',
                    'validates', 'validate', 'processes', 'process', 'manages', 'manage']
    verb_pattern = '^(' + '|'.join(action_verbs) + ')'
    df['starts_with_verb'] = df['comment_sentence'].str.contains(verb_pattern, case=False, regex=True).astype(int)
    
    # 5. Mentions default values
    df['has_default'] = df['comment_sentence'].str.contains('default', case=False).astype(int)
    
    return df

# Apply to training data
print("Extracting metadata features from training data...")
X_train_enc = extract_metadata_features(X_train_enc)

# Apply to test data
print("Extracting metadata features from test data...")
X_test_enc = extract_metadata_features(X_test_enc)

print("\nMetadata features extracted:")
print(f"  - comment_length: {X_train_enc['comment_length'].describe()['mean']:.1f} words (avg)")
print(f"  - has_params: {X_train_enc['has_params'].sum()} comments ({X_train_enc['has_params'].sum()/len(X_train_enc)*100:.1f}%)")
print(f"  - has_code_symbols: {X_train_enc['has_code_symbols'].sum()} comments ({X_train_enc['has_code_symbols'].sum()/len(X_train_enc)*100:.1f}%)")
print(f"  - starts_with_verb: {X_train_enc['starts_with_verb'].sum()} comments ({X_train_enc['starts_with_verb'].sum()/len(X_train_enc)*100:.1f}%)")
print(f"  - has_default: {X_train_enc['has_default'].sum()} comments ({X_train_enc['has_default'].sum()/len(X_train_enc)*100:.1f}%)")

print(f"\nTraining data shape: {X_train_enc.shape}")
print(f"Columns: {X_train_enc.columns.tolist()}")

Extracting metadata features from training data...
Extracting metadata features from test data...

Metadata features extracted:
  - comment_length: 6.9 words (avg)
  - has_params: 632 comments (28.1%)
  - has_code_symbols: 838 comments (37.3%)
  - starts_with_verb: 35 comments (1.6%)
  - has_default: 112 comments (5.0%)

Training data shape: (2249, 7)
Columns: ['class', 'comment_sentence', 'comment_length', 'has_params', 'has_code_symbols', 'starts_with_verb', 'has_default']


  df['starts_with_verb'] = df['comment_sentence'].str.contains(verb_pattern, case=False, regex=True).astype(int)
  df['starts_with_verb'] = df['comment_sentence'].str.contains(verb_pattern, case=False, regex=True).astype(int)


## 2.4 Encode the Target Labels
We convert the text labels (e.g., "Usage", "Summary") into numbers (0, 1, 2...).
- We `.fit()` the label encoder only on `y_train`
- We check if the test set contains any new labels (unlikely in this dataset, but good practice).

In [12]:
label_encoder = LabelEncoder()

# Fit on Training labels
y_train_encoded = label_encoder.fit_transform(y_train_enc)

# Transform Test labels (using the same mapping)
y_test_encoded = label_encoder.transform(y_test_enc)

# Display the mapping
print("Category to Numeric Mapping:")
for i, category in enumerate(label_encoder.classes_):
    print(f"  {category}: {i}")

Category to Numeric Mapping:
  DevelopmentNotes: 0
  Expand: 1
  Parameters: 2
  Summary: 3


## 2.5 Build and Fit the Feature Engineering Pipeline with BERT Embeddings + Metadata

We combine three types of features:
1. `OneHotEncoder`: Converts the `class` column into binary columns.
2. `SentenceTransformer (BERT)`: Converts the `comment_sentence` into dense semantic embeddings.
3. `Metadata Features`: The 5 extracted features (comment_length, has_params, etc.)

**Hypothesis:** Combining BERT's semantic understanding with explicit metadata features should boost performance beyond 65%.

In [13]:
# Initialize BERT model (downloads ~80MB on first run)
print("Loading BERT model (all-MiniLM-L6-v2)...")
bert_model = SentenceTransformer('all-MiniLM-L6-v2')
print("BERT model loaded.")

# Encode training comments with BERT
print("\nEncoding training comments with BERT...")
train_comment_embeddings = bert_model.encode(
    X_train_enc['comment_sentence'].tolist(),
    show_progress_bar=True,
    batch_size=32
)
print(f"Training BERT embeddings shape: {train_comment_embeddings.shape}")

# Encode test comments with BERT
print("\nEncoding test comments with BERT...")
test_comment_embeddings = bert_model.encode(
    X_test_enc['comment_sentence'].tolist(),
    show_progress_bar=True,
    batch_size=32
)
print(f"Test BERT embeddings shape: {test_comment_embeddings.shape}")

# Create OneHotEncoder for the 'class' column
from sklearn.preprocessing import OneHotEncoder
class_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=True)

# Fit and transform 'class' column
print("\nOne-hot encoding 'class' column...")
train_class_encoded = class_encoder.fit_transform(X_train_enc[['class']])
test_class_encoded = class_encoder.transform(X_test_enc[['class']])

print(f"Training class encoding shape: {train_class_encoded.shape}")
print(f"Test class encoding shape: {test_class_encoded.shape}")

# Extract metadata features as numpy arrays
print("\nExtracting metadata features...")
metadata_cols = ['comment_length', 'has_params', 'has_code_symbols', 'starts_with_verb', 'has_default']
train_metadata = X_train_enc[metadata_cols].values
test_metadata = X_test_enc[metadata_cols].values

print(f"Training metadata shape: {train_metadata.shape}")
print(f"Test metadata shape: {test_metadata.shape}")

print("\nFeature engineering complete.")

Loading BERT model (all-MiniLM-L6-v2)...


BERT model loaded.

Encoding training comments with BERT...


Batches:   0%|          | 0/71 [00:00<?, ?it/s]

Training BERT embeddings shape: (2249, 384)

Encoding test comments with BERT...


Batches:   0%|          | 0/18 [00:00<?, ?it/s]

Test BERT embeddings shape: (563, 384)

One-hot encoding 'class' column...
Training class encoding shape: (2249, 306)
Test class encoding shape: (563, 306)

Extracting metadata features...
Training metadata shape: (2249, 5)
Test metadata shape: (563, 5)

Feature engineering complete.


## 2.6 Combine All Features and Save
Now we combine BERT embeddings + class encoding + metadata features into final feature matrices.

In [14]:
# Convert to sparse matrices
from scipy.sparse import csr_matrix, hstack

train_bert_sparse = csr_matrix(train_comment_embeddings)
test_bert_sparse = csr_matrix(test_comment_embeddings)

train_metadata_sparse = csr_matrix(train_metadata)
test_metadata_sparse = csr_matrix(test_metadata)

# Combine: class encoding + BERT embeddings + metadata features
X_train_encoded = hstack([train_class_encoded, train_bert_sparse, train_metadata_sparse])
X_test_encoded = hstack([test_class_encoded, test_bert_sparse, test_metadata_sparse])

print(f"Final Training Features Shape: {X_train_encoded.shape}")
print(f"Final Test Features Shape: {X_test_encoded.shape}")

print("\nFeature breakdown:")
print(f"  - Class one-hot: {train_class_encoded.shape[1]} features")
print(f"  - BERT embeddings: {train_bert_sparse.shape[1]} features")
print(f"  - Metadata: {train_metadata_sparse.shape[1]} features")
print(f"  - Total: {X_train_encoded.shape[1]} features")

# --- SAVING FILES ---

# Save Features (Sparse Matrices)
sparse.save_npz("train_features_4cat_bert_meta.npz", X_train_encoded)
sparse.save_npz("test_features_4cat_bert_meta.npz", X_test_encoded)

# Save Targets (CSVs)
pd.DataFrame(y_train_encoded, columns=['category']).to_csv("train_target_4cat_meta.csv", index=False)
pd.DataFrame(y_test_encoded, columns=['category']).to_csv("test_target_4cat_meta.csv", index=False)

# Save the encoders for later use
joblib.dump(class_encoder, "class_encoder_4cat_meta.pkl")
joblib.dump(bert_model, "bert_model_4cat_meta.pkl")
joblib.dump(label_encoder, "label_encoder_4cat_meta.pkl")

print("\nFiles Saved Successfully:")
print("- train_features_4cat_bert_meta.npz & train_target_4cat_meta.csv")
print("- test_features_4cat_bert_meta.npz  & test_target_4cat_meta.csv")
print("- class_encoder_4cat_meta.pkl")
print("- bert_model_4cat_meta.pkl")
print("- label_encoder_4cat_meta.pkl")

Final Training Features Shape: (2249, 695)
Final Test Features Shape: (563, 695)

Feature breakdown:
  - Class one-hot: 306 features
  - BERT embeddings: 384 features
  - Metadata: 5 features
  - Total: 695 features

Files Saved Successfully:
- train_features_4cat_bert_meta.npz & train_target_4cat_meta.csv
- test_features_4cat_bert_meta.npz  & test_target_4cat_meta.csv
- class_encoder_4cat_meta.pkl
- bert_model_4cat_meta.pkl
- label_encoder_4cat_meta.pkl
