# Code Comment Classification - Encoding

This notebook performs the following encoding operations:
1. Load the cleaned dataset
2. Select features and target
3. Build a prepocessing pipeline
4. Apply the transformations to encode the dataset
5. Convert the sparse matrix to a dataframe 
6. Save the final encoded dataset and the pipeline

## 1. Load the cleaned dataset

In [51]:
import pandas as pd

# Load the CSV file containing your cleaned dataset
df = pd.read_csv("code-comment-classification-cleaned.csv")

# Show the first few rows to confirm it loaded correctly
df.head()

Unnamed: 0,comment_sentence_id,class,comment_sentence,category
0,512,MigrationGraph,migrations files can be marked as replacing an...,Usage
1,513,MigrationGraph,this is to support the squash feature.,Usage
2,514,MigrationGraph,the graph handler isn t responsible,Usage
3,515,MigrationGraph,"for these instead, the code to load them in he...",Usage
4,516,MigrationGraph,migration files and if the replaced migrations...,Usage


## 2. Select features and target

In [52]:
# These are the columns we will use as inputs for the model.
# They contain:
#   - categorical data: class
#   - text data: comment_sentence
FEATURES = ["class", "comment_sentence"]

# This is the value we want to predict
TARGET = "category"

# Split the dataset into features (X) and target (y)
X = df[FEATURES]
y = df[TARGET]

In [53]:
from sklearn.preprocessing import LabelEncoder

# Create label encoder for target variable
label_encoder = LabelEncoder()

# Fit and transform the target categories to numeric values
y_encoded = label_encoder.fit_transform(y)

# Display the mapping
print("Category to Numeric Mapping:")
for i, category in enumerate(label_encoder.classes_):
    print(f"  {category}: {i}")

print(f"\nOriginal target (first 5): {y.head().tolist()}")
print(f"Encoded target (first 5): {y_encoded[:5].tolist()}")

Category to Numeric Mapping:
  DevelopmentNotes: 0
  Expand: 1
  Parameters: 2
  Summary: 3
  Usage: 4

Original target (first 5): ['Usage', 'Usage', 'Usage', 'Usage', 'Usage']
Encoded target (first 5): [4, 4, 4, 4, 4]


In [54]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

preprocess = ColumnTransformer(
    transformers=[
        # Encode categorical columns into one-hot vectors
        ("cat", OneHotEncoder(handle_unknown="ignore"), ["class"]),

        # Convert comment_sentence text into TF-IDF features
        ("text", TfidfVectorizer(stop_words="english", ngram_range=(1, 2)), 
         "comment_sentence")
    ]
)

# Fit the preprocessing pipeline on the dataset
# (This learns the vocabulary for TF-IDF and unique categories)
preprocess.fit(X)

print("Preprocessing pipeline fitted successfully!")


Preprocessing pipeline fitted successfully!


## 4. Apply the transformations to encode the dataset
This produces a sparse matrix containing:
- one-hot encoded categorical features
- TF-IDF encoded text features
- numeric features

In [55]:
X_encoded = preprocess.transform(X)

X_encoded

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 22390 stored elements and shape (2864, 8907)>

## 5. Convert the sparse matrix to a dataframe
This is only for demonstrating the final encoded output.

In [56]:
import numpy as np
from scipy.sparse import csr_matrix

# Retrieve the names of the generated features for interpretability
cat_features = preprocess.named_transformers_["cat"].get_feature_names_out(
    ["class"]
)
text_features = preprocess.named_transformers_["text"].get_feature_names_out()

# Combine them in order
all_feature_names = list(cat_features) + list(text_features)

# Convert sparse matrix to dense numpy array
X_dense = X_encoded.toarray()

# Create a DataFrame for inspection
encoded_df = pd.DataFrame(X_dense, columns=all_feature_names)

# Add the encoded target column (now numeric)
encoded_df["category"] = y_encoded

encoded_df.head()

Unnamed: 0,class_AbstractEngine,class_AbstractHolidayCalendar,class_AccessMixin,class_AccessorCallableDocumenter,class_AccessorDocumenter,class_Adadelta,class_Adam,class_Adamax,class_AdaptiveMaxPool3d,class_AmbiguityError,...,zero paddings,zero sum,zeroormore,zeroormore boolean,zeros,zeros circular,zip,zip files,zipfile,zipfile manifests
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 6. Save the final encoded dataset and the pipeline

In [57]:
from scipy import sparse
import joblib

# Save the original dataset with encoded target
df_encoded_target = df.copy()
df_encoded_target['category'] = y_encoded
df_encoded_target.to_csv("dataset_with_encoded_target.csv", index=False)

# Save encoded feature matrix in sparse format (.npz)
sparse.save_npz("encoded_features.npz", csr_matrix(X_encoded))

# Save encoded target values (numeric)
pd.DataFrame({'category': y_encoded}).to_csv("target.csv", index=False)

# Save the preprocessing pipeline so we can use it later
joblib.dump(preprocess, "preprocessing_pipeline.pkl")

# Save the label encoder so we can decode predictions later
joblib.dump(label_encoder, "label_encoder.pkl")

print("Files saved:")
print("- dataset_with_encoded_target.csv (original dataset with numeric category)")
print("- encoded_features.npz (features only, sparse format)")
print("- target.csv (numeric encoded target)")
print("- preprocessing_pipeline.pkl")
print("- label_encoder.pkl")

Files saved:
- dataset_with_encoded_target.csv (original dataset with numeric category)
- encoded_features.npz (features only, sparse format)
- target.csv (numeric encoded target)
- preprocessing_pipeline.pkl
- label_encoder.pkl
