# Code Comment Classification - Encoding

This notebook performs the following encoding operations:
1. Load the cleaned dataset
2. Select features and target
3. Build a prepocessing pipeline
4. Apply the transformations to encode the dataset
5. Convert the sparse matrix to a dataframe 
6. Save the final encoded dataset and the pipeline

## 1. Load the cleaned dataset

In [1]:
import pandas as pd

# Load the CSV file containing your cleaned dataset
df = pd.read_csv("code-comment-classification-cleaned.csv")

# Show the first few rows to confirm it loaded correctly
df.head()

Unnamed: 0,comment_sentence_id,class,category,comment_sentence,partition,instance_type
0,1,AccessMixin,DevelopmentNotes,abstract cbv mixin that gives access mixins th...,1,0
1,1,AccessMixin,Expand,abstract cbv mixin that gives access mixins th...,1,0
2,1,AccessMixin,Parameters,abstract cbv mixin that gives access mixins th...,1,0
3,1,AccessMixin,Summary,abstract cbv mixin that gives access mixins th...,1,1
4,1,AccessMixin,Usage,abstract cbv mixin that gives access mixins th...,0,0


## 2. Select features and target

In [2]:
# These are the columns we will use as inputs for the model.
# They contain:
#   - categorical data: class, category
#   - text data: comment_sentence
#   - numeric data: partition
FEATURES = ["class", "category", "comment_sentence", "partition"]

# This is the value we want to predict (binary classification)
TARGET = "instance_type"

# Split the dataset into features (X) and target (y)
X = df[FEATURES]
y = df[TARGET]


## 3. Build a prepocessing pipeline
We use ColumnTransformer to apply different preprocessing steps
to different columns:
- `OneHotEncoder` for categorical columns
- `TfidfVectorizer` for the text column
- passthrough for numeric columns

In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

preprocess = ColumnTransformer(
    transformers=[
        # Encode categorical columns into one-hot vectors
        ("cat", OneHotEncoder(handle_unknown="ignore"), ["class", "category"]),

        # Convert comment_sentence text into TF-IDF features
        ("text", TfidfVectorizer(stop_words="english", ngram_range=(1, 2)), 
         "comment_sentence"),

        # Keep numerical columns as they are
        ("num", "passthrough", ["partition"])
    ]
)

# Fit the preprocessing pipeline on the dataset
# (This learns the vocabulary for TF-IDF and unique categories)
preprocess.fit(X)

print("Preprocessing pipeline fitted successfully!")


Preprocessing pipeline fitted successfully!


## 4. Apply the transformations to encode the dataset
This produces a sparse matrix containing:
- one-hot encoded categorical features
- TF-IDF encoded text features
- numeric features

In [4]:
X_encoded = preprocess.transform(X)

X_encoded

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 114523 stored elements and shape (12775, 8913)>

## 5. Convert the sparse matrix to a dataframe
This is only for demonstrating the final encoded output.

In [5]:
import numpy as np
from scipy.sparse import csr_matrix

# Retrieve the names of the generated features for interpretability
cat_features = preprocess.named_transformers_["cat"].get_feature_names_out(
    ["class", "category"]
)
text_features = preprocess.named_transformers_["text"].get_feature_names_out()
num_features = ["partition"]

# Combine them in order
all_feature_names = list(cat_features) + list(text_features) + num_features

# Convert sparse matrix to dense numpy array
X_dense = X_encoded.toarray()

# Create a DataFrame for inspection
encoded_df = pd.DataFrame(X_dense, columns=all_feature_names)

# Add the target column for completeness
encoded_df["target"] = y.values

encoded_df.head()

Unnamed: 0,class_AbstractEngine,class_AbstractHolidayCalendar,class_AccessMixin,class_AccessorCallableDocumenter,class_AccessorDocumenter,class_Adadelta,class_Adam,class_Adamax,class_AdaptiveMaxPool3d,class_AmbiguityError,...,zero sum,zeroormore,zeroormore boolean,zeros,zeros circular,zip,zip files,zipfile,zipfile manifests,partition
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 6. Save the final encoded dataset and the pipeline

In [6]:
from scipy import sparse
import joblib

# Save encoded feature matrix in sparse format (.npz)
sparse.save_npz("encoded_features.npz", csr_matrix(X_encoded))

# Save target values
y.to_csv("target.csv", index=False)

# Save the preprocessing pipeline so we can use it later
joblib.dump(preprocess, "preprocessing_pipeline.pkl")

print("Files saved:")
print("- encoded_features.npz")
print("- target.csv")
print("- preprocessing_pipeline.pkl")


Files saved:
- encoded_features.npz
- target.csv
- preprocessing_pipeline.pkl
