# Code Comment Classification - Outliers Detection

This notebook performs the following Outliers Detection operations:
1. Load the encoded dataset
2. Outlier detection using isolation forest
3. Add outlier labels back to original data
4. Analyze results
5. Save outlier results

## 1. Load the encoded dataset

In [3]:
import pandas as pd
from scipy import sparse

# Load the encoded sparse matrix (.npz)
X_encoded = sparse.load_npz("encoded_features.npz")

# Load target column
y = pd.read_csv("target.csv")["instance_type"]

print("Encoded feature matrix shape:", X_encoded.shape)
print("Target shape:", y.shape)


Encoded feature matrix shape: (12775, 8913)
Target shape: (12775,)


## 2. Outlier detection using isolation forest
`Isolation Forest` is chosen because:
- works well on high-dimensional sparse data
- scales well
- is unsupervised (no labels required)
- detects anomalies based on how isolated points appear

In [4]:
from sklearn.ensemble import IsolationForest

# Create the outlier detector
iso = IsolationForest(
    n_estimators=200,
    contamination="auto",  # automatically estimates outlier proportion
    random_state=42
)

# Fit the model on encoded features
iso.fit(X_encoded)

# Predict outliers:
#   +1 → normal point
#   -1 → outlier (anomaly)
outlier_labels = iso.predict(X_encoded)

# Convert to a convenient format:
#   1 → normal
#   0 → outlier
outliers_binary = (outlier_labels == -1).astype(int)

print("Outlier detection complete!")


Outlier detection complete!


## 3. Add outlier labels back to original data

In [5]:
# Load original dataset (before encoding)
df_original = pd.read_csv("code-comment-classification-cleaned.csv")

# Add new column: 1 = outlier, 0 = normal
df_original["outlier"] = outliers_binary

df_original.head()


Unnamed: 0,comment_sentence_id,class,category,comment_sentence,partition,instance_type,outlier
0,1,AccessMixin,DevelopmentNotes,abstract cbv mixin that gives access mixins th...,1,0,0
1,1,AccessMixin,Expand,abstract cbv mixin that gives access mixins th...,1,0,0
2,1,AccessMixin,Parameters,abstract cbv mixin that gives access mixins th...,1,0,0
3,1,AccessMixin,Summary,abstract cbv mixin that gives access mixins th...,1,1,0
4,1,AccessMixin,Usage,abstract cbv mixin that gives access mixins th...,0,0,0


## 4. Analyze results

In [6]:
# Count outliers
num_outliers = df_original["outlier"].sum()
num_normal = len(df_original) - num_outliers

print(f"Total rows: {len(df_original)}")
print(f"Outliers detected: {num_outliers}")
print(f"Normal rows: {num_normal}")

# Show actual outlier rows
df_original[df_original["outlier"] == 1].head(10)


Total rows: 12775
Outliers detected: 0
Normal rows: 12775


Unnamed: 0,comment_sentence_id,class,category,comment_sentence,partition,instance_type,outlier


## 5. Save outlier results

In [7]:
# Save updated dataset with outlier labels
df_original.to_csv("code-comment-classification-with-outliers.csv", index=False)

# Save just the outlier predictions
pd.DataFrame({"outlier": outliers_binary}).to_csv("outlier-labels.csv", index=False)

print("Saved:")
print("- code-comment-classification-with-outliers.csv")
print("- outlier-labels.csv")


Saved:
- code-comment-classification-with-outliers.csv
- outlier-labels.csv
