# 01 - EDA

## Setting Up Project Dir

In [1]:
from jupyter_init import setup

setup()

from src_code.config import *

## Loading Dataset

In [2]:
import pandas as pd
import numpy as np

TRANSFORMED_DF = EXTRACTED_DATA_DIR / "train_labeled_features_partial_copy.feather"

# ---- LOAD ----
df = pd.read_feather(TRANSFORMED_DF)
print(f"Loaded dataframe with {len(df)} rows and {len(df.columns)} columns\n")

df.dtypes

Loaded dataframe with 111567 rows and 31 columns



datetime                      datetime64[us, pytz.FixedOffset(-120)]
commit                                                        object
repo                                                          object
filepath                                                      object
content                                                       object
methods                                                       object
lines                                                         object
author_email                                                  object
canonical_datetime                               datetime64[ns, UTC]
author_exp_pre                                                 int64
author_recent_activity_pre                                     int64
label                                                          int64
loc_added                                                      int64
loc_deleted                                                    int64
files_changed                     

## Correlation between suspicious negatives and repo

Maybe some repos behave strangely:

In [None]:
df[df['time_since_last_change'] < 0]['repo'].value_counts()


Ansible has an extremely branch-heavy development model, with large numbers of parallel feature branches → this explains why Ansible dominates your negatives.

Because Ansible's development process is:

- highly distributed
- patch-based
- very branch-heavy
- contains large numbers of old branches merged late
- frequent external contributors with old author dates
- multiple maintainers commit patches on their behalf

This creates many opportunities for chronologically inconsistent commit sequences.

## Label leakage

Make sure labels aren’t trivially encoded:

- Does loc_added == 0 imply no bug?
- Does complexity_delta > 0 imply bug?
- Are bug labels perfectly predicted by any single feature?

In [3]:
df.groupby('label').mean(numeric_only=True)

Unnamed: 0_level_0,author_exp_pre,author_recent_activity_pre,loc_added,loc_deleted,files_changed,hunks_count,msg_len,has_fix_kw,has_bug_kw,ast_delta,complexity_delta,max_func_change,time_since_last_change,todo,fixme,try,except,raise,recent_churn
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
0,784.679983,65.879603,96.397537,115.449355,25.264013,100.265608,129.482188,0.287016,0.057431,1108.177383,36.830376,59.136929,19389.024214,0.190741,0.073947,100.347621,9.687835,2.284899,22404.352773
1,580.079231,48.285646,86.107986,122.336533,16.258294,73.775573,239.289335,0.353747,0.068942,147.683507,5.394581,106.860477,11273.012123,0.112688,0.025131,3.037189,1.632857,1.085681,7766.710967


## Extremely Short Commit Messages

Very short messages may need normalization.

In [4]:
df['msg_len'].describe()


count    111567.000000
mean        175.108939
std         305.239324
min           1.000000
25%          48.000000
50%          76.000000
75%         195.000000
max        9628.000000
Name: msg_len, dtype: float64