<a href="https://colab.research.google.com/github/Ale080801/MLSA/blob/main/exercise_1_supervised_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning for Software Analysis (MLSA)

#### Fabio Pinelli
<a href="mailto:fabio.pinelli@imtlucca.it">fabio.pinelli@imtlucca.it</a><br/>
IMT School for Advanced Studies Lucca<br/>
2025/2026<br/>
October, 9 2025

## Exercise 1

**Dataset**: KC1 (NASA PROMISE) — static code metrics (Halstead, McCabe/Cyclomatic, LOC...).

**Metrics**:
- *Halstead* (operators/operands, volume, difficulty, effort, estimated bugs) —
[https://en.wikipedia.org/wiki/Halstead_complexity_measures](https://en.wikipedia.org/wiki/Halstead_complexity_measures)

- *Cyclomatic complexity (McCabe)* — [https://en.wikipedia.org/wiki/Cyclomatic_complexity
](https://en.wikipedia.org/wiki/Cyclomatic_complexity
)


**Goal**: binary classification of defective vs non-defective modules.
  
+ **Size / comments**

    - **loc**: "Lines of Code (LOC): total number of lines in the module.",
    - **loccodeandcomment**: "Lines of code including comment lines.",
    - **locode**: "Effective lines of code (no comments/blank).",
    - **locomment**: "Number of comment lines.",
    - **loblank**: "Number of blank lines.",
    - **branchcount**: "Number of branches in the control flow (approx. number of decisions).",

+ **McCabe / Cyclomatic**
    - **v(g)**": "Cyclomatic complexity v(G): number of linearly independent paths in the CFG (McCabe).",
    - **ev(g)**": "Essential complexity ev(G): measures degree of structuredness.",
    - **iv(g)**": "Design complexity iv(G): complexity related to design/call structure.",
    - **cyclomatic_complexity**": "Cyclomatic complexity: number of independent paths.",

+ **Halstead**
    - **uniq_op**: "Halstead: distinct operators (η₁).",
    - **uniq_opnd**: "Halstead: distinct operands (η₂).",
    - **total_op**: "Halstead: total operators (N₁).",
    - **total_opnd**: "Halstead: total operands (N₂).",
    - **n**: "Halstead length N = N₁ + N₂.",
    - **v**: "Halstead volume V = N × log₂(η₁ + η₂).",
    - **l**: "Halstead level L (inverse of difficulty).",
    - **d**: "Halstead difficulty D = (η₁/2) × (N₂/η₂).",
    - **i**: "Halstead intelligence content I = L × V (interpretations vary).",
    - **e**: "Halstead effort E = D × V.",
    - **b**: "Halstead estimated bugs B ≈ V/3000 (or (E^(2/3))/3000).",
    - **t**: "Halstead time to program T = E/18 (seconds)."





In [None]:
import pandas as pd
# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
import seaborn as sns
import os
import numpy as np
import sklearn
import sys
df = pd.read_csv('https://raw.githubusercontent.com/fpinell/mlsa/refs/heads/main/AA20252026/data/kc1_modified.csv')

In [None]:
df

Unnamed: 0,ev(g),iv(g),n,v,l,d,i,e,b,t,...,loblank,loccodeandcomment,uniq_op,uniq_opnd,total_op,total_opnd,branchcount,loc_qbin,v(g)_bin,defects
0,1.4,1.4,1.3,1.30,1.30,1.30,1.30,1.30,1.30,1.30,...,2,2,1.2,1.2,1.2,1.2,1.4,Q1,Low,False
1,1.0,1.0,1.0,1.00,1.00,1.00,1.00,1.00,1.00,1.00,...,1,1,1.0,1.0,1.0,1.0,1.0,Q1,Low,True
2,1.0,11.0,171.0,927.89,0.04,23.04,40.27,21378.61,0.31,1187.70,...,6,0,18.0,25.0,107.0,64.0,21.0,Q4,Low,True
3,6.0,8.0,141.0,769.78,0.07,14.86,51.81,11436.73,0.26,635.37,...,5,0,16.0,28.0,89.0,52.0,15.0,Q4,Low,True
4,1.0,3.0,58.0,254.75,0.11,9.35,27.25,2381.95,0.08,132.33,...,2,0,11.0,10.0,41.0,17.0,5.0,Q4,Low,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2104,1.0,2.0,40.0,175.69,0.15,6.82,25.77,1197.90,0.06,66.55,...,2,0,10.0,11.0,25.0,15.0,3.0,Q3,Low,False
2105,3.0,3.0,60.0,278.63,0.10,9.69,28.75,2700.58,0.09,150.03,...,2,0,12.0,13.0,39.0,21.0,5.0,Q3,Low,False
2106,1.0,1.0,4.0,8.00,0.67,1.50,5.33,12.00,0.00,0.67,...,0,0,3.0,1.0,3.0,1.0,1.0,Q1,Low,False
2107,,,17.0,60.94,0.25,4.00,15.24,243.78,0.02,13.54,...,5,0,6.0,6.0,9.0,8.0,1.0,Q3,Low,False


In [None]:
df.defects.value_counts()

Unnamed: 0_level_0,count
defects,Unnamed: 1_level_1
False,1783
True,326


In [None]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2109 entries, 0 to 2108
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ev(g)              1997 non-null   float64
 1   iv(g)              1997 non-null   float64
 2   n                  2109 non-null   float64
 3   v                  2109 non-null   float64
 4   l                  2109 non-null   float64
 5   d                  2109 non-null   float64
 6   i                  2109 non-null   float64
 7   e                  2109 non-null   float64
 8   b                  2109 non-null   float64
 9   t                  2109 non-null   float64
 10  locode             2109 non-null   int64  
 11  locomment          2109 non-null   int64  
 12  loblank            2109 non-null   int64  
 13  loccodeandcomment  2109 non-null   int64  
 14  uniq_op            2109 non-null   float64
 15  uniq_opnd          2109 non-null   float64
 16  total_op           2109 

In [None]:
#problem is a classification problem
#ev(g) and iv(g) have missing values
df["v(g)_bin"].value_counts()

Unnamed: 0_level_0,count
v(g)_bin,Unnamed: 1_level_1
Low,2029
MedLow,65
MedHigh,14
High,1


In [None]:
df[["locode","locomment","loccodeandcomment"]]

Unnamed: 0,locode,locomment,loccodeandcomment
0,2,2,2
1,1,1,1
2,65,10,0
3,37,2,0
4,21,0,0
...,...,...,...
2104,12,1,0
2105,18,1,0
2106,0,0,0
2107,6,0,0


In [None]:
# Check for duplicate rows in the original dataframe
print("Number of duplicate rows in the original dataframe:", df.duplicated().sum())

# Remove duplicate rows
df.drop_duplicates(inplace=True)

print("Shape of dataframe after removing duplicates:", df.shape)

# Now proceed with the stratified split in the next cell

Number of duplicate rows in the original dataframe: 906
Shape of dataframe after removing duplicates: (1203, 22)


In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

# Use the original df that contains the 'defects' column
# Note: If you ran the cell that removed rows from df, you might need to reload the data
# or adjust the workflow. For now, assuming df in its current state *should* have 'defects'.

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
# The split method needs the dataframe and the target column for stratification
for train_index, test_index in split.split(df, df["defects"]):
    strat_train_set = df.loc[train_index]
    strat_test_set = df.loc[test_index]

# Separate features (X) and target (y) from the stratified sets
X_train = strat_train_set.drop("defects", axis=1)
y_train = strat_train_set["defects"]

X_test = strat_test_set.drop("defects", axis=1)
y_test = strat_test_set["defects"]


print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

# You can now proceed with preprocessing X_train and X_test,
# including handling the row where v(g)_bin was 'High' if needed.

KeyError: '[657, 300, 595, 827, 965, 557, 1006, 1024, 794, 274, 430, 792, 647, 703, 908, 457, 1087, 1117, 845, 449, 428, 763, 445, 627, 950, 907, 413, 835, 966, 1097, 766, 924, 488, 790, 1198, 649, 892, 765, 435, 1139, 906, 1110, 481, 1145, 1146, 525, 1029, 684, 471, 1022, 919, 973, 930, 1031, 1121, 1009, 510, 1161, 1100, 1132, 486, 885, 1104, 888, 931, 489, 863, 559, 926, 961, 646, 972, 1103, 778, 1192, 697, 431, 329, 898, 615, 831, 681, 401, 1128, 1127, 953, 442, 825, 862, 476, 849, 997, 1166, 754, 844, 603, 625, 1152, 883, 1064, 332, 1123, 1055, 932, 800, 630, 512, 1164, 718, 597, 671, 757, 733, 577, 520, 1070, 1052, 631, 1120, 839, 947, 487, 747, 977, 920, 727, 1159, 535, 523, 275, 1021, 1080, 828, 463, 1026, 1090, 791, 490, 301, 812, 1138, 629, 1079, 563, 1025, 1126, 241, 503, 600, 335, 694, 339, 524, 531, 807, 462, 663, 935, 690, 895, 1002, 1102, 1158, 484, 776, 1141, 1091, 465, 482, 407, 762, 1157, 984, 934, 1072, 871, 343, 841, 843, 517, 861, 1081, 1124, 1074, 485, 701, 734, 478, 676, 1023, 185, 711, 842, 213, 1035, 750, 467, 1057, 536, 786, 793, 941, 688, 1155, 1053, 634, 338, 200, 1050, 851, 884, 570, 433, 429, 683, 882, 687, 983, 188, 1033, 1181, 455, 1018, 460, 92, 735, 458, 337, 507, 497, 453, 1041, 752, 1048, 1054, 795, 952, 464, 461, 562, 648, 1135, 1028, 511, 979, 1034, 508, 1016, 712, 330, 1077, 721, 1136, 860, 808, 454, 576, 596, 1019, 466, 933, 1165, 468, 452, 803, 344, 799, 1137, 1147, 706, 1122, 957, 1125, 498, 819, 580, 1062, 889, 1129, 693, 785, 1007, 677, 470, 459, 1005, 686, 632, 1163, 806, 1017, 857, 896, 561, 560, 501, 954, 730, 829, 781, 444, 928, 633, 276, 450, 475, 492, 707, 1112, 1154, 689, 477, 480, 575, 826, 699, 602, 1202, 645, 506, 971, 1143, 1010, 732, 848, 982, 978, 1071] not in index'

In [None]:
print("Distribuzione della variabile target in y_train:")
print(y_train.value_counts(normalize=True))

print("\nDistribuzione della variabile target in y_test:")
print(y_test.value_counts(normalize=True))

In [None]:
from sklearn.impute import SimpleImputer
import numpy as np

# Calculate the median of the columns with missing values on the training set
median_evg = X_train['ev(g)'].median()
median_ivg = X_train['iv(g)'].median()

# Store the medians (optional, but good practice for later use on test set)
imputation_medians = {
    'ev(g)': median_evg,
    'iv(g)': median_ivg
}

# Impute missing values in X_train with the calculated medians
X_train['ev(g)'].fillna(imputation_medians['ev(g)'], inplace=True)
X_train['iv(g)'].fillna(imputation_medians['iv(g)'], inplace=True)

# Verify that there are no more missing values in these columns in X_train
print("Missing values in X_train after imputation:")
print(X_train[['ev(g)', 'iv(g)']].isnull().sum())

# Display the first few rows of X_train to see the changes
display(X_train.head())

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# Identify the categorical columns
categorical_cols = ['loc_qbin', 'v(g)_bin']

# Create an OrdinalEncoder instance
# handle_unknown='use_encoded_value' and unknown_value=-1 are useful
# for handling categories in the test set that might not be in the training set
ordinal_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

# Fit the encoder on the training data and transform X_train
X_train[categorical_cols] = ordinal_encoder.fit_transform(X_train[categorical_cols])

# Display the first few rows of X_train to see the transformed columns
print("X_train after Ordinal Encoding:")
display(X_train.head())

# Display the categories learned by the encoder (useful for understanding the mapping)
print("\nCategories learned by the Ordinal Encoder:")
print(ordinal_encoder.categories_)