<a href="https://colab.research.google.com/github/Ale080801/MLSA/blob/main/exercise_1_supervised_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning for Software Analysis (MLSA)

#### Fabio Pinelli
<a href="mailto:fabio.pinelli@imtlucca.it">fabio.pinelli@imtlucca.it</a><br/>
IMT School for Advanced Studies Lucca<br/>
2025/2026<br/>
October, 9 2025

## Exercise 1

**Dataset**: KC1 (NASA PROMISE) — static code metrics (Halstead, McCabe/Cyclomatic, LOC...).

**Metrics**:
- *Halstead* (operators/operands, volume, difficulty, effort, estimated bugs) —
[https://en.wikipedia.org/wiki/Halstead_complexity_measures](https://en.wikipedia.org/wiki/Halstead_complexity_measures)

- *Cyclomatic complexity (McCabe)* — [https://en.wikipedia.org/wiki/Cyclomatic_complexity
](https://en.wikipedia.org/wiki/Cyclomatic_complexity
)


**Goal**: binary classification of defective vs non-defective modules.
  
+ **Size / comments**

    - **loc**: "Lines of Code (LOC): total number of lines in the module.",
    - **loccodeandcomment**: "Lines of code including comment lines.",
    - **locode**: "Effective lines of code (no comments/blank).",
    - **locomment**: "Number of comment lines.",
    - **loblank**: "Number of blank lines.",
    - **branchcount**: "Number of branches in the control flow (approx. number of decisions).",

+ **McCabe / Cyclomatic**
    - **v(g)**": "Cyclomatic complexity v(G): number of linearly independent paths in the CFG (McCabe).",
    - **ev(g)**": "Essential complexity ev(G): measures degree of structuredness.",
    - **iv(g)**": "Design complexity iv(G): complexity related to design/call structure.",
    - **cyclomatic_complexity**": "Cyclomatic complexity: number of independent paths.",

+ **Halstead**
    - **uniq_op**: "Halstead: distinct operators (η₁).",
    - **uniq_opnd**: "Halstead: distinct operands (η₂).",
    - **total_op**: "Halstead: total operators (N₁).",
    - **total_opnd**: "Halstead: total operands (N₂).",
    - **n**: "Halstead length N = N₁ + N₂.",
    - **v**: "Halstead volume V = N × log₂(η₁ + η₂).",
    - **l**: "Halstead level L (inverse of difficulty).",
    - **d**: "Halstead difficulty D = (η₁/2) × (N₂/η₂).",
    - **i**: "Halstead intelligence content I = L × V (interpretations vary).",
    - **e**: "Halstead effort E = D × V.",
    - **b**: "Halstead estimated bugs B ≈ V/3000 (or (E^(2/3))/3000).",
    - **t**: "Halstead time to program T = E/18 (seconds)."





In [44]:
import pandas as pd
# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
import seaborn as sns
import os
import numpy as np
import sklearn
import sys
df = pd.read_csv('https://raw.githubusercontent.com/fpinell/mlsa/refs/heads/main/AA20252026/data/kc1_modified.csv')

In [45]:
df

Unnamed: 0,ev(g),iv(g),n,v,l,d,i,e,b,t,...,loblank,loccodeandcomment,uniq_op,uniq_opnd,total_op,total_opnd,branchcount,loc_qbin,v(g)_bin,defects
0,1.4,1.4,1.3,1.30,1.30,1.30,1.30,1.30,1.30,1.30,...,2,2,1.2,1.2,1.2,1.2,1.4,Q1,Low,False
1,1.0,1.0,1.0,1.00,1.00,1.00,1.00,1.00,1.00,1.00,...,1,1,1.0,1.0,1.0,1.0,1.0,Q1,Low,True
2,1.0,11.0,171.0,927.89,0.04,23.04,40.27,21378.61,0.31,1187.70,...,6,0,18.0,25.0,107.0,64.0,21.0,Q4,Low,True
3,6.0,8.0,141.0,769.78,0.07,14.86,51.81,11436.73,0.26,635.37,...,5,0,16.0,28.0,89.0,52.0,15.0,Q4,Low,True
4,1.0,3.0,58.0,254.75,0.11,9.35,27.25,2381.95,0.08,132.33,...,2,0,11.0,10.0,41.0,17.0,5.0,Q4,Low,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2104,1.0,2.0,40.0,175.69,0.15,6.82,25.77,1197.90,0.06,66.55,...,2,0,10.0,11.0,25.0,15.0,3.0,Q3,Low,False
2105,3.0,3.0,60.0,278.63,0.10,9.69,28.75,2700.58,0.09,150.03,...,2,0,12.0,13.0,39.0,21.0,5.0,Q3,Low,False
2106,1.0,1.0,4.0,8.00,0.67,1.50,5.33,12.00,0.00,0.67,...,0,0,3.0,1.0,3.0,1.0,1.0,Q1,Low,False
2107,,,17.0,60.94,0.25,4.00,15.24,243.78,0.02,13.54,...,5,0,6.0,6.0,9.0,8.0,1.0,Q3,Low,False


In [46]:
df.defects.value_counts()

Unnamed: 0_level_0,count
defects,Unnamed: 1_level_1
False,1783
True,326


In [47]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2109 entries, 0 to 2108
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ev(g)              1997 non-null   float64
 1   iv(g)              1997 non-null   float64
 2   n                  2109 non-null   float64
 3   v                  2109 non-null   float64
 4   l                  2109 non-null   float64
 5   d                  2109 non-null   float64
 6   i                  2109 non-null   float64
 7   e                  2109 non-null   float64
 8   b                  2109 non-null   float64
 9   t                  2109 non-null   float64
 10  locode             2109 non-null   int64  
 11  locomment          2109 non-null   int64  
 12  loblank            2109 non-null   int64  
 13  loccodeandcomment  2109 non-null   int64  
 14  uniq_op            2109 non-null   float64
 15  uniq_opnd          2109 non-null   float64
 16  total_op           2109 

In [48]:
#problem is a classification problem
#ev(g) and iv(g) have missing values
df["v(g)_bin"].value_counts()

Unnamed: 0_level_0,count
v(g)_bin,Unnamed: 1_level_1
Low,2029
MedLow,65
MedHigh,14
High,1


In [49]:
df[["locode","locomment","loccodeandcomment"]]

Unnamed: 0,locode,locomment,loccodeandcomment
0,2,2,2
1,1,1,1
2,65,10,0
3,37,2,0
4,21,0,0
...,...,...,...
2104,12,1,0
2105,18,1,0
2106,0,0,0
2107,6,0,0


In [50]:
# Check for duplicate rows in the original dataframe
print("Number of duplicate rows in the original dataframe:", df.duplicated().sum())

# Remove duplicate rows and reset the index
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)

# Remove rows where 'v(g)_bin' is 'High'
df = df[df['v(g)_bin'] != 'High'].copy()
df.reset_index(drop=True, inplace=True)

print("Shape of dataframe after removing duplicates and 'High' v(g)_bin:", df.shape)

# Now proceed with the stratified split in the next cell

Number of duplicate rows in the original dataframe: 906
Shape of dataframe after removing duplicates and 'High' v(g)_bin: (1202, 22)


In [51]:
from sklearn.model_selection import StratifiedShuffleSplit

# Use the original df that contains the 'defects' column
# Note: If you ran the cell that removed rows from df, you might need to reload the data
# or adjust the workflow. For now, assuming df in its current state *should* have 'defects'.

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
# The split method needs the dataframe and the target column for stratification
for train_index, test_index in split.split(df, df["defects"]):
    strat_train_set = df.loc[train_index]
    strat_test_set = df.loc[test_index]

# Separate features (X) and target (y) from the stratified sets
X_train = strat_train_set.drop("defects", axis=1)
y_train = strat_train_set["defects"]

X_test = strat_test_set.drop("defects", axis=1)
y_test = strat_test_set["defects"]


print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

# You can now proceed with preprocessing X_train and X_test,
# including handling the row where v(g)_bin was 'High' if needed.

Shape of X_train: (961, 21)
Shape of X_test: (241, 21)
Shape of y_train: (961,)
Shape of y_test: (241,)


In [52]:
print("Distribuzione della variabile target in y_train:")
print(y_train.value_counts(normalize=True))

print("\nDistribuzione della variabile target in y_test:")
print(y_test.value_counts(normalize=True))

Distribuzione della variabile target in y_train:
defects
False    0.740895
True     0.259105
Name: proportion, dtype: float64

Distribuzione della variabile target in y_test:
defects
False    0.738589
True     0.261411
Name: proportion, dtype: float64


First, let's identify the numerical and categorical columns that need preprocessing.

In [53]:
# Identify numerical and categorical features
# Exclude the target variable 'defects' and the bin columns which might be handled differently or dropped
numerical_features = X_train.select_dtypes(include=np.number).columns.tolist()
categorical_features = X_train.select_dtypes(include='object').columns.tolist()

# Explicitly remove the categorical bin columns from numerical features if they were incorrectly included
for col in ['loc_qbin', 'v(g)_bin']:
    if col in numerical_features:
        numerical_features.remove(col)
    if col not in categorical_features:
        categorical_features.append(col)


print("Numerical features:", numerical_features)
print("Categorical features:", categorical_features)

Numerical features: ['ev(g)', 'iv(g)', 'n', 'v', 'l', 'd', 'i', 'e', 'b', 't', 'locode', 'locomment', 'loblank', 'loccodeandcomment', 'uniq_op', 'uniq_opnd', 'total_op', 'total_opnd', 'branchcount']
Categorical features: ['loc_qbin', 'v(g)_bin']


Now, let's build the preprocessing pipeline using `ColumnTransformer`. We will use `SimpleImputer` for missing values and `StandardScaler` for numerical features. For categorical features, we will use `SimpleImputer` to fill missing values and `OneHotEncoder` for encoding.

In [54]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Create pipelines for numerical and categorical features
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')), # Use median for numerical imputation
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')), # Use most frequent for categorical imputation
    ('onehot', OneHotEncoder(handle_unknown='ignore')) # Handle unknown categories during transformation
])

# Combine preprocessing steps using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ],
    remainder='passthrough' # Keep any other columns that were not specified (though in this case all are specified)
)

# Fit the preprocessor on the training data
X_train_processed = preprocessor.fit_transform(X_train)

# Transform the test data
X_test_processed = preprocessor.transform(X_test)

print("Shape of processed training data:", X_train_processed.shape)
print("Shape of processed test data:", X_test_processed.shape)

Shape of processed training data: (961, 26)
Shape of processed test data: (241, 26)


Now that the data is preprocessed, let's train a RandomForestClassifier model and evaluate its accuracy.

In [55]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initialize and train the RandomForestClassifier model
# Using a random_state for reproducibility
model = RandomForestClassifier(random_state=42)
model.fit(X_train_processed, y_train)

# Make predictions on the processed test data
y_pred = model.predict(X_test_processed)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy of the RandomForestClassifier model: {accuracy:.4f}")

Accuracy of the RandomForestClassifier model: 0.7178
