# Extra Trees Model Training (from scraped GitHub code)

This notebook trains an **Extra Trees** model on the metrics dataset 


## 1) Load Lib

In [30]:
# Step 1: Imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score, classification_report
import pickle
from pathlib import Path

## 2) Load dataset

In [31]:
df = pd.read_csv('../data/processed/dataset_processed.csv')
df.head()

Unnamed: 0,abbreviation_density,average_cyclomatic_complexity,avg_line_length,comment_code_mismatch_score,comment_lines,comment_percentage,decision_density,documentation_coverage,external_vs_internal_field_access_ratio,functions,global_usages_total,globals_declared,halstead_difficulty,halstead_effort,halstead_estimated_bugs,halstead_volume,inter_file_coupling,large_parameter_list_indicator,lazy_class_indicator,lines_added,lines_of_code,long_method_indicator,maintainability_score,max_cyclomatic_ratio,max_line_length,max_lines_per_function,max_nesting_level,mean_cyclomatic_ratio,mean_lines_per_function,mean_param_entropy,nesting_variance,percent_lines_over_80,source_lines,test_files_found,test_function_count,test_lines,test_to_source_ratio,total_imports,y_FeatureEnvy,y_FormattingIssues,y_LargeParameterList,y_MisleadingComments,y_PoorDocumentation,y_UntestedCode,complexity_score,code_health,doc_quality,has_tests,coupling_complexity,smell_density,effort_impact_ratio
0,0.0,0.666324,0.568106,0.0,1.0,1.0,0.623314,0.0,0.458333,0.125,0.0,0.0,0.310958,0.461683,0.409628,0.409485,0.5,False,False,0.247761,0.247761,False,0.545455,0.172667,0.684211,0.259259,0.375,0.254697,0.356322,0.0,0.259471,0.085183,0.25731,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.718428,0.880992,0.4165,0,0.541667,0.150928,0.44942
1,0.0,0.306982,0.531561,0.0,0.25,1.0,0.326629,0.0,0.125,0.0625,0.25,0.25,0.0,0.0,0.0,0.0,0.083333,False,False,0.059701,0.059701,False,0.727273,0.166667,0.473684,0.088889,0.125,0.26096,0.137931,0.613327,0.0,0.0,0.081871,1.0,1.0,1.0,1.0,0.230769,0.0,0.0,1.0,0.0,1.0,0.0,0.35237,0.932637,0.4525,1,0.083333,0.402474,0.0
2,0.333,0.204312,0.348837,0.0,0.0,0.0,0.1632,0.0,0.041667,0.0,0.25,0.25,0.0,0.0,0.0,0.0,0.083333,False,False,0.065672,0.065672,False,0.727273,0.333333,0.56391,0.02963,0.125,0.521921,0.045977,0.613327,0.0,0.311382,0.081871,0.0,0.0,0.0,0.0,0.230769,0.0,1.0,0.0,0.0,1.0,0.0,0.30012,0.929269,0.5,0,0.083333,0.551215,0.0
3,0.5,0.204312,0.325581,0.0,0.0,0.0,0.104,0.0,0.041667,0.0,0.25,0.25,0.067127,0.016964,0.070242,0.070217,0.083333,False,False,0.083582,0.083582,False,0.727273,0.266667,0.398496,0.037037,0.125,0.417537,0.057471,0.643482,0.0,0.0,0.128655,0.0,0.0,0.0,0.0,0.230769,0.0,0.0,0.0,0.0,1.0,0.0,0.27922,0.932637,0.5,0,0.083333,0.291447,0.019656
4,0.5,0.204312,0.320598,0.0,0.0,0.0,0.099429,0.0,0.041667,0.0,0.25,0.25,0.0,0.0,0.0,0.0,0.083333,False,False,0.089552,0.089552,False,0.727273,0.266667,0.406015,0.037037,0.125,0.417537,0.057471,0.643482,0.0,0.0,0.134503,0.0,0.0,0.0,0.0,0.307692,0.0,0.0,0.0,0.0,1.0,0.0,0.27922,0.932637,0.5,0,0.083333,0.272644,0.0


In [32]:
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# If the target is continuous numeric, convert it to classes (0/1) using a simple threshold.
# This is needed for classifiers (they require discrete labels).
if pd.api.types.is_numeric_dtype(y):
    y = (y >= y.median()).astype(int)

y.value_counts(dropna=False)

effort_impact_ratio
1    3088
Name: count, dtype: int64

## 3) Split the data

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## 4) Train Extra Trees

In [34]:
# Step 5: Initialize + train model
et_model = ExtraTreesClassifier(
    n_estimators=200,
    max_depth=None,
    random_state=42,
    n_jobs=-1,
 )
et_model.fit(X_train, y_train)

0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",200
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.",'gini'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=""sqrt"" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to `""sqrt""`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",'sqrt'
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=False Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",False


## 5) Evaluate

In [35]:
# Step 6: Make predictions
y_pred = et_model.predict(X_test)

# Step 7: Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 1.0

Classification Report:
               precision    recall  f1-score   support

           1       1.00      1.00      1.00       927

    accuracy                           1.00       927
   macro avg       1.00      1.00      1.00       927
weighted avg       1.00      1.00      1.00       927



## 6) Save model

In [36]:
# Step 8: Save model
from pathlib import Path
import pickle

models_dir = Path("../models")
models_dir.mkdir(parents=True, exist_ok=True)

model_path = models_dir / "extra_trees_classifier.pkl"
with open(model_path, "wb") as f:
    pickle.dump(et_model, f)

print("Extra Trees classifier saved successfully as pickle! ->", model_path)

Extra Trees classifier saved successfully as pickle! -> ../models/extra_trees_classifier.pkl
