## Part 2: Model Training and Evaluation
Goal: This notebook loads the clean data created in Part 1. Its sole focus is to build, train, and evaluate two separate machine learning models:
1.  A model to predict the **Component (`co`)**.
2.  A model to predict the **Priority (`pr`)**.


In [12]:
import pandas as pd
import pickle
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from lightgbm import LGBMClassifier

# --- Configuration ---
CLEAN_DATA_PATH = '../data/cleaned_bug_reports.csv'
OUTPUT_MODEL_PATH = '../models/final_multimodel.pkl'

### 1. Load Clean Data
Start by loading the `cleaned_bug_reports.csv` file that was the output of the previous notebook.

In [4]:
print(f"Loading clean data from {CLEAN_DATA_PATH}...")
df_final = pd.read_csv(CLEAN_DATA_PATH)
print("Clean data loaded.")
df_final.head()

Loading clean data from ../data/cleaned_bug_reports.csv...
Clean data loaded.


Unnamed: 0,sd,pd,os,bs,target_co_simplified,target_pr_simplified
0,pde quickfix creates invalid @since tag,PDE,Linux,VERIFIED,Other,P3
1,grant access to projects storage service to th...,Community,Linux,CLOSED,CI-Jenkins,P3
2,add relation information to rest-api,MDMBL,All,CLOSED,General,P3
3,provide platform independent plug-in to set th...,Platform,Linux,CLOSED,UI,P3
4,"inline method refacting reports ""inaccurate re...",JDT,Linux,RESOLVED,UI,P3


### 2. Define Features and Targets
separating the dataset into features (`X`) and our two distinct target variables (`y_component` and `y_priority`). We also initialize and fit a separate `LabelEncoder` for each target to convert their text labels into numbers.

In [8]:
TEXT_COLUMN = 'sd'
CATEGORICAL_COLUMNS = ['pd', 'os', 'bs']

# Define features (X) and our two targets (y)
X = df_final[ [TEXT_COLUMN] + CATEGORICAL_COLUMNS ]
y_component = df_final['target_co_simplified']
y_priority = df_final['target_pr_simplified']

# Encode both targets with separate encoders
le_component = LabelEncoder()
le_priority = LabelEncoder()
y_component_enc = le_component.fit_transform(y_component)
y_priority_enc = le_priority.fit_transform(y_priority)

### 3. Train and Evaluate Component Prediction Model
First, we build and train the model to predict the component. We use a `ColumnTransformer` to handle text and categorical features differently, then bundle it all into a `Pipeline` with our `LGBMClassifier`.

In [9]:
# Create the preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('text', TfidfVectorizer(stop_words='english', max_features=10000, ngram_range=(1, 2)), TEXT_COLUMN),
        ('categorical', OneHotEncoder(handle_unknown='ignore'), CATEGORICAL_COLUMNS)
    ])

# Create the full pipeline
component_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LGBMClassifier(random_state=42))
])

# Split data and train
X_train, X_test, y_train_co, y_test_co = train_test_split(
    X, y_component_enc, test_size=0.2, random_state=42, stratify=y_component_enc
)

print("\nTraining the Component model...")
component_pipeline.fit(X_train, y_train_co)
print("Component model training complete!")

# Evaluate the model
y_pred_co = component_pipeline.predict(X_test)
acc_co = accuracy_score(y_test_co, y_pred_co)
print(f"Component Model Test Accuracy: {acc_co:.4f}")


Training the Component model...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006608 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8437
[LightGBM] [Info] Number of data points in the train set: 6782, number of used features: 526
[LightGBM] [Info] Start training from score -4.077095
[LightGBM] [Info] Start training from score -3.634642
[LightGBM] [Info] Start training from score -4.727683
[LightGBM] [Info] Start training from score -3.050586
[LightGBM] [Info] Start training from score -4.177636
[LightGBM] [Info] Start training from score -4.851735
[LightGBM] [Info] Start training from score -4.833043
[LightGBM] [Info] Start training from score -1.975084
[LightGBM] [Info] Start training from score -3.902046
[LightGBM] [Info] Start training from score -3.198010
[LightGBM] [Info] Start training from score -4.870784
[LightGBM] [Info] St



### 4. Train and Evaluate Priority Prediction Model
the same process for our priority prediction model. We can reuse the same preprocessor and pipeline structure, but this time we train it on the priority labels.

In [10]:
# Create the pipeline for the priority model
priority_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LGBMClassifier(random_state=42))
])

# Split data and train
X_train, X_test, y_train_pr, y_test_pr = train_test_split(
    X, y_priority_enc, test_size=0.2, random_state=42, stratify=y_priority_enc
)

print("\nTraining the Priority model...")
priority_pipeline.fit(X_train, y_train_pr)
print("Priority model training complete!")

# Evaluate the model
y_pred_pr = priority_pipeline.predict(X_test)
acc_pr = accuracy_score(y_test_pr, y_pred_pr)
print(f"Priority Model Test Accuracy: {acc_pr:.4f}")


Training the Priority model...
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004143 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8445
[LightGBM] [Info] Number of data points in the train set: 6782, number of used features: 524
[LightGBM] [Info] Start training from score -5.295667
[LightGBM] [Info] Start training from score -4.278733
[LightGBM] [Info] Start training from score -0.052831
[LightGBM] [Info] Start training from score -3.451389
[LightGBM] [Info] Start training from score -7.030268
Priority model training complete!
Priority Model Test Accuracy: 0.9393




### 5. Save Both Models
saving both of our trained pipelines and their corresponding label encoders into a single `.pkl` file. This file contains everything the Flask API will need to make predictions.

In [13]:
with open(OUTPUT_MODEL_PATH, 'wb') as f:
    pickle.dump({
        'component_model': component_pipeline,
        'priority_model': priority_pipeline,
        'component_encoder': le_component,
        'priority_encoder': le_priority
    }, f)
print(f"\n Both models saved to {OUTPUT_MODEL_PATH}")


 Both models saved to ../models/final_multimodel.pkl
