### This notebook details on the preparation of the data to be fed into the model.

First, we'll load the raw `Eclipse.csv` dataset and handle potential parsing errors and explore the initial structure to understand our features.

In [2]:
import pandas as pd

file_path = '../data/Eclipse.csv'

# Load the data into a pandas DataFrame
df = pd.read_csv(file_path, on_bad_lines='skip', sep=';')

# Look at the first 5 rows
print(df.head())

    bugID                                                 sd  \
0  550000            PDE quickfix creates invalid @Since tag   
1  550001  Grant access to projects storage service to th...   
2  550002               Add relation information to REST-API   
3  550003  Provide platform independent plug-in to set th...   
4  550004  Inline method refacting reports "Inaccurate re...   

                   cl         pd          co   rp     os        bs  \
0     Eclipse Project        PDE   API Tools   PC  Linux  VERIFIED   
1  Eclipse Foundation  Community  CI-Jenkins   PC  Linux    CLOSED   
2          Automotive      MDMBL     General  All    All    CLOSED   
3     Eclipse Project   Platform          UI   PC  Linux    CLOSED   
4     Eclipse Project        JDT          UI   PC  Linux  RESOLVED   

           rs  pr     bsr  
0       FIXED  P3  normal  
1   DUPLICATE  P3  normal  
2       FIXED  P3   major  
3   DUPLICATE  P3  normal  
4  WORKSFORME  P3  normal  


In [3]:
print(df.columns)

Index(['bugID', 'sd', 'cl', 'pd', 'co', 'rp', 'os', 'bs', 'rs', 'pr', 'bsr'], dtype='object')


NOTE: THIS IS WHAT THE COLUMN SHORT-FORMS STAND FOR
- bugID - This is the unique identification number for each bug report.
- sd: Short Description - The title or summary of the bug. (most important)
- cl: Classification - The general category of the software.
- pd: Product - The specific software product where the bug was found.
- co: Component - The particular part of the product or the team responsible for it.(what the model will predict)
- rp: Reporter - The person who submitted the bug report.
- os: Operating System - The OS on which the bug was discovered (e.g., Windows, macOS, Linux).
- bs: Bug Severity - This describes the impact of the bug on the system (e.g., critical, major, minor, trivial).
- rs: Resolution Status - This shows the final outcome of the bug report (e.g., FIXED, WONTFIX, DUPLICATE).
- pr: Priority - This indicates how urgently the bug should be fixed (e.g., P1, P2, P3).

In [4]:
# Selecting features (X) and (y)
df_clean = df[['sd', 'co']].copy()

# Drop rows where either the description or the component is missing
df_clean.dropna(inplace=True)

# Assign to X and y
X = df_clean['sd']
y = df_clean['co']

print("Data prepared!")
print(f"Number of bug reports: {len(X)}")

Data prepared!
Number of bug reports: 8478


### Splitting the data into:
- **Training set** → used by the model to learn patterns.  
- **Testing set** → used to evaluate how well the model performs on new data.  

In [5]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")

Training set size: 6782
Testing set size: 1696


### Converting Text to Numerical Features (TF-IDF)

Machine learning models can’t work directly with raw text.  
We use **TF-IDF (Term Frequency–Inverse Document Frequency)** to turn text into numerical feature vectors.


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

# Learn the vocabulary from the training data and transform it
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data using the same learned vocabulary
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("Text data has been vectorized!")
print(f"Shape of the training data vectors: {X_train_tfidf.shape}")

Text data has been vectorized!
Shape of the training data vectors: (6782, 5000)
