### Goal
The purpose of this notebook is to load the raw `Eclipse.csv` dataset, perform all necessary cleaning and preprocessing steps, and save a final, clean DataFrame. This clean data will be used in the next notebook for model training.

### Key Steps:
1.  Load the raw, semicolon-separated data.
2.  Clean the text descriptions.
3.  Analyze and simplify the `component` and `priority` target variables.
4.  Save the final, analysis-ready dataset.

### 1. Load Raw Data

In [20]:
import pandas as pd

raw_data_path = '../data/Eclipse.csv'

# --- Helper Function for Cleaning Text ---
def clean_text(s):
    if pd.isna(s):
        return ""
    return str(s).lower().replace("\n"," ").replace("\r"," ")

# Load the data into a pandas DataFrame
df = pd.read_csv(file_path, on_bad_lines='skip', sep=';')
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8478 entries, 0 to 8477
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   bugID   8478 non-null   int64 
 1   sd      8478 non-null   object
 2   cl      8478 non-null   object
 3   pd      8478 non-null   object
 4   co      8478 non-null   object
 5   rp      8478 non-null   object
 6   os      8478 non-null   object
 7   bs      8478 non-null   object
 8   rs      7363 non-null   object
 9   pr      8478 non-null   object
 10  bsr     8478 non-null   object
dtypes: int64(1), object(10)
memory usage: 728.7+ KB


NOTE: THIS IS WHAT THE COLUMN SHORT-FORMS STAND FOR
- bugID - This is the unique identification number for each bug report.
- sd: Short Description - The title or summary of the bug. (most important)
- cl: Classification - The general category of the software.
- pd: Product - The specific software product where the bug was found.
- co: Component - The particular part of the product or the team responsible for it.(what the model will predict)
- rp: Reporter - The person who submitted the bug report.
- os: Operating System - The OS on which the bug was discovered (e.g., Windows, macOS, Linux).
- bs: Bug Severity - This describes the impact of the bug on the system (e.g., critical, major, minor, trivial).
- rs: Resolution Status - This shows the final outcome of the bug report (e.g., FIXED, WONTFIX, DUPLICATE).
- pr: Priority - This indicates how urgently the bug should be fixed (e.g., P1, P2, P3).

### 2. Clean and Prepare Data

In [21]:
# Clean the main text column
df['sd'] = df['sd'].apply(clean_text)

# Simplify the 'co' (component) column by grouping rare categories
target_col_co = df['co'].dropna()
threshold_co = 50
value_counts_co = target_col_co.value_counts()
to_replace_co = value_counts_co[value_counts_co < threshold_co].index
df['target_co_simplified'] = df['co'].replace(to_replace_co, 'Other').astype(str)

# Prepare the 'pr' (priority) column
df['target_pr_simplified'] = df['pr'].astype(str)

print("Text cleaned and target variables prepared.")

Text cleaned and target variables prepared.


### 3. Save the Clean Data

In [22]:
# Define all columns that need to be carried over
features_to_use = ['sd', 'pd', 'os', 'bs']
all_needed_columns = features_to_use + ['target_co_simplified', 'target_pr_simplified']

# Create the final clean DataFrame
df_clean = df[all_needed_columns].dropna()

# Save the clean data to a new file
CLEAN_DATA_PATH = '../data/cleaned_bug_reports.csv'
df_clean.to_csv(CLEAN_DATA_PATH, index=False)

print(f"Clean data successfully saved to {CLEAN_DATA_PATH}")
print(f"Final shape of the clean dataset: {df_clean.shape}")

Clean data successfully saved to ../data/cleaned_bug_reports.csv
Final shape of the clean dataset: (8478, 6)
