The data from the "eclipse_clear.json" file undergoes a comprehensive preprocessing process that includes the following steps:

1. Filtering: This step involves the removal of entries associated with both normal and X severities, ensuring that the data focuses exclusively on specific severity levels of interest.
2. Binary Severity Creation: A new column, 'binary_severity,' is created, enabling the classification of entries into a binary severity category for streamlined analysis and interpretation.
3. Bag-of-Words Preprocessing: The data is further processed using the bag-of-words technique, facilitating the extraction of key features and patterns from the text data.
4. Stemming: Additionally, stemming techniques are applied to the preprocessed text, reducing words to their root form to enhance the effectiveness of subsequent natural language processing tasks.

In [1]:
from pathlib import Path
import pandas as pd

from string import Template 

In [2]:
import sys

sys.path.append('../baseline/')
from baseline_functions import filter_bug_severity, create_binary_feature, preprocess_text

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
data_path = Path('../../data/eclipse_clear.json')

In [4]:
df_json = pd.read_json(data_path, lines=True)
df_json.head(2)

Unnamed: 0,_id,bug_id,product,description,bug_severity,dup_id,short_desc,priority,version,component,delta_ts,bug_status,creation_ts,resolution
0,{'$oid': '52e9b44754dc1c25ebdb1ee5'},3,Platform,KM (10/2/2001 5:55:18 PM)\n\tThis PR about the...,normal,[],Sync does not indicate deletion (1GIEN83),P5,2.0,Team,2010-05-07 10:28:53 -0400,RESOLVED,2001-10-10 21:34:00 -0400,FIXED
1,{'$oid': '52e9b44754dc1c25ebdb1ee6'},1,Platform,- Setup a project that contains a *.gif resour...,normal,[],Usability issue with external editors (1GE6IRL),P3,2.0,Team,2012-02-09 15:57:47 -0500,CLOSED,2001-10-10 21:34:00 -0400,FIXED


In [5]:
df_filtered = filter_bug_severity(df_json, severity_col='bug_severity')
df_binary = create_binary_feature(df_filtered, severity_col ='bug_severity')

In [6]:
df_binary.head(2)

Unnamed: 0,bug_id,description,bug_severity,binary_severity
44,43,I have a project (Junk) that has been released...,major,1
162,163,AK (6/12/01 4:55:24 PM)\n\ti got this exceptio...,critical,1


In [7]:
df_preprocess = preprocess_text(df_binary)

In [8]:
df_preprocess.shape

(72192, 5)

In [9]:
df_preprocess.head(3)

Unnamed: 0,bug_id,bug_severity,binary_severity,description,stemmed_description
44,43,major,1,project Junk released teamstream rename projec...,project junk releas teamstream renam project a...
162,163,critical,1,AK 6/12/01 4:55:24 PM got exception last night...,ak 6/12/01 4:55:24 pm got except last night wo...
192,194,major,1,1 Add global ignore pattern BazProject 2 Creat...,1 add global ignor pattern bazproject 2 creat ...


In [10]:
df_preprocess.to_json("../../data/eclipse_72k.json", orient="records", indent=4)