The data from the "eclipse_clear.json" file undergoes a comprehensive preprocessing process that includes the following steps:

1. Filtering: This step involves the removal of entries associated with both normal and X severities, ensuring that the data focuses exclusively on specific severity levels of interest.
2. Binary Severity Creation: A new column, 'binary_severity,' is created, enabling the classification of entries into a binary severity category for streamlined analysis and interpretation.
3. Bag-of-Words Preprocessing: The data is further processed using the bag-of-words technique, facilitating the extraction of key features and patterns from the text data.
4. Stemming: Additionally, stemming techniques are applied to the preprocessed text, reducing words to their root form to enhance the effectiveness of subsequent natural language processing tasks.

In [1]:
from pathlib import Path
import pandas as pd

from string import Template 

In [2]:
import sys

sys.path.append('../baseline/')
from baseline_functions import filter_bug_severity, create_binary_feature, preprocess_text

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
data_path = Path('../../data/eclipse_clear.json')

In [4]:
df_json = pd.read_json(data_path, lines=True)
df_json.head(2)

Unnamed: 0,_id,bug_id,product,description,bug_severity,dup_id,short_desc,priority,version,component,delta_ts,bug_status,creation_ts,resolution
0,{'$oid': '52e9b44754dc1c25ebdb1ee5'},3,Platform,KM (10/2/2001 5:55:18 PM)\n\tThis PR about the...,normal,[],Sync does not indicate deletion (1GIEN83),P5,2.0,Team,2010-05-07 10:28:53 -0400,RESOLVED,2001-10-10 21:34:00 -0400,FIXED
1,{'$oid': '52e9b44754dc1c25ebdb1ee6'},1,Platform,- Setup a project that contains a *.gif resour...,normal,[],Usability issue with external editors (1GE6IRL),P3,2.0,Team,2012-02-09 15:57:47 -0500,CLOSED,2001-10-10 21:34:00 -0400,FIXED


In [5]:
df_filtered = filter_bug_severity(df_json, severity_col='bug_severity')
df_binary = create_binary_feature(df_filtered, severity_col ='bug_severity')

In [6]:
df_binary.head(2)

Unnamed: 0,bug_id,description,bug_severity,binary_severity
44,43,I have a project (Junk) that has been released...,major,1
162,163,AK (6/12/01 4:55:24 PM)\n\ti got this exceptio...,critical,1


In [7]:
df_preprocess = preprocess_text(df_binary)

In [8]:
df_preprocess.shape

(72192, 5)

In [9]:
df_preprocess.head(3)

Unnamed: 0,bug_id,bug_severity,binary_severity,description,stemmed_description
44,43,major,1,project Junk released teamstream rename projec...,project junk releas teamstream renam project a...
162,163,critical,1,AK 6/12/01 4:55:24 PM got exception last night...,ak 6/12/01 4:55:24 pm got except last night wo...
192,194,major,1,1 Add global ignore pattern BazProject 2 Creat...,1 add global ignor pattern bazproject 2 creat ...


In [10]:
df_preprocess.to_json("../../data/eclipse_72k.json", orient="records", indent=4)

# Mozilla

In [4]:
mozilla_path = Path('../../data/mozilla_mozall.json')
mozilla_json = pd.read_json(mozilla_path, lines=True)
mozilla_json.shape

(768335, 14)

In [13]:
mozilla_json.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768335 entries, 0 to 768334
Data columns (total 14 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   _id           768335 non-null  object
 1   bug_id        768335 non-null  int64 
 2   product       768335 non-null  object
 3   description   768326 non-null  object
 4   bug_severity  768335 non-null  object
 5   dup_id        768335 non-null  object
 6   short_desc    768335 non-null  object
 7   priority      768335 non-null  object
 8   version       768335 non-null  object
 9   component     768335 non-null  object
 10  delta_ts      768335 non-null  object
 11  bug_status    768335 non-null  object
 12  creation_ts   768335 non-null  object
 13  resolution    768335 non-null  object
dtypes: int64(1), object(13)
memory usage: 82.1+ MB


In [24]:
mozilla_json.head(10)

Unnamed: 0,_id,bug_id,product,description,bug_severity,dup_id,short_desc,priority,version,component,delta_ts,bug_status,creation_ts,resolution
0,{'$oid': '52eaece454dc1c410c4fbc01'},35,MozillaClassic,Created by (weitsang@cs.cornell.edu) on Mond...,minor,[],Navigator does not free preference hash table ...,P3,1998-03-31,XFE,2013-11-19 18:16:47 -0800,VERIFIED,1998-04-07 01:37:03 -0700,WONTFIX
1,{'$oid': '52eaece454dc1c410c4fbc02'},36,SeaMonkey,Created by J. Daniel Powell (dan@java-linux.or...,critical,[],Floating Point Exception on Execution,P2,Trunk,Build Config,2012-10-31 18:36:36 -0700,VERIFIED,1998-04-07 02:04:03 -0700,INVALID
2,{'$oid': '52eaece454dc1c410c4fbc03'},37,MozillaClassic,Created by Chen Ronghua (chenrh@usa.net) on Mo...,normal,[],Preference Dialog does not show,P2,1998-03-31,Windows FE,2000-12-25 17:53:17 -0800,VERIFIED,1998-04-07 02:20:01 -0700,FIXED
3,{'$oid': '52eaece454dc1c410c4fbc04'},39,MozillaClassic,Created by Chen Ronghua (chenrh@usa.net) on Mo...,normal,[],Bookmark properties leads to an Assert failed,P2,1998-03-31,Aurora/RDF BE,2013-11-19 23:42:54 -0800,VERIFIED,1998-04-07 02:34:14 -0700,WONTFIX
4,{'$oid': '52eaece454dc1c410c4fbc05'},42,MozillaClassic,Created by Stephan Nagy (steph8@flash.net) on...,minor,[],navigator redraw after initial startup,P2,1998-03-31,XFE,2013-07-22 06:53:51 -0700,VERIFIED,1998-04-07 05:42:04 -0700,FIXED
5,{'$oid': '52eaece454dc1c410c4fbc06'},38,MozillaClassic,Created by Chen Ronghua (chenrh@usa.net) on Mo...,normal,[],Close Mozilla lead to a Assert Failed,P2,1998-03-31,NetLib,2000-12-25 17:52:34 -0800,VERIFIED,1998-04-07 02:30:03 -0700,FIXED
6,{'$oid': '52eaece454dc1c410c4fbc07'},41,MozillaClassic,Created by Chris Kennedy (chris@groovy.org) on...,critical,[],Multiple Netscape windows cannot recieve data ...,P2,1998-03-31,XFE,2003-04-16 06:31:21 -0700,VERIFIED,1998-04-07 05:05:25 -0700,WONTFIX
7,{'$oid': '52eaece554dc1c410c4fbc08'},43,MozillaClassic,Created by Aleksandar Totic (atotic@netscape.c...,normal,[],Just testing,P3,1998-03-31,Macintosh FE,2002-09-13 16:18:32 -0700,VERIFIED,1998-04-07 06:53:01 -0700,INVALID
8,{'$oid': '52eaece554dc1c410c4fbc09'},51,MozillaClassic,Created by Svein Erik Brostigen (sbrostig@no.o...,normal,[],Resizing communicator changes display format o...,P2,1998-03-31,StubFE,2000-12-25 17:52:41 -0800,VERIFIED,1998-04-07 08:22:31 -0700,FIXED
9,{'$oid': '52eaece554dc1c410c4fbc0a'},61,MozillaClassic,Created by Jukka Santala (donwulff@iki.fi) on ...,minor,[],Navigator shutdown set-zero-context bug,P3,1998-03-31,Windows FE,2000-11-07 10:41:58 -0800,VERIFIED,1998-04-07 11:56:14 -0700,FIXED


In [21]:
id_eclipse_list = df_preprocess['bug_id'].tolist()
id_mozilla_list = mozilla_json['bug_id'].tolist()

In [23]:
common_elements = set(id_eclipse_list) & set(id_mozilla_list)

if common_elements:
    print("There are common elements.")
    print("Common elements:", len(common_elements))
else:
    print("No common elements found.")

There are common elements.
Common elements: 67055


In [5]:
mozilla_filtered = filter_bug_severity(mozilla_json, severity_col='bug_severity', description_col='short_desc')
mozilla_binary = create_binary_feature(mozilla_filtered, severity_col ='bug_severity')

In [6]:
mozilla_binary.head(3)

Unnamed: 0,bug_id,short_desc,bug_severity,binary_severity
0,35,Navigator does not free preference hash table ...,minor,0
1,36,Floating Point Exception on Execution,critical,1
4,42,navigator redraw after initial startup,minor,0


In [7]:
mozilla_binary.shape

(201106, 4)

In [8]:
mozilla_preprocess = preprocess_text(dataframe=mozilla_binary, col_to_process='short_desc')

In [9]:
mozilla_preprocess.shape

(201093, 5)

In [10]:
mozilla_preprocess.head(2)

Unnamed: 0,bug_id,bug_severity,binary_severity,description,stemmed_description
0,35,minor,0,Navigator free preference hash table exit,navig free prefer hash tabl exit
1,36,critical,1,Floating Point Exception Execution,float point except execut


In [11]:
mozilla_preprocess.to_json("../../data/mozilla_201k.json", orient="records", indent=4)