# Extract data from public Jira data
* This notebook will extract data from the public jira dataset
* The dataset is stored in mongoDB.
* mongoDB must be installed and running on your system.

For more details refer to 
 https://zenodo.org/record/5901956

Command used to export the data (this command takes about 15 minutes to complete).

`mongodump --db=JiraRepos --gzip --archive=mongodump-JiraRepos.archive`

Accompanying command to restore the data (this command takes about 15 minutes to complete). Expanded, this data is ~60GB inside MongoDB.

`mongorestore --gzip --archive=mongodump-JiraRepos.archive --nsFrom "JiraRepos.*" --nsTo "JiraRepos.*"`

Change the `--nsTo` command to contain the desired name for the JiraRepos database.
mongorestore --gzip --archive=mongodump-JiraRepos.archive --nsFrom "JiraRepos.*" --nsTo "JiraRepos.Apache"


For more information see: https://docs.mongodb.com/manual/tutorial/backup-and-restore-tools/

Jira Dataset for TD filtered was extracted from https://zenodo.org/record/5901956  (https://arxiv.org/pdf/2201.08368.pdf) and adapted 

Montgomery, Lloyd, Lüders, Clara, & Maalej, Prof. Dr. Walid. (2022). The Public Jira Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5901956



In [1]:
import pymongo
# Default connection to localhost
myclient = pymongo.MongoClient("mongodb://localhost:27017/")



In [2]:
mydb = myclient["JiraRepos"]
mydb


Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'JiraRepos')

In [3]:
collist = mydb.list_collection_names()
collist


['Spring',
 'RedHat',
 'Sakai',
 'JiraEcosystem',
 'Jira',
 'Hyperledger',
 'Apache',
 'SecondLife',
 'MariaDB',
 'MongoDB',
 'Mojang',
 'Qt',
 'JFrog',
 'IntelDAOS',
 'Mindville',
 'Sonatype']

In [4]:
# Name of the collection to extract
collection_name = 'JiraEcosystem'

In [5]:

import pymongo as pm
import pandas as pd
import os

CHUNK_SIZE = 500
client = pm.MongoClient()
coll = client.get_database('JiraRepos').get_collection(collection_name)
cursor = coll.find({}, batch_size=CHUNK_SIZE)

# Count total documents and calculate total chunks
total_docs = coll.count_documents({})
total_chunks = (total_docs + CHUNK_SIZE - 1) // CHUNK_SIZE  # Rounds up the division

def yield_rows(cursor, chunk_size):
    """
    Generator to yield chunks from cursor
    :param cursor:
    :param chunk_size:
    :return:
    """
    chunk = []
    for i, row in enumerate(cursor):
        if i % chunk_size == 0 and i > 0:
            yield chunk
            del chunk[:]
        chunk.append(row)
    yield chunk

# Define the columns you wish to extract
desired_columns = [
    "id", 
    "fields.project.name", 
    "fields.priority.name", 
    "fields.created", 
    "fields.labels", 
    "fields.summary", 
    "fields.description", 
    "fields.status.name",
    "fields.status.description", 
    "fields.issuetype.name", 
    "fields.issuetype.description", 
    "fields.issuetype.subtask", 
    "fields.comments"
]

# Ensure the directory exists
os.makedirs("dataset_dump/" + collection_name, exist_ok=True)

chunks = yield_rows(cursor, CHUNK_SIZE)

# Initialize chunk counter
chunk_counter = 0

for chunk in chunks:
    chunk_counter += 1  # Increment the chunk counter
    df = pd.json_normalize(chunk, errors='ignore')
    
    # Select only the columns that exist in the DataFrame
    available_columns = [col for col in desired_columns if col in df.columns]
    df = df[available_columns]

    # Save to CSV, considering 'id' is always present
    df.to_csv(f"dataset_dump/{collection_name}/{collection_name}-{chunk_counter}.csv", index=False)

    print(f"Processed chunk {chunk_counter} of {total_chunks}")

# Print completion message
print("All chunks processed.")

Processed chunk 1 of 84
Processed chunk 2 of 84
Processed chunk 3 of 84
Processed chunk 4 of 84
Processed chunk 5 of 84
Processed chunk 6 of 84
Processed chunk 7 of 84
Processed chunk 8 of 84
Processed chunk 9 of 84
Processed chunk 10 of 84
Processed chunk 11 of 84
Processed chunk 12 of 84
Processed chunk 13 of 84
Processed chunk 14 of 84
Processed chunk 15 of 84
Processed chunk 16 of 84
Processed chunk 17 of 84
Processed chunk 18 of 84
Processed chunk 19 of 84
Processed chunk 20 of 84
Processed chunk 21 of 84
Processed chunk 22 of 84
Processed chunk 23 of 84
Processed chunk 24 of 84
Processed chunk 25 of 84
Processed chunk 26 of 84
Processed chunk 27 of 84
Processed chunk 28 of 84
Processed chunk 29 of 84
Processed chunk 30 of 84
Processed chunk 31 of 84
Processed chunk 32 of 84
Processed chunk 33 of 84
Processed chunk 34 of 84
Processed chunk 35 of 84
Processed chunk 36 of 84
Processed chunk 37 of 84
Processed chunk 38 of 84
Processed chunk 39 of 84
Processed chunk 40 of 84
Processed

In [6]:
import dask
dask.config.set({'dataframe.query-planning': True})
import dask.dataframe as dd
df = dd.read_csv(f'dataset_dump/{collection_name}/{collection_name}-*.csv')
df



<dask_expr.expr.DataFrame: expr=ReadCSV(746c9c8)>

In [7]:
df = df.compute()

In [8]:
df

Unnamed: 0,id,fields.project.name,fields.priority.name,fields.created,fields.labels,fields.summary,fields.description,fields.status.name,fields.status.description,fields.issuetype.name,fields.issuetype.description,fields.issuetype.subtask,fields.comments
0,276837,Who's Looking for OnDemand,Minor,2019-05-10T02:55:41.828-0500,[],User profile picture doesn't display,Some user profile picture doesn't display when...,Reviewing Request,,Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://ecosystem.atlassian.net/res...
1,259197,Who's Looking for OnDemand,Major,2018-06-27T05:38:26.604-0500,[],CLONE - Users are occasionally displayed twice,Reported here: https://jira.atlassian.com/brow...,To Do,,Bug,A problem which impairs or prevents the functi...,False,[]
2,253083,Who's Looking for OnDemand,Major,2018-03-25T23:04:58.826-0500,[],CLONE - Username changes are not read by WL,For those users for whom we have had to change...,To Do,,Bug,A problem which impairs or prevents the functi...,False,[]
3,134767,Who's Looking for OnDemand,Critical,2016-03-04T05:21:24.601-0600,[],Add compatibility with JIRA 7.x,Encountered this plugin within the ecosystem.a...,In Progress,This issue is being actively worked on at the ...,New Feature,"A new feature of the product, which has yet to...",False,[]
4,110186,Who's Looking for OnDemand,Major,2015-01-07T09:35:10.351-0600,[],Username changes are not read by WL,For those users for whom we have had to change...,Reviewing Request,,Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://ecosystem.atlassian.net/res...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,96951,Activity Streams,Major,2014-07-17T02:40:05.326-0500,[],atlassian-streams 5.4.x Failing JDK8 Builds,,Closed,"The issue is considered finished, the resoluti...",Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://ecosystem.atlassian.net/res...
496,96917,Activity Streams,Major,2014-07-16T08:04:38.759-0500,['clarified'],Reproducible test failure assertThatOnlyEntrie...,{{AppLinksTests fail}} on {{activity-stream-5....,Closed,"The issue is considered finished, the resoluti...",Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://ecosystem.atlassian.net/res...
497,96817,Activity Streams,Major,2014-07-15T04:18:26.724-0500,"['activity-stream', 'crucible', 'fecru', 'fish...",Invalid space encoding in changeset-file links,Links to changed files rendered by {{streams-f...,Closed,"The issue is considered finished, the resoluti...",Bug,A problem which impairs or prevents the functi...,False,[]
498,96687,Activity Streams,Minor,2014-07-11T01:25:33.967-0500,[],Unable to comment on Confluence page via JIRA ...,"Steps to replicate:\r\n\r\n# In OnDemand, crea...",Closed,"The issue is considered finished, the resoluti...",Bug,A problem which impairs or prevents the functi...,False,[]


In [9]:
df["fields.priority.name"].value_counts().to_frame()[:50]


Unnamed: 0_level_0,count
fields.priority.name,Unnamed: 1_level_1
Major,29060
Minor,6320
Critical,2837
Blocker,2439
Trivial,836


In [10]:
df["fields.issuetype.name"].value_counts().to_frame()[:50]

Unnamed: 0_level_0,count
fields.issuetype.name,Unnamed: 1_level_1
Bug,20410
Improvement,7098
Task,5028
Story,2993
New Feature,2318
Sub-task,2294
Suggestion,542
Documentation,463
Epic,348
Support Request,147


In [11]:
df["fields.project.name"].value_counts().to_frame()[:50]

Unnamed: 0_level_0,count
fields.project.name,Unnamed: 1_level_1
Atlassian Marketplace,4563
Atlassian User Interface,4183
Universal Plugin Manager,3763
Atlassian Connector for IntelliJ IDE (discontinued),2889
Activity Streams,2293
Atlassian Connect in Jira Cloud,1999
Atlassian Connect,1544
Atlassian Connector for Eclipse (discontinued),1499
Atlassian Maven Plugin Suite,1490
Atlassian Gadgets,1485


In [12]:
# Number of different projects
df["fields.project.name"].nunique()

101

In [13]:
# To csv 
df.to_csv(f'final_dataset/{collection_name}.csv', index=False)

In [14]:
# read csv
df1 = pd.read_csv(f'final_dataset/{collection_name}.csv')     
df1

Unnamed: 0,id,fields.project.name,fields.priority.name,fields.created,fields.labels,fields.summary,fields.description,fields.status.name,fields.status.description,fields.issuetype.name,fields.issuetype.description,fields.issuetype.subtask,fields.comments
0,276837,Who's Looking for OnDemand,Minor,2019-05-10T02:55:41.828-0500,[],User profile picture doesn't display,Some user profile picture doesn't display when...,Reviewing Request,,Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://ecosystem.atlassian.net/res...
1,259197,Who's Looking for OnDemand,Major,2018-06-27T05:38:26.604-0500,[],CLONE - Users are occasionally displayed twice,Reported here: https://jira.atlassian.com/brow...,To Do,,Bug,A problem which impairs or prevents the functi...,False,[]
2,253083,Who's Looking for OnDemand,Major,2018-03-25T23:04:58.826-0500,[],CLONE - Username changes are not read by WL,For those users for whom we have had to change...,To Do,,Bug,A problem which impairs or prevents the functi...,False,[]
3,134767,Who's Looking for OnDemand,Critical,2016-03-04T05:21:24.601-0600,[],Add compatibility with JIRA 7.x,Encountered this plugin within the ecosystem.a...,In Progress,This issue is being actively worked on at the ...,New Feature,"A new feature of the product, which has yet to...",False,[]
4,110186,Who's Looking for OnDemand,Major,2015-01-07T09:35:10.351-0600,[],Username changes are not read by WL,For those users for whom we have had to change...,Reviewing Request,,Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://ecosystem.atlassian.net/res...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
41861,96951,Activity Streams,Major,2014-07-17T02:40:05.326-0500,[],atlassian-streams 5.4.x Failing JDK8 Builds,,Closed,"The issue is considered finished, the resoluti...",Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://ecosystem.atlassian.net/res...
41862,96917,Activity Streams,Major,2014-07-16T08:04:38.759-0500,['clarified'],Reproducible test failure assertThatOnlyEntrie...,{{AppLinksTests fail}} on {{activity-stream-5....,Closed,"The issue is considered finished, the resoluti...",Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://ecosystem.atlassian.net/res...
41863,96817,Activity Streams,Major,2014-07-15T04:18:26.724-0500,"['activity-stream', 'crucible', 'fecru', 'fish...",Invalid space encoding in changeset-file links,Links to changed files rendered by {{streams-f...,Closed,"The issue is considered finished, the resoluti...",Bug,A problem which impairs or prevents the functi...,False,[]
41864,96687,Activity Streams,Minor,2014-07-11T01:25:33.967-0500,[],Unable to comment on Confluence page via JIRA ...,"Steps to replicate:\r\n\r\n# In OnDemand, crea...",Closed,"The issue is considered finished, the resoluti...",Bug,A problem which impairs or prevents the functi...,False,[]
