# Extract data from public Jira data
* This notebook will extract data from the public jira dataset
* The dataset is stored in mongoDB.
* mongoDB must be installed and running on your system.

For more details refer to 
 https://zenodo.org/record/5901956

Command used to export the data (this command takes about 15 minutes to complete).

`mongodump --db=JiraRepos --gzip --archive=mongodump-JiraRepos.archive`

Accompanying command to restore the data (this command takes about 15 minutes to complete). Expanded, this data is ~60GB inside MongoDB.

`mongorestore --gzip --archive=mongodump-JiraRepos.archive --nsFrom "JiraRepos.*" --nsTo "JiraRepos.*"`

Change the `--nsTo` command to contain the desired name for the JiraRepos database.
mongorestore --gzip --archive=mongodump-JiraRepos.archive --nsFrom "JiraRepos.*" --nsTo "JiraRepos.Apache"


For more information see: https://docs.mongodb.com/manual/tutorial/backup-and-restore-tools/

Jira Dataset for TD filtered was extracted from https://zenodo.org/record/5901956  (https://arxiv.org/pdf/2201.08368.pdf) and adapted 

Montgomery, Lloyd, Lüders, Clara, & Maalej, Prof. Dr. Walid. (2022). The Public Jira Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5901956



In [1]:
import pymongo
# Default connection to localhost
myclient = pymongo.MongoClient("mongodb://localhost:27017/")



In [2]:
mydb = myclient["JiraRepos"]
mydb


Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'JiraRepos')

In [3]:
myapache = mydb["Apache"]
myjiraecosystem = mydb["JiraEcosystem"]
mySonatype = mydb["Sonatype"]
mymongo = mydb["MongoDB"]


In [4]:
collist = mydb.list_collection_names()
collist


['Spring',
 'RedHat',
 'Sakai',
 'JiraEcosystem',
 'Jira',
 'Hyperledger',
 'Apache',
 'SecondLife',
 'MariaDB',
 'MongoDB',
 'Mojang',
 'Qt',
 'JFrog',
 'IntelDAOS',
 'Mindville',
 'Sonatype']

In [5]:
# Name of the collection to extract
collection_name = 'Jira'

In [6]:

import pymongo as pm
import pandas as pd
import os

CHUNK_SIZE = 500
client = pm.MongoClient()
coll = client.get_database('JiraRepos').get_collection(collection_name)
cursor = coll.find({}, batch_size=CHUNK_SIZE)

# Count total documents and calculate total chunks
total_docs = coll.count_documents({})
total_chunks = (total_docs + CHUNK_SIZE - 1) // CHUNK_SIZE  # Rounds up the division

def yield_rows(cursor, chunk_size):
    """
    Generator to yield chunks from cursor
    :param cursor:
    :param chunk_size:
    :return:
    """
    chunk = []
    for i, row in enumerate(cursor):
        if i % chunk_size == 0 and i > 0:
            yield chunk
            del chunk[:]
        chunk.append(row)
    yield chunk

# Define the columns you wish to extract
desired_columns = [
    "id", 
    "fields.project.name", 
    "fields.priority.name", 
    "fields.created", 
    "fields.labels", 
    "fields.summary", 
    "fields.description", 
    "fields.status.name",
    "fields.status.description", 
    "fields.issuetype.name", 
    "fields.issuetype.description", 
    "fields.issuetype.subtask", 
    "fields.comments"
]

# Ensure the directory exists
os.makedirs("dataset_dump/" + collection_name, exist_ok=True)

chunks = yield_rows(cursor, CHUNK_SIZE)

# Initialize chunk counter
chunk_counter = 0

for chunk in chunks:
    chunk_counter += 1  # Increment the chunk counter
    df = pd.json_normalize(chunk, errors='ignore')
    
    # Select only the columns that exist in the DataFrame
    available_columns = [col for col in desired_columns if col in df.columns]
    df = df[available_columns]

    # Save to CSV, considering 'id' is always present
    df.to_csv(f"dataset_dump/{collection_name}/{collection_name}-{chunk_counter}.csv", index=False)

    print(f"Processed chunk {chunk_counter} of {total_chunks}")

# Print completion message
print("All chunks processed.")

Processed chunk 1 of 550
Processed chunk 2 of 550
Processed chunk 3 of 550
Processed chunk 4 of 550
Processed chunk 5 of 550
Processed chunk 6 of 550
Processed chunk 7 of 550
Processed chunk 8 of 550
Processed chunk 9 of 550
Processed chunk 10 of 550
Processed chunk 11 of 550
Processed chunk 12 of 550
Processed chunk 13 of 550
Processed chunk 14 of 550
Processed chunk 15 of 550
Processed chunk 16 of 550
Processed chunk 17 of 550
Processed chunk 18 of 550
Processed chunk 19 of 550
Processed chunk 20 of 550
Processed chunk 21 of 550
Processed chunk 22 of 550
Processed chunk 23 of 550
Processed chunk 24 of 550
Processed chunk 25 of 550
Processed chunk 26 of 550
Processed chunk 27 of 550
Processed chunk 28 of 550
Processed chunk 29 of 550
Processed chunk 30 of 550
Processed chunk 31 of 550
Processed chunk 32 of 550
Processed chunk 33 of 550
Processed chunk 34 of 550
Processed chunk 35 of 550
Processed chunk 36 of 550
Processed chunk 37 of 550
Processed chunk 38 of 550
Processed chunk 39 of

In [7]:
import dask
dask.config.set({'dataframe.query-planning': True})
import dask.dataframe as dd
df = dd.read_csv(f'dataset_dump/{collection_name}/{collection_name}-*.csv')
df



<dask_expr.expr.DataFrame: expr=ReadCSV(5e06ce2)>

In [8]:
df = df.compute()

In [9]:
df

Unnamed: 0,id,fields.project.name,fields.priority.name,fields.created,fields.labels,fields.summary,fields.description,fields.status.name,fields.status.description,fields.issuetype.name,fields.issuetype.description,fields.issuetype.subtask,fields.comments
0,1843729,Sourcetree for Windows,,2022-01-04T09:13:21.000+0000,[],Please make warning LESS SCARY about a detache...,I wanted to follow a perfectly normal and feas...,Gathering Interest,This suggestion needs more unique domain votes...,Suggestion,,False,[]
1,1843575,Sourcetree for Windows,Low,2022-01-04T07:39:09.000+0000,[],Bug in listing bitbucket repos,some errors show up as shown in the screenshot...,Waiting for Release,A fix for this issue has been implemented and ...,Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://jira.atlassian.com/rest/api...
2,1843552,Sourcetree for Windows,Low,2022-01-03T21:46:14.000+0000,[],Spaces in File path of Custom Action,I have been using Sourcetree 3.4.4. We use cu...,Needs Triage,This issue is waiting to be reviewed by a memb...,Bug,A problem which impairs or prevents the functi...,False,[]
3,1843397,Sourcetree for Windows,Low,2022-01-03T14:07:33.000+0000,[],Failed to start (System.BadImageFormatException),After installing SourceTree for Windows 10 64b...,Needs Triage,This issue is waiting to be reviewed by a memb...,Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://jira.atlassian.com/rest/api...
4,1842733,Sourcetree for Windows,,2021-12-25T02:30:54.000+0000,[],commit view show in branchs,When the project contains many branches and I ...,Gathering Interest,This suggestion needs more unique domain votes...,Suggestion,,False,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,1777577,Jira Service Management Cloud,,2021-07-28T08:49:54.000+0000,[],Printing label tags for assets,Cant figure out a way to print labels on Insig...,Gathering Interest,This suggestion needs more unique domain votes...,Suggestion,,False,[{'self': 'https://jira.atlassian.com/rest/api...
496,1777505,Jira Service Management Cloud,Low,2021-07-28T07:18:51.000+0000,[],Having child objects with no individual attrib...,h3. Issue Summary\r\n\r\nCreating child object...,Long Term Backlog,"A fix for this issue is required, but planned ...",Bug,A problem which impairs or prevents the functi...,False,[]
497,1777387,Jira Service Management Cloud,,2021-07-28T06:50:23.000+0000,[],Ability to display Insight custom field in Kan...,h3. Problem\r\n\r\nCurrently there is no capab...,Gathering Interest,This suggestion needs more unique domain votes...,Suggestion,,False,[]
498,1777462,Jira Service Management Cloud,Low,2021-07-27T18:57:59.000+0000,[],Insight IQL uses timezones in a way that can b...,h3. Issue Summary\r\nInsight IQL uses timezone...,Gathering Impact,"This issue has been reviewed, but needs more s...",Bug,A problem which impairs or prevents the functi...,False,[]


In [10]:
df["fields.priority.name"].value_counts().to_frame()[:50]


Unnamed: 0_level_0,count
fields.priority.name,Unnamed: 1_level_1
Low,76547
Medium,43672
High,10283
Highest,5022


In [11]:
df["fields.issuetype.name"].value_counts().to_frame()[:50]

Unnamed: 0_level_0,count
fields.issuetype.name,Unnamed: 1_level_1
Suggestion,138378
Bug,131040
Sub-task,2501
Support Request,2438
Public Security Vulnerability,98
Improvement,52
New Feature,23
Task,12
Fug,2
Feedback,1


In [12]:
df["fields.project.name"].value_counts().to_frame()[:50]

Unnamed: 0_level_0,count
fields.project.name,Unnamed: 1_level_1
Jira Server and Data Center,47225
Confluence Server and Data Center,43910
Jira Cloud,27916
Confluence Cloud,25115
Bitbucket Cloud,20352
Bamboo,14625
Jira Software Server and Data Center,13273
Jira Software Cloud,12733
Sourcetree for Windows,9646
Bitbucket Server,7885


In [13]:
# Number of different projects
df["fields.project.name"].nunique()

30

In [14]:
# To csv 
df.to_csv(f'final_dataset/{collection_name}.csv', index=False)

In [16]:
# read csv
df1 = pd.read_csv(f'final_dataset/{collection_name}.csv')     
df1

Unnamed: 0,id,fields.project.name,fields.priority.name,fields.created,fields.labels,fields.summary,fields.description,fields.status.name,fields.status.description,fields.issuetype.name,fields.issuetype.description,fields.issuetype.subtask,fields.comments
0,1843729,Sourcetree for Windows,,2022-01-04T09:13:21.000+0000,[],Please make warning LESS SCARY about a detache...,I wanted to follow a perfectly normal and feas...,Gathering Interest,This suggestion needs more unique domain votes...,Suggestion,,False,[]
1,1843575,Sourcetree for Windows,Low,2022-01-04T07:39:09.000+0000,[],Bug in listing bitbucket repos,some errors show up as shown in the screenshot...,Waiting for Release,A fix for this issue has been implemented and ...,Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://jira.atlassian.com/rest/api...
2,1843552,Sourcetree for Windows,Low,2022-01-03T21:46:14.000+0000,[],Spaces in File path of Custom Action,I have been using Sourcetree 3.4.4. We use cu...,Needs Triage,This issue is waiting to be reviewed by a memb...,Bug,A problem which impairs or prevents the functi...,False,[]
3,1843397,Sourcetree for Windows,Low,2022-01-03T14:07:33.000+0000,[],Failed to start (System.BadImageFormatException),After installing SourceTree for Windows 10 64b...,Needs Triage,This issue is waiting to be reviewed by a memb...,Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://jira.atlassian.com/rest/api...
4,1842733,Sourcetree for Windows,,2021-12-25T02:30:54.000+0000,[],commit view show in branchs,When the project contains many branches and I ...,Gathering Interest,This suggestion needs more unique domain votes...,Suggestion,,False,[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...
274540,1777577,Jira Service Management Cloud,,2021-07-28T08:49:54.000+0000,[],Printing label tags for assets,Cant figure out a way to print labels on Insig...,Gathering Interest,This suggestion needs more unique domain votes...,Suggestion,,False,[{'self': 'https://jira.atlassian.com/rest/api...
274541,1777505,Jira Service Management Cloud,Low,2021-07-28T07:18:51.000+0000,[],Having child objects with no individual attrib...,h3. Issue Summary\r\n\r\nCreating child object...,Long Term Backlog,"A fix for this issue is required, but planned ...",Bug,A problem which impairs or prevents the functi...,False,[]
274542,1777387,Jira Service Management Cloud,,2021-07-28T06:50:23.000+0000,[],Ability to display Insight custom field in Kan...,h3. Problem\r\n\r\nCurrently there is no capab...,Gathering Interest,This suggestion needs more unique domain votes...,Suggestion,,False,[]
274543,1777462,Jira Service Management Cloud,Low,2021-07-27T18:57:59.000+0000,[],Insight IQL uses timezones in a way that can b...,h3. Issue Summary\r\nInsight IQL uses timezone...,Gathering Impact,"This issue has been reviewed, but needs more s...",Bug,A problem which impairs or prevents the functi...,False,[]
