# Extract data from public Jira data
* This notebook will extract data from the public jira dataset
* The dataset is stored in mongoDB.
* mongoDB must be installed and running on your system.

For more details refer to 
 https://zenodo.org/record/5901956

Command used to export the data (this command takes about 15 minutes to complete).

`mongodump --db=JiraRepos --gzip --archive=mongodump-JiraRepos.archive`

Accompanying command to restore the data (this command takes about 15 minutes to complete). Expanded, this data is ~60GB inside MongoDB.

`mongorestore --gzip --archive=mongodump-JiraRepos.archive --nsFrom "JiraRepos.*" --nsTo "JiraRepos.*"`

Change the `--nsTo` command to contain the desired name for the JiraRepos database.
mongorestore --gzip --archive=mongodump-JiraRepos.archive --nsFrom "JiraRepos.*" --nsTo "JiraRepos.Apache"


For more information see: https://docs.mongodb.com/manual/tutorial/backup-and-restore-tools/

Jira Dataset for TD filtered was extracted from https://zenodo.org/record/5901956  (https://arxiv.org/pdf/2201.08368.pdf) and adapted 

Montgomery, Lloyd, Lüders, Clara, & Maalej, Prof. Dr. Walid. (2022). The Public Jira Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5901956



In [1]:
import pymongo
# Default connection to localhost
myclient = pymongo.MongoClient("mongodb://localhost:27017/")



In [2]:
mydb = myclient["JiraRepos"]
mydb


Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'JiraRepos')

In [3]:
collist = mydb.list_collection_names()
collist


['Spring',
 'RedHat',
 'Sakai',
 'JiraEcosystem',
 'Jira',
 'Hyperledger',
 'Apache',
 'SecondLife',
 'MariaDB',
 'MongoDB',
 'Mojang',
 'Qt',
 'JFrog',
 'IntelDAOS',
 'Mindville',
 'Sonatype']

In [4]:
# Name of the collection to extract
collection_name = 'Qt'

In [5]:

import pymongo as pm
import pandas as pd
import os

CHUNK_SIZE = 500
client = pm.MongoClient()
coll = client.get_database('JiraRepos').get_collection(collection_name)
cursor = coll.find({}, batch_size=CHUNK_SIZE)

# Count total documents and calculate total chunks
total_docs = coll.count_documents({})
total_chunks = (total_docs + CHUNK_SIZE - 1) // CHUNK_SIZE  # Rounds up the division

def yield_rows(cursor, chunk_size):
    """
    Generator to yield chunks from cursor
    :param cursor:
    :param chunk_size:
    :return:
    """
    chunk = []
    for i, row in enumerate(cursor):
        if i % chunk_size == 0 and i > 0:
            yield chunk
            del chunk[:]
        chunk.append(row)
    yield chunk

# Define the columns you wish to extract
desired_columns = [
    "id", 
    "fields.project.name", 
    "fields.priority.name", 
    "fields.created", 
    "fields.labels", 
    "fields.summary", 
    "fields.description", 
    "fields.status.name",
    "fields.status.description", 
    "fields.issuetype.name", 
    "fields.issuetype.description", 
    "fields.issuetype.subtask", 
    "fields.comments"
]

# Ensure the directory exists
os.makedirs("dataset_dump/" + collection_name, exist_ok=True)

chunks = yield_rows(cursor, CHUNK_SIZE)

# Initialize chunk counter
chunk_counter = 0

for chunk in chunks:
    chunk_counter += 1  # Increment the chunk counter
    df = pd.json_normalize(chunk, errors='ignore')
    
    # Select only the columns that exist in the DataFrame
    available_columns = [col for col in desired_columns if col in df.columns]
    df = df[available_columns]

    # Save to CSV, considering 'id' is always present
    df.to_csv(f"dataset_dump/{collection_name}/{collection_name}-{chunk_counter}.csv", index=False)

    print(f"Processed chunk {chunk_counter} of {total_chunks}")

# Print completion message
print("All chunks processed.")

Processed chunk 1 of 298
Processed chunk 2 of 298
Processed chunk 3 of 298
Processed chunk 4 of 298
Processed chunk 5 of 298
Processed chunk 6 of 298
Processed chunk 7 of 298
Processed chunk 8 of 298
Processed chunk 9 of 298
Processed chunk 10 of 298
Processed chunk 11 of 298
Processed chunk 12 of 298
Processed chunk 13 of 298
Processed chunk 14 of 298
Processed chunk 15 of 298
Processed chunk 16 of 298
Processed chunk 17 of 298
Processed chunk 18 of 298
Processed chunk 19 of 298
Processed chunk 20 of 298
Processed chunk 21 of 298
Processed chunk 22 of 298
Processed chunk 23 of 298
Processed chunk 24 of 298
Processed chunk 25 of 298
Processed chunk 26 of 298
Processed chunk 27 of 298
Processed chunk 28 of 298
Processed chunk 29 of 298
Processed chunk 30 of 298
Processed chunk 31 of 298
Processed chunk 32 of 298
Processed chunk 33 of 298
Processed chunk 34 of 298
Processed chunk 35 of 298
Processed chunk 36 of 298
Processed chunk 37 of 298
Processed chunk 38 of 298
Processed chunk 39 of

In [6]:
import dask
dask.config.set({'dataframe.query-planning': True})
import dask.dataframe as dd
df = dd.read_csv(f'dataset_dump/{collection_name}/{collection_name}-*.csv')
df



<dask_expr.expr.DataFrame: expr=ReadCSV(1a95163)>

In [7]:
df = df.compute()

In [8]:
df

Unnamed: 0,id,fields.project.name,fields.priority.name,fields.created,fields.labels,fields.summary,fields.description,fields.status.name,fields.status.description,fields.issuetype.name,fields.issuetype.description,fields.issuetype.subtask,fields.comments
0,348468,Qt Project Website,P2: Important,2022-01-04T08:51:51.000+0000,[],Custom css is not usable in published document...,While documentation projects can define custom...,Open,The issue is open and ready for the assignee t...,Bug,A problem which impairs or prevents the functi...,False,[]
1,331989,Qt Project Website,P2: Important,2021-12-16T11:55:21.000+0000,[],Second Cookie banner,Second cookie banner pop ups on top of the for...,Open,The issue is open and ready for the assignee t...,Task,A task that needs to be done.,False,[]
2,331192,Qt Project Website,P3: Somewhat important,2021-12-06T08:00:42.000+0000,[],doc.qt.io search: Changing Sort by criterium a...,Changing the 'Sort by criteria' in a doc.qt.io...,Open,The issue is open and ready for the assignee t...,Bug,A problem which impairs or prevents the functi...,False,[]
3,331155,Qt Project Website,Not Evaluated,2021-12-03T16:50:53.000+0000,[],planet.qt.io did not fetch my last post,"Hi, I did a day ago a new post on cutelyst.org...",Open,The issue is open and ready for the assignee t...,Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://bugreports.qt.io/rest/api/2...
4,330942,Qt Project Website,Not Evaluated,2021-12-01T12:19:21.000+0000,[],.sha256 virtual files are missing for qt 6.2.2...,missing\r\nhttps://download.qt.io/official_rel...,Closed,"The issue is considered finished, the resoluti...",Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://bugreports.qt.io/rest/api/2...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,301329,Qt,P2: Important,2020-07-01T12:03:13.000+0000,[],Unable to record more than one video on Raspbe...,The application crashes when trying to record ...,Closed,"The issue is considered finished, the resoluti...",Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://bugreports.qt.io/rest/api/2...
496,301328,Qt,Not Evaluated,2020-07-01T11:51:59.000+0000,['Reported_by_support_standard'],QQC2 AbstractButton.display: add option to sel...,Currently seems like the only option to edit o...,Reported,"The issue has been reported, but no validation...",Suggestion,Public suggestions,False,[]
497,301327,Qt,P2: Important,2020-07-01T11:11:14.000+0000,['Reported_by_support_standard'],QTreeView: FetchMore is not getting called for...,When a Treeview is specified to fetch more row...,Closed,"The issue is considered finished, the resoluti...",Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://bugreports.qt.io/rest/api/2...
498,301326,Qt,P2: Important,2020-07-01T11:00:14.000+0000,[],Select SDK and XCode for Build Qt from sources...,"Hi,\r\n\r\nI'm attempting to build Qt 5.15 fro...",Closed,"The issue is considered finished, the resoluti...",Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://bugreports.qt.io/rest/api/2...


In [9]:
df["fields.priority.name"].value_counts().to_frame()[:50]


Unnamed: 0_level_0,count
fields.priority.name,Unnamed: 1_level_1
P2: Important,46926
Not Evaluated,42453
P3: Somewhat important,28075
P1: Critical,20750
P4: Low,7375
P0: Blocker,2050
P5: Not important,935


In [10]:
df["fields.issuetype.name"].value_counts().to_frame()[:50]

Unnamed: 0_level_0,count
fields.issuetype.name,Unnamed: 1_level_1
Bug,106804
Suggestion,15723
Task,12015
Technical task,4830
Sub-task,4792
User Story,3401
Epic,793
Change Request,209
Improvement,11
Research,1


In [11]:
df["fields.project.name"].value_counts().to_frame()[:50]

Unnamed: 0_level_0,count
fields.project.name,Unnamed: 1_level_1
Qt,97172
Qt Creator,25249
Qt Design Studio,4619
Qt Quality Assurance Infrastructure,4169
Qt 3D Studio,4092
Qt Installer Framework,2178
Qt Mobility,1926
"Qbs (""Cubes"")",1634
Qt for Python,1595
Qt Automotive Suite,1521


In [12]:
# Number of different projects
df["fields.project.name"].nunique()

21

In [13]:
# To csv 
df.to_csv(f'final_dataset/{collection_name}.csv', index=False)

In [14]:
# read csv
df1 = pd.read_csv(f'final_dataset/{collection_name}.csv')     
df1

Unnamed: 0,id,fields.project.name,fields.priority.name,fields.created,fields.labels,fields.summary,fields.description,fields.status.name,fields.status.description,fields.issuetype.name,fields.issuetype.description,fields.issuetype.subtask,fields.comments
0,348468,Qt Project Website,P2: Important,2022-01-04T08:51:51.000+0000,[],Custom css is not usable in published document...,While documentation projects can define custom...,Open,The issue is open and ready for the assignee t...,Bug,A problem which impairs or prevents the functi...,False,[]
1,331989,Qt Project Website,P2: Important,2021-12-16T11:55:21.000+0000,[],Second Cookie banner,Second cookie banner pop ups on top of the for...,Open,The issue is open and ready for the assignee t...,Task,A task that needs to be done.,False,[]
2,331192,Qt Project Website,P3: Somewhat important,2021-12-06T08:00:42.000+0000,[],doc.qt.io search: Changing Sort by criterium a...,Changing the 'Sort by criteria' in a doc.qt.io...,Open,The issue is open and ready for the assignee t...,Bug,A problem which impairs or prevents the functi...,False,[]
3,331155,Qt Project Website,Not Evaluated,2021-12-03T16:50:53.000+0000,[],planet.qt.io did not fetch my last post,"Hi, I did a day ago a new post on cutelyst.org...",Open,The issue is open and ready for the assignee t...,Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://bugreports.qt.io/rest/api/2...
4,330942,Qt Project Website,Not Evaluated,2021-12-01T12:19:21.000+0000,[],.sha256 virtual files are missing for qt 6.2.2...,missing\r\nhttps://download.qt.io/official_rel...,Closed,"The issue is considered finished, the resoluti...",Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://bugreports.qt.io/rest/api/2...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
148574,301329,Qt,P2: Important,2020-07-01T12:03:13.000+0000,[],Unable to record more than one video on Raspbe...,The application crashes when trying to record ...,Closed,"The issue is considered finished, the resoluti...",Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://bugreports.qt.io/rest/api/2...
148575,301328,Qt,Not Evaluated,2020-07-01T11:51:59.000+0000,['Reported_by_support_standard'],QQC2 AbstractButton.display: add option to sel...,Currently seems like the only option to edit o...,Reported,"The issue has been reported, but no validation...",Suggestion,Public suggestions,False,[]
148576,301327,Qt,P2: Important,2020-07-01T11:11:14.000+0000,['Reported_by_support_standard'],QTreeView: FetchMore is not getting called for...,When a Treeview is specified to fetch more row...,Closed,"The issue is considered finished, the resoluti...",Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://bugreports.qt.io/rest/api/2...
148577,301326,Qt,P2: Important,2020-07-01T11:00:14.000+0000,[],Select SDK and XCode for Build Qt from sources...,"Hi,\r\n\r\nI'm attempting to build Qt 5.15 fro...",Closed,"The issue is considered finished, the resoluti...",Bug,A problem which impairs or prevents the functi...,False,[{'self': 'https://bugreports.qt.io/rest/api/2...
