# Extract data from public Jira data
* This notebook will extract data from the public jira dataset
* The dataset is stored in mongoDB.
* mongoDB must be installed and running on your system.

For more details refer to 
 https://zenodo.org/record/5901956

Command used to export the data (this command takes about 15 minutes to complete).

`mongodump --db=JiraRepos --gzip --archive=mongodump-JiraRepos.archive`

Accompanying command to restore the data (this command takes about 15 minutes to complete). Expanded, this data is ~60GB inside MongoDB.

`mongorestore --gzip --archive=mongodump-JiraRepos.archive --nsFrom "JiraRepos.*" --nsTo "JiraRepos.*"`

Change the `--nsTo` command to contain the desired name for the JiraRepos database.
mongorestore --gzip --archive=mongodump-JiraRepos.archive --nsFrom "JiraRepos.*" --nsTo "JiraRepos.Apache"


For more information see: https://docs.mongodb.com/manual/tutorial/backup-and-restore-tools/

Jira Dataset for TD filtered was extracted from https://zenodo.org/record/5901956  (https://arxiv.org/pdf/2201.08368.pdf) and adapted 

Montgomery, Lloyd, Lüders, Clara, & Maalej, Prof. Dr. Walid. (2022). The Public Jira Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5901956



In [10]:
import pymongo
# Default connection to localhost
myclient = pymongo.MongoClient("mongodb://localhost:27017/")



In [11]:
mydb = myclient["JiraRepos"]
mydb


Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'JiraRepos')

In [12]:
collist = mydb.list_collection_names()
collist


['Spring',
 'RedHat',
 'Sakai',
 'JiraEcosystem',
 'Jira',
 'Hyperledger',
 'Apache',
 'SecondLife',
 'MariaDB',
 'MongoDB',
 'Mojang',
 'Qt',
 'JFrog',
 'IntelDAOS',
 'Mindville',
 'Sonatype']

In [13]:
# Name of the collection to extract
collection_name = 'MariaDB'

In [14]:

import pymongo as pm
import pandas as pd
import os

CHUNK_SIZE = 500
client = pm.MongoClient()
coll = client.get_database('JiraRepos').get_collection(collection_name)
cursor = coll.find({}, batch_size=CHUNK_SIZE)

# Count total documents and calculate total chunks
total_docs = coll.count_documents({})
total_chunks = (total_docs + CHUNK_SIZE - 1) // CHUNK_SIZE  # Rounds up the division

def yield_rows(cursor, chunk_size):
    """
    Generator to yield chunks from cursor
    :param cursor:
    :param chunk_size:
    :return:
    """
    chunk = []
    for i, row in enumerate(cursor):
        if i % chunk_size == 0 and i > 0:
            yield chunk
            del chunk[:]
        chunk.append(row)
    yield chunk

# Define the columns you wish to extract
desired_columns = [
    "id", 
    "fields.project.name", 
    "fields.priority.name", 
    "fields.created", 
    "fields.labels", 
    "fields.summary", 
    "fields.description", 
    "fields.status.name",
    "fields.status.description", 
    "fields.issuetype.name", 
    "fields.issuetype.description", 
    "fields.issuetype.subtask", 
    "fields.comments"
]

# Ensure the directory exists
os.makedirs("dataset_dump/" + collection_name, exist_ok=True)

chunks = yield_rows(cursor, CHUNK_SIZE)

# Initialize chunk counter
chunk_counter = 0

for chunk in chunks:
    chunk_counter += 1  # Increment the chunk counter
    df = pd.json_normalize(chunk, errors='ignore')
    
    # Select only the columns that exist in the DataFrame
    available_columns = [col for col in desired_columns if col in df.columns]
    df = df[available_columns]

    # Save to CSV, considering 'id' is always present
    df.to_csv(f"dataset_dump/{collection_name}/{collection_name}-{chunk_counter}.csv", index=False)

    print(f"Processed chunk {chunk_counter} of {total_chunks}")

# Print completion message
print("All chunks processed.")

Processed chunk 1 of 63
Processed chunk 2 of 63
Processed chunk 3 of 63
Processed chunk 4 of 63
Processed chunk 5 of 63
Processed chunk 6 of 63
Processed chunk 7 of 63
Processed chunk 8 of 63
Processed chunk 9 of 63
Processed chunk 10 of 63
Processed chunk 11 of 63
Processed chunk 12 of 63
Processed chunk 13 of 63
Processed chunk 14 of 63
Processed chunk 15 of 63
Processed chunk 16 of 63
Processed chunk 17 of 63
Processed chunk 18 of 63
Processed chunk 19 of 63
Processed chunk 20 of 63
Processed chunk 21 of 63
Processed chunk 22 of 63
Processed chunk 23 of 63
Processed chunk 24 of 63
Processed chunk 25 of 63
Processed chunk 26 of 63
Processed chunk 27 of 63
Processed chunk 28 of 63
Processed chunk 29 of 63
Processed chunk 30 of 63
Processed chunk 31 of 63
Processed chunk 32 of 63
Processed chunk 33 of 63
Processed chunk 34 of 63
Processed chunk 35 of 63
Processed chunk 36 of 63
Processed chunk 37 of 63
Processed chunk 38 of 63
Processed chunk 39 of 63
Processed chunk 40 of 63
Processed

In [15]:
import dask
dask.config.set({'dataframe.query-planning': True})
import dask.dataframe as dd
df = dd.read_csv(f'dataset_dump/{collection_name}/{collection_name}-*.csv')
df

<dask_expr.expr.DataFrame: expr=ReadCSV(5e2ecf6)>

In [16]:
df = df.compute()

ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+---------------------------+--------+----------+
| Column                    | Found  | Expected |
+---------------------------+--------+----------+
| fields.status.description | object | float64  |
+---------------------------+--------+----------+

The following columns also raised exceptions on conversion:

- fields.status.description
  ValueError("could not convert string to float: 'This issue is not being actively worked on at the moment.'")

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'fields.status.description': 'object'}

to the call to `read_csv`/`read_table`.

In [None]:
df

In [None]:
df["fields.priority.name"].value_counts().to_frame()[:50]


In [None]:
df["fields.issuetype.name"].value_counts().to_frame()[:50]

In [None]:
df["fields.project.name"].value_counts().to_frame()[:50]

In [None]:
# Number of different projects
df["fields.project.name"].nunique()

In [None]:
# To csv 
df.to_csv(f'final_dataset/{collection_name}.csv', index=False)

In [None]:
# read csv
df1 = pd.read_csv(f'final_dataset/{collection_name}.csv')     
df1