# **PARSING OF JIRA BUG TICKETS FOR APACHE OPEN-SOURCE PROJECT**

## WEBSCRAPPING APACHE PROJECT BUG TICKETS DATA

The data about bugs in the Apache project is fetched Apache's JIRA issue tracker using jira api.

## Findings about fetched data
1. It is in json format which is equivalent of dictionary in python
2. The fetched data has 5 fields/keys [***Label1***]:<br>
    a. expand: It is a string<br>
    b. startAt: starting index<br>
    c. maxResults: Maximum requests that can be read<br>
       Note: maxResults is equal to pagelength given in code but it's maximum value can be 1000
    d. total: Gives total number of issues in the tracker<br>
       Note: It can be greater than the number of maxResults that can be fetched via api in one request
       (e.g. In the case of Apache airflow)
    e. issues: It is a list of issues in json form/dictionary form

In [222]:
#Importing the requests module
import requests

#Base url
jira_api_base_url = "https://issues.apache.org/jira/"

#max number of results fetched per request
page_length = 1000

#name of project
project = "airflow"
proj_jira_id = project.upper()

#starting index
start_idx = 0

#Dictionary to store fetched results
r_dict={}

"""
Intially we don't know what are the total number of issues in the JIRA Isuue
Tracker for a single project, 
so we set it to 0"""
total=0

"""
To know the total number issues, we need to pull request from the url atleast once

One can do it in many ways, but here do-while loop is used

Since, python doesn't support do-while loops
Below is given one way to mimic the do-while loop in python 
"""
while True:
  #Making the final url
  url = (
      jira_api_base_url
      + f"rest/api/2/search?jql=project={proj_jira_id}+order+by+created"
      + f"&issuetypeNames=Bug&maxResults={page_length}&"
      + f"startAt={start_idx}&fields=id,key,priority,labels,versions,"
      + "status,components,creator,reporter,issuetype,description,"
      + "summary,resolutiondate,created,updated"
  )

  #Printing the final url from where the data is fetched
  print(f"Getting data from {url}...")

  #Sends a GET request to fetch data from specified url
  r = requests.get(url)

  #Storing the json data fetched from the url in a temporary dictionary
  #You can refer to "Label1" in the above cell for reviewing the structure of the json data fetched
  temp_dict = r.json()

  #Increasing the start index of request by a value
  """
  That value is equal to the page length as we have already fetched the specified number of pages
  """
  start_idx+=page_length

  #Making total equal to the total number of issues present in the tracker
  total=temp_dict['total']

  #Intially our result dictionary is empty
  #So we can check if it's empty using the length command --> len 
  if len(r_dict)==0:
    #Copying temporary fetched dictionary to our empty result dictionary
    r_dict=temp_dict.copy()

  #If it is not empty then
  #We just need to add the fetched list of 'issues'
  #from temporary dictionary(temp_dict) to our final result dictionary(r_dict)
  else:
    #Adding two lists using the extend function
    r_dict['issues'].extend(temp_dict['issues'])

  #Now condition of while is checked
  #If the start index has exceeded the total issues the loop will break
  if total<start_idx:
    break

Getting data from https://issues.apache.org/jira/rest/api/2/search?jql=project=AIRFLOW+order+by+created&issuetypeNames=Bug&maxResults=1000&startAt=0&fields=id,key,priority,labels,versions,status,components,creator,reporter,issuetype,description,summary,resolutiondate,created,updated...
Getting data from https://issues.apache.org/jira/rest/api/2/search?jql=project=AIRFLOW+order+by+created&issuetypeNames=Bug&maxResults=1000&startAt=1000&fields=id,key,priority,labels,versions,status,components,creator,reporter,issuetype,description,summary,resolutiondate,created,updated...
Getting data from https://issues.apache.org/jira/rest/api/2/search?jql=project=AIRFLOW+order+by+created&issuetypeNames=Bug&maxResults=1000&startAt=2000&fields=id,key,priority,labels,versions,status,components,creator,reporter,issuetype,description,summary,resolutiondate,created,updated...
Getting data from https://issues.apache.org/jira/rest/api/2/search?jql=project=AIRFLOW+order+by+created&issuetypeNames=Bug&maxResults

In [223]:
#Checking the maximum number of results fetched per request
r_dict['maxResults']

1000

In [224]:
#Total number of issues in the Tracker
r_dict['total']

7038

In [225]:
#Checking the length of issues in our final dictionary(r_dict) to make sure it is equal to total number of issues in the tracker
len(r_dict['issues'])

7038

## COLLECTION OF RELEVANT DATA
According to the columns suggested in the [sample](https://github.com/HelgeCPH/ase-effect-branching-on-defects/blob/master/data/samples/airflow_interface.csv) interface by [HelgeCPH](https://github.com/HelgeCPH/)

We need to extract 5 information from our dictionary(r_dict)
The five columns are as follows:<br>
    1. key<br>
    2. priority<br>
    3. created<br>
    4. status<br>
    5. versions

In [226]:
#Empty list to store our data
data=[]

In [227]:
#For each issue in the total issues , we will collect five items(namely: key,priority,created,status,versions)
for issue in r_dict['issues']:
  #List to store data for one row/issue
  row=[]

  #Getting the key value
  key=issue['key']

  #Getting priority name
  priority=issue['fields']['priority']['name']

  #Getting the date and time when issue was created
  created=issue['fields']['created']

  #Getting the name of status of the issue(e.g. Open,Closed,Resolved)
  status=issue['fields']['status']['name']

  #Getting the version of the products which had the bug

  #In some issues the version is not given
  #So this condition checks absence or presence of information about versions
  if len(issue['fields']['versions'])==0:
    #If no version information present then version field is empty string
    version=''
  else:
    #If it has version, then it can be more than one

    #Creating a list for version names
    version_names=[]

    #Iterating through each version
    for v in issue['fields']['versions']:

      #Fetching it's name
      name=v['name']

      #We just need the numerical values so removing the non-numerical part
      #Final version string variable
      final_name=''

      #For each character we will only store the number and '.'(dot)
      for c in name:

        #Check if the character is number or dot('.')
        if c.isnumeric() or c=='.':
          #if yes add it to final string name
          final_name=final_name+c

      #Add final version name to version lists
      version_names.append(final_name)
    
    #Add all version names in one string
    version=' '.join(version_names)

  #Add each column information in our row list
  row.extend([key,priority,created,status,version])

  #Add the row information in final data
  data.append(row)

In [228]:
#The final data
#Note: The data starts from latest bugs
data

[['AIRFLOW-7121', 'Major', '2020-05-26T22:08:20.000+0000', 'Open', '1.10.10'],
 ['AIRFLOW-7120',
  'Blocker',
  '2020-04-28T12:14:24.000+0000',
  'Open',
  '1.10.10'],
 ['AIRFLOW-7119', 'Minor', '2020-04-02T07:03:01.000+0000', 'Open', '1.10.9'],
 ['AIRFLOW-7118',
  'Minor',
  '2020-04-01T19:34:38.000+0000',
  'Resolved',
  '1.10.5 1.10.6 1.10.7 1.10.8 1.10.9'],
 ['AIRFLOW-7117',
  'Minor',
  '2020-04-01T18:42:15.000+0000',
  'Resolved',
  '1.10.5 1.10.6 1.10.7 1.10.8 1.10.9'],
 ['AIRFLOW-7116', 'Major', '2020-04-01T00:49:58.000+0000', 'Open', '1.10.9'],
 ['AIRFLOW-7115',
  'Minor',
  '2020-03-30T14:33:51.000+0000',
  'Resolved',
  '1.10.9'],
 ['AIRFLOW-7114',
  'Minor',
  '2020-03-30T13:57:11.000+0000',
  'Resolved',
  '1.10.9'],
 ['AIRFLOW-7113',
  'Major',
  '2020-03-27T18:22:28.000+0000',
  'Closed',
  '1.10.10'],
 ['AIRFLOW-7112',
  'Trivial',
  '2020-03-27T10:07:53.000+0000',
  'Resolved',
  '1.10.9'],
 ['AIRFLOW-7111',
  'Major',
  '2020-03-25T17:11:25.000+0000',
  'Resolved',
  

In [229]:
#Reverse the data to start it from the first issue
final_data=data[::-1]

In [230]:
#The final data
final_data

[['AIRFLOW-1', 'Major', '2016-04-15T04:07:26.000+0000', 'Resolved', ''],
 ['AIRFLOW-6', 'Major', '2016-04-22T00:04:48.000+0000', 'Closed', ''],
 ['AIRFLOW-7', 'Minor', '2016-04-27T00:04:31.000+0000', 'Closed', ''],
 ['AIRFLOW-9', 'Major', '2016-04-27T18:28:20.000+0000', 'Closed', ''],
 ['AIRFLOW-10', 'Major', '2016-04-27T18:40:00.000+0000', 'Closed', ''],
 ['AIRFLOW-11', 'Major', '2016-04-27T18:40:12.000+0000', 'Closed', ''],
 ['AIRFLOW-12', 'Major', '2016-04-27T18:41:30.000+0000', 'Closed', ''],
 ['AIRFLOW-13', 'Major', '2016-04-27T18:41:51.000+0000', 'Resolved', ''],
 ['AIRFLOW-14', 'Major', '2016-04-28T14:13:53.000+0000', 'Closed', ''],
 ['AIRFLOW-15', 'Major', '2016-04-28T14:43:00.000+0000', 'Resolved', ''],
 ['AIRFLOW-16', 'Major', '2016-04-28T16:30:27.000+0000', 'Resolved', ''],
 ['AIRFLOW-17', 'Major', '2016-04-28T16:40:31.000+0000', 'Closed', ''],
 ['AIRFLOW-18', 'Critical', '2016-04-28T18:38:59.000+0000', 'Closed', ''],
 ['AIRFLOW-19', 'Minor', '2016-04-28T21:08:00.000+0000', 

In [231]:
#Importing Pandas to download the data as csv file
import pandas as pd

In [232]:
#The column names for our dataframe(finally our csv file)
col_names=['key','priority','created','status','version']

In [233]:
#Creating a Panda Dataframe
final_df=pd.DataFrame(final_data,columns=col_names)

In [234]:
#Final dataframe
final_df

Unnamed: 0,key,priority,created,status,version
0,AIRFLOW-1,Major,2016-04-15T04:07:26.000+0000,Resolved,
1,AIRFLOW-6,Major,2016-04-22T00:04:48.000+0000,Closed,
2,AIRFLOW-7,Minor,2016-04-27T00:04:31.000+0000,Closed,
3,AIRFLOW-9,Major,2016-04-27T18:28:20.000+0000,Closed,
4,AIRFLOW-10,Major,2016-04-27T18:40:00.000+0000,Closed,
...,...,...,...,...,...
7033,AIRFLOW-7117,Minor,2020-04-01T18:42:15.000+0000,Resolved,1.10.5 1.10.6 1.10.7 1.10.8 1.10.9
7034,AIRFLOW-7118,Minor,2020-04-01T19:34:38.000+0000,Resolved,1.10.5 1.10.6 1.10.7 1.10.8 1.10.9
7035,AIRFLOW-7119,Minor,2020-04-02T07:03:01.000+0000,Open,1.10.9
7036,AIRFLOW-7120,Blocker,2020-04-28T12:14:24.000+0000,Open,1.10.10


In [235]:
#Saving the data from dataframe to csv file
final_df.to_csv("airflow_interface.csv")