# Microtask 1

## Aim of the task: 
Analysis of data fetched by perceval on a per-quarter basis.  
This includes (but not limited to) :
- The number of new committers per quarter
- The number of new issue and pull request submitters per quarter
- The total number of issues, commits and pull requests per quarter

This task has been done purely in Python

# Collecting Data
Just like microtask-0, data is collected from the progit project on github. Specifically, the repositories progit2-ru and progit2-zh were used.

```python
github_url = "https://github.com/"
owner = "progit"
repos_used = ["progit2-ru", "progit2-zh"]
repo_urls = [github_url + owner + "/" + repo_used for repo_used in repos_used]
auth_token = " " # please populate this field with your github auth token 
```
And then:
```python
for repo, repo_url in zip(repos_used, repo_urls):
    print(repo, repo_url)
    if repo == repos_used[0]:
        !perceval git --json-line $repo_url > ../progit.json
        
    else:
        !perceval git --json-line $repo_url >> ../progit.json
        
    !perceval github -t $auth_token --json-line --sleep-for-rate --category pull_request $owner $repo >> ../progit.json
    
    !perceval github -t $auth_token --json-line --sleep-for-rate --category issue $owner $repo >> ../progit.json
```
The code snippet above has already been used in microtask0 to generate the progit.json file present in the parent directory of our current location. Thus, I am not putting this snippet in a runnable cell for this microtask. 



In [1]:
import json
import csv
import datetime

## Defining a few constants
**start_year and end_year**:
    these denote the range of quarters we'll be considering. For example, the current usage implies that 8 quarters will be considered for this metric. THus, the end_year will be considered.

In [2]:
# Here both start year and end year are included
start_year = 2017
end_year = 2018 

# This dicts represent the range of dates which fall into each quarter
quar1_dates = {"start_date": "01-01", "end_date": "03-31"}
quar2_dates = {"start_date": "04-01", "end_date": "06-30"}
quar3_dates = {"start_date": "07-01", "end_date": "09-30"}
quar4_dates = {"start_date": "10-01", "end_date": "12-31"}

# These sets allow one to track the number of new contributers for each item (commit, pull request, issue)
old_committers = set()
old_issue_subs = set()
old_pr_subs = set()

## The Quarter class
I was looking for an easy way to represent the collected data. I went with a class representation of a quarter so that when the time comes to analyze data per quarter, all that's left to be done is to use the instance variables of that quarter object.

In [3]:
class Quarter:
    
    def __init__(self, number, year):
        # the quarter number and year (these make a quarter unique (like a candidate key))
        self.number = number    
        self.year = year   
   
        # these store the number of commits, issues and pull_requests created during a particular quarter
        self.num_commits = 0  
        self.num_issues = 0
        self.num_pullrequests = 0
        
        # these store the number of new contributers in that particular quarter
        self.new_committers = 0
        self.new_issue_subs = 0
        self.new_pr_subs = 0
        
        # these represent the date range over which a particular quarter is valid
        self.start_date = ""
        self.end_date = ""
        
        # populate the self.start_date and self.end_date instance variables
        if self.number == 1:
            self.start_date = str(self.year) + '-' + quar1_dates["start_date"]
            self.end_date = str(self.year) + '-' + quar1_dates["end_date"]
            
        if self.number == 2:
            self.start_date = str(self.year) + '-' + quar2_dates["start_date"]
            self.end_date = str(self.year) + '-' + quar3_dates["end_date"]            
            
        if self.number == 3:
            self.start_date = str(self.year) + '-' + quar3_dates["start_date"]
            self.end_date = str(self.year) + '-' + quar3_dates["end_date"]
            
        if self.number == 4:
            self.start_date = str(self.year) + '-' + quar4_dates["start_date"]
            self.end_date = str(self.year) + '-' + quar4_dates["end_date"]
        
            
    def is_includes_data(self, date):
        """
        :param data: this is a date in the form of a string which will be converted 
        into a date time object using the _str_to_dt_data() static method. 
        
        Note: the Quarter instance variables: self.start_date and self.end_date are also strings. 
        To convert them to a date time object of the same format, _str_to_dt_quarter() is used.
        """
        if self._str_to_dt_quarter(self.start_date) <= self._str_to_dt_data(date) < self._str_to_dt_quarter(self.end_date):
            return True

        return False
    
    def add_analysis(self, datapoint):
        
        if datapoint['category'] == "commit":
            self.num_commits += 1 
            
            # if the author has already committed before, do nothing
            if datapoint['author'] not in old_committers:
                self.new_committers += 1
                
            old_committers.add(datapoint['author'])
                
        if datapoint['category'] == "issue":
            self.num_issues += 1 
            
            if datapoint['author'] not in old_issue_subs:
                self.new_issue_subs += 1

            old_issue_subs.add(datapoint['author'])
            
        if datapoint['category'] == "pull_request":
            self.num_pullrequests += 1 
            
            if datapoint['author'] not in old_pr_subs:
                self.new_pr_subs += 1
                
            old_pr_subs.add(datapoint['author'])
    
    @staticmethod
    def _str_to_dt_data(date):
        """
        :param date: converts date (str) to a datetime object 
        Note: the string format for the date in the json file is either: 
         - %a %b %d %H:%M:%S %Y %z --> for commits
         - %Y-%m-%dT%H:%M:%SZ      --> for issues and pull requests
        """        
        try:
            datetimestr =  datetime.datetime.strptime(date, "%a %b %d %H:%M:%S %Y %z").strftime("%Y-%m-%d")
        
        except ValueError as ve:
            datetimestr =  datetime.datetime.strptime(date, "%Y-%m-%dT%H:%M:%SZ").strftime("%Y-%m-%d")
        
        finally:
            datetimeobj = datetime.datetime.strptime(datetimestr, "%Y-%m-%d")
            return datetimeobj
        
    @staticmethod
    def _str_to_dt_quarter(date):
        
        datetimeobj =  datetime.datetime.strptime(date, "%Y-%m-%d")
        return datetimeobj
    
    def __str__(self):
        return str(self.number) + " " + str(self.year)

# Cleaning the data
This class creates a dictionary with three keys:
    - "commit"
    - "issue"
    - "pull_request"
The value of each key is a list, whose elements are indivisual commits, pull requests or issues (based on the key). Each element is of type `dict`

In [4]:
class CleanJson():   
    
    def __init__(self, path_to_file):
        
        self.clean_data = {
            'commit': [],
            'issue': [],
            'pull_request': []
        }
        
        with open(path_to_file, 'r') as raw_data:
            for line in raw_data:
                line = json.loads(line)

                clean_line = dict()
                if line['category'] == "commit":
                    clean_line = self._clean_commit(line)

                elif line['category'] == "issue":
                    clean_line = self._clean_issue(line)

                elif line['category'] == "pull_request":
                    clean_line = self._clean_pr(line)

                self.clean_data[line['category']].append(clean_line)

                    
    @staticmethod                
    def _clean_commit(line):
        repo_name = line['origin']
        line_data = line['data']
        summary = {
            'repo': repo_name,
            'hash': line_data['commit'],
            'category': "commit",
            'commit': line_data['Commit'],
            'author': line_data['Author'],
            'created_date': line_data['CommitDate'],
            'files_no': len(line_data['files'])
        }
        
        actions = 0
        
        for file in line_data['files']:
            if 'action' in file:
                actions += 1
                summary['files_action'] = actions
                summary['merge'] = 'Merge' in line_data
        return summary
    
    @staticmethod
    def _clean_issue(line):
        repo_name = line['origin']
        line_data = line['data']
        cleaned_line ={
            'repo': repo_name,
            'hash': line_data['id'],
            'category': "issue",
            'author': line_data['user']['login'],
            'created_date': line_data['created_at'],
            'current_status': line_data['state']   
        }
        
        return cleaned_line
    
    @staticmethod
    def _clean_pr(line):
        repo_name = line['origin']
        line_data = line['data']
        cleaned_line ={
            'repo': repo_name,
            'hash': line_data['id'],
            'category': "pull_request",
            'author': line_data['user']['login'],
            'created_date': line_data['created_at'],
            'current_status': line_data['state']   
        }
        
        return cleaned_line
    

## Creating the required number of quarters
The list comprehension below is a general way to create the required quarter objects, depending on the start_year and end_year global variables declared above.

In [5]:
year_list = [x for x in range(start_year, end_year + 1)]
quar_list = [Quarter(num, year)  for year in year_list for num in range(1, 5)]

In [6]:
clean_data = CleanJson('../progit.json')

## Populating the quarter objects
The snippet below loops through each key in the clean_data.clean_data dictionary. Remember, the structure of clean_data.clean_data is:
```python
    {
        'commit': [commit1_dict, commit2_dict, ....], 
        'issue': [issue1_dict, issue2_dict, ....], 
        'pull_request': [pr1_dict, pr2_dict, ....], 
    }
```
For each key, it loop through each quarter object, present in the quar_list list and then decides if each element in the value for that key falls in that quarter. 
If that is the case, it updates the quarter object's instance variables using the quarter.add_analysis(data_point) method. 

In [7]:
for category in ("commit", "issue", "pull_request"):
    data = clean_data.clean_data[category]
    for quarter in quar_list:
        
        for data_point in data:
            if quarter.is_includes_data(data_point["created_date"]):
                quarter.add_analysis(data_point)


# Analysis
The idea of creating an object for each quarter allows one to easily analyze the data returned by perceval. For each topic, be it number of items, or number of new contributers, simple print the corresponding instance variable for each Quarter object

## Number of commits, pull requests and issues per quadrant

In [8]:
for q in quar_list:
    print("Quarter num and year: " ,q,    
          " \n commits: ", q.num_commits,
          " \n issues: ", q.num_issues, 
          " \n pull requests: ", q.num_pullrequests)
    print("--------------------------")

Quarter num and year:  1 2017  
 commits:  13  
 issues:  8  
 pull requests:  4
--------------------------
Quarter num and year:  2 2017  
 commits:  9  
 issues:  12  
 pull requests:  8
--------------------------
Quarter num and year:  3 2017  
 commits:  5  
 issues:  8  
 pull requests:  5
--------------------------
Quarter num and year:  4 2017  
 commits:  11  
 issues:  8  
 pull requests:  7
--------------------------
Quarter num and year:  1 2018  
 commits:  18  
 issues:  12  
 pull requests:  7
--------------------------
Quarter num and year:  2 2018  
 commits:  19  
 issues:  16  
 pull requests:  12
--------------------------
Quarter num and year:  3 2018  
 commits:  3  
 issues:  4  
 pull requests:  4
--------------------------
Quarter num and year:  4 2018  
 commits:  4  
 issues:  6  
 pull requests:  2
--------------------------


## Number of new committers, new issue submitters and pull request creators

In [9]:
for q in quar_list:
    print("Quarter num and year: " ,q,    
          " \n new committers: ", q.new_committers,
          " \n new issue submitters: ", q.new_issue_subs, 
          " \n new pull request creators: ", q.new_pr_subs)
    print("--------------------------")

Quarter num and year:  1 2017  
 new committers:  7  
 new issue submitters:  7  
 new pull request creators:  4
--------------------------
Quarter num and year:  2 2017  
 new committers:  5  
 new issue submitters:  9  
 new pull request creators:  7
--------------------------
Quarter num and year:  3 2017  
 new committers:  0  
 new issue submitters:  0  
 new pull request creators:  0
--------------------------
Quarter num and year:  4 2017  
 new committers:  4  
 new issue submitters:  4  
 new pull request creators:  3
--------------------------
Quarter num and year:  1 2018  
 new committers:  8  
 new issue submitters:  9  
 new pull request creators:  6
--------------------------
Quarter num and year:  2 2018  
 new committers:  6  
 new issue submitters:  9  
 new pull request creators:  8
--------------------------
Quarter num and year:  3 2018  
 new committers:  0  
 new issue submitters:  0  
 new pull request creators:  0
--------------------------
Quarter num and year

# Viewing data as a csv file

## Writing the cleaned data to a csv
The following function takes a file path as a parameter and writes to that file the following: 
    quarter number and year
    number of commits, issues, and pull requests in that quarter
    number of new committers, new issue submitters and new pull request creators.
    
The actual process of writing to the csv is done with the help of the `csv` python package and specifically, `csv.writer`

In [10]:
def write_to_csv(file_path):
    with open(file_path, 'w', ) as csvfile:
        csv_writer = csv.writer(csvfile, delimiter=',')
        
        file_headers = ["Quarter(Num)", "Quarter(Year)", "Num_Commits", "Num_Issues", "Num_PRs", "Num_new_commits", "Num_new_issues", "Num_new_prs"]
        csv_writer.writerow(file_headers)
        
        for quar in quar_list:
            row = [str(quar.number),         \
                 str(quar.year),             \
                 str(quar.num_commits) ,     \
                 str(quar.num_issues)  ,     \
                 str(quar.num_pullrequests), \
                 str(quar.new_committers),   \
                 str(quar.new_issue_subs),   \
                 str(quar.new_pr_subs)       
                  ]
            csv_writer.writerow(x for x in row)
            

In [11]:
write_to_csv("../progit.csv")

## Displaying a table based on the csv file
The following function creates a table after the reading the csv file created above, allowing one to visualize the data stored in the csv

In [12]:
def create_table(file_path):
    with open(file_path, 'r', ) as csvfile:
        csv_reader = csv.reader(csvfile, delimiter=',')

        for row in csv_reader:
            for field in row:
                print("%-10s" %field, end="\t")
            print()


In [13]:
create_table('../progit.csv')

Quarter(Num)	Quarter(Year)	Num_Commits	Num_Issues	Num_PRs   	Num_new_commits	Num_new_issues	Num_new_prs	
1         	2017      	13        	8         	4         	7         	7         	4         	
2         	2017      	9         	12        	8         	5         	9         	7         	
3         	2017      	5         	8         	5         	0         	0         	0         	
4         	2017      	11        	8         	7         	4         	4         	3         	
1         	2018      	18        	12        	7         	8         	9         	6         	
2         	2018      	19        	16        	12        	6         	9         	8         	
3         	2018      	3         	4         	4         	0         	0         	0         	
4         	2018      	4         	6         	2         	1         	5         	1         	
