# Microtask 0

## Aim:
- Get a basic understanding of perceval and the github api data it fetches
- Get comfortable analyzing said data: 
    - Total number of commits
    - Number of issues and pull-requests
    - Number of open issues and closed issues

# Collecting the data
Data is collected from the progit project on github. 
Specifically, the repositories used were: 
- [progit2-ru](https://github.com/progit/progit2-ru)
- [progit2-zh](https://github.com/progit/progit2-zh) were used.  
The data was fetched at: `Wednesday 20 March 2019 05∶52∶27 PM IST`

### Some numbers (from the analysis in this notebook): 
**The total number of commits is**:   
    progit2-ru: 1292  
    progit2-zh: 1792  
      
**The total number of issues is**:   
    progit2-ru: 218  
    progit2-zh: 366  
      
**The total number of pull-requests is**:       
    progit2-ru: 218  
    progit2-zh: 258  

## Getting the data
The cell below helps understand the data to be fetched, like the **owner**, the **repository names**, the **repository urls** and most importantly the **github authentication token**.  

**Make sure to fill in your token for the `auth_token` variable in the cell below.**

In [44]:
github_url = "https://github.com/"  # the github url domain: used for generating repo_urls
owner = "progit"
repos_used = ["progit2-ru", "progit2-zh"]
repo_urls = [github_url + owner + "/" + repo_used for repo_used in repos_used]
auth_token = "" # Please enter your github token here

### Harnessing the power of jupyter notebooks, the script in the cell below is a generalized way to create and populate a file named `progit.json`  using perceval.  

The steps involved are simple: 
For each repository specified in the `repos_used` variable, fetch its git data, its pull_requests data and finally its issues data from the github api in that order and append them to `progit.json`. 

**Note**: it has been commented to prevent an accidental overwrite of the progit.json file, present in the parent directory of our present directory. To work on more recent data, or to perform an analysis on a completely different set of repositories, please uncomment the snippet below and run the cell.

In [31]:
# for repo, repo_url in zip(repos_used, repo_urls):
#     print(repo, repo_url)
#     if repo == repos_used[0]:
#         !perceval git --json-line $repo_url > ../progit.json

#     else:
#         !perceval git --json-line $repo_url >> ../progit.json

#     !perceval github -t $auth_token --json-line --sleep-for-rate --category pull_request $owner $repo >> ../progit.json

#     !perceval github -t $auth_token --json-line --sleep-for-rate --category issue $owner $repo >> ../progit.json

In [32]:
import json
import datetime
import pprint

## The CleanJson class:

### Structure: 
    
variables:
- self.clean_data

instance methods:
- \_\_init__(generator)   
    - parameters:   path_to_file  
    - returns: None  
 
- number_of_repos  
    - parameters:  None  
    - returns: int (number of repos used)  

- total_commits  
    - parameters:  None  
    - returns: int (total_commits)  

- total_commits_per_repo  
    - parameters:   None  
    - returns: dict  
  
- count_from_to   
    - parameters:   start=None, end=None, type_of_date="author_date", empty=True, merge=True  
    - returns: int  
        
private methods:
- \_str_to_dt_data: converts string dates to a datetime object with format: "%Y-%m-%d"
- \_clean_commit: takes a line from json file and converts to a dict
- \_clean_issue: takes a line from json file and converts to a dict
- \_clean_pr: takes a line from json file and converts to a dict
     

In [33]:
class CleanJson:
                
    def __init__(self, path_to_file):
        
        self.clean_data = {
            "commit": [], 
            "issue": [], 
            "pull_request": []
        }
        
        with open(path_to_file, 'r') as raw_data:
            for line in raw_data:
                line = json.loads(line)

                clean_line = dict()
                if line['category'] == "commit":
                    clean_line = self._clean_commit(line)
                    
                elif line['category'] == "issue":
                    clean_line = self._clean_issue(line)

                elif line['category'] == "pull_request":
                    clean_line = self._clean_pr(line)

                self.clean_data[line['category']].append(clean_line)        
    
    
    def number_of_repos(self):
        return len(repos_used)
    
    def total_commits(self):
        return len(self.clean_data['commit'])
    
    def total_commits_per_repo(self):
        commits_per_repo = {el:0 for el in repo_urls}
        
        for commit in self.clean_data['commit']:
            
            commits_per_repo[commit['repo']] += 1
    
        return commits_per_repo
    
    def count_from_to(self, start=None, end=None, type_of_date="author_date", empty=True, merge=True):
        # commit_list has elements of a specific category
        category = "commit"
        commit_list = self.clean_data[category]
        start_date = datetime.datetime.strptime(start, "%Y-%m-%d") if start is not None else datetime.datetime.min
        end_date = datetime.datetime.strptime(end, "%Y-%m-%d") if end is not None else datetime.datetime.max
        
        required_commit_set = set()
        for elem in commit_list:
            if start_date <= self._str_to_dt_data(elem[type_of_date]) <= end_date:
                if (empty) or (not empty and elem['files_action'] != 0):
                    if (merge) or (not merge and elem['merge'] == False):

                        required_commit_set.add(elem['hash'])
        return len(required_commit_set)
                    
    
    # private methods to clean data ---------------------------
    @staticmethod
    def _str_to_dt_data(date):
        """
        :param date: converts date (str) to a datetime object 
        Note: the string format for the date in the json file is either: 
         - %a %b %d %H:%M:%S %Y %z --> for commits
         - %Y-%m-%dT%H:%M:%SZ      --> for issues and pull requests
        """        
        try:
            datetimestr =  datetime.datetime.strptime(date, "%a %b %d %H:%M:%S %Y %z").strftime("%Y-%m-%d")
        
        except ValueError as ve:
            datetimestr =  datetime.datetime.strptime(date, "%Y-%m-%dT%H:%M:%SZ").strftime("%Y-%m-%d")
        
        finally:
            datetimeobj = datetime.datetime.strptime(datetimestr, "%Y-%m-%d")
            return datetimeobj
    
    @staticmethod                
    def _clean_commit(line):
        repo_name = line['origin']
        line_data = line['data']
        cleaned_line = {
            'repo': repo_name,
            'hash': line_data['commit'],
            'category': "commit",
            'commit': line_data['Commit'],
            'author': line_data['Author'],
            'author_date': line_data['AuthorDate'],
            'commit_date': line_data['CommitDate'],
            'files_no': len(line_data['files'])
        }
        
        # number of files affected by a commit
        actions = 0
        cleaned_line['files_action'] = actions
        cleaned_line['merge'] = 'Merge' in line_data
        
        for file in line_data['files']:
            if 'action' in file:
                actions += 1
                cleaned_line['files_action'] = actions
                cleaned_line['merge'] = 'Merge' in line_data
        return cleaned_line
    
    @staticmethod
    def _clean_issue(line):
        repo_name = line['origin']
        line_data = line['data']
        cleaned_line ={
            'repo': repo_name,
            'hash': line_data['id'],
            'category': "issue",
            'author': line_data['user']['login'],
            'created_date': line_data['created_at'],
            'current_status': line_data['state']   
        }
        return cleaned_line
    
    @staticmethod
    def _clean_pr(line):
        repo_name = line['origin']
        line_data = line['data']
        cleaned_line ={
            'repo': repo_name,
            'hash': line_data['id'],
            'category': "pull_request",
            'author': line_data['user']['login'],
            'created_date': line_data['created_at'],
            'current_status': line_data['state']   
        }
        return cleaned_line

# Analyzing the data using the Class
The cell below creates a CleanJson object `commits_data` for the file `progit.json` present in microtask0's parent directory.

In [34]:
data = CleanJson('../progit.json')

## Total number of commits
A simple use of the `total_commits()` and `total_commits_per_repo()` methods

In [35]:
print("The total number of commits in all repos is: ", data.total_commits())
print("The number of commits repo-wise is: \n", data.total_commits_per_repo())

The total number of commits in all repos is:  3024
The number of commits repo-wise is: 
 {'https://github.com/progit/progit2-ru': 1292, 'https://github.com/progit/progit2-zh': 1732}


## Total number of commits between dates
A simple use of the count_from_to method with different use cases

In [36]:
print("Code changes count all period:", data.count_from_to())
print("Code changes count from 2018-01-01 to 2018-07-01:",
      data.count_from_to(start="2018-01-01", end="2019-07-01"))
print("Code changes count from 2018-01-01 to 2019-07-01 (no merge commits):",
      data.count_from_to(start="2018-01-01", end="2018-07-01", merge=False))


Code changes count all period: 2402
Code changes count from 2018-01-01 to 2018-07-01: 77
Code changes count from 2018-01-01 to 2019-07-01 (no merge commits): 27


# Analyzing the json file directly
Even without the CleanJson class, perceval makes it very easy to analyze the data produced by it. A simple context manager to read the json file line-by-line will is all that's required.

The contents of github_data will once populated will look something like:

```python
    {
        'commit': [commit1_dict, commit2_dict, ....], 
        'issue': [issue1_dict, issue2_dict, ....], 
        'pull_request': [pr1_dict, pr2_dict, ....], 
    }
```


In [38]:
github_data =  {
                "commit": [], 
                "issue": [], 
                "pull_request": []
                }

num_commits = 0
with open('../../progit.json', 'r') as github_data_file:
    for line in github_data_file:
        data_line = json.loads(line)
        num_commits += 1

        category = data_line['category']
        github_data[category].append(data_line)

## Total number of non empty commits
Empty commits are those which do not affect any files in the repository. Thus, to count the number of such commits, we simply have to run a loop through all the commits in `github_data` and see if any files were affected by the commit. Even if one such file is found, we immediately count that commit as a non-empty one and move on to the next one

In [40]:
non_empty_commits = 0

for commit in github_data['commit']:
    for file in commit['data']['files']:
        if 'action' in file:
            non_empty_commits += 1
            break
            
print(non_empty_commits)            

2610


## Total number of non - merge commits

In [41]:
count = 0

for commit in github_data['commit']:
    if 'Merge' not in commit['data']:
        count += 1
        
print("Number of non-merge commits is: %d" %count)

Number of non-merge commits is: 2453


# Pull Requests and Issues
Though commits are an important way to measure how active an organization is, pull requests and issues give a better picture of the entire community involved in that project. If you are wondering why I decided to structure `github_data` variable the way I did, you will soon realize that the part of the analysis dealing with pull requests and issues becomes so much easier.

## Total number of pull requests and issues

Lets create a dictionary `repo_wise_issues_prs`, whose structure is shown below: 
```python
    {
        repo_url_1: {"issue": .., "pull_request": ..}, 
        repo_url_2: {"issue": .., "pull_request": ..},
        repo_url_3: {"issue": .., "pull_request": ..}, 
        ..
        .
    }
```
The generic keys used allow this part of the script to work no matter which repositories or projects are used for the analysis.

Looping through each issue and pull request in `github_data`, simply populate `repo_wise_issues_prs`

In [42]:
repo_wise_issues_prs = {repo_url: {"issue": 0, "pull_request": 0} for repo_url in repo_urls}
total_issues = 0
total_prs = 0

for elem in github_data['issue']:
    repo_wise_issues_prs[elem['origin']]['issue'] += 1
    total_issues += 1
    
for elem in github_data['pull_request']:
    repo_wise_issues_prs[elem['origin']]['pull_request'] += 1
    total_prs += 1

print(json.dumps(repo_wise_issues_prs, indent=4))
print("Total number of issues: ", total_issues)
print("Total number of pull requests: ", total_prs)

{
    "https://github.com/progit/progit2-ru": {
        "issue": 218,
        "pull_request": 200
    },
    "https://github.com/progit/progit2-zh": {
        "issue": 366,
        "pull_request": 258
    }
}
Total number of issues:  584
Total number of pull requests:  458


## Total number of open and closed issues
From the results in the previous cell i.e the total number of issues, we can check the number of open and closed issues. Though the number of open issues can be correlated with inactivity, this is not always the case as certain issues pertaining to specific project topics are more useful when open and when people can actively add comments in that issue  --- a good example being that of the issues for GSOC ideas.

Thus, the cell below simply analyses the `status` key of each issue. Of course, it is easy to lose track of things when so many features and dimensions are involved. 
To reiterate: 
```python
github_data = {
    "commit": [commit1_dict1, commit2_dict, ..],
    "issue": [issue1_dict1, issue2_dict, ..],
    "pull_request": [pull_request1_dict1, pull_request2_dict, ..]
}
```

Coming back to the analysis, if the 'state' of an issue is open, we increment the `num_open_issues` counter. 

In [43]:
num_open_issues = 0
for issue in github_data["issue"]:
    if issue['data']['state'] == "open":
        num_open_issues += 1
        
print("The number of open issues is ", num_open_issues)
print("The number of closed issues is ", total_issues - num_open_issues)

The number of open issues is  32
The number of closed issues is  552
