## Microtask0

### Problem Statement:
Use this notebook implementing the Code_Changes metric (see it in MyBinder) as an example of how to collect the data, producing a single JSON file per data source, with all items (commits, issues, pull/merge requests) in it. Produce one notebook per data source (git, GitHub/GitLab issues, GitHub pull requests / GitLab merge requests) showing a summary of the contents of that file (number of items in it, and number of different identities in it counting authors/committers for git, submitters for issues and pull/merge requests). This microtask is mandatory, to show that you can retrieve data and produde a notebook showing it. In each notebook, include also the list of repositories retrieved, and the date of retrieval, using data available in the JSON file.

### Project: django

### Repositories

In [10]:
import json
import datetime
from collections import defaultdict
from pprint import pprint

### Getting the data

In [16]:
github_url = "https://github.com/"
owner = "sarthak-sehgal"
repos_used = ["timetable-visualizer"]
# github_url = "https://github.com/"
# owner = "django"
# repos_used = ["djangoproject.com"]
repo_urls = [github_url + owner + "/" + repo_used for repo_used in repos_used]
auth_token = "f4d98f2cef0a0b4873b723fe3ee4574765e8b37f"
print(repo_urls)



['https://github.com/sarthak-sehgal/timetable-visualizer']


In [17]:
for repo, repo_url in zip(repos_used, repo_urls):
    print(repo, repo_url)
    !perceval git --json-line $repo_url > trial.json
    print("...")
    !perceval github -t $auth_token --json-line --sleep-for-rate --category pull_request $owner $repo >> trial.json
    print("...")
    !perceval github -t $auth_token --json-line --sleep-for-rate --category issue $owner $repo >> trial.json
    print("...")

timetable-visualizer https://github.com/sarthak-sehgal/timetable-visualizer
[2019-03-20 11:16:25,170] - Sir Perceval is on his quest.
[2019-03-20 11:16:25,172] - Fetching commits: 'https://github.com/sarthak-sehgal/timetable-visualizer' git repository from 1970-01-01 00:00:00+00:00 to 2100-01-01 00:00:00+00:00; all branches
[2019-03-20 11:16:27,015] - Fetch process completed: 14 commits fetched
[2019-03-20 11:16:27,015] - Sir Perceval completed his quest.
...
[2019-03-20 11:16:27,604] - Sir Perceval is on his quest.
[2019-03-20 11:16:30,704] - Sir Perceval completed his quest.
...
[2019-03-20 11:16:31,194] - Sir Perceval is on his quest.
[2019-03-20 11:16:34,960] - Getting info for https://api.github.com/users/kunal-mohta
[2019-03-20 11:16:36,882] - Getting info for https://api.github.com/users/sarthak-sehgal
[2019-03-20 11:16:38,053] - Sir Perceval completed his quest.
...


In [102]:
class Code_Changes:
               
    
    def __init__(self, path_to_file):
        
        self.clean_data = defaultdict(list)
        
        with open(path_to_file, 'r') as raw_data:
            for line in raw_data:
                line = json.loads(line)

                clean_line = dict()
                if line['category'] == "commit":
                    clean_line = self._clean_commit(line)
                    
                elif line['category'] == "issue":
                    clean_line = self._clean_issue(line)

                elif line['category'] == "pull_request":
                    clean_line = self._clean_pr(line)

                self.clean_data[line['category']].append(clean_line)        
    
    
    def number_of_repos(self):
        return len(repos_used)
    
    def total_commits(self):
        return len(self.clean_data['commit'])
    
    def total_commits_per_repo(self):
        commits_per_repo = {el:0 for el in repo_urls}
        
        for commit in self.clean_data['commit']:
            
            commits_per_repo[commit['repo']] += 1
    
        return commits_per_repo
    
    def count_from_to(self, start=None, end=None, type_of_date="author_date", empty=True, merge=True):
        # commit_list has elements of a specific category
        category = "commit"
        commit_list = self.clean_data[category]
        start_date = datetime.datetime.strptime(start, "%Y-%m-%d") if start is not None else datetime.datetime.min
        end_date = datetime.datetime.strptime(end, "%Y-%m-%d") if end is not None else datetime.datetime.max
        
        required_commit_set = set()
        for elem in commit_list:
            if start_date <= self._clean_date(elem[type_of_date]) <= end_date:
                if (empty) or (not empty and elem['files_action'] != 0):
                    if (merge) or (not merge and elem['merge'] == False):

                        required_commit_set.add(elem['hash'])
        return len(required_commit_set)
                    
    
    # private methods to clean data ---------------------------
    
    @staticmethod
    def _clean_date(date_long_format):
        datetimeobj = datetime.datetime.strptime(date_long_format, "%a %b %d %H:%M:%S %Y %z")
        datetimeobj = datetimeobj.replace(tzinfo=None)
    
        return datetimeobj
    
    @staticmethod                
    def _clean_commit(line):
        repo_name = line['origin']
        line_data = line['data']
        cleaned_line = {
            'repo': repo_name,
            'hash': line_data['commit'],
            'category': "commit",
            'commit': line_data['Commit'],
            'author': line_data['Author'],
            'author_date': line_data['AuthorDate'],
            'commit_date': line_data['CommitDate'],
            'files_no': len(line_data['files'])
        }
        
        actions = 0
        
        for file in line_data['files']:
            if 'action' in file:
                actions += 1
                cleaned_line['files_action'] = actions
                cleaned_line['merge'] = 'Merge' in line_data
        return cleaned_line
    
    @staticmethod
    def _clean_issue(line):
        repo_name = line['origin']
        line_data = line['data']
        cleaned_line ={
            'repo': repo_name,
            'hash': line_data['id'],
            'category': "issue",
            'author': line_data['user']['login'],
            'created_date': line_data['created_at'],
            'current_status': line_data['state']   
        }
        
        return cleaned_line
    
    @staticmethod
    def _clean_pr(line):
        repo_name = line['origin']
        line_data = line['data']
        cleaned_line ={
            'repo': repo_name,
            'hash': line_data['id'],
            'category': "pull_request",
            'author': line_data['user']['login'],
            'created_date': line_data['created_at'],
            'current_status': line_data['state']   
        }
        
        return cleaned_line
    

# Analyzing the data using the Class

In [103]:
commits_data = Code_Changes('./trial.json')

## Total number of commits 

In [104]:
print("The total number of commits in all repos is: ", commits_data.total_commits())
print("The number of commits repo-wise is ", commits_data.total_commits_per_repo())

The total number of commits in all repos is:  14
The number of commits repo-wise is  {'https://github.com/sarthak-sehgal/timetable-visualizer': 14}


## Total number of commits between dates

In [134]:
print("Code changes count all period:", commits_data.count_from_to())
print("Code changes count from 2018-01-01 to 2018-07-01:",
      commits_data.count_from_to(start="2018-01-01", end="2019-07-01"))
print("Code changes count from 2018-01-01 to 2019-07-01 (no merge commits):",
      commits_data.count_from_to(start="2018-01-01", end="2018-07-01", merge=False))


Code changes count all period: 14
Code changes count from 2018-01-01 to 2018-07-01: 14
Code changes count from 2018-01-01 to 2019-07-01 (no merge commits): 0


# Analyzing the json file directly

In [141]:
github_data = defaultdict(list)

count = 0
with open('trial.json', 'r') as github_data_file:
    for line in github_data_file:
        data_line = json.loads(line)
        count += 1

        category = data_line['category']
        data = data_line['data']
        github_data[category].append(data)
            
print(len(github_data['commit']))

14


## Total number of commits in the master branch


In [155]:
master_commits = set()

for elem in github_data['commit']:
    if 'HEAD -> refs/heads/master' in elem['refs']:
        master_commits.add(elem['commit'])       
        for parent in elem['parents']:
            
            if parent not in master_commits:
                master_commits.add(parent)
                
print(len(master_commits))
        

2


## Total number of non empty commits

In [156]:
num_empty_commits = 0

for commit in github_data['commits']:
    for file in commit['files']:
        if 'action' in file:
            num_empty_commits += 1
            break
            
print(num_empty_commits)
            

0


## Total number of non - merge commits

In [143]:
count = 0

for commit in github_data['commit']:
    if 'Merge' not in commit:
        count += 1
        
print("Number of non-merge commits is: %d" %count)

Number of non-merge commits is: 14


# Pull Requests and Issues

## Total number of pull requests and issues

In [109]:
total_issues = len(github_data["issue"])

print("The number of issues is {0}".format(total_issues))
print("The number of pull requests is {0}".format(len(github_data["pull_request"])))

The number of issues is 1
The number of pull requests is 0


## Total number of open and closed issues

In [110]:
num_open_issues = 0
for issue in github_data["issue"]:
    if issue['state'] == "open":
        num_open_issues += 1
        
print("The number of open issues is ", num_open_issues)
print("The number of closed issues is ", total_issues - num_open_issues)

The number of open issues is  0
The number of closed issues is  1
