# Microtask 0

## Aim:
- Get a basic understanding of perceval and the github api data it fetches
- Get comfortable analyzing said data: 
    - Total number of commits
    - Number of issues and pull-requests
    - Number of open issues and closed issues

## Getting the data
The cell below helps understand the data to be fetched, like the **owner**, the **repository names**, the **repository urls** and most importantly the **github authentication token**.  

**Make sure to fill in your token for the `auth_token` variable in the cell below.**

In [23]:
github_url = "https://github.com/"  # the github url domain: used for generating repo_urls
owner = "progit"
repos_used = ["progit2-ru", "progit2-zh"]
repo_urls = [github_url + owner + "/" + repo_used for repo_used in repos_used]
auth_token = "" # Please enter your github token here

### Harnessing the power of jupyter notebooks, the script in the cell below is a generalized way to create and populate a file named `progit.json`  using perceval.  

The steps involved are simple: 
For each repository specified in the `repos_used` variable, fetch its git data, its pull_requests data and finally its issues data from the github api in that order and append them to `progit.json`. 

**Note**: it has been commented to prevent an accidental overwrite of the progit.json file, present in the parent directory of our present directory. To work on more recent data, or to perform an analysis on a completely different set of repositories, please uncomment the snippet below and run the cell.

In [24]:
# for repo, repo_url in zip(repos_used, repo_urls):
#     print(repo, repo_url)
#     if repo == repos_used[0]:
#         !perceval git --json-line $repo_url > ../progit.json

#     else:
#         !perceval git --json-line $repo_url >> ../progit.json

#     !perceval github -t $auth_token --json-line --sleep-for-rate --category pull_request $owner $repo >> ../progit.json

#     !perceval github -t $auth_token --json-line --sleep-for-rate --category issue $owner $repo >> ../progit.json

In [25]:
import json
import datetime
import pprint

# Analyzing the json file directly
Even without the CleanJson class, perceval makes it very easy to analyze the data produced by it. A simple context manager to read the json file line-by-line will is all that's required.

The contents of github_data will once populated will look something like:

```python
    {
        'commit': [commit1_dict, commit2_dict, ....], 
        'issue': [issue1_dict, issue2_dict, ....], 
        'pull_request': [pr1_dict, pr2_dict, ....], 
    }
```


In [26]:
months = 3 # ask asked in problem statement


def get_date_range(months):
    current_date = datetime.datetime.now()
    timediff = datetime.timedelta(hours=24*30*months)
    start_date = current_date - timediff    
    return start_date

start_date = get_date_range(months)

In [27]:
def in_range(date_string, start_date):
    try:
        datetimestr =  datetime.datetime.strptime(date_string, "%a %b %d %H:%M:%S %Y %z").strftime("%Y-%m-%d")


    except ValueError as ve:
        datetimestr =  datetime.datetime.strptime(date_string, "%Y-%m-%dT%H:%M:%SZ").strftime("%Y-%m-%d")

    finally:
        datetimeobj = datetime.datetime.strptime(datetimestr, "%Y-%m-%d")
        if datetimeobj >= start_date:
            return True
        return False

In [28]:
github_data =  {
                "commit": [], 
                "issue": [], 
                "pull_request": []
                }


with open('../progit.json', 'r') as github_data_file:
    for line in github_data_file:
        data_line = json.loads(line)
        category = data_line['category']
        if category == "commit":
            if in_range(data_line["data"]["CommitDate"], start_date):
                github_data[category].append(data_line)
        else:
            if in_range(data_line["data"]["created_at"], start_date):
                github_data[category].append(data_line)

## Total number of pull requests and issues

Lets create a dictionary `repo_wise_issues_prs`, whose structure is shown below: 
```python
    {
        repo_url_1: {"issue": .., "pull_request": ..}, 
        repo_url_2: {"issue": .., "pull_request": ..},
        repo_url_3: {"issue": .., "pull_request": ..}, 
        ..
        .
    }
```
The generic keys used allow this part of the script to work no matter which repositories or projects are used for the analysis.

Looping through each issue and pull request in `github_data`, simply populate `repo_wise_issues_prs`

In [29]:
repo_wise_data = {repo_url: {"issue": 0, "pull_request": 0, "commit": 0} for repo_url in repo_urls}
total_issues = 0
total_prs = 0
total_commits = 0

for elem in github_data['issue']:
    repo_wise_data[elem['origin']]['issue'] += 1
    total_issues += 1
    
for elem in github_data['pull_request']:
    repo_wise_data[elem['origin']]['pull_request'] += 1
    total_prs += 1
    
for elem in github_data['commit']:
    repo_wise_data[elem['origin']]['commit'] += 1
    total_commits += 1

print(json.dumps(repo_wise_data, indent=4))
print("Total number of issues: ", total_issues)
print("Total number of pull requests: ", total_prs)

{
    "https://github.com/progit/progit2-ru": {
        "issue": 18,
        "pull_request": 18,
        "commit": 30
    },
    "https://github.com/progit/progit2-zh": {
        "issue": 4,
        "pull_request": 3,
        "commit": 6
    }
}
Total number of issues:  22
Total number of pull requests:  21


In [41]:
def write_to_csv(file_path, repo_wise_data):
    
    repo_wise_data = dict(sorted(repo_data_wise.items(), key=lambda tup: (tup[1]["issue"]) + (tup[1]["pull_request"]) + (tup[1]["commit"])
    
    with open(file_path, 'w') as csvfile:
        csv_writer = csv.writer(csvfile, delimiter=',')
        
        file_headers = ["Repository", "Commits", "Num_Commits", "Num_Issues", "Num_PRs", "Total"]
        csv_writer.writerow(file_headers)
        
        for quar in quar_list:
            row = [str(quar.number),         \
                 str(quar.year),             \
                 str(quar.num_commits) ,     \
                 str(quar.num_issues)  ,     \
                 str(quar.num_pullrequests), \
                 str(quar.new_committers),   \
                 str(quar.new_issue_subs),   \
                 str(quar.new_pr_subs)       
                  ]
            csv_writer.writerow(x for x in row)
            

SyntaxError: invalid syntax (<ipython-input-41-71e2f76e385d>, line 5)

In [None]:
def create_table(file_path):
    with open(file_path, 'r', ) as csvfile:
        csv_reader = csv.reader(csvfile, delimiter=',')

        for row in csv_reader:
            for field in row:
                print("%-10s" %field, end="\t")
            print()


In [39]:
a = {"x": [1, 2], "y": [2, -10], "z": [-1, 100]}
dict(sorted(a.items(), key=lambda kv: sum(kv[1])))

{'y': [2, -10], 'x': [1, 2], 'z': [-1, 100]}