# Microtask 4

## Aim:
- Organize data for all repositories for the last three months
- Present data as a csv and table
- Order data based on the sum of commits, pull requests and issues

## Getting the data
The data used for this microtask is the same as that used for the previous microtasks. Please check microtask 0 to see the cell output generated by running the commented script present two cells below this one. 
The cell below helps understand the data to be fetched, like the **owner**, the **repository names**, the **repository urls** and most importantly the **github authentication token**.  

**Make sure to fill in your token for the `auth_token` variable in the cell below.**

In [1]:
github_url = "https://github.com/"  # the github url domain: used for generating repo_urls
owner = "progit"
repos_used = ["progit2-ru", "progit2-zh"]
repo_urls = [github_url + owner + "/" + repo_used for repo_used in repos_used]
auth_token = "" # Please enter your github token here

### Harnessing the power of jupyter notebooks
The script in the cell below is a generalized way to create and populate a file named `progit.json`  using perceval.  

The steps involved are simple: 
For each repository specified in the `repos_used` variable, fetch its git data, its pull_requests data and finally its issues data from the github api in that order and append them to `progit.json`. 

**Note**: it has been commented to prevent an accidental overwrite of the progit.json file, present in the parent directory of our present directory. To work on more recent data, or to perform an analysis on a completely different set of repositories, please uncomment the snippet below and run the cell.

In [2]:
# for repo, repo_url in zip(repos_used, repo_urls):
#     print(repo, repo_url)
#     if repo == repos_used[0]:
#         !perceval git --json-line $repo_url > ../progit.json

#     else:
#         !perceval git --json-line $repo_url >> ../progit.json

#     !perceval github -t $auth_token --json-line --sleep-for-rate --category pull_request $owner $repo >> ../progit.json

#     !perceval github -t $auth_token --json-line --sleep-for-rate --category issue $owner $repo >> ../progit.json

In [3]:
import json
import datetime
import pprint
import csv

# Analyzing the json file

## Functions
The `months` variable takes last 'x' number of months for which to calculate as its value. 
The two utility functions defined in the cells below are described here: 

**get_date_range**
    
  - parameters: int (number of months)
  - returns: datetime object (start date)
  
  The function first creates a timedelta object based on the value of months. Here, a month is approximated to be 30 days. But this is an implementation detail and can be easily modified based on preference. Next, the function calculates the start date based on the timedelta object created. 
  

**is_in_range**
    
  - parameters: str (date of item in string form as returned by perceval)
                datetime object (start date as calculated by get_date_range)
  - returns: Boolean (object lies in range or not)
  
   The code for this function is more or less the same one used in the other microtasks. The function quickly becomes complicated due to the fact that the date parameter for commits is different from that of issues or pull requests in the data fetched by perceval. 
   Once the item's date is converted to a comparable form, its compared with `start_date` which results in either True or False, depending on whether the creation date of that item happened in the last `num_months` months.

In [4]:
num_months = 3 # ask asked in problem statement


def get_date_range(months):
    current_date = datetime.datetime.now()
    timediff = datetime.timedelta(hours=24*30*months)
    start_date = current_date - timediff    
    return start_date

start_date = get_date_range(num_months)

In [5]:
def is_in_range(date_string, start_date):
    try:
        datetimestr =  datetime.datetime.strptime(date_string, "%a %b %d %H:%M:%S %Y %z").strftime("%Y-%m-%d")


    except ValueError as ve:
        datetimestr =  datetime.datetime.strptime(date_string, "%Y-%m-%dT%H:%M:%SZ").strftime("%Y-%m-%d")

    finally:
        datetimeobj = datetime.datetime.strptime(datetimestr, "%Y-%m-%d")
        if datetimeobj >= start_date:
            return True
        return False

The contents of github_data will once populated will look something like:

```python
    {
        'commit': [commit1_dict, commit2_dict, ....], 
        'issue': [issue1_dict, issue2_dict, ....], 
        'pull_request': [pr1_dict, pr2_dict, ....], 
    }
```

In [6]:
github_data =  {
                "commit": [], 
                "issue": [], 
                "pull_request": []
                }


with open('../progit.json', 'r') as github_data_file:
    for line in github_data_file:
        data_line = json.loads(line)
        category = data_line['category']
        if category == "commit":
            if is_in_range(data_line["data"]["CommitDate"], start_date):
                github_data[category].append(data_line)
        else:
            if is_in_range(data_line["data"]["created_at"], start_date):
                github_data[category].append(data_line)

## Total number of pull requests and issues

Lets create a dictionary `repo_wise_issues_prs`, whose structure is shown below: 
```python
    {
        repo_url_1: {"issue": .., "pull_request": ..}, 
        repo_url_2: {"issue": .., "pull_request": ..},
        repo_url_3: {"issue": .., "pull_request": ..}, 
        ..
        .
    }
```
The generic keys used allow this part of the script to work no matter which repositories or projects are used for the analysis.

Looping through each issue and pull request in `github_data`, simply populate `repo_wise_issues_prs`

In [7]:
repo_wise_data = {repo_url: {"issue": 0, "pull_request": 0, "commit": 0} for repo_url in repo_urls}
total_issues = 0
total_prs = 0
total_commits = 0

for elem in github_data['issue']:
    repo_wise_data[elem['origin']]['issue'] += 1
    total_issues += 1
    
for elem in github_data['pull_request']:
    repo_wise_data[elem['origin']]['pull_request'] += 1
    total_prs += 1
    
for elem in github_data['commit']:
    repo_wise_data[elem['origin']]['commit'] += 1
    total_commits += 1

print(json.dumps(repo_wise_data, indent=4))
print("Total number of issues: ", total_issues)
print("Total number of pull requests: ", total_prs)

{
    "https://github.com/progit/progit2-ru": {
        "issue": 18,
        "pull_request": 18,
        "commit": 30
    },
    "https://github.com/progit/progit2-zh": {
        "issue": 4,
        "pull_request": 3,
        "commit": 6
    }
}
Total number of issues:  22
Total number of pull requests:  21


# Viewing data as a csv file

## Writing the cleaned data to a csv
The following function takes a file path as a parameter and writes to that file the following: 
    The repositories for which data was fetched
    number of commits, issues, and pull requests in the last `num_months` months
    The total number of items (commits + issues + pull requests)
    
The actual process of writing to the csv is done with the help of the `csv` python package and specifically, `csv.writer`

In [8]:
def write_to_csv(file_path, repo_wise_data):
    
    repo_wise_data = dict(sorted(repo_wise_data.items(), key=lambda tup: (tup[1]["issue"]) + (tup[1]["pull_request"]) + (tup[1]["commit"])))
    
    with open(file_path, 'w') as csvfile:
        csv_writer = csv.writer(csvfile, delimiter=',')
        
        file_headers = ["Repository", "Num_Commits", "Num_Issues", "Num_PRs", "Total"]
        csv_writer.writerow(file_headers)
        
        for key, val in repo_wise_data.items():
            row = [key.replace("https://github.com", ''),                        \
                 str(val["commit"]) ,          \
                 str(val["issue"]) ,           \
                 str(val["pull_request"]) ,    \
                 str(val["commit"]+ val["issue"] + val["pull_request"])       
                  ]
            csv_writer.writerow(x for x in row)
            

## Displaying a table based on the csv file
The following function creates a table after the reading the csv file created above, allowing one to visualize the data stored in the csv

In [9]:
def create_table(file_path):
    with open(file_path, 'r', ) as csvfile:
        csv_reader = csv.reader(csvfile, delimiter=',')

        for row in csv_reader:
            for field in row:
                print("%-18s" %field, end="\t")
            print()


In [10]:
write_to_csv("../progit_last_3_months.csv", repo_wise_data)
create_table("../progit_last_3_months.csv")

Repository        	Num_Commits       	Num_Issues        	Num_PRs           	Total             	
/progit/progit2-zh	6                 	4                 	3                 	13                	
/progit/progit2-ru	30                	18                	18                	66                	
