# Microtask 5

## Aim:
- Organize data for all repositories for the last three months
- Present data as a csv and table
- Order data based on the sum of commits, pull requests and issues
- This is exactly the same as microtask 4, except that this must be done using pandas

## Getting the data
The data used for this microtask is the same as that used for the previous microtasks. Please check microtask 0 to see the cell output generated by running the commented script present two cells below this one. 
The cell below helps understand the data to be fetched, like the **owner**, the **repository names**, the **repository urls** and most importantly the **github authentication token**.  

**Make sure to fill in your token for the `auth_token` variable in the cell below.**

In [1]:
github_url = "https://github.com/"  # the github url domain: used for generating repo_urls
owner = "atom"
repos_used = ["language-java", "teletype"]
repo_urls = [github_url + owner + "/" + repo_used for repo_used in repos_used]
auth_token = "" # Please enter your github token here
file_name = owner + ".json" # file to which perceval stores data (a ../ is automatically added)
csv_name = owner + "_last_3_months" + ".csv" # file to which csv data is written (a ../ is automatically added)Please enter your github token here

### Harnessing the power of jupyter notebooks
The script in the cell below is a generalized way to create and populate a json file`  using perceval.  

The steps involved are simple: 
For each repository specified in the `repos_used` variable, fetch its git data, its pull_requests data and finally its issues data from the github api in that order and append them to json file. 

**Note**: it has been commented to prevent an accidental overwrite of the json file, present in the parent directory of our present directory. To work on more recent data, or to perform an analysis on a completely different set of repositories, please uncomment the snippet below and run the cell.

In [2]:
# for repo, repo_url in zip(repos_used, repo_urls):
#     print(repo, repo_url)

#     !perceval git --json-line $repo_url >> ../$file_name

#     !perceval github -t $auth_token --json-line --sleep-for-rate --category pull_request $owner $repo >> ../$file_name

#     !perceval github -t $auth_token --json-line --sleep-for-rate --category issue $owner $repo >> ../$file_name

In [3]:
import json
import datetime
import csv
import pandas as pd

# Analyzing the json file

## Functions
The `months` variable takes last 'x' number of months for which to calculate as its value. 
The two utility functions defined in the cells below are described here: 

**get_date_range**
    
  - parameters: int (number of months)
  - returns: datetime object (start date)
  
  The function first creates a timedelta object based on the value of months. Here, a month is approximated to be 30 days. But this is an implementation detail and can be easily modified based on preference. Next, the function calculates the start date based on the timedelta object created. 
  

**is_in_range**
    
  - parameters: str (date of item in string form as returned by perceval)
                datetime object (start date as calculated by get_date_range)
  - returns: Boolean (object lies in range or not)
  
   The code for this function is more or less the same one used in the other microtasks. The function quickly becomes complicated due to the fact that the date parameter for commits is different from that of issues or pull requests in the data fetched by perceval. 
   Once the item's date is converted to a comparable form, its compared with `start_date` which results in either True or False, depending on whether the creation date of that item happened in the last `num_months` months.

In [4]:
num_months = 3 # ask asked in problem statement


def get_date_range(months):
    current_date = datetime.datetime.now()
    timediff = datetime.timedelta(hours=24*30*months)
    start_date = current_date - timediff    
    return start_date

start_date = get_date_range(num_months)

In [5]:
def is_in_range(date_string, start_date):
    try:
        datetimestr =  datetime.datetime.strptime(date_string, "%a %b %d %H:%M:%S %Y %z").strftime("%Y-%m-%d")


    except ValueError as ve:
        datetimestr =  datetime.datetime.strptime(date_string, "%Y-%m-%dT%H:%M:%SZ").strftime("%Y-%m-%d")

    finally:
        datetimeobj = datetime.datetime.strptime(datetimestr, "%Y-%m-%d")
        if datetimeobj >= start_date:
            return True
        return False

The contents of github_data will once populated will look something like:

```python
    {
        'commit': [commit1_dict, commit2_dict, ....], 
        'issue': [issue1_dict, issue2_dict, ....], 
        'pull_request': [pr1_dict, pr2_dict, ....], 
    }
```

Note: In the cell below, you might notice that there is an extra condition for when the category of the data is "issue". As mentioned in previous microtasks, the Github API assumes that all pull requests are issues as well, and hence, redundant pull request data is present in the "issue" category. The extra condition counters that.

In [6]:
github_data =  {
                "commit": [], 
                "issue": [], 
                "pull_request": []
                }


with open('../' + file_name, 'r') as github_data_file:
    for line in github_data_file:
        data_line = json.loads(line)
        category = data_line['category']
        if category == "commit":
            if is_in_range(data_line["data"]["CommitDate"], start_date):
                github_data[category].append(data_line)
        else:
            if is_in_range(data_line["data"]["created_at"], start_date):
                if (category == "issue" and "pull_request" not in data_line['data']) or category == "pull_request":
                    github_data[category].append(data_line)

## Total number of pull requests and issues

Lets create a dictionary `repo_wise_issues_prs`, whose structure is shown below: 
```python
    {
        repo_url_1: {"issue": .., "pull_request": ..}, 
        repo_url_2: {"issue": .., "pull_request": ..},
        repo_url_3: {"issue": .., "pull_request": ..}, 
        ..
        .
    }
```
The generic keys used allow this part of the script to work no matter which repositories or projects are used for the analysis.

Looping through each issue and pull request in `github_data`, simply populate `repo_wise_issues_prs`

In [7]:
repo_wise_data = {repo_url: {"issue": 0, "pull_request": 0, "commit": 0} for repo_url in repo_urls}
total_issues = 0
total_prs = 0
total_commits = 0

for elem in github_data['issue']:
    repo_wise_data[elem['origin']]['issue'] += 1
    total_issues += 1
    
for elem in github_data['pull_request']:
    repo_wise_data[elem['origin']]['pull_request'] += 1
    total_prs += 1
    
for elem in github_data['commit']:
    repo_wise_data[elem['origin']]['commit'] += 1
    total_commits += 1

print(json.dumps(repo_wise_data, indent=4))
print("Total number of issues: ", total_issues)
print("Total number of pull requests: ", total_prs)

{
    "https://github.com/atom/language-java": {
        "issue": 7,
        "pull_request": 5,
        "commit": 5
    },
    "https://github.com/atom/teletype": {
        "issue": 6,
        "pull_request": 2,
        "commit": 1
    }
}
Total number of issues:  13
Total number of pull requests:  7


# Viewing data as a csv file

## Writing the cleaned data to a csv
The following function takes a file path and a dictionary containing the data as a parameter and writes to that file the following: 
    The repositories for which data was fetched
    number of commits, issues, and pull requests in the last `num_months` months
    The total number of items (commits + issues + pull requests)
    
The actual process of writing is done using `pandas.DataFrame.to_csv`. First the dictionary `repo_wise_data` is converted to a dataframe, which is later sorted based on the `Total` column, added later to the dataframe. 

In [8]:
def write_to_csv(file_path, repo_wise_data):
    
#   repo_wise_data = dict((repo_wise_data[key.replace("https://github.com", '')], value) for (key, value) in repo_wise_data.items())
    
    df = pd.DataFrame(repo_wise_data).transpose()
    df.reset_index(level=0, inplace=True)
    df["Total"] = df["issue"] + df["pull_request"] + df["commit"]
    df = df.sort_values("Total")
    df.columns = ["Repository", "Num_Commits", "Num_Issues", "Num_PRs", "Total"]
    df["Repository"] = df["Repository"].apply(lambda x: x.replace("https://github.com", ''))
    df = df.set_index('Repository')
    df.to_csv(file_path)
            

In [9]:
write_to_csv("../" + csv_name, repo_wise_data)

## Displaying a table based on the csv file
Instead of the csv.reader used in microtask1, the `pandas.read_csv()` is used to populate a dataframe, which is displayed below:

In [10]:
pd.read_csv("../" + csv_name)

Unnamed: 0,Repository,Num_Commits,Num_Issues,Num_PRs,Total
0,/atom/teletype,1,6,2,9
1,/atom/language-java,5,7,5,17
