# Microtask 2

## Aim of the task: 
Analysis of data fetched by perceval on a per-quarter basis.
This includes (but not limited to) :

- The number of new committers per quarter
- The number of new issue and pull request submitters per quarter
- The total number of issues, commits and pull requests per quarter

This task is exactly the same as microtask 1, except for the fact that this one is supposed to be done using pandas. Thus, only the points that differ from microtask1  will be discussed here.

The data used and the method of data collection is the same as that of microtask 0 and hence is not mentioned here.  

In [1]:
import json
import csv
import datetime
import pandas as pd

In [2]:
# Here both start year and end year are included
start_year = 2017
end_year = 2018 

quar1_dates = {"start_date": "01-01", "end_date": "03-31"}
quar2_dates = {"start_date": "04-01", "end_date": "06-30"}
quar3_dates = {"start_date": "07-01", "end_date": "09-30"}
quar4_dates = {"start_date": "10-01", "end_date": "12-31"}

old_committers = set()
old_issue_subs = set()
old_pr_subs = set()

In [3]:
class Quarter:
    
    def __init__(self, number, year):
        self.number = number    
        self.year = year   
   
        self.num_commits = 0  
        self.num_issues = 0
        self.num_pullrequests = 0
        
        self.new_committers = 0
        self.new_issue_subs = 0
        self.new_pr_subs = 0
        
        self.start_date = ""
        self.end_date = ""
        
        if self.number == 1:
            self.start_date = str(self.year) + '-' + quar1_dates["start_date"]
            self.end_date = str(self.year) + '-' + quar1_dates["end_date"]
            
        if self.number == 2:
            self.start_date = str(self.year) + '-' + quar2_dates["start_date"]
            self.end_date = str(self.year) + '-' + quar3_dates["end_date"]            
            
        if self.number == 3:
            self.start_date = str(self.year) + '-' + quar3_dates["start_date"]
            self.end_date = str(self.year) + '-' + quar3_dates["end_date"]
            
        if self.number == 4:
            self.start_date = str(self.year) + '-' + quar4_dates["start_date"]
            self.end_date = str(self.year) + '-' + quar4_dates["end_date"]
        
            
    def is_includes_data(self, date):
        if self._str_to_dt_quarter(self.start_date) <= self._str_to_dt_data(date) < self._str_to_dt_quarter(self.end_date):
            return True

        return False
    
    def add_analysis(self, datapoint):
        if datapoint['category'] == "commit":
            self.num_commits += 1 
            
            if datapoint['author'] not in old_committers:
                self.new_committers += 1
                
            old_committers.add(datapoint['author'])
            
                
        if datapoint['category'] == "issue":
            self.num_issues += 1 
            
            if datapoint['author'] not in old_issue_subs:
                self.new_issue_subs += 1
                
            old_issue_subs.add(datapoint['author'])
            
                
        if datapoint['category'] == "pull_request":
            self.num_pullrequests += 1 
            
            if datapoint['author'] not in old_pr_subs:
                self.new_pr_subs += 1
                
            old_pr_subs.add(datapoint['author'])
    
    @staticmethod
    def _str_to_dt_data(date):
        try:
            datetimestr =  datetime.datetime.strptime(date, "%a %b %d %H:%M:%S %Y %z").strftime("%Y-%m-%d")
            
        
        except ValueError as ve:
            datetimestr =  datetime.datetime.strptime(date, "%Y-%m-%dT%H:%M:%SZ").strftime("%Y-%m-%d")
        
        finally:
            datetimeobj = datetime.datetime.strptime(datetimestr, "%Y-%m-%d")
            return datetimeobj    
        
    @staticmethod
    def _str_to_dt_quarter(date):
        
        datetimeobj =  datetime.datetime.strptime(date, "%Y-%m-%d")
        return datetimeobj
    
    def __str__(self):
        return str(self.number) + " " + str(self.year)

# Cleaning and Organizing the Data
The CleanJson class is just like the one used in microtask 1, the only difference being that each item (commit, pull_request or issue) is a row in the corresponding dataframe. 
Thus, 
    - clean_commit_df : Each row is a commit
    - clean_issue_df: Each row is an issue
    - clean_pr_df: Each row is a pull request
    
A dictionary `clean_dict` with the keys "commit", "issue" and "pull_request" is another member of the CleanJson class. The corresponding values of its keys are the dataframes mentioned above.
Thus, the overall structure is:
```python
    clean_dict = {
        "commit": clean_commit_df,
        "issue": clean_issue_df,
        "pull_request": clean_pr_df
    }
```

In [4]:
class CleanJson():   
    
    def __init__(self, path_to_file):

        # The dataframes mentioned in the above cell will be populated using the 
        # following lists: pd.DataFrame(list_name)
        clean_commit_list = list()
        clean_issue_list = list()
        clean_pr_list = list()
        
        with open(path_to_file, 'r') as raw_data:
            for line in raw_data:
                line = json.loads(line)
                
                clean_line = dict()
                if line['category'] == "commit":
                    clean_line = self._clean_commit(line)
                    clean_commit_list.append(clean_line)
                    

                elif line['category'] == "issue":
                    clean_line = self._clean_issue(line)
                    clean_issue_list.append(clean_line)
                    

                elif line['category'] == "pull_request":
                    clean_line = self._clean_pr(line)
                    clean_pr_list.append(clean_line)
                        
                        
                self.clean_commit_df = pd.DataFrame(clean_commit_list)
                self.clean_issue_df = pd.DataFrame(clean_issue_list)
                self.clean_pr_df = pd.DataFrame(clean_pr_list)
                
            self.clean_dict = {
                'commit': self.clean_commit_df,
                'issue': self.clean_issue_df,
                'pull_request': self.clean_pr_df
            }

    @staticmethod
    def _clean_commit(line):
            repo_name = line['origin']
            line_data = line['data']
            cleaned_line = {
                'repo': repo_name,
                'hash': line_data['commit'],
                'category': "commit",
                'commit': line_data['Commit'],
                'author': line_data['Author'],
                'created_date': line_data['CommitDate'],
                'files_no': len(line_data['files'])
            }

            actions = 0

            for file in line_data['files']:
                if 'action' in file:
                    actions += 1
                    cleaned_line['files_action'] = actions
                    cleaned_line['merge'] = 'Merge' in line_data
            return cleaned_line

    @staticmethod
    def _clean_issue(line):
        repo_name = line['origin']
        line_data = line['data']
        cleaned_line ={
            'repo': repo_name,
            'hash': line_data['id'],
            'category': "issue",
            'author': line_data['user']['login'],
            'created_date': line_data['created_at'],
            'current_status': line_data['state']   
        }

        return cleaned_line

    @staticmethod
    def _clean_pr(line):
        repo_name = line['origin']
        line_data = line['data']
        cleaned_line ={
            'repo': repo_name,
            'hash': line_data['id'],
            'category': "pull_request",
            'author': line_data['user']['login'],
            'created_date': line_data['created_at'],
            'current_status': line_data['state']   
        }

        return cleaned_line
    

# Creating the Quarter object
Again, this step is exactly the same as microtask 1, where, based on the `start_year` and `end_year` variables in the third cell of this notebook, quarters are created. Be sure to note that even the `end_year` is included:
    For example, if `start_year` is 2017 and `end_year` is 2018, 8 quarters will be created.

In [5]:
year_list = [x for x in range(start_year, end_year + 1)]
quar_list = [Quarter(num, year)  for year in year_list for num in range(1, 5)]

In [6]:
clean_data = CleanJson('../progit.json')

In [7]:
for df in clean_data.clean_dict.values():
    data = df
    for quarter in quar_list:
        
        for index, data_point in data.iterrows():
            
            data_point = pd.Series.to_dict(data_point)
            if quarter.is_includes_data(data_point["created_date"]):
                quarter.add_analysis(data_point)


# Analysis



## Number of commits, pull requests and issues per quadrant

In [8]:
for q in quar_list:
    print("Quarter num and year: " ,q,    
          " \n commits: ", q.num_commits,
          " \n issues: ", q.num_issues, 
          " \n pull requests: ", q.num_pullrequests)
    print("--------------------------------")

Quarter num and year:  1 2017  
 commits:  13  
 issues:  8  
 pull requests:  4
--------------------------------
Quarter num and year:  2 2017  
 commits:  9  
 issues:  12  
 pull requests:  8
--------------------------------
Quarter num and year:  3 2017  
 commits:  5  
 issues:  8  
 pull requests:  5
--------------------------------
Quarter num and year:  4 2017  
 commits:  11  
 issues:  8  
 pull requests:  7
--------------------------------
Quarter num and year:  1 2018  
 commits:  18  
 issues:  12  
 pull requests:  7
--------------------------------
Quarter num and year:  2 2018  
 commits:  19  
 issues:  16  
 pull requests:  12
--------------------------------
Quarter num and year:  3 2018  
 commits:  3  
 issues:  4  
 pull requests:  4
--------------------------------
Quarter num and year:  4 2018  
 commits:  4  
 issues:  6  
 pull requests:  2
--------------------------------


## Number of new committers, new issue submitters and pull request creators

In [9]:
for q in quar_list:
    print("Quarter num and year: " ,q,    
          " \n new committers: ", q.new_committers,
          " \n new issue submitters: ", q.new_issue_subs, 
          " \n new pull request creators: ", q.new_pr_subs)
    print("-----------------------------------")

Quarter num and year:  1 2017  
 new committers:  7  
 new issue submitters:  7  
 new pull request creators:  4
-----------------------------------
Quarter num and year:  2 2017  
 new committers:  5  
 new issue submitters:  9  
 new pull request creators:  7
-----------------------------------
Quarter num and year:  3 2017  
 new committers:  0  
 new issue submitters:  0  
 new pull request creators:  0
-----------------------------------
Quarter num and year:  4 2017  
 new committers:  4  
 new issue submitters:  4  
 new pull request creators:  3
-----------------------------------
Quarter num and year:  1 2018  
 new committers:  8  
 new issue submitters:  9  
 new pull request creators:  6
-----------------------------------
Quarter num and year:  2 2018  
 new committers:  6  
 new issue submitters:  9  
 new pull request creators:  8
-----------------------------------
Quarter num and year:  3 2018  
 new committers:  0  
 new issue submitters:  0  
 new pull request creato

# Viewing data as a csv file
This part is almost completely the same as microtask 1, except for the part where a table is created using the csv file. 

## Writing the cleaned data to a csv

In [10]:
def write_to_csv(file_path):
    with open(file_path, 'w', ) as csvfile:
        csv_writer = csv.writer(csvfile, delimiter=',')
        
        file_headers = ["Quarter(Num)", "Quarter(Year)", "Num_Commits", "Num_Issues", "Num_PRs", "Num_new_commits", "Num_new_issues", "Num_new_prs"]
        csv_writer.writerow(file_headers)
        
        for quar in quar_list:
            row = [str(quar.number),         \
                 str(quar.year),             \
                 str(quar.num_commits) ,     \
                 str(quar.num_issues)  ,     \
                 str(quar.num_pullrequests), \
                 str(quar.new_committers),   \
                 str(quar.new_issue_subs),   \
                 str(quar.new_pr_subs)       
                  ]
            csv_writer.writerow(x for x in row)

In [11]:
write_to_csv("../progit.csv")

## Displaying a table based on the csv file
Instead of the csv.reader used in microtask1, the `pandas.read_csv()` is used to populate a dataframe, which is displayed below:

In [12]:
pd.read_csv('../progit.csv')

Unnamed: 0,Quarter(Num),Quarter(Year),Num_Commits,Num_Issues,Num_PRs,Num_new_commits,Num_new_issues,Num_new_prs
0,1,2017,13,8,4,7,7,4
1,2,2017,9,12,8,5,9,7
2,3,2017,5,8,5,0,0,0
3,4,2017,11,8,7,4,4,3
4,1,2018,18,12,7,8,9,6
5,2,2018,19,16,12,6,9,8
6,3,2018,3,4,4,0,0,0
7,4,2018,4,6,2,1,5,1
