In [5]:
import pandas as pd


df = pd.read_parquet('../data/2015-01-01-10.parquet')

# Simple Data Analysis

I will perform a simple data analysis to know the steps and things I want to know and perform in pyspark to avoid large execution times and have more visibility of a batch of data. 

Note: I will consider the schema provided and the stable columns for this analysis

Things to do:

- drop the payload and other columns: json with unstable keys
- drop columns that are not useful for the analysis
- created_at only with the hour?

Possible useful questions for the data:

- Top contributors? -> actor_id / actor_login -> new column with the last actor_login for the actor_id ? Problems
- N commits by hour? -> type and created_at
- N actions by hour? -> any column and create_at
- % organization by day? -> org_id and created_at

Possibilities for the top contributors:
1. New column with the last actor_login for the actor_id
1. Return actor_id or actor_login (first more stable but less interpretable and the second less stable but more interpretable) 
1. Make another table with the actor_id - last actor_login (you need to join the tables for the dashboard)

1. New column with the last actor_login for the actor_id?

    Pros:
    - Avoid problems with renamed users (we don't know if for the same actor_id always the same actor_login)
    - ids generally more stable
    - Easier in terms of using it for the report/dashboard. Direct conversion.
    - Easier than having two tables: one for the events and another for the users.

    Cons:
    - More complexity than using the actor_login or actor_id
    - Use the last one have problems with backwards compability. You have to change all the previous records if someone change their name.
    - Bad scalability

1. Return actor_id or actor_login:

    Pros:
    - Most simple

    Cons:
    - Biased

1. Make another table with the actor_id - last actor_login (you need to join the tables for the dashboard)

    Pros:
    - Most flexible
    - Solves the backwards compatibility
    - Only needs to update renamed users and insert new ones

    Cons:
    - 2 tables
    - More storage
    - Having to join the tables afterwards to end up with the first table option

Conclusions:

1. If the majority of queries or the only one is to a table of a predefined time (not all the history) this can be a good option to avoid bias without having problems with the scalability. 
2. For results that the risk of being biased are not a problem (actor_login) or for an internal report where you can look for the name of the person afterwards (actor_id)
3. A more real project, having an intermediate table with all the actor information (inserting rows for new users and also for each updated information, user history) -> get the simplified table


In this project, we are more in the second option. And I will be using the actor_login.

In [6]:
df.head()

Unnamed: 0,id,type,payload,public,created_at,actor_id,actor_login,actor_gravatar_id,actor_url,actor_avatar_url,repo_id,repo_name,repo_url,org_id,org_login,org_gravatar_id,org_url,org_avatar_url,other
0,2489546120,PushEvent,"{'action': None, 'before': '5a4fd385410b90ea0c...",True,2015-01-01 10:00:00+00:00,5270945,wistone,,https://api.github.com/users/wistone,https://avatars.githubusercontent.com/u/5270945?,28683463,wistone/recipes,https://api.github.com/repos/wistone/recipes,,,,,,
1,2489546121,PullRequestEvent,"{'action': 'opened', 'before': None, 'comment'...",True,2015-01-01 10:00:00+00:00,3261477,clarkorz,,https://api.github.com/users/clarkorz,https://avatars.githubusercontent.com/u/3261477?,20100867,strongloop/loopback-boot,https://api.github.com/repos/strongloop/loopba...,3020012.0,strongloop,,https://api.github.com/orgs/strongloop,https://avatars.githubusercontent.com/u/3020012?,
2,2489546122,IssuesEvent,"{'action': 'opened', 'before': None, 'comment'...",True,2015-01-01 10:00:01+00:00,8024050,AmbujMishra,,https://api.github.com/users/AmbujMishra,https://avatars.githubusercontent.com/u/8024050?,27638946,AmbujMishra/Gbits,https://api.github.com/repos/AmbujMishra/Gbits,,,,,,
3,2489546125,CreateEvent,"{'action': None, 'before': None, 'comment': No...",True,2015-01-01 10:00:01+00:00,10363284,xLiimited,,https://api.github.com/users/xLiimited,https://avatars.githubusercontent.com/u/10363284?,28684233,xLiimited/PwSPlugin,https://api.github.com/repos/xLiimited/PwSPlugin,,,,,,
4,2489546127,PushEvent,"{'action': None, 'before': '3a25aca13dff66dacc...",True,2015-01-01 10:00:01+00:00,3181834,aplnosun,,https://api.github.com/users/aplnosun,https://avatars.githubusercontent.com/u/3181834?,9339636,aplnosun/aplnosun.github.com,https://api.github.com/repos/aplnosun/aplnosun...,,,,,,


In [3]:
df.type.value_counts()

PushEvent                        3820
CreateEvent                       881
WatchEvent                        816
IssueCommentEvent                 470
PullRequestEvent                  326
IssuesEvent                       304
ForkEvent                         243
DeleteEvent                       174
GollumEvent                        83
CommitCommentEvent                 41
PullRequestReviewCommentEvent      34
ReleaseEvent                       24
MemberEvent                        12
PublicEvent                         6
Name: type, dtype: int64