# Spark, v2

## Implementation


### IMPORTANT NOTE!

Due to very ppor machine I have, I have to read & process HOURLY packages instead of running spark for the whole day of even bigger period.

### Impl

In [1]:
import pathlib, os

spark_resources_path = "/home/jovyan/work/data/"

In [2]:
spark_resources_path_obj = pathlib.Path(spark_resources_path)

In [3]:
import pyspark

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession

In [5]:
master = "spark://spark:7077"
conf = SparkConf().setAppName("Spark TEST v2").setMaster(master)

In [4]:
# jars_to_load = [
#     "clickhouse-jdbc-0.4.6-all.jar",
#     "clickhouse-spark-runtime-3.3_2.12-0.7.2.jar"
# ]

# jar_path_joined = ":".join(["/opt/spark/resources/jars/{}".format(jar_name) for jar_name in jars_to_load])
# jar_path_joined

'/opt/spark/resources/jars/clickhouse-jdbc-0.4.6-all.jar:/opt/spark/resources/jars/clickhouse-spark-runtime-3.3_2.12-0.7.2.jar'

In [98]:
# os.environ['PYSPARK_SUBMIT_ARGS'] = \
#   '--jars {} pyspark-shell'.format(":".join(["../jars/{}".format(jar_name) for jar_name in jars_to_load]))


In [6]:
# conf.set('spark.jars', jar_path_joined)
# conf.set('spark.driver.extraClassPath', jar_path_joined)
# conf.set('spark.driver.extraLibraryPath', jar_path_joined)

In [7]:
conf.getAll()

[('spark.app.name', 'Spark TEST v2'), ('spark.master', 'spark://spark:7077')]

In [8]:
spark = SparkSession \
    .builder \
    .config(conf=SparkConf()) \
    .getOrCreate()

In [9]:
spark.sparkContext

In [10]:
given_date = "2023-07-23"
glob_pattern = "{}-*.json.gz".format(given_date)

In [11]:
paths = (str(path) for path in sorted(spark_resources_path_obj.glob(glob_pattern), key=os.path.getmtime))
cur_path = next(paths)
print(cur_path)

/home/jovyan/work/data/2023-07-23-1.json.gz


In [12]:
df = spark.read.json(cur_path)

In [13]:
df.count()

112858

In [14]:
df.createOrReplaceTempView("activity")

In [15]:
spark.sql(
    "SELECT min(created_at), max(created_at) FROM activity"
).show(10)

+--------------------+--------------------+
|     min(created_at)|     max(created_at)|
+--------------------+--------------------+
|2023-07-23T01:00:00Z|2023-07-23T01:59:59Z|
+--------------------+--------------------+



In [16]:
spark.sql(
    "SELECT type, count(*) FROM activity group by type order by count(*) desc"
).show(50)

+--------------------+--------+
|                type|count(1)|
+--------------------+--------+
|           PushEvent|   80900|
|         CreateEvent|   10209|
|    PullRequestEvent|    5721|
|          WatchEvent|    4711|
|   IssueCommentEvent|    3056|
|         DeleteEvent|    2236|
|         IssuesEvent|    1962|
|           ForkEvent|    1106|
|PullRequestReview...|     810|
|        ReleaseEvent|     767|
|  CommitCommentEvent|     537|
|PullRequestReview...|     418|
|         PublicEvent|     226|
|         MemberEvent|     103|
|         GollumEvent|      96|
+--------------------+--------+



## Metrics

* List of Developers that own more than one repository;
* List of Developers who did more than one commit in a day, ordered by name and number of commits;
* List of Developers with less than one commit in a day;
* Total Developers grouped by gender;
* Total projects with more than 10 members;

### LoD - Own > 1 Repo

In [17]:
split_col = pyspark.sql.functions.split(df['repo.name'], '/')

df = df.withColumn('repo_author', split_col.getItem(0))
df = df.withColumn('repo_name', split_col.getItem(1))

In [18]:
df.createOrReplaceTempView("activity")

Provide next list of __DISTINCT__ records:
```
| date | hour | repo author | repo_name |
```

In [19]:
df_repos = spark.sql(
    """
    SELECT
        date(created_at) as date,
        hour(created_at) as hour,
        repo_author, 
        repo_name,
        count(*) as total_events
    FROM activity 
    GROUP BY 
        date,
        hour,
        repo_author,
        repo_name
    ORDER BY total_events DESC
    """
)

In [20]:
df_repos.show()

+----------+----+-----------------+--------------------+------------+
|      date|hour|      repo_author|           repo_name|total_events|
+----------+----+-----------------+--------------------+------------+
|2023-07-23|   1|       1Panel-dev|            appstore|         865|
|2023-07-23|   1|    happyfish2024|                mins|         717|
|2023-07-23|   1|        B74LABgit|                 CAM|         692|
|2023-07-23|   1|          unifyai|                 ivy|         542|
|2023-07-23|   1|  bullet-dev-team|   demo-app-env-list|         509|
|2023-07-23|   1|  bullet-dev-team|python-pyramid-pu...|         493|
|2023-07-23|   1|       CyberCommy|biqu520net-60001-...|         373|
|2023-07-23|   1|      timisalin01|               green|         357|
|2023-07-23|   1|       CyberCommy|biqu520net-50001-...|         267|
|2023-07-23|   1|  networkoperator|demo-cluster-mani...|         265|
|2023-07-23|   1|           zomeru|              zomeru|         242|
|2023-07-23|   1|   

In [21]:
print(
"""
Degree of ~compression~ via aggregation: {:0.2f} ({} / {})
""".format(df.count() / df_repos.count(), df_repos.count(), df.count())
)


Degree of ~compression~ via aggregation: 1.84 (61477 / 112858)



#### Limitations & Problems

1. As the snapshots in GHARCHIVE are hourly-based, there is __no data__ about the past activity for the repositories. \
So without the past activity, the number of owned repos is incomplete and can be based only on the activities, that we've crawled & stored. \
On how to overcome this, please read the next section, `How can it be improved?`

#### How can it be improved?

* Get bigger slice of data - and do the next thing:
  * Store daily snapshots of all unique combinations of `date`, `hour`, `repo.author`, `repo.name`
  * Provide the metrics by getting the global `count(distinct repo.name)` aggregation on group of `repo.author`-s

#### Loading - Clickhouse

In [22]:
import clickhouse_connect

client = clickhouse_connect.get_client(
    host='clickhouse_server', 
    username='altenar', 
    password='altenar_ch_demo_517'
)

In [23]:
client.command('USE gharchive;')

''

In [24]:
client.command("show tables;")

''

In [25]:
create_table_cmd = """
CREATE TABLE IF NOT EXISTS gharchive.repo_aggregated
(
    date Date,
    hour UInt8,
    repo_author String,
    repo_name String,
    total_events UInt32
)
ENGINE MergeTree
ORDER BY date;
"""

print(create_table_cmd)

client.command(create_table_cmd)


CREATE TABLE IF NOT EXISTS gharchive.repo_aggregated
(
    date Date,
    hour UInt8,
    repo_author String,
    repo_name String,
    total_events UInt32
)
ENGINE MergeTree
ORDER BY date;



''

In [29]:
spark.sparkContext.addFile("../jars/clickhouse-jdbc-0.4.6-all.jar")

spark.sparkContext._jsc.addJar("../jars/clickhouse-jdbc-0.4.6-all.jar")

spark.sparkContext.addFile("../jars/clickhouse-spark-runtime-3.3_2.12-0.7.2.jar")

spark.sparkContext._jsc.addJar("../jars/clickhouse-spark-runtime-3.3_2.12-0.7.2.jar")

spark.sparkContext

In [27]:
type(df_repos)

pyspark.sql.dataframe.DataFrame

In [30]:
df_repos.write \
    .format("jdbc") \
    .mode("append") \
    .option("driver", "com.github.housepower.jdbc.ClickHouseDriver") \
    .option("url", "jdbc:clickhouse://clickhouse-server:9000") \
    .option("user", "altenar") \
    .option("password", "altenar_ch_demo_517") \
    .option("dbtable", "gharchive.repo_aggregated") \
    .option("truncate", "true") \
    .option("batchsize", 10000) \
    .option("isolationLevel", "NONE") \
    .save()

Py4JJavaError: An error occurred while calling o80.save.
: java.lang.ClassNotFoundException: com.github.housepower.jdbc.ClickHouseDriver
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
	at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:103)
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:103)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:103)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcOptionsInWrite.<init>(JDBCOptions.scala:246)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcOptionsInWrite.<init>(JDBCOptions.scala:250)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:47)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488)
	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79)
	at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:133)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:856)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:387)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:360)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:833)


### LoD - Did > 1 Commit a day

https://stackoverflow.com/questions/64605066/explode-array-with-nested-array-raw-spark-sql

#### Notes

1. Commits are present inside `payload.commits.author.name`.\
Given the hourly and typized nature of the dataset, we can:
    * Filter by `type`, leaving only `PushEvent`-s
    * Explode author names
    * Get the count aggregation

#### Impl

In [96]:
df.select(
    df["payload.commits.author.name"],
    df["payload.ref"],
    df["payload.ref_type"],
    df["payload.review"],
    df["type"],
).show(50)

+--------------------+--------------------+----------+--------------------+--------------------+
|                name|                 ref|  ref_type|              review|                type|
+--------------------+--------------------+----------+--------------------+--------------------+
|                null|                null|      null|                null|          WatchEvent|
|          [moyan222]|     refs/heads/main|      null|                null|           PushEvent|
|         [tokenhash]|     refs/heads/main|      null|                null|           PushEvent|
|       [TheCamBloch]|     refs/heads/main|      null|                null|           PushEvent|
|             [Peter]|     refs/heads/safe|      null|                null|           PushEvent|
|       [wendellopes]|refs/heads/featur...|      null|                null|           PushEvent|
|                null|     lib-default-use|       tag|                null|         CreateEvent|
|[DEV Registry Ser...|   refs/

In [71]:
spark.sql(
    """
    SELECT
        date(created_at) as created_at_date,
        explode(payload.commits.author.name) as author_name
    FROM activity
    WHERE 
        type = 'PushEvent'
    """
).show(10)

+---------------+--------------------+
|created_at_date|         author_name|
+---------------+--------------------+
|     2023-07-27|            moyan222|
|     2023-07-27|           tokenhash|
|     2023-07-27|         TheCamBloch|
|     2023-07-27|               Peter|
|     2023-07-27|         wendellopes|
|     2023-07-27|DEV Registry Service|
|     2023-07-27|DEV Registry Service|
|     2023-07-27|             preciad|
|     2023-07-27| github-actions[bot]|
|     2023-07-27| github-actions[bot]|
+---------------+--------------------+
only showing top 10 rows



In [72]:
spark.sql(
    """
    SELECT 
        sq.created_at_date as date,
        sq.author_name, 
        count(*) total_commits from 
    (
        SELECT
            date(created_at) as created_at_date,
            explode(payload.commits.author.name) as author_name
        FROM activity
        WHERE 
            type = 'PushEvent'
    ) sq
    GROUP BY 
        date, 
        sq.author_name
    HAVING total_commits > 1
    ORDER BY 
        total_commits DESC, 
        author_name DESC
    """
).show(10)

+----------+-------------------+-------------+
|      date|        author_name|total_commits|
+----------+-------------------+-------------+
|2023-07-28|github-actions[bot]|        64469|
|2023-07-27|github-actions[bot]|        35880|
|2023-07-27|        Upptime Bot|        27881|
|2023-07-28|        Upptime Bot|        13536|
|2023-07-27|      renovate[bot]|        10954|
|2023-07-27|           sgou1969|         8488|
|2023-07-27|    dependabot[bot]|         8071|
|2023-07-28|           sgou1969|         6061|
|2023-07-28|      renovate[bot]|         5313|
|2023-07-27|         readme-bot|         5125|
+----------+-------------------+-------------+
only showing top 10 rows



#### Limitations & Problems

1. Potentially, subquery usage in Spark is not a good practice. Yet, it may be used for prototype reasons.
2. We have to additionally group by date, because we have to provide DAILY number of commits.

#### How can it be improved?

* Probably, we should get rid of bot entities :)

### [!!!] LoD - < 1 commit in a day

#### Notes

* The trick is that an author can have 1 commit in, let's say, 25h time span, which is more than 1 day. \
So we should regroup data not by a date, but by consecutive dataspansm which are 25h each...
* For to test it, we need at least __2 days worth of data__! \
For now I have complications to get this data, so I skip the metric.

#### Impl

In [73]:
spark.sql(
    """
    SELECT 
        sq.created_at_date as date,
        sq.author_name, 
        count(*) total_commits from 
    (
        SELECT
            date(created_at) as created_at_date,
            explode(payload.commits.author.name) as author_name
        FROM activity
        WHERE 
            type = 'PushEvent'
    ) sq
    GROUP BY 
        date, 
        sq.author_name
    HAVING total_commits <= 1
    ORDER BY 
        total_commits DESC, 
        author_name DESC
    """
).show(10)

+----------+--------------------+-------------+
|      date|         author_name|total_commits|
+----------+--------------------+-------------+
|2023-07-28|            🤖 R2-D2|            1|
|2023-07-28|      💵 moneybot 💵|            1|
|2023-07-28| 🐼 Samrose Ahmed 🐼|            1|
|2023-07-27|𒀳 Scribe of the ...|            1|
|2023-07-27|                  ｚ|            1|
|2023-07-28|              황지환|            1|
|2023-07-28|              홍세빈|            1|
|2023-07-27|              현기홍|            1|
|2023-07-28|                현경|            1|
|2023-07-27|              한창수|            1|
+----------+--------------------+-------------+
only showing top 10 rows



### [!!!] Total Developers grouped by gender

#### Notes

* Proabably can be done via `user[profile_pronouns]` , but for EACH unique GitHub User... That's just dumb.
* I haven't found such info (sex,gender, pronouns) in the dataset.

### Total projects with more than 10 members

#### Notes

* Where to find members ???

#### IMPL


In [98]:
spark.sql(
    """
    SELECT
        type,
        payload.member
    FROM activity
    WHERE
        payload.member is not NULL
        
    """
).show(10)

+-----------+--------------------+
|       type|              member|
+-----------+--------------------+
|MemberEvent|{https://avatars....|
|MemberEvent|{https://avatars....|
|MemberEvent|{https://avatars....|
|MemberEvent|{https://avatars....|
|MemberEvent|{https://avatars....|
|MemberEvent|{https://avatars....|
|MemberEvent|{https://avatars....|
|MemberEvent|{https://avatars....|
|MemberEvent|{https://avatars....|
|MemberEvent|{https://avatars....|
+-----------+--------------------+
only showing top 10 rows



In [114]:
df.select(
    df["payload.member"],
    df["payload.member.id"],
    df["payload.member.login"],
    df["payload.member.type"],
    df["payload.action"],
    df["org.login"],
    df["actor.login"],
    df["repo.name"],
    df["type"],
).filter(df["type"] == 'MemberEvent').show(50)

+--------------------+---------+-------------------+----+------+--------------------+--------------------+--------------------+-----------+
|              member|       id|              login|type|action|               login|               login|                name|       type|
+--------------------+---------+-------------------+----+------+--------------------+--------------------+--------------------+-----------+
|{https://avatars....|107618870|            bsozeau|User| added|                null|      noirblancrouge|noirblancrouge/Pa...|MemberEvent|
|{https://avatars....|132350039|         RolivhuwaN|User| added|                null|        Spinofficial| Spinofficial/printf|MemberEvent|
|{https://avatars....|132908419|        stheeCamile|User| added|                null|              faellm| faellm/calculoRotas|MemberEvent|
|{https://avatars....|140736057|       bevin-crypto|User| added|                null|       Programmer231|Programmer231/Roc...|MemberEvent|
|{https://avatars...

In [118]:
spark.sql(
    """
    SELECT
        repo.name,
        count(distinct payload.member.login) as total_member_logins
    FROM activity
    WHERE
        type = 'MemberEvent'
        and payload.action = 'added'
    GROUP BY
        repo.name
    ORDER BY
        total_member_logins DESC
    """
).show(10)

+--------------------+-------------------+
|                name|total_member_logins|
+--------------------+-------------------+
|jgranadoscunoc/re...|                 29|
|Jucer74/WebDevelo...|                 17|
|hdtoledo/nodejs_a...|                 11|
|InstrucJavaReclui...|                  9|
|ChungLeba/nextjsv...|                  9|
|iramgutierrez/433...|                  9|
|emrchi/TestNGProj...|                  8|
|aydaakcay/TestPro...|                  7|
|ufuk-muhsiroglu/T...|                  7|
|GoldenMEmre/com.w...|                  7|
+--------------------+-------------------+
only showing top 10 rows



In [119]:
spark.sql(
    """
    SELECT
        date(created_at) as date,
        hour(created_at) as hour,
        repo.name,
        count(distinct payload.member.login) as total_member_logins
    FROM activity
    WHERE
        type = 'MemberEvent'
        and payload.action = 'added'
    GROUP BY
        date, hour, repo.name
    ORDER BY
        total_member_logins DESC
    """
).show(10)

+----------+----+--------------------+-------------------+
|      date|hour|                name|total_member_logins|
+----------+----+--------------------+-------------------+
|2023-07-27|  20|jgranadoscunoc/re...|                 21|
|2023-07-28|   1|Jucer74/WebDevelo...|                 16|
|2023-07-28|   1|InstrucJavaReclui...|                  9|
|2023-07-27|  19|hdtoledo/nodejs_a...|                  9|
|2023-07-27|  19|emrchi/TestNGProj...|                  7|
|2023-07-27|  19|ufuk-muhsiroglu/T...|                  7|
|2023-07-27|  18|aydaakcay/TestPro...|                  7|
|2023-07-27|  18|GoldenMEmre/com.w...|                  7|
|2023-07-27|  19|mehmetfilik/comWo...|                  6|
|2023-07-27|  19|FabianoCarneiro/p...|                  6|
+----------+----+--------------------+-------------------+
only showing top 10 rows



In [122]:
spark.sql(
    """
    SELECT
        date(created_at) as date,
        hour(created_at) as hour,
        created_at,
        repo.name,
        payload.member.login as member_name
    FROM activity
    WHERE
        type = 'MemberEvent'
        and payload.action = 'added'
    ORDER BY
        date ASC,
        hour ASC,
        created_at ASC,
        repo.name ASC,
        member_name ASC
    """
).show(10)

+----------+----+--------------------+--------------------+---------------+
|      date|hour|          created_at|                name|    member_name|
+----------+----+--------------------+--------------------+---------------+
|2023-07-27|  18|2023-07-27T18:00:15Z|noirblancrouge/Pa...|        bsozeau|
|2023-07-27|  18|2023-07-27T18:00:31Z| Spinofficial/printf|     RolivhuwaN|
|2023-07-27|  18|2023-07-27T18:00:33Z| faellm/calculoRotas|    stheeCamile|
|2023-07-27|  18|2023-07-27T18:00:39Z|Programmer231/Roc...|   bevin-crypto|
|2023-07-27|  18|2023-07-27T18:00:41Z|    Drefdu/DentalNew|AntonioEstrada0|
|2023-07-27|  18|2023-07-27T18:00:44Z|code-X16/simple_s...|      udeme-goc|
|2023-07-27|  18|2023-07-27T18:01:13Z|maxDes23/project-...|       MadDog83|
|2023-07-27|  18|2023-07-27T18:01:15Z|Diego582/ideas-pi...|       FedeSabi|
|2023-07-27|  18|2023-07-27T18:01:16Z|CaravanStudios/dc...| caseylivingood|
|2023-07-27|  18|2023-07-27T18:01:20Z|devfabien/Gym-git...| IbrahimBagalwa|
+----------+

#### Limitations & Problems

1. As the snapshots in GHARCHIVE are hourly-based, there is __no data__ about the past activity for the repositories. \
So without the past activity, the number of members is incomplete and can be based only on the activities, like 'MemberEvent', that we've crawled & stored. \
On how to overcome this, please read the next section, `How can it be improved?`

#### How can it be improved?

* Get bigger slice of data - and do the next thing:
  * Store daily snapshots of all unique combinations of `date`, `repo`, `member`, 'action'
  * Provide the metrics by getting the global `count(distinct member)` aggregation on group of `repo`-s