## Setup Guide

Welcome to the OSS Insight setup guide.

This guide will help you setup your environment for OSS Insight.

### Prerequisites

Before you begin, you need to have the following installed:

- TiDB Cluster - Database to store the data
- MyCLI - Connect to the TiDB Cluster
- Node.js 18.x and above - API server runtime
- PNPM - Package manager for Node.js
- Python 3.x and pip - For jupyter notebook

#### 1. Startup TiDB Cluster

First of all, you need to startup a TiDB cluster. You can startup a **serverless tier** cluster using TiDB Cloud, please click [here](https://docs.pingcap.com/tidbcloud/tidb-cloud-quickstart#step-1-create-a-tidb-cluster) to learn how to create a new serverless tier cluster.

> **Note**
>
> If you already have a TiDB cluster that you can connect to, you can ignore this step.
>

After you created a TiDB cluster on TiDB Cloud, you can enter the cluster detail page. 

In this page, you can get the connection information in the connection panel.

<center>
  <img align="middle" width="800" alt="Serverless Tier Cluster Manage Interface" src="https://user-images.githubusercontent.com/5086433/204476069-0ddbdf6f-419c-4291-b929-ccfbd2f5ea5f.png">
  <p><i>Serverless Tier Cluster Manage Interface</i></p>
</center>

You can enter to the cluster's security settings window by clicking on the modify menu in the upper right corner of the cluster details page and generate a root user password.

<center>
  <img width="480" alt="The Cluster Modify Menu" src="https://user-images.githubusercontent.com/85985765/204876779-3a4c6ac4-8814-47cd-b82a-40eb5e4d8f96.png">
  <p><i>The Cluster Modify Menu</i></p>
</center>

<center>
  <img width="720" alt="Security Settings" src="https://user-images.githubusercontent.com/85985765/204877348-5c3e9012-f7bf-42e9-8a03-fd9f14bfc826.png">
  <p><i>Security Settings</i></p>
</center>

#### 3. Prepare Your GitHub Access Token

You need to prepare a personal access token to allow the application to access the data through the GitHub API.

> **Note**
>
> If you are viewing this document from GitHub codespace, you can skip this step because `GITHUB_TOKEN` is set by default in the codespace environment.
>  

You can learn how to generate one by reading: [Creating a personal access token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token). Or just click this [link](https://github.com/settings/personal-access-tokens/new) to generate your personal access token quickly.

<center>
  <img align="middle" width="800" alt="Create a New GitHub Personal Access Token" src="https://user-images.githubusercontent.com/5086433/204564273-93cccbe4-d10a-4d1b-a9d1-112a1144712a.png">
  <p><i>Create a new GitHub personal access token</i></p>
</center>


#### 4. Setup the Environment Variables

Place your database connection information and personal access token in the corresponding location on the command line below, and then execute them **on a terminal in VSCode**.


In [2]:
import getpass
import os

api_server_dot_env="./packages/api-server/.env"
if os.path.exists(api_server_dot_env): 
    con = input(api_server_dot_env + " file already exists, do you want to overwrite it? (y/n) [y]")
    if con != 'y':
        exit();

# Config GitHub personal access token.
github_token = ""
if os.getenv('GITHUB_TOKEN') is None:
    github_token = getpass.getpass(prompt='Enter your personal access token of GitHub: ')
else:
    github_token = os.getenv('GITHUB_TOKEN')

# Config database connection.
db_endpoint = input("The endpoint of TiDB cluster: ")
db_port = int(input("The port of TiDB cluster [4000]: ")) or 4000
db_username = input("The username of TiDB cluster: ")
db_password = getpass.getpass(prompt='Enter the password of TiDB cluster: ')
db_name = "ossinsight"
db_enable_ssl = input("Enable ssl connection to the TiDB cluster? (y/n) [y]") or "y"
db_ssl_config='&ssl={"minVersion":"TLSv1.2"}'
if db_enable_ssl == "n":
    db_ssl_config = ""

# Write to ./packages/api-server/.env file.
with open(api_server_dot_env, "w") as file:
    file.write(
        "DATABASE_URL=mysql://{}:{}@{}:{}/{}?connectionLimit=100&queueLimit=10000{}\n".format(
            db_username, db_password, db_endpoint, db_port, db_name, db_ssl_config
        )
    )
    file.write("ENABLE_CACHE=false\n")
    file.write("GITHUB_ACCESS_TOKENS={}\n".format(github_token))

print("Setup successfully!")

Setup successfully!


#### 5. Check if You Can Connect Your TiDB Cluster

Execute the following command on a VSCode terminal to verify if you can connect to the tidb cluster.

```bash
mycli -h ${DB_ENDPOINT} -P ${DB_PORT} -u ${DB_USERNAME} -p ${DB_PASSWORD} -D test \
    --ssl-ca=/etc/ssl/certs/ca-certificates.crt \
    --ssl-verify-server-cert \
     -e 'SELECT tidb_version()\G'
```

If you can connect successfully, you will get the version information of the tidb cluster like follow:

```
tidb_version()
Release Version: v6.3.0-serverless
Edition: Community
Git Commit Hash: e87c16b215d518aed4921b8ef3b13e90e3ed6e2d
Git Branch: release-6.3-serverless
UTC Build Time: 2022-11-25 09:31:28
GoVersion: go1.19
Race Enabled: false
TiKV Min Version: 6.1.0
Check Table Before Drop: false
Store: tikv
```

### Load Data

GitHub was launched in April 2008, and the `/events` api was published in Feb 2011, so there is a big amount of both historical and realtime data.

> **GitHub API Docs**
> 
> Link: https://docs.github.com/en/rest/activity/events
>


#### Load Historical Data

We can't fetch the historical data from GitHub `/events` api, but fortunately, they were archived by [GH Archive](https://gharchive.org).

GitHub provides 20+ event types, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. These events are aggregated into hourly archives, which you can access with any HTTP client:

| Query | Downloadable Files |
| ---- | ---- |
| Activity for 1/1/2015 @ 3PM UTC |	`https://data.gharchive.org/2015-01-01-15.json.gz` |
| Activity for 1/1/2015 | `https://data.gharchive.org/2015-01-01-{0..23}.json.gz` |
| Activity for all of January 2015 | `https://data.gharchive.org/2015-01-{01..31}-{0..23}.json.gz` |

For the convenience of the demo, we have prepared a sample data and store it in AWS S3:


#### Load Realtime Data

According to [GitHub Events API Docs](https://docs.github.com/en/rest/activity/events), you can access fetch realtime events happneded on GitHub through `/events` API:

In [1]:
%%!
curl -s \
  -H "Accept: application/vnd.github.v3+json" \
  -H "Authorization: token $GITHUB_TOKEN" \
  https://api.github.com/events

['[',
 '  {',
 '    "id": "25584614669",',
 '    "type": "IssueCommentEvent",',
 '    "actor": {',
 '      "id": 49699333,',
 '      "login": "dependabot[bot]",',
 '      "display_login": "dependabot",',
 '      "gravatar_id": "",',
 '      "url": "https://api.github.com/users/dependabot[bot]",',
 '      "avatar_url": "https://avatars.githubusercontent.com/u/49699333?"',
 '    },',
 '    "repo": {',
 '      "id": 280926365,',
 '      "name": "daviseares/goweather-react-native",',
 '      "url": "https://api.github.com/repos/daviseares/goweather-react-native"',
 '    },',
 '    "payload": {',
 '      "action": "created",',
 '      "issue": {',
 '        "url": "https://api.github.com/repos/daviseares/goweather-react-native/issues/212",',
 '        "repository_url": "https://api.github.com/repos/daviseares/goweather-react-native",',
 '        "labels_url": "https://api.github.com/repos/daviseares/goweather-react-native/issues/212/labels{/name}",',
 '        "comments_url": "https://api.g

## Get Insight With SQL


#### Example 1

```sql
SELECT * FROM github_events LIMIT 1;
```

```sql
EXPLAIN SELECT * FROM github_events LIMIT 1;
```