Skip to content

Project-EPIC/epic-infra

Repository files navigation

EPIC Infrastructure

EPIC Infrastructure backend mono repo. Contains all services, kubernetes deployment files, dataproc definitions...

Diagram

EPIC Infrastructure Diagram

Overview

This project is prepared to run on top of a pre-configured Kubernetes cluster. Basic knowledge of Kubernetes is required. A good start is this Udacity course. The project is structured into 2 separate parts: collection pipeline services and dashboard services.

Collection Pipeline Services

This side is in charge of connecting to Twitter and collecting tweets 24/7. It has 2 services: TwitterStream (downloads tweets and sends to Kafka) and tweet-store (receives tweets from Kafka and uploads them to Google Cloud Storage).

Tweets are stored following a EVENT/YYYY/MM/DD/HH/ folder structure (a tweet received at 2PM on April 3rd 2019 for event winter would be stored in the folder winter/2019/04/03/14) in a the epic-collect Google Cloud bucket. Each tweet received is buffered. When the buffer reaches size of 1000 tweets a file is created and uploaded in the corresponding folder.


Requirements

List of requirements for deploying or developing in this repository.

Development

In order to work on the services you will need the following:

  • Installed java 8. (Ex: brew install java8)
  • Installed maven. (Ex: brew install mvn)
  • Installed Make (Ex: brew install make)
  • Install our authlib: cd authlib && mvn install.
  • Set up your local Maven installation to pull from GitHub repository (read how to do so here)
  • Log in on your GCloud CLI. gcloud auth login
  • Create a default token using your GCloud user. gcloud auth application-default login

Deploying

In order to deploy you will need:

  • Docker CLI installed (Ex: brew install docker)
  • A hub.docker.com account and your Docker CLI connected to it (docker login)
  • Editor access to Project EPIC Docker Hub organization.
  • Editor access to the GCloud project.
  • kubectl installed (Ex: brew install kubectl)
  • kubectl connected to the corresponding cluster (Project EPIC: gcloud container clusters get-credentials epic-prod --zone us-central1-c --project crypto-eon-164220)

New service

Start development

Requirements: Development requirements

Deployment

  • (ONLY FIRST TIME) Create new Kubernetes definition file in the api folder
  • Make sure your resources are protected with the right annotations (see how to do it here)
  • Make sure you have health checks configured properly for external dependencies
  • Add new path field in spec within ingress.yaml
    • Ex.
        - path: /new-api-folder/*
          backend:
            serviceName: new-api
            servicePort: 8080
    
  • Update image version in Makefile
  • Create and upload docker image: make push
  • Update docker image version in your api definition file
  • kubectl replace -f api/NEW.yml (replace NEW with your api file name), or kubectl apply -f api/NEW.yml
  • kubectl apply -f ingress.yaml

System deployment


Queries

How to run diverse queries on the system with new and old data.

Collection query

Streaming collection for events happenning at the moment

  • Open dashboard.gerard.space
  • Select Events on side bar.
  • Press the pink button on the left-down corner
  • Fill form with information on event and use keywords field to add keywords to collect from. Read more about how Twitter tracking works here

BigQuery query

  • Open desired table (see sections below)
  • Click Query table
  • Build SQL statement for the query we are interested in. See syntax here.
  • Run query
  • Download data by clicking Save results

Open table for new infrastructure event

Query on an event collected in the new infrastrucure

  • Open dashboard.gerard.space
  • Select Events on side bar.
  • Select event to query
  • Select Dashboard tab on top
  • (ONLY FIRST TIME) Click Create BigQuery Table
  • Click Explore in BigQuery

Open table on legacy imported events

  • Open historic dataset in BigQuery
  • If table exists: Execute query
  • Else:
  • Click Create table
  • Set table configuration to the following (if not specified, leave as defaulted):
    • Create table from: "Google Cloud Storage"
    • Select file...: Browse file in epic-historic_tweets bucket, in the corresponding folder and select one file. Replace filename with wildcard instead (Ex: epic-historic-tweets/2012 Canada Fires/*)
    • File format: "JSON (newline delimited)"
    • Table type: "External table"
    • Table name: Fill with a distinct table name
    • Check Auto detect - Schema and input parameters box
    • Advanced options:
      • Number of errors allowed: 2147483647
      • Check Ignore unknown values box
  • Click Create table

Frequent errors

Google Cloud is giving an authorization error on local

  • Log in on your GCloud CLI. gcloud auth login
  • Make sure you have been added to the proper Google Cloud project.
  • Create a default token using your GCloud user. gcloud auth application-default login
  • Make sure you don't have any GOOGLE_APPLICATION_CREDENTIALS environment variable set.