Michał Liss
Marceli Sokólski
Piotr Krzystanek

# Instalacja pig

W kontenerze hadoopa zainstalowany został pig za pomocą nastepujących komend w Dockerfile:

```Dockerfile
RUN wget https://downloads.apache.org/pig/pig-0.17.0/pig-0.17.0.tar.gz
RUN tar -xf pig-0.17.0.tar.gz
RUN cd pig-0.17.0
RUN mv pig-0.17.0 /usr/local/pig
ENV PATH="${PATH}:/usr/local/pig/bin"
```

# Instalacja hive

Dodane zostały dodatkowe kontenery:
```yml
  hive-server:
    image: przydasie99/pwr-hadoop-hive-server:amd64
    build:
      context: hive-europe
      dockerfile: Dockerfile
    container_name: hive-server
    volumes:
      - ../master_volume:/data/master_volume
    env_file:
      - ./hadoop.env
    environment:
      HIVE_CORE_CONF_javax_jdo_option_ConnectionURL: "jdbc:postgresql://hive-metastore/metastore"
      SERVICE_PRECONDITION: "hive-metastore:9083"
    depends_on:
      - hive-metastore
    ports:
      - "10000:10000"
```
```yml
  hive-metastore:
    image: przydasie99/pwr-hadoop-hive-server:amd64
    build:
      context: hive-europe
      dockerfile: Dockerfile
    container_name: hive-metastore
    env_file:
      - ./hadoop.env
    command: /opt/hive/bin/hive --service metastore
    environment:
      SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864 hive-metastore-postgresql:5432"
    depends_on:
      - hive-metastore-postgresql
    ports:
      - "9083:9083"
```
```yml
  hive-metastore-postgresql:
    build:
      context: hive-metastore
      dockerfile: Dockerfile
    image: przydasie99/pwr-hadoop-hive-postgres:amd64
    container_name: hive-metastore-postgresql
    ports:
      - '5433:5432'
```

# Installation



In [None]:
import docker
import uuid
import paramiko
import os
from timeit import default_timer as timer
from dataclasses import dataclass
import re
import requests
import statistics
import pandas as pd
import numpy as np

In [None]:
def run_in_master(command):
    ssh = paramiko.SSHClient()
    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    ssh.connect("namenode", username="root", password="pass")
    ssh_stdin, ssh_stdout, ssh_stderr = ssh.exec_command(f"cd /app/ && . /env_var_path.sh && {command}")
    return (ssh_stdout.readlines(), ssh_stderr.readlines())

def run_in_hive(command):
    ssh = paramiko.SSHClient()
    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    ssh.connect("hive-server", username="root", password="pass")
    ssh_stdin, ssh_stdout, ssh_stderr = ssh.exec_command(f"bash -c '. /env_var_path.sh && {command}'")
    return (ssh_stdout.readlines(), ssh_stderr.readlines())

Create a hive database and schema for the covid table.

In [None]:
run_in_hive("hive -f /data/master_volume/hive_scripts/covid_table.hql")

Copy the covid-dataset.csv file to the external database folder.

In [None]:
run_in_master(f"hdfs dfs -cp /datasets/covid-dataset.csv /user/hive/warehouse/covid.db/covid/covid-dataset.csv")

Process the data and transform into another table.

In [None]:
start = timer()
results = run_in_hive("hive -f /data/master_volume/hive_scripts/covid_data.hql")
stop = timer()

print(results)
print(f"HIVE JOB TOOK: {stop - start:.2f}s")

Compare the results of the new file with the original.

In [None]:
run_in_master(f"hdfs dfs -cat /user/hive/warehouse/covid.db/covid/covid-dataset.csv")

In [None]:
run_in_master(f"hdfs dfs -cat /user/hive/warehouse/covid.db/calculated/000000_0")

Compare with Pig

In [None]:
start = timer()
results = run_in_master("pig -x mapreduce /data/master_volume/pig_scripts/test.pig")
stop = timer()

print(results)
print(f"PIG JOB TOOK: {stop - start:.2f}s")

In [15]:
run_in_hive("hive -f /data/master_volume/hive_scripts/steam_01_combine.hql")

(['570\tDota 2\tValve\tValve\t\t1456697\t291673\t0\t100,000,000 .. 200,000,000\t39576\t1788\t1021\t944\t0\t0\t0\t612293\n',
  '730\tCounter-Strike: Global Offensive\tValve, Hidden Path Entertainment\tValve\t\t5680980\t756795\t0\t50,000,000 .. 100,000,000\t29878\t858\t6437\t300\t0\t0\t0\t679488\n',
  '578080\tPUBG: BATTLEGROUNDS\tKRAFTON, Inc.\tKRAFTON, Inc.\t\t1139437\t888720\t0\t50,000,000 .. 100,000,000\t21370\t964\t6632\t176\t0\t0\t0\t279181\n',
  '1063730\tNew World\tAmazon Games\tAmazon Games\t\t153807\t73148\t0\t50,000,000 .. 100,000,000\t8392\t590\t3408\t490\t3999\t3999\t0\t20403\n',
  '440\tTeam Fortress 2\tValve\tValve\t\t813122\t56073\t0\t50,000,000 .. 100,000,000\t11858\t1769\t348\t367\t0\t0\t0\t82571\n',
  '304930\tUnturned\tSmartly Dressed Games\tSmartly Dressed Games\t\t438411\t40925\t0\t20,000,000 .. 50,000,000\t10750\t3927\t331\t817\t0\t0\t0\t50423\n',
  '271590\tGrand Theft Auto V\tRockstar North\tRockstar Games\t\t1144201\t208030\t0\t20,000,000 .. 50,000,000\t13182\t8

In [15]:
run_in_hive("hive -f /data/master_volume/hive_scripts/steam_01_combine.hql")

(['570\tDota 2\tValve\tValve\t\t1456697\t291673\t0\t100,000,000 .. 200,000,000\t39576\t1788\t1021\t944\t0\t0\t0\t612293\n',
  '730\tCounter-Strike: Global Offensive\tValve, Hidden Path Entertainment\tValve\t\t5680980\t756795\t0\t50,000,000 .. 100,000,000\t29878\t858\t6437\t300\t0\t0\t0\t679488\n',
  '578080\tPUBG: BATTLEGROUNDS\tKRAFTON, Inc.\tKRAFTON, Inc.\t\t1139437\t888720\t0\t50,000,000 .. 100,000,000\t21370\t964\t6632\t176\t0\t0\t0\t279181\n',
  '1063730\tNew World\tAmazon Games\tAmazon Games\t\t153807\t73148\t0\t50,000,000 .. 100,000,000\t8392\t590\t3408\t490\t3999\t3999\t0\t20403\n',
  '440\tTeam Fortress 2\tValve\tValve\t\t813122\t56073\t0\t50,000,000 .. 100,000,000\t11858\t1769\t348\t367\t0\t0\t0\t82571\n',
  '304930\tUnturned\tSmartly Dressed Games\tSmartly Dressed Games\t\t438411\t40925\t0\t20,000,000 .. 50,000,000\t10750\t3927\t331\t817\t0\t0\t0\t50423\n',
  '271590\tGrand Theft Auto V\tRockstar North\tRockstar Games\t\t1144201\t208030\t0\t20,000,000 .. 50,000,000\t13182\t8