# Sprawozdanie 6 - akwizycja danych

## Środowisko

Mamy maszynę wirtualną z Ubuntu postawioną za pomocą Vagrant'a (korzystającego pod spodem z VirtualBox'a). Na tej maszynie wirtualnej stawiamy kontenery Docker'a.

Aby ułatwić sobie późniejszą pracę z obrazami na których postawiony jest hadoop postanowiliśmy dodać do master-node volumen na dane (modyfikując skrypty generujące docker-compose). Dzięki temu możemy w wygodny sposób (tj. poprzez wrzucenie do odpowiedniego folderu) przenosić pliki do miejsca, do którego możemy się dostać z poziomu maszyny z hadoopem. Warto zwrócić uwagę, że maszyna wirtualna także posiada taki wolumen, który zapewnia wykorzystanie Vagrant'a.

```yaml
master:
    image: hjben/hadoop-eco:$hadoop_version
    hostname: master
    container_name: master
    privileged: true
    ports:
      - 8088:8088
      - 9870:9870
      - 8042:8042
      - 10000:10000
      - 10002:10002
      - 16010:16010
    volumes:
      - /sys/fs/cgroup:/sys/fs/cgroup
      - $hdfs_path:/data/hadoop
      - $hadoop_log_path:/usr/local/hadoop/logs
      - $hbase_log_path/master:/usr/local/hbase/logs
      - $hive_log_path:/usr/local/hive/logs
      - $sqoop_log_path:/usr/local/sqoop/logs
      - /vagrant/master_volume:/data/master_volume <-------------- dodany volumen
    networks:
      hadoop-cluster:
        ipv4_address: 10.1.2.3
    extra_hosts:
      - "mariadb:10.1.2.2"
      - "master:10.1.2.3"
```

## Pobieranie danych
Niestety przez potrzebę generowania i podania klucza API do serwisu kaggle przed pobraniem danych należy wykonać kilka czynności.

1. Pobrać ze strony kaggle klucz API (kaggle.json)
2. Stworzyć folder .kaggle w głównym katalogu użytkownika i skopiować tam klucz API
3. Poniższe funkcje:
    1. Pobierają z kaggle:
       * [YouTube Trending Video Dataset](https://www.kaggle.com/datasets/rsrishav/youtube-trending-video-dataset)
       * [Steam Dataset](https://www.kaggle.com/datasets/souyama/steam-dataset)
    2. Pobierają z sieci [dane Covid'owe](https://covid.ourworldindata.org/data/owid-covid-data.csv)

In [1]:
import os
from timeit import default_timer as timer
import requests
import docker
import json
import opendatasets as od

In [2]:
if os.getcwd().startswith("/tmp"):
    os.chdir("/vagrant/sprawozdania/akwizycja")

In [3]:
output_dir = "../../master_volume/datasets"

In [14]:
%%timeit -r 1 -n 1
if not os.path.isdir(f"{output_dir}/youtube-trending-video-dataset"):
    od.download("https://www.kaggle.com/datasets/rsrishav/youtube-trending-video-dataset", f"{output_dir}")
else:
    print("Dataset youtube-trending-video-dataset already exists, skipping download")

Downloading youtube-trending-video-dataset.zip to ../../master_volume/datasets\youtube-trending-video-dataset


100%|██████████| 1.24G/1.24G [02:17<00:00, 9.70MB/s]



2min 32s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [18]:
%%timeit -r 1 -n 1
if not os.path.isdir(f"{output_dir}/steam-dataset"):
    od.download("https://www.kaggle.com/datasets/souyama/steam-dataset", f"{output_dir}")
else:
    print("Dataset steam-dataset already exists, skipping download")

Downloading steam-dataset.zip to ../../master_volume/datasets\steam-dataset


100%|██████████| 610M/610M [00:57<00:00, 11.1MB/s] 



1min 9s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [19]:
%%timeit -r 1 -n 1
path = f"{output_dir}/covid-dataset.csv"

if not os.path.isfile(path):
    print(f"Downloading covid-dataset to {path}")
    start = timer()
    r = requests.get("https://covid.ourworldindata.org/data/owid-covid-data.csv", allow_redirects=True)
    with open(path, 'wb') as file:
        file.write(r.content)
    end = timer()
    print(f"Download finished in {end - start:.02f}s")
else:
    print("Dataset covid-dataset already exists, skipping download")

print(f"covid-dataset.csv: {os.stat(path).st_size / (1024 * 1024):.02f}MB")

Downloading covid-dataset to ../../master_volume/datasets/covid-dataset.csv
Download finished in 1.80s
covid-dataset.csv: 77.76MB
1.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


## Formatowanie danych
Po rozpakowaniu danych widać, że część z nich ma format trudny do późniejszej pracy. Ostatecznie postanowiliśmy przed wrzuceniem plików do hdfs wszystkie przetransformować do dormatu .jsonl. Format .jsonl zawiera obiekty json, każdy w kolejnej linii. Dzięki zastosowaniu takiego formatu łatwo będzie można implementować procesy map-reduce.


In [20]:
%%timeit -r 1 -n 1

print("Converting JSON to JSONL")
for root, directories, files in os.walk(f"{output_dir}"):
        for filename in files:
            path = os.path.join(root,filename)
            output_path = f"{path}l"

            if os.path.isfile(output_path):
                continue

            if path.endswith(".json"):
                print(path)
                with open(path, "r") as file:
                    data = json.load(file)
                    if type(data) is dict:
                        data = [{"key": key, "value": data[key]} for key in data]

                    with open(output_path, "w") as jsonl_file:
                        for obj in data:
                            json.dump(obj, jsonl_file)
                            jsonl_file.write("\n")
                    print(output_path)

print("Done")

Converting JSON to JSONL
../../master_volume/datasets\steam-dataset\steam_dataset\appinfo\dlc_data\missing.json
../../master_volume/datasets\steam-dataset\steam_dataset\appinfo\dlc_data\missing.jsonl
../../master_volume/datasets\steam-dataset\steam_dataset\appinfo\dlc_data\steam_dlc_data.json
../../master_volume/datasets\steam-dataset\steam_dataset\appinfo\dlc_data\steam_dlc_data.jsonl
../../master_volume/datasets\steam-dataset\steam_dataset\appinfo\store_data\steam_store_data.json
../../master_volume/datasets\steam-dataset\steam_dataset\appinfo\store_data\steam_store_data.jsonl
../../master_volume/datasets\steam-dataset\steam_dataset\news_data\missing.json
../../master_volume/datasets\steam-dataset\steam_dataset\news_data\missing.jsonl
../../master_volume/datasets\steam-dataset\steam_dataset\news_data\steam_news_data.json
../../master_volume/datasets\steam-dataset\steam_dataset\news_data\steam_news_data.jsonl
../../master_volume/datasets\steam-dataset\steam_dataset\steamspy\basic\stea

## Dodanie plików do hdfs

In [21]:
client = docker.from_env()
container = client.containers.get('master')

def hdfs_mkdir(path):
    container.exec_run(f"hdfs dfs -mkdir -p /{path}/")

def hdfs_upload(path):
    directory = "/".join(path.split("/")[:-1])
    hdfs_mkdir(directory)
    cmd = f"hdfs dfs -put /data/master_volume/{path} /{directory}"
    print(cmd)
    code, output = container.exec_run(cmd)
    print(f"exit code {code}")
    print(output)

In [22]:
%%timeit -r 1 -n 1

print("Uploading to HDFS")

for root, directories, files in os.walk(f"{output_dir}"):
        for filename in files:
            if filename.endswith(".json"):
                continue # skip JSON files, we have JSONL from previous step
            filepath = path = os.path\
                .join(root,filename)

            path = filepath\
                .replace("../../master_volume/", "")\
                .replace("\\", "/")

            start_single = timer()
            hdfs_upload(path)
            end_single = timer()
            print(f"HDFS upload of {os.stat(filepath).st_size / (1024 * 1024):.02f}MB took {end_single - start_single:.02f}s")

Uploading to HDFS
hdfs dfs -put /data/master_volume/datasets/covid-dataset.csv /datasets
exit code 0
b''
HDFS upload of 77.76MB took 5.08s
hdfs dfs -put /data/master_volume/datasets/steam-dataset/steam_dataset/appinfo/dlc_data/missing.jsonl /datasets/steam-dataset/steam_dataset/appinfo/dlc_data
exit code 0
b''
HDFS upload of 0.00MB took 2.72s
hdfs dfs -put /data/master_volume/datasets/steam-dataset/steam_dataset/appinfo/dlc_data/steam_dlc_data.jsonl /datasets/steam-dataset/steam_dataset/appinfo/dlc_data
exit code 0
b''
HDFS upload of 230.57MB took 9.75s
hdfs dfs -put /data/master_volume/datasets/steam-dataset/steam_dataset/appinfo/dlc_data/timestamp.txt /datasets/steam-dataset/steam_dataset/appinfo/dlc_data
exit code 0
b''
HDFS upload of 0.00MB took 2.84s
hdfs dfs -put /data/master_volume/datasets/steam-dataset/steam_dataset/appinfo/store_data/steam_store_data.jsonl /datasets/steam-dataset/steam_dataset/appinfo/store_data
exit code 0
b''
HDFS upload of 552.35MB took 19.69s
hdfs dfs -pu

In [23]:
def hdfs_set_replication_level(number):
    container.exec_run(f"hdfs dfs -setrep -R {number} /")

hdfs_set_replication_level(3)

## Akwizycja danych zmiennych

Nasz proces zakłada, że na podstawie części danych niezmiennych dotyczących gier z serwisu Steam (steam_dataset) wytypujemy te gry, o których liczby graczy na przestrzeni czasu będziemy pytać SteamCharts API za pomocą bardzo prostego zapytania
```
https://steamcharts.com/app/<appid>/chart-data.json
```

W celu akwizycji tych danych przygotowaliśmy proces map-reduce, który przyjmuje na wejściu potrzebne id i zwraca wyniki zapytania w odpowiedniej postaci (zgodnie z poniższą ilustracją). Na dzień dzisiejszy wybór odpowiednich gier jest symulowany (wybierane jest po prostu pierwsze 20). Pierwszy podproces, który agreguje dane został poprawnie zaimplementowany jako proces map-reduce.

![](images/SteamChartsBig.png)

### Agregacja danych

In [24]:
client = docker.from_env()
container = client.containers.get('master')

res1 = container.exec_run(("yarn jar /data/master_volume/map_reduce_jars/1.jar "
"/datasets/steam-dataset/steam_dataset/appinfo/store_data/steam_store_data.jsonl "
"/datasets/steam-dataset/steam_dataset/steamspy/basic/steam_spy_scrap.jsonl "
"/out_steam_1"))

res1

ExecResult(exit_code=0, output=b"2023-04-24 05:41:14,965 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at master/10.1.2.3:8050\n2023-04-24 05:41:15,176 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.\n2023-04-24 05:41:15,186 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1682280865018_0001\n2023-04-24 05:41:15,882 INFO input.FileInputFormat: Total input files to process : 2\n2023-04-24 05:41:16,054 INFO mapreduce.JobSubmitter: number of splits:6\n2023-04-24 05:41:16,158 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1682280865018_0001\n2023-04-24 05:41:16,158 INFO mapreduce.JobSubmitter: Executing with tokens: []\n2023-04-24 05:41:16,287 INFO conf.Configuration: resource-types.xml not found\n2023-04-24 05:41:16,287 INFO resource.ResourceUtils: Unable

In [25]:
raw = container.exec_run("hdfs dfs -cat /out_steam_1/part-r-00000").output.decode('utf-8')
print(f"{raw[0:10000]}")

10	{"game_id":10,"name":"Counter-Strike","positive":196557,"negative":5070,"owners":"10,000,000 .. 20,000,000","ccu":12877,"release_date":"Nov 1, 2000"}
1000000	{"game_id":1000000,"name":"ASCENXION","positive":27,"negative":5,"owners":"0 .. 20,000","ccu":0,"release_date":"May 14, 2021"}
1000010	{"game_id":1000010,"name":"Crown Trick","positive":3809,"negative":583,"owners":"200,000 .. 500,000","ccu":51,"release_date":"Oct 16, 2020"}
1000030	{"game_id":1000030,"name":"Cook, Serve, Delicious! 3?!","positive":1469,"negative":101,"owners":"50,000 .. 100,000","ccu":54,"release_date":"Oct 14, 2020"}
1000040	{"game_id":1000040,"name":"细胞战争","positive":0,"negative":1,"owners":"0 .. 20,000","ccu":0,"release_date":"Mar 30, 2019"}
1000080	{"game_id":1000080,"name":"Zengeon","positive":1011,"negative":431,"owners":"50,000 .. 100,000","ccu":2,"release_date":"Jun 24, 2019"}
1000100	{"game_id":1000100,"name":"干支セトラ　陽ノ卷｜干支etc.　陽之卷","positive":18,"negative":6,"owners":"0 .. 20,000","ccu":0,"release_dat

In [26]:
res2 = container.exec_run(("yarn jar /data/master_volume/map_reduce_jars/2.jar "
"/out_steam_1/part-r-00000 "
"/out_steam_2"))

res2

ExecResult(exit_code=0, output=b"2023-04-24 05:43:20,863 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at master/10.1.2.3:8050\n2023-04-24 05:43:21,084 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.\n2023-04-24 05:43:21,096 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1682280865018_0002\n2023-04-24 05:43:22,221 INFO input.FileInputFormat: Total input files to process : 1\n2023-04-24 05:43:22,372 INFO mapreduce.JobSubmitter: number of splits:1\n2023-04-24 05:43:22,499 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1682280865018_0002\n2023-04-24 05:43:22,499 INFO mapreduce.JobSubmitter: Executing with tokens: []\n2023-04-24 05:43:22,612 INFO conf.Configuration: resource-types.xml not found\n2023-04-24 05:43:22,612 INFO resource.ResourceUtils: Unable

In [27]:
raw = container.exec_run("hdfs dfs -cat /out_steam_2/part-r-00000").output.decode('utf-8')
print(f"{raw[0:10000]}")

id	{"game_id":999990,"name":"Bouncing Hero"}
id	{"game_id":999930,"name":"Fantasy Sino-Japanese War 幻想甲午"}
id	{"game_id":999900,"name":"Studio by RADiCAL"}
id	{"game_id":999890,"name":"Bruken"}
id	{"game_id":999880,"name":"NASA's Exoplanet Excursions"}
id	{"game_id":999860,"name":"Enemy On Board"}
id	{"game_id":999840,"name":"RONE"}
id	{"game_id":999830,"name":"Becalm"}
id	{"game_id":999760,"name":"Mobile Wars X"}
id	{"game_id":999750,"name":"BLASTER LiLO"}
id	{"game_id":999730,"name":"Secret Neighbor Beta"}
id	{"game_id":999660,"name":"SAMURAI SHODOWN NEOGEO COLLECTION"}
id	{"game_id":999640,"name":"Cube Defense"}
id	{"game_id":999560,"name":"Fragile Equilibrium"}
id	{"game_id":999540,"name":"Scroll2Read"}
id	{"game_id":999430,"name":"Santa Tracker"}
id	{"game_id":999410,"name":"Magnibox"}
id	{"game_id":999360,"name":"soul room"}
id	{"game_id":999350,"name":"Uzak Diyar Destanları 1: Remastered"}
id	{"game_id":999310,"name":"THE NED BALLS"}



In [28]:
res3 = container.exec_run(("yarn jar /data/master_volume/map_reduce_jars/3.jar "
"/out_steam_2/part-r-00000 "
"/out_steam_3"))

res3

ExecResult(exit_code=0, output=b"2023-04-24 05:44:03,537 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at master/10.1.2.3:8050\n2023-04-24 05:44:03,736 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.\n2023-04-24 05:44:03,750 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1682280865018_0003\n2023-04-24 05:44:04,827 INFO input.FileInputFormat: Total input files to process : 1\n2023-04-24 05:44:04,967 INFO mapreduce.JobSubmitter: number of splits:1\n2023-04-24 05:44:05,086 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1682280865018_0003\n2023-04-24 05:44:05,086 INFO mapreduce.JobSubmitter: Executing with tokens: []\n2023-04-24 05:44:05,187 INFO conf.Configuration: resource-types.xml not found\n2023-04-24 05:44:05,187 INFO resource.ResourceUtils: Unable

In [29]:
raw = container.exec_run("hdfs dfs -cat /out_steam_3/part-r-00000").output.decode('utf-8')
print(f"{raw[0:1000000]}")

999310	List([{"timestamp":1546300800000,"count":2},{"timestamp":1548979200000,"count":1},{"timestamp":1551398400000,"count":1},{"timestamp":1554076800000,"count":1},{"timestamp":1556668800000,"count":1},{"timestamp":1559347200000,"count":0},{"timestamp":1561939200000,"count":0},{"timestamp":1564617600000,"count":0},{"timestamp":1567296000000,"count":0},{"timestamp":1569888000000,"count":0},{"timestamp":1572566400000,"count":0},{"timestamp":1575158400000,"count":0},{"timestamp":1577836800000,"count":0},{"timestamp":1580515200000,"count":0},{"timestamp":1583020800000,"count":0},{"timestamp":1585699200000,"count":0},{"timestamp":1588291200000,"count":0},{"timestamp":1590969600000,"count":0},{"timestamp":1593561600000,"count":0},{"timestamp":1596240000000,"count":0},{"timestamp":1598918400000,"count":0},{"timestamp":1601510400000,"count":0},{"timestamp":1604188800000,"count":0},{"timestamp":1606780800000,"count":0},{"timestamp":1609459200000,"count":0},{"timestamp":1612137600000,"count":0}

### Akwizycja danych

## Snippety kodu map-reduce

Jak zapewne łatwo zauważyć, poniższy kod napisany jest w języku scala. Największą rzeczą jaką trzeba było zrobić, aby dostosować map-reduce do scali było napisanie funkcji konwertującej iterator w reduce. Iterator ten nie jest zgodny z API, co powodowało konieczność wcześniejszej konwersji jego ArrayList.

```scala
  class MyMapper extends HadoopJob.HadoopMapper[AnyRef, Text, Text, Text] {
    override def myMap(key: AnyRef, value: Text, emit: (Text, Text) => Unit): Unit = {
      val jsons  = value.toString.split("\n").map(x => x.dropWhile(!_.isWhitespace)).map(_.trim).toList
      val mapped = jsons.flatMap(x => Input.decoder.decodeJson(x).toOption)

      def downloadTimestamps(id: Int) = for {
        res  <- zio.http.Client.request(f"""https://steamcharts.com/app/$id/chart-data.json""")
        data <- res.body.asString
        json  = data.fromJson[ApiResult].getOrElse(List.empty)
      } yield json

      val workflow = ZIO.foreach(mapped)(x => downloadTimestamps(x.game_id).map(res => (x, res)))

      val result = runZIO(
        workflow
          .tapError(err => { ZIO.succeed(emit(Text("error!"), Text(err.getMessage))) })
          .orElse(ZIO.succeed(List.empty))
          .provide(zio.http.Client.default)
      ).map(x => Result(x._1.game_id, x._2.map(y => PlayCount(y.head, y.last))))

      result.foreach { x =>
        emit(new Text(x.game_id.toString), Text(x.playcounts.toJson))
      }
    }
  }

  class MyReducer extends HadoopJob.HadoopReducer[Text, Text, Text] {
    override def myReduce(key: Text, values: List[String], emit: (Text, Text) => Unit): Unit = {
      emit(key, Text(values.toString()))
    }
  }

  def main(args: Array[String]) = {

    java.lang.System.setProperty("java.net.preferIPv4Stack", "true")

    val conf = new Configuration
    val job  = Job.getInstance(conf, "word count")

    job.setJarByClass(classOf[Main.type])
    job.setMapperClass(classOf[MyMapper])
    job.setReducerClass(classOf[MyReducer])
    job.setOutputKeyClass(classOf[Text])
    job.setOutputValueClass(classOf[Text])

    FileInputFormat.addInputPath(job, new Path(args(0)))
    FileOutputFormat.setOutputPath(job, new Path(args(1)))

    java.lang.System.exit(
      if (job.waitForCompletion(true)) 0
      else 1
    )
  }
```