# Sprawozdanie 6 - akwizycja danych

## Środowisko

Mamy maszynę wirtualną z Ubuntu postawioną za pomocą Vagrant'a (korzystającego pod spodem z VirtualBox'a). Na tej maszynie wirtualnej stawiamy kontenery Docker'a.

Aby ułatwić sobie późniejszą pracę z obrazami na których postawiony jest hadoop postanowiliśmy dodać do master-node volumen na dane (modyfikując skrypty generujące docker-compose). Dzięki temu możemy w wygodny sposób (tj. poprzez wrzucenie do odpowiedniego folderu) przenosić pliki do miejsca, do którego możemy się dostać z poziomu maszyny z hadoopem. Warto zwrócić uwagę, że maszyna wirtualna także posiada taki wolumen, który zapewnia wykorzystanie Vagrant'a.

```yaml
master:
    image: hjben/hadoop-eco:$hadoop_version
    hostname: master
    container_name: master
    privileged: true
    ports:
      - 8088:8088
      - 9870:9870
      - 8042:8042
      - 10000:10000
      - 10002:10002
      - 16010:16010
    volumes:
      - /sys/fs/cgroup:/sys/fs/cgroup
      - $hdfs_path:/data/hadoop
      - $hadoop_log_path:/usr/local/hadoop/logs
      - $hbase_log_path/master:/usr/local/hbase/logs
      - $hive_log_path:/usr/local/hive/logs
      - $sqoop_log_path:/usr/local/sqoop/logs
      - /vagrant/master_volume:/data/master_volume <-------------- dodany volumen
    networks:
      hadoop-cluster:
        ipv4_address: 10.1.2.3
    extra_hosts:
      - "mariadb:10.1.2.2"
      - "master:10.1.2.3"
```

## Pobieranie danych
Niestety przez potrzebę generowania i podania klucza API do serwisu kaggle przed pobraniem danych należy wykonać kilka czynności.

1. Pobrać ze strony kaggle klucz API (kaggle.json)
2. Stworzyć folder .kaggle w głównym katalogu użytkownika i skopiować tam klucz API
3. Poniższe funkcje:
    1. Pobierają z kaggle:
       * [YouTube Trending Video Dataset](https://www.kaggle.com/datasets/rsrishav/youtube-trending-video-dataset)
       * [Steam Dataset](https://www.kaggle.com/datasets/souyama/steam-dataset)
    2. Pobierają z sieci [dane Covid'owe](https://covid.ourworldindata.org/data/owid-covid-data.csv)

In [1]:
import os
from timeit import default_timer as timer
import requests
import docker
import json
import opendatasets as od
import csv
import pandas
import paramiko

In [2]:
output_dir = "/data/master_volume/datasets"

In [3]:
os.system("ls /data/master_volume")

datasets
map_reduce_jars


0

In [4]:
if not os.path.isdir(f"{output_dir}/youtube-trending-video-dataset"):
    od.download("https://www.kaggle.com/datasets/rsrishav/youtube-trending-video-dataset", f"{output_dir}")
else:
    print("Dataset youtube-trending-video-dataset already exists, skipping download")

Dataset youtube-trending-video-dataset already exists, skipping download


In [5]:
if not os.path.isdir(f"{output_dir}/steam-dataset"):
    od.download("https://www.kaggle.com/datasets/souyama/steam-dataset", f"{output_dir}")
else:
    print("Dataset steam-dataset already exists, skipping download")

Dataset steam-dataset already exists, skipping download


In [6]:
path = f"{output_dir}/covid-dataset.csv"

if not os.path.isfile(path):
    print(f"Downloading covid-dataset to {path}")
    start = timer()
    r = requests.get("https://covid.ourworldindata.org/data/owid-covid-data.csv", allow_redirects=True)
    with open(path, 'wb') as file:
        file.write(r.content)
    end = timer()
    print(f"Download finished in {end - start:.02f}s")
else:
    print("Dataset covid-dataset already exists, skipping download")

print(f"covid-dataset.csv: {os.stat(path).st_size / (1024 * 1024):.02f}MB")

Dataset covid-dataset already exists, skipping download
covid-dataset.csv: 77.76MB


## Formatowanie danych
Po rozpakowaniu danych widać, że część z nich ma format trudny do późniejszej pracy. Ostatecznie postanowiliśmy przed wrzuceniem plików do hdfs wszystkie przetransformować do dormatu .jsonl. Format .jsonl zawiera obiekty json, każdy w kolejnej linii. Dzięki zastosowaniu takiego formatu łatwo będzie można implementować procesy map-reduce.


In [7]:
%%timeit -r 1 -n 1

print("Converting CSV to JSONL")
for root, directories, files in os.walk(f"{output_dir}"):
        for filename in files:
            path = os.path.join(root,filename)
            output_path = path.replace(".csv", ".jsonl")

            if os.path.isfile(output_path):
                # os.remove(output_path)
                continue

            if path.endswith(".csv"):
                try:
                    print(path)
                    with open(path, 'r', encoding='utf-8', errors='replace') as infile, open(output_path, 'w', encoding='utf-8', errors='replace') as outfile:
                        reader = csv.reader(x.replace('\0', '') for x in infile)
                        headers = next(reader)
                        data = list(reader)
                        df = pandas.DataFrame(data, columns=headers)
                        for row in df.to_dict('records'):
                            json.dump(row, outfile)
                            outfile.write('\n')
                    print(output_path)
                except Exception as e:
                    print(e)

Converting CSV to JSONL
336 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [8]:
%%timeit -r 1 -n 1

print("Converting JSON to JSONL")
for root, directories, files in os.walk(f"{output_dir}"):
        for filename in files:
            path = os.path.join(root,filename)
            output_path = f"{path}l"

            if os.path.isfile(output_path):
                continue

            if path.endswith(".json"):
                print(path)
                with open(path, "r") as file:
                    data = json.load(file)
                    if type(data) is dict:
                        data = [{"key": key, "value": data[key]} for key in data]

                    with open(output_path, "w") as jsonl_file:
                        for obj in data:
                            json.dump(obj, jsonl_file)
                            jsonl_file.write("\n")
                    print(output_path)

print("Done")

Converting JSON to JSONL
Done
430 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


## Dodanie plików do hdfs

In [13]:
def run_in_master(command):
    ssh = paramiko.SSHClient()
    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    ssh.connect("namenode", username="root", password="pass")
    ssh_stdin, ssh_stdout, ssh_stderr = ssh.exec_command(f"cd /app/ && . /env_var_path.sh && {command}")
    return (ssh_stdout.readlines(), ssh_stderr.readlines())

In [14]:
def hdfs_mkdir(path):
    run_in_master(f"hdfs dfs -mkdir -p /{path}/")

def hdfs_upload(path):
    directory = "/".join(path.split("/")[:-1])
    hdfs_mkdir(directory)
    cmd = f"hdfs dfs -put /data/master_volume/{path} /{directory}"
    print(cmd)
    code, output = run_in_master(cmd)
    print(f"exit code {code}")
    print(output)

In [15]:
print("Uploading to HDFS")

for root, directories, files in os.walk(f"{output_dir}"):
        for filename in files:
            if filename.endswith(".json"):
                continue # skip JSON files, we have JSONL from previous step
            filepath = path = os.path\
                .join(root,filename)

            path = filepath\
                .replace("/data/master_volume/", "")\
                .replace("\\", "/")

            start_single = timer()
            hdfs_upload(path)
            end_single = timer()
            print(f"HDFS upload of {os.stat(filepath).st_size / (1024 * 1024):.02f}MB took {end_single - start_single:.02f}s")

Uploading to HDFS
hdfs dfs -put /data/master_volume/datasets/covid-dataset.jsonl /datasets
exit code []
[]
HDFS upload of 601.45MB took 3.25s
hdfs dfs -put /data/master_volume/datasets/covid-dataset.csv /datasets
exit code []
[]
HDFS upload of 77.76MB took 2.38s
hdfs dfs -put /data/master_volume/datasets/steam-dataset/steam_dataset/steamspy/basic/steam_spy_scrap.jsonl /datasets/steam-dataset/steam_dataset/steamspy/basic
exit code []
[]
HDFS upload of 19.11MB took 2.31s
hdfs dfs -put /data/master_volume/datasets/steam-dataset/steam_dataset/appinfo/store_data/steam_store_data.jsonl /datasets/steam-dataset/steam_dataset/appinfo/store_data
exit code []
[]
HDFS upload of 552.30MB took 3.05s
hdfs dfs -put /data/master_volume/datasets/youtube-trending-video-dataset/US_youtube_trending_data.csv /datasets/youtube-trending-video-dataset
exit code []
[]
HDFS upload of 283.76MB took 2.72s
hdfs dfs -put /data/master_volume/datasets/youtube-trending-video-dataset/US_category_id.jsonl /datasets/youtu

In [12]:
run_in_master("hdfs dfs -setrep -R 3 /")

([], ['bash: env_var.sh: No such file or directory\n'])