# Zadanie 1
- Wczytaj plik `zamowienia.csv` do ramki pandas, 
a następnie w kilku miejscach (ale nie w pierwszych 10 wierszach) wstaw wartość NaN, aby zasymulować wartości brakujące. 
- Zapisz ramkę do pliku `zamowienia_missing.csv`. 
- Wczytaj teraz plik do ramki Dask i sprawdź jakie typy danych zostały przydzielone. 
    - Czy zgadzają się z typami z oryginalnego pliku? 
- Wykonaj dowolne obliczenia na całej ramce Dask, aby wymusić wywołanie `.compute()`. 
    - Czy pojawił się błąd dotyczący niespójności typów danych? 
- Spróbuj uruchomić kilka razy funkcję wczytywania danych do ramki Dask dataframe z różnymi wartościami parametru `samples`. 
Dokumentacja `dask.dataframe.read_csv()`: https://docs.dask.org/en/stable/generated/dask.dataframe.read_csv.html

In [1]:
import random
import numpy as np
import pandas as pd
import os

df = pd.read_csv(os.path.join("..", "L1", "zamowienia.csv"), delimiter=';')
df.dtypes

Kraj                object
Sprzedawca          object
Data zamowienia     object
idZamowienia         int64
Utarg              float64
dtype: object

In [2]:
columns = df.columns.values.tolist()
max_rows = df.shape[0]
for i in range(10):
    df.loc[random.randint(0, max_rows), random.choice(columns)] = np.nan
df

Unnamed: 0,Kraj,Sprzedawca,Data zamowienia,idZamowienia,Utarg
0,Polska,Kowalski,2003-07-16,10248.0,440.00
1,Polska,Sowiński,2003-07-10,10249.0,1863.40
2,Niemcy,Peacock,2003-07-12,10250.0,1552.60
3,Niemcy,Leverling,2003-07-15,10251.0,654.06
4,Niemcy,Peacock,2003-07-11,10252.0,3597.90
...,...,...,...,...,...
794,Polska,King,2005-04-30,11048.0,525.00
795,Niemcy,Leverling,2005-05-01,11052.0,1332.00
796,Niemcy,Fuller,2005-04-29,11053.0,3055.00
797,Niemcy,Callahan,2005-05-01,11056.0,3740.00


In [3]:
df.to_csv("zamowienia_missing.csv", header=True, index=False)

In [4]:
from dask.dataframe import dd

ddf = dd.read_csv(os.path.join("zamowienia_missing.csv"))
ddf

Unnamed: 0_level_0,Kraj,Sprzedawca,Data zamowienia,idZamowienia,Utarg
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,string,string,string,float64,float64
,...,...,...,...,...


In [5]:
ddf.dtypes

Kraj               string[pyarrow]
Sprzedawca         string[pyarrow]
Data zamowienia    string[pyarrow]
idZamowienia               float64
Utarg                      float64
dtype: object

inne typy

In [6]:
ddf.head()

Unnamed: 0,Kraj,Sprzedawca,Data zamowienia,idZamowienia,Utarg
0,Polska,Kowalski,2003-07-16,10248.0,440.0
1,Polska,Sowiński,2003-07-10,10249.0,1863.4
2,Niemcy,Peacock,2003-07-12,10250.0,1552.6
3,Niemcy,Leverling,2003-07-15,10251.0,654.06
4,Niemcy,Peacock,2003-07-11,10252.0,3597.9


In [7]:
ddf.memory_usage(deep=True).compute()

Index                132
Kraj               43922
Sprzedawca         46402
Data zamowienia    47060
idZamowienia        6392
Utarg               6392
dtype: int64

In [8]:
utarg_sum = ddf.groupby(['Kraj']).Utarg.sum()
utarg_sum.compute()

Kraj
Niemcy    894756.09
Polska    332702.51
Name: Utarg, dtype: float64

brak błędu o niezgodności typów

In [9]:
ddf = dd.read_csv(os.path.join("zamowienia_missing.csv"), sample=256000)
ddf

Unnamed: 0_level_0,Kraj,Sprzedawca,Data zamowienia,idZamowienia,Utarg
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,string,string,string,float64,float64
,...,...,...,...,...


In [10]:
ddf = dd.read_csv(os.path.join("zamowienia_missing.csv"), sample=100)
ddf.head()

Unnamed: 0,Kraj,Sprzedawca,Data zamowienia,idZamowienia,Utarg
0,Polska,Kowalski,2003-07-16,10248.0,440.0
1,Polska,Sowiński,2003-07-10,10249.0,1863.4
2,Niemcy,Peacock,2003-07-12,10250.0,1552.6
3,Niemcy,Leverling,2003-07-15,10251.0,654.06
4,Niemcy,Peacock,2003-07-11,10252.0,3597.9


In [11]:
ddf = dd.read_csv(os.path.join("zamowienia_missing.csv"), sample=25000)
ddf.head()

Unnamed: 0,Kraj,Sprzedawca,Data zamowienia,idZamowienia,Utarg
0,Polska,Kowalski,2003-07-16,10248.0,440.0
1,Polska,Sowiński,2003-07-10,10249.0,1863.4
2,Niemcy,Peacock,2003-07-12,10250.0,1552.6
3,Niemcy,Leverling,2003-07-15,10251.0,654.06
4,Niemcy,Peacock,2003-07-11,10252.0,3597.9


# Zadanie 2  
Ze strony https://docs.dask.org/en/stable/dashboard.html skonfiguruj plugin Dask dashboard dla Jupyter Lab i przetestuj jego działanie.

# Zadanie 3
- Skonfiguruj lokalny klaster (`Client`) tak, aby nie zaalokował wszystkich zasobów (np. zostaw 8 GB RAM dla systemu hosta + 2 rdzenie). 
- Pobierz dane udostępnione na poprzednich zajęciach (https://huggingface.co/datasets/vargr/private_instagram/tree/refs%2Fconvert%2Fparquet/default/train)
- załaduj do ramki Dask tyle części ile zdołasz w formie bez optymalizacji. 
- Zmierz czas tej operacji. 

In [12]:
from dask.distributed import Client, LocalCluster

cluster = LocalCluster(
    n_workers=4,
    memory_limit='8GB'
)

client = Client(cluster)
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 12,Total memory: 29.80 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:57144,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 12
Started: Just now,Total memory: 29.80 GiB

0,1
Comm: tcp://127.0.0.1:57166,Total threads: 3
Dashboard: http://127.0.0.1:57169/status,Memory: 7.45 GiB
Nanny: tcp://127.0.0.1:57147,
Local directory: C:\Users\weron\AppData\Local\Temp\dask-scratch-space\worker-s473xnzg,Local directory: C:\Users\weron\AppData\Local\Temp\dask-scratch-space\worker-s473xnzg

0,1
Comm: tcp://127.0.0.1:57163,Total threads: 3
Dashboard: http://127.0.0.1:57167/status,Memory: 7.45 GiB
Nanny: tcp://127.0.0.1:57149,
Local directory: C:\Users\weron\AppData\Local\Temp\dask-scratch-space\worker-54gxog1g,Local directory: C:\Users\weron\AppData\Local\Temp\dask-scratch-space\worker-54gxog1g

0,1
Comm: tcp://127.0.0.1:57165,Total threads: 3
Dashboard: http://127.0.0.1:57170/status,Memory: 7.45 GiB
Nanny: tcp://127.0.0.1:57151,
Local directory: C:\Users\weron\AppData\Local\Temp\dask-scratch-space\worker-57hyzkvc,Local directory: C:\Users\weron\AppData\Local\Temp\dask-scratch-space\worker-57hyzkvc

0,1
Comm: tcp://127.0.0.1:57164,Total threads: 3
Dashboard: http://127.0.0.1:57171/status,Memory: 7.45 GiB
Nanny: tcp://127.0.0.1:57153,
Local directory: C:\Users\weron\AppData\Local\Temp\dask-scratch-space\worker-fk2po0r2,Local directory: C:\Users\weron\AppData\Local\Temp\dask-scratch-space\worker-fk2po0r2


In [13]:
from datetime import datetime

start = datetime.now()
ddf = dd.read_parquet(os.path.join("..", "L1", "data", "*.parquet"))
print(f'end: {datetime.now() - start}')

end: 0:00:00.083008


# Zadanie 4
Wykonaj kilka operacji na klastrze lokalnym z danymi z zadania 3:
* wyświetl top 10 użytkowników z najwyższą liczbą like'ów,
* pobierz dane tylko za pierwsze półrocze 2019 roku.
Każdorazowo zmierz i wyświetl czas operacji i obserwuj dashboard.

In [14]:
ddf.head()

Unnamed: 0,sid,sid_profile,post_id,profile_id,date,post_type,description,likes,comments,username,bio,following,followers,num_posts,is_business_account,lang,category
0,28370919,3496776,BXdjjUlgcgq,2237947779,2017-08-06 20:06:57,2,Wreckloose! Deevalley bike park laps on the @i...,80,0,andylund_,"Professional Bicycle technician, Intense Racin...",520,1204,494,False,en,travel_&_adventure
1,13623950,3496776,BeyPed5hKj9,2237947779,2018-02-04 19:35:20,1,The dirty south was prime today. Top day with ...,86,2,andylund_,"Professional Bicycle technician, Intense Racin...",520,1204,494,False,en,diaries_&_daily_life
2,28370905,3496776,Bunhd1DFVAG,2237947779,2019-03-05 08:03:11,1,Tech Tuesday. Been flat out on the tools. Got ...,168,3,andylund_,"Professional Bicycle technician, Intense Racin...",520,1204,494,False,en,science_&_technology
3,28370907,3496776,Bppi85gliQK,2237947779,2018-11-01 20:17:41,1,"On the tools, my favourite wheel builds @stans...",102,2,andylund_,"Professional Bicycle technician, Intense Racin...",520,1204,494,False,en,diaries_&_daily_life
4,32170690,3496776,BuDfIyslzfw,2237947779,2019-02-19 08:10:11,1,Solid effort on the bar turn.\nFully turned.\n...,145,2,andylund_,"Professional Bicycle technician, Intense Racin...",520,1204,494,False,en,diaries_&_daily_life


In [15]:
start = datetime.now()
likes_by_user = ddf.groupby('username')['likes'].sum().reset_index().nlargest(10, 'likes').reset_index()
likes_by_user.compute()

Unnamed: 0,index,username,likes
0,116584,instagram,29864166
1,485370,lizakoshy,19217644
2,1073,433,16457870
3,19189,amandacerny,15019135
4,13380,akshaykumar,13352324
5,167260,maisie_williams,12808999
6,54348,chiaraferragni,11709658
7,57600,claireholt,10109849
8,64706,danbilzerian,9253425
9,268577,urvashirautela,9104884


In [16]:
print(f'end: {datetime.now() - start}')

end: 0:00:02.804178


In [17]:
start = datetime.now()
half_2019 = ddf[ddf['date'].str.startswith('2019')]
half_2019.compute()



Unnamed: 0,sid,sid_profile,post_id,profile_id,date,post_type,description,likes,comments,username,bio,following,followers,num_posts,is_business_account,lang,category
2,28370905,3496776,Bunhd1DFVAG,2237947779,2019-03-05 08:03:11,1,Tech Tuesday. Been flat out on the tools. Got ...,168,3,andylund_,"Professional Bicycle technician, Intense Racin...",520,1204,494,False,en,science_&_technology
4,32170690,3496776,BuDfIyslzfw,2237947779,2019-02-19 08:10:11,1,Solid effort on the bar turn.\nFully turned.\n...,145,2,andylund_,"Professional Bicycle technician, Intense Racin...",520,1204,494,False,en,diaries_&_daily_life
5,14315358,3496776,BxJsMDpA2yH,2237947779,2019-05-07 08:33:51,1,Annual springtime flora picture.\nTurn bars in...,124,2,andylund_,"Professional Bicycle technician, Intense Racin...",520,1204,494,False,en,arts_&_culture
6,8304346,3496776,Bt5LFpZlm3z,2237947779,2019-02-15 08:02:35,1,Laps in spring like conditions. Getting these ...,150,3,andylund_,"Professional Bicycle technician, Intense Racin...",520,1204,494,False,en,sports
7,14315346,3496776,BxZIzaQhS-o,2237947779,2019-05-13 08:32:30,1,Cheers Scotland 🏴󠁧󠁢󠁳󠁣󠁴󠁿 See you in a few weeks...,166,2,andylund_,"Professional Bicycle technician, Intense Racin...",520,1204,494,False,en,sports
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1147352,43194859,4492285,BxEr8JbHVIZ,3198207653,2019-05-05 09:55:28,1,A restaurant with no sign and a one-word menu 🍜,112,2,zarazhangg,Harvard grad from China working in tech 🇨🇳🇸🇬🇯🇵...,1801,1428,249,False,en,food_&_dining
1147353,43194861,4492285,Bw6NjgYH9VF,3198207653,2019-05-01 08:17:33,1,Beijing has nice days too,151,3,zarazhangg,Harvard grad from China working in tech 🇨🇳🇸🇬🇯🇵...,1801,1428,249,False,en,diaries_&_daily_life
1147354,43194863,4492285,Bwossrynm05,3198207653,2019-04-24 13:03:22,1,Breaded,149,2,zarazhangg,Harvard grad from China working in tech 🇨🇳🇸🇬🇯🇵...,1801,1428,249,False,en,food_&_dining
1147355,43194867,4492285,Bvx0jgoHk8h,3198207653,2019-04-03 05:33:47,1,A taste of Tokyo in Beijing 🍱 #waimai,74,1,zarazhangg,Harvard grad from China working in tech 🇨🇳🇸🇬🇯🇵...,1801,1428,249,False,en,travel_&_adventure


In [18]:
print(f'end: {datetime.now() - start}')

end: 0:03:56.543575


# Zadanie 5 
Wczytaj te same dane do ramki Dask co w zadaniu 3, ale podaj typy danych, które zostały wybrane w procesie optymalizacji wykonanej w zadaniach z lab 01. Porównaj czas ładowania z zadaniem 3. Wykonaj również te same operacje co w zadaniu 4 i porównaj czas. Śledź wykonanie zadań patrząć na graf wywołań.

In [19]:
from datetime import datetime

dtypes = {
    'sid': 'int32',
    'sid_profile': 'int32',
    'post_id': 'object',
    'profile_id': 'int64',
    'post_type': 'category',
    'description': 'object',
    'likes': 'int32',
    'comments': 'int32',
    'username': 'object',
    'bio': 'object',
    'following': 'int32',
    'followers': 'int32',
    'num_posts': 'int32',
    'is_business_account': 'bool',
    'lang': 'category',
    'category':'category'
}

start = datetime.now()
ddf2 = dd.read_parquet(os.path.join("..", "L1", "data", "*.parquet"))
for key, value in dtypes.items():
    ddf2[key] = ddf2[key].astype(value)
ddf2['date'] = dd.to_datetime(ddf2['date'])
print(f'end: {datetime.now() - start}')

end: 0:00:00.248838


In [20]:
ddf2.dtypes

sid                             int32
sid_profile                     int32
post_id                        object
profile_id                      int64
date                   datetime64[ns]
post_type                    category
description                    object
likes                           int32
comments                        int32
username                       object
bio                            object
following                       int32
followers                       int32
num_posts                       int32
is_business_account              bool
lang                         category
category                     category
dtype: object

In [21]:
start = datetime.now()
likes_by_user = ddf2.groupby('username')['likes'].sum().reset_index().nlargest(10, 'likes').reset_index()
likes_by_user.compute()

Unnamed: 0,index,username,likes
0,116584,instagram,29864166
1,485370,lizakoshy,19217644
2,1073,433,16457870
3,19189,amandacerny,15019135
4,13380,akshaykumar,13352324
5,167260,maisie_williams,12808999
6,54348,chiaraferragni,11709658
7,57600,claireholt,10109849
8,64706,danbilzerian,9253425
9,268577,urvashirautela,9104884


In [22]:
print(f'end: {datetime.now() - start}')

end: 0:00:04.377370


In [23]:
start = datetime.now()
half_2019 = ddf2[ddf2['date'].dt.year == 2019]
half_2019.compute()

KilledWorker: Attempted to run task ('repartitiontofewer-d422ce38a6909393af9662d7e446404f', 0) on 4 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://127.0.0.1:57496. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.

In [None]:
print(f'end: {datetime.now() - start}')

# Zadanie 6  
Podziel tablicę `darr` z przykładów na inne liczby chunków (eksperymentuj) i wykonaj te same obliczenie (średnia). Dla każdej liczby chunków wypisz czas obliczeń (wykonaj to samo obliczenie minimum 10 razy, aby nieco uwiarygodnić wyniki i uśrednij) i porównaj wyniki. Napisz wniosek o wynikach swoich eksperymentów i automatycznego podziału na chunki. Czy udało Ci się osiągnąć lepszą wydajność niż przy domyślnych ustawieniach?

In [29]:
import dask.array as da
chunks = [
    (10_000,200_000),
    (15_000,300_000),
    (100_000,200_000),
    (5_000,100_000),
    (20_000,20_000),
    (5_000, 5_000)
]

for chunk in chunks:
    times = []
    darr = da.random.normal(5, 0.2, size=(20_000, 20_000), chunks=chunk)
    for i in range(10):
        start = datetime.now()
        darr.mean(axis=0).compute()
        times.append(datetime.now() - start)
    print(f'mean time for chunk {chunk}: {np.mean(times)}')

mean time for chunk (10000, 200000): 0:00:04.611820
mean time for chunk (15000, 300000): 0:00:03.946801
mean time for chunk (100000, 200000): 0:00:06.649932
mean time for chunk (5000, 100000): 0:00:02.828701
mean time for chunk (20000, 20000): 0:00:07.666836
mean time for chunk (5000, 5000): 0:00:01.571550


In [30]:
default_chunks = da.random.normal(5, 0.2, size=(20_000, 20_000))
times = []

In [31]:
for i in range(10):
    start = datetime.now()
    default_chunks.mean(axis=0).compute()
    times.append(datetime.now() - start)
print(f'mean time for default chunk: {np.mean(times)}')

mean time for default chunk: 0:00:01.413450


as the chunks were bigger the time needed to calculate grow as well 
best score was the smallest chunks but still not lower than default