# Форматы данных (2)

Материалы:
* Макрушин С.В. "Лекция 5: Форматы данных (часть 2)"
* https://docs.python.org/3/library/csv.html
* https://docs.h5py.org/en/stable/
* Уэс Маккини. Python и анализ данных

## Задачи для совместного разбора

1. Считайте данные из файла `open_pubs.csv`, используя `csv.reader`, и преобразуйте к структуре данных следующего вида:
    
`{'fas_id': [24, 30, ...], 'name': ['Achor Inn', 'Angel Inn', ...], ... }`

In [1]:
import csv

with open('./data/open_pubs.csv') as fp:
    reader = csv.reader(fp)
    header = next(reader)
    print(header)
    for row in reader:
        print(row)
        break

['fas_id', 'name', 'address', 'postcode', 'easting', 'northing', 'latitude', 'longitude', 'local_authority']
['24', 'Anchor Inn', 'Upper Street, Stratford St Mary, COLCHESTER, Essex', 'CO7 6LW', '604748', '234405', '51.97039', '0.979328', 'Babergh']


2. Сгенерируйте 2 случайные матрицы размера 10_000 x 10_000 и вычислите их произведение. Сколько времени занимают три этих операции? Сохраните 3 полученных матрицы в файл .npz с соответствующими названиями

In [2]:
import numpy as np

A = np.random.randint(0, 100, size=(10_000, 10_000))
B = np.random.randint(0, 100, size=(10_000, 10_000))

In [3]:
np.save("./out/A.npy", A)
np.savez("./out/AB.npz", arr1=A, arr2=B)

In [4]:
r = np.load("./out/AB.npz")
r

<numpy.lib.npyio.NpzFile at 0x10515d4c0>

In [5]:
r.files

['arr1', 'arr2']

3. Создайте 2 матрицы размера 1000x1000, используя различные параметризируемые распределения из numpy (https://docs.scipy.org/doc/numpy-1.15.0/reference/routines.random.html#distributions)

После этого сохраните получившиеся матрицы в hdf5-файл в виде двух различных датасетов. В качестве описания каждого датасета укажите параметры используемых распределений 

In [6]:
import h5py

In [7]:
with h5py.File("./out/test.h5", 'w') as hdf:
    ds1 = hdf.create_dataset("arrA", data=A)
    ds2 = hdf.create_dataset("arrB", data=B)
    
    ds1.attrs["Description"] = "Здесь лежит массив А"
    ds2.attrs["Description"] = "Здесь лежит массив B"    

In [8]:
with h5py.File("./out/test.h5","r") as hdf:
    ds1 = hdf['arrA']
    print(type(ds1))
    #Только в этот момент появляются данные
    arr = ds1[:1000]

<class 'h5py._hl.dataset.Dataset'>


## Лабораторная работа 5

### csv

1.1 В файле `tags_sample.csv` находится информация о тэгах, приписываемых рецептам. Воспользовавшись `csv.reader`, считайте этот файл и создайте словарь вида `id_рецепта: [список тэгов]`. Сохраните этот словарь в файл `tags_sample.json`.

In [9]:
import csv
import json

In [10]:
data_dict = {}

with open('./data/tags_sample.csv') as file:
    reader = csv.reader(file)
    header = next(reader)
    print(header)
    for row in reader:
        
        item_id = int(row[0])
        item_tag = row[1]
        
        if item_id not in data_dict:
            data_dict[item_id] = []
        data_dict[item_id].append(item_tag)
        
data_dict

['id', 'tag']


{44123: ['weeknight',
  'time-to-make',
  'course',
  'main-ingredient',
  'cuisine',
  'preparation',
  'occasion',
  'north-american',
  'soups-stews',
  'beans',
  'poultry',
  'american',
  'chicken',
  'stove-top',
  'dietary',
  'gluten-free',
  'comfort-food',
  'californian',
  'black-beans',
  'free-of-something',
  'meat',
  'taste-mood',
  'equipment',
  'grilling',
  '4-hours-or-less'],
 67664: ['15-minutes-or-less',
  'time-to-make',
  'course',
  'preparation',
  'occasion',
  'low-protein',
  'healthy',
  '5-ingredients-or-less',
  'desserts',
  'lunch',
  'snacks',
  'easy',
  'kid-friendly',
  'low-fat',
  'summer',
  'frozen-desserts',
  'freezer',
  'dietary',
  'low-sodium',
  'low-cholesterol',
  'seasonal',
  'low-saturated-fat',
  'inexpensive',
  'healthy-2',
  'toddler-friendly',
  'low-in-something',
  'equipment',
  'small-appliance',
  'mixer',
  'number-of-servings',
  '3-steps-or-less'],
 38798: ['30-minutes-or-less',
  'time-to-make',
  'course',
  'main-

In [11]:
with open("./out/tags_sample.json", "w") as file:
    json.dump(data_dict, file)

In [12]:
with open("./out/tags_sample.json", "r") as file:
    data_dict_new = json.load(file)
data_dict_new

{'44123': ['weeknight',
  'time-to-make',
  'course',
  'main-ingredient',
  'cuisine',
  'preparation',
  'occasion',
  'north-american',
  'soups-stews',
  'beans',
  'poultry',
  'american',
  'chicken',
  'stove-top',
  'dietary',
  'gluten-free',
  'comfort-food',
  'californian',
  'black-beans',
  'free-of-something',
  'meat',
  'taste-mood',
  'equipment',
  'grilling',
  '4-hours-or-less'],
 '67664': ['15-minutes-or-less',
  'time-to-make',
  'course',
  'preparation',
  'occasion',
  'low-protein',
  'healthy',
  '5-ingredients-or-less',
  'desserts',
  'lunch',
  'snacks',
  'easy',
  'kid-friendly',
  'low-fat',
  'summer',
  'frozen-desserts',
  'freezer',
  'dietary',
  'low-sodium',
  'low-cholesterol',
  'seasonal',
  'low-saturated-fat',
  'inexpensive',
  'healthy-2',
  'toddler-friendly',
  'low-in-something',
  'equipment',
  'small-appliance',
  'mixer',
  'number-of-servings',
  '3-steps-or-less'],
 '38798': ['30-minutes-or-less',
  'time-to-make',
  'course',
  

1.2 Считайте файл `recipes_sample_with_filled_nsteps.csv` (__ЛР4__) в виде `pd.DataFrame`. Добавьте к таблице 2 столбца: `n_tags`, содержащий количество тэгов у этого рецепта; и `tags`, содержащий набор тэгов в виде строки (тэги внутри строки разделяются символом `;`)

In [13]:
import pandas as pd

In [14]:
recipes_filled = pd.read_csv("./data/recipes_sample_with_filled_nsteps.csv")
recipes_filled

Unnamed: 0,id,name,minutes,contributor_id,submitted,n_steps,description,n_ingredients
0,44123,george s at the cove black bean soup,90,35193,2002-10-25,11,an original recipe created by chef scott meska...,18.0
1,67664,healthy for them yogurt popsicles,10,91970,2003-07-26,3,my children and their friends ask for my homem...,
2,38798,i can t believe it s spinach,30,1533,2002-08-29,5,"these were so go, it surprised even me.",8.0
3,35173,italian gut busters,45,22724,2002-07-27,7,my sister-in-law made these for us at a family...,
4,84797,love is in the air beef fondue sauces,25,4470,2004-02-23,4,i think a fondue is a very romantic casual din...,
...,...,...,...,...,...,...,...,...
29995,267661,zurie s holey rustic olive and cheddar bread,80,200862,2007-11-25,16,this is based on a french recipe but i changed...,10.0
29996,386977,zwetschgenkuchen bavarian plum cake,240,177443,2009-08-24,22,"this is a traditional fresh plum cake, thought...",11.0
29997,103312,zwiebelkuchen southwest german onion cake,75,161745,2004-11-03,10,this is a traditional late summer early fall s...,
29998,486161,zydeco soup,60,227978,2012-08-29,7,this is a delicious soup that i originally fou...,


Добавьте к таблице `n_tags`, содержащий количество тэгов у этого рецепта;

In [15]:
patched_dict = {k : len(v) for (k,v) in data_dict.items()}

In [16]:
def filler(x):
    if x["id"] in patched_dict:
        return patched_dict[x["id"]] 
    return float("nan")
    
recipes_filled['n_tags'] = recipes_filled.apply(lambda x: filler(x), axis=1)

#TODO Мозжно делать так
recipes_filled['id'].map(patched_dict.get)

0        25
1        31
2        17
3        11
4        19
         ..
29995    18
29996    19
29997    20
29998    20
29999    12
Name: id, Length: 30000, dtype: int64

In [17]:
recipes_filled

Unnamed: 0,id,name,minutes,contributor_id,submitted,n_steps,description,n_ingredients,n_tags
0,44123,george s at the cove black bean soup,90,35193,2002-10-25,11,an original recipe created by chef scott meska...,18.0,25
1,67664,healthy for them yogurt popsicles,10,91970,2003-07-26,3,my children and their friends ask for my homem...,,31
2,38798,i can t believe it s spinach,30,1533,2002-08-29,5,"these were so go, it surprised even me.",8.0,17
3,35173,italian gut busters,45,22724,2002-07-27,7,my sister-in-law made these for us at a family...,,11
4,84797,love is in the air beef fondue sauces,25,4470,2004-02-23,4,i think a fondue is a very romantic casual din...,,19
...,...,...,...,...,...,...,...,...,...
29995,267661,zurie s holey rustic olive and cheddar bread,80,200862,2007-11-25,16,this is based on a french recipe but i changed...,10.0,18
29996,386977,zwetschgenkuchen bavarian plum cake,240,177443,2009-08-24,22,"this is a traditional fresh plum cake, thought...",11.0,19
29997,103312,zwiebelkuchen southwest german onion cake,75,161745,2004-11-03,10,this is a traditional late summer early fall s...,,20
29998,486161,zydeco soup,60,227978,2012-08-29,7,this is a delicious soup that i originally fou...,,20


In [18]:
patched_dict = {k : ";".join(v) for (k,v) in data_dict.items()}
patched_dict

{44123: 'weeknight;time-to-make;course;main-ingredient;cuisine;preparation;occasion;north-american;soups-stews;beans;poultry;american;chicken;stove-top;dietary;gluten-free;comfort-food;californian;black-beans;free-of-something;meat;taste-mood;equipment;grilling;4-hours-or-less',
 67664: '15-minutes-or-less;time-to-make;course;preparation;occasion;low-protein;healthy;5-ingredients-or-less;desserts;lunch;snacks;easy;kid-friendly;low-fat;summer;frozen-desserts;freezer;dietary;low-sodium;low-cholesterol;seasonal;low-saturated-fat;inexpensive;healthy-2;toddler-friendly;low-in-something;equipment;small-appliance;mixer;number-of-servings;3-steps-or-less',
 38798: '30-minutes-or-less;time-to-make;course;main-ingredient;preparation;appetizers;side-dishes;vegetables;oven;refrigerator;freezer;dietary;oamc-freezer-make-ahead;low-carb;low-in-something;equipment;number-of-servings',
 35173: '60-minutes-or-less;time-to-make;course;preparation;lunch;main-dish;oven;easy;dietary;sandwiches;equipment',
 

In [19]:
def filler(x):
    if x["id"] in patched_dict:
        return patched_dict[x["id"]]
    return float("nan")

recipes_filled['tags'] = recipes_filled.apply(lambda x: filler(x), axis=1)

In [20]:
recipes_filled

Unnamed: 0,id,name,minutes,contributor_id,submitted,n_steps,description,n_ingredients,n_tags,tags
0,44123,george s at the cove black bean soup,90,35193,2002-10-25,11,an original recipe created by chef scott meska...,18.0,25,weeknight;time-to-make;course;main-ingredient;...
1,67664,healthy for them yogurt popsicles,10,91970,2003-07-26,3,my children and their friends ask for my homem...,,31,15-minutes-or-less;time-to-make;course;prepara...
2,38798,i can t believe it s spinach,30,1533,2002-08-29,5,"these were so go, it surprised even me.",8.0,17,30-minutes-or-less;time-to-make;course;main-in...
3,35173,italian gut busters,45,22724,2002-07-27,7,my sister-in-law made these for us at a family...,,11,60-minutes-or-less;time-to-make;course;prepara...
4,84797,love is in the air beef fondue sauces,25,4470,2004-02-23,4,i think a fondue is a very romantic casual din...,,19,30-minutes-or-less;time-to-make;course;main-in...
...,...,...,...,...,...,...,...,...,...,...
29995,267661,zurie s holey rustic olive and cheddar bread,80,200862,2007-11-25,16,this is based on a french recipe but i changed...,10.0,18,time-to-make;course;main-ingredient;cuisine;pr...
29996,386977,zwetschgenkuchen bavarian plum cake,240,177443,2009-08-24,22,"this is a traditional fresh plum cake, thought...",11.0,19,time-to-make;course;main-ingredient;cuisine;pr...
29997,103312,zwiebelkuchen southwest german onion cake,75,161745,2004-11-03,10,this is a traditional late summer early fall s...,,20,time-to-make;course;main-ingredient;cuisine;pr...
29998,486161,zydeco soup,60,227978,2012-08-29,7,this is a delicious soup that i originally fou...,,20,ham;60-minutes-or-less;time-to-make;course;mai...


In [21]:
recipes_filled[recipes_filled['n_tags'].isna()]

Unnamed: 0,id,name,minutes,contributor_id,submitted,n_steps,description,n_ingredients,n_tags,tags


1.3 В файле `ingredients_sample.csv` находится информация о ингредиентах, необходимых для рецепта. Воспользовавшись `csv.DictReader`, считайте этот файл и создайте словарь вида `id_рецепта: [список ингредиентов]`.

In [22]:
data_dict = {}

with open('./data/ingredients_sample.csv') as file:
    reader = csv.DictReader(file)
    for item in reader:
        ingredient, recipe_id = item["ingredient"], int(item["recipe_id"])

        if recipe_id not in data_dict:
            data_dict[recipe_id] = []
        data_dict[recipe_id].append(ingredient)

        
data_dict

{44123: ['unsalted butter',
  'carrot',
  'onion',
  'celery',
  'broccoli stem',
  'dried thyme',
  'dried oregano',
  'dried sweet basil leaves',
  'dry white wine',
  'chicken stock',
  'worcestershire sauce',
  'tabasco sauce',
  'smoked chicken',
  'black beans',
  'broccoli floret',
  'heavy cream',
  'salt & fresh ground pepper',
  'cornstarch'],
 250900: ['unsalted butter',
  'all-purpose flour',
  'walnuts',
  'light brown sugar',
  'refrigerated pie crust',
  'granny smith apples'],
 120462: ['unsalted butter',
  'onion',
  'milk',
  'salt',
  'egg',
  'cream cheese',
  'extra-sharp cheddar cheese',
  'fresh ground black pepper',
  'garlic clove',
  'penne pasta',
  'gruyere cheese',
  'hot red pepper flakes',
  'sweet hungarian paprika',
  'saltines'],
 257111: ['unsalted butter',
  'milk',
  'eggs',
  'honey',
  'white bread',
  'vanilla',
  'ground cinnamon',
  'hot water'],
 148114: ['unsalted butter',
  'nuts',
  'granulated sugar',
  'semi-sweet chocolate chips'],
 1564

1.4 Добавьте к таблице из задания 1.2 столбец `ingredients`, содержащий набор ингредиентов в виде строки (ингредиенты внутри строки разделяются символом `*`)

Для строк, которые содержат пропуски в столбце `n_ingredients`, заполните их на основе файла  `ingredients_sample.csv`

In [23]:
patched_dict = {k : "*".join(v) for k, v in data_dict.items()}

In [24]:
def filler(x):
    if x["id"] in patched_dict:
        return patched_dict[x["id"]]
    return float("nan")
    
recipes_filled['ingredients'] = recipes_filled.apply(lambda x: filler(x), axis=1)

In [25]:
recipes_filled

Unnamed: 0,id,name,minutes,contributor_id,submitted,n_steps,description,n_ingredients,n_tags,tags,ingredients
0,44123,george s at the cove black bean soup,90,35193,2002-10-25,11,an original recipe created by chef scott meska...,18.0,25,weeknight;time-to-make;course;main-ingredient;...,unsalted butter*carrot*onion*celery*broccoli s...
1,67664,healthy for them yogurt popsicles,10,91970,2003-07-26,3,my children and their friends ask for my homem...,,31,15-minutes-or-less;time-to-make;course;prepara...,milk*frozen juice concentrate*plain yogurt
2,38798,i can t believe it s spinach,30,1533,2002-08-29,5,"these were so go, it surprised even me.",8.0,17,30-minutes-or-less;time-to-make;course;main-in...,onion*frozen chopped spinach*eggs*garlic powde...
3,35173,italian gut busters,45,22724,2002-07-27,7,my sister-in-law made these for us at a family...,,11,60-minutes-or-less;time-to-make;course;prepara...,sandwich bun*good seasonings italian salad dre...
4,84797,love is in the air beef fondue sauces,25,4470,2004-02-23,4,i think a fondue is a very romantic casual din...,,19,30-minutes-or-less;time-to-make;course;main-in...,beef steaks*vegetable oil*spicy mustard*fresh ...
...,...,...,...,...,...,...,...,...,...,...,...
29995,267661,zurie s holey rustic olive and cheddar bread,80,200862,2007-11-25,16,this is based on a french recipe but i changed...,10.0,18,time-to-make;course;main-ingredient;cuisine;pr...,dry white wine*eggs*cheddar cheese*baking powd...
29996,386977,zwetschgenkuchen bavarian plum cake,240,177443,2009-08-24,22,"this is a traditional fresh plum cake, thought...",11.0,19,time-to-make;course;main-ingredient;cuisine;pr...,unsalted butter*milk*flour*salt*vanilla*all-pu...
29997,103312,zwiebelkuchen southwest german onion cake,75,161745,2004-11-03,10,this is a traditional late summer early fall s...,,20,time-to-make;course;main-ingredient;cuisine;pr...,onion*milk*eggs*butter*flour*salt*pepper*sugar...
29998,486161,zydeco soup,60,227978,2012-08-29,7,this is a delicious soup that i originally fou...,,20,ham;60-minutes-or-less;time-to-make;course;mai...,onion*celery*dried thyme*dried oregano*fresh p...


In [26]:
patched_dict = {k : len(v) for k, v in data_dict.items()}

In [27]:
def filler(x):
    if pd.isna(x["n_ingredients"]) and x["id"] in patched_dict:
        return patched_dict[x["id"]]
    return x["n_ingredients"]

recipes_filled['n_ingredients'] = recipes_filled.apply(lambda x: filler(x), axis=1)

In [28]:
recipes_filled

Unnamed: 0,id,name,minutes,contributor_id,submitted,n_steps,description,n_ingredients,n_tags,tags,ingredients
0,44123,george s at the cove black bean soup,90,35193,2002-10-25,11,an original recipe created by chef scott meska...,18.0,25,weeknight;time-to-make;course;main-ingredient;...,unsalted butter*carrot*onion*celery*broccoli s...
1,67664,healthy for them yogurt popsicles,10,91970,2003-07-26,3,my children and their friends ask for my homem...,3.0,31,15-minutes-or-less;time-to-make;course;prepara...,milk*frozen juice concentrate*plain yogurt
2,38798,i can t believe it s spinach,30,1533,2002-08-29,5,"these were so go, it surprised even me.",8.0,17,30-minutes-or-less;time-to-make;course;main-in...,onion*frozen chopped spinach*eggs*garlic powde...
3,35173,italian gut busters,45,22724,2002-07-27,7,my sister-in-law made these for us at a family...,9.0,11,60-minutes-or-less;time-to-make;course;prepara...,sandwich bun*good seasonings italian salad dre...
4,84797,love is in the air beef fondue sauces,25,4470,2004-02-23,4,i think a fondue is a very romantic casual din...,12.0,19,30-minutes-or-less;time-to-make;course;main-in...,beef steaks*vegetable oil*spicy mustard*fresh ...
...,...,...,...,...,...,...,...,...,...,...,...
29995,267661,zurie s holey rustic olive and cheddar bread,80,200862,2007-11-25,16,this is based on a french recipe but i changed...,10.0,18,time-to-make;course;main-ingredient;cuisine;pr...,dry white wine*eggs*cheddar cheese*baking powd...
29996,386977,zwetschgenkuchen bavarian plum cake,240,177443,2009-08-24,22,"this is a traditional fresh plum cake, thought...",11.0,19,time-to-make;course;main-ingredient;cuisine;pr...,unsalted butter*milk*flour*salt*vanilla*all-pu...
29997,103312,zwiebelkuchen southwest german onion cake,75,161745,2004-11-03,10,this is a traditional late summer early fall s...,13.0,20,time-to-make;course;main-ingredient;cuisine;pr...,onion*milk*eggs*butter*flour*salt*pepper*sugar...
29998,486161,zydeco soup,60,227978,2012-08-29,7,this is a delicious soup that i originally fou...,22.0,20,ham;60-minutes-or-less;time-to-make;course;mai...,onion*celery*dried thyme*dried oregano*fresh p...


1.5 Проверьте, содержит ли столбец `n_ingredients` пропуски. Если нет, преобразуйте его к целочисленному типу и сохраните результаты в файл `recipes_sample_with_tags_ingredients.csv`

In [29]:
recipes_filled[recipes_filled["n_ingredients"].isnull()]

Unnamed: 0,id,name,minutes,contributor_id,submitted,n_steps,description,n_ingredients,n_tags,tags,ingredients


In [30]:
recipes_filled["n_ingredients"] = recipes_filled["n_ingredients"].astype(int)

In [31]:
recipes_filled.dtypes

id                 int64
name              object
minutes            int64
contributor_id     int64
submitted         object
n_steps            int64
description       object
n_ingredients      int64
n_tags             int64
tags              object
ingredients       object
dtype: object

In [32]:
recipes_filled

Unnamed: 0,id,name,minutes,contributor_id,submitted,n_steps,description,n_ingredients,n_tags,tags,ingredients
0,44123,george s at the cove black bean soup,90,35193,2002-10-25,11,an original recipe created by chef scott meska...,18,25,weeknight;time-to-make;course;main-ingredient;...,unsalted butter*carrot*onion*celery*broccoli s...
1,67664,healthy for them yogurt popsicles,10,91970,2003-07-26,3,my children and their friends ask for my homem...,3,31,15-minutes-or-less;time-to-make;course;prepara...,milk*frozen juice concentrate*plain yogurt
2,38798,i can t believe it s spinach,30,1533,2002-08-29,5,"these were so go, it surprised even me.",8,17,30-minutes-or-less;time-to-make;course;main-in...,onion*frozen chopped spinach*eggs*garlic powde...
3,35173,italian gut busters,45,22724,2002-07-27,7,my sister-in-law made these for us at a family...,9,11,60-minutes-or-less;time-to-make;course;prepara...,sandwich bun*good seasonings italian salad dre...
4,84797,love is in the air beef fondue sauces,25,4470,2004-02-23,4,i think a fondue is a very romantic casual din...,12,19,30-minutes-or-less;time-to-make;course;main-in...,beef steaks*vegetable oil*spicy mustard*fresh ...
...,...,...,...,...,...,...,...,...,...,...,...
29995,267661,zurie s holey rustic olive and cheddar bread,80,200862,2007-11-25,16,this is based on a french recipe but i changed...,10,18,time-to-make;course;main-ingredient;cuisine;pr...,dry white wine*eggs*cheddar cheese*baking powd...
29996,386977,zwetschgenkuchen bavarian plum cake,240,177443,2009-08-24,22,"this is a traditional fresh plum cake, thought...",11,19,time-to-make;course;main-ingredient;cuisine;pr...,unsalted butter*milk*flour*salt*vanilla*all-pu...
29997,103312,zwiebelkuchen southwest german onion cake,75,161745,2004-11-03,10,this is a traditional late summer early fall s...,13,20,time-to-make;course;main-ingredient;cuisine;pr...,onion*milk*eggs*butter*flour*salt*pepper*sugar...
29998,486161,zydeco soup,60,227978,2012-08-29,7,this is a delicious soup that i originally fou...,22,20,ham;60-minutes-or-less;time-to-make;course;mai...,onion*celery*dried thyme*dried oregano*fresh p...


In [33]:
recipes_filled.to_csv("./out/recipes_sample_with_tags_ingredients.csv")

### npy

2.1 Разделите таблицу, полученную в результате 1.5, на две таблицы: одна содержит рецепты, загруженные до 2000 года; вторая - все остальные. В полученных таблицах оставьте только числовые столбцы и преобразуйте их к `numpy.array`

In [34]:
recipes_filled["submitted"] = pd.to_datetime(recipes_filled["submitted"], format='%Y-%m-%d', errors='ignore')
recipes_filled

Unnamed: 0,id,name,minutes,contributor_id,submitted,n_steps,description,n_ingredients,n_tags,tags,ingredients
0,44123,george s at the cove black bean soup,90,35193,2002-10-25,11,an original recipe created by chef scott meska...,18,25,weeknight;time-to-make;course;main-ingredient;...,unsalted butter*carrot*onion*celery*broccoli s...
1,67664,healthy for them yogurt popsicles,10,91970,2003-07-26,3,my children and their friends ask for my homem...,3,31,15-minutes-or-less;time-to-make;course;prepara...,milk*frozen juice concentrate*plain yogurt
2,38798,i can t believe it s spinach,30,1533,2002-08-29,5,"these were so go, it surprised even me.",8,17,30-minutes-or-less;time-to-make;course;main-in...,onion*frozen chopped spinach*eggs*garlic powde...
3,35173,italian gut busters,45,22724,2002-07-27,7,my sister-in-law made these for us at a family...,9,11,60-minutes-or-less;time-to-make;course;prepara...,sandwich bun*good seasonings italian salad dre...
4,84797,love is in the air beef fondue sauces,25,4470,2004-02-23,4,i think a fondue is a very romantic casual din...,12,19,30-minutes-or-less;time-to-make;course;main-in...,beef steaks*vegetable oil*spicy mustard*fresh ...
...,...,...,...,...,...,...,...,...,...,...,...
29995,267661,zurie s holey rustic olive and cheddar bread,80,200862,2007-11-25,16,this is based on a french recipe but i changed...,10,18,time-to-make;course;main-ingredient;cuisine;pr...,dry white wine*eggs*cheddar cheese*baking powd...
29996,386977,zwetschgenkuchen bavarian plum cake,240,177443,2009-08-24,22,"this is a traditional fresh plum cake, thought...",11,19,time-to-make;course;main-ingredient;cuisine;pr...,unsalted butter*milk*flour*salt*vanilla*all-pu...
29997,103312,zwiebelkuchen southwest german onion cake,75,161745,2004-11-03,10,this is a traditional late summer early fall s...,13,20,time-to-make;course;main-ingredient;cuisine;pr...,onion*milk*eggs*butter*flour*salt*pepper*sugar...
29998,486161,zydeco soup,60,227978,2012-08-29,7,this is a delicious soup that i originally fou...,22,20,ham;60-minutes-or-less;time-to-make;course;mai...,onion*celery*dried thyme*dried oregano*fresh p...


In [35]:
recipes_filled.dtypes

id                         int64
name                      object
minutes                    int64
contributor_id             int64
submitted         datetime64[ns]
n_steps                    int64
description               object
n_ingredients              int64
n_tags                     int64
tags                      object
ingredients               object
dtype: object

In [36]:
recipes_before = recipes_filled[recipes_filled["submitted"].dt.year < 2000]
recipes_before

Unnamed: 0,id,name,minutes,contributor_id,submitted,n_steps,description,n_ingredients,n_tags,tags,ingredients
189,3441,30 minute smoked sausage and corn chowder,30,1562,1999-10-18,8,"i love corn chowder, have a pot on now! recipe...",8,10,30-minutes-or-less;time-to-make;course;main-in...,heavy cream*vegetable oil*green onion*cream-st...
434,4205,alfredo fettuccine,25,1617,1999-11-09,3,recipe from the olive garden cookbook,5,14,30-minutes-or-less;time-to-make;course;main-in...,milk*butter*parmesan cheese*cream cheese*fettu...
439,3258,alfredo sauce with pasta,0,1534,1999-10-10,8,,6,20,15-minutes-or-less;time-to-make;course;main-in...,heavy cream*butter*salt*pepper*parmesan cheese...
669,153,amish friendship bread and starter,70,1540,1999-09-06,19,many recipes have been posted for the amish br...,12,16,time-to-make;course;cuisine;preparation;north-...,milk*eggs*flour*salt*baking powder*baking soda...
785,5197,anytime cheese ball,0,1534,1999-12-15,3,,5,16,15-minutes-or-less;time-to-make;main-ingredien...,garlic powder*white vinegar*philadelphia cream...
...,...,...,...,...,...,...,...,...,...,...,...
29196,465,whipped cappuccino,25,1737,1999-08-30,6,the denver post,4,19,30-minutes-or-less;time-to-make;course;prepara...,water*condensed milk*ice cubes*instant espress...
29243,2889,white cake with coconut pecan frosting,120,1646,1999-09-27,7,,11,18,weeknight;time-to-make;course;main-ingredient;...,eggs*butter*vegetable oil*flour*baking soda*su...
29463,3752,wine jelly,0,1535,1999-10-30,13,this make excellent gifts as it is quick and e...,4,9,15-minutes-or-less;time-to-make;course;prepara...,water*sugar*wine*liquid certo
29662,4801,yellow rice ii,20,1598,1999-11-22,4,in south africa we have a rich heritage of amo...,7,18,15-minutes-or-less;time-to-make;course;main-in...,butter*salt*turmeric*cinnamon stick*seedless r...


In [37]:
recipes_after = recipes_filled[recipes_filled["submitted"].dt.year >= 2000]
recipes_after

Unnamed: 0,id,name,minutes,contributor_id,submitted,n_steps,description,n_ingredients,n_tags,tags,ingredients
0,44123,george s at the cove black bean soup,90,35193,2002-10-25,11,an original recipe created by chef scott meska...,18,25,weeknight;time-to-make;course;main-ingredient;...,unsalted butter*carrot*onion*celery*broccoli s...
1,67664,healthy for them yogurt popsicles,10,91970,2003-07-26,3,my children and their friends ask for my homem...,3,31,15-minutes-or-less;time-to-make;course;prepara...,milk*frozen juice concentrate*plain yogurt
2,38798,i can t believe it s spinach,30,1533,2002-08-29,5,"these were so go, it surprised even me.",8,17,30-minutes-or-less;time-to-make;course;main-in...,onion*frozen chopped spinach*eggs*garlic powde...
3,35173,italian gut busters,45,22724,2002-07-27,7,my sister-in-law made these for us at a family...,9,11,60-minutes-or-less;time-to-make;course;prepara...,sandwich bun*good seasonings italian salad dre...
4,84797,love is in the air beef fondue sauces,25,4470,2004-02-23,4,i think a fondue is a very romantic casual din...,12,19,30-minutes-or-less;time-to-make;course;main-in...,beef steaks*vegetable oil*spicy mustard*fresh ...
...,...,...,...,...,...,...,...,...,...,...,...
29995,267661,zurie s holey rustic olive and cheddar bread,80,200862,2007-11-25,16,this is based on a french recipe but i changed...,10,18,time-to-make;course;main-ingredient;cuisine;pr...,dry white wine*eggs*cheddar cheese*baking powd...
29996,386977,zwetschgenkuchen bavarian plum cake,240,177443,2009-08-24,22,"this is a traditional fresh plum cake, thought...",11,19,time-to-make;course;main-ingredient;cuisine;pr...,unsalted butter*milk*flour*salt*vanilla*all-pu...
29997,103312,zwiebelkuchen southwest german onion cake,75,161745,2004-11-03,10,this is a traditional late summer early fall s...,13,20,time-to-make;course;main-ingredient;cuisine;pr...,onion*milk*eggs*butter*flour*salt*pepper*sugar...
29998,486161,zydeco soup,60,227978,2012-08-29,7,this is a delicious soup that i originally fou...,22,20,ham;60-minutes-or-less;time-to-make;course;mai...,onion*celery*dried thyme*dried oregano*fresh p...


In [38]:
recipes_after.dtypes

id                         int64
name                      object
minutes                    int64
contributor_id             int64
submitted         datetime64[ns]
n_steps                    int64
description               object
n_ingredients              int64
n_tags                     int64
tags                      object
ingredients               object
dtype: object

In [39]:
recipes_after_np = np.array([recipes_after[item] for item in ["id", "minutes", "contributor_id","n_steps","n_ingredients","n_tags"]])
recipes_after_np 

array([[ 44123,  67664,  38798, ..., 103312, 486161, 298512],
       [    90,     10,     30, ...,     75,     60,     29],
       [ 35193,  91970,   1533, ..., 161745, 227978, 506822],
       [    11,      3,      5, ...,     10,      7,      9],
       [    18,      3,      8, ...,     13,     22,     10],
       [    25,     31,     17, ...,     20,     20,     12]])

In [40]:
recipes_before_np = np.array([recipes_before[item] for item in ["id", "minutes", "contributor_id","n_steps","n_ingredients","n_tags"]])
recipes_before_np

array([[  3441,   4205,   3258, ...,   3752,   4801,   2982],
       [    30,     25,      0, ...,      0,     20,      0],
       [  1562,   1617,   1534, ...,   1535,   1598, 124030],
       [     8,      3,      8, ...,     13,      4,      6],
       [     8,      5,      6, ...,      4,      7,      7],
       [    10,     14,     20, ...,      9,     18,     13]])

2.2. Сохраните 2 полученных массива в архив `npz`. Дайте массивам читаемые имена.

In [41]:
my_path = "./out/task_2_2.npz"
np.savez(my_path, recipes_before_2000=recipes_before_np, recipes_after_2000=recipes_after_np )

2.3 Считайте созданный архив и продемонстрируйте, что данные считались корректно. 

In [42]:
data = np.load(my_path)
data.files

['recipes_before_2000', 'recipes_after_2000']

In [43]:
recipes_before_2000 = data["recipes_before_2000"]
recipes_before_2000

array([[  3441,   4205,   3258, ...,   3752,   4801,   2982],
       [    30,     25,      0, ...,      0,     20,      0],
       [  1562,   1617,   1534, ...,   1535,   1598, 124030],
       [     8,      3,      8, ...,     13,      4,      6],
       [     8,      5,      6, ...,      4,      7,      7],
       [    10,     14,     20, ...,      9,     18,     13]])

In [44]:
recipes_after_2000 = data["recipes_after_2000"]
recipes_after_2000

array([[ 44123,  67664,  38798, ..., 103312, 486161, 298512],
       [    90,     10,     30, ...,     75,     60,     29],
       [ 35193,  91970,   1533, ..., 161745, 227978, 506822],
       [    11,      3,      5, ...,     10,      7,      9],
       [    18,      3,      8, ...,     13,     22,     10],
       [    25,     31,     17, ...,     20,     20,     12]])

### hdf

3.1 Выведите названия всех датасетов, находящихся в файле `nutrition_sample.h5`, а также размерность матриц, содержащихся в данных датасетах и их метаданные.

Формат вывода:
```
Dataset name=dataset_0, dataset size=(30000,), metadata={'info': 'calories (#)'}
Dataset name=dataset_1, dataset size=(30000,), metadata={'info': 'total fat (PDV)'}
...
```

In [45]:
datasets_dict = {}
with h5py.File("./data/nutrition_sample.h5","r") as hdf:
    
    for key in hdf.keys():
        file = hdf[key]
        metadata = dict(file.attrs)
        print(f"Dataset name={key}, dataset size={file.shape}, metadata={metadata}")
        datasets_dict[key] = {"all" : {'data': file[:]}, 'attrs': metadata}



Dataset name=dataset_0, dataset size=(30000, 2), metadata={'col_0': 'recipe_id', 'col_1': 'calories (#)'}
Dataset name=dataset_1, dataset size=(30000, 2), metadata={'col_0': 'recipe_id', 'col_1': 'total fat (PDV)'}
Dataset name=dataset_2, dataset size=(30000, 2), metadata={'col_0': 'recipe_id', 'col_1': 'sugar (PDV)'}
Dataset name=dataset_3, dataset size=(30000, 2), metadata={'col_0': 'recipe_id', 'col_1': 'sodium (PDV)'}
Dataset name=dataset_4, dataset size=(30000, 2), metadata={'col_0': 'recipe_id', 'col_1': 'protein (PDV)'}
Dataset name=dataset_5, dataset size=(30000, 2), metadata={'col_0': 'recipe_id', 'col_1': 'saturated fat (PDV)'}
Dataset name=dataset_6, dataset size=(30000, 2), metadata={'col_0': 'recipe_id', 'col_1': 'carbohydrates (PDV)'}


In [46]:
datasets_dict

{'dataset_0': {'all': {'data': array([[4.41230e+04, 8.04700e+02],
          [6.76640e+04, 1.64600e+02],
          [3.87980e+04, 5.38000e+01],
          ...,
          [1.03312e+05, 8.64100e+02],
          [4.86161e+05, 4.15200e+02],
          [2.98512e+05, 1.88000e+02]])},
  'attrs': {'col_0': 'recipe_id', 'col_1': 'calories (#)'}},
 'dataset_1': {'all': {'data': array([[4.41230e+04, 1.08000e+02],
          [6.76640e+04, 3.00000e+00],
          [3.87980e+04, 5.00000e+00],
          ...,
          [1.03312e+05, 8.70000e+01],
          [4.86161e+05, 2.60000e+01],
          [2.98512e+05, 1.10000e+01]])},
  'attrs': {'col_0': 'recipe_id', 'col_1': 'total fat (PDV)'}},
 'dataset_2': {'all': {'data': array([[4.41230e+04, 2.60000e+01],
          [6.76640e+04, 5.00000e+00],
          [3.87980e+04, 2.00000e+00],
          ...,
          [1.03312e+05, 3.00000e+01],
          [4.86161e+05, 3.40000e+01],
          [2.98512e+05, 5.70000e+01]])},
  'attrs': {'col_0': 'recipe_id', 'col_1': 'sugar (PD

3.2 Разбейте каждый из имеющихся датасетов на две части: 1 часть содержит только те строки, где PDV (Percent Daily Value) превышает 100%; 2 часть содержит те строки, где PDV не составляет не более 100%.<br>Создайте 2 группы в файле и разместите в них соответствующие части датасета c сохранением метаданных исходных датасетов. Итого должно получиться 2 группы, содержащие несколько датасетов. Сохраните результаты в файл `nutrition_grouped.h5`

In [47]:
datasets_dict = {k : v for k, v in datasets_dict.items() if "PDV" in v["attrs"]["col_1"]}

In [48]:
new_datasets_dict = {"PDV_more_100" : {}, "PDV_less_100": {}}
for dataset, values in datasets_dict.items():

    data = values["all"]["data"]
    mask = data[:,1] > 1
    
    new_datasets_dict["PDV_more_100"][dataset] = {"data" :  data[mask], "attrs" : values["attrs"], "name": dataset}

    mask = data[:,1] < 1
    new_datasets_dict["PDV_less_100"][dataset] = {"data" :  data[mask], "attrs" : values["attrs"], "name": dataset}



In [49]:
for dataset, data in new_datasets_dict.items():
    print(dataset)
    print(data)

PDV_more_100
{'dataset_1': {'data': array([[4.41230e+04, 1.08000e+02],
       [6.76640e+04, 3.00000e+00],
       [3.87980e+04, 5.00000e+00],
       ...,
       [1.03312e+05, 8.70000e+01],
       [4.86161e+05, 2.60000e+01],
       [2.98512e+05, 1.10000e+01]]), 'attrs': {'col_0': 'recipe_id', 'col_1': 'total fat (PDV)'}, 'name': 'dataset_1'}, 'dataset_2': {'data': array([[4.41230e+04, 2.60000e+01],
       [6.76640e+04, 5.00000e+00],
       [3.87980e+04, 2.00000e+00],
       ...,
       [1.03312e+05, 3.00000e+01],
       [4.86161e+05, 3.40000e+01],
       [2.98512e+05, 5.70000e+01]]), 'attrs': {'col_0': 'recipe_id', 'col_1': 'sugar (PDV)'}, 'name': 'dataset_2'}, 'dataset_3': {'data': array([[4.41230e+04, 1.90000e+01],
       [3.87980e+04, 3.00000e+00],
       [3.51730e+04, 9.70000e+01],
       ...,
       [1.03312e+05, 1.80000e+01],
       [4.86161e+05, 2.60000e+01],
       [2.98512e+05, 1.10000e+01]]), 'attrs': {'col_0': 'recipe_id', 'col_1': 'sodium (PDV)'}, 'name': 'dataset_3'}, 'datas

In [50]:
h5_path = "./out/nutrition_grouped.h5"
with h5py.File(h5_path, mode='w') as hdf:
    for group, data in new_datasets_dict.items():
        new_group = hdf.create_group(group)
        for d in data:
            item = new_group.create_dataset(name=data[d]['name'], data=data[d]['data'])
            item.attrs.update(data[d]['attrs'])

3.3 Выведите названия всех групп и датасетов, находящихся в этих группах, из файла `nutrition_grouped.h5` а также размерность матриц, содержащихся в датасетах и их метаданные.

In [51]:
with h5py.File(h5_path) as hdf:
    
    for group_name, data in hdf.items():
        print(f'\nGroup: {group_name}')

        for key in data:
            file = data[key]
            metadata = dict(file.attrs)
            print(f"Dataset name={key}, dataset size={file.shape}, metadata={metadata}")
            datasets_dict[key] = {"all" : {'data': file[:]}, 'attrs': metadata}



Group: PDV_less_100
Dataset name=dataset_1, dataset size=(2146, 2), metadata={'col_0': 'recipe_id', 'col_1': 'total fat (PDV)'}
Dataset name=dataset_2, dataset size=(1371, 2), metadata={'col_0': 'recipe_id', 'col_1': 'sugar (PDV)'}
Dataset name=dataset_3, dataset size=(2588, 2), metadata={'col_0': 'recipe_id', 'col_1': 'sodium (PDV)'}
Dataset name=dataset_4, dataset size=(1274, 2), metadata={'col_0': 'recipe_id', 'col_1': 'protein (PDV)'}
Dataset name=dataset_5, dataset size=(2569, 2), metadata={'col_0': 'recipe_id', 'col_1': 'saturated fat (PDV)'}
Dataset name=dataset_6, dataset size=(1653, 2), metadata={'col_0': 'recipe_id', 'col_1': 'carbohydrates (PDV)'}

Group: PDV_more_100
Dataset name=dataset_1, dataset size=(27025, 2), metadata={'col_0': 'recipe_id', 'col_1': 'total fat (PDV)'}
Dataset name=dataset_2, dataset size=(27822, 2), metadata={'col_0': 'recipe_id', 'col_1': 'sugar (PDV)'}
Dataset name=dataset_3, dataset size=(26148, 2), metadata={'col_0': 'recipe_id', 'col_1': 'sodium

3.4 Модифицируйте код из 3.3 таким образом, чтобы сохранить датасеты, используя сжатие. Сравните размер полученного файла с размерами файла из 3.3. Прокомментируйте результат.

In [52]:
h5_compressed_path = "./out/nutrition_grouped_compressed.h5"
with h5py.File(h5_compressed_path, mode='w') as hdf:
    for group, data in new_datasets_dict.items():
        new_group = hdf.create_group(group)
        for d in data:
            item = new_group.create_dataset(name=data[d]['name'], data=data[d]['data'], compression="gzip", compression_opts=9)
            item.attrs.update(data[d]['attrs'])

In [53]:
def get_file_size_2(file):
 
    stat = os.stat(file)
    print(stat)
    size = stat.st_size
    return size

In [54]:
import os

In [55]:
size = get_file_size_2(h5_path)
print('File size: ' + str(size) + ' bytes')

os.stat_result(st_mode=33188, st_ino=27995154, st_dev=16777234, st_nlink=1, st_uid=501, st_gid=20, st_size=2788576, st_atime=1634820763, st_mtime=1634820763, st_ctime=1634820763)
File size: 2788576 bytes


In [56]:
size = get_file_size_2(h5_compressed_path)
print('File size: ' + str(size) + ' bytes')

os.stat_result(st_mode=33188, st_ino=27997350, st_dev=16777234, st_nlink=1, st_uid=501, st_gid=20, st_size=914789, st_atime=1633279205, st_mtime=1634820764, st_ctime=1634820764)
File size: 914789 bytes
