#Querying tables

> Objectives:
> * Compare queries of tabular data for **in-memory** containers
> * Compare sizes and times for those

In [1]:
from ipython_memwatcher import MemWatcher
mw = MemWatcher()
mw.start_watching_memory()

In [1] used 0.027 MiB RAM in 0.001s, peaked 0.000 MiB above current, total RAM usage 29.012 MiB


In [2]:
import os
dset = '/home/faltet/blosc/movielens-bench/ml-1m'
fdata = os.path.join(dset, 'ratings.dat.gz')
fitem = os.path.join(dset, 'movies.dat')

In [2] used 0.023 MiB RAM in 0.006s, peaked 0.000 MiB above current, total RAM usage 29.035 MiB


In [3]:
import pandas as pd
# pass in column names for each CSV
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(fdata, sep=';', names=r_cols, compression='gzip')

m_cols = ['movie_id', 'title', 'genres']
movies = pd.read_csv(fitem, sep=';', names=m_cols,
                     dtype={'title': "S100", 'genres': "S100"})

In [3] used 78.168 MiB RAM in 0.789s, peaked 0.000 MiB above current, total RAM usage 107.203 MiB


In [4]:
lens = pd.merge(movies, ratings)

In [4] used 53.961 MiB RAM in 0.164s, peaked 0.000 MiB above current, total RAM usage 161.164 MiB


In [21]:
lens[:10]
lens.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns (total 6 columns):
movie_id          1000209 non-null int64
title             1000209 non-null object
genres            1000209 non-null object
user_id           1000209 non-null int64
rating            1000209 non-null int64
unix_timestamp    1000209 non-null int64
dtypes: int64(4), object(2)
memory usage: 53.4+ MB
In [21] used 17.160 MiB RAM in 0.188s, peaked 0.000 MiB above current, total RAM usage 293.570 MiB


In [6]:
result = lens.query("(title == 'Tom and Huck (1995)') & (rating == 5)")['user_id']
%timeit lens.query("(title == 'Tom and Huck (1995)') & (rating == 5)")['user_id']

10 loops, best of 3: 44.5 ms per loop
In [6] used 38.516 MiB RAM in 2.106s, peaked 0.000 MiB above current, total RAM usage 199.855 MiB


In [7]:
import bcolz
print bcolz.print_versions()
bcolz.defaults.cparams['cname'] = 'lz4'
bcolz.defaults.cparams['clevel'] = 5
bcolz.set_nthreads(4)

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
bcolz version:     0.9.0
NumPy version:     1.9.2
Blosc version:     1.4.1 ($Date:: 2014-07-08 #$)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Numexpr version:   2.4.3
Python version:    2.7.10 |Anaconda 2.1.0 (64-bit)| (default, May 28 2015, 17:02:03) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Platform:          linux2-x86_64
Byte-ordering:     little
Detected cores:    4
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
None


4

In [7] used 1.309 MiB RAM in 0.042s, peaked 0.000 MiB above current, total RAM usage 201.164 MiB


In [8]:
zlens = bcolz.ctable.fromdataframe(lens)

In [8] used 13.137 MiB RAM in 0.325s, peaked 71.156 MiB above current, total RAM usage 214.301 MiB


In [10]:
zlens2 = bcolz.ctable.fromdataframe(lens)

In [10] used 8.203 MiB RAM in 0.337s, peaked 70.016 MiB above current, total RAM usage 222.512 MiB


In [11]:
zlens

ctable((1000209,), [('movie_id', '<i8'), ('title', 'S82'), ('genres', 'S47'), ('user_id', '<i8'), ('rating', '<i8'), ('unix_timestamp', '<i8')])
  nbytes: 153.57 MB; cbytes: 7.89 MB; ratio: 19.45
  cparams := cparams(clevel=5, shuffle=True, cname='lz4')
[(1, 'Toy Story (1995)', "Animation|Children's|Comedy", 1, 5, 978824268)
 (1, 'Toy Story (1995)', "Animation|Children's|Comedy", 6, 4, 978237008)
 (1, 'Toy Story (1995)', "Animation|Children's|Comedy", 8, 4, 978233496)
 ...,
 (3952, 'Contender, The (2000)', 'Drama|Thriller', 5837, 4, 1011902656)
 (3952, 'Contender, The (2000)', 'Drama|Thriller', 5927, 1, 979852537)
 (3952, 'Contender, The (2000)', 'Drama|Thriller', 5998, 4, 1001781044)]

In [11] used 0.000 MiB RAM in 0.006s, peaked 0.000 MiB above current, total RAM usage 222.512 MiB


We can see that the space taken by a bcolz container is around 7x smaller (!) than a pandas one.

### Excercise 1

Why do you think that number of uncompressed bytes (nbytes) that the ctable reports is 3x more than pandas (153 MB vs 54 MB)?

*Hint:* Pandas stores the string columns in NumPy containers with 'object' dtype whereas bcolz uses the equivalent to NumPy's 'string' objects.

In [14]:
resultz = [(r.nrow__, r.user_id) for r in zlens.where("(title == 'Tom and Huck (1995)') & (rating == 5)", outcols=['nrow__', 'user_id'])]
%timeit [(r.nrow__, r.user_id) for r in zlens.where("(title == 'Tom and Huck (1995)') & (rating == 5)", outcols=['nrow__', 'user_id'])]

10 loops, best of 3: 24.5 ms per loop
In [14] used 0.152 MiB RAM in 1.133s, peaked 0.000 MiB above current, total RAM usage 222.988 MiB


In [15]:
print("results with pandas Dataframe:", result)
print("results with bcolz ctable:", resultz)

('results with pandas Dataframe:', 5121      75
5164    3842
5187    6031
Name: user_id, dtype: int64)
('results with bcolz ctable:', [(5121, 75), (5164, 3842), (5187, 6031)])
In [15] used 0.000 MiB RAM in 0.005s, peaked 0.000 MiB above current, total RAM usage 222.988 MiB


## Using structured NumPy arrays

It turns out that this is a consequence of NumPy not being optimal in accessing unaligned data. Initially, one could have blamed the CPU having to access unaligned data, but this is not true anymore. For more info see: http://mail.scipy.org/pipermail/numpy-discussion/2015-July/073146.html

In [18]:
nalens = lens.to_records()

In [18] used 0.000 MiB RAM in 1.031s, peaked 53.297 MiB above current, total RAM usage 276.406 MiB


In [19]:
resultna = nalens[(nalens['title'] == 'Tom and Huck (1995)') & (nalens['rating'] == 5)]
%timeit nalens[(nalens['title'] == 'Tom and Huck (1995)') & (nalens['rating'] == 5)]
resultna

10 loops, best of 3: 16.7 ms per loop


rec.array([ (5121, 8, 'Tom and Huck (1995)', "Adventure|Children's", 75, 5, 977851520),
       (5164, 8, 'Tom and Huck (1995)', "Adventure|Children's", 3842, 5, 967986151),
       (5187, 8, 'Tom and Huck (1995)', "Adventure|Children's", 6031, 5, 956718223)], 
      dtype=[('index', '<i8'), ('movie_id', '<i8'), ('title', 'O'), ('genres', 'O'), ('user_id', '<i8'), ('rating', '<i8'), ('unix_timestamp', '<i8')])

In [19] used 0.000 MiB RAM in 0.804s, peaked 0.000 MiB above current, total RAM usage 276.406 MiB


Again, NumPy works the fastest for in-memory data containers, while memory consumption is close to pandas.

## Rules of thumb for querying in-memory tabular datasets

* Choose pure NumPy recarrays if you need the fastest speed
* Choose bcolz ctables if you need to store lots of data in limited memory and not want to loose too much speed
* Choose pandas if what you need is rich functionality (at the penalty of some speed)