What is Vaex?[](https://docs.vaex.io/en/latest/index.html#What-is-Vaex? "Permalink to this headline")
=====================================================================================================

Vaex (originally VaEx: "Visualization and Exploration") is a Python library for lazy **Out-of-Core DataFrames** (similar to Pandas), to visualize and explore big tabular datasets. 

It can calculate *statistics* such as mean, sum, count, standard deviation etc, on an *N-dimensional grid* up to **a billion** (109109) objects/rows **per second**. 

Visualization is done using **histograms**, **density plots** and **3d volume rendering**, allowing interactive exploration of big data. 

Vaex uses memory mapping, a zero memory copy policy, and lazy computations for best performance (no memory wasted).

In [None]:
import vaex
df = vaex.open('data/beer_small.csv')
df

For a quick API demo, and also to show the parallelism between Vaex and Pandas, we'll run the Pandas "Beer Exploration" code and see some similarities and differences.

In [None]:
len(df)

Note: Vaex count doesn't return counts by column, since we don't necessarily want to "commit" to that work

In [None]:
df.count()

However, we can get info for columns we're interested in

In [None]:
df.count('brewery_name')

`count` also has a number of additional features aimed at working with larger data: https://docs.vaex.io/en/latest/api.html#vaex.dataframe.DataFrame.count ... in the source it just delegates to aggregation

In [None]:
df.count()

Let's get summary statistics for the numeric columns ... things like review score and ABV

In [None]:
df.describe()

There are some really low-alcohol beers in there ... maybe even bogus data.

Find all entries with ABV less than 1%

In [None]:
low_abv = df[df.beer_abv < 1]

low_abv

How many of these reviews are there?

In [None]:
len(low_abv)

This includes multiple reviews for the same beer, so let's group by beer and count.

In [None]:
grouping = low_abv.groupby('beer_name')
try:
    grouping.size()
except Exception as err:
    print(err)

In [None]:
grouping.agg('count')

How consistent are the O'Douls overall scores?

In [None]:
scores = low_abv[low_abv.beer_name=="O'Doul's"]['review_overall']
scores

Let's plot a histogram

In [None]:
try:
    scores.hist()
except Exception as err:
    print(err)

In [None]:
low_abv[low_abv.beer_name=="O'Doul's"].plot1d('review_overall')

Default behavior is to plot 99.7% (+/- 3σ) and omit outliers... we can adjust the limits:

In [None]:
low_abv[low_abv.beer_name=="O'Doul's"].plot1d('review_overall', limits=[0, 5])

In [None]:
scores.mean(), scores.std()

In the full dataset, can we count reviews by brewery, and then by style within that brewery?

In [None]:
df.groupby(['brewery_name', 'beer_style']).agg('count')

### Now we'll try and build up a slightly more complex report

Step 1: Find all rows corresponsing to reviews where the beer style starts with "American"

In [None]:
all_american = df[df.beer_style.str.startswith('American')]
all_american

Next, make a dataframe with just the `beer_style` and `review_overall` fields for those rows.

In [None]:
narrowed = all_american[['beer_style', 'review_overall']]
narrowed

In [None]:
narrowed[narrowed.beer_style=='American Malt Liquor'].plot1d('review_overall')

In [None]:
narrowed[narrowed.beer_style=='American IPA'].plot1d('review_overall')