<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#VAEX" data-toc-modified-id="VAEX-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>VAEX</a></span><ul class="toc-item"><li><span><a href="#Vaex-out-of-core" data-toc-modified-id="Vaex-out-of-core-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Vaex out-of-core</a></span></li><li><span><a href="#Why-vaex" data-toc-modified-id="Why-vaex-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Why vaex</a></span></li></ul></li></ul></div>

Web harvesting example of [Towards Data Science](https://towardsdatascience.com) website using VAEX python library

# VAEX

## Vaex out-of-core

Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. 

Vaex uses memory mapping, a zero memory copy policy, and lazy computations for best performance (no memory wasted).

The most important class (datastructure) in vaex is the DataFrame. A DataFrame is obtained by either opening the  dataset or connecting to a remote server.

> df1 = vaex.open("somedata.hdf5")

> df2 = vaex.open("somedata.fits")

> df2 = vaex.open("somedata.arrow")

> df4 = vaex.open("somedata.csv")

> df_remote = vaex.open("http://try.vaex.io/nyc_taxi_2015")

You can access [VAEX DOCUMENTATION](https://vaex.readthedocs.io/en/latest/api.html#vaex.dataset.Dataset.to_pandas_df) on this link.

In [1]:
# Installing the package
# !pip install vaex

## Why vaex

Performance: works with huge tabular data, processes >109 rows/second

Lazy / Virtual columns: compute on the fly, without wasting ram

Memory efficient no memory copies when doing filtering/selections/subsets

Visualization: directly supported, a one-liner is often enough

User friendly API: you will only need to deal with the DataFrame object, and tab completion + docstring will help you out: ds.mean<tab>, feels very similar to Pandas
    
Lean: separated into multiple packages:

* vaex-core: DataFrame and core algorithms, takes numpy arrays as input columns

* vaex-hdf5: Provides memory mapped numpy arrays to a DataFrame

* vaex-arrow: Arrow support for cross language data sharing

* vaex-viz: Visualization based on matplotlib

* vaex-jupyter: Interactive visualization based on Jupyter widgets / ipywidgets, bqplot, ipyvolume and ipyleaflet

* vaex-astro: Astronomy related transformations and FITS file support

* vaex-server: Provides a server to access a DataFrame remotely

* vaex-distributed: (Proof of concept) combined multiple servers / cluster into a single DataFrame for distributed computations

* vaex-qt: Program written using Qt GUI

* vaex: Meta package that installs all of the above

* vaex-ml: Machine learning

Jupyter integration: vaex-jupyter will give you interactive visualization and selection in the Jupyter notebook and Jupyter lab


In [2]:
# System settings
print("Importing ...")
import os
import pandas as pd
import psutil
import multiprocessing as mp

# Harvester methods/libraries
import requests
from bs4 import BeautifulSoup as bs
import vaex

# Display python version and operational system-specif parameters 
import sys
print(sys.version)

# Check the number of cores and memory usage
num_cores = mp.cpu_count()
print("This kernel has ",num_cores,"cores and memory usage of:",psutil.virtual_memory())

# # Check Dask and Hosting the diagnostics dashboard
# cluster = LocalCluster()
# client = Client(cluster)
# client

# Expands the visualization of a matrix
pd.set_option("display.max_columns", 300)
pd.set_option("display.max_rows", 300)

print("System ready to go!")


Importing ...
3.7.3 (default, Mar 27 2019, 16:54:48) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
This kernel has  4 cores and memory usage of: svmem(total=8589934592, available=2771288064, percent=67.7, used=4451819520, free=15011840, active=2757619712, inactive=2521808896, wired=1694199808)
System ready to go!


In [3]:
url = "https://towardsdatascience.com/archive/2020/04/30"
response = requests.get(url)
soup = bs(response.text, "html.parser")

In [4]:
# class_tweets =  "container" # "TweetAuthor js-inViewportScribingTarget" # "timeline-Header timeline-InformationCircle-widgetParent" 
# class_feeds = "_3dp _29k" # "post_message" 

In [5]:
class_stories_id = "postArticle postArticle--short js-postArticle js-trackPostPresentation js-trackPostScrolls"
class_author_class = "postMetaInline postMetaInline-authorLockup ui-captionStrong u-flex1 u-noWrapWithEllipsis"

In [6]:
tag = soup.findAll("div", {"class": class_stories_id})[0]
title = tag.find("h3").get_text()
title

'Resources I Wish I Knew When I Started Out With Data\xa0Science'

In [7]:
subtitle = tag.find("h4").get_text()
subtitle

'A Powerful Learning Guide for\xa0Serious…'

In [8]:
df_stories = pd.DataFrame(columns=["title", "subtitle", "author", "date", "reading_time", "claps"])
dates_list = pd.date_range("2020-05-24", "2020-06-04").astype(str).tolist()
dates_list[:5]

['2020-05-24', '2020-05-25', '2020-05-26', '2020-05-27', '2020-05-28']

In [9]:
def parse_tag_element(tag, element):
    """
    Parses tags and elements from inspected website.
    """
    try:
        story_title = tag.find(element).get_text()
    except AttributeError:
        return None
    return story_title

class_stories_id = "postArticle postArticle--short js-postArticle js-trackPostPresentation js-trackPostScrolls"
class_author_class = "postMetaInline postMetaInline-authorLockup ui-captionStrong u-flex1 u-noWrapWithEllipsis"

for date in dates_list:
    date = date.replace("-", "/")
    url = "https://towardsdatascience.com/archive/%s" % date
    print("Scrapping", url)
    
    response = requests.get(url)
    soup = bs(response.text, "html.parser")
    
    for i, tag in enumerate(soup.findAll("div", {"class": class_stories_id}), 1):
        # parse story title
        story_title = ""
        for element in ["h3", "h2"]:
            title = parse_tag_element(tag, element)
            if title is not None:
                story_title = title
                break
        
        # parse story subtitle
        story_subtitle = ""
        for element in ["h4", "p"]:
            subtitle = parse_tag_element(tag, element)
            if subtitle is not None:
                story_subtitle = subtitle
                break
                
        author_tag = tag.find("div", {"class": class_author_class})
        author_name = author_tag.get_text(separator= ",").split(",")[0]
        reading_time = author_tag.find("span", {"class": "readingTime"})["title"]
        n_claps = tag.find("span", {"class": "u-relative u-background js-actionMultirecommendCount u-marginLeft5"}
                          ).get_text()
        row = {"title": story_title, "subtitle": story_subtitle, "author": author_name, "date": date,
              "reading_time": reading_time, "claps": n_claps}
        df_stories = df_stories.append(row, ignore_index=True)
#         print(i, story_title, story_subtitle, author_name, date, reading_time, n_claps)
        
        # save Dataframe in each iteration so that progress is not lost if something breaks
#         df_stories.to_csv("tds_stories.csv")
        
        def from_pandas(df, name="pandas", copy_index=False, index_name="index"):
            """
            Create an in memory DataFrame from a pandas DataFrame.

            :param: pandas.DataFrame df: Pandas DataFrame
            :param: name: unique for the DataFrame

            >>> import vaex, pandas as pd
            >>> df_pandas = pd.from_csv('test.csv')
            >>> df = vaex.from_pandas(df_pandas)

            :rtype: DataFrame
            """
            vaex_df = vaex.dataframe.DataFrameArrays(name)

            def add(name, column=df_stories.columns):
                values = column.values
                if isinstance(values, pd.core.arrays.integer.IntegerArray):
                    values = np.ma.array(values._data, mask=values._mask)
                try:
                    vaex_df.add_column(name, values)
                except Exception as e:
                    print("could not convert column %s, error: %r, will try to convert it to string" % (name, e))
                    try:
                        values = values.astype("S")
                        vaex_df.add_column(name, values)
                    except Exception as e:
                        print("Giving up column %s, error: %r" % (name, e))
            for name in df_stories.columns:
                add(name, df_stories[name])
            if copy_index:
                add(index_name, df_stories.index)
            return vaex_df
        
        

Scrapping https://towardsdatascience.com/archive/2020/05/24
Scrapping https://towardsdatascience.com/archive/2020/05/25
Scrapping https://towardsdatascience.com/archive/2020/05/26
Scrapping https://towardsdatascience.com/archive/2020/05/27
Scrapping https://towardsdatascience.com/archive/2020/05/28
Scrapping https://towardsdatascience.com/archive/2020/05/29
Scrapping https://towardsdatascience.com/archive/2020/05/30
Scrapping https://towardsdatascience.com/archive/2020/05/31
Scrapping https://towardsdatascience.com/archive/2020/06/01
Scrapping https://towardsdatascience.com/archive/2020/06/02
Scrapping https://towardsdatascience.com/archive/2020/06/03
Scrapping https://towardsdatascience.com/archive/2020/06/04


This function automatically reads to vaex dataframe file (vaex.dataframe.DataFrameArrays) or to HDF5 from .csv and persists it to disk.

In [10]:
# path = "dataset.csv"
# dv = vaex.from_csv(path, convert=True, chunk_size=10_000)
vaex_df = from_pandas(df_stories)
type(vaex_df)

vaex.dataframe.DataFrameArrays

In [11]:
col_names = vaex_df.get_column_names

df = vaex_df.to_pandas_df(col_names(strings=col_names))
type(df)
df.head(2)

Unnamed: 0,title,subtitle,author,date,reading_time,claps
0,5 Books That Will Teach You the Math Behind Ma...,A guide to the beautiful world of…,Tivadar Danka,2020/05/24,5 min read,608
1,A definitive guide for Setting up a Deep Learn...,DL Rig,Rahul Agarwal,2020/05/24,8 min read,266


In [12]:
df.shape

(885, 6)

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 885 entries, 0 to 884
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         885 non-null    object
 1   subtitle      885 non-null    object
 2   author        885 non-null    object
 3   date          885 non-null    object
 4   reading_time  885 non-null    object
 5   claps         885 non-null    object
dtypes: object(6)
memory usage: 41.6+ KB


In [14]:
# Checking if the scraping algorithm was able to scrape all the titles
print("Number of Null values: ", df_stories[df_stories.title.isnull()].shape)

# In case of any missing values, remove them:
print("Number of values: ", df_stories[df_stories.title.notnull()].shape)

Number of Null values:  (0, 6)
Number of values:  (885, 6)


In [19]:
# Duplicates
df[df.title.duplicated()].sort_values("claps", ascending=False)

Unnamed: 0,title,subtitle,author,date,reading_time,claps,id
226,How ISIS Uses Twitter,Using Social Network Analysis and Community De...,Mitchell Telatnik,2020/05/26,4 min read,46,1262484266718225157
609,Predicting Reddit Flairs using Machine Learnin...,,Prakhar Rathi,2020/05/31,8 min read,14,3406829216753561466


In [16]:
# Verifying if columns don't have nulls or empty string values.

for i in df_stories.columns:
#     print(i)
    print("Column ", i, ": ", df_stories[(df_stories[i].isnull()) | (df_stories[i] == " ")])
    

Column  title :  Empty DataFrame
Columns: [title, subtitle, author, date, reading_time, claps]
Index: []
Column  subtitle :  Empty DataFrame
Columns: [title, subtitle, author, date, reading_time, claps]
Index: []
Column  author :  Empty DataFrame
Columns: [title, subtitle, author, date, reading_time, claps]
Index: []
Column  date :  Empty DataFrame
Columns: [title, subtitle, author, date, reading_time, claps]
Index: []
Column  reading_time :  Empty DataFrame
Columns: [title, subtitle, author, date, reading_time, claps]
Index: []
Column  claps :  Empty DataFrame
Columns: [title, subtitle, author, date, reading_time, claps]
Index: []


A great way to check for duplicates with many columns is to combine the values, calculate the hash
and check for duplicated hashes. The downside of this method is that it doesn't work well with null
values.

In [17]:
df["id"] = (df.title + df.subtitle + df.author + df.date + df.reading_time+ df.claps).apply(hash)
df[df.id.duplicated()]

Unnamed: 0,title,subtitle,author,date,reading_time,claps,id


In [18]:
# Final check and Saving the dataframe to a .csv file
print("The shape of a validated dataset is: ", df.shape , "and column names are: \n", df.columns)
# df.to_csv("TowardsDataScience_validated.csv", index=False, 
#           columns=["title", "subtitle", "author", "date", "reading_time", "claps"])

The shape of a validated dataset is:  (885, 7) and column names are: 
 Index(['title', 'subtitle', 'author', 'date', 'reading_time', 'claps', 'id'], dtype='object')
