<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#VAEX" data-toc-modified-id="VAEX-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>VAEX</a></span><ul class="toc-item"><li><span><a href="#Vaex-out-of-core" data-toc-modified-id="Vaex-out-of-core-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Vaex out-of-core</a></span></li><li><span><a href="#Why-vaex" data-toc-modified-id="Why-vaex-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Why vaex</a></span></li></ul></li></ul></div>

Web harvesting example of [Towards Data Science](https://towardsdatascience.com) website using VAEX python library

# VAEX

## Vaex out-of-core

Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, a zero memory copy policy, and lazy computations for best performance (no memory wasted).

The most important class (datastructure) in vaex is the DataFrame. A DataFrame is obtained by either opening the  dataset or connecting to a remove server.

> df1 = vaex.open("somedata.hdf5")

> df2 = vaex.open("somedata.fits")

> df2 = vaex.open("somedata.arrow")

> df4 = vaex.open("somedata.csv")

> df_remote = vaex.open("http://try.vaex.io/nyc_taxi_2015")

You can access [VAEX DOCUMENTATION](https://vaex.readthedocs.io/en/latest/api.html#vaex.dataset.Dataset.to_pandas_df) on this link.

In [1]:
# Installing the package
# !pip install vaex

## Why vaex

Performance: works with huge tabular data, processes >109 rows/second

Lazy / Virtual columns: compute on the fly, without wasting ram

Memory efficient no memory copies when doing filtering/selections/subsets

Visualization: directly supported, a one-liner is often enough

User friendly API: you will only need to deal with the DataFrame object, and tab completion + docstring will help you out: ds.mean<tab>, feels very similar to Pandas
    
Lean: separated into multiple packages:

* vaex-core: DataFrame and core algorithms, takes numpy arrays as input columns

* vaex-hdf5: Provides memory mapped numpy arrays to a DataFrame

* vaex-arrow: Arrow support for cross language data sharing

* vaex-viz: Visualization based on matplotlib

* vaex-jupyter: Interactive visualization based on Jupyter widgets / ipywidgets, bqplot, ipyvolume and ipyleaflet

* vaex-astro: Astronomy related transformations and FITS file support

* vaex-server: Provides a server to access a DataFrame remotely

* vaex-distributed: (Proof of concept) combined multiple servers / cluster into a single DataFrame for distributed computations

* vaex-qt: Program written using Qt GUI

* vaex: Meta package that installs all of the above

* vaex-ml: Machine learning

Jupyter integration: vaex-jupyter will give you interactive visualization and selection in the Jupyter notebook and Jupyter lab


In [2]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import vaex



In [3]:
url = "https://towardsdatascience.com/archive/2020/04/30"
response = requests.get(url)
soup = bs(response.text, "html.parser")

In [4]:
# class_tweets =  "container" # "TweetAuthor js-inViewportScribingTarget" # "timeline-Header timeline-InformationCircle-widgetParent" 
# class_feeds = "_3dp _29k" # "post_message" 

In [5]:
class_stories_id = "postArticle postArticle--short js-postArticle js-trackPostPresentation js-trackPostScrolls"
class_author_class = "postMetaInline postMetaInline-authorLockup ui-captionStrong u-flex1 u-noWrapWithEllipsis"

In [6]:
tag = soup.findAll("div", {"class": class_stories_id})[0]
title = tag.find("h3").get_text()
title

'Resources I Wish I Knew When I Started Out With Data\xa0Science'

In [7]:
subtitle = tag.find("h4").get_text()
subtitle

'A Powerful Learning Guide for\xa0Serious…'

In [8]:
df_stories = pd.DataFrame(columns=["title", "subtitle", "author", "date", "reading_time", "claps"])
dates_list = pd.date_range("2020-05-24", "2020-06-04").astype(str).tolist()
dates_list[:5]

['2020-05-24', '2020-05-25', '2020-05-26', '2020-05-27', '2020-05-28']

In [9]:
def parse_tag_element(tag, element):
    """
    Parses tags and elements from inspected website.
    """
    try:
        story_title = tag.find(element).get_text()
    except AttributeError:
        return None
    return story_title

class_stories_id = "postArticle postArticle--short js-postArticle js-trackPostPresentation js-trackPostScrolls"
class_author_class = "postMetaInline postMetaInline-authorLockup ui-captionStrong u-flex1 u-noWrapWithEllipsis"

for date in dates_list:
    date = date.replace("-", "/")
    url = "https://towardsdatascience.com/archive/%s" % date
    print("Scrapping", url)
    
    response = requests.get(url)
    soup = bs(response.text, "html.parser")
    
    for i, tag in enumerate(soup.findAll("div", {"class": class_stories_id}), 1):
        # parse story title
        story_title = ""
        for element in ["h3", "h2"]:
            title = parse_tag_element(tag, element)
            if title is not None:
                story_title = title
                break
        
        # parse story subtitle
        story_subtitle = ""
        for element in ["h4", "p"]:
            subtitle = parse_tag_element(tag, element)
            if subtitle is not None:
                story_subtitle = subtitle
                break
                
        author_tag = tag.find("div", {"class": class_author_class})
        author_name = author_tag.get_text(separator= ",").split(",")[0]
        reading_time = author_tag.find("span", {"class": "readingTime"})["title"]
        n_claps = tag.find("span", {"class": "u-relative u-background js-actionMultirecommendCount u-marginLeft5"}
                          ).get_text()
        row = {"title": story_title, "subtitle": story_subtitle, "author": author_name, "date": date,
              "reading_time": reading_time, "claps": n_claps}
        df_stories = df_stories.append(row, ignore_index=True)
        print(i, story_title, story_subtitle, author_name, date, reading_time, n_claps)
        
        # save Dataframe in each iteration so that progress is not lost if something breaks
#         df_stories.to_csv("tds_stories.csv")
        
        def from_pandas(df, name="pandas", copy_index=False, index_name="index"):
            """
            Create an in memory DataFrame from a pandas DataFrame.

            :param: pandas.DataFrame df: Pandas DataFrame
            :param: name: unique for the DataFrame

            >>> import vaex, pandas as pd
            >>> df_pandas = pd.from_csv('test.csv')
            >>> df = vaex.from_pandas(df_pandas)

            :rtype: DataFrame
            """
            vaex_df = vaex.dataframe.DataFrameArrays(name)

            def add(name, column=df_stories.columns):
                values = column.values
                if isinstance(values, pd.core.arrays.integer.IntegerArray):
                    values = np.ma.array(values._data, mask=values._mask)
                try:
                    vaex_df.add_column(name, values)
                except Exception as e:
                    print("could not convert column %s, error: %r, will try to convert it to string" % (name, e))
                    try:
                        values = values.astype("S")
                        vaex_df.add_column(name, values)
                    except Exception as e:
                        print("Giving up column %s, error: %r" % (name, e))
            for name in df_stories.columns:
                add(name, df_stories[name])
            if copy_index:
                add(index_name, df_stories.index)
            return vaex_df
        
        

Scrapping https://towardsdatascience.com/archive/2020/05/24
1 5 Books That Will Teach You the Math Behind Machine Learning A guide to the beautiful world of… Tivadar Danka 2020/05/24 5 min read 608
2 A definitive guide for Setting up a Deep Learning Workstation with Ubuntu 18.04 DL Rig Rahul Agarwal 2020/05/24 8 min read 266
3 How to Accelerate your Data Science Career by Putting yourself in the Right Environment  Khuyen Tran 2020/05/24 7 min read 70
4 Quick Python Tip: Suppress Known Exception Without Try Except Handle known exceptions in a more… Christopher Tao 2020/05/24 5 min read 212
5 Complete Data Engineer’s Vocabulary DATA ENGINEERING Kovid Rathee 2020/05/24 7 min read 492
6 The Complete Guide to Linear Regression Analysis This article is about understanding the linear… Abhay Jidge 2020/05/24 11 min read 37
7 You Should Become a Data Scientist. Here’s Why. Opinion Matt Przybyla 2020/05/24 5 min read 310
8 All About Python List Comprehension Elegant, comfortable, concise, and fa

65 Processing large data files with python multithreading We spend a lot of time waiting for some data… Miguel Albrecht 2020/05/24 3 min read 6
66 How to upload your R code on GitHub: example with an R script on MacOS See a step-by-step guide (with… Antoine Soetewey 2020/05/24 4 min read 4
67 One mouse click to label an image within a Jupyter Notebook Key to training an image recognition model… Gareth Morinan 2020/05/24 2 min read 54
68 GLR with Python and scikit-learn library An Introduction to Generalized Linear Regression Guillaume Androz 2020/05/24 7 min read 1
69 Introduction to regression techniques in Machine Learning for beginners. Learning and implementing the… Kabirkhanna 2020/05/24 5 min read 111
70 Assertive Programming in R Your code should work as intended or fail immediately Denis Gontcharov 2020/05/24 4 min read 7
71 Word Embeddings and Embedding Projector of TensorFlow Theoretical explanation and a practical example. Soner Yıldırım 2020/05/24 5 min read 5
72 Entropy an

77 Writing Professional Data Science Documentation  Adam Gajtkowski 2020/05/25 3 min read 13
78 Supervised vs Unsupervised Machine Learning Artificial Intelligence & Education Marco Santos 2020/05/25 6 min read 1
79 StayAtHome — A Story of COVID-19 An Analysis of Trend and Perspective about StayAtHome Campaign Robert 2020/05/25 6 min read 9
80 API Private Information Analyzer Privacy is all about data management. You need to know which type of… Manu Cohen Yashar 2020/05/25 3 min read 1
81 “Now, how am I actually going to do this?” You’ve got your great project idea, now what? Kate Christensen 2020/05/25 3 min read 3
82 Beware of Using the Casefold() Method when Dealing with Strings in Python Although string handling in… Christos Zeglis 2020/05/25 3 min read 
83 Don’t Be an Expandable Data Scientist! 3 tips to be the rock star they can’t afford to lose. Karim Lahrichi 2020/05/25 5 min read 1
84 Keeping things .secret A guide to hiding your API keys from an army of hackers. Ravi Malde 20

73 What are People Asking About COVID-19? A New Question Classification Dataset COVID-Q is a new dataset… Jerry Wei 2020/05/26 6 min read 52
74 Scikit-learn Linear Regression for Predicting Golf Performance How I built a simple tool for helping players understand their skills using data. Mark Vrahas 2020/05/26 6 min read 5
75 Pneumonia Detection using Convolutional Neural Network A detailed description of how to build a simple… Nischal Madiraju 2020/05/26 8 min read 18
76 4 things I learned recruiting with Bayesian inference  Nicholas Heal 2020/05/26 7 min read 3
77 Silhouette Coefficient : Validating clustering techniques This is my first medium story, so please… Ashutosh Bhardwaj 2020/05/26 3 min read 3
78 How-To: A Color-Coded, Segmented Bar Graph I’ve been doing a lot of COVID-19 analysis recently. Over the weekend I wanted to… barrysmyth 2020/05/26 6 min read 1
79 A Swift Introduction To Metaprogramming in Julia The basics of using Julia’s Meta package for… Emmett Boudreau 2020/05

71 Image Classification Using TensorFlow in Python What is Image Classification and how can we use… Cvetanka Eftimoska 2020/05/27 6 min read 1
72 Statistical Test for Time Series It determines whether the model is ready to use or not. Irfan Alghani Khalid 2020/05/27 6 min read 3
73 Overall Equipment Effectiveness with Python Python for Industrial Engineers Roberto Salazar 2020/05/27 5 min read 5
74 Decision Trees: 6 important things to always remember All we need to know about Decision trees without… Sivakar Siva 2020/05/27 5 min read 11
75 Collaboration will unlock AI’s business value For AI, collaboration is today, what computing power was… Prajakta Kharkar Nigam 2020/05/27 4 min read 63
76 Dockerize, Deploy, and Call my SDR Email Detector Model via API  Rodrigo Fuentes 2020/05/27 7 min read 5
77 Predicting Reddit Flairs using Machine Learning and Deploying the Model using Heroku  Prakhar Rathi 2020/05/27 12 min read 76
78 Interaction analyses — Appropriately adjusting for control va

73 Analyzing “Tilt” to Win More Games (League of Legends)  Jack J 2020/05/28 5 min read 3
74 Improving Classifier Performance by Changing the Difficulty of Images We propose a difficulty… Jerry Wei 2020/05/28 3 min read 11
75 Stacked Bar Graphs, Why & How  Darío Weitz 2020/05/28 8 min read 6
76 5 Statistical Functions in PyTorch PyTorch functions useful for machine learning Anurag Lahon 2020/05/28 3 min read 53
77 Looking Inside Mahalanobis Metric Matching How does it actually work? Bowen Chen 2020/05/28 4 min read 
78 Amazing Free Geolocation Alternative to Google Maps To young age startups and small businesses… Kapil Raghuwanshi 2020/05/28 3 min read 16
79 How to Formulate Good Research Question for Data Analysis Learn how to ask good questions to bring… Rashida Nasrin Sucky 2020/05/28 3 min read 
80 Several Different Ways to Combine Datasets in SAS Simple tutorial to explain in SAS studio Sydney Chen 2020/05/28 9 min read 76
81 How to efficiently design machine learning system Key i

63 Logistic Regression A gentle introduction to Logistic Regression Sangeet Aggarwal 2020/05/29 7 min read 102
64 Why was Pippen such a bargain to the Chicago Bulls? Despite his invaluable contribution to the team… Everton Almeida 2020/05/29 9 min read 
65 Fairness in Decision-Making for Criminal Justice Do machine learning algorithms ensure fairness in the criminal justice system, or do they perpetuate inequality? Andy Mandrell 2020/05/29 8 min read 216
66 Interactive Distribution Plots with Plotly How to create informative distribution plots Soner Yıldırım 2020/05/29 5 min read 32
67 Enhance Power BI report with Tooltip Pages Find out various tips on how Tooltip Pages can enhance your… Nikola Ilic 2020/05/29 6 min read 10
68 Aaron Mayer: Empowering Engineers to Build a Better World Innovators in Technology Series Amber Teng 2020/05/29 9 min read 125
69 Logistics Process Models for Automated Integration Testing A Case-Study on Continuous Integration and… Wladimir Hofmann 2020/05/29 5 

52 Why Big Data? Why Big Data, Where did it Come from and the Value of it. Tharuka KasthuriArachchi 2020/05/30 7 min read 292
53 Analyzing #WhenTrumpIsOutOfOffice tweets A step-by-step guide to cleaning and analyzing tweets in R Feng Lim 2020/05/30 9 min read 2
54 “The devil is in the details”. Machine Learning does not escape from it. Humans are the only animal… Oscar GR 2020/05/30 4 min read 
55 Building Linear Regression models with Alteryx In the business analytics field, Alteryx has not only… Vishal Sharma 2020/05/30 4 min read 15
Scrapping https://towardsdatascience.com/archive/2020/05/31
1 3 Reasons why you Shouldn’t become a Data Scientist Opinion Dario Radečić 2020/05/31 5 min read 608
2 How to Build a Data Science Portfolio Website Showcasing your work — with a website from scratch Julia Nikulski 2020/05/31 6 min read 824
3 A Single Line of Python Code Scraping Dataset from Webpages Hunting for API endpoints from webpages… Christopher Tao 2020/05/31 6 min read 571
4 The Seven

1 10 Smooth Python Tricks For Python Gods 10 Tricks that will individualize and better your Python code Emmett Boudreau 2020/06/01 6 min read 1.5K
2 How to process a DataFrame with billions of rows in seconds Yet another Python library for Data… Roman Orac 2020/06/01 6 min read 1.97K
3 Web App Development in Python Introduction to building a front-end user experience Roman Paolucci 2020/06/01 3 min read 265
4 Let’s Code Convolutional Neural Network in plain NumPy Mysteries of Neural Networks Piotr Skalski 2020/06/01 10 min read 598
5 Can you land your first data science job without having a MOOC certificate? Yes, you can, and here’s… Srishti Singh 2020/06/01 5 min read 298
6 You Want to Learn Rust but You Don’t Know Where to Start A Complete Resource for Rust Beginners Shinichi Okada 2020/06/01 9 min read 304
7 Ultimate Guide to Python Debugging Let’s explore the Art of debugging using Python logging… Martin Heinz 2020/06/01 6 min read 230
8 How to Design For Panic Resilience in Rust E

74 Multiclass Classification With Logistic Regression One vs All Method From Scratch Using Python  Rashida Nasrin Sucky 2020/06/01 4 min read 3
75 Explaining AI to children How to prepare our younger generation for the challenges of tomorrow! Alexiei Dingli 2020/06/01 5 min read 10
76 AI & Arbitration of Truth Can we make an AI fact-checker? Language, knowledge, and opinions are all… Nathan Lambert 2020/06/01 7 min read 11
77 Investigating a dataset using Pandas and seaborn  Vishal Sharma 2020/06/01 4 min read 5
78 Feeding The Machine — How We Digitize The World For Them. Thanks to COVID19. Michel Kana 2020/06/01 7 min read 1
79 Create A Machine Learning Model With Google Cloud BigQuery ML Using SQL  Aakash Rathor 2020/06/01 5 min read 3
80 Understanding Deep Associative Embedding in Convolutional Neural Networks An elegant method to group… Shuchen Du 2020/06/01 4 min read 2
81 Lovecraft with Natural Language Processing — Part 1: Rule-Based Sentiment Analysis  Mate Pocs 2020/06/01 10 m

74 Clustered & Overlapped Bar Charts  Darío Weitz 2020/06/02 6 min read 1
75 Find the Intersection of Two Sets of Coordinates and Sort By Colors Using Python OOP  Rashida Nasrin Sucky 2020/06/02 3 min read 4
76 AI-Based Fuzzing (AIF) Fuzzing refers to the process of using semi-valid input in a computer program to verify exceptions to behavior, memory… Ensar Seker 2020/06/02 4 min read 2
77 The Balance-Sample Size Frontier in Matching What is it, and how does it work? Bowen Chen 2020/06/02 7 min read 
Scrapping https://towardsdatascience.com/archive/2020/06/03
Scrapping https://towardsdatascience.com/archive/2020/06/04
1 How I Land My First Job in Data Science The transition from a postdoc in microbiology to a data… Jun 2020/06/04 9 min read 92
2 Topic Modeling in Power BI using PyCaret A step-by-step tutorial for implementing Topic Model in Power… Moez Ali 2020/06/04 7 min read 155
3 Machine Learning Classifiers Comparison with Python Python for Machine Learning Roberto Salazar 2020/06

This function automatically reads to vaex dataframe file (vaex.dataframe.DataFrameArrays) or to HDF5 from .csv and persists it to disk.

In [10]:
# path = "dataset.csv"
# dv = vaex.from_csv(path, convert=True, chunk_size=10_000)
vaex_df = from_pandas(df_stories)
type(vaex_df)

vaex.dataframe.DataFrameArrays

In [11]:
col_names = vaex_df.get_column_names

df = vaex_df.to_pandas_df(col_names(strings=col_names))
type(df)
df.head(2)

Unnamed: 0,title,subtitle,author,date,reading_time,claps
0,5 Books That Will Teach You the Math Behind Ma...,A guide to the beautiful world of…,Tivadar Danka,2020/05/24,5 min read,608
1,A definitive guide for Setting up a Deep Learn...,DL Rig,Rahul Agarwal,2020/05/24,8 min read,266


In [12]:
df.shape

(830, 6)

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 830 entries, 0 to 829
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         830 non-null    object
 1   subtitle      830 non-null    object
 2   author        830 non-null    object
 3   date          830 non-null    object
 4   reading_time  830 non-null    object
 5   claps         830 non-null    object
dtypes: object(6)
memory usage: 39.0+ KB


In [14]:
# Checking if the scraping algorithm was able to scrape all the titles
print("Number of Null values: ", df_stories[df_stories.title.isnull()].shape)

# In case of any missing values, remove them:
print("Number of values: ", df_stories[df_stories.title.notnull()].shape)

Number of Null values:  (0, 6)
Number of values:  (830, 6)


In [15]:
# Duplicates
df[df.author.duplicated()].sort_values("claps", ascending=False)

Unnamed: 0,title,subtitle,author,date,reading_time,claps
559,Productive NLP Experimentation with Python usi...,How to use Pytorch…,Arie Pratama Sutiono,2020/05/31,7 min read,96
802,The Ultimate Glossary of Data Science,Beginners-friendly A-Z Guide.,Oleksii Kharkovyna,2020/06/04,12 min read,93
598,Can Ronaldo Score a Goal? Let us find out usin...,"In this article, I will go through on how we…",Rajat Keshri,2020/05/31,8 min read,90
593,Batch vs Stochastic Gradient Descent,Learn difference between Batch & Stochastic De...,Amar Mandal,2020/05/31,4 min read,9
115,A Practical Guide for Exploratory Data Analysi...,Exploring 2019–2020 season of…,Soner Yıldırım,2020/05/25,7 min read,9
...,...,...,...,...,...,...
776,Simple Guide to Choropleth Maps,Choropleth Maps using Plotly to track COVID 19...,Shraddha Anala,2020/06/02,3 min read,
539,“The devil is in the details”. Machine Learnin...,Humans are the only animal…,Oscar GR,2020/05/30,4 min read,
752,Matplotlib: Tutorial (with code) for Python’s ...,Explaining the basics…,Maurizio Sluijmers,2020/06/02,8 min read,
664,Hypothesis Testing along with Type I & Type II...,How to select the right test…,Ratul Ghosh,2020/06/01,6 min read,


In [16]:
# Verifying if columns don't have nulls or empty string values.

for i in df_stories.columns:
#     print(i)
    print("Column ", i, ": ", df_stories[(df_stories[i].isnull()) | (df_stories[i] == " ")])
    

Column  title :  Empty DataFrame
Columns: [title, subtitle, author, date, reading_time, claps]
Index: []
Column  subtitle :  Empty DataFrame
Columns: [title, subtitle, author, date, reading_time, claps]
Index: []
Column  author :  Empty DataFrame
Columns: [title, subtitle, author, date, reading_time, claps]
Index: []
Column  date :  Empty DataFrame
Columns: [title, subtitle, author, date, reading_time, claps]
Index: []
Column  reading_time :  Empty DataFrame
Columns: [title, subtitle, author, date, reading_time, claps]
Index: []
Column  claps :  Empty DataFrame
Columns: [title, subtitle, author, date, reading_time, claps]
Index: []


A great way to check for duplicates with many columns is to combine the values, calculate the hash
and check for duplicated hashes. The downside of this method is that it doesn't work well with null
values.

In [17]:
df["id"] = (df.title + df.subtitle + df.author + df.date + df.reading_time+ df.claps).apply(hash)
df[df.id.duplicated()]

Unnamed: 0,title,subtitle,author,date,reading_time,claps,id


In [18]:
# Final check and Saving the dataframe to a .csv file
print("The shape of a validated dataset is: ", df.shape , "and the column names are: \n", df.columns)
# df.to_csv("TowardsDataScience_validated.csv", index=False, 
#           columns=["title", "subtitle", "author", "date", "reading_time", "claps"])

The shape of a validated dataset is:  (830, 7) and the column names are: 
 Index(['title', 'subtitle', 'author', 'date', 'reading_time', 'claps', 'id'], dtype='object')
