# Bash commands for quick data analysis

This notebooks aims to give an hands-on approach on how to quickly browse and understand data store in files.

First let's be sure we got a connection to the cluster and start by looking at the current directory

Using the exclamation mark (!) we can run single-line shell commands. Using ``%%bash`` or ``%%sh``, we can do this for multi-line commands.

In [None]:
# Installing some additional dependencies
!pip3 install csvkit termgraph

In [None]:
%%bash 
wget -O miller.tar.gz "https://github.com/johnkerl/miller/releases/download/v6.11.0/miller-6.11.0-linux-amd64.tar.gz"; tar -xzf miller.tar.gz

In [None]:
!ln miller-6.11.0-linux-amd64/mlr ./mlr

In [None]:
%%bash
ls 

Okay, there we got a connection. Let's move on to some exercises.

### Getting started

Now let's download the data for this class.

__wget__ is an command used to download files from the internet, in this case we will download a randomly generated csv containing sales data.  
This is the url of the file: http://eforexcel.com/wp/wp-content/uploads/2020/09/5m-Sales-Records.zip

In [None]:
%%bash

# Your code goes here
wget -O sales.zip http://eforexcel.com/wp/wp-content/uploads/2020/09/5m-Sales-Records.zip

Can you list the directory again and confirm the file was dowloaded?

In [None]:
%%bash
# Your code goes here

Next let's **unzip** the file

In [None]:
%%bash
# Your code goes here

Confirm the file is now unziped and check how much space does it take (ls might be a solution here _**ls --help**_ will get all options for extra funcionality)

In [None]:
%%bash
# Your code goes here

Optionally rename the file to a simpler name using the mv command (hint: mv --help)

In [None]:
%%bash

# Your code goes here

## Diving in the data

Will start by understanding what data are we dealing with.  
Take the first 5 rows and see what are the contents of the file.

In [None]:
%%bash

# Your code goes here

In [None]:
# An alternative approach using the "csvkit"


So a regular csv where the first line is the header and the columns are devided with commas. 

Next let's see how many rows does the file contain.

In [None]:
%%bash

# Your code goes here

How many of these rows are relative to Europe?

In [None]:
%%bash

# Your code goes here

And from the the European ones how many were Online orders?

In [None]:
%%bash

# Your code goes here

Output to a file the Item Type, the order date and the order priority of orders from Portugal.

_Hint:_  
_As mentioned on the intro the **>** operator can be used to redirect the output to a file._
> echo "New line" > new_file.txt

_Will create a file called new_file.txt with New line_

In [None]:
%%bash

# Your code goes here

Within European Online orders which 5 countries got the highest number of orders? (hint: A combinatation of **sort** and **uniq** will probably help)

In [None]:
%%bash

# Your code goes here


We can achieve similar results using ``mlr`` and "then"-chaining. Note that this is equivalent to calling ``| ./mlr *verb*`` sequentially.

In [None]:
%%bash
 
# Your code goes here

We can also generate summary statistics using ``mlr`` using stats1 for univariate data.

In [None]:
%%bash

# Your code goes here

We can use the ``csvsql`` utility contained in the ``csvkit`` to run SQL queries on the data.

In [None]:
# Your code goes here

In [None]:
# Your code goes here

It is also easy to integrate with ``Python``, given the following function can you make an histogram with number of orders per region?

In [None]:
%%bash
echo '''
import math
import sys
for line in sys.stdin:
  width, data = line.strip().split(" ",1)
  print("{:<35}{:=<{width}}".format(data, "", width=math.log(int(width))))
''' > log_histogram.py

In [None]:
%%bash 
# Your code goes here

In [None]:
%%bash

# We can also use the "termgraph" utility 

Now let's split the file with:

In [None]:
%%bash
# Your code goes here


Can you run the same operation as previously but now searching through the multiple files created? (_Hint: the **cat** command will probably be usefull here)_

In [None]:
%%bash

# Your code goes here

Which file contains the following entry? And in what line? (_Hint: the **grep** command will probably be usefull here)_
`Australia and Oceania,Solomon Islands,Clothes,Offline,H,1/1/2020,562158238,1/20/2020,3970,109.28,35.84,433841.60,142284.80,291556.80`

In [None]:
%%bash

# Your code goes here

## Parallel Processing

We are going to use "GNU parallel", which allows us to run code in parallel using the shell

In [None]:
!apt-get -y install parallel

In [None]:
%%bash
# Parallel processing using GNU "parallel"
seq 0 2 20 | parallel "echo {}^2" | head

"Embarassingly parallel" code can easily be parallelized, such as the following webscraping example. Every call to the website is independent of each other. We first upload the sites.txt file to the file store and confirm that it is there.

In [None]:
# dbutils gives us access to some databricks utilities. fs lets us explore the file system
dbutils.fs.cp("dbfs:/FileStore/sites.txt", "file:/tmp/sites.txt")

In [None]:
%%bash

# Your code goes here

We can perform HTML parsing from the command line too, using the "pup" utility

In [None]:
%%bash 
# Installing pup
curl -Lo pup.zip "https://github.com/ericchiang/pup/releases/download/v0.4.0/pup_v0.4.0_linux_amd64.zip"; unzip pup.zip

Counting the number of rows in the HTML table. Note that we use ./pup in the command since the path is relative to the current directory.

In [None]:
# Your code goes here

We can put all of it together in a single command, using the ";" delimiter. We first find all the "html" files in the current directory and re-direct them into another file. Then, we perform parallel counts of the number of rows contained in any of those files. We use 5 jobs at the same time. Finally, we print the name of the file and use "pup" to count the number of rows in the table.

In [None]:
# Your code goes here

Want to learn more?  
Check the free O’Reilly book here: https://www.datascienceatthecommandline.com/