# Bash commands for quick data analysis

This notebooks aims to give an hands-on approach on how to quickly browse and understand data store in files.

First let's be sure we got a connection to the cluster and start by looking at the current directory

In [None]:
%%bash
ls

Okay, there we got a connection. Let's move on to some exercises.

### Getting started

Now let's download the data for this class.

__wget__ is an command used to download files from the internet, in this case we will download a randomly generated csv containing sales data.  
This is the url of the file: http://eforexcel.com/wp/wp-content/uploads/2020/09/5m-Sales-Records.zip

In [None]:
%%bash

# Your code goes here


Can you list the directory again and confirm the file was dowloaded?

In [None]:
%%bash
# Your code goes here


Next let's **unzip** the file

In [None]:
%%bash
# Your code goes here


Confirm the file is now unziped and check how much space does it take (ls might be a solution here _**ls --help**_ will get all options for extra funcionality)

In [None]:
%%bash
# Your code goes here


Optionally rename the file to a simpler name using the mv command (hint: mv --help)

In [None]:
%%bash

# Your code goes here


## Diving in the data

Will start by understanding what data are we dealing with.  
Take the first 5 rows and see what are the contents of the file.

In [None]:
%%bash

# Your code goes here


So a regular csv where the first line is the header and the columns are devided with commas. 

Next let's see how many rows does the file contain.

In [None]:
%%bash

# Your code goes here


How many of these rows are relative to Europe?

In [None]:
%%bash

# Your code goes here


And from the the European ones how many were Online orders?

In [None]:
%%bash

# Your code goes here


_Hint:_  
_As mentioned on the intro the **>** operator can be used to redirect the output to a file._
> _echo "New line" > new_file.txt _

_Will create a file called new_file.txt with New line_

Output to a file the Item Type, the Order Date and the Order Priority of orders from Portugal.

In [None]:
%%bash

# Your code goes here


Check the contents of the file to confirm you got the desired output

In [None]:
%%bash

# Your code goes here


Within European Online orders which 5 countries got the highest number of orders? (hint: A combinatation of **sort** and **uniq** will probably help)

In [None]:
%%bash

# Your code goes here


It is also easy to integrate with python, given the following function can you make an histogram with number of orders per region?

In [None]:
%%bash
echo '''
import math
import sys
for line in sys.stdin:
  width, data = line.strip().split(" ",1)
  print("{:<35}{:=<{width}}".format(data, "", width=math.log(int(width))))
''' > log_histogram.py

In [None]:
%%bash

# Your code goes here


Now let's split the file with:

In [None]:
%bash
split -n 3 --additional-suffix=-sales sales.csv

Can you run the same operation as previously but now searching through the multiple files created? (_Hint: the **cat** command will probably be usefull here)_

In [None]:
%%bash

# Your code goes here


Which file contains the following entry? And in what line? (_Hint: the **grep** command will probably be usefull here)_
`Australia and Oceania,Solomon Islands,Clothes,Offline,H,1/1/2020,562158238,1/20/2020,3970,109.28,35.84,433841.60,142284.80,291556.80`

In [None]:
%%bash

# Your code goes here


> **Now, what if we wanted to read this file with spark?**  
(we will learn more details about this in the following labs)

> First, let's check the absoute path of the file:

In [None]:
%sh
# Your code goes here

> Now we pass this absolute path to `spark.read.text`:

In [None]:
spark.read.text("#Your code goes here")

#### Uh-oh. What's wrong?

In databricks, when we run standard bash commands, we are acessing the **driver local storage**. 

The **driver local storage** is a temporary storage space that is available on the **driver node of your Spark cluster**.  
You can use it to store files that are needed for your Spark job, such as configuration files, libraries, or intermediate results. However, you should be aware that the driver local storage is not persistent, and will be deleted when your cluster is terminated or restarted.  

To access the driver local storage in Databricks, you can use the /databricks/driver path, which is a symbolic link to the driver local storage.

The problem here, is that we also have the **Databricks File System** (DBFS), which is the default location for Spark commands (that's why `spark.read` can't find that path).  

The DBFS, on the other hand, is a distributed file system that is built into the Databricks platform. It provides a simple and scalable way to store and access data within Databricks, and is designed to work seamlessly with Spark.


| Feature | Driver Ephemeral Volume Storage | DBFS |
|---------|--------------------------|------|
| Location | Local disk storage | Distributed file system |
| Persistence | Not persistent, deleted when cluster is terminated or restarted | Persistent, not deleted when cluster is terminated or restarted |
| Access | /databricks/driver | dbfs:/ scheme with dbutils.fs module |
| Speed | Faster | Slower |
| Cost | Cheaper | More expensive |
| Reliability | Not reliable | Reliable |
| Scalability | Not scalable | Scalable |
| Use case | Temporary data needed for a specific Spark job | Data needed across multiple Spark jobs or that cannot be regenerated if lost |

When I say that the DBFS is reliable and scalable, I mean that it provides better data durability and availability, and can handle larger amounts of data and higher levels of concurrency, in comparison to the local disk storage.

> **The easiest way to access the dbfs, is to use the magic `%fs` instead of `%sh`:**

> Let's copy the LOTR books to the dbfs and try to read it using spark again:

In [None]:
%fs

# Your code goes here

In [None]:
%fs

# Your code goes here

>  Now we can read it with Spark!

In [None]:
spark.read.text("#Your code goes here").count()

The root path on Databricks depends on the code executed:

| Command     | Default Location    |
| ----------- | --------------------|
| %sh         | driver local storage|
| %fs         | dbfs                |
| most python code      | driver local storage|
| PySpark       | dbfs                |

## Extra

Installing packages or tools is also easier than the well know windows installer.  
To demonstrate it we will use a fortune telling cow.

First we need to install the packages with the **apt** command.

In [None]:
%%bash

apt install -y fortune cowsay

Normally these two commands should be available as **fortune** and **cowsay**. In this case they will be available as **/usr/games/fortune** **/usr/games/cowsay** due to a problem with databricks.  
Can you use these two commands together to create a fortune telling cow?

In [None]:
%%bash 
# Your code goes here

Want to learn more?  
Check the free O’Reilly book here: https://www.datascienceatthecommandline.com/