# Har2Tree Tutorial

Crawling a web page can sound like a bit of an abstract concept at first. How exactly can we extract data from a web page? What data is really interesting to look at? Where can it be found?  

&rarr; Every web browser generates a **[HAR file](https://www.keycdn.com/support/what-is-a-har-file#:~:text=HAR%2C%20short%20for%20HTTP%20Archive,times%2C%20and%20page%20rendering%20problems.)** (short for http archive) when loading a web page. This file mostly contains information about what resources are loaded by the browser, as it was firstly designed to identify possible performance issues. However, as the whole file is in a *standard JSON* format, **we can reverse engineer the process to extract useful information** and make a whole tree out of all the resources found in that HAR file. This step is particularly important as it is really complicated to understand what is going on by simply looking at the HAR file. *Example [here](https://gist.githubusercontent.com/Felalex57/8a90a3bd0628e3aef16ee04fb08e7e7e/raw/ecee33d26c5696989c600ba87683becff270ccc1/example.har)!*

This notebook will guide you through the core features that **[Har2Tree](https://github.com/Lookyloo/har2tree)** offers.

It is also important to note that Har2Tree is an API based on the **[TreeNode](http://etetoolkit.org/docs/latest/reference/reference_tree.html) class of ETE3 Toolkit** and that a lot of help can be found on the documentation there in case you want to know a bit more about how the program works.

# Before we do anything: Setup

## 1. Prerequisites


For the following tutorial, we assume you have the following environment at your disposal.

1. Ubuntu 20.04 or more recent. You can also work with WSL 2 


2. Python 3.8 or 3.9

## 2. Installing har2tree

If you are here it means that you already cloned the har2tree repository:  **you should be all set up already**!

In case you got here another way, simply clone the repository in your desired folder:  
```bash
git clone https://github.com/Lookyloo/har2tree.git
```

You may also want to initialize a submodule containing a few sample captures:

``` bash
git submodule update --init --recursive
```

## (Optional) 3. Retrieving useful files

At this point, you could use a pre-existing capture made for the tests of har2tree. They are located in **`tests / capture_samples`**.
However, you might want to take a look at how the files are downloaded **to have a better understanding of the program** and eventually use it on some pages of your choice.
<br/>
<br/>

**Important note:** Because Har2Tree was made for Lookyloo, it may require some additional files located in the same folder as the HAR file to be completely operational. To ensure that the program will fully work, we will simulate a capture using the **[public Lookyloo instance](https://lookyloo.circl.lu/)** rather than download the HAR file in the conventional way *(on Chrome: Ctrl + Shift + J > Network > F5 (Reload the page) > Arrow facing downwards)* . 



By simply **adding `/export` at the end of the url** when browsing on a capture, we can **download all the files generated by Lookyloo**. This includes the complete html capture of the page along with various other files that we will get into later on.

 
Capture link: &nbsp;&nbsp; &rarr;  https://lookyloo.circl.lu/tree/b6b29698-4c97-4a21-adaa-f934e5bfb042  
Download link: &rarr; https://lookyloo.circl.lu/tree/b6b29698-4c97-4a21-adaa-f934e5bfb042/export

You can then unzip the folder in the desired folder of your choice and your HAR folder is now ready!  
**Tip:** unzip the folder in the same directory as this notebook, it will be easier for later.

## Getting started

The place where the magic of the API begins is the **[CrawledTree object](https://github.com/Lookyloo/har2tree/blob/9f92dab3909e406877cb36b3dbc30d0c5ead8c63/har2tree/parser.py#L15) :**  it takes a list of **HAR file paths** and a  **[uuid](https://en.wikipedia.org/wiki/Universally_unique_identifier#:~:text=A%20universally%20unique%20identifier%20UUID,%2C%20for%20practical%20purposes%2C%20unique.)** as parameters. <br/> To keep things simple for now, we will only be using **one HAR file per tree**.
To build OS paths in python, we are going to use the **Path** class from **pathlib**.  

Note that the keyword `__file__` doesn't work on Jupyter.  

Let's see how we can tell the program to display our home directory:

In [None]:
from pathlib import Path
Path.home()

Great. Now let's try to create our first tree. As mentioned before, you will also need to pass a uuid as a parameter, but don't worry, python has everything you need:

In [None]:
import uuid
uuid.uuid4()

Little notes though:
- *CrawledTree* takes a string as parameter and not a UUID, we just have to make a little conversion
- it takes a list of HAR paths, even if there's only one path as mentioned before

You might want to change the HAR path to what you downloaded in part 3 of the setup.
Enough talking:

## Creating the tree

In [None]:
from har2tree import CrawledTree
har_path = Path() / '..' / 'tests' / 'capture_samples' / 'http_redirect' / '0.har'
my_first_crawled_tree = CrawledTree([har_path], str(uuid.uuid4()))

## Part 1: Extracting simple data

If you didnt get an error, everything worked! Let's now see what we can do with that CrawledTree. 
You can find all the **properties** in the **[parser.py](https://github.com/Lookyloo/har2tree/blob/9f92dab3909e406877cb36b3dbc30d0c5ead8c63/har2tree/parser.py#L76)** file.

First, let's see what website you got the capture from:

In [None]:
my_first_crawled_tree.root_url

Why not also check at what time the capture was made, as well as the **[user agent](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent)** that made it:

In [None]:
print(my_first_crawled_tree.start_time)
print(my_first_crawled_tree.user_agent)

Finally, what really interests us: **let's see if there are any [redirects](https://moz.com/learn/seo/redirection) on the page**.

In [None]:
print(my_first_crawled_tree.redirects)

And that's it for the first part. With very few lines of codes, we are able to extract very useful information in neglectable execution time. This makes it so much easier than having to go through the HAR file and find what you're looking for.

In the second part, we'll dig into the more complex features.

## Part 2: the second level

In this part we will look into the root_hartree property of CrawledTree, which is nothing else than a Har2Tree object inside CrawledTree. You can see that it is **initialized [here](https://github.com/Lookyloo/har2tree/blob/f908280347b4b33f8ef0dd750d372dcd825634ed/har2tree/parser.py#L25)**.  

The few properties we saw before are here to simplify the access of that sub-level tree. They are called **[that way](https://github.com/Lookyloo/har2tree/blob/f908280347b4b33f8ef0dd750d372dcd825634ed/har2tree/parser.py#L79)** : `CrawledTree.root_hartree.method`

### Har2Tree properties

Let's start with something simple and display the start time to check if we get the same result as before:

In [None]:
print(my_first_crawled_tree.root_hartree.start_time)
print(my_first_crawled_tree.root_hartree.start_time == my_first_crawled_tree.start_time)

The stats property calls multiple useful other properties and displays them nicely in a JSON format. You can find what it calls **[here](https://github.com/Lookyloo/har2tree/blob/f908280347b4b33f8ef0dd750d372dcd825634ed/har2tree/har2tree.py#L370)** and trace it back to the other properties in case you want to know something in specific that is not covered here.

In [None]:
my_first_crawled_tree.root_hartree.stats