# Har2Tree Tutorial

Crawling a web page can sound like a bit of an abstract concept at first. How exactly can we extract data from a web page? What data is really interesting to look at? Where can it be found?  

&rarr; Every web browser generates a **[HAR file](https://www.keycdn.com/support/what-is-a-har-file#:~:text=HAR%2C%20short%20for%20HTTP%20Archive,times%2C%20and%20page%20rendering%20problems.)** (short for http archive) when loading a web page. This file mostly contains information about what resources are loaded by the browser, as it was firstly designed to identify possible performance issues. However, as the whole file is in a *standard JSON* format, **we can reverse engineer the process to extract useful information** and make a whole tree out of all the resources found in that HAR file. This step is particularly important as it is really complicated to understand what is going on by simply looking at the HAR file. *Example [here](https://gist.githubusercontent.com/Felalex57/8a90a3bd0628e3aef16ee04fb08e7e7e/raw/ecee33d26c5696989c600ba87683becff270ccc1/example.har)!*

This notebook will guide you through the core features that **[Har2Tree](https://github.com/Lookyloo/har2tree)** offers.

It is also important to note that Har2Tree is an API based on the **[TreeNode](http://etetoolkit.org/docs/latest/reference/reference_tree.html) class of ETE3 Toolkit** and that a lot of help can be found on the documentation there in case you want to know a bit more about how the program works.

# Before we do anything: Setup

## 1. Prerequisites


For the following tutorial, we assume you have the following environment at your disposal.

1. Ubuntu 20.04 or more recent. You can also work with WSL 2 


2. Python 3.8 or 3.9

## 2. Installing har2tree

If you are here it means that you already cloned the har2tree repository:  **you should be all set up already**!

In case you got here another way, simply clone the repository in your desired folder:  
```bash
git clone https://github.com/Lookyloo/har2tree.git
```

You may also want to initialize a submodule containing a few sample captures:

``` bash
git submodule update --init --recursive
```

## (Optional) 3. Retrieving useful files

At this point, you could use a pre-existing capture made for the tests of har2tree. They are located in `tests / capture_samples`.
However, you might want to take a look at how the files are downloaded **to have a better understanding of the program** and eventually use it on some pages of your choice.
<br/>
<br/>

**Important note:** Because Har2Tree was made for Lookyloo, it may require some additional files located in the same folder as the HAR file to be completely operational. To ensure that the program will fully work, we will simulate a capture using the **[public Lookyloo instance](https://lookyloo.circl.lu/)** rather than download the HAR file in the conventional way *(on Chrome: Ctrl + Shift + J > Network > F5 (Reload the page) > Arrow facing downwards)* . 



By simply **adding `/export` at the end of the url** when browsing on a capture, we can **download all the files generated by Lookyloo**. This includes the complete html capture of the page along with various other files that we will get into later on.

 
Capture link: &nbsp;&nbsp; &rarr;  https://lookyloo.circl.lu/tree/b6b29698-4c97-4a21-adaa-f934e5bfb042  
Download link: &rarr; https://lookyloo.circl.lu/tree/b6b29698-4c97-4a21-adaa-f934e5bfb042/export

You can then unzip the folder in the desired folder of your choice and your HAR folder is now ready!  
**Tip:** unzip the folder in the same directory as this notebook, it will be easier for later.

## Getting started

The place where the magic of the API begins is the **[CrawledTree object](https://github.com/Lookyloo/har2tree/blob/9f92dab3909e406877cb36b3dbc30d0c5ead8c63/har2tree/parser.py#L15) :**  it takes a list of **HAR file paths** and a  **[uuid](https://en.wikipedia.org/wiki/Universally_unique_identifier#:~:text=A%20universally%20unique%20identifier%20UUID,%2C%20for%20practical%20purposes%2C%20unique.)** as parameters. <br/> To keep things simple for now, we will only be using **one HAR file per tree**.
To build OS paths in python, we are going to use the **Path** class from **pathlib**.  

Note that the keyword `__file__` doesn't work on Jupyter.  

Let's see how we can tell the program to display our home directory:

In [None]:
from pathlib import Path
Path.home()

Great. Now let's try to create our first tree. As mentioned before, you will also need to pass a uuid as a parameter, but don't worry, python has everything you need:

In [None]:
import uuid
uuid.uuid4()

Little notes though:
- *CrawledTree* takes a string as parameter and not a UUID, we just have to make a little conversion
- it takes a list of HAR paths, even if there's only one path as mentioned before

You might want to change the HAR path to what you downloaded in part 3 of the setup.
Enough talking:

## Creating the tree

In [None]:
from har2tree import CrawledTree
har_path = Path() / '..' / 'tests' / 'capture_samples' / 'http_redirect' / '0.har'
my_first_crawled_tree = CrawledTree([har_path], str(uuid.uuid4()))

## Part 1: Extracting simple data

If you didnt get any error, everything worked! Let's now see what we can do with that CrawledTree. 
You can find all the **properties** in the **[parser.py](https://github.com/Lookyloo/har2tree/blob/9f92dab3909e406877cb36b3dbc30d0c5ead8c63/har2tree/parser.py#L76)** file.

First, let's see what website you got the capture from:

In [None]:
my_first_crawled_tree.root_url

Why not also check at what time the capture was made, as well as the **[user agent](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent)** that made it:

In [None]:
print(my_first_crawled_tree.start_time)
print(my_first_crawled_tree.user_agent)

Finally, what really interests us: **let's see if there are any [redirects](https://moz.com/learn/seo/redirection) on the page**.

In [None]:
print(my_first_crawled_tree.redirects)

And that's it for the first part. With very few lines of codes, we are able to extract very useful information in neglectable execution time. This makes it so much easier than having to go through the HAR file and find what you're looking for.

In the second part, we'll dig into the more complex features.

## Part 2: the second level

In this part we will look into the root_hartree property of CrawledTree, which is nothing else than a Har2Tree object inside CrawledTree. You can see that it is **initialized [here](https://github.com/Lookyloo/har2tree/blob/f908280347b4b33f8ef0dd750d372dcd825634ed/har2tree/parser.py#L25)**.

Har2Tree's goal is to build a tree **out of the different contents of a given HAR file**. By doing that, a lot of subsequent methods are gonna get invoked, particularly in the **nodes.py** file. We will cover them later on.

The few properties we saw before are here to simplify the access of that sub-level tree. They are called **[that way](https://github.com/Lookyloo/har2tree/blob/f908280347b4b33f8ef0dd750d372dcd825634ed/har2tree/parser.py#L79)** : `CrawledTree.root_hartree.method`

### Har2Tree properties

Let's start with something simple and display the start time to check if we get the same result as before:

In [None]:
print(my_first_crawled_tree.root_hartree.start_time)
print(my_first_crawled_tree.root_hartree.start_time == my_first_crawled_tree.start_time)

- The `stats` property calls multiple useful other properties and displays them nicely in a JSON format. You can find what it calls **[here](https://github.com/Lookyloo/har2tree/blob/f908280347b4b33f8ef0dd750d372dcd825634ed/har2tree/har2tree.py#L370)** and trace it back to the other properties in case you want to know something in specific that is not covered here.

In [None]:
my_first_crawled_tree.root_hartree.stats

- You can get a pretty good idea of the time taken to build a tree by calling the `total_load_time` property. It's not 100% precise as some loads are made in parallel but it gives a good approximation. Along with it, you can call `total_size_responses` that give you the size in bytes of the response bodies:

In [None]:
print(my_first_crawled_tree.root_hartree.total_load_time)
print(my_first_crawled_tree.root_hartree.total_size_responses)

You can see that with this simple capture, it doesn't take a lot of time. What about a more complex one?
### Task: export the har data of a website of your choice and make a tree out of it. Then check the value of total_load_time

In [None]:
har_path = Path() / '..' / #the_path_of_your_capture 
complex_crawled_tree = CrawledTree([har_path], str(uuid.uuid4()))
print(complex_crawled_tree.root_hartree.total_load_time)

- A very interesting property to look at is `root_after_redirect`: it returns a URL in case there is at least one redirect on the capture. 
The returned URL is the URL that you'll end up with after following all redirects of the page. 

In [None]:
my_first_crawled_tree.root_hartree.root_after_redirect

If you really want to dig deeper and **investigate the whole construction of the tree**, **I recommend you take a look [here](https://github.com/Lookyloo/har2tree/blob/9eafd89563721acd803defa204fa353c762ca9c9/har2tree/har2tree.py#L579)**. This will give you more insight and the whole construction with every single step cannot be covered here.

Same thing for **[make_hostname_tree](https://github.com/Lookyloo/har2tree/blob/9eafd89563721acd803defa204fa353c762ca9c9/har2tree/har2tree.py#L499)**: a short explanation is that it's basically a construction and aggregation of the URLTree depending on the hostnames.

## HarFile class

If you take a look at **[the code](https://github.com/Lookyloo/har2tree/blob/66817c2c56697fd9a6ff3e42820978802e84faa0/har2tree/har2tree.py#L480)**, you'll notice that we actually call the function from **`har`** which is nothing else than **[an instance of HarFile](https://github.com/Lookyloo/har2tree/blob/66817c2c56697fd9a6ff3e42820978802e84faa0/har2tree/har2tree.py#L259)** to make it easier to access inside of the Har2Tree class. You can find its definition in **[the same file](https://github.com/Lookyloo/har2tree/blob/66817c2c56697fd9a6ff3e42820978802e84faa0/har2tree/har2tree.py#L251)** as its main use is to pre-process a lot of data useful in the Har2Tree class and give a more *python-esque* interface that lets us access the contents of the HAR file with ease.

Let's see some of its interesting features:

In [None]:
har_properties = my_first_crawled_tree.root_hartree.har    # For more readability

In [None]:
print('uuid: ' + har_properties.capture_uuid)   
print('Path: ' + str(har_properties.path) + '\n')    # What we defined before

print('Initial redirects: ' + str(har_properties.has_initial_redirects))  # Our example from before 
print('Final redirect: ' + har_properties.final_redirect + '\n')    # Same as root_after_redirect

print('Unique representation: ' + repr(har_properties))   # path of the capture and the uuid at the same time

Only execute that one if you want to see all the informations of one given URL.
You could also remove the [0] to print out all entries but it takes <span style="color:red">a lot of time</span> and it is enough to print just the first one to get a good idea of what an entry looks like:


In [None]:
print(har_properties.entries[0])

As you can see, the URL is located in `request > url`. Let's see what we can do with that.


This next example is quite interesting. It shows the number of entries in the capture; then it prints out all the URLs loaded by the page; you can even retrace the redirects in the first 4 URLs.

In [None]:
for entry in har_properties.entries:
    print(entry['request']['url'])
print(har_properties.number_entries)

<span style="color:blue">Note:</span> and of course this is also implemented in the all_url_requests attribute, but in the Har2Tree class.

In [None]:
print(my_first_crawled_tree.root_hartree.all_url_requests)
print(len(my_first_crawled_tree.root_hartree.all_url_requests))

You can however see that we only have 6 as the duplicates were removed compared to 7 before.
And that weird Tree Node thing... we're getting there in the next part.

## nodes.py
As you can see [in this file](https://github.com/Lookyloo/har2tree/blob/66817c2c56697fd9a6ff3e42820978802e84faa0/har2tree/nodes.py), the nodes we use come from the [Ete3 toolkit API](http://etetoolkit.org/docs/latest/reference/reference_tree.html). 

Let's try to see what one of those nodes looks like; we'll begin with **URLNode** as it's pretty self explanatory.

We are going to use the [rendered_node](https://github.com/Lookyloo/har2tree/blob/9eafd89563721acd803defa204fa353c762ca9c9/har2tree/har2tree.py#L489) property as it returns the node which will ultimately be displayed on the screen.

In [None]:
print(my_first_crawled_tree.root_hartree.rendered_node)
print("\n")
print(my_first_crawled_tree.root_hartree.rendered_node.describe())

You might find weird that *<span style="color:#8b8b8b">the rendered URL itself is not displayed</span>*. It's because the name of the node itself is not returned in the `__str__` [method](https://github.com/etetoolkit/ete/blob/1f587a315f3c61140e3bdbe697e3e86eda6d2eca/ete3/coretype/tree.py#L251) of the TreeNode class as the show_internal parameter is [set to False](https://github.com/etetoolkit/ete/blob/1f587a315f3c61140e3bdbe697e3e86eda6d2eca/ete3/coretype/tree.py#L67) by default.

- You could simply print out the name of the node like this:

In [None]:
my_first_crawled_tree.root_hartree.rendered_node.name

- Or you could invoke the get_ascii method because its default show_internal parameter [is set to True](https://github.com/etetoolkit/ete/blob/1f587a315f3c61140e3bdbe697e3e86eda6d2eca/ete3/coretype/tree.py#L1491) but you will have to zoom out to get someting readable:

In [None]:
print(my_first_crawled_tree.root_hartree.rendered_node.get_ascii())

- Finally, you could run this little script. It invokes the method .show() of a node which opens a window with an interactive interface and really helps visualizing what the node actually contains. However, you may face a lot of problems while running it, so [here is a screenshot](tree.png) just in case.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **<span style="color:blue">Note:</span>** you might want to take a look [here](https://itsfoss.com/unable-to-locate-package-error-ubuntu/) and [there](https://gist.github.com/davydka/6715e74bd67501ee0a98e40b9820857c), it really helped me to fix bugs.

In [None]:
#TODO: I didnt manage to fix that one...
from getpass import getpass
!echo {getpass()} | "sudo -S ./interactive_tree.sh"
#Easiest way is to do ./interactive_tree.sh in your shell

<br/>

**As you can see, it's a lot of trouble to retrieve very little data that is not formatted nicely.** That's why Lookyloo is also complex: it takes care of that in a more effective manner than ete3.  
- Here is another interesting property that lists what URLs the elements that have href attribute lead to (it's way easier than it sounds).

In [None]:
my_first_crawled_tree.root_hartree.rendered_node.urls_in_rendered_page

- Time for a bit more complicated example: let's try to find the node containing the root URL using a method we saw previously:

In [None]:
root_node = my_first_crawled_tree.root_hartree.url_tree.search_nodes(name=my_first_crawled_tree.root_url)[0]
print(root_node.name)

[comment]: <> (to check: url_tree.to_json and rendered_node.to_json both return circlular references)
- To see all the informations that a node contains, you can simply dump all the features using the to_json method:

In [None]:
my_first_crawled_tree.root_hartree.hostname_tree.to_json()

But this is difficult to read. Instead, you could check the `features`property that are updated in the [add_url](https://github.com/Lookyloo/har2tree/blob/7697d43a513d43e55b8f04778e835d5c6950a8ac/har2tree/nodes.py#L471) method for HostNode  or [load_har_entry](https://github.com/Lookyloo/har2tree/blob/7697d43a513d43e55b8f04778e835d5c6950a8ac/har2tree/nodes.py#L78) for URLNode that give a way clearer view of what's inside the node:

In [None]:
my_first_crawled_tree.root_hartree.hostname_tree.features

A few more HostNode interesting features:

- request_cookie: returns the number of unique cookies sent in the requests of all URL nodes
- response_cookie: returns the number of unique cookies received in the requests of all URL nodes
- third_party_cookies_received: returns the number of unique 3rd party cookies received in the requests of all URL nodes
- mixed_content: returns true if there is http and http**s** URL nodes, false otherwise

In [None]:
print("request cookies: " + str(my_first_crawled_tree.root_hartree.hostname_tree.request_cookie))
print("response cookies: " + str(my_first_crawled_tree.root_hartree.hostname_tree.response_cookie))
print("3rd party cookies: " + str(my_first_crawled_tree.root_hartree.hostname_tree.third_party_cookies_received))
print("mixed content: " + str(my_first_crawled_tree.root_hartree.hostname_tree.mixed_content))

Let's see what happens with the cookie capture of `capture_samples` where we passed a cookie in the request:

In [None]:
har_path = Path() / '..' / 'tests' / 'capture_samples' / 'cookie' / '0.har'
cookie_crawled_tree = CrawledTree([har_path], str(uuid.uuid4()))
print("request cookies: " + str(cookie_crawled_tree.root_hartree.hostname_tree.request_cookie))
print("response cookies: " + str(cookie_crawled_tree.root_hartree.hostname_tree.response_cookie))
print("3rd party cookies: " + str(cookie_crawled_tree.root_hartree.hostname_tree.third_party_cookies_received))
print("mixed content: " + str(cookie_crawled_tree.root_hartree.hostname_tree.mixed_content))

#### Interpreting the data:
- we passed one cookie in the request
- all the captures were made on www.lookyloo-testing.herokuapp.com: it makes sense that there are no cookies in the response as its a clean website and there would be no use of any cookies there

**to check:  mixed_content should be True because herokuapp is http and the rest of the URLs are https**

That's it for the Nodes. We now covered **all the classes** available inside of the har2tree API. 

## helper.py

We'll now briefly cover the methods of the [helper file](https://github.com/Lookyloo/har2tree/blob/7697d43a513d43e55b8f04778e835d5c6950a8ac/har2tree/helper.py). They do not offer an interface like the classes but it is interesting to take a look at them as they are central to the API.

- [rebuild_url](https://github.com/Lookyloo/har2tree/blob/7697d43a513d43e55b8f04778e835d5c6950a8ac/har2tree/helper.py#L57) takes a base (broken) url, a partial URL and tries to rebuild it until it matches a URL that would be correct
- [find_external_ressources](https://github.com/Lookyloo/har2tree/blob/7697d43a513d43e55b8f04778e835d5c6950a8ac/har2tree/helper.py#L184) crawls all the different types of ressources available in a given html file and creates a dictionary out of it to summarize what types of contents can be found in a given page -> those resources are then displayed in the tree on Lookyloo
- [url_cleanup](https://github.com/Lookyloo/har2tree/blob/7697d43a513d43e55b8f04778e835d5c6950a8ac/har2tree/helper.py#L131) cleans the url one more time when trying to rebuild it in rebuild_url; it removes signs such as `\\` or `'` 

Thank you for following this turorial. Hopefully you have a better understanding of how the API works now!