# Scraping a web page

## Why?

Websites are very often used to trick users into giving out information to malicious actors, or spread malware, or leak private information to third parties without the user's knowledge.

Any website, especially when hosted somewhere, will give us information about the people behind it and how they operate. These information can be used to track a specific campaign, figure out the malicious actor(s) behind it, inform other potential targets of the attack, get a victim to act and avoid further consequences, or get in touch with the impersonated organisation so they take appropriate measures.

It is also more and more comon to see nasty information leaks because of 3rd party components, or because a developper did a mistake. The problem is that with websites becoming more and more conplex, and many moving parts interacting together, investigating a website gets tricky very fast.

# Part 0: Setup

## The work environment 

For the following tutorial, we assume you have the following environment at your disposal.

1. Ubuntu 20.04 or 21.10. It can be an other similar general purpose operating system (Debian 10, Fedora), but specialized distros such as Kali Linux are strongly discouraged and won't be supported if you have issues.

   **Important note**: It is assumed that you're *not* running as root, but the account you're using is administrator (tl;dr: `sudo` works)

2. Python 3.8 or 3.9.

3. Basic command line tools: `curl`, `wget`, `grep`, `git`

4. Poetry 1.1.0 (or more recent), preferably installed this way: 

  ```bash
  curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py > get-poetry.py
  python3 ./get-poetry.py
  ```

In [None]:
!python -V

In [None]:
!poetry self -V

## The tutorial 

1. Clone the repository (requires `git`)
  ```bash
  git clone https://github.com/Lookyloo/scraping-tutorial.git
  cd scraping-tutorial
  ```
2. Install the dependencies:
  ```bash
  poetry install
  ```
3. Run the lab:
  ```bash
  poetry shell
  jupyter-lab
  ```
4. Move to your browser. Running `jupyter-lab` should have opened a tab in your favorite browser. If it didn't, look in the terminal for hints.

At this point, you're all setup for the tutorial

## How to use jupyter

Jupyter will allow you to run commands the same way you would do that in a terminal, but also run python code, directly in your browser.

All the code snipets can be executed by doing SHIFT + Enter

In [None]:
def hello_world():
    print(">>>>> hello world")
    
hello_world()

To run a system command, you need to prepend it with `!`

In [None]:
!ls

# Part I: The trivial approach

In this first part, we will look at existing tools and use simple techniques to extract indicators from websites.

**Goal**: get a listing of all the URLs a webpage loads content from. The content can be internal (same domain) or external (to a completely other domain).

## Simple website

In [None]:
!curl https://circl.lu

### Notes

Commonly loaded stuff:

* css stylesheets
* fonts
* icon
* images

Example of search:

In [None]:
!curl https://circl.lu | grep raleway

    
**Question**:
* Do we see external content?
* Can one of the stuff loaded by the page load something?


Let's cheat and open that page in a browser and look at the network traffic:

1. Open a new window in your favorite browser
2. Go to `http://circl.lu`
3. Open the dev tools
  * Firefox: Right click > `Inspect Elements`
  * Chrome/Chromium: Right click > `Inspect`
4. Go to the `Network` tab
5. Reload the page (F5)

**Question**:

* Why is this URL loaded ?! http://circl.lu/assets/fonts/raleway-regular.ttf

Note: In Chrome/Chromium, the `Initiator` tab will tell you.

## Not so simple Website


In [None]:
!curl salon.com

### wat?

Let's figure out why that didn't return anything

In [None]:
!curl -vvv salon.com

There is a redirect (HTTP code 301).

How do we follow that?

In [None]:
!man curl

### Redirect

```
       -L, --location
              (HTTP) If the server reports that the requested page has moved to a different location (indicated with a Location: header and a 3XX response code),  this  option
              will make curl redo the request on the new place. If used together with -i, --include or -I, --head, headers from all requested pages will be shown. When authen‐
              tication is used, curl only sends its credentials to the initial host. If a redirect takes curl to a different host, it won't be able to intercept the user+pass‐
              word. See also --location-trusted on how to change this. You can limit the amount of redirects to follow by using the --max-redirs option.

              When  curl  follows a redirect and the request is not a plain GET (for example POST or PUT), it will do the following request with a GET if the HTTP response was
              301, 302, or 303. If the response code was any other 3xx code, curl will re-send the following request using the same unmodified method.

              You can tell curl to not change the non-GET request method to GET after a 30x response by  using  the  dedicated  options  for  that:  --post301,  --post302  and
              --post303.

```

In [None]:
!curl -L salon.com | grep href

Let's do the same as on the simple website above and search for internal and external resources loaded on the page.

# Phishing website

In order to find a valid and working phishing website, have a look at Phishtank:
http://phishtank.org/phish_search.php?valid=y&active=y&Search=Search

In [None]:
!curl -L -vvv https://www.inelle.fr/modules/paypal/smarty/plugins/www.netflix.de/www/spotify/

In [None]:
!curl -vvv  https://luxembourg-post.com/