In [18]:
# Read the instructions before choosing a kernel and running the cells
with open('links.txt', encoding='utf-8') as file:
    URLS = file.read().split()

# Crawling to your host machine
This is relatively straight-forward. You save the results to a file on your local file system.

In [2]:
from ggplace_review_crawler import ReviewCrawler

metadata_file = open('../docker-hadoop/data-mount/place_meta.csv', 'w', encoding='utf-8')
reviews_file = open('../docker-hadoop/data-mount/reviews.csv', 'w', encoding='utf-8')
crawler = ReviewCrawler()

metadata_file.write('place_index,address,price_range\n')
reviews_file.write('review,rating,place_index\n');

In [3]:
crawler.open()

num_reviews = 0
load_timeout = 2
for i, url in enumerate(URLS):
    print(f'Crawled {i+1}/{len(URLS)} links\n')

    data = crawler.crawl_from(url, num_reviews, load_timeout)

    metadata_file.write(f"{i},{data['address']},{data['price_range']}\n")
    for review, rating in zip(data['reviews'], data['ratings']):
        review = review.replace('\n', '. ').replace('"', "'")
        reviews_file.write(f'"{review}",{rating},{i}\n')

    metadata_file.flush()
    reviews_file.flush()

Crawled 1/18 links

Now crawling from [34;4mhttps://www.google.co.vi/maps/place/KFC+Phan+Huy+%C3%8Dch/@10.8295038,106.6301008,17z/data=!3m1!4b1!4m6!3m5!1s0x317529a4fb79dbed:0xced4a7a09e5c949a!8m2!3d10.8294985!4d106.6326757!16s%2Fg%2F11r7tm_3n2?entry=ttu&g_ep=EgoyMDI1MDQyMy4wIKXMDSoASAFQAw%3D%3D[0m
Getting place overview...
Begin crawling reviews
[32;1mCrawling finished![0m
Crawled 2/18 links

Now crawling from [34;4mhttps://www.google.co.vi/maps/place/KFC+Nguy%E1%BB%85n+V%C4%83n+Qu%C3%A1/@10.8372582,106.6268548,17z/data=!3m1!4b1!4m6!3m5!1s0x31752912173e1fbf:0xa9b22b550b3a5279!8m2!3d10.8372529!4d106.6294297!16s%2Fg%2F11s938b10b?entry=ttu&g_ep=EgoyMDI1MDQyMy4wIKXMDSoASAFQAw%3D%3D[0m
Getting place overview...
Begin crawling reviews
[32;1mCrawling finished![0m
Crawled 3/18 links

Now crawling from [34;4mhttps://www.google.co.vi/maps/place/KFC+Pandora/@10.8074047,106.6314162,17z/data=!3m1!4b1!4m6!3m5!1s0x317529f7450bbb47:0xc96469088e4537ca!8m2!3d10.8073994!4d106.6339911!16s%2Fg%2F1

In [4]:
# don't forget to close these
crawler.close()
metadata_file.close()
reviews_file.close()

# Crawling to a hosted HDFS on Docker
If you are hosting a HDFS locally on your machine. It should be straight-forward, just establish a client like below (after changing the IP address of course) and write to file on the HDFS.

However, if you are hosting the HDFS through a Docker container, it will be a bit more contrived. Essentially, you can create a client the same way, but, you need to expose the datanode's 9864 port for the client to do any data transfer.\
On the other hand, you may also run your Python scripts within the Docker host so you don't need to expose the ports, which is what being done in the project.

## How does this work?
The project hosts a HDFS within a Docker application. The application consists of multiple containers, these containers have the corresponding Hadoop service, Spark service and Jupyter service with PySpark installed.

When the aplication is built and run, a Jupyter server is hosted locally and one may connect to this server to run Python scripts.

## What to do now?
Assuming that you have succesfully start the container stack in [`docker-hadoop`](../docker-hadoop/), follow the step below. If not, check out the [`README.md`](../docker-hadoop/README.md):
1. Before doing anything within this notebook, you must first upload the [`ggplace_review_crawler`](./ggplace_review_crawler/) package to the Jupyter server. You can either do this by copying everything into the appropriate container or you may do this in the WebUI accessible through http://localhost:8888.
2. Now, because the code uses a Firfox web driver, we need to install Firefox in the running container. To do this, first run: `docker exec -u root -it <container ID> bash`. This will run the bash CLI within the spark-notebook container. Then, please follow the steps provided in this [blog](https://www.omgubuntu.co.uk/2022/04/how-to-install-firefox-deb-apt-ubuntu-22-04). You only need to run step 1 to 6 to finish the installation, everything can be copy and pasted right into bash.
3. When installation is done, you can check whether Firefox can be run within the container by running `firefox -headless` or checking if the web driver can establish a connecting using the code provided below.

You should be able to run the following cells after the steps above.

In [1]:
# Install the necessary libraries
%pip install selenium
%pip install hdfs

Collecting selenium
  Downloading selenium-4.31.0-py3-none-any.whl.metadata (7.5 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.30.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting typing_extensions~=4.9 (from selenium)
  Downloading typing_extensions-4.13.2-py3-none-any.whl.metadata (3.0 kB)
Collecting websocket-client~=1.8 (from selenium)
  Downloading websocket_client-1.8.0-py3-none-any.whl.metadata (8.0 kB)
Collecting attrs>=23.2.0 (from trio~=0.17->selenium)
  Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Collecting h11<1,>=0.9.0 (from wsproto>=0.14->trio-websocket~=0.9->selenium)
  Downloading h11

In [14]:
from hdfs import Client
from ggplace_review_crawler import ReviewCrawler

client = Client('http://namenode:9870')
crawler = ReviewCrawler()

client.makedirs('/review_data')
client.write('/review_data/place_meta.csv', data='place_index,address,price_range\n')
client.write('/review_data/reviews.csv', data='review,rating,place_index\n')

In [None]:
crawler.open()

num_reviews = 10
load_timeout = 2
for i, url in enumerate(URLS):
    print(f'Crawled {i+1}/{len(URLS)} links\n')

    data = crawler.crawl_from(url, num_reviews, load_timeout)

    client.write('/review_data/place_meta.csv',
                 data=f"{i},{data['address']},{data['price_range']}\n",
                 encoding='utf-8',
                 append=True
    )

    review_data = ''.join([
        review.replace('\n', '. ').replace('"', "'") +
            f',{rating},{i}\n'
        for review, rating in zip(data['reviews'], data['ratings'])
    ])
    client.write('/review_data/reviews.csv',
                 data=review_data,
                 encoding='utf-8',
                 append=True
    )

Now crawling from [34;4mhttps://www.google.co.vi/maps/place/KFC+Phan+Huy+%C3%8Dch/@10.8295038,106.6301008,17z/data=!3m1!4b1!4m6!3m5!1s0x317529a4fb79dbed:0xced4a7a09e5c949a!8m2!3d10.8294985!4d106.6326757!16s%2Fg%2F11r7tm_3n2?entry=ttu&g_ep=EgoyMDI1MDQyMy4wIKXMDSoASAFQAw%3D%3D[0m
Getting place overview...
Begin crawling reviews
[32;1mCrawling finished![0m
Now crawling from [34;4mhttps://www.google.co.vi/maps/place/KFC+Nguy%E1%BB%85n+V%C4%83n+Qu%C3%A1/@10.8372582,106.6268548,17z/data=!3m1!4b1!4m6!3m5!1s0x31752912173e1fbf:0xa9b22b550b3a5279!8m2!3d10.8372529!4d106.6294297!16s%2Fg%2F11s938b10b?entry=ttu&g_ep=EgoyMDI1MDQyMy4wIKXMDSoASAFQAw%3D%3D[0m
Getting place overview...
Begin crawling reviews
[32;1mCrawling finished![0m
Now crawling from [34;4mhttps://www.google.co.vi/maps/place/KFC+Pandora/@10.8074047,106.6314162,17z/data=!3m1!4b1!4m6!3m5!1s0x317529f7450bbb47:0xc96469088e4537ca!8m2!3d10.8073994!4d106.6339911!16s%2Fg%2F11stkgyrhl?entry=ttu&g_ep=EgoyMDI1MDQyMy4wIKXMDSoASAFQAw%3D%3

In [21]:
# don't forget to close
crawler.close()