# Crawling to your host machine
This is relatively straight-forward. You save the results to a file on your local file system.

In [None]:
from ggplace_review_crawler import ReviewCrawler

urls = []

metadata_file = open('../docker-hadoop/data-mount/place_meta.csv', 'w', encoding='utf-8')
reviews_file = open('../docker-hadoop/data-mount/reviews.csv', 'w', encoding='utf-8')
crawler = ReviewCrawler()

metadata_file.write('place_index,address,price_range')
reviews_file.write('review,rating,place_index')

In [None]:
crawler.open()

for i, url in enumerate(urls):
    data = crawler.crawl_from(url)

    metadata_file.write(f"{i},{data['address']},{data['price_range']}")
    for review, rating in zip(data['reviews'], data['ratings']):
        review = review.replace('\n ', '. ').replace('"', "'")
        reviews_file.write(f'"{review}",{rating},{i}\n')

# don't forget to close these
crawler.close()
metadata_file.close()
reviews_file.close()

# Crawling to an hosted HDFS on Docker
If you are hosting a HDFS locally on your machine. It should be straight-forward, just establish a client like below (after changing the IP address of course) and write to file on the HDFS.

However, if you are hosting the HDFS through a Docker container, it will be a bit more contrived. Essentially, you can create a client the same way, but, you need to expose the datanode's 9864 port for the client to do any data transfer.\
On the other hand, you may also run your Python scripts within the Docker host so you don't need to expose the ports, which is what being done in the project.

## How does this work?
The project hosts a HDFS within a Docker application. The application consists of multiple containers, these containers have the corresponding Hadoop service, Spark service and Jupyter service with PySpark installed.

When the aplication is built and run, a Jupyter server is hosted locally and one may connect to this server to run Python scripts.

## What to do now?
Before doing anything within this notebook, you must first upload the [`ggplace_review_crawler`](./ggplace_review_crawler/) package to the Jupyter server. You can either do this by copying everything into the appropriate container or you may do this in the WebUI accessible through http://localhost:8888.

After that, you may run the following cells.

In [None]:
# Install the necessary libraries
%pip install selenium
%pip install hdfs

In [None]:
from hdfs import Client
from ggplace_review_crawler import ReviewCrawler

client = Client('http://namenode:9870')
crawler = ReviewCrawler()

client.write('/review_data/place_meta.csv', data='place_index,address,price_range')
client.write('/review_data/reviews.csv', data='review,rating,place_index')

In [None]:
crawler.open()

for i, url in enumerate(urls):
    data = crawler.crawl_from(url)

    client.write('/review_data/place_meta.csv',
                 data=f"{i},{data['address']},{data['price_range']}",
                 encoding='utf-8',
                 append=True
    )

    review_data = ''.join([
        f'"{review.replace('\n ', '. ').replace('"', "'")}",{rating},{i}\n'
        for review, rating in zip(data['reviews'], data['ratings'])
    ])
    client.write('/review_data/reviews.csv',
                 data=review_data,
                 encoding='utf-8',
                 append=True
    )

# don't forget to close
crawler.close()