# Homework 3 - Places of the world


## 1. Data collection

For this homework, there is no provided dataset. Instead, you have to build your own. Your search engine will run on text documents. So, here
we detail the procedure to follow for the data collection. 


### 1.2. Crawl places

Once you get all the URLs in the first 400 pages of the list, you:

1. Download the HTML corresponding to each of the collected URLs.
2. After you collect a single page, immediately save its `HTML` in a file. In this way, if your program stops for any reason, you will not lose the data collected up to the stopping point.
3. Organize the entire set of downloaded `HTML` pages into folders. Each folder will contain the `HTML` of the places on page 1, page 2, ... of the list of locations.

__Tip__: Due to a large number of pages you should download, you can use some methods that can help you shorten the time it takes. If you employed a particular process or approach, kindly describe it.
 
### 1.3 Parse downloaded pages

At this point, you should have all the HTML documents about the places of interest, and you can start to extract the places' information. The list of the information we desire for each place and their format is as follows:

1. Place Name (to save as `placeName`): String.
2. Place Tags (to save as `placeTags`): List of Strings.
3. \# of people who have been there (to save as `numPeopleVisited`): Integer.
4. \# of people who want to visit the place(to save as `numPeopleWant`): Integer.
5. Description (to save as `placeDesc`): String. Everything from under the first image up to "know before you go" (orange frame on the example image).
6. Short Description (to save as `placeShortDesc`): String. Everything from the title and location up to the image (blue frame on the example image).
7. Nearby Places (to save as `placeNearby`): Extract the names of all nearby places, but only keep unique values: List of Strings.
8. Address of the place(to save as `placeAddress`): String.
9. Altitude and Longitude of the place's location(to save as `placeAlt` and `placeLong`): Integers
10. The username of the post editors (to save as `placeEditors`): List of Strings.
11. Post publishing date (to save as `placePubDate`): datetime.
12. The names of the lists that the place was included in (to save as `placeRelatedLists`): List of Strings.
13. The names of the related places (to save as `placeRelatedPlaces`): List of Strings.
14. The URL of the page of the place (to save as `placeURL`):String
<p align="center">
<img src="img/last_version_place.png" width = 1000>
</p>


For each place, you create a `place_i.tsv` file of this structure:

```
placeName \t placeTags \t  ... \t placeURL
```

If an information is missing, you just leave it as an empty string.


### IMPORT

In [11]:
import requests
from bs4 import BeautifulSoup

### 1.1. Get the list of places

We start with the list of places to include in your corpus of documents. In particular, we focus on the [Most popular places](https://www.atlasobscura.com/places?sort=likes_count). Next, we want you to **collect the URL** associated with each site in the list from this list.
The list is long and split into many pages. Therefore, we ask you to retrieve only the URLs of the places listed in **the first 400 pages** (each page has 18 places, so that you will end up with 7200 unique place URLs).

The output of this step is a `.txt` file whose single line corresponds to the place's URL.


Built the urls i need to get every url for every places.

In [1]:
url  = 'https://www.atlasobscura.com/places?sort=likes_count'
main_domain  ='https://www.atlasobscura.com'
get_page_query = '/places?pages?page='
query_end = '&sort=likes_count'

Main code to get every url of evry places and the save them locally in places_url.txt

In [None]:
places_url  = []
output = open("places_url.txt",'w')
cnt  =  0
for i in range(1,401):
    req  = requests.get(main_domain+get_page_query+str(i)+query_end)
    soup  = BeautifulSoup(req.text)
    places_url= [main_domain+ x.get('href') for x in soup.find_all('a',{'class' : 'content-card-place' })]
    for url in places_url:
        output.write(url+"\n")
        cnt+=1
output.close()

Check if all the 7200 urls have been written in places_url.txt

In [17]:
print(cnt)

7200


### 1.2. Crawl places

Once you get all the URLs in the first 400 pages of the list, you:

1. Download the HTML corresponding to each of the collected URLs.
2. After you collect a single page, immediately save its `HTML` in a file. In this way, if your program stops for any reason, you will not lose the data collected up to the stopping point.
3. Organize the entire set of downloaded `HTML` pages into folders. Each folder will contain the `HTML` of the places on page 1, page 2, ... of the list of locations.

__Tip__: Due to a large number of pages you should download, you can use some methods that can help you shorten the time it takes. If you employed a particular process or approach, kindly describe it.

#### IMPORT

In [None]:
import linecache
import requests
import os

### get_html function

In this function there is the code to download the page html of a place on www.atlasobscura.com:

*Parameters*

    __url__ : this url is the location of the html page that we want to download, it is read from a file txt containing 7200 url of places on www.atlasobscura.com.
    
    __path__: this path is the location where we want to save the .html file locally.

    __number__: this number goes from 0 to 7199, each number corresponds to a different place.
    
    __page__: this number goes from 1 to 400, each number corresponds to a page from this link https://www.atlasobscura.com/places?page=<page>sort=likes_count.

*Execution*
    
    This function makes a GET request to the given <url>, tries to create a new html file called place<number>.html in the <path> given as input.
    When the GET request has a status code equals to 200, the request.text attribute is written in the place<number>.html. 
    After the file is written, it gives a feedback if the file as already been saved locally( in this case you will see on the standard output "Done!") or if it has just been saved locally(in this case you will see on the standard output "Downloaded place <number>, Page <page>).

In [None]:
def get_html(url,path,number,page):
    try:    
        if not os.path.exists(path+"/place"+str(number)+".html") or os.path.getsize(path+"/place"+str(number)+".html") < 2000:
            with open(path+"/place"+str(number)+".html","w", encoding="utf-8") as file_html:
                req =  requests.get(url)
                if req.status_code == 200:
                    file_html.write(req.text)
                    print("Downloaded place "+str(number)+", Page "+str(page))
        else:
            print("done!")
    except Exception as e:
        print("Error occured! "+ str(e))

This block of code creates the need 400 directory correponding to each page from this link  https://www.atlasobscura.com/places?page=<page_index>sort=likes_count.

In [None]:
for page_index  in range(1,401):
    dir_path  = "HTML_FILES/page"+str(page_index)
    if not os.path.exists(dir_path):
            os.makedirs(dir_path)

Main code the calls get_html for every url saved in the places_url.txt.
I used linecache module to acces directly to a specific line in the file.txt.

In [None]:
for i in range(0,7200):
    page  = (i//18)+1
    dir_path  = "HTML_FILES/page"+str(page)
    url = str(linecache.getline('places_url.txt',i+1).replace('\n',""))
    get_html(url,dir_path,i,page)