![Genietalk](http://genietalk.com/wp-content/uploads/2017/03/Website-Logo-Medium.png)

## Objective: 
Extract all information about a specific city hotels
This could be achieved using urllib and a very handy web-scrapping framework called the BeautifulSoup

In this objective we will do the following:
- Count the number of pages of hotels in a specific city
- Iterate through every page and append the hotels name from every page into a list `all_hotels`
- Create functions to extract specific informations from every hotel page. Information like:
> + Hotel_name
> + Hotel_address
> + Hotel_url
> + Hotel_rating
> + Hotel_info
> 
>   ...
> + Hotel_facilities

### Import all the headers

In [1]:
from urllib.request import Request,urlopen
from bs4 import BeautifulSoup as soup
import json

Create a Base_Url `https://me.cleartrip.com` and then create an extension.

The value in the extension url can be altered to change the locations 

`(e.g. /hotels/united-states/miami -> /hotels/india/mumbai)`

In [2]:
base_url = 'https://me.cleartrip.com'
url_ext = '/hotels/united-states/miami'

Download and parse the page information using this function

In [3]:
# Downloading and parsing the page
def download_data(main_url):
    url_client = urlopen(main_url)
    page_html = url_client.read()
    page_soup=soup(page_html,'html.parser')
    url_client.close()
    return page_soup

In [4]:
page_soup = download_data(base_url+url_ext)

In [5]:
#check if the above function works
main_soup = page_soup.find('section',{'id':'content'})
main_soup.h1

<h1>Miami Hotels</h1>

#### Count the total number of pages for a specific location hotels
We will count the number of links to other pages on the first page and then count the total number of pages as follows

In [6]:
page_number = page_soup.find('div',{'class':'pagination'}).findAll('a')
total_pages=len(page_number)

In [7]:
hotel_name=main_soup.findAll('div',{'class':'hotels-card-cnt'})
page_links = [hotelpage_links.div.h2.a['href'] for hotelpage_links in hotel_name]

In [9]:
'''
    iterate over all pages
    fetch the hotels url and
    store this information(urls) in the dictionary
'''
all_hotels=[]
for p_no in range(1,total_pages+1):
    sub_url='/hotels/united-states/miami?page='+str(p_no)
    page_soup=download_data(base_url+sub_url)
    main_soup = page_soup.find('section',{'id':'content'})
    hotel_name=main_soup.findAll('div',{'class':'hotels-card-cnt'})
    page_links = [hotelpage_links.div.h2.a['href'] for hotelpage_links in hotel_name]
    all_hotels+=page_links
len(all_hotels) # Print the total number of hotels

755

Just for human level sanity check, we will print the list of all links

In [10]:
all_hotels

['/hotels/info/hilton-bentley-south-beach-192017',
 '/hotels/info/epic-a-kimpton-hotel-292714',
 '/hotels/info/jw-marriott-marquis-miami-345087',
 '/hotels/info/sls-south-beach-367313',
 '/hotels/info/king-grove-tides-south-beach-367317',
 '/hotels/info/viceroy-367353',
 '/hotels/info/the-ritz-carlton-bal-harbour-miami-453735',
 '/hotels/info/ritz-carlton-key-biscayne-miami-514188',
 '/hotels/info/the-ritz-carlton-coconut-grove-miami-275840',
 '/hotels/info/the-carillon-hotel-and-spa-338669',
 '/hotels/info/hotel-beaux-arts-miami-345107',
 '/hotels/info/jw-marriott-marquis-miami-531425',
 '/hotels/info/boulan-south-beach-569703',
 '/hotels/info/gale-south-beach-660995',
 '/hotels/info/apartment-collins-avenue-1214040',
 '/hotels/info/metropolitan-by-como-miami-beach-1232342',
 '/hotels/info/loews-192045',
 '/hotels/info/acqualina-192008',
 '/hotels/info/viceroy-miami-253754',
 '/hotels/info/turnberry-isle-miami-autograph-collection-417953',
 '/hotels/info/fontainebleau-miami-bch-140302

In [11]:
# Download and parse the hotel information
def download_hotel_data(hotel_url):
    url_client = urlopen(hotel_url)
    page_html = url_client.read()
    page_soup=soup(page_html,'html.parser')
    url_client.close()
    return page_soup

In [12]:
# Functions to extract all the hotel information from their webpage
import re

def find_name(page_soup):
    try:
        hotel_name=page_soup.find('h1',{'class':'hotel-title'}).text
        return hotel_name
    except:
        return None

def hotel_address(page_soup):
    try:
        hotel_add=page_soup.find('div',{'class':'hotels-location'}).text
        return hotel_add
    except:
        return None

def hotel_price(page_soup):
    try:
        hotel_price=page_soup.find('p',{'class':'minprice'}).contents[-1]
        return hotel_price
    except:
        return None
    
def hotel_rating(page_soup):
    try:
        hotel_tarating=page_soup.find('span',{'class':'schema-TAcount'}).contents[-1]
        return hotel_tarating.split('/')[0]
    except:
        return None

def hotel_review_number(page_soup):
    try:
        hotel_tarating=page_soup.find('span',{'itemprop':'reviewCount'})
        return hotel_tarating.text
    except:
        return None

def hotel_dscr(page_soup):
    hotel_desc=page_soup.find('div',{'class':'hotel-description'})
    data={}
    try:
        for i in hotel_desc:
            key=i.text.split(":")[0].lower()
            key=re.sub('[^ a-zA-Z0-9]','',key)
            key='_'.join(key.split(" "))
            val=i.text.split(":")[1]
            data[key]=val
        return data
    except:
        return None

def hotel_others(page_soup):
    dct={}
    others=page_soup.find_all('div',{'class':'amenities-category'})
    tstst = [i for i in others]
    try:
        check_in=tstst[0].text.split(" ")[5]# if tstst[0].text.split(" ")[5] else 0
        check_out=tstst[0].text.split(" ")[9]
        rooms_no=tstst[0].text.split(" ")[13]# if tstst[0].text.split(" ")[13] else 0
        dct['check_in']=check_in
        dct['check_out']=check_out
        dct['rooms']=rooms_no
        others=tstst[1]
        heads=[i.text for i in others.find_all('div')]
        lsts=others.findAll('ul',{'class':'list-inline amenities'})
        lk=0
        for i in lsts:
            key=heads[lk].lower()
            key=re.sub('[^ a-zA-Z0-9]','',key)
            key='_'.join(key.split(' '))
            dct[key]=i.text.split("  ")[1:]
            lk+=1
        return dct
    except:
        return None


In [13]:
# Store the extracted information in the dict in a proper oriented format
def create_dict(url):
    page_soup=download_hotel_data(url)
    dct=dict()
    dct['hotel_name']=find_name(page_soup)
    dct['hotel_url']=str(url)
    dct['hotel_address']=hotel_address(page_soup)
    dct['hotel_rating']=hotel_rating(page_soup)
    dct['hotel_review_num']=hotel_review_number(page_soup)
    dct['hotel_price']=hotel_price(page_soup)
    dct['hotel_info']=hotel_dscr(page_soup)
    others=hotel_others(page_soup)
    try:
        for key in others:
            dct[key]=others[key]
        return dct
    except:
        return dct

In [14]:
# Try our function on a sample slice all_hotels[:5]
ls_test=[]
for url in all_hotels[:5]:
    print(base_url+url)
    ls_test.append(create_dict(base_url+url))
ls_test

https://me.cleartrip.com/hotels/info/hilton-bentley-south-beach-192017
https://me.cleartrip.com/hotels/info/epic-a-kimpton-hotel-292714
https://me.cleartrip.com/hotels/info/jw-marriott-marquis-miami-345087
https://me.cleartrip.com/hotels/info/sls-south-beach-367313
https://me.cleartrip.com/hotels/info/king-grove-tides-south-beach-367317


[{'basics': ['Air Conditioning', 'Lift '],
  'business_services': ['Business Center '],
  'check_in': '16:00',
  'check_out': '-',
  'hotel_address': None,
  'hotel_info': None,
  'hotel_name': 'Hilton Bentley South Beach',
  'hotel_price': None,
  'hotel_rating': '4.0',
  'hotel_review_num': '1103',
  'hotel_url': 'https://me.cleartrip.com/hotels/info/hilton-bentley-south-beach-192017',
  'personal_services': ['Room Service '],
  'recreation': ['Gym', 'Heated Pool', 'Sauna '],
  'rooms': '218',
  'travel': ['Parking', 'Porter ']},
 {'business_services': ['Audio Visual Equipment',
   'Business Centre',
   'Meeting Room '],
  'check_in': '16:00',
  'check_out': '11:00',
  'food__beverage': ['24 Hour Room Service',
   'Banquet Hall',
   'Bar',
   'Coffee Shop',
   'Restaurant '],
  'front_desk_services': ['Concierge '],
  'general': ['24 Hour Front Desk',
   'ATM/Cash Machine',
   'Air Conditioning',
   'Central Heating',
   'Elevator '],
  'hotel_address': 'Downtown Miami, Miami',
  'ho

In [16]:
ls_final=[]
for url in all_hotels:
    print(enumerate(base_url+url)) #Just to know that your function is running 
    ls_final.append(create_dict(base_url+url))

https://me.cleartrip.com/hotels/info/hilton-bentley-south-beach-192017
https://me.cleartrip.com/hotels/info/epic-a-kimpton-hotel-292714
https://me.cleartrip.com/hotels/info/jw-marriott-marquis-miami-345087
https://me.cleartrip.com/hotels/info/sls-south-beach-367313
https://me.cleartrip.com/hotels/info/king-grove-tides-south-beach-367317
https://me.cleartrip.com/hotels/info/viceroy-367353
https://me.cleartrip.com/hotels/info/the-ritz-carlton-bal-harbour-miami-453735
https://me.cleartrip.com/hotels/info/ritz-carlton-key-biscayne-miami-514188
https://me.cleartrip.com/hotels/info/the-ritz-carlton-coconut-grove-miami-275840
https://me.cleartrip.com/hotels/info/the-carillon-hotel-and-spa-338669
https://me.cleartrip.com/hotels/info/hotel-beaux-arts-miami-345107
https://me.cleartrip.com/hotels/info/jw-marriott-marquis-miami-531425
https://me.cleartrip.com/hotels/info/boulan-south-beach-569703
https://me.cleartrip.com/hotels/info/gale-south-beach-660995
https://me.cleartrip.com/hotels/info/apar

https://me.cleartrip.com/hotels/info/1451-brickell-by-miami-vacations-2322084
https://me.cleartrip.com/hotels/info/one-broadway-luxury-suites-downtown-2326280
https://me.cleartrip.com/hotels/info/bright-3br-in-little-havana-by-sonder-2325648
https://me.cleartrip.com/hotels/info/sonesta-resort-by-1st-homerent-2324624
https://me.cleartrip.com/hotels/info/sls-brickell-2326326
https://me.cleartrip.com/hotels/info/riviera-luxury-living-at-icon-brickell-2328106
https://me.cleartrip.com/hotels/info/opera-tower-2400306
https://me.cleartrip.com/hotels/info/palmeiras-beach-club-at-grove-isle-2310592
https://me.cleartrip.com/hotels/info/bright-studio-in-wynwood-by-sonder-2312406
https://me.cleartrip.com/hotels/info/nuovo-miami-apartments-at-design-district-midtown-2311654
https://me.cleartrip.com/hotels/info/hyatt-regency-140225
https://me.cleartrip.com/hotels/info/trump-international-beach-192089
https://me.cleartrip.com/hotels/info/marenas-resort-210117
https://me.cleartrip.com/hotels/info/eden

https://me.cleartrip.com/hotels/info/sapphire-south-beach-1282014
https://me.cleartrip.com/hotels/info/comfort-suites-miami-airport-north-1297216
https://me.cleartrip.com/hotels/info/miami-airport-marriott-2315954
https://me.cleartrip.com/hotels/info/courtyard-by-marriott-miami-airport-2318380
https://me.cleartrip.com/hotels/info/opera-tower-bay-2319506
https://me.cleartrip.com/hotels/info/lively-2br-in-miami-river-inn-by-sonder-2319646
https://me.cleartrip.com/hotels/info/lively-1br-in-coconut-grove-by-sonder-2322514
https://me.cleartrip.com/hotels/info/vibrant-1br-in-coconut-grove-by-sonder-2322532
https://me.cleartrip.com/hotels/info/simple-2br-in-miami-river-inn-by-sonder-2327956
https://me.cleartrip.com/hotels/info/holiday-inn-miami-79th-street-2399800
https://me.cleartrip.com/hotels/info/hyatt-centric-brickell-miami-2400164
https://me.cleartrip.com/hotels/info/the-vagabond-hotel-2310688
https://me.cleartrip.com/hotels/info/hilton-garden-inn-miami-dolphin-mall-2310772
https://me.c

https://me.cleartrip.com/hotels/info/collins-studios-by-design-suites-miami-1123922
https://me.cleartrip.com/hotels/info/seacoast-suites-1135000
https://me.cleartrip.com/hotels/info/miami-princess-hotel-1149772
https://me.cleartrip.com/hotels/info/colony-hotel-1318860
https://me.cleartrip.com/hotels/info/bright-2br-in-miami-river-inn-by-sonder-2315856
https://me.cleartrip.com/hotels/info/brickell-city-suites-by-yourent-2316790
https://me.cleartrip.com/hotels/info/smart-1br-in-coconut-grove-by-sonder-2317098
https://me.cleartrip.com/hotels/info/global-luxury-suites-at-one-broadway-2318172
https://me.cleartrip.com/hotels/info/homewood-suites-by-hilton-miami-dolphin-mall-2319554
https://me.cleartrip.com/hotels/info/colorful-1br-in-miami-river-inn-by-sonder-2320486
https://me.cleartrip.com/hotels/info/bright-1br-in-miami-river-inn-by-sonder-2321064
https://me.cleartrip.com/hotels/info/lyx-suites-at-bayshore-grove-in-coconut-grove-2321654
https://me.cleartrip.com/hotels/info/posh-studio-in-

https://me.cleartrip.com/hotels/info/sky-city-at-icon-brickell-774100
https://me.cleartrip.com/hotels/info/aloft-miami-brickell-774940
https://me.cleartrip.com/hotels/info/bars-bb-south-beach-hotel-918668
https://me.cleartrip.com/hotels/info/the-mercury-hotel-936424
https://me.cleartrip.com/hotels/info/kent-hotel-952806
https://me.cleartrip.com/hotels/info/the-shepley-hotel-1000560
https://me.cleartrip.com/hotels/info/congress-hotel-by-miavac-1001250
https://me.cleartrip.com/hotels/info/1021-euclid-apartment-1014874
https://me.cleartrip.com/hotels/info/ocean-five-studios-1010626
https://me.cleartrip.com/hotels/info/aloft-miami-brickell-1022072
https://me.cleartrip.com/hotels/info/pelican-residences-in-coral-gables-walk-to-merrick-park-1076910
https://me.cleartrip.com/hotels/info/south-beach-home-1175044
https://me.cleartrip.com/hotels/info/luxury-apartments-by-miamiluxsuites-1190578
https://me.cleartrip.com/hotels/info/mare-azur-design-district-luxury-apartments-2316622
https://me.clea

https://me.cleartrip.com/hotels/info/continental-hotel-oceanfront-south-beach-321981
https://me.cleartrip.com/hotels/info/baymont-inn-and-suites-west-miami-airport-324099
https://me.cleartrip.com/hotels/info/wishes-coral-gables-529343
https://me.cleartrip.com/hotels/info/hotel-pierre-646817
https://me.cleartrip.com/hotels/info/extended-stay-america-miami-coral-gables-1210954
https://me.cleartrip.com/hotels/info/hampton-inn-miami-290306
https://me.cleartrip.com/hotels/info/econo-lodge-miami-1277088
https://me.cleartrip.com/hotels/info/hampton-inn-miami-airport-east-fl-2314212
https://me.cleartrip.com/hotels/info/days-inn-miami-2310628
https://me.cleartrip.com/hotels/info/continental-south-beach-192026
https://me.cleartrip.com/hotels/info/the-continental-hotel-214073
https://me.cleartrip.com/hotels/info/venezia-hotel-249330
https://me.cleartrip.com/hotels/info/deco-walk-hostel-beach-club-452206
https://me.cleartrip.com/hotels/info/miami-hostel-524112
https://me.cleartrip.com/hotels/info/

https://me.cleartrip.com/hotels/info/even-hotel-miami-airport-2316172
https://me.cleartrip.com/hotels/info/king-motel-miami-582924
https://me.cleartrip.com/hotels/info/saturn-motel-miami-627474
https://me.cleartrip.com/hotels/info/carls-el-padre-motel-624758
https://me.cleartrip.com/hotels/info/hotel-mimo-630567
https://me.cleartrip.com/hotels/info/the-megghy-894864
https://me.cleartrip.com/hotels/info/suites-r-us-in-ocean-drive-924084
https://me.cleartrip.com/hotels/info/casa-grande-suites-by-south-beach-vacations-932416
https://me.cleartrip.com/hotels/info/south-beach-studio-indian-creek-935298
https://me.cleartrip.com/hotels/info/doral-vacation-apartments-by-envisend-939346
https://me.cleartrip.com/hotels/info/congress-by-design-suites-miami-950640
https://me.cleartrip.com/hotels/info/esseza-residence-970974
https://me.cleartrip.com/hotels/info/lenox-cottage-on-water-979122
https://me.cleartrip.com/hotels/info/chic-bohemian-apartment-at-the-mercury-hotel-1000980
https://me.cleartrip

### Dump the data into Json object file

In [17]:
with open('slice_data.json', 'w') as fp: #dumping test dataset
    json.dump(ls_test, fp)

In [18]:
with open('miami_hotel_data.json', 'w') as fp: #dumping complete dataset
    json.dump(ls_final, fp)

In [20]:
len(ls_final) #Sanity check for the number of hotels in the final dataset

755

The number of data points in our Hotel_list equals to the number of hotels in our information list.
Thus the function is working properly.