## Overview


The goal of this Proof of Concept (PoC) project is to create a tool that do web scraping of Website "https://www.holidify.com/country/india/places-to-visit.html".

### Key Features:


- Scrape tourist places info:

  Extract place names, State, description, Rating out of 5.0, Ranking out of 100, number of tourist attractions and best time to visit.

- Store Data: Save the scraped data to a CSV file or database for future analysis.

In [5]:
#!pip install requests beautifulsoup4 pandas

In [27]:
#importing Libraries
import os
import numpy as np
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup

In [7]:
webpage = "https://www.holidify.com/country/india/places-to-visit.html"

response = requests.get(url=webpage)
response.ok

True

In [8]:
inspect_data = BeautifulSoup(response.text,'html.parser')

In [9]:
containers = inspect_data.findAll("div",{"class":"col-12 col-md-6 pr-md-3"})
len(containers)

42

In [10]:

data_list = []

In [11]:
containers[0]

<div class="col-12 col-md-6 pr-md-3">
<div class="card content-card" data-itemid="GULMARG">
<a data-href="/places/gulmarg/" data-position="1" href="/places/gulmarg/">
<h3 class="card-heading"> 1. Gulmarg </h3>
<div class="position-relative">
<div class="collection-scrollable" data-hotel-position="1">
<img alt="" class="card-img-top lazy" data-original="https://www.holidify.com/images/bgImages/GULMARG.jpg" src="/res/images/patt.png"/>
<div class="lazyBG card-img-top" data-original="https://www.holidify.com/images/compressed/dest_wiki_5433.jpg" style="background-image:url('https://holidify.com/images/patt.png');">
</div>
<div class="lazyBG card-img-top" data-original="https://www.holidify.com/images/cmsuploads/compressed/IMG_3768_20190710152743.JPG" style="background-image:url('https://holidify.com/images/patt.png');">
</div>
<div class="lazyBG card-img-top" data-original="https://www.holidify.com/images/cmsuploads/compressed/2771936432_d603c3fbd9_b_20190710152801.jpg" style="background-

In [12]:
for container in containers:
    data = {}
    pname = container.find("h3", {"class":"card-heading"} )
    data['p_name'] = pname.get_text().split('.')[1].split('-')[0].strip()

    rating = container.find("span",{"class":"rating-badge"})
    if rating:
        rating_val = rating.find('b').get_text().strip()
        rating_max = rating.find('span',{"class":"light"}).get_text().strip()
        rating = rating_val+rating_max
    else: rating = 0
    data['p_rating'] = rating
    
    objective = container.find("p", {"class":{"objective"}}).get_text().strip()
    data['rank'] = objective.split('Places to visit in India')[0]
    if objective.split('Places to visit in India')[1]:
        data['tourist_attractions'] = objective.split('Places to visit in India')[1].strip().split()[0]
    else: data['tourist_attractions'] =''

    mb = container.find("p", {"class":{"mb-2"}})
    if mb:
        data['location'] = mb.find('a').get_text().strip()
    else:data['location'] =''

    data['description'] = container.find('p',{'class':'card-text'}).get_text().strip()

    mb_3 = container.findAll('p',class_='mb-3')
    for mb in mb_3:
        if mb.find('b').get_text().strip()=='Known For :':
            mb3 = mb.findAll("span",{"class": "clickable align-middle"})
            known_for=[]
            for span in mb3:
                known_for.append(span.find('b').get_text().strip())
            data['known_for'] = known_for
        elif "Best Time:" in mb.get_text():
            data['best_time'] = mb.get_text().split("Best Time: ")[len(mb.get_text().split("Best Time: "))-1]
    data_list.append(data)

In [13]:
result = pd.DataFrame(data_list,columns = ['p_name','p_rating','rank','tourist_attractions','location','description','known_for','best_time'])


In [14]:
result.head(-5)

Unnamed: 0,p_name,p_rating,rank,tourist_attractions,location,description,known_for,best_time
0,Gulmarg,4.4/5,1 out of 100,29.0,Jammu and Kashmir,Situated at an altitude of 2730 m above sea le...,"[Gulmarg Gondola, Alpather Lake, Apharwat Peak]",October to June
1,Munnar,4.2/5,2 out of 100,54.0,Kerala,"Famous for the tea estates, greenery, winding ...","[Mattupetty Dam, Kolukkumalai Tea Estate, Rose...",September to May
2,Gangtok,4.6/5,3 out of 100,34.0,Sikkim,"Incredibly alluring, pleasantly boisterous and...","[Nathula Pass, MG Marg, Rumtek Monastery]",Throughout the year
3,Manali,4.8/5,4 out of 100,53.0,Himachal Pradesh,"With spectacular valleys, breathtaking views, ...","[Solang Valley, Hadimba Devi Temple, Manali Ma...",October to June
4,Srinagar,4.2/5,5 out of 100,59.0,Jammu and Kashmir,"Famously known as 'Heaven on Earth, Srinagar i...","[Dal Lake, Indira Gandhi Memorial Tulip Garden...",April to October
5,Shillong,4.2/5,6 out of 100,33.0,Meghalaya,"Nestled amidst the pine-clad hills, Shillong, ...","[Umiam Lake, Elephant Falls, Shillong Peak]",September to May
6,Ooty,4.5/5,7 out of 100,43.0,Tamil Nadu,"Nestled amidst Nilgiri hills, Ooty, also known...","[Nilgiri Mountain Railway, Ooty Lake, Emerald ...",Throughout the year
7,Darjeeling,4.8/5,8 out of 100,31.0,West Bengal,"Darjeeling, the former summer capital of India...","[Darjeeling Himalayan Railway, Darjeeling Peac...","February to March, September to December"
8,Ladakh,4.7/5,9 out of 100,70.0,,"Ladakh, located in the northernmost region of ...","[Pangong Lake, Khardung La, Nubra Valley]",April - Mid-July
9,Rishikesh,4.5/5,10 out of 100,54.0,Uttarakhand,Located in the foothills of the Himalayas alon...,"[Triveni Ghat, Rafting in Rishikesh, Lakshman ...",Throughout the year


In [51]:
parent_dir = os.getcwd()
result.to_csv(os.path.join(parent_dir,'tourist_places.csv'),index=False)
print('Stored the Results to CSV file')

Stored the Results to CSV file
