# Scraping 'Holidify' Website To Get Top 100 Indian Tourist Places
![](https://i.imgur.com/nnDuExi.jpg)

Travel planning has always been messy and difficult.*Holidify* is attempting to collect all the information that we will ever need to plan our trip - from when, where and how, to explore more hidden gems in every destination.*Holidify* is now India's favourite trip planning website as it is the one-stop solution to all our travel planning needs.

Now,through this project,let us get the information about the top Indian tourist places from [Holidify-Indian Tourist Places](https://www.holidify.com/country/india/places-to-visit.html?pageNum=0) using *WEB-SCRAPING*.

Web scraping is an automatic method to obtain large amounts of data from websites. It is the process of using appropriate functions and [Python libraries](https://analyticsindiamag.com/top-7-python-web-scraping-tools-for-data-scientists/) for extracting content and data from a website in a structured format by crawling through HTML pages.*[Know More about Web Scraping](https://www.datacamp.com/tutorial/web-scraping-using-python)*.





## Project Overview

Here's an outline of steps we'll follow:
1. Install Libraries like [Jovian](https://docs.jovian.ai/docs/user-guide/install.html), [Requests](https://www.w3schools.com/python/module_requests.asp) and [BeautifulSoup](https://pypi.org/project/beautifulsoup4/)
2. Download the web page using  *`Requests`*.
3. Parse the HTML Source code of the website using *`BeautifulSoup`*.
4. Extract information like Next Page links,Place Names,Best time to visit,ratings etc.
5. Prepare python lists and dictionaries with the extracted and cleaned information.
6. Create Data Frames with the extracted data
7. Combine and  Save the required information from all pages to CSV files.


Let us look at the format of CSV file, that will be created by the end of the Project:

**Place Name,Best Time To Visit,Ratings,Link to know more About the Place**     
MANALI,October to Jun,4.5,https://www.holidify.com//places/manali         
LADAKH,Jun to Sep,4.6,https://www.holidify.com//places/ladakh
 

In [5]:
!pip install jovian --upgrade --quiet

In [6]:
import jovian

## 1.Download the web page using *requests*

*Requests* library can be used to download the web page from the link. It can be installed using `pip`.

In [7]:
!pip install requests --upgrade --quiet

In [8]:
import requests

In [9]:
Holidify_page_url ='https://www.holidify.com/country/india/places-to-visit.html'

#Execute this to get the response from the page
page_response = requests.get(Holidify_page_url)

   Let us check the `type`  and `status` of *page_response*. The `.status_code` property is used to check whether the request is successful or not.A successful response will have the status code between 200 and 299

In [10]:
#to know about the type of response
type(page_response)

requests.models.Response

In [11]:
#checking the status of response
page_response.status_code

200

yaay! the request was successful.Now we can get the contents of the page using *page_response*`.text`

In [12]:
contents_of_page = page_response.text

Number of characters in the page can be found using`len()`  

In [13]:
len(contents_of_page)

313493

It seems like there are over 31,00 characters in the page. let us see what are the first 200 characters!

In [14]:
#to get the top 200 Characters of the source code
contents_of_page[:200]

"<!doctype html>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n<html>\n   <head>\n   \n   \n      \n    \n    \n\n    \n\n\n\n<script>\n\t  dataLayer = [{\n\t    'pageType': 'Country.TopPlaces',\n\t    'countryCode' : 'INDIA',\n\t    'contin"

This is how the *HTML source code* of the page look like.[*HTML*](https://www.w3schools.com/html/) is a language used to create pages on the Web. Let us save the entire code into a file and view the page locally within *jupyter* using "File > Open".

In [15]:
#to create a local HTML file with extracted source code
with open('Holidify-webpage.html', 'w' ,encoding='utf-8')as file:
    file.write(contents_of_page)

Let's see how the *html* code and output of it looks like in jupyter. The page looks similar to the original web page but non of the links work in it. 

By this we have successfully downloaded the web page.

## 2.Parse the HTML Source code of the website using *BeautifulSoup.*

Now let us install and import `BeasutifulSoup` for parsing the data from the website. Also see what tags can we get from the page.

In [16]:
#Installing Beautifulsoup
!pip install beautifulsoup4 --upgrade --quiet

In [17]:
from bs4 import BeautifulSoup

we have installed and imported BeautifulSoup. Using this, get all the information from the website into a variable like *doc* and check its *type*.

In [18]:
#applying beautifulsoup
doc = BeautifulSoup(contents_of_page,'html.parser')

In [19]:
type(doc)

bs4.BeautifulSoup

Check what can be parsed from the website.

In [20]:
#to check the first element under the title_tag
doc.find('title')

<title> 100  Places To Visit In India | Tourist Places in India | Holidify  </title>

###### Now, Let's write a function, covering everything we have done till here.

In [21]:
#Function to get the beautifulsoup document from the link
def Get_page_contents(page_link):
    response = requests.get(page_link)
    if response.status_code != 200:
        raise Exception ('Unable to fetch the page {}'.format(page_link))
    contents_of_page = response.text
    doc = BeautifulSoup(response.text,'html.parser')
    return doc

Let's see how it works.

In [22]:
#Check the function
demo_doc= Get_page_contents(Holidify_page_url)

In [23]:
demo_doc.find('title')

<title> 100  Places To Visit In India | Tourist Places in India | Holidify  </title>

so, we can use the function '*Get_page_contents*' for downloading and parsing information from any website.

## 3.Extract the required information from the page.

Extracting the required information is the crucial and complicated part of the web-scraping.We will take the help of *BeautifulSoup* and *HTML* in extracting the required information.           
To Extract a particular information from the web page, we need to right click on that element and select `inspect`.Then, you will get a view something like below.
Here, i wanted to extract the *ratings* of places, so i have inspected the page by clicking on the rating. You can see that the HTML tag and code of that particular rating is highlighted in the HTML code window.

[]! Image

### 3.1 Let us Extract the links of other Pages from [Holidify](https://www.holidify.com/country/india/places-to-visit.html) web page .
When we inspect the page, we can see that the page links are in *[a_tags](https://www.w3schools.com/tags/tag_a.asp)*, under class 'Page-link'. Let us a write a code to get that information and `print` it to see the result.
![](https://i.imgur.com/7p4saDZ.png)

In [24]:
#extract other page links from the webpage
page_links =doc.find_all('a',class_="page-link")
page_link_urls =[]
#To get alive link, add the below URL before the elements we are extracting.
url='https://www.holidify.com/'
for tag in page_links:
    page_link_urls.append(url+tag['href'])
print(page_link_urls)

['https://www.holidify.com//country/india/places-to-visit.html?pageNum=0', 'https://www.holidify.com//country/india/places-to-visit.html?pageNum=0', 'https://www.holidify.com//country/india/places-to-visit.html?pageNum=1', 'https://www.holidify.com//country/india/places-to-visit.html?pageNum=2', 'https://www.holidify.com//country/india/places-to-visit.html?pageNum=1']


Now, let us write a function to get next page links from the *BeautifulSoup* Document automatically.

In [25]:
#function to get next page links
def Get_page_links(page_doc):
    page_links =page_doc.find_all('a',class_="page-link")
    page_link_urls =[]
    url='https://www.holidify.com/'
    for tag in page_links:
        page_link_urls.append(url+tag['href'])
    return page_link_urls

Lets check how it works.


In [26]:
#checking the function
Links_of_pages = Get_page_links(Get_page_contents(Holidify_page_url))

In [27]:
#execute to get the link of second page
Links_of_pages[2]

'https://www.holidify.com//country/india/places-to-visit.html?pageNum=1'

Yaay! We made it. similarly, let's write functions to extract all the required information.

### 3.2 Extract 'Tourist place names'.       
By inspecting the page, we came to know that the place names are inside [<div_tags](https://www.w3schools.com/tags/tag_div.ASP#:~:text=The%20tag%20defines%20a,inside%20the%20tag!)  under class 'card content-card.So let's extarct the information using these two as filters.
![](https://i.imgur.com/SBSBee5.png)

In [28]:
#get all div_tags with class 'card content-card'
Place_Name_Tags = doc.find_all('div', class_= 'card content-card')
Place_Names=[]
#Code to get only required information from the contents inside the tag
for tag in Place_Name_Tags:
    Place_Names.append(tag['data-itemid'])
print(Place_Names)

['MANALI', 'LADAKH', 'COORG', 'ANDAMAN-NICOBAR-ISLANDS', 'LAKSHADWEEP-ISLANDS', 'GOA', 'UDAIPUR', 'SRINAGAR', 'GANGTOK', 'MUNNAR', 'VARKALA', 'MCLEODGANJ', 'RISHIKESH', 'ALLEPPEY', 'DARJEELING', 'NAINITAL', 'SHIMLA', 'OOTY', 'JAIPUR', 'LONAVALA', 'MUSSOORIE', 'KODAIKANAL', 'DALHOUSIE', 'PACHMARHI', 'VARANASI', 'MUMBAI', 'AGRA', 'KOLKATA', 'JODHPUR', 'BANGALORE', 'AMRITSAR', 'DELHI', 'JAISALMER', 'MOUNT-ABU', 'WAYANAD', 'HYDERABAD', 'PONDICHERRY', 'KHAJURAHO', 'CHENNAI', 'VAISHNO-DEVI', 'AJANTA-AND-ELLORA-CAVES', 'HARIDWAR']


Now let's create a function to get Place Names automatically from the *BeautifulSoup* doc.

In [29]:
#function to extract place names from the beautifulsoup document
def Get_place_names(page_doc):
    Place_Name_Tags = page_doc.find_all('div', class_= 'card content-card')   
    Place_Names=[]
    for tag in Place_Name_Tags:
        Place_Names.append(tag['data-itemid'])
    return Place_Names

Lets Check how it is working. Get top 5 Names from the second page.

In [30]:
#cheking the function Get_place_names
Get_place_names(Get_page_contents(Links_of_pages[2]))[:5]

['KANYAKUMARI', 'PUNE', 'KOCHI', 'AHMEDABAD', 'KANHA-NATIONAL-PARK']

Yup. It worked! Here we have used all the functions we have created till now.

### 3.3 Extract information to get '*Best Time To Visit*' the places.       
From the HTML code of the web page, we can see that the best time to visit the place was given in [<p_tags](https://www.w3schools.com/tags/tag_p.asp) under class 'mb-3'. Let's get that information.
![](https://i.imgur.com/dsNg9HN.png)

In [31]:
#first extract all p_tags with class 'mb-3'
Visit_time_Tags=doc.find_all('p',class_='mb-3')
Best_time_to_visit=[]
#now get only text showing when to when
for tag in Visit_time_Tags:
    Best_time_to_visit.append(tag.text.strip('Best Time').strip(':').strip())
print(Best_time_to_visit)

['October to Jun', 'Jun to Sep', 'October to March', 'October to Jun', 'September to May', 'October to March', 'October to March', 'April to October', 'Throughout the year', 'September to May', 'Throughout the year', 'October to Jun', 'Throughout the year', 'June to March', 'February to March, September to December', 'Throughout the year', 'October to Jun', 'Throughout the year', 'October to March', 'Throughout the year', 'September to Jun', 'September to May', 'Throughout the year', 'Throughout the year', 'October to March', 'October to February', 'October to March', 'October to March', 'November to February', 'Throughout the year', 'October to March', 'October to March', 'October to March', 'October to March', 'Throughout the year', 'September to March', 'October to March', 'July to March', 'October to March', 'Throughout the year', 'June to March', 'Throughout the year']


Now lets create a function to get the same information easily.

In [32]:
#function to extract 'Best time to visit' from the beautifulsoup document
def Get_best_time_to_visit(page_doc):
    Visit_time_Tags=page_doc.find_all('p',class_='mb-3')
    Best_time=[]
    for tag in Visit_time_Tags:
        Best_time.append(tag.text.strip('Best Time').strip(':').strip())
    return Best_time

Now, let's find the best time to visit **top 5** places in **3rd page.**

In [33]:
#using 'Get_page_contents' function to get the beautifulsoup doc
Get_best_time_to_visit(Get_page_contents(Links_of_pages[3]))[:5]

['October to March',
 'October to March',
 'September to March',
 'Throughout the year',
 'July to March']

### 3.4 Extract the *Ratings* given for the places.     
We can find the rantings in between [<b_tags](https://www.w3schools.com/tags/tag_b.asp) inside [<span_tags](https://www.w3schools.com/tags/tag_span.asp) with class_='rating-badge'. Let's get that information.
![](https://i.imgur.com/qkSBHQs.png)

In [34]:
#Get all span_tags with class_='rating-badge'
Rating_Tags= doc.find_all('span', 'b', class_='rating-badge')
#Now get required text and strip all the unwanted information 
Ratings=[]
for tag in Rating_Tags:
    Ratings.append(tag.text.strip('\n''/5').strip()+'/5')
print(Ratings)

['4.5/5', '4.6/5', '4.2/5', '4.5/5', '4.0/5', '4.5/5', '4.4/5', '4.5/5', '4.4/5', '4.5/5', '4.5/5', '4.4/5', '4.3/5', '4.5/5', '4.3/5', '4.3/5', '4.3/5', '4.3/5', '4.4/5', '4.1/5', '4.2/5', '4.4/5', '4.2/5', '4.4/5', '4.5/5', '4.2/5', '4.2/5', '4.3/5', '4.2/5', '4.1/5', '4.4/5', '4.1/5', '4.4/5', '4.4/5', '4.3/5', '4.1/5', '4.1/5', '4.6/5', '3.9/5', '4.4/5', '4.3/5', '4.0/5']


Let's Create a function for this now.

In [35]:
#function to extract ratings from the beautifulsoup document
def Get_ratings(page_doc):
    Rating_Tags= page_doc.find_all('span', 'b', class_='rating-badge')
    Ratings=[]
    for tag in Rating_Tags:
        Ratings.append(tag.text.strip('\n''/5').strip()+'/5')
    return Ratings

Now, let's find the Ratings given to top 5 places in 2nd page.

In [36]:
#check how the above function works
Get_ratings(Get_page_contents(Links_of_pages[2]))[:5]

['4.1/5', '4.1/5', '4.2/5', '4.1/5', '3.8/5']

### 3.5 Extract 'links' that provide detailed information about the places.   
We can see that the link to get more information about the place is in <div_tags with class 'content-card-footer' 
. Let's get those links out.
![](https://i.imgur.com/T3DZeYY.png)

In [37]:
#get the links from div tags, under class 'content-card-footer'
Place_link_Tags =doc.find_all('div',class_='content-card-footer')
know_more_about_place_link =[]
#To get a live link, add the below URL before the elements we are extracting.
link_url='https://www.holidify.com/'
for tag in Place_link_Tags:
    know_more_about_place_link.append(link_url+tag['data-href'])
    
#lets print first 6 links from the list
print (know_more_about_place_link[:6])

['https://www.holidify.com//places/manali', 'https://www.holidify.com//places/ladakh', 'https://www.holidify.com//places/coorg', 'https://www.holidify.com//places/andaman-nicobar-islands', 'https://www.holidify.com//places/lakshadweep-islands', 'https://www.holidify.com//places/goa']


Let's create a function for the above code.

In [38]:
#function to extract place links from the beautifulsoup document
def Get_place_info_links(page_doc):
    Place_link_Tags =page_doc.find_all('div',class_='content-card-footer')
    know_more_about_place_link =[]
    link_url='https://www.holidify.com/'
    for tag in Place_link_Tags:
        know_more_about_place_link.append(link_url+tag['data-href'])
    return know_more_about_place_link

Let us get links to know about *Manali,ladakh and coorg*.

In [39]:
#Get links of top 3 places from page 1
Get_place_info_links(Get_page_contents(Links_of_pages[1]))[:3]

['https://www.holidify.com//places/manali',
 'https://www.holidify.com//places/ladakh',
 'https://www.holidify.com//places/coorg']

By now, we have extracted all the required data separately. Let us prepare lists and dictionaries using this information.

## 4.Prepare python lists and dictionaries with the extracted and cleaned information.
We have successfully extracted all the required data from the page. Now let us create few [lists](https://www.w3schools.com/python/python_lists.asp) and [Dictionaries](https://www.w3schools.com/python/python_dictionaries.asp) to get any particular information.  
Let us write a function to get the details of a particular place with respect to their index value.

**Let's create a function to get the details of a place as a dictionary**

In [40]:
#Function to get the details of a particular place with respect to their index value.
def Get_place_Details (Place_Name_Tags,Rating_Tags,Visit_time_Tags,Place_link_Tags):
    # Provides required information about the tourist Places.
    Tourist_place = Place_Name_Tags['data-itemid']
    Best_time_to_Visit = Visit_time_Tags.text.strip('Best Time').strip(':').strip()
    Rating_for_Place = Rating_Tags.text.strip('\n''/5').strip()+'/5'
    Link_for_more_details = link_url+ Place_link_Tags['data-href']
    
    return Tourist_place,Best_time_to_Visit,Rating_for_Place,Link_for_more_details

Let us try to get details of 3rd place in page 1 and see how it works.

In [41]:
#to get the details of 3rd place
Get_place_Details (Place_Name_Tags[2],Rating_Tags[2],Visit_time_Tags[2],Place_link_Tags[2])

('COORG',
 'October to March',
 '4.2/5',
 'https://www.holidify.com//places/coorg')

**Creating a dictionary with all the lists we have prepared.**


In [42]:
#Give the 'Key' names for dictionary
Tourist_places_dict={
    'Tourist place':[],
    'Best time to Visit':[],
    'Rating for Place':[],
    'Link for more details':[]
}

#add the respective values to the keys
for i in range(len(Place_Name_Tags)):
    Tourism_Details = Get_place_Details(Place_Name_Tags[i],Rating_Tags[i],Visit_time_Tags[i],Place_link_Tags[i])
    Tourist_places_dict['Tourist place'].append(Tourism_Details[0])
    Tourist_places_dict['Best time to Visit'].append(Tourism_Details[1])
    Tourist_places_dict['Rating for Place'].append(Tourism_Details[2])
    Tourist_places_dict['Link for more details'].append(Tourism_Details[3])
    
#Lets see what we get by executing this.
Tourist_places_dict

{'Tourist place': ['MANALI',
  'LADAKH',
  'COORG',
  'ANDAMAN-NICOBAR-ISLANDS',
  'LAKSHADWEEP-ISLANDS',
  'GOA',
  'UDAIPUR',
  'SRINAGAR',
  'GANGTOK',
  'MUNNAR',
  'VARKALA',
  'MCLEODGANJ',
  'RISHIKESH',
  'ALLEPPEY',
  'DARJEELING',
  'NAINITAL',
  'SHIMLA',
  'OOTY',
  'JAIPUR',
  'LONAVALA',
  'MUSSOORIE',
  'KODAIKANAL',
  'DALHOUSIE',
  'PACHMARHI',
  'VARANASI',
  'MUMBAI',
  'AGRA',
  'KOLKATA',
  'JODHPUR',
  'BANGALORE',
  'AMRITSAR',
  'DELHI',
  'JAISALMER',
  'MOUNT-ABU',
  'WAYANAD',
  'HYDERABAD',
  'PONDICHERRY',
  'KHAJURAHO',
  'CHENNAI',
  'VAISHNO-DEVI',
  'AJANTA-AND-ELLORA-CAVES',
  'HARIDWAR'],
 'Best time to Visit': ['October to Jun',
  'Jun to Sep',
  'October to March',
  'October to Jun',
  'September to May',
  'October to March',
  'October to March',
  'April to October',
  'Throughout the year',
  'September to May',
  'Throughout the year',
  'October to Jun',
  'Throughout the year',
  'June to March',
  'February to March, September to December',

We got a dictionary with lists of all details.

## 5.Create Pandas DataFrame
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Know more about Data Frames [here](https://www.geeksforgeeks.org/python-pandas-dataframe/#:~:text=Pandas%20DataFrame%20is%20two%2Ddimensional,fashion%20in%20rows%20and%20columns.).                           

let us install and import Pandas.

In [43]:
#Install Pandas
!pip install pandas --quiet

In [44]:
#imort pandas 
import pandas as pd

Let's write a code to get all required information from a page into a Data Frame. We use `pd.DataFrame()` to convert a dictionary into a Data Frame Structure.

In [45]:
#Create a dictinary with all the required keys and appropriate values.
Travel_dict={
    'Tourist Place':Place_Names,
    'Best time to Visit':Best_time_to_visit,
    'Ratings':Ratings,
    'More about The Place':know_more_about_place_link}
#convert the dictionary into a DataFrame
Travel_Details = pd.DataFrame(Travel_dict)

#Let'see the output
Travel_Details

Unnamed: 0,Tourist Place,Best time to Visit,Ratings,More about The Place
0,MANALI,October to Jun,4.5/5,https://www.holidify.com//places/manali
1,LADAKH,Jun to Sep,4.6/5,https://www.holidify.com//places/ladakh
2,COORG,October to March,4.2/5,https://www.holidify.com//places/coorg
3,ANDAMAN-NICOBAR-ISLANDS,October to Jun,4.5/5,https://www.holidify.com//places/andaman-nicob...
4,LAKSHADWEEP-ISLANDS,September to May,4.0/5,https://www.holidify.com//places/lakshadweep-i...
5,GOA,October to March,4.5/5,https://www.holidify.com//places/goa
6,UDAIPUR,October to March,4.4/5,https://www.holidify.com//places/udaipur
7,SRINAGAR,April to October,4.5/5,https://www.holidify.com//places/srinagar
8,GANGTOK,Throughout the year,4.4/5,https://www.holidify.com//places/gangtok
9,MUNNAR,September to May,4.5/5,https://www.holidify.com//places/munnar


.

This is the Data Frame for the information extracted from the first Page. **Now let us create a function to get a Data Frame out of any page from the website.**




In [46]:
#function to get a data frame out of any selected page
def Get_Data_Frame(Page_Number):
    page_url= 'https://www.holidify.com/country/india/places-to-visit.html?pageNum='+(str(Page_Number -1))
    response = requests.get(page_url)
    #Check the status of the response
    if response.status_code != 200:
        raise Exception('Failed! Unable to fetch information from page {}'.format(Page_Number))
    page_doc = BeautifulSoup(response.text,'html.parser')

    #Craete a dictionary with all the required keys and values
    dict_tourist_place={
        'Place Name':Get_place_names(page_doc),
        'Best Time To Visit':Get_best_time_to_visit(page_doc),
        'Ratings':Get_ratings(page_doc),
        'Link to know more About the Place':Get_place_info_links(page_doc)}
    #create a data frame out of dictionary
    return pd.DataFrame(dict_tourist_place)


Try getting a Dataframe from page 3 of Holidify Website.

In [47]:
#to create a data frame for page 3 details
Get_Data_Frame(3)

Unnamed: 0,Place Name,Best Time To Visit,Ratings,Link to know more About the Place
0,AJMER,October to March,3.8/5,https://www.holidify.com//places/ajmer
1,AURANGABAD,October to March,3.8/5,https://www.holidify.com//places/aurangabad
2,JAMMU,September to March,4.1/5,https://www.holidify.com//places/jammu
3,DEHRADUN,Throughout the year,3.9/5,https://www.holidify.com//places/dehradun
4,PURI,July to March,4.3/5,https://www.holidify.com//places/puri
5,CHERRAPUNJEE,September to May,4.5/5,https://www.holidify.com//places/cherrapunjee
6,BIKANER,October to March,4.3/5,https://www.holidify.com//places/bikaner
7,SHIMOGA,July to December,4.4/5,https://www.holidify.com//places/shimoga
8,HOGENAKKAL,October to March,4.0/5,https://www.holidify.com//places/hogenakkal
9,GIR-NATIONAL-PARK,July to March,4.2/5,https://www.holidify.com//places/gir-national-...


By here, we have completed extracting data from different pages and converting them into DataFrames. Now let us write all the functions together .Now let's try writing a code to generate CSV files out of the information.

## 6.Save the required information to *CSV files.*

A [CSV (Comma Separated Values)](https://www.programiz.com/python-programming/csv#:~:text=To%20write%20to%20a%20CSV,data%20into%20a%20delimited%20string.) format is one of the most simple and common ways to store tabular data. To represent a CSV file, it must be saved with the `.csv`file extension. Let us write a code to convert the information extracted from different pages into  different `CSV Files`

### 6.1 Save individual pages into distinct CSV files.         

Let us save the data from each page into respective CSV files. As we have 3 pages, let's create 3 CSV files, one for each page.write a function to automatically scrape information from all pages and create respective CSV files

In [48]:
#Function to create csv files from each page
def Create_CSV_Files():
    for i in range(1,4):
        Indian_tourism_details = Get_Data_Frame(i)
        #To create a CSV file and saving it to .CSV File
        Indian_tourism_details.to_csv('Indian Tourism page {}.csv'.format(i), index= None )
        print( 'scraping Indian Tourism page {}'.format(i))

#Lets Check how it works.
Create_CSV_Files()

scraping Indian Tourism page 1
scraping Indian Tourism page 2
scraping Indian Tourism page 3


We have created three CSV files. We can find them in `Jupyter Notebook` > `File` >`open`              
But, till here we got the data individually for different pages. Lets combine all the data get it to one place.

### 6.2 Save all pages into a MEGA  *CSV* File

Now, we will create appropriate functions to combine the extracted data from all the pages, create a dataframe with it and save it to a single CSV file. This will provide us with all the data from the selected pages in a single file.

####  Create a Mega Data Frame
Let us write functions to combine information from all the pages and create a single,mega DataFrame.

In [49]:
#create a function to get all required information from a particular page
def get_info_from_page(page_number):
    url='https://www.holidify.com/country/india/places-to-visit.html?pageNum='+(str(page_number -1))
    #use all the functions we have created to get the data
    page_doc=Get_page_contents(url)
    place_names = Get_place_names(page_doc)
    Best_time_to_visit = Get_best_time_to_visit(page_doc)
    place_ratings = Get_ratings(page_doc)
    More_info_link = Get_place_info_links(page_doc)
    return place_names,Best_time_to_visit,place_ratings,More_info_link

Let's use `for` loop to combine data from all the pages. 

In [50]:
#Let's combine all the data from 3 page
def combine_pages():

    #Let's add respective data    
    all_places,all_best_timings,all_ratings,all_info_links = [],[],[],[]
    for page_number in range(1,4):
        places,Best_timings,ratings,info_links = get_info_from_page(page_number)
        #add data from each page to respective keys
        all_places += places
        all_best_timings += Best_timings
        all_ratings += ratings
        all_info_links += info_links

    #Create a dictionary with all the data we have extracted
    all_pages_info = {
        'Place Name':all_places,
        'Best Time To Visit': all_best_timings ,
        'Ratings':all_ratings,
        'Link to know more About the Place':all_info_links}
    #convert the dictionary into a dataframe and write it to a variable
    return pd.DataFrame(all_pages_info)

Let us check how the Mega data frame looks.

In [51]:
#Execute the variable to get a dataframe with details from all 3 pages
combine_pages()

Unnamed: 0,Place Name,Best Time To Visit,Ratings,Link to know more About the Place
0,MANALI,October to Jun,4.5/5,https://www.holidify.com//places/manali
1,LADAKH,Jun to Sep,4.6/5,https://www.holidify.com//places/ladakh
2,COORG,October to March,4.2/5,https://www.holidify.com//places/coorg
3,ANDAMAN-NICOBAR-ISLANDS,October to Jun,4.5/5,https://www.holidify.com//places/andaman-nicob...
4,LAKSHADWEEP-ISLANDS,September to May,4.0/5,https://www.holidify.com//places/lakshadweep-i...
...,...,...,...,...
95,PUSHKAR,October to March,4.0/5,https://www.holidify.com//places/pushkar
96,CHITTORGARH,October to March,4.1/5,https://www.holidify.com//places/chittorgarh
97,NAHAN,April to September,3.9/5,https://www.holidify.com//places/nahan
98,LAVASA,Throughout the year,3.9/5,https://www.holidify.com//places/lavasa


Yaay!. We got a data frame with all the **100 rows** from 3 pages, with the required **4 columns**

####  Create a Mega CSV File

Let us create a single CSV file using the `DATAFRAME` that contains all the data from three pages. 

In [52]:
#execute this to save all the data into single csv file
combine_pages().to_csv('Top 100 Indian Tourist Places.csv',index=None)
print('MEGA CSV file created')

MEGA CSV file created


**Add this CSV file to the Project Notebook**

In [53]:
# Execute this to save the notebook add the required file to it 
jovian.commit(files=['Top 100 Indian Tourist Places.csv'])

<IPython.core.display.Javascript object>

[jovian] Updating notebook "tharakdasari25/indiantourism-web-scraping-project-1" on https://jovian.com[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.com/tharakdasari25/indiantourism-web-scraping-project-1[0m


'https://jovian.com/tharakdasari25/indiantourism-web-scraping-project-1'

## 7.Write down all the functions  used in the project.

Let us write all the libraries, functions,codes and data involved in the project into a single cell.

In [54]:
#Install and import required libraries

!pip install jovian --upgrade --quiet
import jovian

!pip install requests --upgrade --quiet
import requests

!pip install beautifulsoup4 --upgrade --quiet
from bs4 import BeautifulSoup

Holidify_page_url ='https://www.holidify.com/country/india/places-to-visit.html'

#Function to get the HTML sourse code from the page
def Get_page_contents(page_link):
    response = requests.get(page_link)
    if response.status_code != 200:
        raise Exception ('Unable to fetch the page {}'.format(page_link))
    contents_of_page = response.text
    doc = BeautifulSoup(response.text,'html.parser')
    return doc

#Function to extract next page links from the selected webpage.
def Get_page_links(page_doc):
    page_links =page_doc.find_all('a',class_="page-link")
    page_link_urls =[]
    url='https://www.holidify.com/'
    for tag in page_links:
        page_link_urls.append(url+tag['href'])
    return page_link_urls


#Function to extract 'Tourist place Name' from the website.
def Get_place_names(page_doc):
    Place_Name_Tags = page_doc.find_all('div', class_= 'card content-card')   
    Place_Names=[]
    for tag in Place_Name_Tags:
        Place_Names.append(tag['data-itemid'])
    return Place_Names


#Function to get the 'Best time to visit' the particular place.
def Get_best_time_to_visit(page_doc):
    Visit_time_Tags=page_doc.find_all('p',class_='mb-3')
    Best_time=[]
    for tag in Visit_time_Tags:
        Best_time.append(tag.text.strip('Best Time').strip(':').strip())
    return Best_time


#Function to get the ratings given for the place.
def Get_ratings(page_doc):
    Rating_Tags= page_doc.find_all('span', 'b', class_='rating-badge')
    Ratings=[]
    for tag in Rating_Tags:
        Ratings.append(tag.text.strip('\n''/5').strip()+'/5')
    return Ratings


#Function to get the links 
def Get_place_info_links(page_doc):
    Place_link_Tags =page_doc.find_all('div',class_='content-card-footer')
    know_more_about_place_link =[]
    link_url='https://www.holidify.com/'
    for tag in Place_link_Tags:
        know_more_about_place_link.append(link_url+tag['data-href'])
    return know_more_about_place_link


#Function to get the details of a particular place with respect to their index value.
def Get_place_Details (Place_Name_Tags,Rating_Tags,Visit_time_Tags,Place_link_Tags):
    # Provides required information about the tourist Places.
    Tourist_place = Place_Name_Tags['data-itemid']
    Best_time_to_Visit = Visit_time_Tags.text.strip('Best Time').strip(':').strip()
    Rating_for_Place = Rating_Tags.text.strip('\n''/5').strip()+'/5'
    Link_for_more_details = link_url+ Place_link_Tags['data-href']
    
    return Tourist_place,Best_time_to_Visit,Rating_for_Place,Link_for_more_details



#Function to scrape each page and get a required  Dataframe
def Get_Data_Frame(Page_Number):
    page_url= 'https://www.holidify.com/country/india/places-to-visit.html?pageNum='+(str(Page_Number -1))
    response = requests.get(page_url)
    #Check the status of the response
    if response.status_code != 200:
        raise Exception('Failed! Unable to fetch information from page {}'.format(Page_Number))
    page_doc = BeautifulSoup(response.text,'html.parser')

    #Craete a dictionary with all the required keys and values
    dict_tourist_place={
        'Place Name':Get_place_names(page_doc),
        'Best Time To Visit':Get_best_time_to_visit(page_doc),
        'Ratings':Get_ratings(page_doc),
        'Link to know more About the Place':Get_place_info_links(page_doc)}
    #create a data frame out of dictionary
    return pd.DataFrame(dict_tourist_place)


#Function to create CSV files induvidually for all pages
def Create_CSV_Files():
    for i in range(1,4):
        Indian_tourism_details = Get_Data_Frame(i)
        #To create a CSV file and saving it to .CSV File
        Indian_tourism_details.to_csv('Indian Tourism page {}.csv'.format(i), index= None )
        print( 'scraping Indian Tourism page {}'.format(i))
        
#create a function to get all required information from a particular page
def get_info_from_page(page_number):
    url='https://www.holidify.com/country/india/places-to-visit.html?pageNum='+(str(page_number -1))
    #use all the functions we have created to get the data
    page_doc=Get_page_contents(url)
    place_names = Get_place_names(page_doc)
    Best_time_to_visit = Get_best_time_to_visit(page_doc)
    place_ratings = Get_ratings(page_doc)
    More_info_link = Get_place_info_links(page_doc)
    return place_names,Best_time_to_visit,place_ratings,More_info_link


#Let's combine all the data from 3 page
def combine_pages():  
    all_places,all_best_timings,all_ratings,all_info_links = [],[],[],[]
    for page_number in range(1,4):
        places,Best_timings,ratings,info_links = get_info_from_page(page_number)
        #add data from each page to respective keys
        all_places += places
        all_best_timings += Best_timings
        all_ratings += ratings
        all_info_links += info_links
    #Create a dictionary with all the data we have extracted
    all_pages_info = {
        'Place Name':all_places,
        'Best Time To Visit': all_best_timings ,
        'Ratings':all_ratings,
        'Link to know more About the Place':all_info_links}
     #convert the dictionary into a dataframe and write it to a variable
    return pd.DataFrame(all_pages_info)

#execute this to save all the data into single csv file
combine_pages().to_csv('Top 100 Indian Tourist Places.csv',index=None)
print('MEGA CSV file created')


MEGA CSV file created


With this, we have completed scraping top 100 tourist places of India from *Holidify* web page.

## Summary Of the Project
1. Install required libraries like *Jovian*,*Requests*,*BeautifulSoup*.
2. Download *Holidify* webpage using `requests.`
3. Save the HTML file of the page into Notebook.
4. Parse the HTML source code using `beautiful soup.`
5. Extract the information of tourist places like place_Name,ratings etc.
6. Compile extracted information into python lists and dictionaries.
7. Install and import `pandas` library.
8. Save the extracted information into a `dataframe.`
9. Create `CSV files` out of scraped information.
10. Combine data from all pages and create a single `Data Frame` and `CSV` file.

## Future work
1. Other details like Number of tourist attractions,Package details etc. can be scrapped from  the web page.
2. Details inside the link provided to know more about the place, can also be extracted individually.
3. Get Google Map locations of tourist places.
4. Get beautiful scenery images for different tourist places.
5. Get tourism details of other countries like Nepal,Bhutan,China etc.


## References

* Basics of Web scraping:  [Web Scraping and REST APIs](https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-and-rest-apis) lesson from Jovian [Data Science and Machine Learning Bootcamp](https://jovian.ai/learn/zero-to-data-analyst-bootcamp).
* Tips for Documentation:  https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/documentation-and-storytelling
* More about python Libraries for web scraping:  https://www.analyticsvidhya.com/blog/2020/04/5-popular-python-libraries-web-scraping/
* Page Used for scraping:  https://www.holidify.com/country/india/places-to-visit.html
* Image links from: https://imgur.com/upload


In [55]:
# Execute this to save the notebook add the required file to it 
jovian.commit(files=['Top 100 Indian Tourist Places.csv'])

<IPython.core.display.Javascript object>

[jovian] Updating notebook "tharakdasari25/indiantourism-web-scraping-project-1" on https://jovian.com[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.com/tharakdasari25/indiantourism-web-scraping-project-1[0m


'https://jovian.com/tharakdasari25/indiantourism-web-scraping-project-1'