### Introduction

This project is aimed at exploring ways to extract information from the internet and store it in database. It extracts information about top 30 pizzerias in San Francisco and stores it in database.

### Variables extracted    
- Pizzeria
    - Name
    - URL
    - Ratings
    - Reviews
    - Amenities
    - Years in business

### Business Outcome:

Data scraped from the internet can be a valuable resource for statistical analysis and research purposes. By storing this data in a database, it becomes possible to visualize all relevant information at once, resulting in significant time savings and increased efficiency.

The comprehensive data of the top 30 pizzerias in San Francisco stored in the database  can be leveraged for business decision-making based on factors such as location or the range of services offered. The structured format of the stored details facilitates data analysis and enables efficient extraction of meaningful insights.

Moreover, the stored data can be used to perform advanced analytics and statistical modeling techniques such as clustering, regression, and predictive modeling. These techniques can identify patterns and relationships within the data and enable accurate predictions and forecasts based on historical trends.

Overall, the availability of structured data in a database can be a powerful tool for organizations seeking to make data-driven decisions, as it enables efficient data analysis and the extraction of valuable insights that can be used to inform business strategy and decision-making.

### Tools and Technologies Used
- Selenium
- Beautiful soup
- Mongodb






Importing libraries


In [4]:
from bs4 import BeautifulSoup
import requests
import time
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import os
import json
import pymongo 
import re


### Top 30 pizzeria


Storing the html page with the listing of top 30 pizzeria on disk

In [1418]:
URL = "https://www.yellowpages.com/search?search_terms=pizzeria&geo_location_terms=San+Francisco%2C+CA"
header={''}
page1 = requests.get(URL,headers=header)
doc1 = BeautifulSoup(page1.text, 'lxml')
saveString(doc1,"sf_pizzeria_search_page.html")


Opening the search result page and loading html in a beautiful soup object


In [13]:
mylist=[]
with open("sf_pizzeria_search_page.html",'r',encoding='utf-8') as a:          
    mylist.append(a.read())
    soup = BeautifulSoup(mylist[0], 'lxml')

Function to get rank number of each search result


In [1420]:
def search_rank(html_object,num):
    rank_list=[]
    for counter in range(0,num):
        y2=html_object.find('div',attrs={"class":"search-results organic"}).findAll("div",attrs={"class":"info"})[counter].find("h2",attrs={"class":"n"}).text
        a=re.findall("(.*)\.", str(y2))
        rank_list.append(a[0])
    return rank_list


Function to get name of the shop


In [1421]:
def name(html_object,num):   
    name_list=[]
    for counter in range(0,num):
        y=html_object.find('div',attrs={"class":"search-results organic"}).findAll("div",attrs={"class":"info"})[counter].find("h2",attrs={"class":"n"}).find('span').text
        name_list.append(y)
    return name_list


Function to get list of URLs   


In [1422]:
def get_url(html_object,num):
    list_url=[]

    for counter in range(0,num):
        list_of_contents=html_object.find('div',attrs={"class":"search-results organic"}).findAll('div',attrs= {"class" : "info-section info-primary"})
        part2=(list_of_contents[counter].findAll('a')[0].get("href"))
        list_url.append("https://www.yellowpages.com"+part2)
    return list_url


Function to get star ratings if exists


In [1423]:
def get_star_ratings(html_object,num):
    star_ratings_list=[]

    for counter in range(0,num):
        try:
            y=html_object.find('div',attrs={"class":"search-results organic"}).findAll("div",attrs={"class":"info"})[counter].find("a",attrs={"class":"rating hasExtraRating"}).find("div").get("class")[1]
            star_ratings_list.append(y)
        except:
            star_ratings_list.append("NA")
    return star_ratings_list

Function to fetch count of star ratings reviews


In [1424]:
def star_reviews_count(html_object,num):
    star_num_reviews=[]
    for counter in range(0,num):
        try:
            y=html_object.find('div',attrs={"class":"search-results organic"}).findAll("div",attrs={"class":"info"})[counter].find("a",attrs={"class":"rating hasExtraRating"}).text
            star_num_reviews.append(y)
        except:
            star_num_reviews.append("NA")
    return star_num_reviews

Function to get TA ratings and count of ratings  


In [1425]:
def ta_ratings(html_object,num):
    trip_advisor_ratings=[]
    ta_num_reviews=[]
    for counter in range(0,num):
        y=html_object.find('div',attrs={"class":"search-results organic"}).findAll("div",attrs={"class":"info"})[counter].find("div",attrs={"class":"ratings"}).get('data-tripadvisor')
        try:
            z=json.loads(y)
            for key, value in z.items():   # iter on both keys and values
                if key.startswith('rating'):
                        trip_advisor_ratings.append(value)
                if key.startswith('count'):
                        ta_num_reviews.append(value)
        except:
            trip_advisor_ratings.append("NA")
            ta_num_reviews.append("NA")
            
    return (trip_advisor_ratings,ta_num_reviews)

Function to get price range


In [1426]:
def price(html_object,num):
    price_range=[]

    for counter in range(0,num):
        try:
            y=html_object.find('div',attrs={"class":"search-results organic"}).findAll("div",attrs={"class":"info"})[counter].find("div",attrs={"class":"price-range"}).text
            price_range.append(y)
        except:
            price_range.append("NA")
    return price_range

Function to fetch years active business


In [1427]:
def years_active(html_object,num):
    years_business=[]
    
    for counter in range(0,num):
        try:
            y=html_object.find('div',attrs={"class":"search-results organic"}).findAll("div",attrs={"class":"info"})[counter].find("div",attrs={"class":"number"}).text
            years_business.append(y)
        except:
                years_business.append("NA")  

    return years_business

Function to fetch customer reviews


In [1428]:
def cust_reviews(html_object,num):
    customer_reviews=[]
    
    for counter in range(0,num):
        try:
            y=html_object.find('div',attrs={"class":"search-results organic"}).findAll("div",attrs={"class":"info"})[counter].find("p",attrs={"class":"body with-avatar"}).text
            customer_reviews.append(y)
        except:
            customer_reviews.append("NA")
    return customer_reviews

Function to get amenities from each span 


In [1429]:
def get_amenities(html_object,num):
    amenities=[]
    
    for counter in range(0,num):
        try:    
            y=html_object.find('div',attrs={"class":"search-results organic"}).findAll("div",attrs={"class":"info"})[counter].find("div",attrs={"class":"amenities-info"}).findAll("span")
            int_list=[]
            for tag in y:
                int_list.append(tag.text)
                
            amenities.append(int_list)
        except:
            amenities.append("NA")
    return amenities



Call specific functions to get shop information


In [1430]:
#Get rank number of search result
rank_list=[]
rank_list=search_rank(soup,30)

#Get name of the shop
name_list=[]
name_list=name(soup,30)

#Get list of URLs 
list_url=[]
list_url=get_url(soup,30)

#Get star ratings if exists
star_ratings_list=[]
star_ratings_list=get_star_ratings(soup,30)

#Count of star ratings reviews
star_num_reviews=[]
star_num_reviews=star_reviews_count(soup,30)

#TA ratings and count of ratings  
trip_advisor_ratings=[]
ta_num_reviews=[]
(trip_advisor_ratings,ta_num_reviews)=ta_ratings(soup,30)

#Price range of the store in dollar signs
price_range=[]
price_range=price(soup,30)

#Get years active business
years_business=[]
years_business=years_active(soup,30)

#Customer reviews
customer_reviews=[]
customer_reviews= cust_reviews(soup,30)

#Get amenities from each span 
amenities=[]
amenities=get_amenities(soup,30)


Loading mongodb client


In [1152]:
client=pymongo.MongoClient()

Creating a new mongodb collection


In [None]:
db=client.get_database("Pizzeria")

#Checking if the name of the database exists in mongodb instance
print(client.list_database_names())

Creating the collection


In [1154]:
store_collection=db.create_collection("sf_pizzerias")
store_collection=db.get_collection("sf_pizzerias")

Inserting documents in mongodb collection


In [1157]:

for i in range(0,30):
    response=store_collection.insert_one(
    {
        "Search Rank": rank_list[i],
        "Name": name_list[i],
        "Linked url": list_url[i],
        "Star rating": star_ratings_list[i] ,
        "Number of reviews":star_num_reviews[i] ,
        "TripAdvisor rating": trip_advisor_ratings[i],
        "Number of TA reviews":num_reviews[i] ,
        "Dollar value": price_range[i],
        "Years in business": years_business[i],
        "Review":customer_reviews[i],
        "Amenities": amenities[i]
     })

Accessing the URLs stored in mongodb collection


In [1158]:
a=store_collection.find({},{"Linked url":1,"_id":0})
list_url=[]
for i in range(0,30):
    list_url.append(a[i]['Linked url'])

Saving html of each page url to disk


In [1159]:
for a in range(0,30):
    page1 = requests.get(list_url[a],headers=header)
    doc1 = BeautifulSoup(page1.text, 'lxml')
    saveString(doc1,"sf_pizzeria_"+str(rank_list[a])+".html")
    

Read the 30 downloaded shop pages
Open each of the search result page and load html in a beautiful soup object


In [None]:
shop_address=[]
shop_phone=[]
shop_website=[]
for i in range(1,31):
    mylist=[]
    name="sf_pizzeria_"+str(i) +".html"
    with open(name,'r',encoding='utf-8') as a:          
        mylist.append(a.read())
        soup = BeautifulSoup(mylist[0], 'lxml')
        
        #Find address of the store
        z=soup.find("span",attrs={"class":"address"})
        shop_address.append(z.contents[0].text+", "+z.contents[1].text)
        
        #Get phone number of the store
        q=soup.find("a",attrs={"class":"phone dockable"})
        shop_phone.append(q.text)
        
        #Get website address of the store
        try:
            s=soup.find("a",attrs={"class":"website-link dockable"})
            shop_website.append(s.get("href"))
        except:
            shop_website.append("NA")
          

Positionstack- 
Access position stack with access key to store latitude and longitude for all stores address


In [1162]:
shop_latitude=[]
shop_longitude=[]
url = "http://api.positionstack.com/v1/forward";
header={''}

for i in range(0,30):
    page = requests.get(url,headers=header,params={
        'access_key': '',
        'query': shop_address[i],
        'region': 'United States',
        'limit': 1,
        })

    doc = BeautifulSoup(page.content, 'html.parser')
    json_dict = json.loads(str(doc))
    
    try:
        shop_longitude.append(json_dict['data'][0]['longitude'])
    except:
        shop_longitude.append('NA')
    try:
        shop_latitude.append(json_dict['data'][0]['latitude'])
    except:
        shop_latitude.append('NA')


Update shop's address, phone number, website and geolocation to the existing mongodb collection


In [1163]:
for i in range(0,30):

    #Address
    store_collection.update_one({'Search Rank' : str(i+1)}, { "$set": { "Address": shop_address[i]} })

    #Phone
    store_collection.update_one({'Search Rank' : str(i+1)}, { "$set": { "PhoneNum": shop_phone[i]} })

    #Website
    store_collection.update_one({'Search Rank' : str(i+1)}, { "$set": { "Website": shop_website[i]} })
    
    #Geolocation
    a=str(shop_longitude[i])+", "+str(shop_latitude[i])
    newvalue={ "$set": { "Geolocation": a} }
    store_collection.update_one({'Search Rank' : str(i+1)}, newvalue)

