<a href="https://colab.research.google.com/github/MV2290/MV2290/blob/main/GoogleMaps_selenium_TO_BE_SHARED_WIP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction and guide

The script permits to scrape a specific GoogleMaps page without needing of GoogleMaps API.
Information obtained from script are general overview, information page and reviews.

The script is designed completly online, from the platform to output. For this reason you need to connect your GoogleDrive to Google Colab to deliver data to a Google Sheet saved on your Drive. Organization of output is specified for a given template.

Remember, the time of scraping depends from your connection and how many reviews are inside website.

For a script overview refer to GMS_overview picture.


**Step-by-step guide:**

1.   Run point 1. and authorize Google Colab to connect to your Google Drive. Then run point 2. and point 3. (refer to video 'GMS_FirstSession' loaded on files)

2.   At point 4. you have to insert the inputs for scraping. (you can check video 'GMS_InputAndExtraction')
*   Google sheet name where your output should be stored
*   URL of GoogleMaps page
*   How many reviews you want to scrap
*   Information and reviews Xpath, please check dedicated step-by-step instruction after this point, or the video related (GMS_Xpath) if you are not sure what are the xpath required

3.  Point 5. start the extraction, during extraction will appaer some useful information





**Instruction to retrive xpath for reviews and information:**


1.   Go to GoogleMaps page that you need to scrap
2.   Press F12 key (command+option+U on Mac) to be able to see the html code
3.   Activate inspector key with ctrl+shift+C (command+option+I on Mac)
4.   Click on Reviews or Information button and html code will be evidenced on right side
5.   Select html code begining with "button" and right click on it
6.   On menu list go to Copy and then select "Copy XPath"
7.   Now you can parse the xpath in your code here above

ps Most of the time, xpath is the same, so you don't need to replace it every time you run the code, only when you get the error message from the last code indicating "Xpath not found"

# 1. Authorize and link your Google Drive - *Run once*

When you run the first code, automatic pop-up window will appear to authorize the connection between Google Colab and Google Drive. After that, you can run the other code, if you are doing from Section, will be automatic.

In [None]:
from google.colab import auth
auth.authenticate_user()

In [None]:
import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

# 2. Selenium setup - *Run only once*

Set up for running selenium in Google Colab
You don't need to run this code if you do it in Jupyter notebook, or other local Python setting

In [None]:
# install chromium, its driver, and selenium
!apt update
!apt install chromium-chromedriver
!pip install selenium
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import re
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
# set options to be headless, ..
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

[33m0% [Working][0m            Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
[33m0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.36)] [Waiting for headers] [0m                                                                                                    Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
                                                                                                    Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Get:6 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Get:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Get:9 http://archive.ubuntu.

# 3. Extraction information code core - *Run only once*

In [None]:
class MapScraper:
    def __init__(self, url, file_name):
        self.driver = None
        self.name = file_name
        self.driver = driver
        self.unique = []
        self.url = url

    def search_location(self):
        self.driver.get(self.url)
        try:
          search_accept = self.driver.find_element(By.XPATH,'//*[@id="yDmH0d"]/c-wiz/div/div/div/div[2]/div[1]/div[3]/div[1]/div[1]/form[2]/div/div/button/span')
          search_accept.click()
        except: pass
        time.sleep(2)

        Utils.get_business_info()

    def get_business_info(self):
        try:
            time.sleep(2)
            start_time = time.time()
            source = self.driver.page_source
            soup = BeautifulSoup(source, 'html.parser')
            name = Utils.parse_name(soup)
            url = Utils.link(self.url)

            if name not in self.unique and name:
                self.unique.append(name)
                url = Utils.link(self.url)
                rating, reviews = Utils.parse_rating_and_reviews(soup)
                address = Utils.parse_address(soup)
                category = Utils.parse_category(soup)
                website, phone = Utils.parse_contact_info(soup)
                detail = Utils.get_detail(soup)
                hours = Utils.parse_time(soup)

                record = [
                    name,
                    address,
                    phone,
                    website,
                    rating,
                    reviews,
                    hours,
                    category,
                    detail
                ]
                time.sleep(2)
                print(f'Scraped informations, excel first page: {record}')
                info = Utils.get_information()
                print(f'Information scrapped: {info}')
                time.sleep(2)
                reviews = Utils.get_reviews()
                print(f'Number of reviews scrapped: {reviews}')
                time.sleep(2)
                end_time = time.time()
                print(f'Scraping time: {round(end_time - start_time)} or {round((end_time - start_time)/60,2)} minutes')

                self.driver.quit()

        except Exception as err:
            print(f'get_business_info: {err}')

    # get information page
    def get_information(self):
        try:
            click_info = self.driver.find_element(By.XPATH,info_xpath)
            click_info.click()
            time.sleep(2)
            source = self.driver.page_source
            soup = BeautifulSoup(source, 'html.parser')
            ws = gc.open(self.name).get_worksheet(1)
            info = soup.find("div",{"class":"m6QErb DxyBCb kA9KIf dS8AEf"})
            c = 2
            for i in info.find_all("div"):
                try:
                    cat = i.find('h2').text
                    ws.update('B'+str(c),cat)
                    for span in i.find_all('span'):
                        ws.update('B'+str(c),cat)
                        ws.update('C'+str(c),span.text)
                        c += 1
                except:
                    continue

            try:
              back_click = self.driver.find_element(By.XPATH,'//*[@id="omnibox-singlebox"]/div/div[1]/button')
              back_click.click()
            except:
              pass
        except Exception as err:
            print(f'get_information: {err}')

        return "done"

    def parse_name(self, content):
        ws = gc.open(self.name).sheet1
        activity_name = "Not found"
        try:
            activity_name = content.find('h1', {"class": "DUwDvf"}).text
        except Exception as err:
            print("parse_name", err)
        ws.update('C2',activity_name)

        return activity_name

    def parse_address(self, content):
        ws = gc.open(self.name).sheet1
        address = "Not found"
        try:
            address_block = content.find_all('div', {"class": "RcCsl fVHpi w4vB1d NOE9ve M0S7ae AG25L"})
            address = address_block[0].text

        except Exception as err:
            print("parse_address", err)
        ws.update('C5',address)
        return address

    def parse_rating_and_reviews(self,content):
        rating = None
        reviews = None
        try:
            rating_area = content.find('div', {"class": "F7nice"}).text.split("(")
            if len(rating_area) > 1:
                rating = rating_area[0].strip()
                reviews = rating_area[1].split(")")[0].strip()

        except Exception as err:
            print("parse_rating_and_reviews", err)

        return rating, reviews

    def parse_contact_info(self, content):
        ws = gc.open(self.name).sheet1
        website = "Not found"
        phone = "Not found"
        try:
            pattern = re.compile(r'[a-zA-Z0-9]+\.+[a-zA-Z]+')
            address_block = content.find_all('div', {"class": "RcCsl fVHpi w4vB1d NOE9ve M0S7ae AG25L"})

            for container in address_block:

                if pattern.search(container.text):
                    website = container.find("a").get('href')

                for i in range(1,1000):
                    prefix = "+"+str(i)

                    if prefix in container.text and container.text[0] == "+":
                        phone = container.text
                        break

        except Exception as err:
            print("parse_contact_info", err)

        ws.update('C6',phone)
        ws.update('C7',website)
        return website, phone

    def parse_category(self, content):
        ws = gc.open(self.name).sheet1
        category = "Not found"
        try:
            category = content.find('button', {"class": "DkEaL"}).text

        except Exception as err:
            print("parse_category", err)
        ws.update('C4',category)

        return category

    def parse_time(self, content):
        count = 0

        #why this loop? Some times happen that format of hours is different, so you need to insert the new XPATH, could be updated
        xpath = [ '//*[@id="QA0Szd"]/div/div/div[1]/div[2]/div/div[1]/div/div/div[9]/div[4]/div[1]/div[2]/div/span[2]',
            '/html/body/div[3]/div[8]/div[9]/div/div/div[1]/div[2]/div/div[1]/div/div/div[7]/div[4]/div[1]/div[2]/div/span[2]',
        ]

        for x in xpath:
         try:
            open_hours_click = self.driver.find_element(By.XPATH, x)
            open_hours_click.click()
            count += 1
            break
         except:
            continue

        ws = gc.open(self.name).sheet1
        days = []
        h = []
        timing = []
        try:
            hours = content.find("table",{"class":"eK4R0e fontBodyMedium"})
            for i in hours.find_all("div"):
                try:
                    if i.text != "":
                        days.append(i.text + ": ")
                except:
                    continue
            for i in hours.find_all("ul"):
                try:
                    if i.text != "":
                        h.append(i.text + " ")
                except:
                    continue
            for t in range(len(days)):
                try:
                    timing.append(days[t])
                    timing.append(h[t])
                except: break
            hours = "".join(timing)

        except Exception as err:
            print("parse_time", err)
        ws.update('C8',hours)

        return hours

    def get_detail(self, content):
        ws = gc.open(self.name).sheet1
        detail = "Not found"
        try:
            detail = content.find('div', {"class": "PYvSYb"}).text

        except Exception as err:
            print("get_detail", err)
        ws.update('C9',detail)

        return detail

    def link(self,url):
        ws = gc.open(self.name).sheet1
        ws.update('C3',url)

        return url

    # get reviews
    def get_reviews(self):
        click_info = self.driver.find_element(By.XPATH,reviews_xpath)
        click_info.click()
        time.sleep(3)
        reviews = []
        try:

          # Wait for the reviews to load
          wait = WebDriverWait(self.driver, 20)  # Increased the waiting time

          # Scroll down to load more reviews
          body = self.driver.find_element(By.XPATH, "//div[contains(@class, 'm6QErb') and contains(@class, 'DxyBCb') and contains(@class, 'kA9KIf') and contains(@class, 'dS8AEf')]")

          num_reviews = len(self.driver.find_elements(By.CLASS_NAME, 'wiI7pd'))

          while True:
              body.send_keys(Keys.END)
              time.sleep(3)  # Adjust the delay based on your internet speed and page loading time
              new_num_reviews = len(self.driver.find_elements(By.CLASS_NAME, 'wiI7pd'))
              if new_num_reviews == num_reviews:
                  # Scroll to the top to ensure all reviews are loaded
                  body.send_keys(Keys.HOME)
                  time.sleep(3)
                  break
              num_reviews = new_num_reviews

          # Wait for the reviews to load completely
          wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'wiI7pd')))

          # Let all long reviews visible
          complete_review = self.driver.find_elements(By.CLASS_NAME,'w8nwRe kyuRq')
          for cr in complete_review:
            cr.click()

          # Extract the text of each review
          ws = gc.open(self.name).get_worksheet(2)

          review_name = driver.find_elements(By.CLASS_NAME, 'd4r55')
          review_time = driver.find_elements(By.CLASS_NAME, 'rsqaWe')
          review_stars = driver.find_elements(By.CLASS_NAME, 'kvMYJc')
          review_text = driver.find_elements(By.CLASS_NAME, 'wiI7pd')

          if len(review_text) > reviews_limit:
            review_name = review_name[:reviews_limit]
            review_time = review_time[:reviews_limit]
            review_stars = review_stars[:reviews_limit]
            review_text = review_text[:reviews_limit]
            len_review = reviews_limit
          else:
            len_review = len(review_text)

          reviews1 = [element.text for element in review_name]
          reviews2 = [element.text for element in review_time]
          reviews3 = [element.get_attribute("aria-label") for element in review_stars]
          reviews4 = [element.text for element in review_text]
          cell = 2

          for i in range(len(reviews1)):

            ws.update("B"+str(cell),i)
            ws.update("C"+str(cell),reviews3[i])
            ws.update("D"+str(cell),reviews1[i])
            ws.update("E"+str(cell),reviews2[i])
            ws.update("F"+str(cell),reviews4[i])
            cell +=1

          reviews = [reviews1,reviews2,reviews3,reviews4]

        except Exception as err:
          print('get_reviews',err)

        return print(f"Number of reviews scraped: {len_review}", len(review_name), len(review_time), len(review_stars), len(review_text))


# 4. Start extraction - *Insert Google sheet name, URL and xpath*

Extract information to Google Sheet:

1.   google_sheet_name = insert the name of file (should be always inside '')
2.   url = copy paste url to scrape (should be always inside '')

**This is the unique cell that you have to modify with GS name and url:**

In [None]:
#google sheet name inside quotes
google_sheet_name = 'Google Maps Data V01'
#url you want to scrape
url = 'https://maps.app.goo.gl/3q4QM3QW7qJXoZyc9'
#limit of reviews you want to scrape
reviews_limit = 100
#you have to insert the xpath of reviews, please see how from instructions or video (GM_Scrape_RetriveXpath)
reviews_xpath = '//*[@id="QA0Szd"]/div/div/div[1]/div[2]/div/div[1]/div/div/div[3]/div/div/button[2]'
#you have to insert the xpath of information table, please see how from instructions or video (GM_Scrape_RetriveXpath)
info_xpath = '//*[@id="QA0Szd"]/div/div/div[1]/div[2]/div/div[1]/div/div/div[3]/div/div/button[3]'

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 5. Start extraction - *Run to start extraction*

In [None]:
driver = webdriver.Chrome(options=options)

The script returns also any error or information not found in the page

In [None]:
Utils = MapScraper(url, google_sheet_name)
location = Utils.search_location()

get_detail 'NoneType' object has no attribute 'text'
Scraped informations, excel first page: ['FerreroLegno S.p.A.', 'SS28, 26, 12060 Magliano Alpi CN, Italy', '+39 0174 622411', 'http://www.ferrerolegno.com/', '3.9', '104', 'Tuesday: 8\u202fAM–5:30\u202fPM Wednesday: 8\u202fAM–12\u202fPM1:30–5:30\u202fPM Thursday: 8\u202fAM–12\u202fPM1:30–5:30\u202fPM Friday: 8\u202fAM–12\u202fPM1:30–5:30\u202fPM Saturday: Closed Sunday: Closed Monday: 8\u202fAM–12\u202fPM1:30–5:30\u202fPM ', 'Door supplier', 'Not found']
Information scrapped: done
get_reviews {'code': 429, 'message': "Quota exceeded for quota metric 'Write requests' and limit 'Write requests per minute per user' of service 'sheets.googleapis.com' for consumer 'project_number:522309567947'.", 'status': 'RESOURCE_EXHAUSTED', 'details': [{'@type': 'type.googleapis.com/google.rpc.ErrorInfo', 'reason': 'RATE_LIMIT_EXCEEDED', 'domain': 'googleapis.com', 'metadata': {'quota_limit': 'WriteRequestsPerMinutePerUser', 'service': 'sheets.googlea