# Scraping Dynamic Pages with Selenium

## In order to scrap content generated by JS from dynamic page BeautifulSoup is not enough, since it does not wait for elements that are loaded with the use of JS. However, in order to solve this problem we can use Selenium. There are a few things that we need to take into account though, for example: the one that most often occurs, loading page time (Time out error).

### First wee need to import essential libraries

In [1]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import csv
import pandas as pd
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

### Before scrapping we need to make sure that none bug will occur. The first error, which I encoutered is "ERROR:ssl_client_socket_openssl.cc(1158)] handshake failed with ChromeDriver Chrome browser and Selenium". The solution for this problem with explanation I found [here](https://stackoverflow.com/questions/37883759/errorssl-client-socket-openssl-cc1158-handshake-failed-with-chromedriver-chr)

### "You get this error when the browser asks you to accept the certificate from the website. You can set to ignore these errors by default in order avoid these errors."
### You can find the soltuion for Chrome Browser below

In [None]:
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')
driver_path = 'C:\chromedriver'
driver = webdriver.Chrome(driver_path,chrome_options=options)
driver.maximize_window()

### The example below shows the website that shows their articles on different numeric pages, which follows the pattern 0,8,16,24, etc. Therefore, the link f"https://www.insertherelinktowebsite.com?start={page}" contains page variable, which will be changing every time the loops begins. This is only an example, each link might differ, therefore, one has to keep that in mind

### Before we start our "scraping loop" we gonna create/open csv file, in which we will save the data after every loop finishes

In [None]:
# We will need the varbiale page later, for our scrapping loop to access another page of a website
page = 0
with open ('scraped_data.csv', 'w', encoding='utf-8', newline='') as outfile:
    writer=csv.writer(outfile)
    writer.writerow(['Title','Date','Comment'])

### I decided to do while loop, since I don't know the exact range of pages that I want to access, my goal is to scrap all articles from 2019, therefore, I use while True, and break it when the program finds 2018 in date
### In order to handle errors we use try, under which we insert our code, starting from declaring dictionary to which keys we gonna append information from each page (note that while loop has to be nested inside with open...

In [None]:
while True:
        try:
            articles_dict = {
                    'Title':[],
                    'Comment': [],
                    'Date': []
                }

### Next we define the home_url that we gonna scrap
### We also make the program to open home_url and make our program to wait 20 seconds (driver.implicitly_wait(20)) before throwing timeout error (more about it below)

In [None]:
home_url = f"https://www.insertherelinktowebsite.com?start={page}"
            driver.get(home_url)
            driver.implicitly_wait(20)
            soup = BeautifulSoup(driver.page_source,'html.parser')
            page += 8

### Another step is to make our program to write the Comemnts, Titles, Dates to the file that we opened in previous steps

In [None]:
# Writing to csv file

comments = soup.find_all('span', class_="disqus-comment-count")
for comment in comments:
    articles_dict['Comment'].append(comment.text)

titles = soup.find_all('a', class_='contentpagetitle')
for title in titles:
    articles_dict['Title'].append(title.text.strip('\n\t'))
    
dates = soup.find_all('span', class_='createdate')
for date in dates:
    articles_dict['Date'].append(date.text)

for i in range(len(articles_dict['Title'])):
    writer.writerow([articles_dict['Title'][i],
                    articles_dict['Date'][i],
                    articles_dict['Comment'][i]])

### The last step is to instruct our program what to do in case of exceptions
### Here we also check if 2018 is in the first article on the page, if yes, then we break the loop and finish the program

In [None]:
# Handling exceptions
except TimeoutException:
# If the loading took too long, print message, refresh and try again
    print("Loading took too much time!")
    driver.refresh()
except requests.ConnectionError as e:
    print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
    print(str(e))
except requests.Timeout as e:
    print("OOPS!! Timeout Error")
    print(str(e))
except requests.RequestException as e:
    print("OOPS!! General Error")
    print(str(e))
except KeyboardInterrupt:
    print("Someone closed the program")

if '2018' in dates[0]:
    break

### The whole code presents as follows:

In [None]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import csv
import pandas as pd
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')
driver_path = 'C:\chromedriver'
driver = webdriver.Chrome(driver_path,chrome_options=options)
driver.maximize_window()


page = 0

# We can change w to a, if we want to append to already existing data
with open ('scraped_data.csv', 'w', encoding='utf-8', newline='') as outfile:
    writer=csv.writer(outfile)
    writer.writerow(['Title','Date','Comment'])
    while True:
        try:
            articles_dict = {
                    'Title':[],
                    'Comment': [],
                    'Date': []
                }

            home_url = f"https://www.insertherelinktowebsite.com?start={page}"
            driver.get(home_url)
            driver.implicitly_wait(20)
            soup = BeautifulSoup(driver.page_source,'html.parser')
            instalki_pages.append(soup)
            page += 8

            # Writing to csv file

            comments = soup.find_all('span', class_="disqus-comment-count")
            for comment in comments:
                articles_dict['Comment'].append(comment.text)

            titles = soup.find_all('a', class_='contentpagetitle')
            for title in titles:
                articles_dict['Title'].append(title.text.strip('\n\t'))

            for date in dates:
                articles_dict['Date'].append(date.text)

            for i in range(len(articles_dict['Title'])):
                writer.writerow([articles_dict['Title'][i],
                                articles_dict['Date'][i],
                                articles_dict['Comment'][i]])

        # Handling exceptions
        except TimeoutException:
        # If the loading took too long, print message and try again
            print("Loading took too much time!")
            driver.refresh()
        except requests.ConnectionError as e:
            print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
            print(str(e))
        except requests.Timeout as e:
            print("OOPS!! Timeout Error")
            print(str(e))
        except requests.RequestException as e:
            print("OOPS!! General Error")
            print(str(e))
        except KeyboardInterrupt:
            print("Someone closed the program")

        if '2018' in dates[0]:
            break