

# Web Scraper 2
***
This web scraper is designed to scrape web content where the content is both static and dynamic (generated by Javascrpt) or more specifically, where parts of the content (user comments) are managed by disqus.com.

As an improvement to version 1, this version can loop through a set of url thus avoiding manual repetition.

Prerequisites:
- install selenium (pip install selenium)
- install BeautifulSoup (pip install BeautifulSoup4)
- download and install Chrome web browser
***

### Import the required libraries

In [1]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import json
import csv
import os.path
import time

from web_scraper_functions import *  

### Set up the web driver

In [2]:
# Input for selenium (for making http requests)
driver_path = './chromedriver'
s = Service(driver_path)
options = Options()
options.headless = False
browser = webdriver.Chrome(service=s, options=options)

### Set up the URLs of the web pages

In [3]:
# URLs of the web pages for scraping
urls = [
    'https://teslamag.de/news/supercharger-limburg-tesla-karte-geoffnet-fremde-elektroautos-49280', 
    'https://teslamag.de/news/test-niederlande-tesla-oeffnet-erste-zehn-supercharger-fremde-elektroautos-42542', 
    'https://teslamag.de/news/fremde-elektroautos-supercharger-vw-id3-problemlos-audi-fahrer-ueberzeugt-42649',
    'https://teslamag.de/news/kreativ-supercharger-tesla-andere-elektroauto-fahrer-loesungen-teilen-49209'
]

### Scrape!

In [4]:
csv_file_path = './comments.csv'
new_file = False
if not os.path.exists('./comments.csv'):
    new_file = True

count = 0 
with open(csv_file_path, 'a', encoding='UTF8') as f:
    writer = csv.writer(f)
    if new_file:
        header = ['article', 'comment', 'created', 'likes', 'dislikes']
        writer.writerow(header)
    
    for url in urls:
        count += 1
        print(str(count) + '.', url)
        #time.sleep(5)
        # Open the url in Chrome browser
        browser.get(url)
        time.sleep(10)
        # Get the page in html
        page_html = browser.page_source
        # Parse the html source
        soup = BeautifulSoup(page_html, "html.parser")
        # Get the Disqus thread (comments thread)
        disqus = soup.find_all('div', id='disqus_thread')
        #print(disqus)
        post_js_url = disqus[0].find('iframe').attrs['src']
        # Get and parse the posts from disqus javascript
        browser.get(post_js_url)
        post_html = browser.page_source
        soup2 = BeautifulSoup(post_html, "html.parser")
        # Get the thread data from the script tag & convert it into a dictionary
        thread_data = soup2.find('script', id='disqus-threadData')
        posts = json.loads(thread_data.string)
        thread = { 'title': soup.title.text, 'posts': posts['response']['posts']}
        title = thread['title']
        posts = thread['posts']
        for post in posts:
            writer.writerow([title, post['raw_message'], post['createdAt'], post['likes'], post['dislikes']])
        print('')
        time.sleep(1)
print('Mission accomplished!\n')

1.  https://teslamag.de/news/supercharger-limburg-tesla-karte-geoffnet-fremde-elektroautos-49280

2.  https://teslamag.de/news/test-niederlande-tesla-oeffnet-erste-zehn-supercharger-fremde-elektroautos-42542

3.  https://teslamag.de/news/fremde-elektroautos-supercharger-vw-id3-problemlos-audi-fahrer-ueberzeugt-42649

4.  https://teslamag.de/news/kreativ-supercharger-tesla-andere-elektroauto-fahrer-loesungen-teilen-49209

Mission accomplished!

