# Web Scraping using Selenium and Beautiful Soup
Selenium is a browser automation tool that can not only be used for testing, but also for many other purposes. It's especially useful because using it we can also scrape data that are client side rendered.

Installation:

In [None]:
!pip install selenium

or

In [None]:
!conda install selenium

To use Selenuim a WebDriver for your favorite web browser must also be installed. The Firefox WebDriver(GeckoDriver) can be installed by going to [this page](https://github.com/mozilla/geckodriver/releases/) and downloading the appropriate file for your operating system. After the download has finished the file has to be extracted.

Now the file can either be [added to path](https://www.architectryan.com/2018/03/17/add-to-the-path-on-windows-10/) or copied into the working directory. I chose to copy it to my working directory because I’m not using it that often.

Importing:

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import re
import os

In [2]:
# website url
base_url = "https://programmingwithgilbert.firebaseapp.com/"
videos_url = "https://programmingwithgilbert.firebaseapp.com/videos/keras-tutorials"

Trying to load the data using urllib. This won't get any data because it can't load data which is loaded after the document.onload function

In [None]:
import urllib.request

page = urllib.request.urlopen(videos_url)
soup = BeautifulSoup(page, 'html.parser')
soup

Now we can create a firefox session and navigate to the base url of the video section

In [4]:
# Firefox session
driver = webdriver.Firefox()
driver.get(videos_url)
driver.implicitly_wait(100)

To navigate to the specifc pages we nned to get the buttons which a a text of "Watch" and then navigate to each side, scrape the data, save it and go back to the main page

In [5]:
num_links = len(driver.find_elements_by_link_text('Watch'))
code_blocks = []
for i in range(num_links):
    # navigate to link
    button = driver.find_elements_by_class_name("btn-primary")[i]
    button.click()
    # get soup
    element = WebDriverWait(driver, 10).until(lambda x: x.find_element_by_id('iframe_container'))
    tutorial_soup = BeautifulSoup(driver.page_source, 'html.parser')
    tutorial_code_soup = tutorial_soup.find_all('div', attrs={'class': 'code-toolbar'})
    tutorial_code = [i.getText() for i in tutorial_code_soup]
    code_blocks.append(tutorial_code)
    # go back to initial page
    driver.execute_script("window.history.go(-1)")
code_blocks

[['import numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nfrom keras.datasets import mnist\nfrom keras.utils import to_categorical ',
  'def getData(): Copy',
  'def getData():\n    (X_train, y_train), (X_test, y_test) = mnist.load_data()\n    img_rows, img_cols = 28, 28     ',
  '    y_train = to_categorical(y_train, num_classes=10)\n    y_test = to_categorical(y_test, num_classes=10) Copy',
  '    X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)\n    X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1) ',
  '    plt.imshow(X_train[0][:,:,0])\n    plt.show() Copy',
  '    return X_train, y_train, X_test, y_test\ngetData() ',
  'import numpy as np import pandas as pd\nimport matplotlib.pyplot as plt\nfrom keras.datasets import mnist\nfrom keras.utils import to_categorical\ndef getData():\n    (X_train, y_train), (X_test, y_test) = mnist.load_data()\n    img_rows, img_cols = 28, 28\n    y_train = to_categorical(y_train, num_classes=10)\n  

In [7]:
code_blocks[1]

['import numpy as np import pandas as pd\nimport matplotlib.pyplot as plt\nfrom keras.datasets import mnist\nfrom keras.utils import to_categorical\ndef getData():\n    (X_train, y_train), (X_test, y_test) = mnist.load_data()\n    img_rows, img_cols = 28, 28\n    y_train = to_categorical(y_train, num_classes=10)\n    y_test = to_categorical(y_test, num_classes=10)\n    X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)\n    X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)\n    plt.imshow(X_train[0][:,:,0])\n    plt.show()\n    return X_train, y_train, X_test, y_test\n\ngetData() ',
 'X_train, y_train, X_test, y_test = getData()',
 'from keras.models import Sequential, model_from_json\nfrom keras.layers import Conv2D, MaxPool2D, Dropout, Dense, Flatten\nfrom keras.preprocessing.image import ImageDataGenerator\nfrom keras.optimizers import RMSprop\nfrom keras.callbacks import ReduceLROnPlateau\nimport os ',
 'def trainModel(X_train, y_train, X_test, y_test)

After scraping all the needed data we can close the browser session and save the results into .txt files

In [6]:
driver.quit()
for i, tutorial_code in enumerate(code_blocks):
    with open('code_blocks{}.txt'.format(i), 'w') as f:
        for code_block in tutorial_code:
            f.write(code_block+"\n")