# Selenium Setup for Google Colab

## Introduction:
This guide provides instructions for setting up Selenium on Google Colab, allowing you to automate browser interactions within the Colab environment.

## Prerequisites:
- Access to a Google Colab environment
- Basic knowledge of running commands in Google Colab

## Steps:

### 1. Update System Packages:
```bash
sudo apt -y update


In [None]:
# Set up for running selenium in Google Colab
## You don't need to run this code if you do it in Jupyter notebook, or other local Python setting
%%shell
sudo apt -y update
sudo apt install -y wget curl unzip
wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb
dpkg -i libu2f-udev_1.1.4-1_all.deb
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb
CHROME_DRIVER_VERSION=`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/$CHROME_DRIVER_VERSION/chromedriver_linux64.zip -P /tmp/
unzip -o /tmp/chromedriver_linux64.zip -d /tmp/
chmod +x /tmp/chromedriver
mv /tmp/chromedriver /usr/local/bin/chromedriver
pip install selenium

[33m0% [Working][0m            Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
[33m0% [Waiting for headers] [Connecting to security.ubuntu.com] [Connected to clou[0m                                                                               Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
[33m0% [2 InRelease 12.7 kB/119 kB 11%] [Connecting to security.ubuntu.com (185.125[0m                                                                               Get:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
[33m0% [2 InRelease 21.4 kB/119 kB 18%] [Connecting to security.ubuntu.com (185.125[0m[33m0% [2 InRelease 21.4 kB/119 kB 18%] [Connecting to security.ubuntu.com (185.125[0m                                                                               Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-backports InRele



# Selenium Setup with Automatic Chromedriver Installation

## Introduction:
This guide provides instructions for setting up Selenium with automatic Chromedriver installation in Python, allowing you to automate browser interactions. With Chromedriver autoinstaller, you can ensure that the appropriate Chromedriver version is installed for your Chrome browser.

## Prerequisites:
- Python environment with pip installed
- Basic knowledge of Python programming

## Steps:

### 1. Install Chromedriver Autoinstaller:
```bash
!pip install chromedriver-autoinstaller


In [None]:
!pip install chromedriver-autoinstaller

import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')


from selenium import webdriver
import chromedriver_autoinstaller

# setup chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless') # ensure GUI is off
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# set path to chromedriver as per your configuration
chromedriver_autoinstaller.install()

# set up the webdriver
driver = webdriver.Chrome(options=chrome_options)
# quit the driver
driver.quit()



# Web Scraping for Refresher Readings from CFA Institute Website

## Introduction:
This Python script demonstrates web scraping techniques to extract refresher reading details from the CFA Institute website. It utilizes Selenium for dynamic page interaction and Beautiful Soup for HTML parsing.

## Prerequisites:
- Python environment with necessary libraries installed (Selenium, BeautifulSoup, pandas)
- Chrome browser installed

## Setup:
1. Install required libraries:
```bash
pip install selenium beautifulsoup4 pandas


**Script Overview:**
This script performs the following tasks:

Initializes the WebDriver with headless Chrome options.
Scrapes the main page of CFA Institute's refresher readings and navigates through pagination to extract reading titles and links.
Extracts detailed information for each reading, including the topic, year, level, introduction, learning outcomes, summary, and overview.
Structures the extracted data into a DataFrame and saves it to a CSV file.

In [16]:
#!/usr/bin/env python
# coding: utf-8

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import time
from bs4 import BeautifulSoup
import pandas as pd
from selenium.common.exceptions import NoSuchElementException

def initialize_driver():
    # setup chrome options
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless') # ensure GUI is off
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    # set path to chromedriver as per your configuration
    chromedriver_autoinstaller.install()
    driver = webdriver.Chrome(options=chrome_options)
    driver.maximize_window()
    return driver

def close_privacy_warning(driver):
    close_button = driver.find_element(By.ID, "closePrivacyWarning")
    close_button.click()

def click_next_button(driver):
    try:
        next_button = driver.find_element(By.CLASS_NAME, "coveo-pager-next")
        next_button.click()
        time.sleep(5)
        return driver
    except NoSuchElementException:
        return None

def scrape(driver, refresher_readings_list):
    time.sleep(5)  # Wait for the page to load
    html_content = driver.page_source
    soup = BeautifulSoup(html_content, 'html.parser')
    titles = soup.find_all('h4', class_='coveo-title')
    for title in titles:
        link = title.find('a', class_='CoveoResultLink')['href']
        reading = [title.text.strip(), link]
        refresher_readings_list.append(reading)

def get_reading_detail_data(driver, reading):
    driver.get(reading[1])
    time.sleep(5)
    html_content = driver.page_source
    soup = BeautifulSoup(html_content, 'html.parser')

    meta_data = soup.find('div', class_="content-utility")
    span_elements = meta_data.find_all('span', class_=['content-utility-curriculum', 'content-utility-topic'])

    data = {
        "topic": "",
        "year": "",
        "level": "",
        "introduction": "",
        "learning_outcomes": "",
        "summary": "",
        "overview": ""
    }

    # Extract text content from selected span elements
    if len(span_elements) >= 3:  # Ensure 'curriculum', 'topic', and 'level' span elements are present
        data["year"] = span_elements[0].text.strip().split()[0]
        data["level"] = span_elements[1].text.strip()
        data["topic"] = span_elements[2].text.strip()

    # Extract data from other sections
    headings = soup.find_all('h2', class_="article-section")
    for section in headings:
        if section.text in ('Introduction', "Learning Outcomes", "Summary", "Overview"):
            if section.text == "Introduction":
                data["introduction"] = section.findParent().text.strip()
            elif section.text == "Learning Outcomes":
                data["learning_outcomes"] = section.find_next().text.strip()
            elif section.text == "Summary":
                data["summary"] = section.find_next().text.strip()
            elif section.text == "Overview":
                data["overview"] = section.find_next().text.strip()

    return data


def scrape_reading_detail(refresher_readings_list):
    data_list = []
    driver = initialize_driver()
    for reading in refresher_readings_list:
        reading_detail = get_reading_detail_data(driver, reading)
        data_list.append({
            'Title': reading[0],
            'Topic': reading_detail['topic'],
            'Year': reading_detail['year'],
            'Level': reading_detail['level'],
            'Introduction': reading_detail['introduction'],
            'Learning Outcomes': reading_detail['learning_outcomes'],
            'Summary': reading_detail['summary'],
            'Overview': reading_detail['overview']
        })
    driver.quit()
    df = pd.DataFrame(data_list)
    return df


def main():
    refresher_readings_list = []
    driver = initialize_driver()
    url = "https://www.cfainstitute.org/en/membership/professional-development/refresher-readings#first=10&sort=%40refreadingcurriculumyear%20descending"
    driver.get(url)
    close_privacy_warning(driver)
    for page_num in range(23):
        scrape(driver, refresher_readings_list)
        driver = click_next_button(driver)
        if driver is None:
            break
    df = scrape_reading_detail(refresher_readings_list)
    print(df)
    df.to_csv('refresher_readings.csv', index=False)
    # driver.quit()

if __name__ == "__main__":
    main()



                                                 Title  \
0                                 Time-Series Analysis   
1                               Credit Analysis Models   
2              Introduction to Alternative Investments   
3                                 Credit Default Swaps   
4                       Valuation of Contingent Claims   
..                                                 ...   
219                  Fixed-Income Cash Flows and Types   
220  Private Capital, Real Estate, Infrastructure, ...   
221                  Extensions of Multiple Regression   
222  Pricing and Valuation of Forward Contracts and...   
223          Option Replication Using Put-Call Parity​   

                       Topic  Year     Level  \
0       Quantitative Methods  2024  Level II   
1               Fixed Income  2024  Level II   
2    Alternative Investments  2023   Level I   
3               Fixed Income  2024  Level II   
4                Derivatives  2024  Level II   
..             