# Part 1: Web Scraping with Selenium and Beautiful Soup

In [2]:
pip install selenium

Collecting selenium
  Downloading selenium-4.11.2-py3-none-any.whl (7.2 MB)
Collecting trio-websocket~=0.9
  Downloading trio_websocket-0.11.1-py3-none-any.whl (17 kB)
Collecting trio~=0.17
  Downloading trio-0.22.2-py3-none-any.whl (400 kB)
Collecting certifi>=2021.10.8
  Downloading certifi-2024.2.2-py3-none-any.whl (163 kB)
Collecting urllib3[socks]<3,>=1.26
  Downloading urllib3-2.0.7-py3-none-any.whl (124 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.2.0-py3-none-any.whl (24 kB)
Collecting exceptiongroup; python_version < "3.11"
  Downloading exceptiongroup-1.2.0-py3-none-any.whl (16 kB)
Collecting outcome
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl (10 kB)
Collecting attrs>=20.1.0
  Downloading attrs-23.2.0-py3-none-any.whl (60 kB)
Collecting sniffio
  Downloading sniffio-1.3.0-py3-none-any.whl (10 kB)
Collecting h11<1,>=0.9.0
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
Installing collected packages: h11, wsproto, attrs, outcome, exceptiongroup, sniffio, t

ERROR: pytest-astropy 0.8.0 requires pytest-cov>=2.0, which is not installed.
ERROR: pytest-astropy 0.8.0 requires pytest-filter-subpackage>=0.1, which is not installed.
ERROR: requests 2.22.0 has requirement urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1, but you'll have urllib3 2.0.7 which is incompatible.


Initializing Selenium WebDriver for Chrome

This code first imports the webdriver module from the selenium package. Then, it creates an instance of the Chrome WebDriver. The webdriver.Chrome() command launches a new Chrome browser session controlled by Selenium. The driver object created here will serve as our primary interface to interact with the web browser. With driver, we can navigate to URLs, interact with web elements, and extract data as needed.

In [1]:
from selenium import webdriver

driver = webdriver.Chrome()


The following Python code demonstrates how to use Selenium WebDriver for Chrome to navigate to the CFA Institute's website and extract URLs of refresher readings. Selinum is sed to extract the links in each page and the links are added all_links array. Selinum is used to extract the links because the webpage is dynamic and article links are store in javascript. 

The code performs the following tasks:

1) Initialization and Page Navigation: 
  * We start by creating a Chrome WebDriver instance (driver) and navigating to the CFA Institute's refresher readings page.

2) Handling Privacy Banner: 
  * A function close_privacy_banner() is defined and used to close any potential privacy consent banners that may appear on the site.

3) Extracting URLs in a Loop: 
  * We then enter a while loop to traverse through the web pages. In each iteration, the script:
  * Waits for the page content to load.
  * Extracts the URLs of the refresher readings using JavaScript and stores them in the all_links list.
  * Checks and clicks the 'next page' button if available, or exits the loop if there are no more pages.
  * Error Handling:The script includes error handling for timeouts and cases where elements (the next page button) is not found.

4) Closing the Browser: 
  * Finally, after extracting all URLs or encountering an error, the script closes the browser session.

In [8]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException, StaleElementReferenceException

driver = webdriver.Chrome()
driver.get("https://www.cfainstitute.org/en/membership/professional-development/refresher-readings#sort=%40refreadingcurriculumyear%20descending")
wait = WebDriverWait(driver, 10)

def close_privacy_banner():
    try:
        driver.execute_script("document.getElementById('privacy-banner').style.display='none';")
    except Exception as e:
        print("Privacy banner not found or could not be closed:", e)
all_links = []  # Array to store all the links

current_page = 1
while True:
    close_privacy_banner()

    try:
        # Wait for the new content to load
        wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "CoveoResultLink")))

        # Extract URLs using JavaScript
        links = driver.execute_script(
            "return Array.from(document.querySelectorAll('.CoveoResultLink')).map(link => link.getAttribute('href'));"
        )
        for link in links:
            print(f"Page {current_page}: {link}")
            all_links.append(link)
        # Check if the next page exists
        try:
            next_page = wait.until(EC.presence_of_element_located((By.XPATH, f"//a[contains(@class, 'coveo-pager-anchor') and text()='{current_page + 1}']")))
            if next_page:
                next_page.click()
                current_page += 1
                wait.until(EC.staleness_of(next_page))  # Wait for the old next page to go stale
        except NoSuchElementException:
            print("No more pages to navigate.")
            break
      

    except TimeoutException as e:
        print(f"Error loading page content: {e}")
        break

driver.quit()


Page 1: https://www.cfainstitute.org/membership/professional-development/refresher-readings/time-series-analysis
Page 1: https://www.cfainstitute.org/membership/professional-development/refresher-readings/credit-analysis-models
Page 1: https://www.cfainstitute.org/membership/professional-development/refresher-readings/introduction-alternative-investments
Page 1: https://www.cfainstitute.org/membership/professional-development/refresher-readings/credit-default-swaps
Page 1: https://www.cfainstitute.org/membership/professional-development/refresher-readings/valuation-contingent-claims
Page 1: https://www.cfainstitute.org/membership/professional-development/refresher-readings/introduction-commodities-commodity-derivatives
Page 1: https://www.cfainstitute.org/membership/professional-development/refresher-readings/understanding-income-statements
Page 1: https://www.cfainstitute.org/membership/professional-development/refresher-readings/pricing-and-valuation-of-forward-commitments
Page 1: ht

Page 8: https://www.cfainstitute.org/membership/professional-development/refresher-readings/ethics-and-trust-investment-profession
Page 8: https://www.cfainstitute.org/membership/professional-development/refresher-readings/ethics-application
Page 8: https://www.cfainstitute.org/membership/professional-development/refresher-readings/guidance-standards-i-vii-l3
Page 8: https://www.cfainstitute.org/membership/professional-development/refresher-readings/introduction-gips
Page 8: https://www.cfainstitute.org/membership/professional-development/refresher-readings/trade-strategy-execution
Page 8: https://www.cfainstitute.org/membership/professional-development/refresher-readings/portfolio-performance-evaluation
Page 8: https://www.cfainstitute.org/membership/professional-development/refresher-readings/exchange-traded-funds-mechanics-applications
Page 8: https://www.cfainstitute.org/membership/professional-development/refresher-readings/fixed-income-active-management-credit-strategies
Page 8: 

Page 15: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2020/ICE-RSS-FEED-active-equity-investing-strategies
Page 15: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2020/Refresher-Reading
Page 15: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2020/Refresher-Reading
Page 15: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2020/Copy-of-Refresher-Reading-Test
Page 15: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2020/QA-Test-Refresher-Reading-1
Page 15: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2020/Test-3-big-data-projects
Page 15: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2020/Test-1-big-data-projects
Page 15: https://www.cfainstitute.org/membership/professional-development/refresher-readings/2020/Test20-cost-capital
Pag

Page 22: https://www.cfainstitute.org/membership/professional-development/refresher-readings/Industry-and-Competitive-Analysis
Page 22: https://www.cfainstitute.org/membership/professional-development/refresher-readings/Company-Analysis-Forecasting
Page 22: https://www.cfainstitute.org/membership/professional-development/refresher-readings/investments-real-estate-pubicly-traded-securities
Page 22: https://www.cfainstitute.org/membership/professional-development/refresher-readings/cost-capital-advanced-topics
Page 22: https://www.cfainstitute.org/membership/professional-development/refresher-readings/arbitrage-replication-cost-carry-pricing-derivatives
Page 22: https://www.cfainstitute.org/membership/professional-development/refresher-readings/Company-Analysis-Past-and-Present
Page 22: https://www.cfainstitute.org/membership/professional-development/refresher-readings/Business-Models
Page 22: https://www.cfainstitute.org/membership/professional-development/refresher-readings/Fixed-Incom

In [3]:
all_links[0]


'https://www.cfainstitute.org/membership/professional-development/refresher-readings/time-series-analysis'

In [4]:
pip install requests beautifulsoup4 pandas


Note: you may need to restart the kernel to use updated packages.


This Python script is designed to scrape information from the URLs of CFA Institute refresher readings (previously collected) and then process and save this information into a CSV file. The script performs several key operations:

Text Cleaning Function:

1) clean_text: 
  * A function defined to clean and normalize text. 
  * It converts the text to ASCII, handles special characters like dashes and quotation marks, and performs other general cleaning tasks.

2) Data Scraping and Processing:
  * The script iterates through each URL stored in the all_links list.
  * For each URL, it sends an HTTP GET request, retrieves the HTML content, and parses it using BeautifulSoup.
  * It then extracts various pieces of information, such as the title (topic name), year, level, learning outcomes, summary, and download link. If certain data is not found, it is marked as 'N/A', expcet for year it is 0.
  * The extracted text is cleaned using the clean_text function to ensure readability and uniformity.

3) Data Organization:
  * Extracted data is organized into a dictionary (new_row) for each URL.
  * These dictionaries are collected in the all_rows list.

4) DataFrame Creation and CSV Export:

  * A DataFrame is created from the all_rows list using pandas, with predefined columns.
  * This DataFrame is then exported to a CSV file named 'Assignment.csv'.
  * This script effectively automates the process of extracting and organizing data from web pages into a structured format, which can be easily analyzed or shared.

In [25]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

def clean_text(text):
    # Normalize the text to NFKD form which separates characters and their diacritics
    # which should turn characters like "â€" into their separate components "a" and "€"
    text = unicodedata.normalize('NFKD', text)
    
    # Encode to ASCII bytes, then decode back to a string ignoring non-ASCII characters
    text = text.encode('ascii', 'ignore').decode('ascii')
    
    # Replace common encoding issues
    replacements = {
        '\u2013': '-',  # En-dash
        '\u2014': '--', # Em-dash
        '\u2018': "'",  # Left single quotation mark
        '\u2019': "'",  # Right single quotation mark
        '\u201c': '"',  # Left double quotation mark
        '\u201d': '"',  # Right double quotation mark
        '\u2026': '...',# Ellipsis
        '\u00a0': ' ',  # Non-breaking space
    }
    
    # Apply replacements
    for src, dest in replacements.items():
        text = text.replace(src, dest)
    
    # Additional cleaning can be done here if needed
    text = re.sub(r'\n\s*', ' ', text)
    text = text.replace('\r', ' ')

    return text


def extract_year(text):
    # Match only the year number
    year_match = re.search(r'\b\d{4}\b', text)
    return year_match.group() if year_match else 'N/A'

# Set the path for the CSV file
csv_path = 'Assignment.csv'
# Define the columns of the DataFrame
columns = ["Name of the Topic", "Level", "Year", "Introduction", "Learning Outcomes", "Summary", "Link to the Summary Page", "Download Link"]
all_rows = []


# Loop through all the URLs in the `all_links` list
for url in all_links:
    response = requests.get(url)
    response.encoding = 'utf-8'
    html_content = response.text
    soup = BeautifulSoup(html_content, 'html.parser')
    new_row = {}

    # Fetching the title
    title = soup.find('h1', class_='article-title')
    new_row['Name of the Topic'] = clean_text(title.get_text()) if title else 'N/A'
    
    # Fetching the Level
    level = soup.find('span', class_='content-utility-topic')
    new_row['Level'] = clean_text(level.get_text()) if level else 'N/A'

    # Learning Outcomes
    outcomes_header = soup.find('h2', class_='article-section', text='Learning Outcomes')
    outcomes_text = outcomes_header.find_next_sibling('section').get_text() if outcomes_header else 'N/A'
    new_row['Learning Outcomes'] = clean_text(outcomes_text)

    # Summary
    summary_header = soup.find('h2', class_='article-section', text='Summary')
    summary_text = summary_header.find_next_sibling('div').get_text() if summary_header else 'N/A'
    new_row['Summary'] = clean_text(summary_text)

    # Link to the Summary Page
    new_row['Link to the Summary Page'] = url
    
    # Download Link
    download_link = soup.find('a', class_='locked-content')
    new_row['Download Link'] = download_link['href'] if download_link else 'N/A'

    # Year
    year_span = soup.find('span', class_='content-utility-curriculum')
    new_row['Year'] = extract_year(year_span.get_text()) if year_span else 'N/A'


    # Introduction
    intro_header = soup.find('h2', class_='article-section', text='Introduction')
    if intro_header:
        # Find the parent section of the introduction header
        intro_section = intro_header.find_parent('section')
        # Find all p tags following the introduction header within the section
        intro_paragraphs = intro_section.find_all('p', recursive=False) if intro_section else []
        # Join the text from all p tags
        intro_text = ' '.join(p.get_text(strip=True) for p in intro_paragraphs)
        new_row['Introduction'] = clean_text(intro_text)
    else:
        new_row['Introduction'] = 'N/A'



    # Collect the new row in our list
    all_rows.append(new_row)

# Create a DataFrame from our list of rows
df = pd.DataFrame(all_rows, columns=columns)

# Save the DataFrame to a CSV file
df.to_csv(csv_path, index=False)


In [26]:
df.head()

Unnamed: 0,Name of the Topic,Level,Year,Introduction,Learning Outcomes,Summary,Link to the Summary Page,Download Link
0,Time-Series Analysis,Level II,2024,"As financial analysts, we often use time-serie...",The member should be able to: calculate and e...,The predicted trend value of a time series in...,https://www.cfainstitute.org/membership/profes...,https://study.cfainstitute.org/app/member-acce...
1,Credit Analysis Models,Level II,2024,Credit analysis plays an important role in the...,The member should be able to: explain expecte...,This reading has covered several important top...,https://www.cfainstitute.org/membership/profes...,https://study.cfainstitute.org/app/member-acce...
2,Introduction to Alternative Investments,Level I,2023,"In this section, we explain what alternative i...",The member should be able to: describe types ...,This reading provides a comprehensive introduc...,https://www.cfainstitute.org/membership/profes...,/-/media/documents/protected/refresher-reading...
3,Credit Default Swaps,Level II,2024,Derivative instruments in which the underlying...,The member should be able to: describe credit...,A credit default swap (CDS) is a contract bet...,https://www.cfainstitute.org/membership/profes...,https://study.cfainstitute.org/app/member-acce...
4,Valuation of Contingent Claims,Level II,2024,A contingent claim is a derivative instrument ...,The member should be able to: describe and in...,This reading on the valuation of contingent cl...,https://www.cfainstitute.org/membership/profes...,https://study.cfainstitute.org/app/member-acce...


In [27]:
df.isnull().sum()

Name of the Topic           0
Level                       0
Year                        0
Introduction                0
Learning Outcomes           0
Summary                     0
Link to the Summary Page    0
Download Link               0
dtype: int64

In [28]:
df.iloc[0,5]

' The predicted trend value of a time series in period t is  b 0 + b 1 t in a linear trend model; the predicted trend value of a time series in a log-linear trend model is  e b 0 + b 1 t . Time series that tend to grow by a constant amount from period to period should be modeled by linear trend models, whereas time series that tend to grow at a constant rate should be modeled by log-linear trend models. Trend models often do not completely capture the behavior of a time series, as indicated by serial correlation of the error term. If the DurbinWatson statistic from a trend model differs significantly from 2, indicating serial correlation, we need to build a different kind of model. An autoregressive model of order p, denoted AR(p), uses p lags of a time series to predict its current value: xt = b 0 + b 1 xt 1 + b 2 xt 2 + . . . + bpxt p +  t . A time series is covariance stationary if the following three conditions are satisfied: First, the expected value of the time series must be con