# Setting Up the Environment

This notebook uses the "web-scraper" environment. Following packages are pre-installed in the environment:

- requests
- beautifulsoup4
- selenium
- webdriver_manager
- pandas
- openai
  
The website loads content dynamically using JavaScript (which requests alone cannot handle), so we need to use Selenium to scrape the content.To use Selenium for web scraping, we need to download a WebDriver that matches our browser version. For google chrome [webdriver](https://googlechromelabs.github.io/chrome-for-testing/) can be downloaded from here. Webdriver should then be added to the PATH.

In [16]:
#librairies
import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
from openai import OpenAI
import os

Variables that will be used often in the notebook are defined below:

In [17]:
#variables
MIT_ESS = R'https://ocw.mit.edu/search/?d=Earth%2C%20Atmospheric%2C%20and%20Planetary%20Sciences&s=department_course_numbers.sort_coursenum&type=course'
MIT_base = R'https://ocw.mit.edu'
MIT_curriculum = R'https://catalog.mit.edu/subjects/12/'
chromedriver = R'C:\Users\FarzanehSadeghi\Downloads\chromedriver-win64\chromedriver-win64\chromedriver.exe'
api_key = 'your api'

First we fetch the list of all the urls of the pages to scrape from [MIT OCW site for Earth, Atmospheric, and Planetary Sciences](https://ocw.mit.edu/search/?d=Earth%2C%20Atmospheric%2C%20and%20Planetary%20Sciences&s=department_course_numbers.sort_coursenum&type=course) (MIT_ESS).

In the context of web automation, a "driver" enables automated control over a web browser. The driver acts as an interface between the automation script and the web browser, allowing the script to perform actions like opening web pages, clicking buttons, filling out forms, and extracting information. In this case we also need to scroll down the page to load all the search results. We use the `execute_script` method to scroll down the page. `execute_script` method executes JavaScript code in the context of the currently selected frame or window. The script fragment provided will be executed as the body of an anonymous function. We use the `scrollHeight` property to get the height of the entire document in pixels, and then scroll the entire document height.


We make a get request to the MIT_ESS and fetch the HTML content, and parse the HTML to extract the course code, URL, and title from each `<article>` element. First all the `<article>` tags are located and the number of the found articles are verified (it should be 106):

In [18]:
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)

try:
    driver.get('https://ocw.mit.edu/search/?d=Earth%2C+Atmospheric%2C+and+Planetary+Sciences&s=department_course_numbers.sort_coursenum&type=course')
    # until method starts the waiting process until the given condition is met or the timeout is reached.
    WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.TAG_NAME, 'article'))
    )
    # running javascript codes using execute_script method to scroll down the page and load all the courses
    last_height = driver.execute_script("return document.body.scrollHeight")
    #in darvaqe ye loop e binahayat shuru mikone k ta vaqti b break narese vase khodesh mire
    while True: 
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

    # aval oomadim page o ta pain ba loop scroll kardim ta hameye course ha load beshe va bad page ro parse mikonim
    MIT_ESS_content = driver.page_source
    soup = BeautifulSoup(MIT_ESS_content, 'html.parser')
    
    articles = soup.find_all('article')
    print(f"Found {len(articles)} courses.")
finally:
    driver.quit()


Found 106 courses.


In [19]:
with open('output.html', 'w', encoding='utf-8') as file:
    file.write(soup.prettify())

In the `MIT_ESS_content` each `<article>` represents an ESS course, and you can extract the course code, the course relative URL, and the course title from it. This is the first `<article>`:
```html
<article aria-labelledby="search-result-0-title" aria-setsize="-1" aria-posinset="1" tabindex="0"><div class="card learning-resource-card list-view"><div class="card-contents"><div class="lr-info search-result has-min-height"><div class="lr-row resource-header"><div class="resource-type">12.000 | Undergraduate</div></div><div class="lr-row course-title"><a href="/courses/12-000-solving-complex-problems-fall-2009/" class="w-100"><div><span id="search-result-0-title">Solving Complex Problems</span></div></a></div><div class="lr-row subtitles"><div class="lr-row subtitle"><div class="lr-subtitle listitem"><div class="content"><a href="/search/?q=%22Prof.%20Samuel%20Bowring%22">Prof. Samuel Bowring </a></div></div></div></div><div class="lr-row subtitles"><div class="lr-row subtitle"><div class="lr-subtitle listitem topics-list"><div class="content"><a class="topic-link" href="/search/?t=Engineering">Engineering </a><a class="topic-link" href="/search/?t=Science">Science </a><a class="topic-link" href="/search/?t=Social%20Science">Social Science </a><a class="more" href="/courses/12-000-solving-complex-problems-fall-2009/">+ 7 more</a></div></div></div></div></div><div class="cover-image"><a href="/courses/12-000-solving-complex-problems-fall-2009/"><img src="/courses/12-000-solving-complex-problems-fall-2009/d0858e57959fc33931782b75722957ef_12-000f09.jpg" height="130" alt=""></a></div></div></div></article>
```

The needed information for this step is:

`12.000` from 
```html 
<div class="resource-type">12.000 | Undergraduate</div>
```
`/courses/12-000-solving-complex-problems-fall-2009/` from 
```html
<a href="/courses/12-000-solving-complex-problems-fall-2009/" class="w-100">
```
`Solving Complex Problems` from 
```html
<span id="search-result-0-title">Solving Complex Problems</span>
```

In [20]:
# Iterate through each article to extract course information

courses = []

for article in articles:
    course_data = {}
    
    # Extract course code
    course_code_div = article.find('div', class_='resource-type')
    if course_code_div:
        course_data['course_code'] = course_code_div.text.split(' | ')[0]
    
    # Extract course URL
    course_url_tag = article.find('a', class_='w-100')
    if course_url_tag:
        course_data['course_url'] = course_url_tag['href']
    
    # Extract course title
    course_title_span = article.find('span', id=lambda x: x and x.startswith('search-result'))
    if course_title_span:
        course_data['course_title'] = course_title_span.text
    
    courses.append(course_data)

print(courses)


[{'course_code': '12.000', 'course_url': '/courses/12-000-solving-complex-problems-fall-2009/', 'course_title': 'Solving Complex Problems'}, {'course_code': '12.000', 'course_url': '/courses/12-000-solving-complex-problems-fall-2003/', 'course_title': 'Solving Complex Problems'}, {'course_code': '12.001', 'course_url': '/courses/12-001-introduction-to-geology-fall-2013/', 'course_title': 'Introduction to Geology'}, {'course_code': '12.002', 'course_url': '/courses/12-002-physics-and-chemistry-of-the-terrestrial-planets-fall-2008/', 'course_title': 'Physics and Chemistry of the Terrestrial Planets'}, {'course_code': '12.003', 'course_url': '/courses/12-003-atmosphere-ocean-and-climate-dynamics-fall-2008/', 'course_title': 'Atmosphere, Ocean and Climate Dynamics'}, {'course_code': '12.005', 'course_url': '/courses/12-005-applications-of-continuum-mechanics-to-earth-atmospheric-and-planetary-sciences-spring-2006/', 'course_title': 'Applications of Continuum Mechanics to Earth, Atmospheric

In [23]:
for course in courses:
    course['full_url'] = MIT_base + course['course_url']
    print(course['full_url'])

https://ocw.mit.edu/courses/12-000-solving-complex-problems-fall-2009/
https://ocw.mit.edu/courses/12-000-solving-complex-problems-fall-2003/
https://ocw.mit.edu/courses/12-001-introduction-to-geology-fall-2013/
https://ocw.mit.edu/courses/12-002-physics-and-chemistry-of-the-terrestrial-planets-fall-2008/
https://ocw.mit.edu/courses/12-003-atmosphere-ocean-and-climate-dynamics-fall-2008/
https://ocw.mit.edu/courses/12-005-applications-of-continuum-mechanics-to-earth-atmospheric-and-planetary-sciences-spring-2006/
https://ocw.mit.edu/courses/12-006j-nonlinear-dynamics-chaos-fall-2022/
https://ocw.mit.edu/courses/12-007-geobiology-spring-2013/
https://ocw.mit.edu/courses/12-009j-theoretical-environmental-analysis-spring-2015/
https://ocw.mit.edu/courses/12-010-computational-methods-of-scientific-programming-fall-2011/
https://ocw.mit.edu/courses/12-085-seminar-in-environmental-science-spring-2008/
https://ocw.mit.edu/courses/12-086-modeling-environmental-complexity-fall-2014/
https://ocw

The needed information from each course homepage is:

- Course title from 
```html
<h1>
    <a class="text-capitalize m-0 text-white" href="/courses/12-000-solving-complex-problems-fall-2009/">Solving Complex Problems</a>
</h1>
```
- Course code from 
```html
<span class="course-number-term-detail">12.000 | Fall 2009 | Undergraduate</span>
```
- Date from 
```html
<span class="course-number-term-detail">12.000 | Fall 2009 | Undergraduate</span>
```
- Level from 
```html
<span class="course-number-term-detail">12.000 | Fall 2009 | Undergraduate</span>
```
- Description from 
```html
<div id="expanded-description" class="description"><p><em>12.000</em> <em>Solving Complex Problems</em> is designed to provide students the opportunity to work as part of a team to propose solutions to a complex problem that requires an interdisciplinary approach. For the students of the class of 2013, 12.000 will revolve around the issues associated with what we can and must do about the steadily increasing amounts CO<sub>2</sub> in Earth’s atmosphere.</p>
<p>12.000 is a core course for the <a href="https://terrascope.mit.edu/" target="_blank" rel="noopener">MIT Terrascope</a> freshman learning community. Each year’s class explores a different problem in detail through the study of complementary case histories and the development of creative solution strategies. It includes training in Web site development, effective written and oral communication, and team building. Initially developed with major financial support from the <a href="http://web.mit.edu/darbeloff/" target="_blank" rel="noopener">d’Arbeloff Fund for Excellence in Education</a>, 12.000 is designed to enhance the freshman experience by helping students develop contexts for other subjects in the sciences and humanities, and by helping them to establish learning communities that include upperclassmen, faculty, MIT alumni, and professionals in science and engineering fields.</p>
<button id="collapse-description" type="button" class="btn btn-link link-button p-0">Show less</button></div>
```
- Instructor from 
```html
<a class="course-info-instructor strip-link-offline" href="/search?q=Prof.+Samuel+Bowring">Prof. Samuel Bowring</a>
```
- Topics from 
```html
<ul class="list-unstyled pb-2 m-0 ">...</ul>
```
Extracting Course Code, Date, and Level:

In [24]:
def fetch_metadata(course_url):

    response = requests.get(course_url)
    if response.status_code != 200:
        print(f"Failed to fetch course page: {course_url}")
        return None
    
    course_soup = BeautifulSoup(response.content, 'html.parser')

    # Extract course title
    title_tag = course_soup.find('h1')
    if title_tag and title_tag.find('a'):
        course_title = title_tag.find('a').get_text(strip=True)
    else:
        course_title = None    
    
    # Extract course code, date, and level
    course_details = course_soup.find('span', class_='course-number-term-detail')
    if course_details:
        code = course_details.text.split(' | ')[0]
        date = course_details.text.split(' | ')[1]
        level = course_details.text.split(' | ')[2]
    else:
        code = date = level = None

    # Extract description
    description_div = course_soup.find('div', id='expanded-description')
    description = description_div.get_text(separator=' ', strip=True) if description_div else None
    
    # Extract instructors
    instructor_tags = course_soup.find_all('a', class_='course-info-instructor')
    instructors = list(set(instructor.get_text(strip=True) for instructor in instructor_tags))  # Remove duplicates
    
    topics_ul = course_soup.find('ul', class_='list-unstyled pb-2 m-0')
    topics = set()
    if topics_ul:
        for li in topics_ul.find_all('li'):
            subtopics = li.find_all('a', class_='course-info-topic')
            if subtopics:
                topics.add(subtopics[-1].get_text(strip=True))  # Get the lowest level topic
    topics = list(topics)
    
    return {
        'course_title': course_title,
        'course_code': code,
        'date': date,
        'level': level,
        'description': description,
        'instructors': instructors,
        'topics': topics
    }

In [25]:
course_metadata = []

# Loop through each course URL and fetch metadata
for course in courses:
    metadata = fetch_metadata(course['full_url'])
    if metadata:
        metadata['url'] = course['full_url']  # Add the URL to the metadata
        course_metadata.append(metadata)

# Convert the list of dictionaries to a pandas DataFrame
df = pd.DataFrame(course_metadata)

# Save the DataFrame as a CSV file
df.to_csv('course_metadata.csv', index=False)

print("Course metadata saved to course_metadata.csv")


Course metadata saved to course_metadata.csv


Extracting course descriptions from the MIT subject page: https://catalog.mit.edu/subjects/12/

Needed infroamtion for this step is:

- Course title from: 
```html
<h4 class="courseblocktitle"><span font="Meta Pro"><strong>12.00 Frontiers and Careers in Earth, Planets, Climate, and Life</strong></span></h4>
```
- Course description from:
```html
<p class="courseblockdesc">Provides a broad overview of topics, technologies, and career paths at the forefront of Earth, Atmospheric and Planetary Sciences. Introduces the complex interplay between physics, mathematics, chemistry, biology, and computational methods used to study processes associated with a changing Earth and climate, distant planets, and life. Sessions guided by faculty members discussing current research problems, and by EAPS alumni describing how their careers have evolved. Subject can count toward the 6-unit discovery-focused credit limit for first year students. </p>
```

In [81]:
response = requests.get(MIT_curriculum)
if response.status_code == 200:
    print("Successfully fetched the page.")
else:
    print("Failed to fetch the page. Status code:", response.status_code)

Successfully fetched the page.


In [82]:
soup = BeautifulSoup(response.content, 'html.parser')

course_blocks = soup.find_all('div', class_='courseblock')
print(f"Found {len(course_blocks)} course blocks.")

Found 234 course blocks.


In [83]:
courses_data = []

for course in course_blocks:
    title_tag = course.find('h4', class_='courseblocktitle')
    course_number_title = title_tag.get_text(strip=True) if title_tag else None
    
    desc_tag = course.find('p', class_='courseblockdesc')
    course_description = desc_tag.get_text(strip=True) if desc_tag else None
    
    courses_data.append({
        'course_number_title': course_number_title,
        'course_description': course_description
    })

for course in courses_data:
    print(course)

{'course_number_title': '12.00 Frontiers and Careers in Earth, Planets, Climate, and Life', 'course_description': 'Provides a broad overview of topics, technologies, and career paths at the forefront of Earth, Atmospheric and Planetary Sciences. Introduces the complex interplay between physics, mathematics, chemistry, biology, and computational methods used to study processes associated with a changing Earth and climate, distant planets, and life. Sessions guided by faculty members discussing current research problems, and by EAPS alumni describing how their careers have evolved. Subject can count toward the 6-unit discovery-focused credit limit for first year students.'}
{'course_number_title': '12.000 Solving Complex Problems', 'course_description': "Provides an opportunity for entering freshmen to gain firsthand experience in integrating the work of small teams to develop effective solutions to complex problems in Earth system science and engineering. Each year's class explores a di

In [84]:
df = pd.DataFrame(courses_data)

df.to_csv('mit_curriculum.csv', index=False)

print("Course data saved to mit_courses.csv")

Course data saved to mit_courses.csv
