<h1><center>edX Courses Scraping - Notebook 3</center></h1>

In this notebook, I will be scraping data from individual course links extracted from each subject page of edX website in previous notebook.  

## Importing Python Libraries

In [1]:
from selenium import webdriver
import pandas as pd
import pickle
import re

## Separating not extracted course links

I have extracted 2987 course links in previous notebook, in which I have already scraped 998 course links in first notebook. Hence, I'm seperating course links those are not yet scraped

In [2]:
#unloading the pickled dictionary 
with open('Data/all_sub_links.pkl','rb') as file:
    all_sub_links = pickle.load(file)

In [3]:
#extracting the unique course links 
all_courses = []
for sub in all_sub_links:
    all_courses.extend(all_sub_links[sub][1])
#set stores only unique values
all_courses = set(all_courses)

In [4]:
#loading the courses already scraped 
course_details = pd.read_csv('Data/edX_Course.csv')
extracted_courses = set(course_details['Course Link'])

In [5]:
#finding the courses not scraped
not_extracted_courses = all_courses.difference(extracted_courses)
print(f'There are {len(not_extracted_courses)} courses yet to be scraped')

There are 1989 courses yet to be scraped


## Scraping the individual course pages

In [6]:
#writing functions to extract and return required value from each edX course page
#these functions will return 'Missing' if certain fields are not found in course page

def get_title():
    try:
        title = driver.find_element_by_xpath('//h1[@class="course-intro-heading mb-2"]').text
    except:
        title = 'Missing'
    finally:
        return title
    
def get_short_description():
    try:
        des = driver.find_element_by_xpath('//div[@class="course-intro-lead-in mb-3"]/p').text
    except:
        des = 'Missing'
    finally:
        return des
    
def get_length():
    try:
        length = driver.find_element_by_xpath('(//li[@class="list-group-item d-flex row px-0"])[1]')
        length = length.find_element_by_xpath('./div[@class="col"]').text
    except:
        length = 'Missing'
    finally:
        return length
    
def get_effort():
    try:
        effort = driver.find_element_by_xpath('(//li[@class="list-group-item d-flex row px-0"])[2]')
        effort = effort.find_element_by_xpath('./div[@class="col"]').text
    except:
        effort = 'Missing'
    finally:
        return effort

def get_price():
    try:
        price = driver.find_element_by_xpath('(//li[@class="list-group-item d-flex row px-0"])[3]')
        price = price.find_element_by_xpath('./div[@class="col"]').text
        #extract only the value starting with $ or ₹
        price = re.findall(r'[\$\₹].*', price)[0]
    except:
        price = 'Missing'
    finally:
        return price

def get_institution():
    try:
        institution = driver.find_element_by_xpath('(//li[@class="list-group-item d-flex row px-0"])[4]')
        institution = institution.find_element_by_xpath('./div[@class="col"]').text
    except:
        institution = 'Missing'
    finally:
        return institution

def get_subject():
    try:
        subject = driver.find_element_by_xpath('(//li[@class="list-group-item d-flex row px-0"])[5]')
        subject = subject.find_element_by_xpath('./div[@class="col"]').text
    except:
        subject = 'Missing'
    finally:
        return subject

def get_level():
    try:
        level = driver.find_element_by_xpath('(//li[@class="list-group-item d-flex row px-0"])[6]')
        level = level.find_element_by_xpath('./div[@class="col"]').text
    except:
        level = 'Missing'
    finally:
        return level
    
def get_language():
    try:
        language = driver.find_element_by_xpath('(//li[@class="list-group-item d-flex row px-0"])[7]')
        language = language.find_element_by_xpath('./div[@class="col"]').text
    except:
        language = 'Missing'
    finally:
        return language

def get_prerequisites():
    try:
        prerequisites = driver.find_element_by_xpath('//div[@class="col prerequisite-sidebar"]//p').text
    except:
        prerequisites = 'Missing'
    finally:
        return prerequisites
    
def get_img_src():
    try:
        image = driver.find_element_by_xpath('//img[@class="header-image"]').get_attribute('src')
    except:
        try:
            image = driver.find_element_by_xpath('//img[@class="video-thumb"]').get_attribute('src') 
        except:
            image = "Missing"
    finally:
        return image
        
def get_already_enrolled():
    try:
        enrolled = driver.find_element_by_xpath('//div[@id="js-number-enrolled-label"]/span/span').text
        enrolled = int(enrolled.replace(',',''))
    except:
        enrolled = 'Missing'
    finally:
        return enrolled

In [7]:
#loop through course link and fetch required fields
for course_link in not_extracted_courses:
    course_dict = {}
    driver = webdriver.Chrome('chromedriver.exe')
    driver.get(course_link)
    
    #extract information using functions and storing it in temp dict    
    course_dict['Course Link'] = course_link
    course_dict['Title'] = get_title()
    course_dict['Short Description'] = get_short_description()
    course_dict['Length'] = get_length()
    course_dict['Effort'] = get_effort()
    course_dict['Price'] = get_price()
    course_dict['Institution'] = get_institution()
    course_dict['Subject'] = get_subject()
    course_dict['Level'] = get_level()
    course_dict['Prerequisites'] = get_prerequisites()
    course_dict['Image Source'] = get_img_src()
    course_dict['Already Enrolled'] = get_already_enrolled()
    course_dict['Language'] = get_language()
    driver.close()
    
    #append the extracted info to the DataFrame
    course_details = course_details.append(course_dict,ignore_index=True)

## Handling missed links

>In addition to the courses, some subjects will have other certification programs like xseries,micro masters,professional certification. It should extracted in different format.

In [8]:
#removing xseries,micro masters,professional certification courses
course_details = course_details[course_details['Course Link'].str.contains('course')]

In [9]:
#seperating missed links
missed_links = list(course_details[course_details['Title'] == 'Missing']['Course Link'])

if len(missed_links):
    print(f'{len(missed_links)} courses are missed')
else:
    print("All courses got loaded")

8 courses are missed


In [10]:
#dropping the missed courses from DataFrame
course_details = course_details[course_details.Title != 'Missing']

In [11]:
#loop through missed course link and fetch required fields
for course_link in missed_links:
    course_dict = {}
    driver = webdriver.Chrome('chromedriver.exe')
    driver.get(course_link)
    
    #extract information using functions and storing it in temp dict    
    course_dict['Course Link'] = course_link
    course_dict['Title'] = get_title()
    course_dict['Short Description'] = get_short_description()
    course_dict['Length'] = get_length()
    course_dict['Effort'] = get_effort()
    course_dict['Price'] = get_price()
    course_dict['Institution'] = get_institution()
    course_dict['Subject'] = get_subject()
    course_dict['Level'] = get_level()
    course_dict['Prerequisites'] = get_prerequisites()
    course_dict['Already Enrolled'] = get_already_enrolled()
    course_dict['Language'] = get_language()
    driver.close()
    
    #append the extracted info to the DataFrame
    course_details = course_details.append(course_dict,ignore_index=True)

## Writing the data to CSV file

In [12]:
#checking for missed links
(course_details['Title'] == 'Missing').any()

True

>Still some courses are missing. Lets examine their links

In [14]:
#seperating missed links
missed_links = list(course_details[course_details['Title'] == 'Missing']['Course Link'])

print(f'Missed courses are {missed_links}')

Missed courses are ['https://www.edx.org/professional-certificate/edx-course-creator-plus']


> Course missed is a professional certificate, which will be extracted in next notebook

In [15]:
course_details['Effort'] = course_details['Effort'].str.replace('–','-')

In [16]:
#writing the extracted data to CSV file
course_details.to_csv('Data/edX_Course.csv',index=False)