# Web Scraping Exercise - Coursera

* Nontapat Pintira
* Student ID: 6088118

# Let's import the libraries need for this assignment

In this assignment I am going to retrieve data from Coursera web page, and I am going to use beautiful soup for this web scraping assignment. After I get the data, I will construct a dataframe and export a .csv file

Let's do this!!

In [1]:
import requests
import bs4
import lxml
import json
import time
import pandas as pd

# Scraping Coursera

## Testing the imports
In this case I am going to retrieve the the page for the search query = "python", and I am going to get some information such as course name, number of enrollments, rating etc.

After the inspection, I see that the HTML tag for the courses name is *h2[class="color-primary-text card-title headline-1-text"]*

Therefore, I am going to use soup to select only this tag from the document

Note: The webpage is encoded in UTF-8 format

In [2]:
query = 'python'

response = requests.get('https://www.coursera.org/search?query=' + query)
response.encoding = 'utf-8'

soup = bs4.BeautifulSoup(response.text, "lxml")

# The tag for the <h2> course name is in this div tag.
courses = soup.select("div.card-info.vertical-box")
for c in courses:
    print(c.h2.text)

Python for Everybody
Python 3 Programming
IBM Data Science
Google IT Automation with Python
Applied Data Science with Python
Programming for Everybody (Getting Started with Python)
Python for Data Science and AI
Основы программирования на Python
Introducción a la programación en Python I: Aprendiendo a programar con Python
Python Basics


***
The request works, so let's get other information from it.

* The tag for course provider name is 'span [class="partner-name"]'

* The tag for rating is 'span[class="ratings-text"]'

* The tag for the number of enrolled students is 'span[class="enrollment-number"]'

* And the tag for difficulty level is 'span[class="difficulty"]'

Noted: We can also get the name of the course provider from the courses tag

## Getting Courses Detail

In [3]:
ratings = soup.find_all("span", class_="ratings-text")
enrollments = soup.find_all("span", class_="enrollment-number")
difficulties = soup.find_all("span", class_="difficulty")

for i in range(0, len(courses)):
    print(courses[i].h2.text, courses[i].span.text, ratings[i].text, enrollments[i].text, difficulties[i].text)

Python for Everybody University of Michigan 4.8 1.2m Beginner
Python 3 Programming University of Michigan 4.7 83k Beginner
IBM Data Science IBM 4.6 400k Beginner
Google IT Automation with Python Google 4.7 49k Beginner
Applied Data Science with Python University of Michigan 4.5 420k Intermediate
Programming for Everybody (Getting Started with Python) University of Michigan 4.8 1.1m Mixed
Python for Data Science and AI IBM 4.6 140k Beginner
Основы программирования на Python National Research University Higher School of Economics 4.6 73k Beginner
Introducción a la programación en Python I: Aprendiendo a programar con Python Pontificia Universidad Católica de Chile 4.5 93k Beginner
Python Basics University of Michigan 4.8 70k Beginner


***

And here are the courses information displayed on the first page of python courses in Coursera website

However, this is kinda hard to view, so let's put it in a data frame

## Creating DataFrame

In [4]:
courses_text = []
providers_text = []
ratings_text = []
enrollments_text = []
difficulties_text = []

for i in range(0, len(courses)):
    courses_text.append(courses[i].h2.text)
    providers_text.append(courses[i].span.text)
    ratings_text.append(ratings[i].text)
    enrollments_text.append(enrollments[i].text)
    difficulties_text.append(difficulties[i].text)
    
data = {
    "Course Name": courses_text,
    "Partner": providers_text,
    "Rating": ratings_text,
    "Enrollment": enrollments_text,
    "Level": difficulties_text
}
    
df = pd.DataFrame(data)
df

Unnamed: 0,Course Name,Partner,Rating,Enrollment,Level
0,Python for Everybody,University of Michigan,4.8,1.2m,Beginner
1,Python 3 Programming,University of Michigan,4.7,83k,Beginner
2,IBM Data Science,IBM,4.6,400k,Beginner
3,Google IT Automation with Python,Google,4.7,49k,Beginner
4,Applied Data Science with Python,University of Michigan,4.5,420k,Intermediate
5,Programming for Everybody (Getting Started wit...,University of Michigan,4.8,1.1m,Mixed
6,Python for Data Science and AI,IBM,4.6,140k,Beginner
7,Основы программирования на Python,National Research University Higher School of ...,4.6,73k,Beginner
8,Introducción a la programación en Python I: Ap...,Pontificia Universidad Católica de Chile,4.5,93k,Beginner
9,Python Basics,University of Michigan,4.8,70k,Beginner


***
And there we have our data frame.
Now let's define a function to do this task for all the pages.

Noted that some courses do not have rating, number of enrollment, or level, and I have to take that into account.

## Defining Functions

In [5]:
def getPageSource(query, page):
    response = requests.get('https://www.coursera.org/search?query=' + query +
                            '&page=' + page +
                            '&index=prod_all_products_term_optimization')
    response.encoding = 'utf-8'
    return response.text


def peekPages(query, page):
    soup = bs4.BeautifulSoup(getPageSource(query, page), "lxml")
    pages = soup.select('button[class="box number current"]')
    return pages


def getInfo(query, page):
    
    data = []
     
    soup = bs4.BeautifulSoup(getPageSource(query, page), "lxml")
    card = soup.select("div.card-info.vertical-box")
    
    for c in card:
        
        s = c.get_text(separator="|", strip=True)
        temp_list = s.split("|")
        size = len(temp_list)
        
        if size > 10:
            l = [c.h2.text, c.span.text, temp_list[size-9], temp_list[size-8], temp_list[size-4], temp_list[size - 1]]
        else:
            l = [c.h2.text, c.span.text, temp_list[2], 'None', 'None', 'None']

        data.append(l)
    
    return data


def getAllInfo(query):

    page_counter = 1

    while peekPages(query, str(page_counter)):
        print("Retrieving page: " + str(page_counter))
        data = getInfo(query, str(page_counter))

        if page_counter == 1:
            df = pd.DataFrame(data)
        else:
            frame1 = df.copy()
            frame2 = pd.DataFrame(data)
            frames = [frame1, frame2]
            df = pd.concat(frames, ignore_index=True)

        time.sleep(2.0)
        page_counter += 1
    
    df.columns = ['Course', 'Provider', 'Type', 'Rating', 'Enrollment', 'Difficulty']
    return df

***

Now let's run the function getAllInfo() to see if our code is working

In [49]:
df = getAllInfo('python')

Retrieving page: 1
Retrieving page: 2
Retrieving page: 3
Retrieving page: 4
Retrieving page: 5
Retrieving page: 6
Retrieving page: 7
Retrieving page: 8
Retrieving page: 9
Retrieving page: 10
Retrieving page: 11
Retrieving page: 12
Retrieving page: 13
Retrieving page: 14
Retrieving page: 15
Retrieving page: 16
Retrieving page: 17
Retrieving page: 18
Retrieving page: 19
Retrieving page: 20
Retrieving page: 21
Retrieving page: 22
Retrieving page: 23
Retrieving page: 24
Retrieving page: 25
Retrieving page: 26
Retrieving page: 27
Retrieving page: 28
Retrieving page: 29
Retrieving page: 30
Retrieving page: 31
Retrieving page: 32
Retrieving page: 33
Retrieving page: 34
Retrieving page: 35
Retrieving page: 36
Retrieving page: 37
Retrieving page: 38
Retrieving page: 39
Retrieving page: 40
Retrieving page: 41


Looks good
Let's display the result

In [50]:
df

Unnamed: 0,Course,Provider,Type,Rating,Enrollment,Difficulty
0,Python for Everybody,University of Michigan,SPECIALIZATION,4.8,1.2m,Beginner
1,Python 3 Programming,University of Michigan,SPECIALIZATION,4.7,83k,Beginner
2,IBM Data Science,IBM,PROFESSIONAL CERTIFICATE,4.6,400k,Beginner
3,Google IT Automation with Python,Google,PROFESSIONAL CERTIFICATE,4.7,49k,Beginner
4,Applied Data Science with Python,University of Michigan,SPECIALIZATION,4.5,420k,Intermediate
...,...,...,...,...,...,...
399,"Smart Analytics, Machine Learning, and AI on GCP",Google Cloud,COURSE,4.5,2.1k,Intermediate
400,Introduction to Virtual Reality,University of London,COURSE,4.7,14k,Beginner
401,Launching into Machine Learning en Español,Google Cloud,COURSE,,,
402,MongoDB Aggregation Framework,MongoDB Inc.,COURSE,4.7,4.3k,Intermediate


***
Finally, let's save the dataframe into a .csv file

In [51]:
df.to_csv('webscraping_6088118.csv', index=False)