# Udacity Courses Web Scraping

In this notebook I will try to extract all the courses information in the [Udacity Webiste](https://www.udacity.com)

![image.png](images/image.png)

### Import nessecary libraries for extraction

we will use:
+ `requests` library to get the html tegs of the site
+ `BeautifulSoup` from `bs4` library to extract useful information from the tags
+ `pandas` and `numpy` to make the DataFrame of all courses

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

### Extract the HTML of the site

In [3]:
req = requests.get('https://www.udacity.com/courses/all')

In [11]:
html_soup = BeautifulSoup(req.content, 'html.parser')

#### Extract all the courses' cards to get our needed information

In [42]:
all_cards = html_soup.find_all('div', class_='card-content')

#### Get every needed piece of information from that card

In [67]:
courses_links = []
courses_name = []
courses_level = []
courses_school = []
courses_skills = []

for card in all_cards:
    # Add Title and Link
    a_title = card.find('a', class_='capitalize')
    courses_links.append('https://www.udacity.com' + a_title['href'])
    courses_name.append(a_title.text.strip())
    # Add skill
    try:
        skill = card.find('span', class_='truncate-content').text.strip()
        courses_skills.append(skill)
    except:
        courses_skills.append(np.nan)
    # Add Level
    try:
        level = card.find('span', class_='capitalize').text
        courses_level.append(level)
    except:
        courses_level.append(np.nan)
    # Add School
    try:
        school = card.find('h4', class_='category ng-star-inserted').text.strip()
        courses_school.append(school)
    except:
        courses_school.append(np.nan)

#### Import the infos got into a dataframe for further manipulation
We will have in this initial dataframe 5 columns:
1. Course Name
2. Course Link
3. Course School
4. Course Skills
5. Course Level

In [68]:
df = pd.DataFrame({'Course Name': courses_name, 'Course Link': courses_links,\
              'Course School': courses_school, 'Course Skills': courses_skills, 'Course Level': courses_level})
df.head()

Unnamed: 0,Course Name,Course Link,Course School,Course Skills,Course Level
0,Establishing Data Infrastructure,https://www.udacity.com/course/establishing-da...,School of Business,"Data Pipelines, Data Consumers, Data Producers...",intermediate
1,Intermediate JavaScript,https://www.udacity.com/course/intermediate-ja...,School of Programming,"Functional Programming, DOM, Data Structures, ...",intermediate
2,Monetization Strategy,https://www.udacity.com/course/monetization-st...,School of Business,"Product Management, Monetization Models, Prici...",intermediate
3,Applying Data Science to Product Management,https://www.udacity.com/course/applying-data-s...,School of Business,"Data Science, Product Management, Data Visuali...",intermediate
4,Data Product Manager,https://www.udacity.com/course/data-product-ma...,School of Business,"Data Science, Product Management, Product Desi...",intermediate


## Extract Additional Information
Here we will enter in every course website to extract additional information about them

In [397]:
types = []
costs = []
times = []
instructors = []

for link in df['Course Link']:
    page_req = requests.get(link)
    page_soup = BeautifulSoup(page_req.content, 'html.parser')
    ###################################################################
    ### There is 3 main types of those courses (Free - Single - Paid) #
    ### So we try to get the ideal extraction method for every course #
    ###################################################################
    try:
        type_ = page_soup.find('h6', class_='hero__course--type bar--bottom bar--green').text
        types.append(type_)
        costs.append('Free')
        
        timeline = page_soup.find_all('div', class_='col')[1].find('h5').text[7:].strip()
        times.append(timeline)
        try:
            instructors_names = page_soup.find('div', class_='instructors__list').find_all('h3', class_='h5 instructor--name')
            instructors.append(', '.join([_.text.strip() for _ in instructors_names]))
        except:
            instructors.append(np.nan)
        
    except:
        try:
            type_ = page_soup.find('h6', class_='hero__label').text
            types.append(type_)
            costs.append('Free')
            
            timeline = page_soup.find('div', class_='details__overview__item ng-star-inserted').text[17:].strip()
            times.append(timeline)
            
            try:
                instructors_names = page_soup.find_all('div', class_='leads__instructor ng-star-inserted')
                instructors.append(', '.join([_.text[:-12].strip() for _ in instructors_names]))
            except:
                instructors.append(np.nan)
            
        except:
            type_ = page_soup.find('div', class_='content__header').find('h6').text
            types.append(type_)
            costs.append(np.nan)
            
            timeline = page_soup.find('ul', class_='column-list').find('li').find('h5').text
            times.append(timeline)
            
            instructors_names = page_soup.find_all('h5', class_='instructor__name')
            instructors.append(', '.join([_.text.strip() for _ in instructors_names]))


Merging the new dataframe with the first one

In [399]:
df2 = pd.DataFrame({'Course Type': types, 'Course Cost': costs, 'Course Lenght': times, 'Instructors': instructors})
df = pd.concat([df, df2], axis=1)
df.head()

Unnamed: 0,Course Name,Course Link,Course School,Course Skills,Course Level,Course Type,Course Cost,Course Lenght,Instructors
0,Establishing Data Infrastructure,https://www.udacity.com/course/establishing-da...,School of Business,"Data Pipelines, Data Consumers, Data Producers...",intermediate,COURSE TWO OF THREE,,1 Month,Vaishali Agarwal
1,Intermediate JavaScript,https://www.udacity.com/course/intermediate-ja...,School of Programming,"Functional Programming, DOM, Data Structures, ...",intermediate,Nanodegree Program,,3 months,"Alyssa Hope, Rachel Manning, Andrew Wong, Rich..."
2,Monetization Strategy,https://www.udacity.com/course/monetization-st...,School of Business,"Product Management, Monetization Models, Prici...",intermediate,COURSE THREE OF THREE,,1 Month,Rizwan Ansary
3,Applying Data Science to Product Management,https://www.udacity.com/course/applying-data-s...,School of Business,"Data Science, Product Management, Data Visuali...",intermediate,COURSE ONE OF THREE,,1 Month,JJ Miclat
4,Data Product Manager,https://www.udacity.com/course/data-product-ma...,School of Business,"Data Science, Product Management, Product Desi...",intermediate,Nanodegree Program,,3 Months,"JJ Miclat, Vaishali Agarwal, Anne Rynearson"


In [414]:
# Save the dataframe in a csv file
df.to_csv('courses.csv')

# Counclusion

Now we have the entire database of Udacity's courses with their description in a csv file
