# Scrapping the Website for its Courses and information regarding the courses


## Intro to Web Scrapping
- Web scraping is a technique to fetch data from websites. 
- One way is to manually copy-paste the data, which both tedious and time-consuming. Web Scraping is the automation of the data extraction process from websites.

## Tools I had used
- Python
    - requests library
    - BeautifulSoup 
    - Pandas

## I am going to scrape 'https://online.stanford.edu'


## Project Outline  (Step by step How we had written the code)
- I am going to use requests library to download the webpages
- Then i am going to decide the info i want to scrape 
- I will use BeautifulSoup to parse the HTML code of page
- Then i am indentifying the tags of info i wanna scrap
- I am storing it in a list then storing them into dictionary and then i am creating a data frame using pandas 
- Then I will save into csv file which will look like this format


In [13]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


### Creating various functions each performing unique task

- This Below function will use get() method from requests library and download the webpages.
- Then it will store the text of response we got from the get()
- We will parse that text using BeautifulSoup for getting information 

In [14]:
def parsing_url(URL):
    
#  Following code for Bypassing the url because it contains robots
    Headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36'
    }

# Here we are using  get() method in requests library,it will return a response object so we will create a variable to store that 
# response
    response=requests.get(URL, headers=Headers)
    content = response.text

# If the status code is 200 then it means that webpage is downloaded succesfully
    if response.status_code != 200:
        raise Exception("Failed to Load Page")

# Parsing the HTML code using BeautifulSoup
    doc = BeautifulSoup(content,'html.parser')
    return doc

### This below function will search the tags which contains the Topic names
    - We will search for tags by right clicking on the topic and then go to inspect


In [15]:
def extract_topic_tags(doc):
    h3_tags = doc.find_all('h3')
    return h3_tags

### After searching tags from the HTMl code of webpage we will get the text of that tag,and for getting the every name from every tag in a webpage , We will use loop
    - So I have taken an empty list, in which i am going to store the Topic names

    - In loop i am traversing through each tag and storing text of each tag in list
    
    - Then, returning the list

In [16]:
def get_topics(h3_tags):
    topics_list = []
    for tag in h3_tags:
        topics_list.append(tag.text)
    return topics_list

### Now In the below method I am going to find the tags whose text contains the url of each course
    - At first, if we look at the parameter list , i have passed the doc , the HTML code of whole website which we had parsed and the topic_tags 
    list in which we had stored the tags which contains the topic names.
   
    - Then by right clicking on url and then going to inspect I had observed that the tags containing url is parent of parent of tag containing 
    topic names.
   
    - So I had taken an empty list to store the tags containing url 
    
    - Then, I am traversing through the list containg topic tags and extracting the tags which contains url
    
    - Then, returning the list

In [17]:
def extract_url_tags(doc,topic_tags):
    a_tags = [] # for storing the tags which contains url
    for tag in topic_tags:
        a_tags.append(tag.parent.parent)
    return a_tags


### In the below function we will extract the urls from the tags which we have collected
    - First getting an empty list for storing the urls of respective courses

    - Creating base url , which is common in every url of webpage

    - Then traversing through every tags , then searching text(url) in tag through attribute 'href' then storing url after concating behind 
    base url
    
    - Then, returning list containing urls

In [18]:
def get_urls(url_tags):
    urls_list = []
    base_url = 'https://online.stanford.edu'
    for tag in url_tags:
        urls_list.append(base_url + tag['href'])
    return urls_list

### Now the info regarding the name of school is inside the urls we had scrapped ,So now we will scrape every urls
    - First created an empty list for storing the school names
    
    - Then traversing through the url list

    - Now by right clicking and then going to inspect , by observing we get to know that name of school is in p tag

    - So if p tag and specific class with that tag is present in the url then we will call parsing_url(url) method which will parse url 
    passed inside the parameter

    - Then we are going to collect the first occourence of p tag 
        - Then searching for the first a tag because it contains the schoolname (Info we want)

        -After finding the tag in which school name is there, we will then append (text of that tag)info  in empty list 
    
    - Then, returning the list containing school names

In [19]:
def get_school_names(List_Of_Urls):
    schools = []
    for i in range(len(List_Of_Urls)): 
# Similarly we will parse each url to get the info about school name which is present in urls, So we will scrape that info  
        page_doc = parsing_url(List_Of_Urls[i])
# If page contains p tag and that specific class then it will return true else false
        if(page_doc.find('p',class_= 'text-red text-strong')):
    # Then I am collecting the first occourence of p tag 
            p_tags = page_doc.find('p',class_= 'text-red text-strong')
    # Then searching for the first a tag because it contains the schoolname (Info we want)
            a_tags = p_tags.find('a')
        # After finding the tag in which school name is there, we will then store (text of that tag)info  in list 
            schools.append(a_tags.text)
        else:
            schools.append(None)
    return schools

### One of the ways for creating DataFrame is by using pandas 

    - Creating dictionary and giving each coloumn a name , and what info must be there in each coloumn, that list name

    - Then , returning data frane

In [20]:
def converting_into_Dataframe():
#  Creating dataframe using pandas , by creating a dictionary
    dict_scrapped_things = {
        'Topic_Titles' : Topic_list,
        'School_Name' : school_name,
        'Link' : url_list
    }
    return pd.DataFrame(dict_scrapped_things)


In the following function , I am just calling all the function which we had defined above 

    - But i am traversing through 8 different webpages and performing these following tasks on each webpage
    
        - Parsing url

        - Then collecting topic tags, url tags
        
        - From topic tags , collecting topic names , from url tags collecting urls , and from urls collecting school names

        - Then converting into data frame


In [21]:
Topic_list = []
url_list = []
school_name = []

dummy_url='https://online.stanford.edu/search-catalog?type=All&topics[1054]=1054&topics[1049]=1049&topics[1066]=1066&topics[1069]=1069&topics[1070]=1070&topics[1059]=1059&topics[1047]=1047&topics[1057]=1057&topics[1064]=1064&topics[1073]=1073&topics[1062]=1062&topics[1060]=1060&topics[1065]=1065&topics[1063]=1063&topics[1061]=1061&topics[1094]=1094&topics[1043]=1043&topics[1050]=1050&topics[1048]=1048&topics[1072]=1072&topics[1045]=1045&topics[1042]=1042&topics[1046]=1046&topics[1044]=1044&topics[1055]=1055&topics[1071]=1071&topics[1053]=1053&topics[1052]=1052&topics[1068]=1068&topics[1067]=1067&topics[1098]=1098&topics[1058]=1058&topics[1079]=1079&topics[1051]=1051&topics[1056]=1056&free_or_paid[free]=free&page={k:d}'

######################################## LOOP BEGINS HERE #################################################################    

for i in range(8): # Terminating condition is till 8 because there are 8 webpages of courses We had scraped
    url = dummy_url.format(k=i)  #There is only one difference in url ,is the number at the end, so I am using format() method

# doc is parsed HTML code
    doc = parsing_url(url)

#  Collecting tags for Topic titles and then through tags extracting the names of Topics
    Topic_tags = extract_topic_tags(doc)
    Topic_list += (get_topics(Topic_tags))

# Collecting tags for Topic_urls and then through tags extracting links for each topic
    url_tags = extract_url_tags(doc,Topic_tags)
    url_list += (get_urls(url_tags)) 

######################################## LOOP ENDS HERE #################################################################    

# Collecting the names of school of each course
school_name = get_school_names(url_list)

# Converting the collected info into data frame using pandas
Topics_Dataframe = converting_into_Dataframe()

### NOTE: If csv file of one name is created you can't run code to create csv file of same name It will deny your request

### Converting our dataframe to csv file using to_csv('name.csv') method

In [22]:
# Converting data frame we created into csv file

Topics_Dataframe.to_csv('Coursess_Stanford.csv',index = None)