#Web scraping
Web scraping is an automated way of extracting large chunks of data from websites which can then be saved on a file in your computer or accessed on a spreadsheet.

In [None]:
# Imports
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import re
import os


%matplotlib inline

###Get the website & extracting the data

In [None]:
#riturl="http://www.rit.ac.in"
#riturl="https://purdue.edu"
riturl='https://scholarships.gov.in/'
webpage = requests.get(riturl)
ritsoup = BeautifulSoup(webpage.content, "html.parser")

In [None]:
print(ritsoup)


<!DOCTYPE html>

<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width" name="viewport">
<title>Home - National Scholarship Portal</title>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<link href="/public/Content/img/favicon.ico" rel="icon" type="image/png"/>
<link href="/public/Content/css/bootstrap.min.css" rel="stylesheet"/>
<link href="/public/Content/css/own.css" rel="stylesheet"/>
<link href="/public/Content/css/plugins-2.1.css" rel="stylesheet"/>
<link href="/public/Content/css/menudropsub.css" rel="stylesheet"/>
<link href="/public/Content/css/opensans.css" rel="stylesheet"/>
<link href="/public/Content/FaIcons/css/font-awesome.min.css" rel="stylesheet"/>
<script src="/public/Content/js/popper.min.js"></script>
<script src="/public/Content/js/jquery.js"></script>
<scr

###The snippet above let us to save the URL into a requests.models. Response variable called “webpage” using the requests library, and then we use the BeautifulSoup to get the content of the website using html.parser into a variable called “soup”. BeautifulSoup objects are structured representations of webpages. They have some properties of nested dicts, but you can also perform different kinds of traversal of the structure.

In [None]:
# Title of the parsed page
ritsoup.title

<title>Purdue University - Indiana's Land Grant University</title>

In [None]:
# Find all links
links = [link.get('href') for link in ritsoup.find_all('a')]
print(links)



###Accessing elements

In [None]:
ritsoup.h1

<h1 class="sr-only">Purdue University</h1>

In [None]:
ritsoup.h1.name

'h1'

In [None]:
ritsoup.head

In [None]:
ritsoup.head.meta

<meta content="dUyYAkgfYaJZPg1QMpbqQU1ve7YG0t0qNb9_8zT1Xgo" name="google-site-verification"/>

###Finding things
The most common method for finding things is with find_all:Here we use the .find_all function on the soup variable that we declared before, this soup variable has the content of the website so we need to find what we need. In the code you can change the "class" to any HTML tags like "id" and "class-name" to any specific class/id name that you want to find.

In [None]:
# Gets all the <p> elements:

paragraphs = ritsoup.find_all("p")
print(paragraphs)

[<p class="hide">Find Info For</p>, <p class="hide">Quick Links</p>, <p class="social_description" id="description-social">Follow @LifeAtPurdue to see what is happening around campus.</p>, <p>Purdue University, 610 Purdue Mall, West Lafayette, IN, 47907, 765-494-4600</p>, <p><a href="https://www.purdue.edu/purdue/disclaimer.php">© 2022 Purdue University</a> | <a href="https://www.purdue.edu/purdue/ea_eou_statement.php">An equal access/equal opportunity university</a> | <a href="https://www.purdue.edu/purdue/about/integrity_statement.php">Integrity Statement</a> | <a href="https://www.purdue.edu/securepurdue/security-programs/copyright-policies/reporting-alleged-copyright-infringement.php" target="_blank">Copyright Complaints</a> | <a href="https://www.purdue.edu/brand/" target="_blank">Brand Toolkit</a> | <a href="https://www.purdue.edu/marketing/">Maintained by Purdue Marketing and Communications</a></p>, <p>Contact Purdue Marketing and Communications at <a href="mailto:digital-market

In [None]:
# Gets all the p elements with a "class" attribute with value "hide":

ritsoup.find_all("p", attrs={"class": "hide"})

[<p class="hide">Find Info For</p>, <p class="hide">Quick Links</p>]

In [None]:
# We can use regular expression too!
ritsoup.find_all(re.compile("^(p|a)$"))[: 3]

[<a class="nav nav-skipto" href="#main">Skip to main content</a>,
 <a class="dropdown-toggle" data-toggle="dropdown" href="#"><span class="sr-only">Search</span><i aria-hidden="true" class="fa fa-search fa-lg"></i></a>,
 <a class="dropdown-toggle" data-toggle="dropdown" href="#">Find Info For <b class="caret"></b></a>]

###Getting the string from elements

If an element doesn't contain any other HTML tags, then the string attribute will give you the intuitive string content:

In [None]:
ritsoup.h1.string

'Purdue University'

In [None]:
#The contents method is similar but always returns a list:

ritsoup.h1.contents

['Purdue University']

In [None]:
#If the element contains any tags, then string will return None
paragraphs[2]

<p class="social_description" id="description-social">Follow @LifeAtPurdue to see what is happening around campus.</p>

In [None]:
paragraphs[2].string

'Follow @LifeAtPurdue to see what is happening around campus.'

In [None]:
#However, contents will return a list as before, mixing different kinds of elements:
para2 = paragraphs[2].contents
para2

['Follow @LifeAtPurdue to see what is happening around campus.']

In [None]:
para7=paragraphs[5].contents
para7

['Contact Purdue Marketing and Communications at ',
 <a href="mailto:digital-marketing@groups.purdue.edu?subject=Accessibility Issue with Your Webpage">digital-marketing@groups.purdue.edu</a>,
 ' for accessibility issues with this page | ',
 <a href="https://www.purdue.edu/disabilityresources/">Accessibility Resources</a>,
 ' | ',
 <a href="https://www.purdue.edu/purdue/contact-us">Contact Us</a>]

In [None]:
para7[1]['href']

'mailto:digital-marketing@groups.purdue.edu?subject=Accessibility Issue with Your Webpage'

In [None]:
para7[1].string

'digital-marketing@groups.purdue.edu'

In [None]:
#You can also use stripped_strings, which is a generator over all the strings (tags removed)
#inside the element; this is a fast way to extract the raw texts, with all tag soup strained off:
for s in paragraphs[3].stripped_strings:
    print("="*50)
    print(s)

Purdue University, 610 Purdue Mall, West Lafayette, IN, 47907, 765-494-4600


## Advanced web scraping tools

**[Scrapy](https://scrapy.org)** is a Python framework for large scale web scraping. It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format.

**[ARGUS](https://github.com/datawizard1337/ARGUS)** is an easy-to-use web mining tool that's built on Scrapy. It is able to crawl a broad range of different websites.

**[Selenium](https://selenium-python.readthedocs.io/index.html)** is an umbrella project encapsulating a variety of tools and libraries enabling web browser automation. Selenium specifically provides infrastructure for the W3C WebDriver specification — a platform and language-neutral coding interface compatible with all major web browsers. We can use it to imitate a user's behaviour and interact with Javascript elements (buttons, sliders etc.).

In [None]:
 pip install scrapy

In [None]:
import scrapy


In [None]:
class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://www.zyte.com/blog/']

    def parse(self, response):
        for title in response.css('.oxy-post-title'):
            yield {'title': title.css('::text').get()}

        for next_page in response.css('a.next'):
            yield response.follow(next_page, self.parse)