<a href="https://colab.research.google.com/github/AkashBabu1712/Web-Scrapping/blob/main/Web_Scrapping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**What is Web Scrapping?**

*Web scraping (or data scraping) is a technique used to collect content and data from the internet. This data is usually saved in a local file so that it can be manipulated and analyzed as needed. If you’ve ever copied and pasted content from a website into an Excel spreadsheet, this is essentially what web scraping is, but on a very small scale.*

####**How does a web scraper function?**

* Step 1: Making an HTTP request to a server
* Step 2: Extracting and parsing (or breaking down) the website’s code
* Step 3: Saving the relevant data locally

###**How to scrape the web (step-by-step)**



1.   Find the URLs you want to scrape
2.   Inspect the page
3.   Identify the data you want to extract
4.   Write the necessary code
5.   Execute the code
6.   Storing the data



###**What tools can you use to scrape the web?**

* BeautifulSoup : Used to parse data from XML and HTML documents. 

* Scrapy : Framework that crawls and extracts structured data from the web.

* Pandas : used to scrape the web in conjunction with BeautifulSoup

* Parsehub : Free online tool (to be clear, this one’s not a Python library) that makes it easy to scrape online data.


In [73]:
# Installation of Libraries
"""
Requests library provides easy methods for sending HTTP GET and POST requests. 
For example, the function to send an HTTP Get request is aptly named get()
"""
!pip install requests
!pip install html5lib
!pip install bs4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [74]:
#Importing the libraries
import numpy as np
import pandas as pd
import requests


###**Beautiful Soup**

*Beautiful Soup is a Python library that works with a parser to extract data from HTML and can turn even invalid markup into a parse tree. However, this library is only designed for parsing and cannot request data from web servers in the form of HTML documents/files.*



In [75]:
#Accessing the HTML content from webpage 
#URL = "https://www.ambitionbox.com/list-of-companies?page=1"
#r = requests.get(URL)
#print(r.content)

In [76]:
#sometimes getting error 
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246"}
# Here the user agent is for Edge browser on windows 10. You can find your browser user agent from the above given link.
r = requests.get(url=URL, headers=headers)
print(r.content)

b'<!DOCTYPE html>\n<html class="no-js" dir="ltr" lang="en-US">\n<head>\n  <meta charset="utf-8">\n  <meta http-equiv="content-type" content="text/html; charset=utf-8" />\n  <meta http-equiv="X-UA-Compatible" content="IE=edge" />\n  <meta name="viewport" content="width=device-width,initial-scale=1.0" />\n  <title>Inspirational Quotes - Motivational Quotes - | The Foundation for a Better Life</title>\n<meta name="description" content="Find the perfect quotation from our hand-picked collection of inspiring quotes by hundreds of authors.">\n<meta name="keywords" content="pass, it, on, passiton, values, kindness">\n<meta name="twitter:site_name" content="The Foundation for a Better Life">\n<meta name="twitter:site" content="@passiton_values">\n<meta name="twitter:card" content="summary">\n<meta name="twitter:description" content="Thank you for visiting.">\n<meta name="twitter:image" content="https://www.passiton.com/passiton_fbl.jpg">\n<meta property="og:url" content="https://www.passiton.c

In [77]:
#Parsing the HTML content 
from bs4 import BeautifulSoup

soup = BeautifulSoup(r.content, 'html5lib') # If this line causes an error, run 'pip install html5lib' or install html5lib
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" dir="ltr" lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width,initial-scale=1.0" name="viewport"/>
  <title>
   Inspirational Quotes - Motivational Quotes - | The Foundation for a Better Life
  </title>
  <meta content="Find the perfect quotation from our hand-picked collection of inspiring quotes by hundreds of authors." name="description"/>
  <meta content="pass, it, on, passiton, values, kindness" name="keywords"/>
  <meta content="The Foundation for a Better Life" name="twitter:site_name"/>
  <meta content="@passiton_values" name="twitter:site"/>
  <meta content="summary" name="twitter:card"/>
  <meta content="Thank you for visiting." name="twitter:description"/>
  <meta content="https://www.passiton.com/passiton_fbl.jpg" name="twitter:image"/>
  <meta content="https://www.passiton.com/ins

**soup = BeautifulSoup(r.content, 'html5lib')**

* r.content : It is the raw HTML content.
* html5lib : Specifying the HTML parser we want to use.

**soup.prettify():** 

*Gives the visual representation of the parse tree created from the raw HTML content*

In [78]:
#Searching and navigating through the parse tree 

import csv
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')


In [79]:
# a list to store quotes
quotes=[] 

table = soup.find('div', attrs = {'id':'all_quotes'})

In [80]:
#print(table)

In [84]:
#
for row in table.findAll('div',
                         attrs = {'class':'col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top'}):
    quote = {}
    quote['theme'] = row.h5.text
    quote['url'] = row.a['href'].text
    quote['img'] = row.img['src'].text
    quote['lines'] = row.img['alt'].split(" #")[0]
    quote['author'] = row.img['alt'].split(" #")[1]
    quotes.append(quote)

In [82]:
#Making a csv file
filename = 'inspirational_quotes.csv'
with open(filename, 'w', newline='') as f:
    w = csv.DictWriter(f,['theme','url','img','lines','author'])
    w.writeheader()
    for quote in quotes:
        w.writerow(quote)