## WEB SCRAPPING
This workbook demonstrates usage of python to perform web scrapping from any website to obtain information.

In [35]:
import requests
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [36]:
from bs4 import BeautifulSoup

In [37]:
import bs4

In [38]:
url="http://127.0.0.1:5500/Web%20Scrapping/response.html"

In [39]:
x=requests.get(url)

#### We check whether we obtained a response to the GET request sent.

In [40]:
x.status_code

200

The above status code of 200 indicates that we did get a response to the URL request we made.

Next we check the type of content retrieved from the above URL

In [41]:
type(x.content)

bytes

Now response here denoted by 'x' has 2 attributes. One is x.text that outputs text-based responses such as html,xml etc
while x.content outputs binary-based responses such as jpg,png etc.  
Since we need html version we choose x.text here.

In [42]:
with open('response.html','w') as f:
    f.write(x.text)

Here we create a file named **response.txt** and store the html content into it.

### DISPLAY HTML CONTENT

In [43]:
with open('response.html','r') as f:
    content=f.read()
    print(content)

<!doctype html>
<html lang="en">
   <head>
      <meta charset="utf-8">
      <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
      <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css" integrity="sha384-JcKb8q3iqJ61gNV9KGb8thSsNjpSL0n8PARn9HuZOnIxN0hoP+VmmDGMN5t9UJ0Z" crossorigin="anonymous">
      <title>My Courses</title>
   </head>
   <body>
      <h1>Hello, Start Learning!</h1>
      <div class="card" id="card-python-for-beginners">
         <div class="card-header">
            Python
         </div>
         <div class="card-body">
            <h5 class="card-title">Python for beginners</h5>
            <p class="card-text">If you are new to Python, this is the course that you should buy!</p>
            <a href="#" class="btn btn-primary">Start for 20$</a>
         </div>
      </div>
      <div class="card" id="card-python-web-development">
         <div class="card-header">
            Pyt

One needs a parser to use with Beautiful Soup. There exists numerous options such as lxml,Python's built-in html parser etc.
For speed one is better off using lxml parser.

In [44]:
%pip install lxml


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [45]:
import lxml

In [46]:
os.listdir()

['.DS_Store',
 'timesjobs.html',
 'response.html',
 '.jovianrc',
 'WEB SCRAPPING.ipynb',
 '.ipynb_checkpoints',
 'WEB SCRAPPING INFO FROM A JOB WEBSITE.ipynb']

In [47]:
html_str=""
with open('response.html','r') as f:
    content=f.read()
    html_str+=content

In [48]:
soup=BeautifulSoup(html_str,'lxml')

In [49]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <link crossorigin="anonymous" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css" integrity="sha384-JcKb8q3iqJ61gNV9KGb8thSsNjpSL0n8PARn9HuZOnIxN0hoP+VmmDGMN5t9UJ0Z" rel="stylesheet"/>
  <title>
   My Courses
  </title>
 </head>
 <body>
  <h1>
   Hello, Start Learning!
  </h1>
  <div class="card" id="card-python-for-beginners">
   <div class="card-header">
    Python
   </div>
   <div class="card-body">
    <h5 class="card-title">
     Python for beginners
    </h5>
    <p class="card-text">
     If you are new to Python, this is the course that you should buy!
    </p>
    <a class="btn btn-primary" href="#">
     Start for 20$
    </a>
   </div>
  </div>
  <div class="card" id="card-python-web-development">
   <div class="card-header">
    Python
   </div>
   <div class="card-body">
    <h5 class="ca

## WEB SCRAPPING

Let's grab hold of all the **h5** tags present in the document

In [50]:
h5_tag_info=soup.h5    

In [51]:
type(h5_tag_info)

bs4.element.Tag

In [52]:
h5_tag_info

<h5 class="card-title">Python for beginners</h5>

**soup.tag_name** or **soup.find(tag_name)** outputs if present the first tag_name tag in the html document. To obtain all **tags soup.find_all(tag_name)** is used.

In [53]:
h5_tags_info=soup.find_all('h5')
print(h5_tags_info)

[<h5 class="card-title">Python for beginners</h5>, <h5 class="card-title">Python Web Development</h5>, <h5 class="card-title">Python Machine Learning</h5>]


In [54]:
def Retrieve_Info_Stored_InTags(doc,tag):
    tag_texts=[]
    tag_attr=[]
    for tag in doc.find_all(tag):
        tag_texts.append(tag.string)
        if len(list(tag.attrs.keys()))>0:
            tag_attr.append(tag.attrs)
    return tag_texts,tag_attr

In [55]:
h5_texts,h5_attributes=Retrieve_Info_Stored_InTags(soup,'h5')

In [56]:
h5_texts

['Python for beginners', 'Python Web Development', 'Python Machine Learning']

In [57]:
h5_attributes

[{'class': ['card-title']},
 {'class': ['card-title']},
 {'class': ['card-title']}]

Now let us find out what all text is present across all **anchor** tags in the page

In [58]:
a_texts,a_attributes=Retrieve_Info_Stored_InTags(soup,'a')

In [59]:
a_texts

['Start for 20$', 'Start for 50$', 'Start for 100$']

Let us find out external web pages the anchor tags point at if any

In [60]:
ext_webpages=[]
for link in a_attributes:
    if 'href' in link and not(link['href'].startswith('#')):
        ext_webpages.append(link['href'])

In [61]:
# assert len(ext_webpages)>0,'No Link To Any External Webpage Found'

As can be observed from above, there exists no link that points to any external webpage.

#### OBTAIN COURSE NAMES AND PRICES

It is found that all course names are within **h5** tags whereas all course prices are within **a** tags.

In [62]:
def Obtain_CourseNames_And_Prices(bs_obj):
    ans={}
    for course_tag in bs_obj.find_all('h5'):
        ans[course_tag.string]=""
    for i,price_tag in enumerate(bs_obj.find_all('a')):
        ans[list(ans.keys())[i]]+=str(price_tag.string).split(' ')[-1]
    return ans

In [63]:
courses_info=Obtain_CourseNames_And_Prices(soup)

In [64]:
courses_info

{'Python for beginners': '20$',
 'Python Web Development': '50$',
 'Python Machine Learning': '100$'}

Another way to obtain the same is to inspect page and look as to where the prices information could be found.

In [65]:
def Obtain_CourseNames_And_Prices_2(bs_obj):
    res=bs_obj.find_all('div',class_='card-body')
    ans={}
    last_key=""
    
    for item in res:
        ans[item.h5.string]=""
        ans[item.h5.string]+=str(item.a.string).split(' ')[-1]
    return ans

In [66]:
res=Obtain_CourseNames_And_Prices_2(soup)

In [67]:
res

{'Python for beginners': '20$',
 'Python Web Development': '50$',
 'Python Machine Learning': '100$'}