# **Web Scraping**

## **Web Scraping**
is the process of automatically extracting data from websites using codes instead of copying it manually.

## **example :** 
imagine a job website with 1000 jobs, if you copy company names, job titles, and salaries manually --> it will take forever. **But**, with **Web Scraping** , you can write a script to collect all this info in secs and save it in an excel file or a database.

## **How it works?** 
You send a request to the websites.
The server responds with the HTML content of the page. 
you use a library like BeautifulSoup or lxml to 'Parse' that HTML and extract only what you need.
Save the data in CSV, Excel, or a Database.

## **Note** : 
Not all websites allow scraping by using protections (CAPTCHA, JavaScript rendering, etc).

--------------------------------

## **Web Scraping Frameworks :**
**1.BeautifulSoup** : easy, good for beginners, for small projects.

**2.Scrapy** : best for large scraping, fast and powerful.

**3. Selenium** : best when sites rely on JavaScript.

--------------------------------

## **Note :** 
Python good for web scraping, beacuse Python has :

Simple syntax, 

Libraries (Requests, BeautifulSoup, Scrapy, Selenium), 

Easy integration with Pandas for data analysis.

----------------

## **Database VS. API :**
A **database** is the storage system for the raw data, while an **API** is the interface that allows applications to access or manipulate that data in a controlled, standardized way.

-------------

## **❓ What is the difference between Web Scraping and an AI Agent ?**

## **Web Scraping** : 
 is the process of automatically extracting data from websites. It typically involves sending HTTP requests, parsing the HTML, and collecting specific information such as product prices, job postings, or news articles. It’s focused purely on data collection.

## **AI Agent**
, on the other hand, is a system powered by artificial intelligence that can perceive, reason, and act to achieve a goal. An AI Agent may use web scraping (or APIs, or databases) as one tool, but it goes further—it can analyze data, make decisions, and interact dynamically with its environment or users.

### **✅ Example:**

1. Web Scraping: A script that extracts the prices of phones from an e-commerce website.

2. AI Agent: A system that collects prices from multiple sites, compares them, analyzes trends, and recommends the best phone for your budget.

------------

# **Istalling python libraries for web scraping :**

In [7]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [8]:
pip install requests

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [9]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


# **read file as HTML :**

In [41]:
with open ('home.html','r') as html_file:
    print(html_file.read())

<!doctype html>
<html lang="en">
   <head>
      <meta charset="utf-8">
      <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
      <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css">
      <title>DEPI_R3_BNS_DS Courses</title>
      <style>
         body {
            background-color: #368be0;
         }
         .header-title {
            text-align: center;
            margin-top: 40px;
            margin-bottom: 40px;
            font-weight: bold;
            color: #01070c;
         }
         .card-deck .card {
            transition: transform 0.3s, box-shadow 0.3s;
         }
         .card-deck .card:hover {
            transform: translateY(-5px);
            box-shadow: 0 8px 20px rgba(0, 0, 0, 0.2);
         }
         .btn-custom {
            background-color: #007bff;
            border: none;
            transition: background-color 0.3s;
         }
         .btn-custom:hover {

In [10]:
#read the file as HTML
with open ('home.html', 'r') as html_file:
    content = html_file.read()
    print(content)

<!doctype html>
<html lang="en">
   <head>
      <meta charset="utf-8">
      <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
      <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css">
      <title>DEPI_R3_BNS_DS Courses</title>
      <style>
         body {
            background-color: #368be0;
         }
         .header-title {
            text-align: center;
            margin-top: 40px;
            margin-bottom: 40px;
            font-weight: bold;
            color: #01070c;
         }
         .card-deck .card {
            transition: transform 0.3s, box-shadow 0.3s;
         }
         .card-deck .card:hover {
            transform: translateY(-5px);
            box-shadow: 0 8px 20px rgba(0, 0, 0, 0.2);
         }
         .btn-custom {
            background-color: #007bff;
            border: none;
            transition: background-color 0.3s;
         }
         .btn-custom:hover {

# **BeautifulSoup**

In [None]:
# import BeautifulSoup (class) from bs4 (library)
from bs4 import BeautifulSoup 

#open the file to read
with open ('home.html', 'r') as html_file : 
    #read the file as a string
    content = html_file.read()

    #create an object from Beautifulsoup to parse the HTML by using 'lxml':parser
    soup = BeautifulSoup(content, 'lxml')

    #print the HTML in pretty style
    print(soup.prettify())


<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <link href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css" rel="stylesheet"/>
  <title>
   DEPI_R3_BNS_DS Courses
  </title>
  <style>
   body {
            background-color: #368be0;
         }
         .header-title {
            text-align: center;
            margin-top: 40px;
            margin-bottom: 40px;
            font-weight: bold;
            color: #01070c;
         }
         .card-deck .card {
            transition: transform 0.3s, box-shadow 0.3s;
         }
         .card-deck .card:hover {
            transform: translateY(-5px);
            box-shadow: 0 8px 20px rgba(0, 0, 0, 0.2);
         }
         .btn-custom {
            background-color: #007bff;
            border: none;
            transition: background-color 0.3s;
         }
         .btn-custom:hover {
            backg

## example :

In [19]:
from bs4 import BeautifulSoup

html_doc = """
<html>
  <head>
    <title> TEST TITLE </title>
  </head>

  <body>
    <h1> Hello World! </h1>
    <p class="desc"> This is a paragraph. </p>
  </body>
</html>
"""

soup = BeautifulSoup(html_doc , 'lxml')

print(soup.title.text)
print(soup.h1.text)
print(soup.p.text)

 TEST TITLE 
 Hello World! 
 This is a paragraph. 


In [None]:
from bs4 import BeautifulSoup

with open ('home.html', 'r') as html_file : 
    content = html_file.read()

    soup = BeautifulSoup(content, 'lxml')

    
    #h5 has card name 
    print(soup.find('h5'),'\n')                 #without (.text)
    print(soup.find('h5').text,'\n')

    # The find() method is used to locate the first element 
    # in an HTML document that matches your search criteria.
    # It returns only one element (the first match), not a list.

    courses = soup.find_all('h5')
    for course in courses : 
        print(course.text)

    #The find_all() method is used to locate all elements
    #in an HTML document that match your search criteria.
    #It returns a list of elements (even if only one match is found).

<h5 class="card-title">Python for Beginners</h5> 

Python for Beginners 

Python for Beginners
Python Web Development
Python Machine Learning


In [46]:
from bs4 import BeautifulSoup

with open ('home.html', 'r') as html_file :
    content = html_file.read()

    soup = BeautifulSoup(content, 'lxml')

    courses_card = soup.find_all('div', class_ = 'card') 

    print(courses_card)

[<div class="card">
<div class="card-header bg-primary text-white">Python and Web Scraping</div>
<div class="card-body">
<h5 class="card-title">Python for Beginners</h5>
<p class="card-text">Welcome to the DEPI DS Program. Start your journey with Python fundamentals!</p>
<a class="btn btn-custom btn-block" href="#">Win $20</a>
</div>
</div>, <div class="card">
<div class="card-header bg-success text-white">Data Science</div>
<div class="card-body">
<h5 class="card-title">Python Web Development</h5>
<p class="card-text">Already know Python? Learn how to build powerful web applications with it!</p>
<a class="btn btn-custom btn-block" href="#">Win $30</a>
</div>
</div>, <div class="card">
<div class="card-header bg-danger text-white">Machine Learning</div>
<div class="card-body">
<h5 class="card-title">Python Machine Learning</h5>
<p class="card-text">Dive into AI and ML. Master data modeling and prediction using Python!</p>
<a class="btn btn-custom btn-block" href="#">Win $100</a>
</div>

In [None]:
from bs4 import BeautifulSoup

with open ('home.html', 'r') as html_file :
    content = html_file.read()

    soup = BeautifulSoup(content, 'lxml')

    courses_card = soup.find_all('div', class_ = 'card') 

    for course in courses_card:
        course_name = course.h5.text        #course name is written in h5
        course_win = course.a.text          #course win is written in a
        print(course_name)
        print(course_win,'\n')

        
    for course in courses_card:
        #in a perfect way
        course_win = course.a.text.split()[-1]          #without win
        print(f"{course_name} costs {course_win}")

Python for Beginners
Win $20 

Python Web Development
Win $30 

Python Machine Learning
Win $100 

Python Machine Learning costs $20
Python Machine Learning costs $30
Python Machine Learning costs $100


# **Requests**

In [55]:
#check for response , When you use requests.get(), 
#the server sends back a status code to tell you what happened.

import requests
response = requests.get('https://m.timesjobs.com/mobile/jobs-search-result.html?txtKeywords=python&cboWorkExp1=-1&txtLocation=')
print(response)    

#print(response.text)

<Response [200]>


![image.png](attachment:image.png)

In [None]:
import requests
from bs4 import BeautifulSoup

response = requests.get('https://m.timesjobs.com/mobile/jobs-search-result.html?txtKeywords=python&cboWorkExp1=-1&txtLocation=')

html_text = response.text

soup = BeautifulSoup(html_text,'lxml')

jobs = soup.find_all('div', class_ = 'srp-listing clearfix')

print(jobs)


[<div class="srp-listing clearfix" id="liDiv70769810">
<a class="srp-apply-new" href="https://m.timesjobs.com/mobile/job-detail/python-developer-seven-consultancy-hyderabad-secunderabad-0-to-3-yrs-jobid-ATqt92bFMIxzpSvf__PLUS__uAgZw==&amp;bc=+&amp;sequence=1" onclick="trackClickEvent('Search Result Page','SRPJobDetail');"></a>
<span class="short-star" id="shortstar70769810" onclick="syncShortlist('70769810'); return false;"></span>
<div class="srp-job-bx">
<div class="clearfix srp-heading">
<div class="srp-comp-logo">
<!-- <span class="logo-comp-name">SEVEN CONSULTANCY</span> -->
<span><img alt="SEVEN CONSULTANCY" src="/resources/images/svg_images/comp-default-logo.svg"/></span>
</div>
<div class="srp-job-heading">
<h3><a href="https://m.timesjobs.com/mobile/job-detail/python-developer-seven-consultancy-hyderabad-secunderabad-0-to-3-yrs-jobid-ATqt92bFMIxzpSvf__PLUS__uAgZw==&amp;bc=+&amp;sequence=1" onclick="trackClickEvent('Search Result Page','SRPJobDetail');">Python Developer</a></h3

![image.png](attachment:image.png)

In [67]:
# I need to collect the company name , the skills , and the puplished date
""" 
company name :
skills : python, Java, ...etc
published date : 2 days ago 

company name :
skills :
published date : 

company name :
skills :
published date : 
.
. 
.
etc """

' \ncompany name :\nskills : python, Java, ...etc\npublished date : 2 days ago \n\ncompany name :\nskills :\npublished date : \n\ncompany name :\nskills :\npublished date : \n.\n. \n.\netc '

In [87]:
import requests
from bs4 import BeautifulSoup

response = requests.get('https://m.timesjobs.com/mobile/jobs-search-result.html?txtKeywords=python&cboWorkExp1=-1&txtLocation=')
html_file = response.text
soup = BeautifulSoup(html_file,'lxml')


jobs = soup.find('div', class_ = 'srp-listing clearfix')
company_name = jobs.find('span', class_ = 'srp-comp-name').text
print(company_name)


skills = jobs.find('div', class_ = 'srp-keyskills').text.replace(' ',', ')
print(skills)


puplished_date = jobs.find('span', class_ = 'posting-time').text
print(puplished_date)

SEVEN CONSULTANCY

rest, python, web, developer, storage, software, developer

24h


In [None]:
import requests
response = requests.get(
    'https://m.timesjobs.com/mobile/jobs-search-result.html?txtKeywords=python&cboWorkExp1=-1&txtLocation='
    )
html_file = response.text
print("The Server's Response = ", response,'\n')

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_file,'lxml')
jobs = soup.find_all('div', class_ = 'srp-listing clearfix')
print('Found',len(jobs),'jobs\n')
# print(jobs)   #the output is a list 

# We use 'for loop' : for looping on jobs that its output is a (list)
for job in jobs :
    company_name = job.find('span', class_ = 'srp-comp-name').text
    print('Company name is: ',company_name,'\n')

    skills = job.find('div', class_ = 'srp-keyskills').text.replace(' ',', ')
    print('Skills are: ',skills)

    puplished_date = job.find('span', class_ = 'posting-time').text
    print('Puplished Data is:' ,puplished_date)

    print('----------------------------------\n')

The Server's Response =  <Response [200]> 

Found 25 jobs

Company name is:  SEVEN CONSULTANCY 

Skills are:  
rest, python, web, developer, storage, software, developer

Puplished Data is: 24h
----------------------------------

Company name is:  SYNECHRON 

Skills are:  
advanced, python, programming, rest, api, development, flask, or, django, framework, oop, and, soa, principles, version, control, (, git, ), software, requirements, artificial, intelligence, sql, security, technical, skills, mobile, nosql

Puplished Data is: 24h
----------------------------------

Company name is:  SYNECHRON 

Skills are:  
python, core, pandas, /, numpy, experience, sql, alchemy, /, orm, flask, /, django, knowledge, multiprocessing, /, multithreading, artificial, intelligence, shell, scripting, technical, skills, rest, gtp, mobile

Puplished Data is: 1 days ago
----------------------------------

Company name is:  IQVIA 

Skills are:  
data, analysis, pl, /, sql, proficiency, machine, learning, pyth