<a href="https://colab.research.google.com/github/Ayushag1/Web-Scrap/blob/main/Web_scrap_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping

## Part 1: Loading Web Pages with 'request'

The requests module allows you to send HTTP requests using Python.

The HTTP request returns a Response Object with all the response data (content, encoding, status, and so on). One example of getting the HTML of a page:

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Make a request to https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/

# Store the result in 'res' variable
res = requests.get(
    'https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/')
txt = res.text
status = res.status_code

print(status)
# print the result

200


## Part 2: Extracting title with BeautifulSoup

In this whole classroom, you’ll be using a library called BeautifulSoup in Python to do web scraping. Some features that make BeautifulSoup a powerful solution are:

- It provides a lot of simple methods and Pythonic idioms for navigating, searching, and modifying a DOM tree. It doesn't take much code to write an application

- Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.

- Basically, BeautifulSoup can parse anything on the web you give it.

In [None]:
# Make a request to https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/
page = requests.get(
    "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")

soup = BeautifulSoup(page.content, 'html.parser')

# Extract title of page
page_title = soup.title.text

# print the result
print(page_title)

codedamn Web Scraper demo


## Part 3: Soup-ed body and head

You also saw that you have to call .text on these to get the string, but you can print them without calling .text too, and it will give you the full markup. Try to run the example below:

In [None]:
# Extract title of page
page_title = soup.title.text

# Extract body of page
page_body = soup.body

# Extract head of page
page_head = soup.head

# print the result
print(page_body)

<body>
<header class="navbar navbar-fixed-top navbar-static" role="banner">
<div class="container">
<div class="navbar-header">
<a data-target=".side-collapse" data-target-2=".side-collapse-container" data-toggle="collapse-side">
<button aria-controls="navbar" aria-expanded="false" class="navbar-toggle pull-right collapsed" data-target="#navbar" data-target-2=".side-collapse-container" data-target-3=".side-collapse" data-toggle="collapse" type="button">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar top-bar"></span>
<span class="icon-bar middle-bar"></span>
<span class="icon-bar bottom-bar"></span>
</button>
</a>
<div class="navbar-brand">
<a href="/webscraper-python-codedamn-classroom-website/"><img alt="Web Scraper" src="/webscraper-python-codedamn-classroom-website/logo_white.svg"/></a>
</div>
</div>
<div class="side-collapse in">
<nav class="navbar-collapse collapse" id="navbar" role="navigation">
<ul class="nav navbar-nav navbar-right">
<li class="hidden">
<a

## Part 4: select with BeautifulSoup

Once you have the soup variable (like previous labs), you can work with .select on it which is a CSS selector inside BeautifulSoup. That is, you can reach down the DOM tree just like how you will select elements with CSS. Let's look at an example:

In [None]:
soup.select('h1')

[<h1>Test Sites</h1>, <h1>E-commerce training site</h1>]

In [None]:
lst = soup.select('p')

In [None]:
# Create all_h1_tags as empty list
all_h1_tags = []

# Set all_h1_tags to all h1 tags of the soup
for element in soup.select('h1'):
    all_h1_tags.append(element.text)

# Create seventh_p_text and set it to 7th p element text of the page
seventh_p_text = soup.select('p')[6].text

print(all_h1_tags, seventh_p_text)

['Test Sites', 'E-commerce training site'] 7 reviews


## Part 5: Top items being scraped right now

Let's go ahead and extract the top items scraped from the URL: https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/

If you open this page in a new tab, you’ll see some top items. Your task is to scrape out their names and store them in a list called top_items. You will also extract out the reviews for these items as well.

In [None]:
info = {
        "title": [],
        "review": []
}

# Extract and store in top_items according to instructions on the left
products = soup.select('div.thumbnail')
for elem in products:
    title = elem.select('a.title')[0].text
    review_label = elem.select('div.ratings')[0].text
    info["title"].append(title.strip())
    info["review"].append(review_label.strip())

print(info)

{'title': ['Asus AsusPro Adv...', 'Asus ROG Strix G...', 'Acer Aspire 3 A3...'], 'review': ['7 reviews', '4 reviews', '2 reviews']}


Note that this is only one of the solutions. You can attempt this in a different way too. In this solution:

- First of all you select all the div.thumbnail elements which gives you a list of individual products
- Then you iterate over them
- Note that because you're running inside a loop for div.thumbnail already, the h4 > a.title selector would only give you one result, inside a list. You select that list's 0th element and extract out the text.
-Finally you strip any extra whitespace and append it to your list.

## Final step: Convert to `csv` file

In [None]:
pd.DataFrame(info).to_csv('out.csv', index=False)

In [None]:
# CHECK out.csv
df = pd.read_csv('out.csv')
df

Unnamed: 0,title,review
0,Asus AsusPro Adv...,7 reviews
1,Asus ROG Strix G...,4 reviews
2,Acer Aspire 3 A3...,2 reviews
