# Web Scraping Tutorial

This notebook provides a step-by-step guide to scrape data from a website. Web scraping is a technique used to extract information from websites by transforming the data on web pages into a structured format. This is particularly useful for data analysis, machine learning, and other data-driven tasks.

In this tutorial, we will walk through the process of scraping product information from a sample e-commerce site. By following these steps, you will learn how to:

1. Send HTTP requests to retrieve web pages.
2. Parse HTML content using BeautifulSoup.
3. Identify and extract relevant data elements from the parsed HTML.
4. Store the extracted data in a structured format using pandas.
5. Save the data to a CSV file.
6. Optionally, save the data to a database such as MongoDB.

The website we will be scraping is [ScrapeMe](https://scrapeme.live/shop/). This site is designed for practice purposes and contains a variety of products with details such as names and prices, which makes it an ideal candidate for learning web scraping techniques.

Before you begin, please visit the site to understand its structure. This will help you identify the elements you need to scrape.

Let's get started!

## Import libraries here

In [2]:
pip install requests beautifulsoup4 pymongo

Collecting pymongo
  Downloading pymongo-4.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.6.1-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.6.1-py3-none-any.whl (307 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.7/307.7 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.6.1 pymongo-4.8.0


In [22]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import pymongo
from pymongo import MongoClient
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import display, HTML

In [47]:
pip install dotenv

Collecting dotenv
  Downloading dotenv-0.0.5.tar.gz (2.4 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


## Step 1: Send a request to the website

In [23]:
url = 'https://scrapeme.live/shop/'

response = requests.get(url)

print(response.status_code)

print(response.text[:500])

200

<!doctype html>
<html lang="en-GB">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=2.0">
<link rel="profile" href="http://gmpg.org/xfn/11">
<link rel="pingback" href="https://scrapeme.live/xmlrpc.php">

<title>Products &#8211; ScrapeMe</title>
<link rel='dns-prefetch' href='//fonts.googleapis.com' />
<link rel='dns-prefetch' href='//s.w.org' />
<link rel="alternate" type="application/rss+xml" title="ScrapeMe &raquo; Feed" href="ht


## Step 2: Parse the HTML content of the page

In [24]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')

print(soup.title)


<title>Products – ScrapeMe</title>


## Step 3: Inspect the website and identify the elements to scrape
Inspect the website and identify the elements (e.g., product names, prices, etc.).

In [25]:
soup.find_all('li', {'class': 'product'})

[<li class="post-759 product type-product status-publish has-post-thumbnail product_cat-pokemon product_cat-seed product_tag-bulbasaur product_tag-overgrow product_tag-seed first instock sold-individually taxable shipping-taxable purchasable product-type-simple">
 <a class="woocommerce-LoopProduct-link woocommerce-loop-product__link" href="https://scrapeme.live/shop/Bulbasaur/"><img alt="" class="attachment-woocommerce_thumbnail size-woocommerce_thumbnail wp-post-image" height="324" sizes="(max-width: 324px) 100vw, 324px" src="https://scrapeme.live/wp-content/uploads/2018/08/001-350x350.png" srcset="https://scrapeme.live/wp-content/uploads/2018/08/001-350x350.png 350w, https://scrapeme.live/wp-content/uploads/2018/08/001-150x150.png 150w, https://scrapeme.live/wp-content/uploads/2018/08/001-300x300.png 300w, https://scrapeme.live/wp-content/uploads/2018/08/001-100x100.png 100w, https://scrapeme.live/wp-content/uploads/2018/08/001-250x250.png 250w, https://scrapeme.live/wp-content/uploa

In [28]:
def get_product_info(product):
    product_name = product.find('h2', {'class': 'woocommerce-loop-product__title'}).text.strip()
    product_price = product.find('span', {'class': 'woocommerce-Price-amount amount'}).text.strip()
    return product_name, product_price

In [29]:
for product in soup.find_all('li', {'class': 'product'}):
    print(get_product_info(product))

('Bulbasaur', '£63.00')
('Ivysaur', '£87.00')
('Venusaur', '£105.00')
('Charmander', '£48.00')
('Charmeleon', '£165.00')
('Charizard', '£156.00')
('Squirtle', '£130.00')
('Wartortle', '£123.00')
('Blastoise', '£76.00')
('Caterpie', '£73.00')
('Metapod', '£148.00')
('Butterfree', '£162.00')
('Weedle', '£25.00')
('Kakuna', '£148.00')
('Beedrill', '£168.00')
('Pidgey', '£159.00')


## Step 4: Extract the desired data

In [32]:
product_name = []
product_price = []

for product in soup.find_all('li', {'class': 'product'}):
    name, price = get_product_info(product)
    product_name.append(name)
    product_price.append(price)

print(product_name)
print(product_price)

['Bulbasaur', 'Ivysaur', 'Venusaur', 'Charmander', 'Charmeleon', 'Charizard', 'Squirtle', 'Wartortle', 'Blastoise', 'Caterpie', 'Metapod', 'Butterfree', 'Weedle', 'Kakuna', 'Beedrill', 'Pidgey']
['£63.00', '£87.00', '£105.00', '£48.00', '£165.00', '£156.00', '£130.00', '£123.00', '£76.00', '£73.00', '£148.00', '£162.00', '£25.00', '£148.00', '£168.00', '£159.00']


In [11]:
extract_data = lambda product: get_product_info(product)

In [15]:
def extract_data(product):
    product_name = product.find('h2', {'class': 'woocommerce-loop-product__title'}).text.strip()
    product_price = product.find('span', {'class': 'woocommerce-Price-amount amount'}).text.strip()
    return product
    return product_name, product_price

## Step 5: Create a DataFrame to store the extracted data

In [39]:
# Corrected DataFrame creation
df = pd.DataFrame({'name':product_name, 'price':product_price})
df

# Corrected extract_data function
def extract_data(product):
    product_name = product.find('h2', {'class': 'woocommerce-loop-product__title'}).text.strip()
    product_price = product.find('span', {'class': 'woocommerce-Price-amount amount'}).text.strip()
    return product_name, product_price # Return both values as a tuple
df

Unnamed: 0,name,price
0,Bulbasaur,£63.00
1,Ivysaur,£87.00
2,Venusaur,£105.00
3,Charmander,£48.00
4,Charmeleon,£165.00
5,Charizard,£156.00
6,Squirtle,£130.00
7,Wartortle,£123.00
8,Blastoise,£76.00
9,Caterpie,£73.00


## Step 6: Save the data to a CSV file

In [40]:
df.to_csv('product_data.csv', index=False)

## Step 7: Print the DataFrame to verify the extracted data

> Add blockquote



In [41]:
print(df)

          name    price
0    Bulbasaur   £63.00
1      Ivysaur   £87.00
2     Venusaur  £105.00
3   Charmander   £48.00
4   Charmeleon  £165.00
5    Charizard  £156.00
6     Squirtle  £130.00
7    Wartortle  £123.00
8    Blastoise   £76.00
9     Caterpie   £73.00
10     Metapod  £148.00
11  Butterfree  £162.00
12      Weedle   £25.00
13      Kakuna  £148.00
14    Beedrill  £168.00
15      Pidgey  £159.00


## Step 8: Save the data to a database of your choice. If you are using MongoDB, include the code here.

In [49]:
import os
import pickle
from pymongo import MongoClient
import pandas as pd

MONGO_CONNECTION_STRING = 'mongodb+srv://nasersaqerr:K6I8t3w6tdlzVejz@nasser.4ulhqnp.mongodb.net/?retryWrites=true&w=majority&appName=Nasser'
client = MongoClient(MONGO_CONNECTION_STRING)

db = client['T5_db']
collection = db['Web_scraping_db']

data_dict = df.to_dict("records")

collection.insert_many(data_dict)

InsertManyResult([ObjectId('66b11ecd0bc9ffdad2fc165f'), ObjectId('66b11ecd0bc9ffdad2fc1660'), ObjectId('66b11ecd0bc9ffdad2fc1661'), ObjectId('66b11ecd0bc9ffdad2fc1662'), ObjectId('66b11ecd0bc9ffdad2fc1663'), ObjectId('66b11ecd0bc9ffdad2fc1664'), ObjectId('66b11ecd0bc9ffdad2fc1665'), ObjectId('66b11ecd0bc9ffdad2fc1666'), ObjectId('66b11ecd0bc9ffdad2fc1667'), ObjectId('66b11ecd0bc9ffdad2fc1668'), ObjectId('66b11ecd0bc9ffdad2fc1669'), ObjectId('66b11ecd0bc9ffdad2fc166a'), ObjectId('66b11ecd0bc9ffdad2fc166b'), ObjectId('66b11ecd0bc9ffdad2fc166c'), ObjectId('66b11ecd0bc9ffdad2fc166d'), ObjectId('66b11ecd0bc9ffdad2fc166e')], acknowledged=True)