<a href="https://colab.research.google.com/github/7atemAlawwad/T5/blob/main/web_scrape_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping Tutorial

This notebook provides a step-by-step guide to scrape data from a website. Web scraping is a technique used to extract information from websites by transforming the data on web pages into a structured format. This is particularly useful for data analysis, machine learning, and other data-driven tasks.

In this tutorial, we will walk through the process of scraping product information from a sample e-commerce site. By following these steps, you will learn how to:

1. Send HTTP requests to retrieve web pages.
2. Parse HTML content using BeautifulSoup.
3. Identify and extract relevant data elements from the parsed HTML.
4. Store the extracted data in a structured format using pandas.
5. Save the data to a CSV file.
6. Optionally, save the data to a database such as MongoDB.

The website we will be scraping is [ScrapeMe](https://scrapeme.live/shop/). This site is designed for practice purposes and contains a variety of products with details such as names and prices, which makes it an ideal candidate for learning web scraping techniques.

Before you begin, please visit the site to understand its structure. This will help you identify the elements you need to scrape.

Let's get started!

## Import libraries here

In [1]:
!pip install requests
!pip install beautifulsoup4



In [2]:
!pip install pandas



In [3]:
import requests
from bs4 import BeautifulSoup

## Step 1: Send a request to the website

In [4]:
url = 'https://scrapeme.live/shop/'

# Fetch the content of the web page
response = requests.get(url)
print(response.status_code)
print(response.text[:500])

200

<!doctype html>
<html lang="en-GB">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=2.0">
<link rel="profile" href="http://gmpg.org/xfn/11">
<link rel="pingback" href="https://scrapeme.live/xmlrpc.php">

<title>Products &#8211; ScrapeMe</title>
<link rel='dns-prefetch' href='//fonts.googleapis.com' />
<link rel='dns-prefetch' href='//s.w.org' />
<link rel="alternate" type="application/rss+xml" title="ScrapeMe &raquo; Feed" href="ht


## Step 2: Parse the HTML content of the page

In [5]:
# Create a BeautifulSoup object and specify the parser
soup = BeautifulSoup(response.text, 'html.parser')

# Print the title of the web page
print(soup.title)

<title>Products – ScrapeMe</title>


## Step 3: Inspect the website and identify the elements to scrape
Inspect the website and identify the elements (e.g., product names, prices, etc.).

In [6]:
# Extract all list items in the page
products = soup.find_all('li', class_='product')


## Step 4: Extract the desired data

In [7]:
all_products = []
for page in (1, 3): #Extract the first 2 pages
    url = f'https://scrapeme.live/shop/page/{page}'
    products = soup.find_all('li', class_='product')
    for product  in products:

        product_name = product.find('h2', class_='woocommerce-loop-product__title').text
        product_price = product.find('span', class_='woocommerce-Price-amount').text

        all_products.append({
            'Product_Name': product_name,
            'Product_Price': product_price
        })



In [8]:
len(all_products)

32

In [9]:
all_products

[{'Product_Name': 'Bulbasaur', 'Product_Price': '£63.00'},
 {'Product_Name': 'Ivysaur', 'Product_Price': '£87.00'},
 {'Product_Name': 'Venusaur', 'Product_Price': '£105.00'},
 {'Product_Name': 'Charmander', 'Product_Price': '£48.00'},
 {'Product_Name': 'Charmeleon', 'Product_Price': '£165.00'},
 {'Product_Name': 'Charizard', 'Product_Price': '£156.00'},
 {'Product_Name': 'Squirtle', 'Product_Price': '£130.00'},
 {'Product_Name': 'Wartortle', 'Product_Price': '£123.00'},
 {'Product_Name': 'Blastoise', 'Product_Price': '£76.00'},
 {'Product_Name': 'Caterpie', 'Product_Price': '£73.00'},
 {'Product_Name': 'Metapod', 'Product_Price': '£148.00'},
 {'Product_Name': 'Butterfree', 'Product_Price': '£162.00'},
 {'Product_Name': 'Weedle', 'Product_Price': '£25.00'},
 {'Product_Name': 'Kakuna', 'Product_Price': '£148.00'},
 {'Product_Name': 'Beedrill', 'Product_Price': '£168.00'},
 {'Product_Name': 'Pidgey', 'Product_Price': '£159.00'},
 {'Product_Name': 'Bulbasaur', 'Product_Price': '£63.00'},
 

## Step 5: Create a DataFrame to store the extracted data

In [10]:
import pandas as pd
df = pd.DataFrame(all_products)


In [11]:
df.head()

Unnamed: 0,Product_Name,Product_Price
0,Bulbasaur,£63.00
1,Ivysaur,£87.00
2,Venusaur,£105.00
3,Charmander,£48.00
4,Charmeleon,£165.00


## Step 6: Save the data to a CSV file

In [12]:

df.to_csv('products.csv', index=False)

## Step 7: Print the DataFrame to verify the extracted data

In [13]:
print(df)

   Product_Name Product_Price
0     Bulbasaur        £63.00
1       Ivysaur        £87.00
2      Venusaur       £105.00
3    Charmander        £48.00
4    Charmeleon       £165.00
5     Charizard       £156.00
6      Squirtle       £130.00
7     Wartortle       £123.00
8     Blastoise        £76.00
9      Caterpie        £73.00
10      Metapod       £148.00
11   Butterfree       £162.00
12       Weedle        £25.00
13       Kakuna       £148.00
14     Beedrill       £168.00
15       Pidgey       £159.00
16    Bulbasaur        £63.00
17      Ivysaur        £87.00
18     Venusaur       £105.00
19   Charmander        £48.00
20   Charmeleon       £165.00
21    Charizard       £156.00
22     Squirtle       £130.00
23    Wartortle       £123.00
24    Blastoise        £76.00
25     Caterpie        £73.00
26      Metapod       £148.00
27   Butterfree       £162.00
28       Weedle        £25.00
29       Kakuna       £148.00
30     Beedrill       £168.00
31       Pidgey       £159.00


## Step 8: Save the data to a database of your choice. If you are using MongoDB, include the code here.

In [14]:
!pip install pymongo

Collecting pymongo
  Downloading pymongo-4.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.6.1-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.6.1-py3-none-any.whl (307 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.7/307.7 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.6.1 pymongo-4.8.0


In [15]:
from pymongo import MongoClient

In [16]:
MONGO_CONNECTION_STRING = "mongodb+srv://hatemalawwad:1234@hatemcluster.1ruxsni.mongodb.net/?retryWrites=true&w=majority&appName=HatemCluster"


In [17]:
# Continue your code here
client = MongoClient(MONGO_CONNECTION_STRING)
db = client['Pokemon_database']
collection = db['pokemon_collection']

In [18]:
data_dict = df.to_dict("records")
collection.insert_many(data_dict)

InsertManyResult([ObjectId('66b11dc0dd99682045ad9264'), ObjectId('66b11dc0dd99682045ad9265'), ObjectId('66b11dc0dd99682045ad9266'), ObjectId('66b11dc0dd99682045ad9267'), ObjectId('66b11dc0dd99682045ad9268'), ObjectId('66b11dc0dd99682045ad9269'), ObjectId('66b11dc0dd99682045ad926a'), ObjectId('66b11dc0dd99682045ad926b'), ObjectId('66b11dc0dd99682045ad926c'), ObjectId('66b11dc0dd99682045ad926d'), ObjectId('66b11dc0dd99682045ad926e'), ObjectId('66b11dc0dd99682045ad926f'), ObjectId('66b11dc0dd99682045ad9270'), ObjectId('66b11dc0dd99682045ad9271'), ObjectId('66b11dc0dd99682045ad9272'), ObjectId('66b11dc0dd99682045ad9273'), ObjectId('66b11dc0dd99682045ad9274'), ObjectId('66b11dc0dd99682045ad9275'), ObjectId('66b11dc0dd99682045ad9276'), ObjectId('66b11dc0dd99682045ad9277'), ObjectId('66b11dc0dd99682045ad9278'), ObjectId('66b11dc0dd99682045ad9279'), ObjectId('66b11dc0dd99682045ad927a'), ObjectId('66b11dc0dd99682045ad927b'), ObjectId('66b11dc0dd99682045ad927c'), ObjectId('66b11dc0dd99682045ad92

In [19]:
document_count = collection.count_documents({})
print(f'The number of documents in the collection is: {document_count}')

The number of documents in the collection is: 32
