<a href="https://colab.research.google.com/github/Nawaf9997/Traffic_Anlysis/blob/main/web_scrape_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping Tutorial

This notebook provides a step-by-step guide to scrape data from a website. Web scraping is a technique used to extract information from websites by transforming the data on web pages into a structured format. This is particularly useful for data analysis, machine learning, and other data-driven tasks.

In this tutorial, we will walk through the process of scraping product information from a sample e-commerce site. By following these steps, you will learn how to:

1. Send HTTP requests to retrieve web pages.
2. Parse HTML content using BeautifulSoup.
3. Identify and extract relevant data elements from the parsed HTML.
4. Store the extracted data in a structured format using pandas.
5. Save the data to a CSV file.
6. Optionally, save the data to a database such as MongoDB.

The website we will be scraping is [ScrapeMe](https://scrapeme.live/shop/). This site is designed for practice purposes and contains a variety of products with details such as names and prices, which makes it an ideal candidate for learning web scraping techniques.

Before you begin, please visit the site to understand its structure. This will help you identify the elements you need to scrape.

Let's get started!

In [1]:
!pip install pymongo
!pip install requests
!pip install beautifulsoup4

Collecting pymongo
  Downloading pymongo-4.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.6.1-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.6.1-py3-none-any.whl (307 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.7/307.7 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.6.1 pymongo-4.8.0


## Import libraries here

In [11]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from pymongo import MongoClient
import pymongo
import os

## Step 1: Send a request to the website

In [3]:
import requests
url  = 'https://scrapeme.live/shop/'

response = requests.get(url)

print(response.status_code)
print(response.text[:500])

200

<!doctype html>
<html lang="en-GB">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=2.0">
<link rel="profile" href="http://gmpg.org/xfn/11">
<link rel="pingback" href="https://scrapeme.live/xmlrpc.php">

<title>Products &#8211; ScrapeMe</title>
<link rel='dns-prefetch' href='//fonts.googleapis.com' />
<link rel='dns-prefetch' href='//s.w.org' />
<link rel="alternate" type="application/rss+xml" title="ScrapeMe &raquo; Feed" href="ht


## Step 2: Parse the HTML content of the page

In [4]:
soup = BeautifulSoup(response.text,'html.parser')


## Step 3: Inspect the website and identify the elements to scrape
Inspect the website and identify the elements (e.g., product names, prices, etc.).

In [5]:
print(soup.text)









Products – ScrapeMe































 





 Skip to navigation
Skip to content

ScrapeMeJust another WordPress site 


Search for:

Search


 
 
Menu

Home

Home




£0.00 0 items



 



Home / Product


 






Sort by popularity
Sort by average rating
Sort by newness
Sort by price: low to high
Sort by price: high to low




	Showing 1–16 of 755 results


1
2
3
4
…
46
47
48
→




Bulbasaur
£63.00
Add to basket

Ivysaur
£87.00
Add to basket

Venusaur
£105.00
Add to basket

Charmander
£48.00
Add to basket

Charmeleon
£165.00
Add to basket

Charizard
£156.00
Add to basket

Squirtle
£130.00
Add to basket

Wartortle
£123.00
Add to basket

Blastoise
£76.00
Add to basket

Caterpie
£73.00
Add to basket

Metapod
£148.00
Add to basket

Butterfree
£162.00
Add to basket

Weedle
£25.00
Add to basket

Kakuna
£148.00
Add to basket

Beedrill
£168.00
Add to basket

Pidgey
£159.00
Add to basket



Sort by popularity
Sort by average rating
Sort by newness
Sort by price: low to hi

## Step 4: Extract the desired data

In [6]:
soup = BeautifulSoup(response.text,'html.parser')


for price in soup.find_all('span', class_='price'):
  print(price.text[:20])

£63.00
£87.00
£105.00
£48.00
£165.00
£156.00
£130.00
£123.00
£76.00
£73.00
£148.00
£162.00
£25.00
£148.00
£168.00
£159.00


## Step 5: Create a DataFrame to store the extracted data

In [7]:
prods = soup.find_all('h2', attrs={"class": "woocommerce-loop-product__title"})
titles = [prod.text for prod in prods]

prices = [price.text.strip()[:20] for price in soup.find_all('span', class_='price')]

min_length = min(len(titles), len(prices))
titles = titles[:min_length]
prices = prices[:min_length]
data = {'Products Name': titles, 'Price': prices}

In [8]:
df = pd.DataFrame(data)
df

Unnamed: 0,Products Name,Price
0,Bulbasaur,£63.00
1,Ivysaur,£87.00
2,Venusaur,£105.00
3,Charmander,£48.00
4,Charmeleon,£165.00
5,Charizard,£156.00
6,Squirtle,£130.00
7,Wartortle,£123.00
8,Blastoise,£76.00
9,Caterpie,£73.00


## Step 6: Save the data to a CSV file

In [9]:
df.to_csv('store.csv')

## Step 7: Print the DataFrame to verify the extracted data

In [10]:
df

Unnamed: 0,Products Name,Price
0,Bulbasaur,£63.00
1,Ivysaur,£87.00
2,Venusaur,£105.00
3,Charmander,£48.00
4,Charmeleon,£165.00
5,Charizard,£156.00
6,Squirtle,£130.00
7,Wartortle,£123.00
8,Blastoise,£76.00
9,Caterpie,£73.00


## Step 8: Save the data to a database of your choice. If you are using MongoDB, include the code here.

In [12]:
mongo_url  = 'mongodb+srv://nawafalshehri:qRp2eyKPF4sULU0B@cluster0.bkyxs4r.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0'
client = pymongo.MongoClient(mongo_url)

In [13]:
db = client['storeDB']
collection = db['storeDB']
data = [{'Product Name': title, 'Price': price} for title, price in zip(titles, prices)]

collection.insert_many(data)
print("Data saved successfully to MongoDB.")

OperationFailure: bad auth : authentication failed, full error: {'ok': 0, 'errmsg': 'bad auth : authentication failed', 'code': 8000, 'codeName': 'AtlasError'}