# Getting Data From Internet

## Part 1. Scraping shopping website


https://scrapingclub.com/exercise/list_basic/?page=1



Note: Don't scrape data from a website if you are not sure you are allowed to.

Eg. https://unsplash.com does not allow you to scrape images


### Objective: 
- Loop through all pages to get the links of all dresses
- For each link: get the following information:

image-link

price

name of the dress

descriptin of the dress (inside every page)

- Make a data Frame of the above information

- Download all the images to a single folder

<h3>Import necessary modules</h3>

In [1]:
import requests
from bs4 import BeautifulSoup

## The http request

In [2]:
url = "https://scrapingclub.com/exercise/list_basic/?page=1"
response = requests.get(url)
if response.status_code == 200:
    print("Success")
else:
    print("Failure")

Success


Change to any other page here:

In [3]:
page = input("Please enter the number of the page ")
url = "https://scrapingclub.com/exercise/list_basic/?page=" + page
response = requests.get(url)
if response.status_code == 200:
    print("Success")
else:
    print("Failure")

Please enter the number of the page 1
Success


**Exercise 1.** How many pages do we have?

In [4]:
url = "https://scrapingclub.com/exercise/list_basic/"
response = requests.get(url)

if response.status_code == 200:
    print("Success")
    results_page = BeautifulSoup(response.content,'lxml')
    pagination = results_page.find("ul", {"class": "pagination"})
    pages = pagination.find_all("a")
    print("Il y'a {} pages.".format(len(pages)))
else:
    print("Failure")


Success
Il y'a 7 pages.


<h3>Set up the BeautifulSoup object</h3>

In [5]:
results_page = BeautifulSoup(response.content,'lxml')
print(results_page.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="/static/img/icon.611132651e39.png" rel="icon" type="image/png"/>
  <meta content="Not only crawl products but also handle pagination" name="description"/>
  <title>
   Recursively Scraping pages | ScrapingClub
  </title>
  <!-- Bootstrap core CSS -->
  <link href="/static/bootstrap/css/bootstrap.min.a9766a313743.css" rel="stylesheet"/>
  <link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet"/>
  <link href="/static/css/custom.e6122d0f915e.css" rel="stylesheet"/>
 </head>
 <body>
  <nav class="navbar navbar-expand-lg fixed-top navbar-dark bg-primary">
   <div class="container">
    <a class="navbar-brand" href="/">
     <img alt="ScrapingClub" class="nav-logo" src="/static/img/brand-logo.ad7a4888d334.png"/>
    </a>
    <bu

<h3>BS4 functions</h3>

`find_all` finds all instances of a specified tag. Return a list

In [6]:
all_a_tags = results_page.find_all('a')
print(type(all_a_tags))

<class 'bs4.element.ResultSet'>


In [7]:
len(all_a_tags)

43

You can check the following elements on your browsers

In [8]:
all_a_tags[0]

<a class="navbar-brand" href="/">
<img alt="ScrapingClub" class="nav-logo" src="/static/img/brand-logo.ad7a4888d334.png"/>
</a>

In [9]:
all_a_tags[10]

<a href="/exercise/list_basic_detail/96436-A/">Patterned Slacks</a>

`find` finds the first instance of a specified tag

In [10]:
results_page.find("div")

<div class="container">
<a class="navbar-brand" href="/">
<img alt="ScrapingClub" class="nav-logo" src="/static/img/brand-logo.ad7a4888d334.png"/>
</a>
<button aria-controls="navbarCollapse" aria-expanded="false" aria-label="Toggle navigation" class="navbar-toggler" data-target="#navbarCollapse" data-toggle="collapse" type="button">
<span class="navbar-toggler-icon"></span>
</button>
<div class="collapse navbar-collapse" id="navbarCollapse">
<ul class="navbar-nav mr-auto">
<li class="nav-item">
<a class="nav-link" href="/">Home
            <span class="sr-only">(current)</span>
</a>
</li>
<li class="nav-item">
<a class="nav-link" href="/blog/">Blog</a>
</li>
<li class="nav-item">
<a class="nav-link" href="/about/">About</a>
</li>
<li class="nav-item">
<a class="nav-link" href="/contact/">Contact</a>
</li>
<li class="nav-item active">
<a class="nav-link" href="//eepurl.com/dmPGn9"><i class="fa fa-send"></i> Subscribe</a>
</li>
</ul>
</div>
</div>

<h4>bs4 functions can be recursively applied on elements</h4>

### Real work
Now go to the objective of the exercise

### Fetch data from page 1

In [11]:
import time

In [12]:
url = "https://scrapingclub.com/exercise/list_basic/?page=1"
response = requests.get(url)
if response.status_code == 200:
    print("Success")
else:
    print("Failure")

Success


In [13]:
results_page = BeautifulSoup(response.content,'lxml')
print(results_page.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="/static/img/icon.611132651e39.png" rel="icon" type="image/png"/>
  <meta content="Not only crawl products but also handle pagination" name="description"/>
  <title>
   Recursively Scraping pages | ScrapingClub
  </title>
  <!-- Bootstrap core CSS -->
  <link href="/static/bootstrap/css/bootstrap.min.a9766a313743.css" rel="stylesheet"/>
  <link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet"/>
  <link href="/static/css/custom.e6122d0f915e.css" rel="stylesheet"/>
 </head>
 <body>
  <nav class="navbar navbar-expand-lg fixed-top navbar-dark bg-primary">
   <div class="container">
    <a class="navbar-brand" href="/">
     <img alt="ScrapingClub" class="nav-logo" src="/static/img/brand-logo.ad7a4888d334.png"/>
    </a>
    <bu

In [14]:
cards = results_page.find_all("div", {"class": "card"})
print(cards)

[<div class="card">
<a href="/exercise/list_basic_detail/90008-E/"><img alt="" class="card-img-top img-fluid" src="/static/img/90008-E.jpg"/></a>
<div class="card-body">
<h4 class="card-title">
<a href="/exercise/list_basic_detail/90008-E/">Short Dress</a>
</h4>
<h5>$24.99</h5>
</div>
</div>, <div class="card">
<a href="/exercise/list_basic_detail/96436-A/"><img alt="" class="card-img-top img-fluid" src="/static/img/96436-A.jpg"/></a>
<div class="card-body">
<h4 class="card-title">
<a href="/exercise/list_basic_detail/96436-A/">Patterned Slacks</a>
</h4>
<h5>$29.99</h5>
</div>
</div>, <div class="card">
<a href="/exercise/list_basic_detail/93926-B/"><img alt="" class="card-img-top img-fluid" src="/static/img/93926-B.jpg"/></a>
<div class="card-body">
<h4 class="card-title">
<a href="/exercise/list_basic_detail/93926-B/">Short Chiffon Dress</a>
</h4>
<h5>$49.99</h5>
</div>
</div>, <div class="card">
<a href="/exercise/list_basic_detail/90882-B/"><img alt="" class="card-img-top img-fluid

In [15]:
cards[:9]

[<div class="card">
 <a href="/exercise/list_basic_detail/90008-E/"><img alt="" class="card-img-top img-fluid" src="/static/img/90008-E.jpg"/></a>
 <div class="card-body">
 <h4 class="card-title">
 <a href="/exercise/list_basic_detail/90008-E/">Short Dress</a>
 </h4>
 <h5>$24.99</h5>
 </div>
 </div>, <div class="card">
 <a href="/exercise/list_basic_detail/96436-A/"><img alt="" class="card-img-top img-fluid" src="/static/img/96436-A.jpg"/></a>
 <div class="card-body">
 <h4 class="card-title">
 <a href="/exercise/list_basic_detail/96436-A/">Patterned Slacks</a>
 </h4>
 <h5>$29.99</h5>
 </div>
 </div>, <div class="card">
 <a href="/exercise/list_basic_detail/93926-B/"><img alt="" class="card-img-top img-fluid" src="/static/img/93926-B.jpg"/></a>
 <div class="card-body">
 <h4 class="card-title">
 <a href="/exercise/list_basic_detail/93926-B/">Short Chiffon Dress</a>
 </h4>
 <h5>$49.99</h5>
 </div>
 </div>, <div class="card">
 <a href="/exercise/list_basic_detail/90882-B/"><img alt="" clas

In [16]:
li = []
# Exclude last card cause it's for menu
for card in cards[:9]:
    # Get datas from card
    link = card.find("a").get("href")
    img_link = card.find("img").get("src")
    title = card.find("a").get_text()
    price = float(card.find("h5").get_text()[1:])
    
    base_url = "https://scrapingclub.com/"
    full_url = base_url + link
    full_img_url = base_url + img_link
    time.sleep(5)
    # Get datas from detail page
    response = requests.get(full_url)
    res_page = BeautifulSoup(response.content, 'lxml')
    desc = res_page.find('p', {'class': 'card-text'}).get_text()
    
    y = [link, img_link, title, price, desc]
    li.append(y)

In [17]:
li

[['/exercise/list_basic_detail/90008-E/',
  '/static/img/90008-E.jpg',
  '',
  24.99,
  'Short dress in woven fabric. Round neckline and opening at back of neck with a button. Yoke at back with concealed pleats, long sleeves, and narrow cuffs with ties. Side pockets. 100% polyester. Machine wash cold.'],
 ['/exercise/list_basic_detail/96436-A/',
  '/static/img/96436-A.jpg',
  '',
  29.99,
  'Ankle-length slacks in patterned stretch cotton satin. Regular waist with concealed hook-and-eye fastener and zip fly. Side pockets and tapered legs with slits at hems. 61% cotton, 36% polyester, 3% spandex. Machine wash...'],
 ['/exercise/list_basic_detail/93926-B/',
  '/static/img/93926-B.jpg',
  '',
  49.99,
  'Short V-neck dress in plumeti chiffon. Gathers and small ruffle at shoulders, dropped shoulders, and long, wide sleeves with buttons at cuffs. Narrow, elasticized seam at waist and circle skirt with ruffled tiers. Lined. 100% polyester. Machine wash warm.'],
 ['/exercise/list_basic_detail

In [18]:
import pandas as pd

df = pd.DataFrame(li)

Unnamed: 0,0,1,2,3,4
0,/exercise/list_basic_detail/90008-E/,/static/img/90008-E.jpg,,24.99,Short dress in woven fabric. Round neckline an...
1,/exercise/list_basic_detail/96436-A/,/static/img/96436-A.jpg,,29.99,Ankle-length slacks in patterned stretch cotto...
2,/exercise/list_basic_detail/93926-B/,/static/img/93926-B.jpg,,49.99,Short V-neck dress in plumeti chiffon. Gathers...
3,/exercise/list_basic_detail/90882-B/,/static/img/90882-B.jpg,,59.99,"Short, fitted off-the-shoulder dress in stretc..."
4,/exercise/list_basic_detail/93756-C/,/static/img/93756-C.jpg,,24.99,Top in woven fabric with V-neck front and back...
5,/exercise/list_basic_detail/93926-C/,/static/img/93926-C.jpg,,49.99,Short V-neck dress in plumeti chiffon. Gathers...
6,/exercise/list_basic_detail/93756-B/,/static/img/93756-B.jpg,,24.99,Top in woven fabric with V-neck front and back...
7,/exercise/list_basic_detail/93756-D/,/static/img/93756-D.jpg,,24.99,Top in woven fabric with V-neck front and back...
8,/exercise/list_basic_detail/96643-A/,/static/img/96643-A.jpg,,59.99,"Short, straight-cut dress in lace. Opening at ..."


In [20]:
df.columns = ["link", "image_link", "title", "price", "desc"]

In [21]:
df

Unnamed: 0,link,image_link,title,price,desc
0,/exercise/list_basic_detail/90008-E/,/static/img/90008-E.jpg,,24.99,Short dress in woven fabric. Round neckline an...
1,/exercise/list_basic_detail/96436-A/,/static/img/96436-A.jpg,,29.99,Ankle-length slacks in patterned stretch cotto...
2,/exercise/list_basic_detail/93926-B/,/static/img/93926-B.jpg,,49.99,Short V-neck dress in plumeti chiffon. Gathers...
3,/exercise/list_basic_detail/90882-B/,/static/img/90882-B.jpg,,59.99,"Short, fitted off-the-shoulder dress in stretc..."
4,/exercise/list_basic_detail/93756-C/,/static/img/93756-C.jpg,,24.99,Top in woven fabric with V-neck front and back...
5,/exercise/list_basic_detail/93926-C/,/static/img/93926-C.jpg,,49.99,Short V-neck dress in plumeti chiffon. Gathers...
6,/exercise/list_basic_detail/93756-B/,/static/img/93756-B.jpg,,24.99,Top in woven fabric with V-neck front and back...
7,/exercise/list_basic_detail/93756-D/,/static/img/93756-D.jpg,,24.99,Top in woven fabric with V-neck front and back...
8,/exercise/list_basic_detail/96643-A/,/static/img/96643-A.jpg,,59.99,"Short, straight-cut dress in lace. Opening at ..."


### Question 2 : récupérer le contenu de toute les pages (env 60 robes)

In [22]:
url = "https://scrapingclub.com/exercise/list_basic/"
response = requests.get(url)
results_page = BeautifulSoup(response.content,'lxml')
pagination = results_page.find("ul", {"class": "pagination"})
pages = pagination.find_all("a")
print("Il y'a {} pages.".format(len(pages)))

Il y'a 7 pages.


In [23]:
def get_page_datas(page):
    cards = page.find_all("div", {"class": "card"})
    li = []
    # Exclude last card cause it's for menu
    for card in cards[:9]:
        # Get datas from card
        link = card.find("a").get("href")

        # TODO: fix none issue
        img_link = card.find("img").get("src")
        title = card.find("a").get_text()
        price = float(card.find("h5").get_text()[1:])

        base_url = "https://scrapingclub.com/"
        full_url = base_url + link
        full_img_url = base_url + img_link
        time.sleep(5)
        # Get datas from detail page
        response = requests.get(full_url)
        res_page = BeautifulSoup(response.content, 'lxml')
        desc = res_page.find('p', {'class': 'card-text'}).get_text()

        y = [link, img_link, title, price, desc]
        li.append(y)
    df = pd.DataFrame(li)
    return df

In [45]:
import pandas as pd

main_df = pd.DataFrame()

for page_number in range (1, len(pages)):
    url = "https://scrapingclub.com/exercise/list_basic/?page=" + str(page)
    time.sleep(5)
    response = requests.get(url)
    if response.status_code == 200:
        print("Success getting page {}.".format(page_number))
        print("Fetching datas...")
        results_page = BeautifulSoup(response.content,'lxml')
        page_df = get_page_datas(results_page)
        main_df = main_df.append(page_df)
    else:
        print("Failure")


Success getting page 1.
Fetching datas...
Success getting page 2.
Fetching datas...


Unnamed: 0,0,1,2,3,4
0,/exercise/list_basic_detail/90008-E/,/static/img/90008-E.jpg,,24.99,Short dress in woven fabric. Round neckline an...
1,/exercise/list_basic_detail/96436-A/,/static/img/96436-A.jpg,,29.99,Ankle-length slacks in patterned stretch cotto...
2,/exercise/list_basic_detail/93926-B/,/static/img/93926-B.jpg,,49.99,Short V-neck dress in plumeti chiffon. Gathers...
3,/exercise/list_basic_detail/90882-B/,/static/img/90882-B.jpg,,59.99,"Short, fitted off-the-shoulder dress in stretc..."
4,/exercise/list_basic_detail/93756-C/,/static/img/93756-C.jpg,,24.99,Top in woven fabric with V-neck front and back...
5,/exercise/list_basic_detail/93926-C/,/static/img/93926-C.jpg,,49.99,Short V-neck dress in plumeti chiffon. Gathers...
6,/exercise/list_basic_detail/93756-B/,/static/img/93756-B.jpg,,24.99,Top in woven fabric with V-neck front and back...
7,/exercise/list_basic_detail/93756-D/,/static/img/93756-D.jpg,,24.99,Top in woven fabric with V-neck front and back...
8,/exercise/list_basic_detail/96643-A/,/static/img/96643-A.jpg,,59.99,"Short, straight-cut dress in lace. Opening at ..."
0,/exercise/list_basic_detail/90008-E/,/static/img/90008-E.jpg,,24.99,Short dress in woven fabric. Round neckline an...


In [46]:
main_df.columns = ["link", "image_link", "title", "price", "desc"]
main_df

Unnamed: 0,link,image_link,title,price,desc
0,/exercise/list_basic_detail/90008-E/,/static/img/90008-E.jpg,,24.99,Short dress in woven fabric. Round neckline an...
1,/exercise/list_basic_detail/96436-A/,/static/img/96436-A.jpg,,29.99,Ankle-length slacks in patterned stretch cotto...
2,/exercise/list_basic_detail/93926-B/,/static/img/93926-B.jpg,,49.99,Short V-neck dress in plumeti chiffon. Gathers...
3,/exercise/list_basic_detail/90882-B/,/static/img/90882-B.jpg,,59.99,"Short, fitted off-the-shoulder dress in stretc..."
4,/exercise/list_basic_detail/93756-C/,/static/img/93756-C.jpg,,24.99,Top in woven fabric with V-neck front and back...
5,/exercise/list_basic_detail/93926-C/,/static/img/93926-C.jpg,,49.99,Short V-neck dress in plumeti chiffon. Gathers...
6,/exercise/list_basic_detail/93756-B/,/static/img/93756-B.jpg,,24.99,Top in woven fabric with V-neck front and back...
7,/exercise/list_basic_detail/93756-D/,/static/img/93756-D.jpg,,24.99,Top in woven fabric with V-neck front and back...
8,/exercise/list_basic_detail/96643-A/,/static/img/96643-A.jpg,,59.99,"Short, straight-cut dress in lace. Opening at ..."
0,/exercise/list_basic_detail/90008-E/,/static/img/90008-E.jpg,,24.99,Short dress in woven fabric. Round neckline an...


#### Download all image in 1 folder

In [25]:
import os

In [33]:
os.mkdir("images")

FileExistsError: [Errno 17] File exists: 'images'

Unnamed: 0,0,1,2,3,4
0,/exercise/list_basic_detail/90008-E/,/static/img/90008-E.jpg,,24.99,Short dress in woven fabric. Round neckline an...
1,/exercise/list_basic_detail/96436-A/,/static/img/96436-A.jpg,,29.99,Ankle-length slacks in patterned stretch cotto...
2,/exercise/list_basic_detail/93926-B/,/static/img/93926-B.jpg,,49.99,Short V-neck dress in plumeti chiffon. Gathers...
3,/exercise/list_basic_detail/90882-B/,/static/img/90882-B.jpg,,59.99,"Short, fitted off-the-shoulder dress in stretc..."
4,/exercise/list_basic_detail/93756-C/,/static/img/93756-C.jpg,,24.99,Top in woven fabric with V-neck front and back...
5,/exercise/list_basic_detail/93926-C/,/static/img/93926-C.jpg,,49.99,Short V-neck dress in plumeti chiffon. Gathers...
6,/exercise/list_basic_detail/93756-B/,/static/img/93756-B.jpg,,24.99,Top in woven fabric with V-neck front and back...
7,/exercise/list_basic_detail/93756-D/,/static/img/93756-D.jpg,,24.99,Top in woven fabric with V-neck front and back...
8,/exercise/list_basic_detail/96643-A/,/static/img/96643-A.jpg,,59.99,"Short, straight-cut dress in lace. Opening at ..."


In [48]:
main_df.iloc[4].image_link

'/static/img/93756-C.jpg'

In [None]:
base_url = "https://scrapingclub.com/"
for i in range(df.shape[0]):
    url = base_url + main_df.iloc[i].image_link
    url.split("/")
    filename = url.split("/")[-1]
    print(f"Downloading image {filename}")
    time.sleep(5)
    response = requests.get(url)
    with open("images/"+filename, 'wb') as f:
        f.write(response.content)
print("Download complete.")

Downloading image 90008-E.jpg
Downloading image 96436-A.jpg


0
1
2
3
4
5
6
7
8


IndexError: single positional indexer is out-of-bounds