# 🕷️ Web Scraping with Python & Scrapy – Live Course Notebook

## 1. Introduction
Web scraping = extracting information from websites.

**When useful?** Collecting data for analysis, research, automation.

⚠️ **Ethical/Legal Reminder**: Always check `robots.txt` and site policies. Prefer APIs when possible.

- We will use **Scrapy selectors** to extract data.  
- Scrapy selectors are based on **XPath** and **CSS selectors**.  
- Compared to **BeautifulSoup**, selectors are often **faster** and closer to what you already use in the browser (Inspect → Copy XPath / CSS).  

⚖️ **Important:** Always check a website’s Terms of Service before scraping, and respect `robots.txt`.  



## 2. Setup & Imports    


In [1]:
# !pip install scrapy parsel pandas requests

import requests
import pandas as pd
from scrapy import Selector


## 3. Scraping the Quotes Website  

This site is very useful for learning scraping:  
👉 http://quotes.toscrape.com/  

We will extract:  
- The **quote text**  
- The **author**  
- The **tags**  


### Step 1: Request the page

In [2]:

url = "http://quotes.toscrape.com/"
response = requests.get(url)

print(response.status_code)
print(response.text[:500])


200
<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
    
    
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div cla


### Step 2: Create a Selector

In [3]:
sel = Selector(text=response.text)

### Step 3: Extract Quote Texts

In [4]:

quotes = sel.css("span.text::text").getall()
print("Number of quotes found:", len(quotes))

Number of quotes found: 10


In [5]:
print(quotes[:5])

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"]


### Step 4: Extract Authors

In [6]:
authors = sel.css("small.author::text").getall()
print("Number of authors found:", len(authors))

Number of authors found: 10


In [7]:
print(authors[:5])


['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe']


### Step 5: Extract Tags

In [8]:

tags = sel.css("div.tags a.tag::text").getall()
print("Number of tags found:", len(tags))

Number of tags found: 30


In [9]:
print(tags[:10])


['change', 'deep-thoughts', 'thinking', 'world', 'abilities', 'choices', 'inspirational', 'life', 'live', 'miracle']


### Step 6: Store Results in a DataFrame

In [10]:

data_quotes = pd.DataFrame({
    "quote": quotes,
    "author": authors
})
data_quotes.head()


Unnamed: 0,quote,author
0,“The world as we have created it is a process ...,Albert Einstein
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling
2,“There are only two ways to live your life. On...,Albert Einstein
3,"“The person, be it gentleman or lady, who has ...",Jane Austen
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe


In [11]:
len(data_quotes)

10

In [12]:
# Exemple with xpath
# url = "http://quotes.toscrape.com/random"
# r = requests.get(url)
# quote = response.xpath("/html/body/div/div[2]/div[1]/div/span[1]/text()").get()
# author = response.xpath("/html/body/div/div[2]/div[1]/div/span[2]/small/text()").get()
# tags = response.xpath("/html/body/div/div[2]/div[1]/div/div/a/text()").getall()

### Step 7: Extract in one block and filter

In [13]:
url = "http://quotes.toscrape.com/page/1/"

response = requests.get(url)
response = Selector(text=response.text)

quote_frames = response.css("div[class='quote']")
quote_frames

[<Selector query="descendant-or-self::div[@class = 'quote']" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class = 'quote']" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class = 'quote']" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class = 'quote']" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class = 'quote']" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class = 'quote']" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class = 'quote']" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class = 'quote']" data='<div class="quote" itemscope itemtype...'>,
 <Selector query="descendant-or-self::div[@class = 'quote']" data='<div class="quote" itemscope itemtype...'>,
 

In [14]:
trimmed_quotes = []
for quote_frame in quote_frames: 
    quote = quote_frame.css("span[class='text']::text").get()
    author = quote_frame.css("small[class='author']::text").get()
    tags = quote_frame.css("a[class='tag']::text").getall()
    trimmed_quotes.append((quote, author, tags))

In [15]:
df = pd.DataFrame(trimmed_quotes)
df

Unnamed: 0,0,1,2
0,“The world as we have created it is a process ...,Albert Einstein,"[change, deep-thoughts, thinking, world]"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"[abilities, choices]"
2,“There are only two ways to live your life. On...,Albert Einstein,"[inspirational, life, live, miracle, miracles]"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"[aliteracy, books, classic, humor]"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"[be-yourself, inspirational]"
5,“Try not to become a man of success. Rather be...,Albert Einstein,"[adulthood, success, value]"
6,“It is better to be hated for what you are tha...,André Gide,"[life, love]"
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison,"[edison, failure, inspirational, paraphrased]"
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt,[misattributed-eleanor-roosevelt]
9,"“A day without sunshine is like, you know, nig...",Steve Martin,"[humor, obvious, simile]"


### Step 8: What if i want to scrap all the pages ? 

In [16]:
url = "http://quotes.toscrape.com/page/1/"

r = requests.get(url)
response = Selector(text=r.text)

In [17]:
print(r.text)

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
    
    
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="auth

In [18]:
response.css("span[class='text']::text").getall()

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

In [19]:
response.css("small[class='author']::text").getall()

['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']

In [20]:
response.css("a[class='tag']::text").getall()

['change',
 'deep-thoughts',
 'thinking',
 'world',
 'abilities',
 'choices',
 'inspirational',
 'life',
 'live',
 'miracle',
 'miracles',
 'aliteracy',
 'books',
 'classic',
 'humor',
 'be-yourself',
 'inspirational',
 'adulthood',
 'success',
 'value',
 'life',
 'love',
 'edison',
 'failure',
 'inspirational',
 'paraphrased',
 'misattributed-eleanor-roosevelt',
 'humor',
 'obvious',
 'simile',
 'love',
 'inspirational',
 'life',
 'humor',
 'books',
 'reading',
 'friendship',
 'friends',
 'truth',
 'simile']

#### Method 1 

In [21]:
base_url = "http://quotes.toscrape.com/page/{}/"

all_quotes = []

# Loop through pages 1 to 3
for page in range(1, 2):
    url = base_url.format(page)
    print(f"Scraping {url} ...")

    # Send request
    r = requests.get(url)
    response = Selector(text=r.text)

    # Extract all quotes on the page
    quotes = response.css("span.text::text").getall()

    # Store them
    all_quotes.extend(quotes)

# Show results
for i, quote in enumerate(all_quotes, 1):
    print(f"{i}. {quote}")


Scraping http://quotes.toscrape.com/page/1/ ...
1. “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
2. “It is our choices, Harry, that show what we truly are, far more than our abilities.”
3. “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
4. “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
5. “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
6. “Try not to become a man of success. Rather become a man of value.”
7. “It is better to be hated for what you are than to be loved for what you are not.”
8. “I have not failed. I've just found 10,000 ways that won't work.”
9. “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
10. “A day without sunshine is like, you know, night.”


#### Method 2

- **Pagination**: Find the **Next** link (`li.next a::attr(href)`) and build an **absolute URL** to continue the loop.

In [22]:
response.css("nav li[class='next']").css("a[href]::attr(href)").get()

'/page/2/'

In [23]:
base_url = "http://quotes.toscrape.com"
url = base_url  # start at the main page

all_quotes = []

while url:
    print(f"Scraping {url} ...")
    
    # Request page
    r = requests.get(url)
    response = Selector(text=r.text)
    
    # Extract quotes
    quotes = response.css("span.text::text").getall()
    all_quotes.extend(quotes)
    
    # Check if there is a 'next' button
    next_page = response.css("li.next a::attr(href)").get()
    
    if next_page:
        # Build absolute URL for the next page
        url = base_url + next_page
    else:
        url = None  # stop the loop

# Show results
for i, quote in enumerate(all_quotes, 1):
    print(f"{i}. {quote}")


Scraping http://quotes.toscrape.com ...
Scraping http://quotes.toscrape.com/page/2/ ...
Scraping http://quotes.toscrape.com/page/3/ ...
Scraping http://quotes.toscrape.com/page/4/ ...
Scraping http://quotes.toscrape.com/page/5/ ...
Scraping http://quotes.toscrape.com/page/6/ ...
Scraping http://quotes.toscrape.com/page/7/ ...
Scraping http://quotes.toscrape.com/page/8/ ...
Scraping http://quotes.toscrape.com/page/9/ ...
Scraping http://quotes.toscrape.com/page/10/ ...
1. “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
2. “It is our choices, Harry, that show what we truly are, far more than our abilities.”
3. “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
4. “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
5. “Imperfection is beauty, madness is genius and it's better to be absolute


## 4. Scraping IMDb  

Now let’s try with a real website: IMDb.  
👉 Example page: https://www.imdb.com/chart/top  

We will extract:  
- **Movie titles**  
- **Movie Lins**  
- **Ratings**  


### Step 1: Request the Page

In [24]:
url = "https://www.imdb.com/fr/chart/top/?ref_=hm_nv_menu"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3 ',
    "Accept-Language": "fr-FR,fr;q=0.9"
}


In [25]:
response_imdb = requests.get(url, headers=headers)

In [26]:
print(response_imdb.status_code)
print(response_imdb.text)

200
<!DOCTYPE html><html lang="fr-FR" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml"><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width"/><script>if(typeof uet === 'function'){ uet('bb', 'LoadTitle', {wb: 1}); }</script><script>window.addEventListener('load', (event) => {
        if (typeof window.csa !== 'undefined' && typeof window.csa === 'function') {
            var csaLatencyPlugin = window.csa('Content', {
                element: {
                    slotId: 'LoadTitle',
                    type: 'service-call'
                }
            });
            csaLatencyPlugin('mark', 'clickToBodyBegin', 1760451563343);
        }
    })</script><title>Les 250 meilleurs films selon IMDb</title><meta name="description" content="Évalué par les votants réguliers d&#x27;IMDb." data-id="main"/><meta name="google-site-verification" content="0cadf7898134e79b"/><meta name="msvalidate.01" content="C1DACEF2769068C0B0D2

In [27]:
sel_imdb = Selector(text=response_imdb.text)

In [28]:
print(sel_imdb)

<html lang="fr-FR" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width"><script>if(typeof uet === 'function'){ uet('bb', 'LoadTitle', {wb: 1}); }</script><script>window.addEventListener('load', (event) => {
        if (typeof window.csa !== 'undefined' && typeof window.csa === 'function') {
            var csaLatencyPlugin = window.csa('Content', {
                element: {
                    slotId: 'LoadTitle',
                    type: 'service-call'
                }
            });
            csaLatencyPlugin('mark', 'clickToBodyBegin', 1760451563343);
        }
    })</script><title>Les 250 meilleurs films selon IMDb</title><meta name="description" content="Évalué par les votants réguliers d'IMDb." data-id="main"><meta name="google-site-verification" content="0cadf7898134e79b"><meta name="msvalidate.01" content="C1DACEF2769068C0B0D2687C9E5105FA"><meta name="ro

### Step 2: Extract Movie Informations

In [29]:
sel_imdb.css("h3[class='ipc-title__text ipc-title__text--reduced']::text").getall()



['Graphiques IMDb',
 '1. Les Évadés',
 '2. Le Parrain',
 '3. The Dark Knight : Le Chevalier noir',
 '4. Le Parrain, 2ᵉ partie',
 '5. 12 Hommes en colère',
 '6. Le Seigneur des anneaux : Le Retour du roi',
 '7. La Liste de Schindler',
 "8. Le Seigneur des anneaux : La Communauté de l'anneau",
 '9. Pulp Fiction',
 '10. Le Bon, la Brute et le Truand',
 '11. Forrest Gump',
 '12. Le Seigneur des anneaux : Les Deux Tours',
 '13. Fight Club',
 '14. Inception',
 "15. L'Empire contre-attaque",
 '16. Matrix',
 '17. Les Affranchis',
 '18. Interstellar',
 "19. Vol au-dessus d'un nid de coucou",
 '20. Seven',
 '21. La vie est belle',
 '22. Le silence des agneaux',
 '23. Les 7 Samouraïs',
 '24. Il faut sauver le soldat Ryan',
 '25. La Ligne verte',
 'Récemment consultés']

In [30]:
sel_imdb.css("span[class='ipc-rating-star--rating']::text").getall()

['9,3',
 '9,2',
 '9,1',
 '9,0',
 '9,0',
 '9,0',
 '9,0',
 '8,9',
 '8,8',
 '8,8',
 '8,8',
 '8,8',
 '8,8',
 '8,8',
 '8,7',
 '8,7',
 '8,7',
 '8,7',
 '8,7',
 '8,6',
 '8,6',
 '8,6',
 '8,6',
 '8,6',
 '8,6']

In [31]:
titles = sel_imdb.css("h3[class='ipc-title__text ipc-title__text--reduced']::text").getall()
marks = sel_imdb.css("span[class='ipc-rating-star--rating']::text").getall()

In [32]:
len(titles)


27

In [33]:
len(marks)

25

In [34]:
# another way to do it
section = sel_imdb.css("ul[class='ipc-metadata-list ipc-metadata-list--dividers-between sc-2b8fdbce-0 emPbxy compact-list-view ipc-metadata-list--base']")
len(section)

0

In [35]:
for box in section.css("div[class='sc-ec40e84d-0 dTHKNo']"):
    print(box.css("h3[class='ipc-title__text ipc-title__text--reduced']::text").get())

In [36]:
# Select the first <div> that contains a movie title
first_movie = sel_imdb.css("div.cli-title").get()

print(first_movie)  # You’ll see the full HTML block as a string


<div class="ipc-title ipc-title--base ipc-title--title ipc-title--title--reduced ipc-title-link-no-icon ipc-title--on-textPrimary sc-87337ed2-2 dRlLYG cli-title with-margin"><a href="/fr/title/tt0111161/?ref_=chttp_t_1" class="ipc-title-link-wrapper" tabindex="0"><h3 class="ipc-title__text ipc-title__text--reduced">1. Les Évadés</h3></a></div>


In [37]:
first_movie_element = sel_imdb.css("div.cli-title")[0]

title_tag = first_movie_element.css("h3").get()
print(title_tag)  # You’ll see: <h3 class="...">The Marvels</h3>


<h3 class="ipc-title__text ipc-title__text--reduced">1. Les Évadés</h3>


In [38]:
url = "https://www.imdb.com/chart/boxoffice/"

headers = {
    'User-Agent': 'Mozilla/5.0',
    'Accept-Language': 'fr-FR,fr;q=0.9'
}

# Step 1: Get the page
r = requests.get(url, headers=headers)
response = Selector(text=r.text)

# Step 2: Select all the movie blocks
movies = response.css("div.cli-title")

print(f"Found {len(movies)} movie blocks")

# Step 3: Loop through each movie and extract info
for movie in movies:
    title = movie.css("h3.ipc-title__text::text").get()
    relative_link = movie.css("a::attr(href)").get()

    full_link = "https://www.imdb.com" + relative_link if relative_link else "No link"

    print("📽️ Title:", title)
    print("🔗 Link:", full_link)
    print("------")



Found 10 movie blocks
📽️ Title: Tron: Ares
🔗 Link: https://www.imdb.com/fr/title/tt6604188/?ref_=chtbo_t_1
------
📽️ Title: Roofman
🔗 Link: https://www.imdb.com/fr/title/tt4627382/?ref_=chtbo_t_2
------
📽️ Title: Une bataille après l'autre
🔗 Link: https://www.imdb.com/fr/title/tt30144839/?ref_=chtbo_t_3
------
📽️ Title: Gabby et la maison magique: Le film
🔗 Link: https://www.imdb.com/fr/title/tt32214143/?ref_=chtbo_t_4
------
📽️ Title: Conjuring: L'heure du jugement
🔗 Link: https://www.imdb.com/fr/title/tt22898462/?ref_=chtbo_t_5
------
📽️ Title: Soul on Fire
🔗 Link: https://www.imdb.com/fr/title/tt28078628/?ref_=chtbo_t_6
------
📽️ Title: Demon Slayer: Kimetsu no Yaiba La Forteresse Infinie Film 1
🔗 Link: https://www.imdb.com/fr/title/tt32820897/?ref_=chtbo_t_7
------
📽️ Title: The Smashing Machine
🔗 Link: https://www.imdb.com/fr/title/tt11214558/?ref_=chtbo_t_8
------
📽️ Title: Les Intrus - Chapitre 2
🔗 Link: https://www.imdb.com/fr/title/tt28671344/?ref_=chtbo_t_9
------
📽️ Title: G

## 5. Scraping booking website  

Now let’s try with a real website: booking.  
👉 Example page: We want to scrap some informations about hotel in Paris

### Step 1: Request the Page

In [39]:

url_booking = "https://www.booking.com/searchresults.html?ss=Paris"
response_booking = requests.get(url_booking, headers={"User-Agent": "Mozilla/5.0"})
sel_booking = Selector(text=response_booking.text)


In [40]:
print(sel_booking)

<html><head><script type="text/javascript" nonce="wzjxO0nnZQutTiC">
window.PCM = {
config: {
isBanner: false,
domainUUID: '3ea94870-d4b1-483a-b1d2-faf1d982bb31',
isPCSIntegration: true,
shareConsentWithin: 'www.booking.com',
isCrossDomainCookieSharingEnabled: false,
nonce: 'wzjxO0nnZQutTiC',
countryCode: 'us',
isUserLoggedIn: false,
},
pcm_consent_cookie: 'analytical=false&amp;countryCode=FR&amp;consentId=4c6e7aeb-833a-400b-a138-ae8fec98da81&amp;consentedAt=2025-10-14T14:19:25.430Z&amp;expiresAt=2026-04-12T14:19:25.430Z&amp;implicit=true&amp;marketing=false&amp;regionCode=IDF&amp;regulation=gdpr&amp;legacyRegulation=gdpr'
};
(()=>{let e,c=e=>{let n,t,o=new FormData,i={error:e.message||"",message:e.message||"",stack:e.stack||"",name:"js_errors",colno:0,lno:0,url:window.location.hostname+window.location.pathname,pid:(null==(t=null==(n=window.B)?void 0:n.env)?void 0:t.pageview_id)||1,be_running:1,be_column:0,be_line:0,be_stack:e.stack||"",be_message:e.message||"",be_file:window.location.h

### Step 2: Extract Hotel Names

In [41]:
import requests 
from scrapy import Selector

response = requests.get("https://www.booking.com/searchresults.fr.html?ss=Paris%2C+France&efdco=1&label=gen173nr-10CAEoggI46AdIM1gEaE2IAQGYATO4ARfIAQzYAQPoAQH4AQGIAgGoAgG4Asmaj8YGwAIB0gIkMTZiMjIyZDItNzQ0NC00ODliLWIwYTMtZDAyYjc4MTE3YTdm2AIB4AIB&sid=3f4cfd30c66e4b9ba4934305976617dd&aid=304142&lang=fr&sb=1&src_elem=sb&src=index&dest_id=-1456928&dest_type=city&checkin=2025-09-19&checkout=2025-09-21&group_adults=2&no_rooms=1&group_children=0")
response = Selector(text=response.text)


In [42]:
print(response)

<html><head><script type="text/javascript" nonce="l9cqzQKqqJkbyox" src="https://cdn.cookielaw.org/consent/3ea94870-d4b1-483a-b1d2-faf1d982bb31/OtAutoBlock.js"></script>
<script type="text/javascript" nonce="l9cqzQKqqJkbyox">
(function () {
document.addEventListener('click', function(e) {
if (e.target && e.target.classList.contains('ot-preference-center-footer')) {
e.preventDefault();
Optanon && Optanon.ToggleInfoDisplay();
}
});
document.addEventListener('cookie_banner_closed', function(e) {
if (window.PCM && window.B && window.B.et) {
window.B.et.goal((window.PCM.Marketing || window.PCM.Analytical) ? 'cookie_consent_accepted_policy_banner' : 'cookie_consent_declined_policy_banner');
}
});
})();
</script>
<script type="text/javascript" nonce="l9cqzQKqqJkbyox">
window.PCM = {
config: {
isBanner: true,
domainUUID: '3ea94870-d4b1-483a-b1d2-faf1d982bb31',
isPCSIntegration: true,
shareConsentWithin: 'www.booking.com',
isCrossDomainCookieSharingEnabled: false,
nonce: 'l9cqzQKqqJkbyox',
count

### Step 3: Understand the structure

In [43]:
response.css("div[data-testid='title']").getall()

['<div data-testid="title" class="ecb8d66605 d92ce1996e"><h3 class="e7addce19e f546354b44 cc045b173b f7ffd4d542"><div>Too Hotel &amp; Spa Paris - MGallery Collection<div class="ce1f6dc681"><div class="b5ab46c480 d7f0fd0b38" role="img" aria-label="4\xa0étoiles sur 5."><div class="e03979cfad"><span aria-hidden="true" class="fc70cba028 bdc459fcb4 f24706dc71"><svg xmlns="http://www.w3.org/2000/svg" viewbox="0 0 24 24" width="50px"><path d="M23.555 8.729a1.505 1.505 0 0 0-1.406-.98h-6.087a.5.5 0 0 1-.472-.334l-2.185-6.193a1.5 1.5 0 0 0-2.81 0l-.005.016-2.18 6.177a.5.5 0 0 1-.471.334H1.85A1.5 1.5 0 0 0 .887 10.4l5.184 4.3a.5.5 0 0 1 .155.543l-2.178 6.531a1.5 1.5 0 0 0 2.31 1.684l5.346-3.92a.5.5 0 0 1 .591 0l5.344 3.919a1.5 1.5 0 0 0 2.312-1.683l-2.178-6.535a.5.5 0 0 1 .155-.543l5.194-4.306a1.5 1.5 0 0 0 .433-1.661"></path></svg></span><span aria-hidden="true" class="fc70cba028 bdc459fcb4 e2cec97860 f24706dc71"><svg xmlns="http://www.w3.org/2000/svg" viewbox="0 0 24 24" width="50px"><path d="

In [44]:
response.css("div[data-testid='title']").css("div::text").getall()

['Too Hotel & Spa Paris - MGallery Collection',
 'Hôtel à Paris (13e arr.)',
 'SO/ Paris Hotel',
 'Hôtel à Paris (4e arr.)',
 'Drawing House',
 'Hôtel à Paris (14e arr.)',
 'La Demeure Montaigne',
 'Hôtel à Paris (8e arr.)',
 'Hôtel La Canopée',
 'Hôtel à Paris (8e arr.)',
 'Hôtel Le Milie Rose',
 'Hôtel à Paris (10e arr.)',
 "Paris j'Adore Hotel & Spa",
 'Hôtel à Paris (17e arr.)',
 'Quinzerie hôtel',
 'Hôtel à Paris (15e arr.)',
 'Hôtel Moderniste',
 'Hôtel à Paris (15e arr.)',
 'La Planque Hotel',
 'Hôtel à Paris (10e arr.)']

In [45]:
first_answers = response.css("div[data-testid='title']")

In [46]:
first_answers

[<Selector query="descendant-or-self::div[@data-testid = 'title']" data='<div data-testid="title" class="ecb8d...'>,
 <Selector query="descendant-or-self::div[@data-testid = 'title']" data='<div data-testid="title" class="ecb8d...'>,
 <Selector query="descendant-or-self::div[@data-testid = 'title']" data='<div data-testid="title" class="ecb8d...'>,
 <Selector query="descendant-or-self::div[@data-testid = 'title']" data='<div data-testid="title" class="ecb8d...'>,
 <Selector query="descendant-or-self::div[@data-testid = 'title']" data='<div data-testid="title" class="ecb8d...'>,
 <Selector query="descendant-or-self::div[@data-testid = 'title']" data='<div data-testid="title" class="ecb8d...'>,
 <Selector query="descendant-or-self::div[@data-testid = 'title']" data='<div data-testid="title" class="ecb8d...'>,
 <Selector query="descendant-or-self::div[@data-testid = 'title']" data='<div data-testid="title" class="ecb8d...'>,
 <Selector query="descendant-or-self::div[@data-testid = 'title'

In [47]:
def parse(first_answers):
    answers = first_answers.css("div[data-testid='title']")
    for answer in answers: 
        print(answer.css("div::text").get())

In [48]:
parse(first_answers)

Too Hotel & Spa Paris - MGallery Collection
SO/ Paris Hotel
Drawing House
La Demeure Montaigne
Hôtel La Canopée
Hôtel Le Milie Rose
Paris j'Adore Hotel & Spa
Quinzerie hôtel
Hôtel Moderniste
La Planque Hotel


## 6. Scrapy 

In [49]:
import os 

os.listdir()

['D02-Scraping.ipynb',
 'scrapy_quotes.py',
 'scrapy_booking.py',
 'D03-Boto3-live.ipynb',
 'D04-SQL-alchemy-live.ipynb',
 'scrapy_imdb.py',
 'D01-API-Requests-live.ipynb',
 'scrapy_quotes_all.py',
 'src']

In [50]:
! python scrapy_booking.py

2025-10-14 16:19:29 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2025-10-14 16:19:29 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:35:41) - [Clang 16.0.6 ], pyOpenSSL 23.2.0 (OpenSSL 3.5.0 8 Apr 2025), cryptography 41.0.7, Platform macOS-15.6.1-arm64-arm-64bit
2025-10-14 16:19:29 [scrapy.addons] INFO: Enabled addons:
[]
2025-10-14 16:19:29 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2025-10-14 16:19:29 [scrapy.extensions.telnet] INFO: Telnet Password: f9ae268483136313
2025-10-14 16:19:29 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2025-10-14 16:19:29 [scrapy.crawler] INFO