![LU Logo](https://www.lu.lv/fileadmin/user_upload/LU.LV/www.lu.lv/Logo/Logo_jaunie/LU_logo_LV_horiz.png)


# Tīmekļa lapu izstrāde ar Flask un datu iegūšana(rasmošana) ar Scrapy

## Nodarbības saturs

Mēs apskatīsim šādas tēmas šajā nodarbībā:

* Flask - Python servera puses tīmekļa izstrādes rīks jeb sastatnes
* Scrapy - Python bibliotēka datu iegūšanai no tīmekļa lapām, jeb rasmošanai

## Nodarbības mērķi

Nodarbības beigās jūs būsiet spējīgi:

* Saprast servera puses tīmekļa izstrādes pamatus
* Izveidot vienkāršu tīmekļa lietotni, izmantojot Flask
* Saprast tīmekļa lapu rasmošanas pamatus
* Iegūt datus no tīmekļa lapas, izmantojot Scrapy

## Nepieciešamās priekšzināšanas

Pirms sākat šo nodarbību, jums vajadzētu:

* Saprast Python programmēšanas pamatus - mainīgie, datu tipi, kontroles struktūras, funkcijas, OOP un failu ievade/izvade
* Zināt, kā instalēt Python pakotnes, izmantojot `pip`
* Zināt, kā izveidot virtuālo vidi, izmantojot `venv`
* Saprast HTML pamatus - skatiet MDN Web Docs [HTML ievads](https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML)




## Tīmekļa lapu izstrādes pamati

Tīmekļa lapu izstrāde ir plašs jēdziens, kas ietver daudzas dažādas tehnoloģijas un prasmes.

Viens dalījums būtu pēc tā, vai jūs strādājat ar lietotāja saskarni (front end) vai ar servera pusi (back end).


### Lietotājā saskarne - Front End Web Development

Lietotāja saskarnes izstrāde ietver tīmekļa lapas lietotāja saskarnes un lietotāja pieredzes(UX) izveidi. Tas ietver tīmekļa lapas izkārtojuma, krāsu, fontu un interaktīvo elementu projektēšanu. Front end izstrādātāji izmanto HTML, CSS un JavaScript, lai izveidotu tīmekļa lapas vizuālos elementus, ar kuriem lietotāji mijiedarbojas.


### Servera izstrāde - Back End Web Development

Servera puses tīmekļa izstrāde ietver servera puses loģikas un datu bāzes mijiedarbības izveidi. Tas ietver lietotāja pieprasījumu apstrādi, datu apstrādi un dinamiskas satura ģenerēšanu. Servera puses izstrādātāji izmanto servera puses programmēšanas valodas, piemēram, Python, PHP, Ruby, Java un citas, lai izveidotu tīmekļa lapas servera puses komponentus.

API izstrāde, lietotāja autentifikācijas apstrāde un datu bāzu pārvaldība ir daži no uzdevumiem, par kuriem atbild back end izstrādātāji.

### Pilna tīmekļa izstrāde - Full Stack Web Development

Pilna tīmekļa izstrāde ietver darbu gan ar front end, gan ar back end tehnoloģijām. Pilna tīmekļa izstrādātāji ir prasmīgi gan front end, gan back end tehnoloģijās un var izveidot pilnīgas tīmekļa lietotnes no sākuma līdz beigām.



## Flask - Python tīmekļa izstrādes rīks

Flask ir vienkāršs un viegli lietojams Python tīmekļa izstrādes rīks. Tas ir paredzēts, lai palīdzētu jums izveidot tīmekļa lietotnes ātri un vienkārši.
To izmanto gan pieredzējuši izstrādātāji, gan iesācēji, jo tas ir viegli saprotams un lietojams.

### Kā darbojas Flask


Flask darbojas, izmantojot WSGI (Web Server Gateway Interface) standartu, kas ļauj to darboties ar dažādiem tīmekļa serveriem. Tas nozīmē, ka jūs varat izmantot Flask ar dažādiem tīmekļa serveriem, piemēram, Apache, Nginx vai citiem.

Flask izmanto dekoratorus, lai definētu maršrutus un funkcijas, kas tiek izpildītas, kad tiek saņemts pieprasījums uz konkrēto maršrutu. Tas ļauj jums viegli definēt, kāda darbība jāveic, kad tiek saņemts pieprasījums uz konkrēto URL. Tos apskatīsim vēlāk šajā nodarbībā.

### Virtuālās vides iestatīšana

Pirms instalējat Flask, IZSTRĀDĀTĀJIEM ĻOTI IESAKĀMS izveidot virtuālo vidi savam projektam. Tas palīdzēs jums pārvaldīt atkarības un izvairīties no konfliktiem ar citiem projektiem.

Pastāv vairāki veidi, kā izveidot virtuālo vidi Python. Viena no visbiežāk izmantotajām metodēm ir izmantot iebūvēto `venv` moduli. Šādi varat izveidot virtuālo vidi, izmantojot `venv`:


```bash
# Create a new directory for your project
mkdir myproject
cd myproject
python -m venv myvenv
```

Tā vietā, lai izmantotu myenv, varat izmantot jebkuru vēlamo nosaukumu savai virtuālajai videi. 

#### Virtuālās vides aktivizēšana

Lai aktivizētu virtuālo vidi, varat izmantot šādu komandu:

```bash
# On Windows
myvenv\Scripts\activate

# On macOS and Linux
source myvenv/bin/activate
```

### Flask uzstādīšana

Kad esat izveidojis un AKTIVIZĒJIS savu virtuālo vidi, varat instalēt Flask, izmantojot `pip`:

```bash

pip install Flask
```


## Veidojam vienkāršu tīmekļa lietotni ar Flask


### Hello Pasaule piemērs


In [1]:
# izveidosim vienkāršu Flask aplikāciju

from flask import Flask # import the Flask class from the flask module

app = Flask(__name__) # this creates a new Flask app object

# we will be using app.route() decorator to define the URL that will trigger the function below
@app.route('/') # this route means that the function below will be called when the user goes to the root URL of your website
def hello_world():
    return 'Hello, World!'

# you'd add this line to run the app in script mode
# if __name__ == '__main__':
#     app.run()

app.run() # this is the same as the above line, but it's not recommended to use this in script mode
# usually you would not run this from Jupiter notebook, but from a terminal

# use Ctrl+C to stop the server on terminal
# in Jupyter notebook, you can stop the server by clicking on the stop button

 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
Press CTRL+C to quit


### Parametru izmantošana maršrutos

Nākošais solis ir izveidot vienkāršu tīmekļa lietotni, kas ņem parametru URL un parāda to lapā. Šeit ir piemērs:


```python
from flask import Flask

app = Flask(__name__)

@app.route('/')
def hello_world():
    return 'Hello, World!'

@app.route('/greet/<name>')
def greet(name):
    return f'Hello, {name}!'

if __name__ == '__main__':
    app.run()
``` 

In [None]:
from flask import Flask

app = Flask(__name__)

@app.route('/')
def hello_world():
    return 'Hello, World!'

# note that the URL is case-sensitive
# note the use of <name> in the URL this is a variable part of the URL
@app.route('/greet/<name>')
def greet(name):
    return f'Hello, {name}!'

# note that only first level of the URL will be caught by this route
# so /greet/Janis/Berzins will not work - you would need a separate route for that
# but /greet/Janis will work

if __name__ == '__main__':
    app.run()

# use Ctrl+C to stop the server on terminal
# in Jupyter notebook, you can stop the server by clicking on the stop button

 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
127.0.0.1 - - [09/Dec/2024 16:54:53] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2024 16:54:59] "GET /greet/Valdis HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2024 16:55:05] "GET /greet/LUPython HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2024 16:55:11] "GET /greet/LUPython/Latvia HTTP/1.1" 404 -


### Pieprasījumu parametru izmantošana

Pieprasījuma parametri ir vēl viens veids, kā padot datus tīmekļa lietotnei. Tos pievieno URL pēc jautājuma zīmes `?` un ir formā `key=value`. Šeit ir piemērs:

```python

from flask import Flask, request

app = Flask(__name__)

@app.route('/')
def hello_world():
    return 'Hello, World!'

@app.route('/sveiks')
def greet():
    name = request.args.get('name', 'World')
    return f'Hello, {name}!'

if __name__ == '__main__':
    app.run()
```

Tagad varat piekļūt sveiks maršrutam ar pieprasījuma parametru šādi: `http://localhost:5000/sveiks?name=Uldis`



In [None]:
## Using query parameters

from flask import Flask, request

app = Flask(__name__)

@app.route('/')
def hello_world():
    return 'Hello, World!'

@app.route('/sveiks')
def greet():
    name = request.args.get('name', 'Pasaule') #if no name argument is given, we will use 'Pasaule'
    return f'Hello, {name}!'

if __name__ == '__main__':
    app.run()

# on local server try something like http://127.0.0.1:5000/sveiks?name=Valdis

 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
127.0.0.1 - - [09/Dec/2024 17:07:54] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2024 17:08:03] "GET /sveiks?name=Valdis HTTP/1.1" 200 -


### Šablonu un statisko failu izmantošana

Parasti mēs nevēlamies apstrādāt HTML savā Python kodā. Flask ļauj izmantot šablonus, lai atdalītu HTML no Python koda. Flask izmanto Jinja2 kā savu šablonu dzinēju.

Pilna dokumentācija par Jinja2 atrodama [šeit](https://jinja.palletsprojects.com/en/3.0.x/)

Pamatideja ir tāda, ka izveidojat mapi `templates` savam projektam un ievietojat HTML failus tur. Ievērosim ka šabloni ļaus mums ievietot mainīgos un izteiksmes, kas tiks aizstātas ar reāliem datiem, kad šablons tiks renderēts.

Papildus dinamiskam saturam, jums var būt nepieciešami arī statiskie faili, piemēram, CSS, JavaScript, attēli utt. Lai tos iekļautu savā projektā, izveidojiet mapi `static` un ievietojiet failus tur.




In [None]:
# example of using templates and static files
from flask import Flask, render_template

app = Flask(__name__)

@app.route('/')
def index():
    items = ["Saraksti", "Vārdnīcas", "Citi objekti"] # these could be from a database or other data source
    # also these objects could be from url query parameters
    return render_template('index.html', year=2024, items=items) # thus template will receive year and items
# see index.html templates folder on how it is handled on the template side
# you might also check out base.html to see how templates can be extended

@app.route('/about')
def about():
    return render_template('about.html', year=2024)
# about.html is even simpler template than index.html
# it also extends base.html in templates folder


if __name__ == '__main__':
    # app.run(debug=True) # debug mode will reload the server on code changes and provide more verbose output
    app.run() # debug mode will reload the server on code changes and provide more verbose output

 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
127.0.0.1 - - [09/Dec/2024 18:24:57] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2024 18:24:57] "GET /static/mystyle.css HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2024 18:24:57] "GET /static/myscript.js HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2024 18:25:00] "GET /static/mystyle.css HTTP/1.1" 304 -
127.0.0.1 - - [09/Dec/2024 18:25:12] "GET /about HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2024 18:25:12] "GET /static/mystyle.css HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2024 18:25:12] "GET /static/myscript.js HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2024 18:25:13] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2024 18:25:13] "GET /static/mystyle.css HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2024 18:25:13] "GET /static/myscript.js HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2024 18:25:14] "GET /about HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2024 18:25:14] "GET /static/mystyle.css HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2024 18:25:14] "GET /static/myscript.js HTTP/1.

### Flask projektu struktūra

Lai jūsu Flask projekts būtu labi organizēts, jūs varat izmantot šādu struktūru:

```
project_root/
│
├── app.py                # Jūsu galvenais Flask fails - iespējams, ka būs arī citi .py faili
├── static/               # Mape statiskiem failiem( CSS, JavaScript, attēli, utt.)
│   └── styles.css        # Jūsu CSS fails - var protams būt vairāki
|   └── script.js         # Jūsu JavaScript fails - var protams būt vairāki
├── templates/            # Mape šabloniem
│   └── base.html         # Bāzes šablons - var protams būt vairāki
│   └── index.html        # Citi šabloni - var protams būt vairāki
└── requirements.txt      # (Ieteicams bet ne obligāts) saraksts ar visām nepieciešamajām bibliotēkām
```


### Mācību materiāli un resursi	Flask apguvei

*Lai būtu pilnvērtīgs Flask izstrādātājs, jums būs nepieciešams iemācīties vairāk par šādām tēmām:*	

- Formu apstrāde ar Flask
- Datu bāzu izmantošana ar Flask, SQLAlchemy vai citiem ORM - Object-Relational Mapping rīkiem
- Lietotāja autentifikācija ar Flask
- Sesiju pārvaldība ar Flask
- Flask lietotņu izvietošana - AWS, PythonAnywhere, DigitalOcean, lokālais serveris utt.

Lai turpinātu apgūt Flask, jūs varat izmantot šādus resursus:
- Oficiālā Flask dokumentācija: https://flask.palletsprojects.com/en/2.0.x/
- Miguel Grinberg's Flask Mega-Tutorial: https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-i-hello-world
- Corey Schafer's Flask Tutorial Series: https://www.youtube.com/playlist?list=PL-osiE80TeTs4UjLw5MM6OjgkjFeUxCYH




## Tīmekļa lapu rasmošana - skrāpēšana - kas tas ir?

Tīmekļa lapu rasmošana ir datu iegūšanas process no tīmekļa lapām. Tas ietver HTTP pieprasījumu nosūtīšanu uz tīmekļa lapu, HTML satura parsēšanu un nepieciešamo datu iegūšanu. Tīmekļa lapu rasmošana tiek izmantota datu iegūšanai, cenu uzraudzībai un satura apkopošanai un citiem mērķiem.

Teorētiski iespējams rasmošanu veikt arī manuāli, vienkārši saglabājot apmeklētās tīmekļa lapas saturu, bet bieži vien ir efektīvāk izmantot tīmekļa rasmošanas bibliotēku, piemēram, Scrapy, lai automatizētu procesu.

### Rasmošanas noteikumi

* Spēlējiet tīri! Neveidojiet pārlieku slodzi tīmekļa vietnei ar pieprasījumiem, jo tas var izraisīt servera problēmas un jūs varat tikt nobloķēts.
* Pirms sākat rasmošanu, pārbaudiet tīmekļa vietnes lietošanas noteikumus un robots.txt failu, lai pārliecinātos, ka neaizskarāt kādus noteikumus.
* Ja iespējams iegūt datus caur API, ir ieteicams izmantot API, nevis rasmošanu.
* Iegūtās datus izmantojiet tikai saskaņā ar autortiesībām un likumiem.
* Ja iespējams, izmantojiet jau publisku datukopu - vislabāk no pašiem lietotnes autoriem, nevis tīmekļa lapas rasmošanu.
* Izmantojiet tīmekļa lapu rasmošanu atbildīgi un etiski.
* Ievērojiet atšķirību starp kādu datu rasmošanu un to publicēšanu. Rasmošana ir tikai datu iegūšana, bet publicēšana ir atsevišķs jautājums.
* Pētnieciskiem mērķiem ir ieteicams iegūt atļauju no tīmekļa vietnes īpašniekiem, ja iespējams.



## Scrapy - A Python Web Scraping Library

Scrapy is a powerful web scraping library for Python that makes it easy to extract data from websites. It provides a high-level API for crawling websites and extracting data, making it a great choice for web scraping projects.

### How Scrapy Works

Scrapy works by sending HTTP requests to a website, parsing the HTML content, and extracting the data you need. It provides a set of tools and libraries for building web scrapers, including a built-in web crawler, a powerful selector system, and support for handling cookies and sessions.

### Installing Scrapy

As usual it is best to create and activate a virtual environment before installing Scrapy. 

You can install Scrapy using `pip`:

```bash
pip install Scrapy
```

### Simple web scraping example with Scrapy

Let's say we want to scrape the cities and their populations from wikipedia page: https://en.wikipedia.org/wiki/List_of_cities_in_Latvia
(Note that Wikipedia offers an API for accessing its data, so scraping is not necessary in this case. This is just an example.)

Here is a simple example of how to scrape data from a website using Scrapy:




In [6]:
# first let's import scrapy and check its version
try:
    import scrapy
    print(f"Scrapy version is {scrapy.__version__}")
except ImportError:
    print("Scrapy is not installed")
    print("You can install Scrapy with pip install scrapy")

Scrapy version is 2.12.0


### Basic scraping example with Scrapy

In [None]:
# now let's scrape the following web page
url = 'https://en.wikipedia.org/wiki/List_of_cities_in_Latvia'
print(f"We will scrape {url}")
# Note: usually there is no need to scrape Wikipedia as they have APIs and data dumps available
# but for learning purposes we will scrape this page
# we are interested in table with cities and their population

# we've already imported scrapy so we can start using it
# let's start with basic example where we simply fetch the page and extract the title
from scrapy.http import HtmlResponse

try:
    import requests  # we also use requests library to fetch the page
except ImportError:
    print("Requests is not installed")
    print("You can install Requests with pip install requests")

# NOTE: Scrapy has its own request object as well, but it is more complex

response = requests.get(url)
# let's create a scrapy response object
scrapy_response = HtmlResponse(url, body=response.text, encoding='utf-8')

# let's extract the title using css selector
title = scrapy_response.css('title::text').get()
print("Web page title is", title)


2024-12-09 19:38:03 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): en.wikipedia.org:443
2024-12-09 19:38:03 [urllib3.connectionpool] DEBUG: https://en.wikipedia.org:443 "GET /wiki/List_of_cities_in_Latvia HTTP/11" 200 38490


We will scrape https://en.wikipedia.org/wiki/List_of_cities_in_Latvia
Web page title is List of cities and towns in Latvia - Wikipedia


### Using CSS selectors to extract data

Scrapy provides a powerful selector system that allows you to extract data from HTML using CSS selectors. You can use CSS selectors to target specific elements on a page and extract the data you need.

 **CSS Selectors** CSS Selectors allow for simpler and often more intuitive queries using CSS rules. Scrapy's `response.css()` method provides an interface for selecting elements based on CSS selectors.
 
#### Common CSS Examples: 
| CSS Selector | Description | 
| --- | --- | 
| tag | Selects all <tag> elements. | 
| .class | Selects all elements with the class class. | 
| #id | Selects the element with the ID id. | 
| tag.class | Selects <tag> elements with the class class. | 
| tag[attr="value"] | Selects <tag> elements with an attribute attr="value". | 
| tag > child | Selects direct children of <tag>. | 
| tag child | Selects all descendants of <tag>. | 
| tag:first-child | Selects the first child <tag> of its parent. | 
| tag:nth-child(n) | Selects the nth <tag> child. | 

#### Scrapy CSS Methods: 
 
- `response.css('<CSS selector>')`: Extracts elements matching the CSS selector.
 
- `.get()`: Returns the first matching value.
 
- `.getall()`: Returns all matching values as a list.
 
- `.re('<regex>')`: Extracts values matching a regular expression.

In [16]:
# now let's use Selector to get the table with cities and their population
# we will use CSS selectors
# we will use the table with class wikitable sortable

# first let's see about getting all tables
tables = scrapy_response.css('table')
# how many tables are there?
print("There are", len(tables), "tables on the page")

# by using inspect element in browser we can see that the table we want is the first table with class wikitable sortable
table = scrapy_response.css('table.wikitable.sortable') 
# how many tables are there?
print("There are", len(table), "wikitable sortable tables on the page")
# in our case this is fine as we want elements from both tables
# if we only needed the first table we could use table = scrapy_response.css('table.wikitable.sortable')[0] as the first table is at index 0

# how many tr elements are in the tables
rows = table.css('tr')
print("Table(s) have rows", len(rows))

There are 7 tables on the page
There are 2 wikitable sortable tables on the page
Table(s) have rows 83


In [27]:
# let's save data in a list of dictionaries for easier processing
cities = []
# let's iterate over rows and extract the data
for row in rows:
    # city = row.css('td:nth-child(1)::text').get()
    # we want all text from first child td element that is contained in its children
    city = row.css('td:nth-child(1) *::text').getall() # note the * after td:nth-child(1) we get all text from all children
    population = row.css('td:nth-child(2)::text').get() # here we get only the text from the td element
    if city and population: # we only want rows with city and population
        cities.append({'city': city, 'population': population})
# print first 5 cities
print("Biggest five cities", cities[:5])
# last 5 cities
print("Smallest five cities", cities[-5:])

Biggest five cities [{'city': ['Rīga', '\xa0', 'pronunciation', 'ⓘ', '\xa0', '\n'], 'population': '658,640\n'}, {'city': ['Daugavpils', '\xa0', 'pronunciation', 'ⓘ', '\xa0', '\n'], 'population': '93,312\n'}, {'city': ['Liepāja', '\xa0', 'pronunciation', 'ⓘ', '\xa0', '\n'], 'population': '76,731\n'}, {'city': ['Jelgava', '\xa0', 'pronunciation', 'ⓘ', '\xa0', '\n'], 'population': '59,511\n'}, {'city': ['Jūrmala', '\xa0', 'pronunciation', 'ⓘ', '\xa0', '\n'], 'population': '50,840\n'}]
Smallest five cities [{'city': ['Varakļāni', '\xa0', 'pronunciation', 'ⓘ', '\xa0', '\n'], 'population': '2,106\n'}, {'city': ['Viesīte', '\xa0', 'pronunciation', 'ⓘ', '\xa0', '\n'], 'population': '1,902\n'}, {'city': ['Viļaka', '\xa0', 'pronunciation', 'ⓘ', '\xa0', '\n'], 'population': '1,525\n'}, {'city': ['Viļāni', '\xa0', 'pronunciation', 'ⓘ', '\xa0', '\n'], 'population': '3,468\n'}, {'city': ['Zilupe', '\xa0', 'pronunciation', 'ⓘ', '\xa0', '\n'], 'population': '1,746\n'}]


In [28]:
# we can see that we only care about the first entry for city name so let's clean up the list of dictionaries
# we will keep only the first entry for city name
cities = [{'city': d['city'][0], 'population': d['population']} for d in cities] # list comprehension
# print first 5 cities
print("Biggest five cities", cities[:5])
# last 5 cities
print("Smallest five cities", cities[-5:])

Biggest five cities [{'city': 'Rīga', 'population': '658,640\n'}, {'city': 'Daugavpils', 'population': '93,312\n'}, {'city': 'Liepāja', 'population': '76,731\n'}, {'city': 'Jelgava', 'population': '59,511\n'}, {'city': 'Jūrmala', 'population': '50,840\n'}]
Smallest five cities [{'city': 'Varakļāni', 'population': '2,106\n'}, {'city': 'Viesīte', 'population': '1,902\n'}, {'city': 'Viļaka', 'population': '1,525\n'}, {'city': 'Viļāni', 'population': '3,468\n'}, {'city': 'Zilupe', 'population': '1,746\n'}]


In [None]:
# now we could save the results to a file or database
# alternatively we could load the data into a pandas DataFrame for further processing

### Using XPath selectors to extract data

XPath is more powerful than CSS selectors, but it is also more complex. 

**XPath Selectors** XPath (XML Path Language) allows you to navigate and query nodes in an HTML or XML document. Scrapy's `response.xpath()` method provides an interface for selecting elements based on XPath expressions.

#### Common XPath Examples: 
| XPath Expression | Description | 
| --- | --- | 
| //tag | Selects all <tag> elements anywhere in the document. | 
| ./tag | Selects all <tag> elements directly under the current node. | 
| //tag[@attr="value"] | Selects <tag> elements with an attribute attr="value". | 
| //tag/text() | Selects the text content of <tag> elements. | 
| //tag[contains(@attr, "val")] | Selects <tag> elements where attr contains "val". | 
| //tag[1] | Selects the first <tag> element in the context. | 
| //tag[last()] | Selects the last <tag> element in the context. | 
| //tag[position() < 3] | Selects the first two <tag> elements. | 

#### Scrapy XPath Methods: 
 
- `response.xpath('<XPath>')`: Extracts elements matching the XPath.
 
- `.get()`: Returns the first matching value.
 
- `.getall()`: Returns all matching values as a list.
 
- `.extract()`: Deprecated alias for `.getall()`.
 
- `.re('<regex>')`: Extracts values matching a regular expression.


In [29]:
# now let's see how we could have used XPath selectors to get the same data

# let's extract the table with XPath
table = scrapy_response.xpath('//table[contains(@class, "wikitable") and contains(@class, "sortable")]')
# how many tables are there?
print("There are", len(table), "wikitable sortable tables on the page")

There are 2 wikitable sortable tables on the page


In [None]:
# now let's extract the rows with XPath
rows = table.xpath('.//tr')
print("Table(s) have rows", len(rows))

Table(s) have rows 83


In [None]:
# now let's try something slightly fancier we will want to extract text from first anchor child in first td element - city name
# example
# <tr>
# <td><a href="/wiki/R%C4%ABga" class="mw-redirect" title="Rīga">Rīga</a>&nbsp;<small><span class="noprint"><span class="ext-phonos"><span data-nosnippet="" id="ooui-php-1" class="ext-phonos-PhonosButton noexcerpt oo-ui-widget oo-ui-widget-enabled oo-ui-buttonElement oo-ui-buttonElement-frameless oo-ui-iconElement oo-ui-labelElement oo-ui-buttonWidget" data-ooui="{&quot;_&quot;:&quot;mw.Phonos.PhonosButton&quot;,&quot;href&quot;:&quot;\/\/upload.wikimedia.org\/wikipedia\/commons\/transcoded\/f\/f5\/Lv-R%C4%ABga.ogg\/Lv-R%C4%ABga.ogg.mp3&quot;,&quot;rel&quot;:[&quot;nofollow&quot;],&quot;framed&quot;:false,&quot;icon&quot;:&quot;volumeUp&quot;,&quot;label&quot;:{&quot;html&quot;:&quot;pronunciation&quot;},&quot;data&quot;:{&quot;ipa&quot;:&quot;&quot;,&quot;text&quot;:&quot;&quot;,&quot;lang&quot;:&quot;en&quot;,&quot;wikibase&quot;:&quot;&quot;,&quot;file&quot;:&quot;Lv-R\u012bga.ogg&quot;},&quot;classes&quot;:[&quot;ext-phonos-PhonosButton&quot;,&quot;noexcerpt&quot;]}"><a role="button" tabindex="0" href="//upload.wikimedia.org/wikipedia/commons/transcoded/f/f5/Lv-R%C4%ABga.ogg/Lv-R%C4%ABga.ogg.mp3" rel="nofollow" aria-label="Play audio" title="Play audio" class="oo-ui-buttonElement-button"><span class="oo-ui-iconElement-icon oo-ui-icon-volumeUp"></span><span class="oo-ui-labelElement-label">pronunciation</span><span class="oo-ui-indicatorElement-indicator oo-ui-indicatorElement-noIndicator"></span></a></span><sup class="ext-phonos-attribution noexcerpt navigation-not-searchable"><a href="/wiki/File:Lv-R%C4%ABga.ogg" title="File:Lv-Rīga.ogg">ⓘ</a></sup></span></span></small>&nbsp;<figure class="mw-halign-right" typeof="mw:File"><a href="/wiki/File:Greater_Coat_of_Arms_of_Riga_-_for_display.svg" class="mw-file-description"><img src="//upload.wikimedia.org/wikipedia/commons/thumb/9/99/Greater_Coat_of_Arms_of_Riga_-_for_display.svg/55px-Greater_Coat_of_Arms_of_Riga_-_for_display.svg.png" decoding="async" width="55" height="33" class="mw-file-element" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/99/Greater_Coat_of_Arms_of_Riga_-_for_display.svg/83px-Greater_Coat_of_Arms_of_Riga_-_for_display.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/99/Greater_Coat_of_Arms_of_Riga_-_for_display.svg/110px-Greater_Coat_of_Arms_of_Riga_-_for_display.svg.png 2x" data-file-width="510" data-file-height="303"></a><figcaption></figcaption></figure>
# </td>
# <td>658,640
# </td>
# <td>632,614
# </td>
# <td>605,273
# </td></tr>

# here we can see that city name Rīga is in the first td element inside its first anchor child

In [None]:
# now let's save data in a list of dictionaries for easier processing
also_cities = []
# let's iterate over rows and extract the data
for row in rows:
    # city = row.xpath('.//td[1]//text()').getall() # we want all text from first child td element that is contained in its children
    # we want text from first anchor element in td[1]
    city = row.xpath('.//td[1]//a[1]//text()').get() # note the //a[1] we get text from first anchor child of the first td element
    population = row.xpath('.//td[2]//text()').get() # here we get only the text from the td element
    if city and population: # we only want rows with city and population
        also_cities.append({'city': city, 'population': population})
# print first 5 cities
print("Biggest five cities", also_cities[:5])
# last 5 cities
print("Smallest five cities", also_cities[-5:])

# using XPath we did not have to do any Python list comprehension to clean up the data after extraction
# now we could still clean up newlines and convert population to integer but that is easy and not part of scraping itself

Biggest five cities [{'city': 'Rīga', 'population': '658,640\n'}, {'city': 'Daugavpils', 'population': '93,312\n'}, {'city': 'Liepāja', 'population': '76,731\n'}, {'city': 'Jelgava', 'population': '59,511\n'}, {'city': 'Jūrmala', 'population': '50,840\n'}]
Smallest five cities [{'city': 'Varakļāni', 'population': '2,106\n'}, {'city': 'Viesīte', 'population': '1,902\n'}, {'city': 'Viļaka', 'population': '1,525\n'}, {'city': 'Viļāni', 'population': '3,468\n'}, {'city': 'Zilupe', 'population': '1,746\n'}]


### CSS vs XPath key differences in Scrapy

| Feature | XPath | CSS | 
| --- | --- | --- | 
| Syntax Complexity | More expressive and versatile | Simpler, easier to write | 
| Attribute Matching | @attr | [attr="value"] | 
| Text Content | .//text() | ::text | 
| Positional Matching | [position()=n] | :nth-child(n) | 
| Advanced Navigation | Supports parent/sibling navigation | Limited to child navigation | 

### Scraping multiple pages with Scrapy

For scraping multiple pages, you can use Scrapy's built-in web crawler to follow links and scrape data from multiple pages. You can define rules to follow links and extract data from each page.

For now we will use preset list of URLs to scrape, but in real life you would probably want to scrape links from the page itself.

In [2]:
start_urls = [
    "https://en.wikipedia.org/wiki/List_of_cities_in_Estonia",
    "https://en.wikipedia.org/wiki/List_of_cities_in_Latvia",
    "https://en.wikipedia.org/wiki/List_of_cities_in_Lithuania",
]
print("We will scrape the following pages:", *start_urls, sep="\n") # we using * to unpack the list into separate arguments for neat printing

We will scrape the following pages:
https://en.wikipedia.org/wiki/List_of_cities_in_Estonia
https://en.wikipedia.org/wiki/List_of_cities_in_Latvia
https://en.wikipedia.org/wiki/List_of_cities_in_Lithuania


In [None]:
# we have the start_urls now we can create a Scrapy spider to scrape these pages
# in this case we will use a simple spider that will only extract the title of the page

# let's do a simple spider that will extract the title of the page

# we want to save these titles in a list of dictionaries as JSON

import scrapy
from scrapy.crawler import CrawlerProcess

class SimpleSpider(scrapy.Spider):
    name = 'simple_spider'
    start_urls = [
    "https://en.wikipedia.org/wiki/List_of_cities_in_Estonia",
    "https://en.wikipedia.org/wiki/List_of_cities_in_Latvia",
    "https://en.wikipedia.org/wiki/List_of_cities_in_Lithuania",
] # we could have passed this as a parameter to the spider

    def parse(self, response):
        title = response.css('title::text').get()
        # ADD your own code here to extract more data from each page
        # you can use either css or xpath selectors
        yield {'title': title}

# let's run the spider
# we also want to save the results to a file

process = CrawlerProcess(settings={
    'FEED_URI': 'cities_titles.json',
    'FEED_FORMAT': 'json'
})

process.crawl(SimpleSpider)
process.start()
# NOTE this process is really meant to be run from a script or terminal
# if you run it in a Jupyter notebook you might have to restart the kernel to run it again
# otherwise you will get errors about already running Twisted reactor

2024-12-09 20:14:36 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2024-12-09 20:14:36 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.12.7 (tags/v3.12.7:0b05ead, Oct  1 2024, 03:06:41) [MSC v.1941 64 bit (AMD64)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform Windows-11-10.0.22631-SP0
2024-12-09 20:14:36 [scrapy.addons] INFO: Enabled addons:
[]
2024-12-09 20:14:36 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-12-09 20:14:36 [scrapy.extensions.telnet] INFO: Telnet Password: 230407da82ccaf53
  exporter = cls(crawler)

2024-12-09 20:14:36 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2024-12-09 20:14:36 [scrapy.crawler] INFO: Overridden setting

### Multi page scraper overview

 
1. **Create the Spider** : 
  - Include all URLs in `start_urls` or load them dynamically using `start_requests()`.
 
2. **Parse Each Page** : 
  - Use the same `parse()` method for extracting data since the structure of the pages is similar.
 
3. **Identify the Context** :
  - Use information from the URL or page content to tag the data (e.g., country name).
 
4. **Run the Spider** :
  - Execute the spider and save the output to a file.
 
5. **Post-Processing** :
  - Optionally, combine or clean the scraped data further (e.g., merge JSON files).

### Scrapy Learning References

For further exploration of Scrapy, you can refer to the following resources:

- Official Scrapy Documentation: https://docs.scrapy.org/en/latest/
- Scrapy Tutorial: https://docs.scrapy.org/en/latest/intro/tutorial.html
- YouTube Tutorial: https://www.youtube.com/watch?v=ve_0h4Y8nuI

## Practice

### Flask Practice

1. Create a simple web application using Flask that displays a list of items. The list should be stored in a Python list and displayed on the web page.
2. Add a form to the web application that allows users to add new items to the list.
3. Add a delete button next to each item in the list that allows users to delete items from the list.

### Scrapy Practice

1. Create a Scrapy spider that scrapes data from a website of your choice. The spider should extract at least two fields from the website and save the data to a JSON or CSV file.
2. Modify the spider to save the data to a database instead of a JSON or CSV file.
3. Add error handling to the spider to handle cases where the website is down or the data is missing.

## Summary

In this lesson, we covered the basics of server-side web development using Flask and web scraping using Scrapy. We learned how to create a simple web application with Flask and how to scrape data from a website using Scrapy. We also discussed the rules of web scraping and best practices for working with web scraping libraries.