# Problem 1: scraping houses prices

## Strategy
Loop over all the pages.
For each page, loop over all the list items: 
```html
<li class="sold-results__normal-hit">
  ...
</li>
```
For each list item, extract the following information:
### Sold date: 
```html
<span class="hcl-label hcl-label--state hcl-label--sold-at">
    Såld 9 oktober 2023
</span>
```
remove the "Såld " prefix

### Address: 
```html
<h2 class="sold-property-listing__heading qa-selling-price-title hcl-card__title">
  Skårby station 350
</h2>
```
### Location of the estate: 
```html
   <div class="sold-property-listing__location">
     <div>
       <span class="property-icon property-icon--result">...</span>
       Kareby,
       Kungälvs kommun
     </div>
   </div>
```
access the div child and remove the span element

### Area of the house & number of rooms:
```html  
<div class="sold-property-listing__subheading sold-property-listing__area">
    143
      <span class="listing-card__attribute--normal-weight">
        + 25&nbsp;m²
      </span>
    &nbsp;
    7&nbsp;rum
</div>
```
OR
```html
<div class="sold-property-listing__subheading sold-property-listing__area">
  123&nbsp;m²
  &nbsp;
  6&nbsp;rum
</div>
```
remove the span element if present, remove the m² and rum suffixes
save the biarea if a "+" is present

### Area of the plot:
```html
<div class="sold-property-listing__land-area">
  2&nbsp;963&nbsp;m² tomt
</div>
```
remove the m² tomt suffix

### Closing price:
```html
<span class="hcl-text hcl-text--medium">
  Slutpris 4&nbsp;395&nbsp;000&nbsp;kr
</span>
```
remove the "Slutpris " prefix and the kr suffix
  
---

In [28]:
import glob
from bs4 import BeautifulSoup

Loop over all the pages.

In [29]:
file_pattern = "kungalv_slutpriser/kungalv_slutpris_page_*.html"

l = []

for file_name in glob.glob(file_pattern):
	with open(file_name, "r") as f:
		content = f.read()
		soup = BeautifulSoup(content, "html.parser")
		print("file_name:", file_name)
		for li in soup.find_all("li", class_="sold-results__normal-hit"):
			# Sold date
			sold_date_element = li.find(
				"span", class_="hcl-label hcl-label--state hcl-label--sold-at"
			)
			sold_date = None
			if sold_date_element is not None:
				sold_date = sold_date_element.text.strip()
				sold_date = sold_date.replace("Såld ", "")

			# Address
			address_element = li.find(
				"h2", class_="sold-property-listing__heading qa-selling-price-title hcl-card__title"
			)
			address = None
			if address_element is not None:
				address = address_element.text.strip()

			# Location
			location_element = li.find("div", class_="sold-property-listing__location").div
			location = None
			if location_element is not None:
				location_element.find("span").decompose()
				location = location_element.text.strip()
			
			area_element = li.find("div", class_="sold-property-listing__subheading sold-property-listing__area")
			boarea = None
			biarea = None
			if area_element is not None:
				if area_element.find("span") is not None:
					biarea = area_element.find("span").text.strip()
					biarea = biarea.replace(" m²", "")
					biarea = biarea.replace("+ ", "")
					area_element.find("span").decompose()
				boarea = area_element.text.strip().split("\xa0")[0]
				boarea = boarea.replace(" m²", "")
				# rooms = area_element.text.strip().split("\xa0")[1]
				rooms = rooms.replace(" rum", "")

			# Plot
			plot_element = li.find("div", class_="sold-property-listing__land-area")
			plot = None
			if plot_element is not None:
				plot = plot_element.text.strip()
				plot = plot.replace(" m² tomt", "")
			
			price_element = li.find("span", class_="hcl-text hcl-text--medium")
			price = None
			if price_element is not None:
				price = price_element.text.strip()
				price = price.replace("Slutpris ", "")
				price = price.replace(" kr", "")
			
			print("sold_date:", sold_date)
			print("address:", address)
			print("location:", location)
			print("boarea:", boarea)
			print("biarea:", biarea)
			print("rooms:", rooms)
			print("plot:", plot)
			print("price:", price)
			print("-------------------------")

file_name: kungalv_slutpriser/kungalv_slutpris_page_27.html
sold_date: 23 november 2017
address: Sjöhåla 580
location: Kovikshamn,
          Kungälvs kommun
boarea: 94
                  
              
biarea: 87 m²
rooms: 
                6
plot: 1 068 m² tomt
price: 3 100 000 kr
-------------------------
sold_date: 18 november 2017
address: Galeasgatan 15
location: Kungälv,
          Kungälvs kommun
boarea: 103
                  
              
biarea: 64 m²
rooms: 
                6
plot: 610 m² tomt
price: 3 850 000 kr
-------------------------
sold_date: 17 november 2017
address: Västerhöjdsvägen 36
location: Kärna,
          Kungälvs kommun
boarea: 107
biarea: None
rooms: 
                6
plot: 258 m² tomt
price: 4 000 000 kr
-------------------------
sold_date: 16 november 2017
address: Gråstensvägen 19
location: Kode Halltorp,
          Kungälvs kommun
boarea: 94
biarea: None
rooms: 
                6
plot: 1 197 m² tomt
price: 3 200 000 kr
-------------------------
sold_date