# Project proposal
### _Josef Švec & Markéta Malá_

We would like to work on the analysis of Prague's real estate market. The source for our data is the server Sreality.cz from Seznam that represents a trully extensive pool of supply of real estates either for rent or for sale (and that is due to the coronavirus crisis even more extensive than ever before). We'd like to narrow our focus only to the for rent properties within Prague.

Below, at the very end of this code, there are listed all the variables that we should be able to scrape from the web (with a bit of time and luck). Based on this list we came to conclusion that the most interesting way of approaching the data could be to import a simple map of Prague and visually express the average prices per m^2 on it (e.g. by different intensity of colour or by height of a surface in a 3D model). Also a histogram comparing the prices across different Prague's districts. Alternatively, we could plot different set of variables once we examine our dataset. In ordet to create visualisations, we are planning to use a library such as geo pandas, although we are not familiar with its capabilities yet.

In the notebook below, we were dealing with the issue of scraping dynamic webpages. We came up with a solution using the Selenium libarary. However, we counted that by our proposed way the scraping would take hours, because every link has to be openned separatelly and that takes time... If there is a better approach to this we will be very happy to learn it.

PS: We are aware the the code, will be rewriten into the object oriented programming. This is only a proof of concept. 

## Start of a nice jupyter

In [2]:
from selenium import webdriver

In [3]:
import time

In [4]:
from math import ceil

In [5]:
from bs4 import BeautifulSoup

## How is the link structured 

The website is located under domain name www.sreality.cz

Then it allows a client to search real estate under specified parameters. 

We specify: renting

type: flats

location: Prague

Then it displays 20 adverts per page

Here we break down how the link is structured

In [6]:
link_base = "https://www.sreality.cz"

In [7]:
search = "/hledani/pronajem/byty/praha"

In [8]:
pg = "?strana="

According to number of matches we can calculate the number of pages from which we have to get the links

In [9]:
matches = 7536

In [10]:
pages = ceil(matches/20)

In [11]:
pages

377

Because we understand the structure of the link we can create a list of all list that we want to visit

In [12]:
all_pages = [link_base+search+pg+str(k+1) for k in range(0, pages)]

We print first and last item of the list to verify our result

In [18]:
[all_pages[0], all_pages[-1]]

['https://www.sreality.cz/hledani/pronajem/byty/praha?strana=1',
 'https://www.sreality.cz/hledani/pronajem/byty/praha?strana=377']

Now we gathered all the links to give us the link that we will later scrape.
Sreality is a dynamic page so we will use selenium <3

### Selenium

#### code for one page

We visit the webpage with selenium get the html code and find all the links that lead to our desired destination.

In [19]:
driver = webdriver.Chrome("C:\\chromedriver\\chromedriver.exe")
driver.get("https://www.sreality.cz/hledani/pronajem/byty/praha")
time.sleep(2)
res1 = driver.execute_script("return document.documentElement.outerHTML")
driver.quit()

In [20]:
soup = BeautifulSoup(res1, "lxml")

In [21]:
links = soup.find_all("a", {"class":"title"})

We check that we collected 21 links because there is one sponsored link per page.
Then we print the first link that is not sponsored

In [22]:
len(links)

21

In [23]:
links[1]['href']

'/detail/pronajem/byt/2+kk/praha-zizkov-blahnikova/2249539164'

### for many pages

Now this code is generalized for visiting all the pages that we previously set out to get links from

In [24]:
driver = webdriver.Chrome("C:\\chromedriver\\chromedriver.exe")
all_links = []
for k in range(2): # when ready replace 2 with variable "pages"
    driver.get(all_pages[k])
    time.sleep(1)
    res = driver.execute_script("return document.documentElement.outerHTML")
    soup = BeautifulSoup(res, "lxml")
    links = soup.find_all("a", {"class":"title"})
    for j in links[1:]:
        all_links.append(j['href'])
driver.quit()

In [25]:
len(all_links)

40

In [30]:
all_links[:5]

['/detail/pronajem/byt/2+kk/praha-zizkov-blahnikova/2249539164',
 '/detail/pronajem/byt/3+kk/praha-vinohrady-slavikova/4239736412',
 '/detail/pronajem/byt/1+kk/praha-nove-mesto-wenzigova/3521785436',
 '/detail/pronajem/byt/2+kk/praha-nove-mesto-wenzigova/2774216284',
 '/detail/pronajem/byt/4+kk/praha-stare-mesto-naprstkova/1197882972']

This is just a begining of all the links we can get :D

## Now we need to decide what we want from individual sites.

In [47]:
link_explore = link_base + all_links[1]
driver = webdriver.Chrome("C:\\chromedriver\\chromedriver.exe")
driver.get(link_explore)
time.sleep(2)
gist = driver.execute_script("return document.documentElement.outerHTML")
driver.quit()

In [49]:
info = BeautifulSoup(gist, "lxml")

### magic from now on 

In [50]:
lists = info.find_all("li")
dict = {}

dict["title"] = [info.find("span",{"itemprop":"name"}).findChild().get_text().replace("\n","").replace("\xa0"," ") ]
dict["adresa"] = [info.find("span",{"class": "location"}).get_text().replace("\n","").replace("\xa0"," ")]
for k in lists:
    try:
        column = k.find("label").get_text()
    except:
        break
    try:
        value = [k.find("strong").get_text()]
    except:
        try:
            value = [k.find("a").get_text(), k.find("span", {"class":"c-pois__distance ng-binding"}).get_text()]
        except:
            value = [k.find("span",{"class":"c-pois__poi-text ng-binding ng-scope"}).get_text(),k.find("span", {"class":"c-pois__distance ng-binding"}).get_text()]
    
    if value[0] == "\n\n\n\n\n\n\n":
        try:
            value[0] = k.find("strong").findChild("span",{"class": "icof icon-cross ng-scope"})["ng-if"].split(" ")[-1]
        except:
            value[0] = k.find("strong").findChild("span",{"class": "icof icon-ok ng-scope"})["ng-if"].split(" ")[-1]
    
    #we clean the values a little bit
    keys = [val.replace("\n","") for val in value]
    keys_a = [ki.replace("\xa0"," ") for ki in keys]    

    #add it to the dictionary
    dict[column] = keys_a

print(dict)  

{'title': ['Pronájem bytu 3+kk 50 m²'], 'adresa': ['Slavíkova, Praha 2 - Vinohrady'], 'Celková cena:': ['15 000 Kč za měsíc'], 'Poznámka k ceně:': ['služby 800,-/osoba/měsíc + převod energií na nájemce'], 'ID zakázky:': ['411/3466'], 'Aktualizace:': ['Dnes'], 'Stavba:': ['Cihlová'], 'Stav objektu:': ['Po rekonstrukci'], 'Vlastnictví:': ['Osobní'], 'Podlaží:': ['6. podlaží z celkem 6'], 'Užitná plocha:': ['50m2'], 'Plocha podlahová:': ['50m2'], 'Energetická náročnost budovy:': ['Třída G - Mimořádně nehospodárná č. 78/2013 Sb. podle vyhlášky'], 'Bezbariérový:': ["'boolean-false'"], 'Vybavení:': ["'boolean-true'"], 'Výtah:': ["'boolean-true'"], 'Cukrárna:': ['Vinohradské dorty', ' (64 m)'], 'Kino:': ['Filmový klub VŠE', ' (790 m)'], 'Hřiště:': ['Dětské hřiště Milešovská', ' (225 m)'], 'Kulturní památka:': ['Dům U černé Matky Boží', ' (1857 m)'], 'Večerka:': ['Potraviny Kubelíkova', ' (350 m)'], 'Hospoda:': ['U Pižďucha', ' (713 m)'], 'Divadlo:': ['Theatre ROYAL', ' (505 m)'], 'Veterinář:'

In [51]:
dict

{'title': ['Pronájem bytu 3+kk 50 m²'],
 'adresa': ['Slavíkova, Praha 2 - Vinohrady'],
 'Celková cena:': ['15 000 Kč za měsíc'],
 'Poznámka k ceně:': ['služby 800,-/osoba/měsíc + převod energií na nájemce'],
 'ID zakázky:': ['411/3466'],
 'Aktualizace:': ['Dnes'],
 'Stavba:': ['Cihlová'],
 'Stav objektu:': ['Po rekonstrukci'],
 'Vlastnictví:': ['Osobní'],
 'Podlaží:': ['6. podlaží z celkem 6'],
 'Užitná plocha:': ['50m2'],
 'Plocha podlahová:': ['50m2'],
 'Energetická náročnost budovy:': ['Třída G - Mimořádně nehospodárná č. 78/2013 Sb. podle vyhlášky'],
 'Bezbariérový:': ["'boolean-false'"],
 'Vybavení:': ["'boolean-true'"],
 'Výtah:': ["'boolean-true'"],
 'Cukrárna:': ['Vinohradské dorty', ' (64 m)'],
 'Kino:': ['Filmový klub VŠE', ' (790 m)'],
 'Hřiště:': ['Dětské hřiště Milešovská', ' (225 m)'],
 'Kulturní památka:': ['Dům U černé Matky Boží', ' (1857 m)'],
 'Večerka:': ['Potraviny Kubelíkova', ' (350 m)'],
 'Hospoda:': ['U Pižďucha', ' (713 m)'],
 'Divadlo:': ['Theatre ROYAL', ' (

Although, it should be rewriten as at least functions maybe OOP

## Now we can collect data 
### is it going to be a dictionary of dictionaries? or json?

In [52]:
dict['Tram:'][1]

' (186 m)'