# Project proposal
### _Josef Švec & Markéta Malá_

We want to work on the analysis of Prague's real estate market. The source for our data is the server Sreality.cz from Seznam that represents a trully extensive pool of supply of real estates either for rent or sale (and that is due to the coronavirus crisis even more extensive than ever before). We'd like to narrow our focus only to rents within Prague.

Below, at the very end of this code, there are listed all the variables that we should be able to scrape from the web (with a bit of time and luck). Based on this list we came to conclusion that the most interesting way of approaching the data could be to import a simple map of Prague and visually express the average prices per m^2 on it (e.g. by different intensity of colour or by hight of a surface in a 3D model). Also a histogram comparing the prices across different Prague's districts.

Nevertheless, currently we are dealing with a problem (stemming from the dynamic character of the page that requires usage of Selenium) of technically too demanding scraping. Literally, we counted that by our proposed way the scraping would take hours, because every link has to be openned separatelly and that takes time... If there is a better approach to this we will be very happy to learn it.

## Start of a nice jupyter

In [1]:
from selenium import webdriver

In [2]:
import time

In [3]:
from math import ceil

In [4]:
from bs4 import BeautifulSoup

## How is the link structured 

The website is located under domain name www.sreality.cz

Then it allows a client to search real estate under specified parameters. 

We specify: renting

type: flats

location: Prague

Then it displays 20 adverts per page

Here we break down how the link is structured

In [8]:
link_base = "https://www.sreality.cz"

In [9]:
search = "/hledani/pronajem/byty/praha"

In [10]:
pg = "?strana="

According to number of matches we can calculate the number of pages from which we have to get the links

In [11]:
matches = 7536

In [12]:
pages = ceil(matches/20)

In [13]:
pages

377

Because we understand the structure of the link we can create a list of all list that we want to visit

In [16]:
all_pages = [link_base+search+pg+str(k+1) for k in range(0, pages)]

In [23]:
all_pages

['https://www.sreality.cz/hledani/pronajem/byty/praha?strana=1',
 'https://www.sreality.cz/hledani/pronajem/byty/praha?strana=2',
 'https://www.sreality.cz/hledani/pronajem/byty/praha?strana=3',
 'https://www.sreality.cz/hledani/pronajem/byty/praha?strana=4',
 'https://www.sreality.cz/hledani/pronajem/byty/praha?strana=5',
 'https://www.sreality.cz/hledani/pronajem/byty/praha?strana=6',
 'https://www.sreality.cz/hledani/pronajem/byty/praha?strana=7',
 'https://www.sreality.cz/hledani/pronajem/byty/praha?strana=8',
 'https://www.sreality.cz/hledani/pronajem/byty/praha?strana=9',
 'https://www.sreality.cz/hledani/pronajem/byty/praha?strana=10',
 'https://www.sreality.cz/hledani/pronajem/byty/praha?strana=11',
 'https://www.sreality.cz/hledani/pronajem/byty/praha?strana=12',
 'https://www.sreality.cz/hledani/pronajem/byty/praha?strana=13',
 'https://www.sreality.cz/hledani/pronajem/byty/praha?strana=14',
 'https://www.sreality.cz/hledani/pronajem/byty/praha?strana=15',
 'https://www.sreal

Now we gathered all the links to give us the link that we will later scrape.
Sreality is a dynamic page so we will use selenium <3

### Selenium

#### code for one page

We visit the webpage with selenium get the html code and find all the links that lead to our desired destination.

In [25]:
driver = webdriver.Chrome("C:\\chromedriver\\chromedriver.exe")
driver.get("https://www.sreality.cz/hledani/pronajem/byty/praha")
time.sleep(2)
res1 = driver.execute_script("return document.documentElement.outerHTML")
driver.quit()

In [26]:
soup = BeautifulSoup(res1, "lxml")

In [27]:
links = soup.find_all("a", {"class":"title"})

We check that we collected 21 links because there is one sponsored link per page.
Then we print the first link that is not sponsored

In [28]:
len(links)

21

In [30]:
links[1]['href']

'/detail/pronajem/byt/1+1/praha-haje-brechtova/3401264732'

### for many pages

Now this code is generalized for visiting all the pages that we previously set out to get links from

In [34]:
driver = webdriver.Chrome("C:\\chromedriver\\chromedriver.exe")
all_links = []
for k in range(2): # when ready replace 2 with variable "pages"
    driver.get(all_pages[k])
    time.sleep(1)
    res = driver.execute_script("return document.documentElement.outerHTML")
    soup = BeautifulSoup(res, "lxml")
    links = soup.find_all("a", {"class":"title"})
    for j in links[1:]:
        all_links.append(j['href'])
driver.quit()

In [35]:
len(all_links)

60

In [36]:
all_links[:5]

['/detail/pronajem/byt/1+1/praha-strasnice-starostrasnicka/2781822556',
 '/detail/pronajem/byt/1+1/praha-haje-brechtova/3401264732',
 '/detail/pronajem/byt/2+kk/praha-vrsovice-mexicka/2075864668',
 '/detail/pronajem/byt/1+kk/praha-bubenec-u-akademie/820854364',
 '/detail/pronajem/byt/1+kk/praha-reporyje-k-trebonicum/1313029724']

This is just a begining of all the links we can get :D

## Now we need to decide what we want from individual sites.

In [39]:
link_explore = link_base + all_links[0]
driver = webdriver.Chrome("C:\\chromedriver\\chromedriver.exe")
driver.get(link_explore)
time.sleep(1)
gist = driver.execute_script("return document.documentElement.outerHTML")

In [38]:
info = BeautifulSoup(gist, "lxml")

### magic from now on 

In [41]:
lists = info.find_all("li")
dict = {}

dict["title"] = [info.find("span",{"itemprop":"name"}).findChild().get_text().replace("\n","").replace("\xa0"," ") ]
dict["adresa"] = [info.find("span",{"class": "location"}).get_text().replace("\n","").replace("\xa0"," ")]
for k in lists:
    try:
        column = k.find("label").get_text()
    except:
        break
    try:
        value = [k.find("strong").get_text()]
    except:
        try:
            value = [k.find("a").get_text(), k.find("span", {"class":"c-pois__distance ng-binding"}).get_text()]
        except:
            value = [k.find("span",{"class":"c-pois__poi-text ng-binding ng-scope"}).get_text(),k.find("span", {"class":"c-pois__distance ng-binding"}).get_text()]
    
    if value[0] == "\n\n\n\n\n\n\n":
        value[0] = k.find("strong").findChild("span",{"class": "icof icon-ok ng-scope"})["ng-if"].split(" ")[-1]
    
    #we clean the values a little bit
    keys = [val.replace("\n","") for val in value]
    keys_a = [ki.replace("\xa0"," ") for ki in keys]    

    #add it to the dictionary
    dict[column] = keys_a

print(dict)  

{'title': ['Pronájem bytu 1+1 43 m²'], 'adresa': ['Starostrašnická, Praha 10 - Strašnice Panorama'], 'Celková cena:': ['12 000 Kč za měsíc'], 'Poznámka k ceně:': ['+ 1500,- služby včetně el. a plynu'], 'ID zakázky:': ['273-N00709'], 'Aktualizace:': ['Dnes'], 'Stavba:': ['Cihlová'], 'Stav objektu:': ['Velmi dobrý'], 'Vlastnictví:': ['Osobní'], 'Podlaží:': ['3. podlaží z celkem 6'], 'Užitná plocha:': ['43m2'], 'Plocha podlahová:': ['43m2'], 'Voda:': ['Místní zdroj'], 'Topení:': ['Ústřední dálkové'], 'Plyn:': ['Plynovod'], 'Odpad:': ['Veřejná kanalizace'], 'Telekomunikace:': ['Telefon, Internet, Kabelová televize'], 'Elektřina:': ['400V'], 'Doprava:': ['MHD'], 'Energetická náročnost budovy:': ['Třída G - Mimořádně nehospodárná'], 'Cukrárna:': ['OVOCNÝ SVĚTOZOR', ' (569 m)'], 'Kino:': ['Cinema City Flora', ' (2219 m)'], 'Přírodní zajímavost:': ['Mokřad Triangl', ' (2384 m)'], 'Hřiště:': ['Dětské hřiště Nad Primaskou', ' (174 m)'], 'Večerka:': ['BILLA', ' (421 m)'], 'Hospoda:': ['Hospůdka U

In [42]:
dict

{'title': ['Pronájem bytu 1+1 43 m²'],
 'adresa': ['Starostrašnická, Praha 10 - Strašnice Panorama'],
 'Celková cena:': ['12 000 Kč za měsíc'],
 'Poznámka k ceně:': ['+ 1500,- služby včetně el. a plynu'],
 'ID zakázky:': ['273-N00709'],
 'Aktualizace:': ['Dnes'],
 'Stavba:': ['Cihlová'],
 'Stav objektu:': ['Velmi dobrý'],
 'Vlastnictví:': ['Osobní'],
 'Podlaží:': ['3. podlaží z celkem 6'],
 'Užitná plocha:': ['43m2'],
 'Plocha podlahová:': ['43m2'],
 'Voda:': ['Místní zdroj'],
 'Topení:': ['Ústřední dálkové'],
 'Plyn:': ['Plynovod'],
 'Odpad:': ['Veřejná kanalizace'],
 'Telekomunikace:': ['Telefon, Internet, Kabelová televize'],
 'Elektřina:': ['400V'],
 'Doprava:': ['MHD'],
 'Energetická náročnost budovy:': ['Třída G - Mimořádně nehospodárná'],
 'Cukrárna:': ['OVOCNÝ SVĚTOZOR', ' (569 m)'],
 'Kino:': ['Cinema City Flora', ' (2219 m)'],
 'Přírodní zajímavost:': ['Mokřad Triangl', ' (2384 m)'],
 'Hřiště:': ['Dětské hřiště Nad Primaskou', ' (174 m)'],
 'Večerka:': ['BILLA', ' (421 m)'],


Although, it should be rewriten as at least functions maybe OOP

## Now we can collect data 
### is it going to be a dictionary of dictionaries? or json?

In [45]:
dict['Tram:'][0]

'Strašnická'

In [None]:
Scrape(2)