# Parsing HTML documents

First you need to understand what an HTML document is and how you can read data from it.

HTML (HyperText Markup Language) is a standardized hypertext markup language for documents for viewing web pages in a browser. Web browsers receive an HTML document from the server via HTTP/HTTPS protocols or open it from a local disk, then interpret the code into an interface that will be displayed on the monitor screen.

HTML elements are the building blocks of HTML pages. With HTML, various designs, images, and other objects, such as an interactive web form, can be embedded in the displayed page. HTML provides tools for creating headings, paragraphs, lists, links, quotes, and other elements. HTML elements are separated by tags written using angle brackets. Browsers do not display HTML tags, but use them to interpret page content.

## DOM-tree and searching for values

DOM is a representation of an HTML document as a tree of tags.

Let's start with this simple document:

<img src="./pictures/1.png"  
  width="600"
/>

Everything in HTML, even comments, is part of the DOM.
Even the <!DOCTYPE...> directive is also a DOM node. It's in the DOM tree right before the html. We won't look at this node, we don't even draw it on our diagrams, but it exists.
Even the document object, which represents the entire document, is technically a DOM node.

In reality, the DOM tree looks more complex, each element in the tree has its own tag and relates to others as a child or parent:

<img src="./pictures/2.png"  
  width="1200"
/>

With such a scheme, it would be easy to unload the necessary data, but the DOM of documents usually looks more complex; child elements and parents can have the same tag (as in the example below, the div is also a container):

<img src="./pictures/3.png"  
  width="600"
/>

## Class and Id

<b>Class</b> is a universal tag attribute that can be used to set a name for any element on the page. The element name is then used as a selector in CSS and allows you to control the styling of the element. In addition, the class name is convenient for searching and manipulating elements on the page

<b>Id</b> - defines the unique identifier of the HTML element. The value of the <b>Id</b> attribute must be unique within the HTML document.
The <b>Id</b> attribute is used to indicate a specific style declaration in a style sheet. It is also used to access and manipulate an element with a specific ID.

<img src="./pictures/4.png"  
  width="1400"
/>

In some cases, the DOM tree can be generated on the fly and use data depending on the loaded part of the site or transitions.
In such cases, using parsing will be difficult, because in the current version, we will rely on a ready-made “snapshot” of the page and parse it.

## Errors from the web application

- 400 — Bad Request. Typically this status is associated with an input error, for example if the user enters an incorrect email address.
- 401 - Unauthorized. This status is associated with a situation where a user tries to access something without authorization where authorization is required. This error code is also suitable in a situation where the user is trying to perform an action for which he does not have rights.
- 403 - Forbidden. The difference between this status and status 400 is insignificant. Typically, a 403 code indicates that the server understood the request, but cannot fulfill it. For example, this status can be returned if the user entered an expired promotional coupon number.
- 404 - Not Found. This is the most famous of the "erroneous" response codes. It reports that the requested resource was not found. This can happen due to an incorrect URL, a deleted or moved page.
- 409 - Conflict. In most cases, this status indicates a version control conflict. For example, this happens if the user tries to download a version of a file that is older than the previously downloaded version of the file. This code can also indicate uniqueness constraints, for example, if the user tries to resend the email (clicks the Send button a second time without waiting for the action to complete).
- 500 — Internal Server Error. This status indicates an error, which can be described as: “Something went wrong, but we don’t know what exactly.”
- 503 - Unavailable. The server has failed; the error may be planned or unplanned.

## Status OK

- 200 OK - the most popular and necessary response code from the server. It means that the request from the client side is correct and the server side is executed without problems. All pages that are indexed by search engines should return 200 OK.
- 301 Moved Permanently - indicates a redirection from one page to another.

# New libraries

## requests

The requests library allows us to interact with web applications easily and with a minimum amount of code. We need this to solve any problems related to the transfer of information from the user to the server and back

Let's download the library and see the main methods

In [1]:
import requests

- Uniform Resource Locator.

A URL is nothing more than the address of a given unique resource on the Internet. In theory, every valid URL points to a unique resource.

Knowing what a URL is, let’s load our page and look at the result

In [3]:
URL_TEMPLATE = "https://www.ss.lv/lv/real-estate/flats/riga/all/hand_over/page48.html"

In [4]:
r = requests.get(URL_TEMPLATE)
print(r.status_code)

200


In [7]:
r.text[:100]

'<!DOCTYPE html>\r\n<HTML><HEAD>\r\n<title>SS.LV Dzīvokļi - Rīga, Cenas. Blakus, caurstaigājama..., Izīrē'

We have completely downloaded the page snapshot by URL, now we need to extract the data we need, for this we will use the Beautiful Soup library

## bs4

Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages, which can be used to extract data from HTML, which is useful for scraping web pages.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [8]:
from bs4 import BeautifulSoup as bs

Running the "requests" document through Beautiful Soup gives us a BeautifulSoup object that represents the document as a nested data structure:

In [9]:
soup = bs(r.text, "html.parser")

In [11]:
# here you can see this structure
# soup

<img src="./pictures/5.png"  
  width="1200"
/>

Here are some simple ways to navigate this data structure:

In [12]:
soup.title

<title>SS.LV Dzīvokļi - Rīga, Cenas. Blakus, caurstaigājama..., Izīrē - Visi sludinājumi</title>

In [13]:
soup.title.name

'title'

In [14]:
soup.title.string

'SS.LV Dzīvokļi - Rīga, Cenas. Blakus, caurstaigājama..., Izīrē - Visi sludinājumi'

In [11]:
soup.title.parent.name

'head'

In [15]:
soup.a

<a href="/lv/" title="Sludinājumi"><img alt="Sludinājumi" border="0" class="page_header_logo_ss" src="https://i.ss.lv/img/p.gif"/></a>

In [16]:
soup.find_all('a')[:5]

[<a href="/lv/" title="Sludinājumi"><img alt="Sludinājumi" border="0" class="page_header_logo_ss" src="https://i.ss.lv/img/p.gif"/></a>,
 <a class="a_menu" href="/lv/real-estate/flats/new/" title="Iesniegt Sludinājumu">Iesniegt Sludinājumu</a>,
 <a class="a_menu" href="/lv/login/" title="Mani Sludinājumi">Mani Sludinājumi</a>,
 <a class="a_menu" href="/lv/real-estate/flats/riga/search/" title="Meklēt sludinājumus">Meklēšana</a>,
 <a class="a_menu" href="/lv/favorites/" title="Memo">Memo</a>]

In [17]:
soup.find(id="tr_53431909")

the <b>.find_all</b> method accepts search parameters by tags, classes, id and other tag parameters. In our case, we will use the search by tag + class

In [18]:
parsed_data = soup.find_all('td', class_='msga2-o pp6')

let's look at the result

In [19]:
parsed_data[:12]

[<td c="1" class="msga2-o pp6" nowrap="">centrs<br/>Bruņinieku 52</td>,
 <td c="1" class="msga2-o pp6" nowrap="">3</td>,
 <td c="1" class="msga2-o pp6" nowrap="">71</td>,
 <td c="1" class="msga2-o pp6" nowrap="">3/5</td>,
 <td c="1" class="msga2-o pp6" nowrap="">P. kara</td>,
 <td c="1" class="msga2-o pp6" nowrap="">360  €/mēn.</td>,
 <td c="1" class="msga2-o pp6" nowrap="">Krasta r-ns<br/>Salacas 16</td>,
 <td c="1" class="msga2-o pp6" nowrap="">1</td>,
 <td c="1" class="msga2-o pp6" nowrap="">50</td>,
 <td c="1" class="msga2-o pp6" nowrap="">10/16</td>,
 <td c="1" class="msga2-o pp6" nowrap="">104.</td>,
 <td c="1" class="msga2-o pp6" nowrap="">39  €/dienā</td>]

Another common task is to extract all the text from a page:

<img src="./pictures/6.png"  
  width="1000"
/>

In [20]:
parsed_data[0].get_text()

'centrsBruņinieku 52'

get_text() removed all tags and their parameters, while the `br` tag used (used for line breaks) was also eaten and led to our text being “stuck together”

to avoid such situations, we will add a separator that will be easy to remove/replace in the future to break the text|

In [21]:
parsed_data[0].get_text("|")

'centrs|Bruņinieku 52'

In [22]:
for i in range(0,7):
    print(parsed_data[i].get_text("|"))

centrs|Bruņinieku 52
3
71
3/5
P. kara
360  €/mēn.
Krasta r-ns|Salacas 16


# Example 

## Variant 1

In [23]:
import pandas as pd
import numpy as np

In [24]:
URL_TEMPLATE = "https://www.ss.lv/lv/real-estate/flats/riga/all/sell/"
r = requests.get(URL_TEMPLATE)
print(r.status_code)

200


In [25]:
soup = bs(r.text, "html.parser")
parsed_data = soup.find_all('td', class_='msga2-o pp6')

In [26]:
parsed_data[:12]

[<td c="1" class="msga2-o pp6" nowrap="">Teika<br/>Kuršu 32</td>,
 <td c="1" class="msga2-o pp6" nowrap="">2</td>,
 <td c="1" class="msga2-o pp6" nowrap="">43</td>,
 <td c="1" class="msga2-o pp6" nowrap="">1/2</td>,
 <td c="1" class="msga2-o pp6" nowrap="">Renov.</td>,
 <td c="1" class="msga2-o pp6" nowrap="">79,600  €</td>,
 <td c="1" class="msga2-o pp6" nowrap="">centrs<br/>Katrīnas d. 24 - K3</td>,
 <td c="1" class="msga2-o pp6" nowrap="">3</td>,
 <td c="1" class="msga2-o pp6" nowrap="">67</td>,
 <td c="1" class="msga2-o pp6" nowrap="">1/2</td>,
 <td c="1" class="msga2-o pp6" nowrap="">Staļina</td>,
 <td c="1" class="msga2-o pp6" nowrap="">98,500  €</td>]

<img src="./pictures/7.png"  
  width="1000"
/>

In [27]:
page_array = []

i = 0
for data in parsed_data:
    page_array.append([i, data.get_text("|")])
    i += 1
    
df_tmp = pd.DataFrame(page_array, columns=['line', 'data'])    

In [29]:
df_tmp.head(10)

Unnamed: 0,line,data
0,0,Teika|Kuršu 32
1,1,2
2,2,43
3,3,1/2
4,4,Renov.
5,5,"79,600 €"
6,6,centrs|Katrīnas d. 24 - K3
7,7,3
8,8,67
9,9,1/2


here we received a numbered data array.

As shown in the figure above, all data (even tabular ones) have turned into rows. To correct this situation, let’s take a column index, for this we will use division without a remainder

In [30]:
df_tmp['head'] = df_tmp['line']%6

In [31]:
df_tmp.head(10)

Unnamed: 0,line,data,head
0,0,Teika|Kuršu 32,0
1,1,2,1
2,2,43,2
3,3,1/2,3
4,4,Renov.,4
5,5,"79,600 €",5
6,6,centrs|Katrīnas d. 24 - K3,0
7,7,3,1
8,8,67,2
9,9,1/2,3


Great! We now have a column index in the table, all that remains is to add a row index, for this we will use the `.cumcount()` method, it performs similar actions to the SQL window functions `count() over (partition by column1 group by column1)`

In [32]:
df_tmp['group'] = df_tmp.groupby('head').cumcount()

In [35]:
df_tmp.head(10)

Unnamed: 0,line,data,head,group
0,0,Teika|Kuršu 32,0,0
1,1,2,1,0
2,2,43,2,0
3,3,1/2,3,0
4,4,Renov.,4,0
5,5,"79,600 €",5,0
6,6,centrs|Katrīnas d. 24 - K3,0,1
7,7,3,1,1
8,8,67,2,1
9,9,1/2,3,1


let's assemble a readable view of our table

In [36]:
df = df_tmp.loc[df_tmp['head']==0][['group', 'data']]

In [38]:
df = df.merge(df_tmp.loc[df_tmp['head']==1][['group', 'data']], how='left', on='group', suffixes=('','_rooms'))
df = df.merge(df_tmp.loc[df_tmp['head']==2][['group', 'data']], how='left', on='group', suffixes=('','_m2'))
df = df.merge(df_tmp.loc[df_tmp['head']==3][['group', 'data']], how='left', on='group', suffixes=('','_floor'))
df = df.merge(df_tmp.loc[df_tmp['head']==4][['group', 'data']], how='left', on='group', suffixes=('','_seria'))
df = df.merge(df_tmp.loc[df_tmp['head']==5][['group', 'data']], how='left', on='group', suffixes=('','_price'))

In [39]:
df.head()

Unnamed: 0,group,data,data_rooms,data_m2,data_floor,data_seria,data_price
0,0,Teika|Kuršu 32,2,43,1/2,Renov.,"79,600 €"
1,1,centrs|Katrīnas d. 24 - K3,3,67,1/2,Staļina,"98,500 €"
2,2,centrs|Eksporta 8,4,235,2/2,Jaun.,"380,000| €"
3,3,centrs|Bruņinieku 87,2,57,4/4,P. kara,"119,800 €"
4,4,Bieriņi|Ārlavas 5,2,70,1/1,Priv. m.,"125,000 €"


We have downloaded our first data from the html page, but there is a problem, there are more pages on the site.

So we will write a function to process and download data from all pages

### update script

You've probably heard of such a thing as `DDoS` attacks, this is a type of attack by hackers on a resource in which cybercriminals create a continuous stream of requests from different sources that interfere with the server’s operation.

One of the simplest ways to protect against such attacks is to limit the requests that we can send to the site, sometimes this volume is limited to 500 requests per second, and sometimes 50 per minute (depending on the server's rusers).

In our case, we will not test sites for strength; our task is to obtain data without harming the operation of the site. To do this, we will use the new `time` library, with its help we can ask python to slow down requests by pausing between each one

In [40]:
import time
from tqdm.notebook import tqdm

In [41]:
def load_data(link, time_sleep, page_num):
    
    # the method takes the time to wait in seconds
    time.sleep(time_sleep)
    
    # a new page will be generated here
    # if you pay attention to how the pages differ, looping through them won't be a problem
    # https://www.ss.lv/lv/real-estate/flats/riga/all/hand_over/page48.html
    # https://www.ss.lv/lv/real-estate/flats/riga/all/hand_over/page49.html
    
    link = link + 'page' + str(page_num) + '.html'
    
    # receive request
    r = requests.get(link)
    
    # important, if the status returned is not 200, then you can exit the function
    if r.status_code!=200:
        print('Error status', r.status_code)
        return 
    
    # parse our data, all previously performed operations
    soup = bs(r.text, "html.parser")
    parsed_data = soup.find_all('td', class_='msga2-o pp6')
    
    page_array = []

    i = 0
    for data in parsed_data:
        page_array.append([i, data.get_text("|")])
        i += 1

    df_tmp = pd.DataFrame(page_array, columns=['line', 'data'])
    df_tmp['head'] = df_tmp['line']%6
    df_tmp['group'] = df_tmp.groupby('head').cumcount()
    
    
    # collect our data into a single dataframe
    df_page = df_tmp.loc[df_tmp['head']==0][['group', 'data']]       
    df_page = df_page.merge(df_tmp.loc[df_tmp['head']==1][['group', 'data']], how='left', on='group', suffixes=('','_rooms'))
    df_page = df_page.merge(df_tmp.loc[df_tmp['head']==2][['group', 'data']], how='left', on='group', suffixes=('','_m2'))
    df_page = df_page.merge(df_tmp.loc[df_tmp['head']==3][['group', 'data']], how='left', on='group', suffixes=('','_floor'))
    df_page = df_page.merge(df_tmp.loc[df_tmp['head']==4][['group', 'data']], how='left', on='group', suffixes=('','_seria'))
    df_page = df_page.merge(df_tmp.loc[df_tmp['head']==5][['group', 'data']], how='left', on='group', suffixes=('','_price'))
    
    return df_page

In [42]:
# set the start URL
URL_TEMPLATE = "https://www.ss.lv/lv/real-estate/flats/riga/all/sell/"

In [43]:
# do a test run using the parameters
# our URL
# set the waiting time between requests to 1 second
# select the first page

df = load_data(URL_TEMPLATE, 0.5, 1)

In [44]:
# the data looks fine, repeat the operation for 48 pages
df.head()

Unnamed: 0,group,data,data_rooms,data_m2,data_floor,data_seria,data_price
0,0,Ķengarags|Aglonas 29,2,45,5/5,Hrušč.,"46,500 €"
1,1,Maskavas priekšpilsēta|Lomonosova 12,2,46,4/5,Staļina,"43,800| €"
2,2,centrs|Stabu 61,4,117,3/5,Renov.,"159,400 €"
3,3,centrs|Dainas 1,3,59,4/4,P. kara,"106,000 €"
4,4,Teika|Kuršu 32,2,43,1/2,Renov.,"79,600 €"


In [45]:
# note. the limit is +1 page
for i in tqdm(range(2,52)):
    df  = pd.concat([df, load_data(URL_TEMPLATE, 0.5, i)])

  0%|          | 0/50 [00:00<?, ?it/s]

In [46]:
# uploading sales data 1400+ data
len(df)

1530

modify the results

In [47]:
df.head()

Unnamed: 0,group,data,data_rooms,data_m2,data_floor,data_seria,data_price
0,0,Ķengarags|Aglonas 29,2,45,5/5,Hrušč.,"46,500 €"
1,1,Maskavas priekšpilsēta|Lomonosova 12,2,46,4/5,Staļina,"43,800| €"
2,2,centrs|Stabu 61,4,117,3/5,Renov.,"159,400 €"
3,3,centrs|Dainas 1,3,59,4/4,P. kara,"106,000 €"
4,4,Teika|Kuršu 32,2,43,1/2,Renov.,"79,600 €"


In [48]:
# clear the data in data_price, remove '€','|' And ','
df['data_price'] = df['data_price'].apply(lambda x: x.replace(' €','').replace('|','').replace(',',''))
df['data_price'] = df['data_price'].astype('int64')

In [49]:
# divide district and street into two columns
df[['data_district', 'data_street']] = df['data'].str.split(pat='|', n=1 , expand=True )

In [51]:
# divide floor and maximum floor into two columns
df[['data_cur_floor', 'data_max_floor']] = df['data_floor'].str.split(pat='/', n=1 , expand=True )

# df['data_cur_floor'] = df['data_cur_floor'].astype('int64')
# df['data_max_floor'] = df['data_max_floor'].astype('int64')

In [52]:
df['data_cur_floor'].unique()

array(['5', '4', '3', '1', '2', '8', '7', '10', '11', '9', '12', '6',
       '15', '17', '16', '14', '1.00', '23', '19'], dtype=object)

In [53]:
df['data_max_floor'].unique()

array(['5', '4', '2', '1', '9', '3', '12', '16', '10', '7', '6', '25',
       '18', '8', '23', '22', '13', '24', '11', '15', '14'], dtype=object)

In [54]:
df['data_cur_floor'] = np.where(df['data_cur_floor']=='1.00', 1,df['data_cur_floor'])

In [55]:
df['data_cur_floor'] = df['data_cur_floor'].astype('int64')
df['data_max_floor'] = df['data_max_floor'].astype('int64')

In [57]:
df['data_rooms'].unique()

array(['2', '4', '3', '1', '6', '5', 'Citi'], dtype=object)

In [58]:
# df['data_rooms'] = df['data_rooms'].astype('int64')

In [59]:
df = df[['data_district','data_street','data_rooms','data_cur_floor','data_max_floor','data_m2','data_seria','data_price']]

In [60]:
# our data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1530 entries, 0 to 29
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   data_district   1530 non-null   object
 1   data_street     1530 non-null   object
 2   data_rooms      1530 non-null   object
 3   data_cur_floor  1530 non-null   int64 
 4   data_max_floor  1530 non-null   int64 
 5   data_m2         1530 non-null   object
 6   data_seria      1530 non-null   object
 7   data_price      1530 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 107.6+ KB


In [61]:
df.head()

Unnamed: 0,data_district,data_street,data_rooms,data_cur_floor,data_max_floor,data_m2,data_seria,data_price
0,Ķengarags,Aglonas 29,2,5,5,45,Hrušč.,46500
1,Maskavas priekšpilsēta,Lomonosova 12,2,4,5,46,Staļina,43800
2,centrs,Stabu 61,4,3,5,117,Renov.,159400
3,centrs,Dainas 1,3,4,4,59,P. kara,106000
4,Teika,Kuršu 32,2,1,2,43,Renov.,79600


# Variant 2

If you go to each advertisement separately, you can find additional information; when building machine learning models, any information will be useful, we’ll try to get it

<img src="./pictures/8.png"  
  width="800"
/>

First of all, you need to pay attention to the URL that leads to this page and compare them with a couple of others

- https://www.ss.lv/msg/ru/real-estate/flats/riga/centre/eixne.html
- https://www.ss.lv/msg/ru/real-estate/flats/riga/imanta/iddkc.html
- https://www.ss.lv/msg/ru/real-estate/flats/riga/mezhapark/idogx.html

The pages we see are not numbered, and each region has its own directive, it doesn’t matter to go through everything, we will take all the links directly.

In [62]:
URL_TEMPLATE = "https://www.ss.lv/lv/real-estate/flats/riga/all/hand_over/"

In [63]:
r = requests.get(URL_TEMPLATE)
print(r.status_code)

200


this time we will look for links, they have the tag `a` and class `am`

In [64]:
soup = bs(r.text, "html.parser")
parsed_data = soup.find_all('a', class_='am')

In [65]:
parsed_data[:2]

[<a class="am" data="JUFCJTlBJUE0JTlCJThCJTg0JUJFZiU4OCVCMCU5NyVBOCU5RiU4RCU3RXplJThCJUE5JTlEJUE0JTlDJTg0JTdCcg==|xdtkTKB5R" href="/msg/lv/real-estate/flats/riga/imanta/gloeh.html" id="dm_53855608">Saimnieks izīrē plašu, saulainu, energoefektīvu dzīvokli ar ideā</a>,
 <a class="am" data="bndneSU4QSVDRSVBQSU3RG92aHolOEIlODYlQUUlN0RqJTdCY3IlOEMlODIlQTl3|6D1BZRyG" href="/msg/lv/real-estate/flats/riga/agenskalns/eedmc.html" id="dm_53877939"><b>Īpašnieks izīrē mājīgu vienistabas dzīvokli īrniekam bez kaitīgi</b></a>]

the data we are interested in is stored in the `href` parameter, this parameter indicates the purpose of the link

In [66]:
parsed_data[0].get('href')

'/msg/lv/real-estate/flats/riga/imanta/gloeh.html'

it's time to update our request and download all the links

In [67]:
def get_link(link, time_sleep, page_num):
    
    time.sleep(time_sleep)
    link = link + 'page' + str(page_num) + '.html'
    r = requests.get(link)
    
    if r.status_code!=200:
        return 
    
    soup = bs(r.text, "html.parser")
    parsed_data = soup.find_all('a', class_='am')
    
    pars_links = []
    
    for data in parsed_data:
        pars_links.append(data.get('href'))
        
    return pars_links

In [68]:
link_array = []
URL_TEMPLATE = "https://www.ss.lv/lv/real-estate/flats/riga/all/hand_over/"

In [69]:
for i in tqdm(range(1,10)):
    link_array = link_array + get_link(URL_TEMPLATE, 1, i)

  0%|          | 0/9 [00:00<?, ?it/s]

In [70]:
link_array[:10]

['/msg/lv/real-estate/flats/riga/imanta/gloeh.html',
 '/msg/lv/real-estate/flats/riga/agenskalns/eedmc.html',
 '/msg/lv/real-estate/flats/riga/centre/gxglp.html',
 '/msg/lv/real-estate/flats/riga/centre/ajghx.html',
 '/msg/lv/real-estate/flats/riga/centre/adoxo.html',
 '/msg/lv/real-estate/flats/riga/vecriga/aejhb.html',
 '/msg/lv/real-estate/flats/riga/centre/dlcnh.html',
 '/msg/lv/real-estate/flats/riga/chiekurkalns/chigj.html',
 '/msg/lv/real-estate/flats/riga/kengarags/fbgmd.html',
 '/msg/lv/real-estate/flats/riga/centre/anmck.html']

Now it remains to repeat the above operation for each page, let's see what the parser of one page will look like

In [71]:
# please note that only the domain remains from our URL, we will add the rest from saved links
URL_TEMPLATE = "https://www.ss.lv"

In [72]:
URL_TEMPLATE += link_array[0]

In [73]:
r = requests.get(URL_TEMPLATE)
print('Error status', r.status_code)

Error status 200


In [74]:
soup = bs(r.text, "html.parser")
# parsed_data = soup.find_all('td', class_='msga2-o pp6')

In [75]:
parsed_data = soup.find_all('td', class_='ads_opt')
parsed_data

[<td class="ads_opt" id="tdo_20" nowrap=""><b>Rīga</b></td>,
 <td class="ads_opt" id="tdo_856" nowrap=""><b>Imanta</b></td>,
 <td class="ads_opt" id="tdo_11" nowrap=""><b>Kurzemes pr. 14</b> <span class="td15">[<a class="ads_opt_link_map" href="javascript:;" id="mnu_map" onclick="mnu('map',1,1,'/lv/gmap/fTgTeF4QAzt4FD4eFFM=.html?mode=1&amp;c=56.9568932, 24.0336084, 14');return false;">Karte</a>]</span></td>,
 <td class="ads_opt" id="tdo_1" nowrap="">2</td>,
 <td class="ads_opt" id="tdo_3" nowrap="">49 m²</td>,
 <td class="ads_opt" id="tdo_4" nowrap="">2/5</td>,
 <td class="ads_opt" id="tdo_6" nowrap="">LT proj.</td>,
 <td class="ads_opt" id="tdo_2" nowrap="">Mūra</td>,
 <td class="ads_opt" id="tdo_1734" nowrap="">Balkons</td>]

In [76]:
# received all data
for i in parsed_data:
    print(i.get_text("|"))

Rīga
Imanta
Kurzemes pr. 14| |[|Karte|]
2
49 m²
2/5
LT proj.
Mūra
Balkons


In [77]:
# now let's try to get the card
parsed_map = soup.find_all('a', class_='ads_opt_link_map')
print(parsed_map[0]['onclick'])

mnu('map',1,1,'/lv/gmap/fTgTeF4QAzt4FD4eFFM=.html?mode=1&c=56.9568932, 24.0336084, 14');return false;


In [78]:
# and then we’ll get a description of the apartments
parsed_text = soup.find_all('div', id='msg_div_msg')
parsed_text[0].get_text(" | ")

# note that in the container id=msg_div_msg also captures data on the attributes that we have already taken,
# this is due to the principle of nesting, our td tags are located in this container

'\n | \r\n\r\nSaimnieks izīrē plašu, saulainu, energoefektīvu dzīvokli ar ideālu lokāciju ilgtermiņā.  | \r\nJūsu ērtībai koridors ar koridora skapi, viesistaba ar dīvānu, guļamistaba ar balkonu un ietilpīgu sienas skapi, atsevišķs WC, vannas istaba ar dušu un veļas mašīnu, virtuve ar virtuves iekārtu un visu nepieciešamo sadzīves tehniku- trauku mašīna, gāzes plīts ar cepeškrāsni, tvaika nosūcējs.  | \r\nDzīvoklis ir pēc remonta. Sadzīves tehnika un santehnikas iekārtas jaunas.  | \r\nĒka, kurā atrodas dzīvoklis ir renovēta, ieguvusi balvu, kā energoefektīvākā māja Latvijā, dzīvoklī ir siltuma alokatori- mazi apkures rēķini. Par apkuri tiek maksāts tikai pēc patērētā.  | \r\nApsaimnieko biedrība, mazi komunālie rēķini.  | \r\nDomofons, bezmaksas stāvvieta pie mājas, metāla ārdurvis, koda atslēga, kopta tīra kāpņu telpa. Balkons ar skatu uz skaistu, zaļu iekšpagalmu.  | \r\nBlakus sabiedriskais transports un visa nepieciešamā infrastruktūra.  | \r\nOficiāls līgums ar iespēju deklarētie

In [79]:
# and what about the price
parsed_price = soup.find_all('td', class_='ads_price')
parsed_price[0].get_text()

'360 €/mēn. (7.35 €/m²)'

## update script

In [291]:
# Traditionally, we write what is executed more than 1 time in a function.
# let's write a function to which we will pass the link. and all data in the form of an array will come as a response

In [80]:
def get_data_link(url, time_sleep):
    
    page_array = []
    time.sleep(time_sleep)
    
    # add to existing domain
    link = "https://www.ss.lv"
    link += url
    
    r = requests.get(link)
    if r.status_code!=200:
        return 
    
    soup = bs(r.text, "html.parser")
        
    # data
    parsed_data = soup.find_all('td', class_='ads_opt')   
    # coordinates
    parsed_map = soup.find_all('a', class_='ads_opt_link_map')   
        
    # price
    parsed_price = soup.find_all('td', class_='ads_price')    
    # description 
    parsed_text = soup.find_all('div', id='msg_div_msg')
    
    
    for data in parsed_data:
        page_array.append(data.get_text("|"))

    if len(parsed_map)==1:
        page_array.append(parsed_map[0]['onclick'])
    else:
        page_array.append('')
    
    page_array.append(parsed_price[0].get_text())       
    page_array.append(parsed_text[0].get_text(" | "))
    
    return page_array

In [24]:
# data_array = []

# data_array.append(get_data_link(link_array[447], 1))

In [25]:
# data_array

In [81]:
# write down all the data
data_array = []

for links in tqdm(link_array):
    # we will have 1470 requests. quite a lot, sometimes it is better to increase the waiting time than to get banned
    # in our case, parsing + uploading (with timeout) takes ~1.8 seconds, this is usually enough
    # from experience I can say that on average they set up to 500 iterations per minute or 8.3 per second
    data_array.append(get_data_link(links, 0.25))

  0%|          | 0/270 [00:00<?, ?it/s]

In [106]:
# data_array[10]

In [108]:
# part of the data contains an additional “convenience” field, add a line to generate df
# in the data part there is an additional field "Kadastra numurs", add a line to generate df
data_array_upd = []

for i in data_array:
    if len(i)==11:
        i.insert(8, '')
    if len(i)==12:
        i.insert(8, '')
    data_array_upd.append(i)

data_array_upd_1 = []
for i in data_array_upd:
    if len(i)==12:
        i.insert(8, '')
    data_array_upd_1.append(i)
    
    

In [110]:
data_array_upd = data_array_upd_1

In [112]:
df = pd.DataFrame(data_array_upd, columns=['city', 'district','street','rooms','area','floor','seria','house_type','kadastr_numb','facilities', 'map','price','all_data'])

In [113]:
df.sample(2)

Unnamed: 0,city,district,street,rooms,area,floor,seria,house_type,kadastr_numb,facilities,map,price,all_data
104,Rīga,centrs,Elizabetes 10b| |[|Karte|],2,92 m²,3/5/lifts,P. kara,Mūra,,Balkons,"mnu('map',1,1,'/lv/gmap/fTgTeF4QAzt4FD4eFFM=.h...",785 €/mēn. (8.53 €/m²),\n | \r\n\r\nKlusais centrs jau izsenis pazīst...
59,Rīga,Purvciems,Ūnijas 76a| |[|Karte|],2,38 m²,1/5,LT proj.,Paneļu,,Parkošanas vieta,"mnu('map',1,1,'/lv/gmap/fTgTeF4QAzt4FD4eFFM=.h...",300 €/mēn. (7.89 €/m²),\n | \r\n\r\nIlgtermiņā tiek izīrēts. | \n | ...


In [140]:
# let's start cleaning the data

In [114]:
df['city'].unique()

array(['Rīga'], dtype=object)

In [115]:
df['district'].unique()

array(['Imanta', 'Āgenskalns', 'centrs', 'Vecrīga', 'Čiekurkalns',
       'Ķengarags', 'Ziepniekkalns', 'Pļavnieki', 'Krasta r-ns',
       'Šampēteris-Pleskodāle', 'Zolitūde', 'Iļģuciems',
       'Maskavas priekšpilsēta', 'Dzegužkalns', 'Teika', 'Purvciems',
       'Vecmīlgrāvis', 'Jugla', 'Mežciems', 'Mežaparks', 'Sarkandaugava',
       'Dārzciems', 'Mangaļi', 'Zasulauks', 'Vecāķi', 'Bolderāja',
       'Torņakalns', 'Bieriņi', 'Klīversala', 'Jaunciems', 'Vecdaugava'],
      dtype=object)

In [116]:
df[['data_street', 'map_link']] = df['street'].str.split(pat='|', n=1 , expand=True )

In [117]:
df = df.drop(['city','map_link','street','kadastr_numb'], axis=1)

In [119]:
df.head(2)

Unnamed: 0,district,rooms,area,floor,seria,house_type,facilities,map,price,all_data,data_street
0,Imanta,2,49 m²,2/5,LT proj.,Mūra,Balkons,"mnu('map',1,1,'/lv/gmap/fTgTeF4QAzt4FD4eFFM=.h...",360 €/mēn. (7.35 €/m²),"\n | \r\n\r\nSaimnieks izīrē plašu, saulainu, ...",Kurzemes pr. 14
1,Āgenskalns,1,22 m²,1/2,P. kara,Koka,,"mnu('map',1,1,'/lv/gmap/fTgTeF4QAzt4FD4eFFM=.h...",200 €/mēn. (9.09 €/m²),\n | \r\n\r\nĪpašnieks izīrē mājīgu vienistaba...,Strazdu 8


In [120]:
df[['cur_floor', 'max_floor']] = df['floor'].str.split(pat='/', n=1 , expand=True )

In [121]:
df = df.drop(['floor'], axis=1)

In [122]:
def get_cord(row):
    
    #mnu('map',1,1,'/lv/gmap/fTgTeF4QAzt4FD4eFFM=.html?mode=1&c=56.9568932, 24.0336084, 14');return false;
    
    # looking for a starting point
    point_start = row['map'].find('c=') + 2
    
    first_coma = row['map'][point_start:].find(',') + 1
    second_coma = row['map'][point_start+first_coma:].find(',')
    
    cord = row['map'][point_start:point_start+first_coma+second_coma]
    
    return cord    

In [123]:
df['cord_map'] = df.apply(get_cord, axis=1)

In [124]:
df[['len', 'lon']] = df['cord_map'].str.split(pat=',', n=1 , expand=True )
df = df.drop(['cord_map'], axis=1)
df = df.drop(['map'], axis=1)

In [125]:
df.head(2)

Unnamed: 0,district,rooms,area,seria,house_type,facilities,price,all_data,data_street,cur_floor,max_floor,len,lon
0,Imanta,2,49 m²,LT proj.,Mūra,Balkons,360 €/mēn. (7.35 €/m²),"\n | \r\n\r\nSaimnieks izīrē plašu, saulainu, ...",Kurzemes pr. 14,2,5,56.9568932,24.0336084
1,Āgenskalns,1,22 m²,P. kara,Koka,,200 €/mēn. (9.09 €/m²),\n | \r\n\r\nĪpašnieks izīrē mājīgu vienistaba...,Strazdu 8,1,2,56.9389931,24.0778974


In [126]:
df['area'] = df['area'].apply(lambda x: x.replace(' m²',''))

In [127]:
df[['price_eur', 'else_price']] = df['price'].str.split(pat='(', n=1 , expand=True )

In [128]:
df = df.drop(['price','else_price'], axis=1)

In [129]:
df['max_floor'].unique()

array(['5', '2', '7/lifts', '23', '4', '6/lifts', '6', '5/lifts',
       '9/lifts', '1', '3', '11/lifts', '10/lifts', '9', '30/lifts',
       '23/lifts', '4/lifts', '8', '7', '12/lifts', '8/lifts', '24/lifts',
       '16/lifts', '12'], dtype=object)

In [130]:
df[['total_floor', 'lift']] = df['max_floor'].str.split(pat='/', n=1 , expand=True )
df = df.drop(['max_floor'], axis=1)

In [131]:
df[['price', 'currency']] = df['price_eur'].str.split(pat=' €/', n=1 , expand=True )

In [132]:
df = df.drop(['price_eur'], axis=1)

In [133]:
df['price'] = df['price'].apply(lambda x: x.replace(' ',''))

In [134]:
df['lon'] = df['lon'].fillna('-1')
df['len'] = df['len'].fillna('-1')

df.loc[df['len']=='', 'len'] = '-1'
df.loc[df['lon']=='', 'lon'] = '-1'

In [135]:
df['len'] = df['len'].apply(lambda x: x.replace(' ',''))
df['lon'] = df['lon'].apply(lambda x: x.replace(' ',''))

In [136]:
df = df[['district','data_street','rooms','area','price','cur_floor','total_floor', 'lift', 'seria','house_type','facilities','len','lon','all_data']]

In [138]:
# df['rooms'] = df['rooms'].astype('int64')
df['area'] = df['area'].astype('float64')
df['price'] = df['price'].astype('int64')
df['cur_floor'] = df['cur_floor'].astype('int64')
df['total_floor'] = df['total_floor'].astype('int64')
df['len'] = df['len'].astype('float64')
df['lon'] = df['lon'].astype('float64')

In [139]:
df_all_data = df['all_data']

In [140]:
df = df.drop(['all_data'], axis=1)

In [141]:
df['facilities'].unique()

array(['Balkons', '', 'Parkošanas vieta', 'Balkons, Parkošanas vieta',
       'Terase, Parkošanas vieta', 'Lodžija, Parkošanas vieta',
       '01000300083001', 'Lodžija', 'Pirts, Parkošanas vieta',
       'Balkons, Terase, Parkošanas vieta', 'Terase', '01009264161',
       'Balkons, Lodžija', '01000200110', '010092439718', '01000050035',
       'Balkons, Lodžija, Parkošanas vieta'], dtype=object)

In [142]:
arr_facilities = ['Terase', 'Terase, Parkošanas vieta', 'Lodžija',
       'Lodžija, Parkošanas vieta', 'Parkošanas vieta', 'Balkons',
       'Balkons, Lodžija, Terase',
       'Pirts, Parkošanas vieta', 'Balkons, Parkošanas vieta',
       'Balkons, Lodžija', 'Balkons, Lodžija, Parkošanas vieta',
       'Balkons, Lodžija, Terase, Parkošanas vieta',
       'Balkons, Terase, Parkošanas vieta',
       'Lodžija, Terase, Parkošanas vieta', 'Lodžija, Terase',
       'Terase, Pirts']

In [143]:
df['lift'] = np.where(df['lift']=='lifts',1,0)

In [144]:
df['facilities'] = np.where(df['facilities'].isin(arr_facilities),df['facilities'],'')

In [145]:
df.head()

Unnamed: 0,district,data_street,rooms,area,price,cur_floor,total_floor,lift,seria,house_type,facilities,len,lon
0,Imanta,Kurzemes pr. 14,2,49.0,360,2,5,0,LT proj.,Mūra,Balkons,56.956893,24.033608
1,Āgenskalns,Strazdu 8,1,22.0,200,1,2,0,P. kara,Koka,,56.938993,24.077897
2,centrs,Strēlnieku 7,3,84.3,1300,4,7,1,Jaun.,Ķieģeļu-paneļu,Parkošanas vieta,56.95962,24.107779
3,centrs,J. Daliņa 8,3,87.0,1300,4,23,0,Jaun.,Paneļu,"Balkons, Parkošanas vieta",56.97037,24.12858
4,centrs,Eksporta 8,4,260.0,3000,2,2,0,Jaun.,Mūra,"Terase, Parkošanas vieta",56.965,24.098444
