# Web Scraping

Importing the necessary libraries

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

We scrolled down (with chrome extension until we had all the data we wanted),
after that, we saved the website and scraped it.

We did it to prevent 'burning hours', because it took us more than 6 hours of scrolling down to have the data we want.

The link to the original website: https://www.nadlan.gov.il/.

We also uploaded the website after we scrolled down.

In [2]:
path = 'C:/Users/matan/Data Science Project/check/website/website.html'

In [3]:
# opening the website with code we found on google to prevent bugs.
webSite = open(path, encoding = 'utf8')

In [4]:
# reading the data through bs
bs = BeautifulSoup(webSite.read())

In [5]:
# by exploring the website devTools we saw that all of the relevant data is inside classes named "tableCol" 
data = bs.find_all('div', {'class': "tableCol"})

By checking how the data looks, we saw that the relevant data is in the text itself.

That means we have to iterate through all of the tags we found and placed in 'data',
and take only the text from each one of them.

In [6]:
# This returns a modified list where every value in the original list has been changed based on a function.
# The strip cleans all of the garbage
text_data = list(map(lambda d: d.text.strip(), data))

As for now we have one big list (text_data) that contains all of the data.

To create a panda's data frame we will first create for each column the relevant data for that column. 

In [7]:
year_sold = []
for i in range(int(len(text_data)/10)):
    index = i*10
    year_ind = text_data[index].rfind(".")
    year = text_data[index][year_ind+1:]
    year_int = int(year)
    year_sold.append(year_int)

In [8]:
month_sold = []
for i in range(int(len(text_data)/10)):
    index = i*10
    first_ind = text_data[index].find(".")
    last_ind = text_data[index].rfind(".")
    month = text_data[index][first_ind+1:last_ind]
    month_int = int(month)
    month_sold.append(month_int)

In [9]:
day_sold = []
for i in range(int(len(text_data)/10)):
    index = i*10
    ind = text_data[index].find(".")
    day = text_data[index][:ind]
    day_int = int(day)
    day_sold.append(day_int)

In [10]:
street = []
for i in range (int(len(text_data)/10)):
    index = text_data[i*10+1].rfind(" ")
    street_name = text_data[i*10+1][0:index]
    street.append(street_name)

In [11]:
kind = []
for i in range (int(len(text_data)/10)):
    kind.append(text_data[i*10+3])

In [12]:
num_of_rooms = []
for i in range (int(len(text_data)/10)):
    int_rooms = lambda x: float(x) if x != '' else ''
    num_of_rooms.append(int_rooms(text_data[i*10+4]))

In [13]:
def floor_str_to_int(floor_str):
    if floor_str == 'קרקע' or floor_str == 'קומה 0':
        return 0
    elif floor_str == 'ראשונה' or floor_str == 'קומה 1':
        return 1
    elif floor_str == 'שניה' or floor_str == 'קומה 2':
        return 2
    elif floor_str == 'שלישית' or floor_str == 'קומה 3':
        return 3
    elif floor_str == 'רביעית' or floor_str == 'קומה 4':
        return 4
    elif floor_str == 'חמישית' or floor_str == 'קומה 5':
        return 5
    elif floor_str == 'שישית' or floor_str == 'קומה 6':
        return 6
    elif floor_str == 'שביעית' or floor_str == 'קומה 7':
        return 7
    elif floor_str == 'שמינית' or floor_str == 'קומה 8':
        return 8
    elif floor_str == 'תשיעית' or floor_str == 'קומה 9':
        return 9
    elif floor_str == 'עשירית' or floor_str == 'קומה 10':
        return 10
    elif floor_str == 'אחת עשרה' or floor_str == 'קומה 11':
        return 11
    elif floor_str == 'שתים עשרה' or floor_str == 'קומה 12':
        return 12
    elif floor_str == 'שלוש עשרה' or floor_str == 'קומה 13':
        return 13
    elif floor_str == 'ארבע עשרה' or floor_str == 'קומה 14':
        return 14
    elif floor_str == 'חמש עשרה' or floor_str == 'קומה 15':
        return 15
    elif floor_str == 'שש עשרה' or floor_str == 'קומה 16':
        return 16
    elif floor_str == 'שבע עשרה' or floor_str == 'קומה 17':
        return 17
    elif floor_str == 'שמונה עשרה' or floor_str == 'קומה 18':
        return 18
    elif floor_str == 'תשע עשרה' or floor_str == 'קומה 19':
        return 19
    elif floor_str == 'עשרים' or floor_str == 'קומה 20':
        return 20
    elif floor_str == 'עשרים ואחת' or floor_str == 'קומה 21':
        return 21
    elif floor_str == 'עשרים ושתים' or floor_str == 'עשרים ושתיים' or floor_str == 'קומה 22':
        return 22
    elif floor_str == 'עשרים ושלוש' or floor_str == 'קומה 23':
        return 23
    elif floor_str == 'עשרים וארבע' or floor_str == 'קומה 24':
        return 24
    elif floor_str == 'עשרים וחמש' or floor_str == 'קומה 25':
        return 25
    elif floor_str == 'עשרים ושש' or floor_str == 'קומה 26':
        return 26
    else:
        return ''

In [14]:
floor = []
for i in range (int(len(text_data)/10)):
    fl = text_data[i*10+5]
    if len(fl) == 8 and fl[-2] >= '0' and fl[-2] <= '9':
        fl = fl[0:5]+fl[6:7]
    elif len(fl) == 9 and fl[-2] >= '0' and fl[-2] <= '9':
        fl = fl[0:5]+fl[6:8]
    int_floor = floor_str_to_int(fl)
    floor.append(int_floor)

In [15]:
squared = []
for i in range (int(len(text_data)/10)):
    float_squared = lambda x: float(x) if x != '' else ''
    squared.append(float_squared(text_data[i*10+6]))

In [16]:
price = []
for i in range (int(len(text_data)/10)):
    str_price = text_data[i*10+7].replace(',','')
    int_price = int(str_price)
    price.append(int_price)

We prefared not to take the 'מגמת שינוי' and 'סוג נכס' data at all, because we ran our models with it and got worser results in the predictions.

In [17]:
df_dict = {"Year sold":year_sold, "Month sold":month_sold, "Day sold":day_sold, "Street":street, "Kind":kind, "Number of rooms":num_of_rooms, "Floor":floor, "Squared meter":squared, "Price":price}

## Creating the data frame

In [18]:
df = pd.DataFrame(df_dict)

In [19]:
df.head()

Unnamed: 0,Year sold,Month sold,Day sold,Street,Kind,Number of rooms,Floor,Squared meter,Price
0,2021,12,20,נירים,דירה בבית קומות,6.0,6,134.6,3670000
1,2021,12,19,קרל נטר,דירה בבית קומות,3.0,3,81.0,1770000
2,2021,12,16,יהודה הלוי,דירה בבית קומות,5.0,5,120.0,2400000
3,2021,12,15,שרירא שמואל,דירה בבית קומות,4.0,2,64.32,2050000
4,2021,12,15,"תרמ""ב",דירה בבית קומות,3.0,3,80.98,1640000


## Droping nan/empty values

In [20]:
# We want to have only the rows that has values for all of the columns.
# to do that, we can cast every empty string to NaN,
# and simply, drop all the rows that contains at least one NaN value.
df = df.replace(r'^\s*$', np.NaN, regex=True)
df = df.dropna()

In [21]:
# We decided that our goal will be to predict the price that a random apartment has been sold.
# so, we will remove all other options.
df = df.loc[(df['Kind'] == 'דירה') |
            (df['Kind'] == 'דירת גג') |
            (df['Kind'] == 'דירת גג (פנטהאוז)') |
            (df['Kind'] == 'דירת גן') |
            (df['Kind'] == 'דירה בבית קומות')]

In [22]:
# reseting the index, and removing the old index column
df.reset_index(inplace = True)
df.pop('index')

0            0
1            1
2            2
3            3
4            4
         ...  
25832    39775
25833    39776
25834    39777
25835    39778
25836    39779
Name: index, Length: 25837, dtype: int64

In [23]:
df.shape

(25837, 9)

In [24]:
df.duplicated().sum()

1

we checked and there are no duplicates, the reason it shows that there is one duplicate it's because we didn't took the Street number data.

Note: we prefered not to take that data at all because we got worser results with it.

## Saving the data as csv file
The encoding 'utf-8sig' is a code that we found on google to have csv file clean of garbage data.

In [25]:
df.to_csv('C:/Users/matan/Data Science Project/Real_Estate_Rishon_Lezion.csv', encoding = 'utf-8-sig', index_label = 'index')