First we get a number of search results on houses and apartments from immoweb.
Let's get 2 of them.

In [102]:
import get_search_results as search

dict_urls = search.get_search_results(2)
dict_urls

{'https://www.immoweb.be/en/classified/house/for-sale/evere/1140/8957801': True,
 'https://www.immoweb.be/en/classified/town-house/for-sale/liege-2/4020/8911319': True,
 'https://www.immoweb.be/en/classified/town-house/for-sale/anderlecht/1070/8957822': True,
 'https://www.immoweb.be/en/classified/villa/for-sale/les-bons-villers/6210/8957808': True,
 'https://www.immoweb.be/en/classified/house/for-sale/lier/2500/8957509': True,
 'https://www.immoweb.be/en/classified/house/for-sale/gent/9000/8957821': True,
 'https://www.immoweb.be/en/classified/house/for-sale/gentbrugge/9050/8957820': True,
 'https://www.immoweb.be/en/classified/house/for-sale/gent-(9000)/9000/8957819': True,
 'https://www.immoweb.be/en/classified/house/for-sale/kuurne/8520/8957813': True,
 'https://www.immoweb.be/en/classified/house/for-sale/harelbeke/8530/8906204': True,
 'https://www.immoweb.be/en/classified/town-house/for-sale/bredene/8450/8957811': True,
 'https://www.immoweb.be/en/classified/house/for-sale/breden

Then, we iterate through the found properties and scrape data. Only valid (reachable) links are processed.

In [103]:
import scrap_search_result as scrap

lists_dict = scrap.scrap_list(dict_urls)
lists_dict

{'hyperlink': ['https://www.immoweb.be/en/classified/house/for-sale/evere/1140/8957801',
  'https://www.immoweb.be/en/classified/town-house/for-sale/liege-2/4020/8911319',
  'https://www.immoweb.be/en/classified/town-house/for-sale/anderlecht/1070/8957822',
  'https://www.immoweb.be/en/classified/villa/for-sale/les-bons-villers/6210/8957808',
  'https://www.immoweb.be/en/classified/house/for-sale/lier/2500/8957509',
  'https://www.immoweb.be/en/classified/house/for-sale/gent/9000/8957821',
  'https://www.immoweb.be/en/classified/house/for-sale/gentbrugge/9050/8957820',
  'https://www.immoweb.be/en/classified/house/for-sale/gent-(9000)/9000/8957819',
  'https://www.immoweb.be/en/classified/house/for-sale/kuurne/8520/8957813',
  'https://www.immoweb.be/en/classified/house/for-sale/harelbeke/8530/8906204',
  'https://www.immoweb.be/en/classified/town-house/for-sale/bredene/8450/8957811',
  'https://www.immoweb.be/en/classified/house/for-sale/bredene/8450/8957809',
  'https://www.immoweb.b

The list_dict is imported into the DataQuality class before "flagging" it.
The flagging process identifies duplicates and the number of null values per row.

In [104]:
import dataquality

dq = dataquality.DataQuality(lists_dict)

flagged = dq.flag()
flagged

Unnamed: 0,hyperlink,locality,postcode,house_is,property_subtype,price,sale,rooms_number,area,kitchen_has,...,terrace,terrace_area,garden,garden_area,land_surface,land_plot_surface,facades_number,swimming_pool_has,duplicates,null
0,https://www.immoweb.be/en/classified/house/for...,evere,1140,True,house,449000.0,,4.0,200.0,True,...,True,8.0,True,40.0,,140.0,2.0,False,False,
1,https://www.immoweb.be/en/classified/town-hous...,liege-2,4020,True,town-house,215000.0,,2.0,100.0,False,...,True,5.0,True,157.0,,308.0,3.0,False,False,
2,https://www.immoweb.be/en/classified/town-hous...,anderlecht,1070,True,town-house,489000.0,,3.0,165.0,False,...,False,,False,,,214.0,3.0,False,False,
3,https://www.immoweb.be/en/classified/villa/for...,les-bons-villers,6210,True,villa,450000.0,,4.0,210.0,False,...,False,,False,,,2400.0,4.0,False,False,
4,https://www.immoweb.be/en/classified/house/for...,lier,2500,True,house,321000.0,,4.0,175.0,True,...,True,,True,25.0,,125.0,2.0,False,False,
5,https://www.immoweb.be/en/classified/house/for...,gent,9000,True,house,498000.0,,4.0,,False,...,False,,True,,,87.0,2.0,False,False,
6,https://www.immoweb.be/en/classified/house/for...,gentbrugge,9050,True,house,489000.0,,4.0,,True,...,False,,True,,,200.0,2.0,False,False,
7,https://www.immoweb.be/en/classified/house/for...,gent-(9000),9000,True,house,239000.0,,1.0,,True,...,False,,False,,,,2.0,False,False,
8,https://www.immoweb.be/en/classified/house/for...,kuurne,8520,True,house,398000.0,,4.0,268.0,True,...,False,,True,,,1032.0,4.0,False,False,
9,https://www.immoweb.be/en/classified/house/for...,harelbeke,8530,True,house,395000.0,,4.0,272.0,False,...,False,,True,,,655.0,4.0,False,False,


After the flagging, the dataframe main statistical parameters  per column are identified:
1.count: number of valid (not null) values
1.unique: number of unique values
1.top: the most frequent value (only the first value showed if more values equally frequent)
1.freq: how frequent is the top value
1.mean:
1.std: standard deviation
1.min: minimum value
1.5%: percentile, possible bottom limit for outliers
1.50%: median
1.95%: percentile, possible top limit for outliers
1.max: maximum value
1.dtypes: data type

True and False values are converted respectively into 1 and 0 to obtain significant information.

In [105]:
description = dq.describe()
description

Unnamed: 0,hyperlink,locality,postcode,house_is,property_subtype,price,sale,rooms_number,area,kitchen_has,furnished,open_fire,terrace,terrace_area,garden,garden_area,land_surface,land_plot_surface,facades_number,swimming_pool_has
count,58,58,58,58,58,56,58,57,48,58,58,58,58,14,58,12,0,28,42,58
unique,58,49,44,,7,,1,6,44,,,,,12,,11,0,27,4,
top,https://www.immoweb.be/en/classified/apartment...,oostende,8300,,apartment,,,3,90,,,,,8,,60,,450,2,
freq,1,3,4,,25,,58,19,3,,,,,2,,2,,2,23,
mean,,,,0.517241,,410985,,,,0.551724,0.0172414,0.0689655,0.465517,,0.37931,,,,,0.0172414
std,,,,0.504067,,314653,,,,0.501661,0.131306,0.255609,0.503166,,0.489453,,,,,0.131306
min,,,,0,,134500,,,,0,0,0,0,,0,,,,,0
5%,,,,0,,169000,,,,0,0,0,0,,0,,,,,0
50%,,,,1,,298500,,,,1,0,0,0,,0,,,,,0
95%,,,,1,,998750,,,,1,0,1,1,,1,,,,,0


Duplicates values are removed.

In [106]:
cleaned = dq.clean()
cleaned

Unnamed: 0,hyperlink,locality,postcode,house_is,property_subtype,price,sale,rooms_number,area,kitchen_has,furnished,open_fire,terrace,terrace_area,garden,garden_area,land_surface,land_plot_surface,facades_number,swimming_pool_has
0,https://www.immoweb.be/en/classified/house/for...,evere,1140,True,house,449000.0,,4.0,200.0,True,False,False,True,8.0,True,40.0,,140.0,2.0,False
1,https://www.immoweb.be/en/classified/town-hous...,liege-2,4020,True,town-house,215000.0,,2.0,100.0,False,False,False,True,5.0,True,157.0,,308.0,3.0,False
2,https://www.immoweb.be/en/classified/town-hous...,anderlecht,1070,True,town-house,489000.0,,3.0,165.0,False,False,False,False,,False,,,214.0,3.0,False
3,https://www.immoweb.be/en/classified/villa/for...,les-bons-villers,6210,True,villa,450000.0,,4.0,210.0,False,False,False,False,,False,,,2400.0,4.0,False
4,https://www.immoweb.be/en/classified/house/for...,lier,2500,True,house,321000.0,,4.0,175.0,True,False,False,True,,True,25.0,,125.0,2.0,False
5,https://www.immoweb.be/en/classified/house/for...,gent,9000,True,house,498000.0,,4.0,,False,False,False,False,,True,,,87.0,2.0,False
6,https://www.immoweb.be/en/classified/house/for...,gentbrugge,9050,True,house,489000.0,,4.0,,True,False,False,False,,True,,,200.0,2.0,False
7,https://www.immoweb.be/en/classified/house/for...,gent-(9000),9000,True,house,239000.0,,1.0,,True,False,False,False,,False,,,,2.0,False
8,https://www.immoweb.be/en/classified/house/for...,kuurne,8520,True,house,398000.0,,4.0,268.0,True,False,False,False,,True,,,1032.0,4.0,False
9,https://www.immoweb.be/en/classified/house/for...,harelbeke,8530,True,house,395000.0,,4.0,272.0,False,False,True,False,,True,,,655.0,4.0,False


Values are formatted according to the required output.

In [107]:
_VALUES_FORMAT = dict(hyperlink='str', locality='str', postcode='int', house_is='yn', property_subtype='str',
                      price='int', sale='str', rooms_number='int', area='int', kitchen_has='yn', furnished='yn',
                      open_fire='yn', terrace='yn', terrace_area='int', garden='yn', garden_area='int',
                      land_surface='int', land_plot_surface='int', facades_number='int', swimming_pool_has='yn')

cleaned = dq.values_format(df=cleaned, columns_dtypes=_VALUES_FORMAT)
cleaned

Unnamed: 0,hyperlink,locality,postcode,house_is,property_subtype,price,sale,rooms_number,area,kitchen_has,furnished,open_fire,terrace,terrace_area,garden,garden_area,land_surface,land_plot_surface,facades_number,swimming_pool_has
0,https://www.immoweb.be/en/classified/house/for...,evere,1140,Yes,house,449000.0,,4.0,200.0,Yes,No,No,Yes,8.0,Yes,40.0,,140.0,2.0,No
1,https://www.immoweb.be/en/classified/town-hous...,liege-2,4020,Yes,town-house,215000.0,,2.0,100.0,No,No,No,Yes,5.0,Yes,157.0,,308.0,3.0,No
2,https://www.immoweb.be/en/classified/town-hous...,anderlecht,1070,Yes,town-house,489000.0,,3.0,165.0,No,No,No,No,,No,,,214.0,3.0,No
3,https://www.immoweb.be/en/classified/villa/for...,les-bons-villers,6210,Yes,villa,450000.0,,4.0,210.0,No,No,No,No,,No,,,2400.0,4.0,No
4,https://www.immoweb.be/en/classified/house/for...,lier,2500,Yes,house,321000.0,,4.0,175.0,Yes,No,No,Yes,,Yes,25.0,,125.0,2.0,No
5,https://www.immoweb.be/en/classified/house/for...,gent,9000,Yes,house,498000.0,,4.0,,No,No,No,No,,Yes,,,87.0,2.0,No
6,https://www.immoweb.be/en/classified/house/for...,gentbrugge,9050,Yes,house,489000.0,,4.0,,Yes,No,No,No,,Yes,,,200.0,2.0,No
7,https://www.immoweb.be/en/classified/house/for...,gent-(9000),9000,Yes,house,239000.0,,1.0,,Yes,No,No,No,,No,,,,2.0,No
8,https://www.immoweb.be/en/classified/house/for...,kuurne,8520,Yes,house,398000.0,,4.0,268.0,Yes,No,No,No,,Yes,,,1032.0,4.0,No
9,https://www.immoweb.be/en/classified/house/for...,harelbeke,8530,Yes,house,395000.0,,4.0,272.0,No,No,Yes,No,,Yes,,,655.0,4.0,No


The description of the cleaned dataframe can be showed now.

In [108]:
description_cleaned = dq.describe(df=cleaned)
description_cleaned

Unnamed: 0,hyperlink,locality,postcode,house_is,property_subtype,price,sale,rooms_number,area,kitchen_has,furnished,open_fire,terrace,terrace_area,garden,garden_area,land_surface,land_plot_surface,facades_number,swimming_pool_has
count,56,56,56,56,56,54,56,55,46,56,56,56,56,14,56,12,0,28,42,56
unique,56,49,,2,7,,1,,,2,2,2,2,,2,,0,,,2
top,https://www.immoweb.be/en/classified/apartment...,schaerbeek,,Yes,house,,,,,Yes,No,No,No,,No,,,,,No
freq,1,3,,30,23,,56,,,31,55,52,30,,34,,,,,55
mean,,,5091.36,,,403151,,3,166.261,,,,,23.7857,,146.583,,526.893,2.52381,
std,,,3371.75,,,308023,,1.56347,112.801,,,,,18.3352,,142.505,,550.792,0.833391,
min,,,1000,,,134500,,1,55,,,,,2,,15,,74,1,
5%,,,1015,,,169000,,1,62.25,,,,,3.95,,20.5,,85.7,2,
50%,,,4060,,,298500,,3,137.5,,,,,19,,108.5,,331.5,2,
95%,,,9730,,,863250,,4,284,,,,,48.7,,380.1,,1669.65,4,


Finally results are exported as csv.

In [109]:
import pandas
import os

output_csvs = {"flagged": flagged, "description": description, "cleaned": cleaned, "description_cleaned": description_cleaned}


def table_to_csv(table, filename: str, path=os.path.abspath('')):
    if isinstance(table, dict):
        table = pandas.DataFrame(table)
    elif not isinstance(table, pandas.DataFrame):
        raise Exception("Provided table is neither a dataframe nor a dictionary of lists")
    table.to_csv(os.path.join(path, filename + ".csv"))
    print(filename + ".csv" + " created at: " + path)
    return None


for key, value in output_csvs.items():
    table_to_csv(value, key)

flagged.csv created at: C:\Users\Fra\OneDrive\Desktop\belearner\challenge-collecting-data
description.csv created at: C:\Users\Fra\OneDrive\Desktop\belearner\challenge-collecting-data
cleaned.csv created at: C:\Users\Fra\OneDrive\Desktop\belearner\challenge-collecting-data
description_cleaned.csv created at: C:\Users\Fra\OneDrive\Desktop\belearner\challenge-collecting-data
