## Study of housing prices in Belgium, part 2: cleaning the data

In the first part of the project we extracted all the html code for each property sold on the website. We scraped from the code all the characteristics for each of the properties. Two datasets were saved in .xls format.

In this part we merge the two datasets, and clean the data.

1. importing and merging the data sets
2. deleting the unnecessary columns/variables 
3. cleaning the variables and applying the appropriate type of each variable
4. deleting duplicated rows and otuliers/mistakes from the dataset
5. running some exploratory data analysis

### 1. Importing the datasets

In [1]:
import pandas as pd
import numpy as np
import time

date_string = time.strftime("%Y_%m_%d")
path = "C:/Users/Bedoret/OneDrive/Data Science/Housing prices in Belgium/"

pd.options.display.float_format= "{:.0f}".format
pd.options.display.max_columns = None

In [2]:
# add new data extraction to the lists

id_files = ["property_id_2021_08_11"]
char_files = ["property_char_2021_08_11"]

# extract and concatenate the IDs' files and characteristics' files
property_id = []
for file in id_files:
    data = pd.read_excel('{}{}.xls'.format(path,file), index_col=0)
    property_id.append(data)
property_id = pd.concat(property_id)   

property_char = []
for file in char_files:
    data = pd.read_excel('{}{}.xls'.format(path,file), index_col=0)
    property_char.append(data)
property_char = pd.concat(property_char)

### 2. Merging the datasets

Merging the two datasets along the ID variable which is the unique ID to each property. This variable is already defined in property_id, but is not yet defined in property_char. The unique ID of earch property is contained in the URL link of each property.

In [3]:
# extract the id values from the url string in property_char
property_char["id"] = pd.to_numeric(property_char["url"].str[-7:])

# merge the two datasets and drop all duplicates
properties = property_id.merge(property_char, how="outer", on="id")
properties.drop_duplicates(inplace = True)

# Depending on the date of the extraction, we might have to rename these variables
properties.rename(columns={'id':'Identifiant','type': 'Type', 'zip': 'Code postal', 'price': 'Prix', 'url':'URL'}, inplace=True)

### 3. Cleaning subgroups of properties

Apartment groups and house groups are properties sold as part of a group of properties. Although these could leverage valuable information on housing prices, the available data on the properties is insuficient to run valuable statistics.

In [4]:
properties = properties.loc[(properties['Type'] == "HOUSE") | 
                            (properties['Type'] == "APARTMENT")]

# uncomment to show summary of properties by type
#properties.groupby("Type").mean()

### 4. Removing unnecessary colunms

There is a bunch of colmns/variables that are not necessary for the analysis. In addition, some variables contain too litle information. We can look at the percentage of missing data for each variable.

Note: "Adresse" has been incorectly extracted from the data: it corresponds to the adress of the selling agency, not the adress of the property ! Must be deleted.

In [5]:
def percent_missing(df):
    percent_missing = pd.DataFrame(df.isnull().mean() * 100)
    percent_missing.sort_values(0, inplace = True, ascending = False)
    return percent_missing
#percent_missing(properties)

In [6]:
properties = properties[["Identifiant",
                        "Type",
                        "Étage",
                        "Code postal",
                        "Prix",
                        "Surface habitable",
                        "Surface du terrain",
                        "Chambres",
                        "Type de cuisine",
                        "Salles de bains",
                        "Salles de douche",
                        "Toilettes",
                        "Terrasse",
                        "Surface de la terrasse",
                        "Jardin",
                        "Surface du jardin",
                        "Nombre de façades",
                        "Parkings extérieurs",
                        "Parkings intérieurs",
                        "Année de construction",
                        "État du bâtiment",
                        "Type de chauffage",
                        "Classe énergétique",
                        "URL"]]

### 5. Cleaning the column and applying the type for each variable

#### Cuisine, classe énergétique & bâtiment

In [7]:
# creating dictionnaries of the new labels for the variables 'type de cuisine',
# 'classe énergétique', and 'état du batiment'.

cuisine_dictionary={'Américaine hyper-équipée':1,'Hyper équipée':1,
                    'Américaine équipée':2,'Équipée':2,
                    'Américaine semi-équipée':3, 'Semi-équipée':3,
                    'Américaine non-équipée':4, 'Pas équipée':4
                    }
classe_ener_dictionary={'A++':1,
                        'A+':2,
                        'A':3,
                        'B':4,
                        'C':5,
                        'D':6,
                        'E':7,
                        'F':8,
                        'G':9,
                        'Non communiqué': None
                       }
batiment_dictionary={"Excellent état":1,
                     "Fraîchement rénové":2,
                     "Bon":3,
                     "À rafraîchir":4,
                     "À rénover":5,
                     "À restaurer":6
                    }

properties["Classe énergétique"] = pd.np.where(properties["Classe énergétique"].str.contains("_"), "Non communiqué", properties["Classe énergétique"])

properties = properties.replace({"Type de cuisine":cuisine_dictionary,"Classe énergétique":classe_ener_dictionary,"État du bâtiment":batiment_dictionary})

  properties["Classe énergétique"] = pd.np.where(properties["Classe énergétique"].str.contains("_"), "Non communiqué", properties["Classe énergétique"])


#### Terrasse, Jardin & Parkings

In [8]:
# Clean and complete some missing values

# if any terrace size has been marked, ensure that the variable terrace is marked as yes
properties['Terrasse_new'] = np.where(((properties['Terrasse'] == "Oui")| 
                                       (properties['Surface de la terrasse'] > 0)
                                      ), 1, 0
                                     )

# if any garden size has been marked, ensure that the variable garden is marked as yes
properties['Jardin_new'] = np.where(((properties['Jardin'] == "Oui")| 
                                     (properties['Surface du jardin'] > 0)
                                    ), 1, 0
                                   )

# create the variable parking as a boolean if any indoor or outdoor parking is given
properties['Parking'] = np.where(((properties['Parkings extérieurs'] > 0)| 
                                  (properties['Parkings intérieurs'] > 0)
                                 ), 1, 0
                                )

properties = properties.drop(columns=['Terrasse', 'Jardin','Parkings extérieurs','Parkings intérieurs'])
properties = properties.rename(columns= {"Terrasse_new":"Terrasse",
                                         "Jardin_new": "Jardin"})

### 6. Cleaning unnecessary rows

#### Duplicates

In [9]:
# Several properties are sold by different real estate agencies on the same website. We must delete those duplicates.
properties.drop_duplicates(subset = properties.columns.difference(["Identifiant","URL","Page"]), inplace = True)

#### Removing missing values and outliers

In [10]:
# delete properties who do not refer any size, or which size is bellow 10 or above 7000
properties.drop(properties[(properties['Surface habitable'].isna()) | 
                           (properties['Surface habitable'] < 10) |
                           (properties['Surface habitable'] > 7000) 
                          ].index, inplace=True)

# delete properties which have no price or price is bellow 10 000 €
properties.drop(properties[(properties['Prix'].isna()) | 
                           (properties['Prix'] < 10000)
                          ].index, inplace=True)

# delete properties who have over 30 bedrooms: These are mostly block of appartments sold in batches 
# and should have been referred as "Appartment_Group" or "House_Group" as defined above.
# Below 30 rooms, some are batches of appartments, but some are small castle, huge villas.
properties.drop(properties[properties.Chambres > 30].index, inplace = True)

# mark properties with over 50 floors as missing values
properties.loc[properties['Étage'] > 50, 'Étage'] = None

# mark properties with less than 5 square meters area as missing values
properties.loc[properties['Surface du terrain'] <= 5, 'Surface du terrain'] = None

# mark properties with terrace size over 1000 square meter as missing values
properties.loc[properties['Surface de la terrasse'] >= 1000, 'Surface de la terrasse'] = None

# mark properties with garden size less than 5 square meters as missing values
properties.loc[properties['Surface du jardin'] <= 5, 'Surface du jardin'] = None

# mark properties who have reported over 4 facades as 4 facades
properties.loc[properties['Nombre de façades'] > 4, 'Nombre de façades'] = 4

#### Create new variables

In [11]:
# create a new variable which gives the number of bedrooms per bathroom
properties["Ratio chambres sdb"] = properties["Chambres"] / properties["Salles de bains"]

# create a variable which gives the price per square meter
properties["Prix m2"]= properties["Prix"] / properties["Surface habitable"]
properties = properties.drop(properties[properties["Prix m2"].isna()].index)

# reorder the colunms
properties = properties[["Identifiant","Type","Étage","Code postal","Prix", "Prix m2","Surface habitable","Surface du terrain","Chambres","Type de cuisine","Salles de bains","Ratio chambres sdb","Toilettes","Terrasse","Surface de la terrasse","Jardin","Surface du jardin","Nombre de façades","Parking","Année de construction","État du bâtiment","Type de chauffage","Classe énergétique","URL"]]

#### Independent drops of rows which are inappropriate

In [12]:
# an empty farm with many erroneous variables
properties = properties.drop(properties[properties["Identifiant"] == 9103775].index)
# an appartment with wrong size
properties = properties.drop(properties[properties["Identifiant"] == 8536825].index)
# miss reported values for this appartment
properties = properties.drop(properties[properties["Identifiant"] == 9227766].index)

### 7. Exploratory data analysis

Let's have a look at our cleaned dataset. 
- How many properties do we have?
- How many houses/appartments?
- Average price / average price for a house vs appartment
- Average price per square meter / for houses vs appartment

In [13]:
properties.shape

(43712, 24)

In [14]:
properties.groupby("Type").mean()

Unnamed: 0_level_0,Identifiant,Étage,Code postal,Prix,Prix m2,Surface habitable,Surface du terrain,Chambres,Type de cuisine,Salles de bains,Ratio chambres sdb,Toilettes,Terrasse,Surface de la terrasse,Jardin,Surface du jardin,Nombre de façades,Parking,Année de construction,État du bâtiment,Classe énergétique
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
APARTMENT,9244555,3.0,4753,338220,3264,105,,2,2,1,2,1,1,19,0,491,2,0,1997,2,5
HOUSE,9302511,,5185,447829,2010,224,10239.0,4,2,2,3,2,1,33,0,1085,3,1,1968,3,6


In [15]:
houses = properties.loc[properties["Type"]=="HOUSE"]
apartments = properties.loc[properties["Type"]=="APARTMENT"]

We have 3 data bases to work with:

    - properties which contains all the data about all the properties
    
    - apartments contains data about apartments only
    
    - houses contains data about houses only
Save the 3 databases in an excel format

In [16]:
properties.to_excel('{}properties_{}.xlsx'.format(path,date_string), sheet_name= 'properties')
apartments.to_excel('{}apartments_{}.xlsx'.format(path,date_string), sheet_name= 'apartments')
houses.to_excel('{}houses_{}.xlsx'.format(path,date_string), sheet_name= 'houses')