## Preprocessing of data, Part 2 - checking and cleaning dataset

In this notebook will we check the data in the csv-file we created in part 1, to see if there is anything we have to pay extra attention to when we start with part 3 (remove or replace missing data)


### Import libraries and dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Uses csv-file with unwanted columns removed
df = pd.read_csv('germany_housing_data_part1_with_unwanted_columns_removed.csv')
df.head(10)

Unnamed: 0,Living_space,Lot,Rooms,Bedrooms,Bathrooms,Floors,Year_built,Year_renovated,Garages,Condition,Heating,Energy_source,State,Garagetype,Type,Price
0,106.0,229.0,5.5,3.0,1.0,2.0,2005.0,,2.0,modernized,central heating,gas,Baden-Württemberg,Parking lot,Multiple dwelling,498000.0
1,140.93,517.0,6.0,3.0,2.0,,1994.0,,7.0,modernized,stove heating,,Baden-Württemberg,Parking lot,Mid-terrace house,495000.0
2,162.89,82.0,5.0,3.0,2.0,4.0,2013.0,,1.0,dilapidated,stove heating,other combinations of energy sources,Baden-Württemberg,Garage,Farmhouse,749000.0
3,140.0,814.0,4.0,,2.0,2.0,1900.0,2000.0,1.0,fixer-upper,central heating,electricity,Baden-Württemberg,Garage,Farmhouse,259000.0
4,115.0,244.0,4.5,2.0,1.0,,1968.0,2019.0,1.0,refurbished,central heating,oil,Baden-Württemberg,Garage,Multiple dwelling,469000.0
5,310.0,860.0,8.0,,,3.0,1969.0,,2.0,maintained,,oil,Baden-Württemberg,Garage,Mid-terrace house,1400000.0
6,502.0,5300.0,13.0,,4.0,,2004.0,,7.0,dilapidated,stove heating,other combinations of energy sources,Baden-Württemberg,Parking lot,Duplex,3500000.0
7,263.0,406.0,10.0,,,3.0,1989.0,,2.0,modernized,stove heating,gas,Baden-Württemberg,Garage,Duplex,630000.0
8,227.0,973.0,10.0,4.0,4.0,2.0,1809.0,2015.0,8.0,modernized,central heating,electricity,Baden-Württemberg,Parking lot,Duplex,364000.0
9,787.0,933.0,30.0,,,3.0,1920.0,,12.0,modernized,stove heating,other combinations of energy sources,Baden-Württemberg,Parking lot,Duplex,1900000.0


### Checking the data

We start with an overview of our data.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10552 entries, 0 to 10551
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Living_space    10552 non-null  float64
 1   Lot             10552 non-null  float64
 2   Rooms           10552 non-null  float64
 3   Bedrooms        6878 non-null   float64
 4   Bathrooms       8751 non-null   float64
 5   Floors          7888 non-null   float64
 6   Year_built      9858 non-null   float64
 7   Year_renovated  5349 non-null   float64
 8   Garages         8592 non-null   float64
 9   Condition       10229 non-null  object 
 10  Heating         9968 non-null   object 
 11  Energy_source   9325 non-null   object 
 12  State           10551 non-null  object 
 13  Garagetype      8592 non-null   object 
 14  Type            10150 non-null  object 
 15  Price           10552 non-null  float64
dtypes: float64(10), object(6)
memory usage: 1.3+ MB


In [4]:
df.describe()

Unnamed: 0,Living_space,Lot,Rooms,Bedrooms,Bathrooms,Floors,Year_built,Year_renovated,Garages,Price
count,10552.0,10552.0,10552.0,6878.0,8751.0,7888.0,9858.0,5349.0,8592.0,10552.0
mean,216.721008,1491.659004,7.388978,4.169817,2.308993,2.283976,1958.821465,2010.7063,2.698673,556685.1
std,172.421321,8582.361675,5.378126,2.577169,1.74233,0.821288,55.958072,10.548651,3.195068,608741.0
min,0.0,0.0,1.0,0.0,0.0,0.0,1300.0,1900.0,1.0,0.0
25%,130.0,370.0,5.0,3.0,1.0,2.0,1935.0,2006.0,1.0,250000.0
50%,176.775,656.5,6.0,4.0,2.0,2.0,1971.0,2015.0,2.0,405215.0
75%,250.0,1047.0,8.0,5.0,3.0,3.0,1996.0,2018.0,3.0,655000.0
max,5600.0,547087.0,170.0,61.0,44.0,13.0,2022.0,2206.0,70.0,13000000.0


We can see that there is at least one faulty value in 'Year_renovated', since the max value states that a house has been renovated in year 2206. 

But let us start in the beginning.

#### Rooms vs Bedrooms

The definition of Rooms is rooms with daylight, not counting kitchen or bathrooms.

We do a quick check to see if we have any cases where number of Bedrooms are larger the number of Rooms. 

In [5]:
num_of_rooms = df['Rooms'].values
num_of_bedrooms = df['Bedrooms'].values

In [6]:
strange_rows = 0
index_to_check = []
for i in range(len(num_of_rooms)):
    if num_of_bedrooms[i] > num_of_rooms[i]:
        strange_rows += 1
        index_to_check.append(i)
print(strange_rows)
print(index_to_check)

14
[441, 458, 677, 778, 810, 997, 1322, 1758, 6564, 6890, 6960, 8584, 8796, 10239]


In [7]:
df.loc[index_to_check]

Unnamed: 0,Living_space,Lot,Rooms,Bedrooms,Bathrooms,Floors,Year_built,Year_renovated,Garages,Condition,Heating,Energy_source,State,Garagetype,Type,Price
441,150.0,498.0,5.0,6.0,1.0,3.0,1936.0,,,modernized,central heating,gas,Baden-Württemberg,,Mid-terrace house,799000.0
458,628.0,646.0,5.0,6.0,8.0,3.0,1715.0,2017.0,6.0,modernized,wood-pellet heating,electricity,Baden-Württemberg,Parking lot,Duplex,645000.0
677,100.0,105.0,2.5,3.0,1.0,3.0,1923.0,,1.0,refurbished,underfloor heating,oil,Baden-Württemberg,Outside parking lot,Mid-terrace house,97500.0
778,157.0,661.0,3.0,4.0,3.0,3.0,1998.0,2015.0,3.0,modernized,,,Baden-Württemberg,Garage,Duplex,529000.0
810,500.0,2228.0,14.0,15.0,3.0,3.0,,,,maintained,stove heating,oil,Baden-Württemberg,,Bungalow,270000.0
997,356.0,3550.0,2.5,6.0,4.0,4.0,1937.0,,2.0,refurbished,stove heating,gas,Baden-Württemberg,Outside parking lot,Villa,7950000.0
1322,528.56,800.0,6.0,12.0,6.0,3.0,1969.0,2020.0,6.0,dilapidated,,oil,Bayern,Underground parking lot,Duplex,2700000.0
1758,611.0,500.0,19.0,20.0,6.0,4.0,1750.0,2019.0,5.0,fixer-upper,stove heating,gas,Bayern,Outside parking lot,Duplex,799900.0
6564,355.12,3912.0,5.5,8.0,5.0,,1964.0,,4.0,modernized,stove heating,gas,Nordrhein-Westfalen,Garage,Duplex,1300000.0
6890,150.0,700.0,2.0,4.0,4.0,3.0,1948.0,2019.0,3.0,by arrangement,stove heating,gas,Nordrhein-Westfalen,Outside parking lot,Duplex,689030.0


Two ways to deal with this:
* Delete the rows in question
* Replace the value for 'Bedrooms' with NaN and deal with this in part 3, with the rest of the missing data.

Do not forget to make the same test in part 3C when we impute values.


#### Year_built

We can see that there is houses with a 'Year_built' value that is bigger than 2020. But the max value is 2022 - so it is probably not an error but rather prediction on new construction.

In [8]:
df['Year_built'].describe()

count    9858.000000
mean     1958.821465
std        55.958072
min      1300.000000
25%      1935.000000
50%      1971.000000
75%      1996.000000
max      2022.000000
Name: Year_built, dtype: float64

We check how many we are talking about...

In [9]:
building_years = df['Year_built'].values

In [10]:
new =0
new_buildings = []
index =[]
for i in range(len(building_years)):
    if building_years[i] > 0:
        if building_years[i] > 2020:
            new += 1
            new_buildings.append(building_years[i])
            index.append(i)
print(str(new) + ' houses has not been completed in 2020.')

66 houses has not been completed in 2020.


In [11]:
print(str(new_buildings.count(2021)) + ' are rapported to be completed in 2021.')
print(str(new_buildings.count(2022)) + ' are rapported to be completed in 2022.')

61 are rapported to be completed in 2021.
5 are rapported to be completed in 2022.


In [12]:
#df.loc[index]

We have decided to accept this.

#### Year_renovated

As seen again below there is a max value of 2206. That has to be an error in the data.

In [13]:
df['Year_renovated'].describe()

count    5349.000000
mean     2010.706300
std        10.548651
min      1900.000000
25%      2006.000000
50%      2015.000000
75%      2018.000000
max      2206.000000
Name: Year_renovated, dtype: float64

We choose to replace all value bigger then 2021 with a missing value.

In [14]:
y_r = df['Year_renovated'].values

for i in range(len(y_r)):
    if y_r[i] > 2021.0:
        y_r[i] = np.nan

A final check to see that it is done.

In [15]:
df['Year_renovated'].describe()

count    5346.000000
mean     2010.664796
std        10.204488
min      1900.000000
25%      2006.000000
50%      2015.000000
75%      2018.000000
max      2020.000000
Name: Year_renovated, dtype: float64

#### Year_built vs Year_renovated

Next part to check is the relationship 'Year_built' and 'Year_renovated'. 

A house can not be renovated prior than it was originally built.

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10552 entries, 0 to 10551
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Living_space    10552 non-null  float64
 1   Lot             10552 non-null  float64
 2   Rooms           10552 non-null  float64
 3   Bedrooms        6878 non-null   float64
 4   Bathrooms       8751 non-null   float64
 5   Floors          7888 non-null   float64
 6   Year_built      9858 non-null   float64
 7   Year_renovated  5346 non-null   float64
 8   Garages         8592 non-null   float64
 9   Condition       10229 non-null  object 
 10  Heating         9968 non-null   object 
 11  Energy_source   9325 non-null   object 
 12  State           10551 non-null  object 
 13  Garagetype      8592 non-null   object 
 14  Type            10150 non-null  object 
 15  Price           10552 non-null  float64
dtypes: float64(10), object(6)
memory usage: 1.3+ MB


In [17]:
year_r = df['Year_renovated'].values
year_b = df['Year_built'].values

In [18]:
len(year_r)

10552

In [19]:
strange_rows_2 = 0
index_to_check_2 = []
for i in range(len(year_r)):
    if year_b[i] > year_r[i]:
        strange_rows_2 += 1
        index_to_check_2.append(i)
print(strange_rows_2)
print(index_to_check_2)

11
[972, 3367, 3978, 4740, 6842, 7459, 7851, 8135, 8684, 10361, 10379]


In [20]:
df.loc[index_to_check_2]

Unnamed: 0,Living_space,Lot,Rooms,Bedrooms,Bathrooms,Floors,Year_built,Year_renovated,Garages,Condition,Heating,Energy_source,State,Garagetype,Type,Price
972,246.0,443.0,10.0,5.0,4.0,3.0,2019.0,2000.0,3.0,maintained,stove heating,oil,Baden-Württemberg,Duplex lot,Duplex,770000.0
3367,120.0,694.0,6.0,4.0,1.0,2.0,2006.0,2004.0,1.0,modernized,stove heating,gas,Bremen,Garage,Mid-terrace house,729000.0
3978,135.17,655.0,5.0,4.0,2.0,2.0,2020.0,2019.0,2.0,,stove heating,"solar, gas",Hessen,Outside parking lot,Mid-terrace house,471000.0
4740,314.0,9188.0,7.0,5.0,3.0,2.0,2015.0,1998.0,1.0,fixer-upper,heat pump,gas,Mecklenburg-Vorpommern,Garage,Mid-terrace house,765000.0
6842,145.0,800.0,6.0,5.0,1.0,3.0,2012.0,2000.0,1.0,modernized,stove heating,gas,Nordrhein-Westfalen,Garage,Mid-terrace house,379000.0
7459,280.0,911.0,10.0,8.0,3.0,4.0,2013.0,2010.0,2.0,renovated,stove heating,liquefied petroleum gas,Nordrhein-Westfalen,Outside parking lot,Mid-terrace house,449000.0
7851,131.0,782.0,5.0,4.0,2.0,2.0,2021.0,2020.0,4.0,,stove heating,electricity,Rheinland-Pfalz,Parking lot,Mid-terrace house,382500.0
8135,113.0,136.0,5.0,3.0,1.0,3.0,1940.0,1917.0,,renovated,stove heating,oil,Rheinland-Pfalz,,Single dwelling,189000.0
8684,280.0,620.0,8.0,5.0,3.0,2.0,2016.0,2015.0,,modernized,stove heating,gas,Rheinland-Pfalz,,Duplex,1250000.0
10361,180.0,827.0,6.0,4.0,2.0,1.0,2004.0,1974.0,2.0,renovated,heat pump,gas,Schleswig-Holstein,Parking lot,,650000.0


Three ways to deal with this:
* Delete the rows in question
* Replace the values with NaN and deal with this in part 3, with the rest of the missing data.
* Change place on the values in the two columns


### Removing rows

We choose to remove the rows in both cases above. And we did it all at once.

In [21]:
rows_to_drop = index_to_check + index_to_check_2
print(len(rows_to_drop))
rows_to_drop

25


[441,
 458,
 677,
 778,
 810,
 997,
 1322,
 1758,
 6564,
 6890,
 6960,
 8584,
 8796,
 10239,
 972,
 3367,
 3978,
 4740,
 6842,
 7459,
 7851,
 8135,
 8684,
 10361,
 10379]

In [22]:
df.drop(rows_to_drop, inplace=True)

A quick check...

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10527 entries, 0 to 10551
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Living_space    10527 non-null  float64
 1   Lot             10527 non-null  float64
 2   Rooms           10527 non-null  float64
 3   Bedrooms        6854 non-null   float64
 4   Bathrooms       8726 non-null   float64
 5   Floors          7865 non-null   float64
 6   Year_built      9834 non-null   float64
 7   Year_renovated  5326 non-null   float64
 8   Garages         8573 non-null   float64
 9   Condition       10206 non-null  object 
 10  Heating         9946 non-null   object 
 11  Energy_source   9301 non-null   object 
 12  State           10526 non-null  object 
 13  Garagetype      8573 non-null   object 
 14  Type            10126 non-null  object 
 15  Price           10527 non-null  float64
dtypes: float64(10), object(6)
memory usage: 1.4+ MB


### Saving the changes

We save our changes to a new csv-file for the next step in our preprocessing of data

In [24]:
df.to_csv('germany_housing_data_part2_after_cleaning.csv', index=False)