# Data prepcocessing example

Data preprocessing is the most important stage in whole data life cycle.
If you're working with data you know that's very rare that we're getting clean and well structured data 'first-hand'.
In this article i'm going to show you how to preprocess data using pandas.
I'm going to use dataset which contains flat rent offers scraped from one of the polish website for one particular town which is Poznań.
So here's the dataset:

In [None]:
import pandas as pd
import numpy as np
df = pd.read_parquet('data.parquet')

In [None]:
df

My approach when I'm facing problem like this is using unique method on pandas series and simply printing and examining unique values in every column just by glance.
We can achieve this using this code:

In [None]:
for column in df.columns:
    print(column)
    print(df[column].unique())
    print()

Now i' m going to explain preprocessing steps for every column:

**fees**

We can create simple function which iterates over every element in string and joins only values which are numeric, if value is None (empty value) function will return [np.nan](https://numpy.org/doc/stable/reference/constants.html?highlight=nan#numpy.nan):

In [None]:
def get_numeric_values_from_string(s):
    if s is None:
        return np.nan
    else:
        return float(''.join([v for v in s if v.isnumeric()]))

Simple test:

In [None]:
def test_get_numeric_values_from_string():
    s = '800 zł'
    actual = get_numeric_values_from_string(s)
    expected = 800
    assert actual == expected

test_get_numeric_values_from_string()

finaly using map method in order to use function written above:

In [None]:
df.fees = df.fees.map(get_numeric_values_from_string)

**number_of_rooms, year_built and number_of_parking_spaces**


Let's look at data types before we do anything


In [None]:
df.dtypes

Columns number_of_rooms, year_built and number_of_parking_spaces should be numeric, why ?

lets look at unique values in this columns again

In [None]:
columns_to_check = ['number_of_rooms', 'year_built', 'number_of_parking_spaces']

for column_to_check in columns_to_check:
    print(column_to_check)
    print(df[column_to_check].unique())
    print()

Let's create python dictionary where key will be name of the column we want to convert and the value will be datatype this column should be converted to

In [None]:
convert_dict = {
    'number_of_rooms': float, 
    'year_built': float, 
    'number_of_parking_spaces': float
}

df = df.astype(convert_dict)

So the data types now looks like that:

In [None]:
df.dtypes

**number_of_floors_in_the_building and floor**

We need to replace some values in this two columns
In order to do so im going to create python dictionary which will be used as mapper, key from this dictionary is the value we want to replace and value of this key is new value we want to set. Here's the code:

In [None]:
mapper:dict = {
    '0 (parter)': 0
}
df.number_of_floors_in_the_building = df.number_of_floors_in_the_building.replace(mapper)

Same approach for floor column

In [None]:
mapper:dict = {
    'parter': 0,
    'low parter':0
}
df.floor = df.floor.replace(mapper)

Fast look at datatypes:

In [None]:
convert_dict = {
    'number_of_floors_in_the_building': float, 
    'floor': float, 
}

df = df.astype(convert_dict)

In [None]:
df.dtypes

Fantastic! Let's go further

**area_in_m2**

My approach is to create new function

In [None]:
def convert_area_in_m2_to_numeric(s):
    return float(s.replace('m2', '').replace(',', '.').strip())

Then we can write simple test for this function

In [None]:
def test_convert_area_in_m2_to_numeric():
    s= '100,57 m2'
    actual= convert_area_in_m2_to_numeric(s)
    expected = 100.57
    assert actual == expected
    
test_convert_area_in_m2_to_numeric()

finally use map method in order to apply it on pandas serie

In [None]:
df.area_in_m2 = df.area_in_m2.map(convert_area_in_m2_to_numeric)

Of course we can use use vectorized string methods on pandas serie, and write it like

In [None]:
# df.area_in_m2 = df.area_in_m2.str.replace('m2', '').str.replace(',', '.').str.strip().astype('float')

But if we re going to use this for instance in ETL pipeline my recommendation is to define function, the biggest pros of writing a function is that we can test it, which is extremaly important !

**location**

Last column which should be preprocessed is location, again two approaches write function and test it and use vectorized function

first approach:

new function:

In [None]:
def get_location(s):
    return ' '.join(s.strip().split())

test:

In [None]:
def test_get_location() -> None:
    s = '''

    Poznań,       Stare Miasto,    wielkopolskie
    
    '''
    actual = get_location(s)
    expected = 'Poznań, Stare Miasto, wielkopolskie'
    assert actual == expected
    
test_get_location()

use map method for pandas serie:

In [None]:
df.location = df.location.map(get_location)

second approach (not recommended by me but also valid):

In [None]:
# df.location = df.location.str.strip().str.split().apply(lambda x: ' '.join(x))

Let's use unique method in order to see locations:

In [None]:
df.location.unique()

One of the values - Poznań, wielkopolskie doesn't give any information - we know that we re dealing with flats from Poznań, wielkopolskie is name voivodeship. Let's replace it with [np.nan](https://numpy.org/doc/stable/reference/constants.html?highlight=nan#numpy.nan)

In [None]:
df.location = df.location.replace({'Poznań, wielkopolskie': np.nan})

Last thing we want to do is to extract district from location, we can do it easily by picking second element (first index) at splited location, to do so let's define new function

In [None]:
def get_district(s):
    if s is np.nan:
        return np.nan
    else:
        splited:list = s.split(', ')
        return splited[1]

Of course **test it**

In [None]:
def test_get_district():
    s = 'Poznań, Stare Miasto, wielkopolskie'
    actual = get_district(s)
    expected = 'Stare Miasto'
    assert actual == expected
    
test_get_district()

And use it

In [None]:
df['district'] = df.location.map(get_district)

Hurray!

Preprocessing has been finished and final datset looks great, let's look:

In [None]:
df

Now further use cases for this dataset are almost **unlimited**, examples:

* building machine learning model
* store in data warehouse and use it for BI
* data visualizations