# Data preparation

Let's have a first look to raw data and perform some preliminary operations before going into details with the ML algo.

We are going to load a copy of the raw dataset from a github repo (better keep the original dataset in a safe position, just not to mess up hours of collecting data).

## 1. Loading dataset

In [64]:
import pandas as pd

In [65]:
url = 'https://raw.githubusercontent.com/MarcoCollesei/RentinBo/master/copy_of_dataset.csv'
copy_of_dataset = pd.read_csv(url)
copy_of_dataset

Unnamed: 0,tipologia,genere,zona,bagno,cucina,salotto,balcone,euro,range euro
0,singola,F,San Donato,1,1,1,0,320,300-325
1,singola,M/F,Marconi,1,1,1,0,465,450-475
2,singola,M/F,Saffi,1,1,1,0,430,425-450
3,singola,F,Irnerio,2,1,1,1,550,550-575
4,singola,F,S.Vitale,1,1,0,1,400,400-425
...,...,...,...,...,...,...,...,...,...
195,doppia,M/F,Bolognina,1,1,0,2,237,225-250
196,doppia,M/F,Malpighi,2,1,1,0,250,250-275
197,doppia,M/F,Bolognina,1,1,1,0,250,250-275
198,doppia,F,Malpighi,1,1,1,0,350,350-375


Here's our dataset, as we can see these are renting prices for single and double rooms in Bologna.
Each row has been completed with real informations that I've personally collected browsing on Subito.it.

## 2. Inspect  dataset dimension


Simply get a look of the shape of 'copy_of_dataset'.

In [66]:
shape = copy_of_dataset.shape
shape

(200, 9)

In [67]:
class_counts = copy_of_dataset.groupby('tipologia').size()
print(class_counts)

tipologia
doppia     100
singola    100
dtype: int64


We have 200 rows, exactly divided between single and double rooms, and 9 columns, of which the first 7 are the features and the last two are our target. (We'll drop the second last column later, because our purpose is to predict a range of rent prices, not an exact value. This comes from the consideration that rent prices are subject to homeowners' personal decisions and other factors not always included in ads).

## 3. Get an idea of types in dataset

In [68]:
types = copy_of_dataset.dtypes
print(types)

tipologia     object
genere        object
zona          object
bagno          int64
cucina         int64
salotto        int64
balcone        int64
euro           int64
range euro    object
dtype: object


It was pretty intuitive but it's better not to leave anything to chance.

## 4. Distribution of features

Let's dive into dataset and get a feeling of how it is distributed.

In [69]:
class_counts = copy_of_dataset.groupby('genere').size()
print(class_counts)

genere
F       74
M       21
M/F    105
dtype: int64


In [70]:
class_counts = copy_of_dataset.groupby('zona').size()
print(class_counts)

zona
Barca               2
Bolognina          20
Borgo Panigale      1
Colli               2
Corticella          4
Costa Saragozza    15
Galvani             6
Irnerio            23
Lame                8
Malpighi            9
Marconi            20
Mazzini            17
Murri              17
S.Ruffillo          3
S.Viola             2
S.Vitale           12
Saffi              14
San Donato         25
dtype: int64


In [71]:
class_counts = copy_of_dataset.groupby('bagno').size()
print(class_counts)

bagno
1    166
2     33
3      1
dtype: int64


In [72]:
class_counts = copy_of_dataset.groupby('cucina').size()
print(class_counts)

cucina
1    197
2      3
dtype: int64


In [73]:
class_counts = copy_of_dataset.groupby('salotto').size()
print(class_counts)

salotto
0    137
1     63
dtype: int64


In [74]:
class_counts = copy_of_dataset.groupby('balcone').size()
print(class_counts)

balcone
0    122
1     71
2      5
3      2
dtype: int64


In [75]:
class_counts = copy_of_dataset.groupby('range euro').size()
print(class_counts)

range euro
175-200     3
200-225    12
225-250    20
250-275    30
275-300    13
300-325    21
325-350    14
350-375    24
375-400     8
400-425    24
425-450     4
450-475    16
500-525     6
525-550     1
550-575     3
650-675     1
dtype: int64


The last distribution is not very useful to be honest, since it includes both the ranges for single and double rooms and there is a relevant difference in prices among those two types.

## 5. Correlations

Just to be sure of the goodness of our data we have to check wether some of our numerical features are correlated or not.

In [76]:
correlations = copy_of_dataset.corr(method='pearson')
correlations

Unnamed: 0,bagno,cucina,salotto,balcone,euro
bagno,1.0,0.259115,0.054106,0.16588,-0.008089
cucina,0.259115,1.0,-0.083683,0.185856,-0.090096
salotto,0.054106,-0.083683,1.0,0.046829,-0.11683
balcone,0.16588,0.185856,0.046829,1.0,-0.049141
euro,-0.008089,-0.090096,-0.11683,-0.049141,1.0


It is pretty evident that at least for numerical features there are not strong correlations.

## 6. Preparing dataset

Before looking for a ML algo we still have to work a bit on our dataset.

As we have seen at the beginning the beginning we want to predict a range of rent prices and not an exact value, that is why we are going to drop the 'euro' column.

In [77]:
copy_of_dataset = copy_of_dataset.drop(columns = 'euro')
copy_of_dataset

Unnamed: 0,tipologia,genere,zona,bagno,cucina,salotto,balcone,range euro
0,singola,F,San Donato,1,1,1,0,300-325
1,singola,M/F,Marconi,1,1,1,0,450-475
2,singola,M/F,Saffi,1,1,1,0,425-450
3,singola,F,Irnerio,2,1,1,1,550-575
4,singola,F,S.Vitale,1,1,0,1,400-425
...,...,...,...,...,...,...,...,...
195,doppia,M/F,Bolognina,1,1,0,2,225-250
196,doppia,M/F,Malpighi,2,1,1,0,250-275
197,doppia,M/F,Bolognina,1,1,1,0,250-275
198,doppia,F,Malpighi,1,1,1,0,350-375


Moreover we would like to shuffle our dataset before feeding the ML algorithm.