### Sections <a class="anchor" id="sections"></a>

- [1. Getting started](#section1)
    - [1-1. Drop first column 'Unnamed: 0'](#section1.1)
    - [1-2. Clean feature - floor](#section1.2)
    - [1-3. Clean feature - animal](#section1.3)
    - [1-4. Clean feature - furniture](#section1.4)
    - [1-5. Clean multiple features - hoa, rent amount, property tax, fire insurance, total](#section1.5)
- [2. Convert all column dtypes to integers](#section2)
    - [2-1. Value error - invalid literal for int( ) with base 10: 'Sem info'](#section2.1)
    - [2-2 Value error - invalid literal for int( ) with base 10: 'Incluso'](#section2.2)
    - [2.3 Convert all column dtypes to integers II](#section2.3)
- [3. Shuffle the data](#section3)
- [4. Split the data](#section4)
    - [4-1. Scale features before train test split](#section4.1)
    - [4-2. Train test split](#section4.2)
- [5. Train the models](#section5)
    - [5.1 Logistic Regression classifier](#section5.1)
    - [5.2 C-Support Vector Classification](#section5.2)
    - [5-3. Multi-layer Perceptron classifier](#section5.3)
- [6. Percentage of dataset that is positive (ie, label is city)](#section6)
- [7. Predict class labels for samples in X_test](#section7)
- [8. F-Score](#section8)

In [1]:
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

### 1. Getting started<a class="anchor" id="section1"></a>

In [2]:
df = pd.read_csv('data/houseData.csv')
print(df.shape)
display(df.head())

(6080, 14)


Unnamed: 0.1,Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa,rent amount,property tax,fire insurance,total
0,0,1,240,3,3,4,-,acept,furnished,R$0,"R$8,000","R$1,000",R$121,"R$9,121"
1,1,0,64,2,1,1,10,acept,not furnished,R$540,R$820,R$122,R$11,"R$1,493"
2,2,1,443,5,5,4,3,acept,furnished,"R$4,172","R$7,000","R$1,417",R$89,"R$12,680"
3,3,1,73,2,2,1,12,acept,not furnished,R$700,"R$1,250",R$150,R$16,"R$2,116"
4,4,1,19,1,1,0,-,not acept,not furnished,R$0,"R$1,200",R$41,R$16,"R$1,257"


#### 1-1. Drop first column 'Unnamed: 0'<a class="anchor" id="section1.1"></a>

In [3]:
df.drop(df.columns[0], axis=1, inplace=True)
display(df.head())

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa,rent amount,property tax,fire insurance,total
0,1,240,3,3,4,-,acept,furnished,R$0,"R$8,000","R$1,000",R$121,"R$9,121"
1,0,64,2,1,1,10,acept,not furnished,R$540,R$820,R$122,R$11,"R$1,493"
2,1,443,5,5,4,3,acept,furnished,"R$4,172","R$7,000","R$1,417",R$89,"R$12,680"
3,1,73,2,2,1,12,acept,not furnished,R$700,"R$1,250",R$150,R$16,"R$2,116"
4,1,19,1,1,0,-,not acept,not furnished,R$0,"R$1,200",R$41,R$16,"R$1,257"


#### 1-2. Clean feature - floor<a class="anchor" id="section1.2"></a>

In [4]:
# Return unique column values
print(df['floor'].unique())

['-' '10' '3' '12' '2' '16' '6' '4' '1' '7' '13' '9' '14' '5' '8' '15'
 '11' '19' '20' '24' '23' '17' '18' '22' '27' '85' '28' '25' '29' '35'
 '21' '31' '99' '26' '68' '32' '51']


In [5]:
# Replace '-' symbol with 0
df['floor'].replace(to_replace='-', value=0, inplace=True)

#### 1-3. Clean feature - animal<a class="anchor" id="section1.3"></a>

In [6]:
# Return unique column values
print(df['animal'].unique())

['acept' 'not acept']


In [7]:
# Replace 'acept' value with 1
df['animal'].replace('acept', 1, inplace=True)

# Replace 'not acept' value with 1
df['animal'].replace('not acept', 0, inplace=True)

#### 1-4. Clean feature - furniture<a class="anchor" id="section1.4"></a>

In [8]:
# Return unique column values
print(df['furniture'].unique())

['furnished' 'not furnished']


In [9]:
# replace 'furnished' with 1
df['furniture'].replace('furnished', 1, inplace=True)

# replace ' not furnished' with 0
df['furniture'].replace('not furnished', 0, inplace=True)

#### 1-5. Clean multiple features - hoa, rent amount, property tax, fire insurance, total<a class="anchor" id="section1.5"></a>

In [10]:
print(df.columns[8:])

Index(['hoa', 'rent amount', 'property tax', 'fire insurance', 'total'], dtype='object')


In [11]:
# Remove symbols 'R$' & ',' from values
for column in df.columns[8:]:
    df[column].replace('R\$', '', regex=True, inplace=True)
    df[column].replace(',', '', regex=True, inplace=True)

### 2. Convert all column dtypes to integers<a class="anchor" id="section2"></a>

In [12]:
# View column dtypes
print(df.dtypes)

city               int64
area               int64
rooms              int64
bathroom           int64
parking spaces     int64
floor             object
animal             int64
furniture          int64
hoa               object
rent amount       object
property tax      object
fire insurance    object
total             object
dtype: object


In [13]:
# Convert all values in the dataframe to integers
df = df.astype(dtype=np.int64)

ValueError: invalid literal for int() with base 10: 'Sem info'

#### 2-1. Value error - invalid literal for int( ) with base 10: 'Sem info'<a class="anchor" id="section2.1"></a>

In [14]:
# Filter through each columns for a specific value
print(df.isin(['Sem info']).any())

city              False
area              False
rooms             False
bathroom          False
parking spaces    False
floor             False
animal            False
furniture         False
hoa                True
rent amount       False
property tax      False
fire insurance    False
total             False
dtype: bool


In [15]:
# Replace 'Sem info' value with 0
df['hoa'].replace('Sem info', 0, inplace=True)

#### 2-2. Value error - invalid literal for int( ) with base 10: 'Incluso'<a class="anchor" id="section2.2"></a>

In [16]:
# Filter through each columns for a specific value
print(df.isin(['Incluso']).any())

city              False
area              False
rooms             False
bathroom          False
parking spaces    False
floor             False
animal            False
furniture         False
hoa                True
rent amount       False
property tax       True
fire insurance    False
total             False
dtype: bool


In [17]:
# Replace 'Incluso' value with 0
for column in ['hoa', 'property tax']:
    df[column].replace('Incluso', 0, inplace=True)

#### 2-3. Convert all column dtypes to integers II<a class="anchor" id="section2.3"></a>

In [18]:
df = df.astype(dtype=np.int64)

# View column dtypes
print(df.dtypes)

city              int64
area              int64
rooms             int64
bathroom          int64
parking spaces    int64
floor             int64
animal            int64
furniture         int64
hoa               int64
rent amount       int64
property tax      int64
fire insurance    int64
total             int64
dtype: object


### 3. Shuffle the data<a class="anchor" id="section3"></a>

In [19]:
# Fraction of the data = 100% & index reset before being dropped
dfShuffled = df.sample(frac=1).reset_index(drop=True)

# Compare
display(df.head(), dfShuffled.head())

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa,rent amount,property tax,fire insurance,total
0,1,240,3,3,4,0,1,1,0,8000,1000,121,9121
1,0,64,2,1,1,10,1,0,540,820,122,11,1493
2,1,443,5,5,4,3,1,1,4172,7000,1417,89,12680
3,1,73,2,2,1,12,1,0,700,1250,150,16,2116
4,1,19,1,1,0,0,0,0,0,1200,41,16,1257


Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa,rent amount,property tax,fire insurance,total
0,1,82,2,1,1,5,1,0,860,3400,0,44,4304
1,1,230,3,3,3,11,1,0,1900,4080,417,52,6449
2,1,210,5,4,1,0,1,0,0,15000,1667,226,16890
3,1,233,3,4,4,15,0,1,3780,4300,2567,55,10700
4,1,135,2,3,1,22,1,0,1415,6500,339,83,8337


### 4. Split the data<a class="anchor" id="section4"></a>

In [20]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [21]:
df = dfShuffled
display(df.head())

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa,rent amount,property tax,fire insurance,total
0,1,82,2,1,1,5,1,0,860,3400,0,44,4304
1,1,230,3,3,3,11,1,0,1900,4080,417,52,6449
2,1,210,5,4,1,0,1,0,0,15000,1667,226,16890
3,1,233,3,4,4,15,0,1,3780,4300,2567,55,10700
4,1,135,2,3,1,22,1,0,1415,6500,339,83,8337


In [22]:
# y - label variable
y = df['city']
print(type(y))

# X - feature vriables
X = df.drop('city', axis=1)
print(type(X))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6080 entries, 0 to 6079
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   city            6080 non-null   int64
 1   area            6080 non-null   int64
 2   rooms           6080 non-null   int64
 3   bathroom        6080 non-null   int64
 4   parking spaces  6080 non-null   int64
 5   floor           6080 non-null   int64
 6   animal          6080 non-null   int64
 7   furniture       6080 non-null   int64
 8   hoa             6080 non-null   int64
 9   rent amount     6080 non-null   int64
 10  property tax    6080 non-null   int64
 11  fire insurance  6080 non-null   int64
 12  total           6080 non-null   int64
dtypes: int64(13)
memory usage: 617.6 KB


In [24]:
display(df.describe().T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
city,6080.0,0.863322,0.343535,0.0,1.0,1.0,1.0,1.0
area,6080.0,151.143914,375.559485,10.0,58.0,100.0,200.0,24606.0
rooms,6080.0,2.492599,1.129665,1.0,2.0,3.0,3.0,10.0
bathroom,6080.0,2.341612,1.43886,1.0,1.0,2.0,3.0,10.0
parking spaces,6080.0,1.75625,1.611909,0.0,1.0,1.0,2.0,12.0
floor,6080.0,5.672204,6.168918,0.0,0.0,4.0,9.0,99.0
animal,6080.0,0.767434,0.422502,0.0,1.0,1.0,1.0,1.0
furniture,6080.0,0.260197,0.438778,0.0,0.0,0.0,1.0,1.0
hoa,6080.0,1088.42648,3981.357627,0.0,24.5,650.0,1436.0,220000.0
rent amount,6080.0,4395.844408,3576.668946,420.0,1800.0,3111.0,5952.5,45000.0


#### 4-1. Scale features before train test split<a class="anchor" id="section4.1"></a>

In [25]:
# Create scaler object 
scaler = MinMaxScaler(feature_range=(0, 1))

# Fit features to the scaler
scaler.fit(X)

# Scale features of X according to feature_range
X = scaler.transform(X)

In [26]:
# X is now a numpy array
display(X)

array([[2.92730525e-03, 1.11111111e-01, 0.00000000e+00, ...,
        0.00000000e+00, 6.08308605e-02, 9.79464574e-03],
       [8.94454383e-03, 2.22222222e-01, 2.22222222e-01, ...,
        1.13841114e-03, 7.27002967e-02, 1.55601548e-02],
       [8.13140348e-03, 4.44444444e-01, 3.33333333e-01, ...,
        4.55091455e-03, 3.30860534e-01, 4.36243415e-02],
       ...,
       [4.06570174e-04, 0.00000000e+00, 0.00000000e+00, ...,
        3.54900355e-04, 2.96735905e-02, 5.09353833e-03],
       [5.69198244e-03, 4.44444444e-01, 4.44444444e-01, ...,
        2.50341250e-03, 6.52818991e-02, 1.56004731e-02],
       [2.03285087e-03, 1.11111111e-01, 1.11111111e-01, ...,
        2.73000273e-04, 2.37388724e-02, 4.36512203e-03]])

In [27]:
# View X as a pandas DataFrame - all values now between 0 & 1
display(pd.DataFrame(X))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,0.002927,0.111111,0.000000,0.083333,0.050505,1.0,0.0,0.003909,0.066846,0.000000,0.060831,0.009795
1,0.008945,0.222222,0.222222,0.250000,0.111111,1.0,0.0,0.008636,0.082100,0.001138,0.072700,0.015560
2,0.008131,0.444444,0.333333,0.083333,0.000000,1.0,0.0,0.000000,0.327052,0.004551,0.330861,0.043624
3,0.009067,0.222222,0.333333,0.333333,0.151515,0.0,1.0,0.017182,0.087035,0.007008,0.077151,0.026986
4,0.005082,0.111111,0.222222,0.083333,0.222222,1.0,0.0,0.006432,0.136384,0.000925,0.118694,0.020635
...,...,...,...,...,...,...,...,...,...,...,...,...
6075,0.008131,0.222222,0.333333,0.250000,0.050505,1.0,1.0,0.018636,0.213661,0.002730,0.183976,0.039001
6076,0.005285,0.333333,0.111111,0.083333,0.080808,1.0,1.0,0.008182,0.044415,0.001327,0.041543,0.010905
6077,0.000407,0.000000,0.000000,0.000000,0.010101,1.0,1.0,0.002736,0.030956,0.000355,0.029674,0.005094
6078,0.005692,0.444444,0.444444,0.333333,0.030303,0.0,1.0,0.008182,0.073576,0.002503,0.065282,0.015600


#### 4-2. Train test split<a class="anchor" id="section4.2"></a>

In [28]:
# Training size 80%
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(4864, 12) (1216, 12) (4864,) (1216,)


### 5. Train the models<a class="anchor" id="section5"></a>

In [29]:
# model 1. Logistic Regression classifier
from sklearn.linear_model import LogisticRegression

# model 2. C-Support Vector Classification
from sklearn.svm import SVC

# model 3. Multi-layer Perceptron classifier
from sklearn.neural_network import MLPClassifier

#### 5-1. Logistic Regression classifier<a class="anchor" id="section5.1"></a>

In [30]:
# Create model
logModel = LogisticRegression()

# Fit the model according to the given training data
logModel.fit(X_train, y_train)

# Return the mean accuracy on the given test data and labels
print(logModel.score(X_test, y_test))

0.8618421052631579


#### 5-2. C-Support Vector Classification<a class="anchor" id="section5.2"></a>

In [31]:
# Create model
svmModel = SVC()

# Fit the model according to the given training data
svmModel.fit(X_train, y_train)

# Return the mean accuracy on the given test data and labels
print(svmModel.score(X_test, y_test))

0.8618421052631579


#### 5-3. Multi-layer Perceptron classifier<a class="anchor" id="section5.3"></a>

In [32]:
# Create model
# - two hidden layer, each with 16 nodes
mlpModel = MLPClassifier(hidden_layer_sizes=(16,16))

# Fit the model according to the given training data
mlpModel.fit(X_train, y_train)

# Return the mean accuracy on the given test data and labels
print(mlpModel.score(X_test, y_test))

0.8856907894736842


### 6. Percentage of dataset that is positive (ie, label is city)<a class="anchor" id="section6"></a>

In [33]:
# Take the sum of column to find positive examples (ie where city = 1)
print(df['city'].sum())

# Total number of examples
print(df.shape[0])

# Calculate the % of data that is postive (ie where city = 1)
print(df['city'].sum() / df.shape[0])

5249
6080
0.8633223684210526


86% of our dataset is classified as a positive example.

As such, the mean accruacy values seen earlier (5-1., 5-2., 5-3.) are just the number of correct predictions over the total number of predictions.

Let's say each model predicted y = 1 for every example in our dataset. Then, 86% of the time the model will be correct.

This accuracy metric is useful only when a dataset has an "equal number of positive & negative examples", Thus the dataset used in this example is skewed - ie the target labels are not in equal proportion.

We can use a different metric called the "F-score". The F-score is a measure of a model’s accuracy on a dataset. It is used to evaluate binary classification systems, which classify examples into "positive" or "negative". The F-score is a way of combining the precision & recall of the model (defined as the harmonic mean of the model’s precision & recall).

### 7. Predict class labels for samples in X_test<a class="anchor" id="section7"></a>

In [34]:
logPrediction = logModel.predict(X_test)
display(logPrediction)

array([1, 1, 1, ..., 1, 1, 1])

In [35]:
svmPrediction = svmModel.predict(X_test)
display(svmPrediction)

array([1, 1, 1, ..., 1, 1, 1])

In [36]:
mlpPrediction = mlpModel.predict(X_test)
display(mlpPrediction)

array([1, 1, 1, ..., 1, 1, 0])

In [37]:
# For reference
display(df.head(), df.tail())

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa,rent amount,property tax,fire insurance,total
0,1,82,2,1,1,5,1,0,860,3400,0,44,4304
1,1,230,3,3,3,11,1,0,1900,4080,417,52,6449
2,1,210,5,4,1,0,1,0,0,15000,1667,226,16890
3,1,233,3,4,4,15,0,1,3780,4300,2567,55,10700
4,1,135,2,3,1,22,1,0,1415,6500,339,83,8337


Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa,rent amount,property tax,fire insurance,total
6075,1,210,3,4,3,5,1,1,4100,9945,1000,127,15170
6076,1,140,4,2,1,8,1,1,1800,2400,486,31,4717
6077,1,20,1,1,0,1,1,1,602,1800,130,23,2555
6078,1,150,5,5,4,3,0,1,1800,3700,917,47,6464
6079,1,60,2,2,1,5,1,0,700,1465,100,19,2284


### 8. F-score<a class="anchor" id="section8"></a>

In [38]:
from sklearn.metrics import f1_score

In [39]:
print(f1_score(logPrediction, y_test))

0.9257950530035335


In [40]:
print(f1_score(svmPrediction, y_test))

0.9257950530035335


In [41]:
print(f1_score(mlpPrediction, y_test))

0.9369042215161143


The F-score compares each model's prediction with the y_test, calculating the precision & recall to give us a better insight into each model's accuracy.

Each F-score is quite good. Both the Logistic Regression classifier & C-Support Vector Classification model have identical results while the Multi-layer Perceptron Nueral Network performed slightly better.