In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Objective

In this notebook, you must do an exploratory data analysis , feature engineering, data treatment and application of Machine Learning models to predict the **RENT AMOUNT (R$)**.
I will follow this script:
1. Perform an EDA in order to gain insights and choose the best features;
2. Preprocess the data;
3. Test models and choose the best one;
4. Perform the final test with the chosen model.

# Load Data

In [None]:
raw_data = pd.read_csv('../input/brasilian-houses-to-rent/houses_to_rent_v2.csv')
raw_data.head(20)

The features are:
* **city** - city where the property is located
* **area** - property area
* **rooms** - quantity of rooms
* **bathroom** - quantity of bathrooms
* **parking spaces** - quantity of parking spaces
* **floor** - floor
* **animal** - acept animals?
* **furniture** - furniture?
* **hoa** - Homeowners association tax
* **property tax** - IPTU / property tax
* **rent amount** - rental price
* **fire insurance** - fire insurance
* **total** - total value


# Exploratory Data Analysis (EDA)

#### Shape

In [None]:
print('ROWS: ', raw_data.shape[0])
print('COLUMNS: ', raw_data.shape[1])

#### Basic info

In [None]:
raw_data.info()

#### Basic description

In [None]:
raw_data.describe().T

#### NULL values

In [None]:
raw_data.isnull().sum()

As it is a Kaggle dataset, it is normal to have no null data.

### Rent amount (R$) - Analysis

#### Histogram


In [None]:
plt.figure(figsize=(12, 6))
sns.distplot(raw_data['rent amount (R$)'])
plt.xticks(np.arange(raw_data['rent amount (R$)'].min(), raw_data['rent amount (R$)'].max(), step=3000));

There is a strong asymmetric on the right (**right skew**) and most of the rent amount is between 450,00 and 3.450,00. There is a great chance that there will be several outliers in this data, and for my business knowledge, a rental costing more than R$ **12.000,00** is something strange...

#### Boxplot

In [None]:
plt.figure(figsize=(10, 7))

sns.boxplot(raw_data['rent amount (R$)'])
plt.xticks(np.arange(raw_data['rent amount (R$)'].min(), raw_data['rent amount (R$)'].max(), step=3000))

plt.show()

As previously stated, there are several outliers after **9.500,00**.

### City

In [None]:
cities = raw_data['city'].unique()
cities

#### Histogram

In [None]:
plt.figure(figsize=(18, 8))

i = 1
for city in cities:
    
    if city == 'São Paulo':
        continue
    
    plt.subplot(2, 3, i)
    plt.title(city)
    city_name = raw_data.loc[raw_data['city'] == city]
    sns.distplot(city_name['rent amount (R$)'])
    plt.xticks(np.arange(city_name['rent amount (R$)'].min(), city_name['rent amount (R$)'].max(), step=2000))
    i+=1
    

plt.tight_layout()
plt.show()

rents in these 4 cities have asymmetry on the right and do not usually exceed **2.500,00**.

In [None]:
plt.figure(figsize=(18, 5))

sp = raw_data.loc[raw_data['city'] == 'São Paulo']
sns.distplot(sp['rent amount (R$)'])
plt.xticks(np.arange(sp['rent amount (R$)'].min(), sp['rent amount (R$)'].max(), step=2000))

plt.show()

In São Paulo, the data remains with right skew, and most of the rental values **exceed 2,500.00** to almost **4,500.00**.

#### Boxplot

In [None]:
plt.figure(figsize=(16, 8))

i = 1
step = 5000
for city in cities:
    if step < 2000:
        step = 2000
    plt.subplot(2, 3, i)
    plt.title(city)
    city_name = raw_data.loc[raw_data['city'] == city]
    sns.boxplot(city_name['rent amount (R$)'])
    plt.xticks(np.arange(city_name['rent amount (R$)'].min(), city_name['rent amount (R$)'].max(),
                        step=step))
    step-=3000
    i+=1

    

plt.tight_layout()
plt.show()

Disregarding outliers:
- **São Paulo**: rent amount around **500,0**0 to **12.000,00**
- **Porto Alegre**: rent amount around **500,00** to **4.500,00**
- **Rio de Janeiro**: rent amount around **500,00** to **7.500,00**
- **Campinas**: rent amount around **500,00** to **5.500,00**
- **Belo Horizonte**: rent amount around **500,00** to **9.500,00**

## Correlations

In [None]:
plt.figure(figsize=(10, 10))

numData = raw_data._get_numeric_data()
var_num_corr = numData.corr()

sns.heatmap(var_num_corr, vmin=-1, vmax=1, annot=True, linewidth=0.01, linecolor='black', cmap='RdBu_r')

plt.show()

In [None]:
var_num_corr['rent amount (R$)'].round(3)

The features that most positively influence the **rent amount** (correlation> = 0.5) are:
* rooms
* bathroom
* parking spaces
* fire insurance (R$)

I will analyze them better below:

## Analysis of important features

### rooms

In [None]:
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.barplot(x=raw_data['rooms'], y=raw_data['rent amount (R$)'])

plt.subplot(1, 2, 2)
sns.boxplot(x=raw_data['rooms'])
plt.xticks(np.arange(raw_data['rooms'].min(), raw_data['rooms'].max(), step=1))


plt.show()

The number of rooms usually varies between 1 and 4, and we noticed that the more rooms, the higher the rent, which is already expected. The value of 10 rooms are strange...

### Bathroom


In [None]:
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.barplot(x=raw_data['bathroom'], y=raw_data['rent amount (R$)'])

plt.subplot(1, 2, 2)
sns.boxplot(x=raw_data['bathroom'])
plt.xticks(np.arange(raw_data['bathroom'].min(), raw_data['bathroom'].max(), step=1))


plt.show()

The number of bathrooms usually varies between 1 and 6, and we noticed that the more bathrooms, the higher the rent, which is already expected. The value of 9 bathrooms are strange...

### Parking spaces

In [None]:
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.barplot(x=raw_data['parking spaces'], y=raw_data['rent amount (R$)'])

plt.subplot(1, 2, 2)
sns.boxplot(x=raw_data['parking spaces'])
plt.xticks(np.arange(raw_data['parking spaces'].min(), raw_data['parking spaces'].max(), step=1))


plt.show()

The number of parking spaces usually varies between 0 and 5, and we noticed that the more parking spaces, the higher the rent, which is already expected.The value starts to decrease from 7 parking spaces, something strange...

### Fire insurance

In [None]:
plt.figure(figsize=(18, 6))

sns.regplot(x=raw_data['fire insurance (R$)'], y=raw_data['rent amount (R$)'], line_kws={'color': 'r'})
plt.xticks(np.arange(raw_data['fire insurance (R$)'].min(), raw_data['fire insurance (R$)'].max(), step=20))

plt.show()

The value of **fire insurance** has a positive influence on **rent amount**. Most of the values are between **3,00** and **200,00**.

### Furniture

In [None]:
furniture = raw_data['furniture'].value_counts()
pd.DataFrame(furniture)

There are about 3x more unfurnished houses than furnished

In [None]:
plt.figure(figsize=(11, 5))

plt.subplot(1, 2, 1)
plt.title('Furniture ratio')
plt.pie(furniture, labels = ['not furnished', 'furnished'], colors= ['r', 'g'], 
        explode = (0, 0.1), autopct='%1.1f%%')

plt.subplot(1, 2, 2)
plt.title('Furniture vs Rent amount')
sns.barplot(x=raw_data['furniture'], y=raw_data['rent amount (R$)'])

plt.tight_layout()
plt.show()

The fact that the house **is furnished** increases **the rent amount**

## Analysis of *not so important* features

### Animal

In [None]:
plt.figure(figsize=(15, 6))

plt.subplot(1, 2, 1)
plt.title('Acept or not acept')
sns.countplot(raw_data['city'], hue=raw_data['animal'])

plt.subplot(1, 2, 2)
plt.title('Boxplot')
sns.boxplot(x=raw_data['rent amount (R$)'], y=raw_data['animal'])
plt.xticks(np.arange(raw_data['rent amount (R$)'].min(), raw_data['rent amount (R$)'].max(), step=5000))

plt.tight_layout()
plt.show()

Whether or not to accept animals in the price has a small influence on the increase the **rent amount**.

### hoa

In [None]:
plt.figure(figsize=(15, 6))

plt.subplot(1, 2, 1)
sns.regplot(x=raw_data['hoa (R$)'], y=raw_data['rent amount (R$)'], line_kws={'color': 'r'})
plt.xscale('log')
plt.yscale('log')

plt.show()

There doesn't seem to be much correlation between the **hoa** and the **rent price**.

### Property tax

In [None]:
plt.figure(figsize=(15, 6))

plt.subplot(1, 2, 1)
sns.regplot(x=raw_data['property tax (R$)'], y=raw_data['rent amount (R$)'], line_kws={'color': 'r'})
plt.xscale('log')
plt.yscale('log')

plt.show()

There doesn't seem to be much correlation between the **property tax** and the **rent price**.

### Area

In [None]:
plt.figure(figsize=(15, 6))

plt.subplot(1, 2, 1)
sns.regplot(x=raw_data['area'], y=raw_data['rent amount (R$)'], line_kws={'color': 'r'})
plt.xscale('log')
plt.yscale('log')

plt.show()

There doesn't seem to be much correlation between the **area** and the **rent price**.

# Testing ML models

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from skopt import gp_minimize

# ML models
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from xgboost import XGBRegressor

As this is a continuous value forecast, I will use regression models.

In this first moment, I will remove the outliers from the data using the interquartile range.

### rent price (R$) with outliers

In [None]:
plt.figure(figsize=(8, 6))

sns.boxplot(raw_data['city'], raw_data['rent amount (R$)'])

plt.show()

Maximum rental prices do not usually exceed **15.000**, but with outliers they reach **40.000**.

### Select quantiles

In [None]:
# Grouping cities
city_group = raw_data.groupby('city')['rent amount (R$)']

In [None]:
# Quantile 1 = 25% of data
Q1 = city_group.quantile(.25)
Q3 = city_group.quantile(.75)

# IQR = Interquartile Range
IQR = Q3 - Q1

# Limits
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

### Remove outliers

In [None]:
# DataFrame to store the new data
new_data = pd.DataFrame()

for city in city_group.groups.keys():
    is_city = raw_data['city'] == city
    accepted_limit = ((raw_data['rent amount (R$)'] >= lower[city]) &
                     (raw_data['rent amount (R$)'] <= upper[city]))
    
    select = is_city & accepted_limit
    data_select = raw_data[select]
    new_data = pd.concat([new_data, data_select])

new_data.head()

### Comparation

In [None]:
plt.figure(figsize=(15, 6))

plt.subplot(1, 2, 1)
plt.title('With outliers')
sns.boxplot(raw_data['city'], raw_data['rent amount (R$)'])

plt.subplot(1, 2, 2)
plt.title('Without outliers')
sns.boxplot(new_data['city'], new_data['rent amount (R$)'])

plt.tight_layout(pad=5.0)
plt.show()

Much better!
P.S. Thanks to [Samuel Natividade](http://www.kaggle.com/juxwzera) for the very useful notebook.

### Train new data without outliers

#### Categorical columns handler

In [None]:
catTransformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

#### Numerical columns handler

In [None]:
numTransformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

#### Select FEATURES (X)

In [None]:
cols = ['city', 'rooms', 'bathroom', 'parking spaces', 'fire insurance (R$)',
        'furniture']

X = new_data[cols]
X.head()

In [None]:
for col in X:
    X = X.astype('category')
X['fire insurance (R$)'] = X['fire insurance (R$)'].astype('int64')
X.info()

#### Select TARGET (y)

In [None]:
y = new_data['rent amount (R$)']
y

#### Select numerical features

In [None]:
numFeatures = X.select_dtypes(include=['int64', 'float64']).columns
numFeatures

#### Select categorical features

In [None]:
catFeatures = X.select_dtypes(include=['category']).columns
catFeatures

#### Handling numerical and categorical features

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('numeric', numTransformer, numFeatures),
        ('categoric', catTransformer, catFeatures)
    ])

#### Select TRAIN and TEST data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

#### List of ML models

In [None]:
regressors = [
    DecisionTreeRegressor(),
    RandomForestRegressor(),
    SVR(),
    LinearRegression(),
    XGBRegressor()
]

#### Fit all ML models and select best

In [None]:
# Seed
np.random.seed(42)

for regressor in regressors:
    
    estimator = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('regressor', regressor)
    ])
    
    estimator.fit(X_train, y_train)
    preds = estimator.predict(X_test)
    
    print(regressor)

    print('MAE:', mean_absolute_error(y_test, preds))
    print('RMSE:', np.sqrt(mean_squared_error(y_test, preds)))
    print('R2:', r2_score(y_test, preds))
    print('-' * 40)

In this specific case, **XGBoost** showed better results compared to the other models.

## RandomizedSearchCV with the best model (XGBRegressor)

In [None]:
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                       ('model', XGBRegressor(random_state=42))
                      ])

In [None]:
params = {
        'model__learning_rate': np.arange(0.01, 0.1),
        'model__n_estimators': np.arange(100, 1000, step=50),
        'model__max_depth': np.arange(1, 20, step=2),
        'model__subsample': [0.8, 0.9, 1],
        'model__colsample_bytree': [0.8, 0.9, 1],
        'model__gamma': [0, 1, 3, 5]
         }

In [None]:
estimator = RandomizedSearchCV(pipe, cv=20, param_distributions=params, n_jobs=-1)
estimator.fit(X_train,y_train)

In [None]:
estimator.best_params_

#### Train with the best model and best params 

In [None]:
estimator = XGBRegressor(colsample_bytree=0.8,
                           gamma=0, 
                           learning_rate=0.01, 
                           max_depth=5, 
                           n_estimators=950, 
                           subsample=1)

In [None]:
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', estimator)
])

In [None]:
model.fit(X_train, y_train)

#### Predictions

In [None]:
preds = model.predict(X_test)

#### Evaluate

In [None]:
print('MAE:', mean_absolute_error(y_test, preds))
print('RMSE:', np.sqrt(mean_squared_error(y_test, preds)))
print('R2:', r2_score(y_test, preds))

In [None]:
plt.figure(figsize=(12, 8))

sns.distplot(y_test, hist=False, color='b', label ='Actual')
sns.distplot(preds, hist=False, color='r', label = 'Predicted')

plt.show()

Although the model predicted higher values in the range of 2.000,00, it seems to do a good job of forecasting the other values.

# Save model

In [None]:
from joblib import dump, load
dump(model, 'model_2.joblib')
model = load('model_2.joblib')