# Regression

In regression, we want to predict a continuous value, such as the price of a property, the height of a person, the revenue of a game in the first month of release, etc.

Basically all the techniques studied so far, such as cross-validation, hyperparameter optimization and feature selection, also work for regression problems. 

The big difference between regression and classification is the type of data we want to predict and the metrics used to evaluate performance.

We will use the house price prediction dataset available on kaggle ([House Prices: Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/)).

I strongly suggest, as additional reading, exploring the notebooks from this competition and learning from the competitors and perhaps even participating in the competition!

## Importing Libs

In [5]:
import os
import numpy as np
import pandas as pd

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


## Reading the Data

In [7]:
df_train = pd.read_csv('/Users/dellacorte/py-projects/data-science/supervised-learning-regression-reference/databases/house-price.csv', sep=";")
df_train.head()

Unnamed: 0,zoneamento,tam_terreno,forma_terreno,qualidade_geral,condicao,ano_construcao,qualidade_aquecedor,ar_condicionado,tam_primeiro_andar,tam_segundo_andar,...,qtde_banheiros,qtde_comodos,qtde_lareiras,qtde_carros_garagem,tam_garagem,tam_piscina,qualidade_piscina,mes_venda,ano_venda,preco
0,RL,785.03035,Reg,7,5,2003,Ex,Y,79.524968,79.339162,...,30,8,0,2,50.910844,0.0,NotAv,2,2008,846510.0
1,RL,891.8688,Reg,6,8,1976,Ex,Y,117.243586,0.0,...,2,6,1,2,42.73538,0.0,NotAv,5,2007,734478.4
2,RL,1045.15875,IR1,7,5,2001,Ex,Y,85.47076,80.453998,...,2,6,1,2,56.485024,0.0,NotAv,9,2008,907410.0
3,RL,887.22365,IR1,7,5,1915,Gd,Y,89.279783,70.234668,...,1,7,1,3,59.643726,0.0,NotAv,2,2006,568400.0
4,RL,1324.79678,IR1,8,5,2000,Ex,Y,106.373935,97.826859,...,2,9,1,3,77.666908,0.0,NotAv,12,2008,1015000.0


## Exploratory Data Analysis

In [8]:
df_train.shape

(1458, 23)

The database has 1458 records and 23 columns/features

In [10]:
df_train.index.nunique()

1458

In [11]:
df_train.dtypes

zoneamento              object
tam_terreno            float64
forma_terreno           object
qualidade_geral          int64
condicao                 int64
ano_construcao           int64
qualidade_aquecedor     object
ar_condicionado         object
tam_primeiro_andar     float64
tam_segundo_andar      float64
tam_sala_estar         float64
qtde_quartos             int64
qualidade_cozinha       object
qtde_banheiros           int64
qtde_comodos             int64
qtde_lareiras            int64
qtde_carros_garagem      int64
tam_garagem            float64
tam_piscina            float64
qualidade_piscina       object
mes_venda                int64
ano_venda                int64
preco                  float64
dtype: object

In [12]:
df_train.isnull().sum()

zoneamento             0
tam_terreno            0
forma_terreno          0
qualidade_geral        0
condicao               0
ano_construcao         0
qualidade_aquecedor    0
ar_condicionado        0
tam_primeiro_andar     0
tam_segundo_andar      0
tam_sala_estar         0
qtde_quartos           0
qualidade_cozinha      0
qtde_banheiros         0
qtde_comodos           0
qtde_lareiras          0
qtde_carros_garagem    0
tam_garagem            0
tam_piscina            0
qualidade_piscina      0
mes_venda              0
ano_venda              0
preco                  0
dtype: int64

In [13]:
df_train.describe()

Unnamed: 0,tam_terreno,qualidade_geral,condicao,ano_construcao,tam_primeiro_andar,tam_segundo_andar,tam_sala_estar,qtde_quartos,qtde_banheiros,qtde_comodos,qtde_lareiras,qtde_carros_garagem,tam_garagem,tam_piscina,mes_venda,ano_venda,preco
count,1458.0,1458.0,1458.0,1458.0,1458.0,1458.0,1458.0,1458.0,1458.0,1458.0,1458.0,1458.0,1458.0,1458.0,1458.0,1458.0,1458.0
mean,976.9452,6.098765,5.57476,1971.237311,108.031215,32.196752,140.771685,2.866255,1.583676,6.517147,0.613169,1.766118,43.925405,0.256662,6.316187,2007.817558,734224.4
std,927.893099,1.382749,1.112835,30.20988,35.933321,40.504647,48.774779,0.815482,0.92561,1.624721,0.644829,0.747104,19.869505,3.735141,2.700471,1.327982,322438.0
min,120.7739,1.0,1.0,1872.0,31.029602,0.0,31.029602,0.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,2006.0,141694.0
25%,700.906683,5.0,5.0,1954.0,81.940446,0.0,105.003616,2.0,1.0,5.0,0.0,1.0,30.797345,0.0,5.0,2007.0,527495.5
50%,880.581085,6.0,5.0,1972.5,100.985561,0.0,136.009992,3.0,2.0,6.0,1.0,2.0,44.546988,0.0,6.0,2008.0,661780.0
75%,1077.6748,7.0,6.0,2000.0,129.29775,67.633384,164.995728,3.0,2.0,7.0,1.0,2.0,53.512128,0.0,8.0,2009.0,868840.0
max,19996.906235,10.0,9.0,2010.0,435.900876,191.844695,524.158726,8.0,30.0,14.0,3.0,4.0,131.736454,68.562414,12.0,2010.0,3065300.0


Since we do not have negative numbers in numeric variables, we will replace the missing values ​​in these variables with the arbitrary value -999. As for categorical variables, we will replace missing values ​​with the word `missing`.