#Regressão - Treino do modelo
---
**Aula Prática 03**: Treinando modelo de regressão linear

**Objetivo**: Treinar um modelo de regressão linear

Banco de dados:

**Gorjeta em restaurante**:

Dataset disponibilizado pelo pacote plotly

"One waiter recorded information about each tip he received over a period of a few months working in one restaurant. He collected several variables:

* tip in dollars,

* bill in dollars,

* sex of the bill payer,

* whether there were smokers in the party,

* day of the week,

* time of day,

* size of the party."

**Preço de carros usados**

[Disponivel no kaggle](https://www.kaggle.com/datasets/rishabhkarn/used-car-dataset/data)

[Disponível para download](https://drive.google.com/file/d/1Ny6GypPH4AtJi6CJHmEUEI3KN11hDuGG/view?usp=drive_link)

Usaremos o dado tratado na aula 2

##Import das principais funções e leitura dos dados


---

In [1]:
import pandas as pd #pacote para leitura dos dados
import numpy as np
import plotly.express as px

In [14]:
df_tips = px.data.tips()

In [3]:
df_tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [4]:
path = 'dado_tratado.csv'
df = pd.read_csv(path)

In [5]:
df.head()

Unnamed: 0,seats,kms_driven,mileage,engine,max_power,torque,price,registration_tratada_antes_2010,registration_tratada_depois 2020,fuel_type_Diesel,fuel_type_Petrol,ownsership_tratado_First Owner,ownsership_tratado_None,ownsership_tratado_Second Owner,ownsership_tratado_Third Owner,transmission_tratado_Manual,transmission_tratado_None
0,5,56000,7.81,2996.0,2996.0,333.0,63.75,0,0,0,1,1,0,0,0,0,0
1,5,30615,17.4,999.0,999.0,9863.0,8.99,0,1,0,1,1,0,0,0,0,0
2,5,24000,20.68,1995.0,1995.0,188.0,23.75,0,0,1,0,1,0,0,0,0,0
3,5,18378,16.5,1353.0,1353.0,13808.0,13.56,0,0,0,1,1,0,0,0,1,0
4,5,44900,14.67,1798.0,1798.0,17746.0,24.0,0,0,0,1,1,0,0,0,0,0


## Treino de modelo de regressão - dataset Tips


---


Para treinar um modelo de regressão utilizaremos o pacote sklearn.


### Separação do banco entre treino e teste
O primeiro passo para se treinar um modelo é separar o banco entre treino e teste. Para isso utilizaremos a função train_test_split


``` python
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.3, random_state=15)
```
No exemplo acima X é um dataframe contendo as features do modelo e Y um dataframe com a variável target.


O parâmetro test_size controla o percentual de dados que será utilizado para teste.


O parâmetro random_state controla a aleatoriedade da geração do dado, permitindo que ao reexecutar o código seja gerado os mesmos bancos de treino e teste.


É importante separar o banco entre treino e teste, pois utilizaremos o banco de treino para treinar modelos e o banco de teste para avaliar os modelos.


### Treino do modelo
Agora que já possuímos os dados de treino e teste vamos treinar o nosso modelo de regressão para isso utilizaremos o módulo LinearRegression


``` python
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, Y_train)
```


No código acima o objeto model é do tipo LinearRegression, nele iremos fazer o ajuste do nosso modelo, realizar predições e também ficará armazenado nele os coeficientes do modelo.


``` python
# Para acessar os coeficientes
model.coef_
# Para acessar o intercepto
model.intercept_
# Para fazer predições
model.predict(Y_test)
```


### Avaliação do modelo
Para avaliar o modelo treinado utilizaremos as métricas vistas na aula teórica.


``` python
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Erro médio quadrado
mean_squared_error(Y_test, Y_predict)

# Erro médio absoluto
mean_absolute_error(Y_test, Y_predict)

# R2 score
r2_score(Y_test, Y_predict)
```

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

### Análise e processamento do dado


---

Exercício:


* Análise o banco de dados de tips através da correlação das variáveis, há correlação?
* Análise as variáveis categóricas e crie variáveis dummy para elas.




#### Solução

In [16]:
df_tips.corr(numeric_only=True)

Unnamed: 0,total_bill,tip,size
total_bill,1.0,0.675734,0.598315
tip,0.675734,1.0,0.489299
size,0.598315,0.489299,1.0


In [8]:
df_tips[['sex']].value_counts(dropna=False)

sex   
Male      157
Female     87
dtype: int64

In [9]:
df_tips[['smoker']].value_counts(dropna=False)

smoker
No        151
Yes        93
dtype: int64

In [10]:
df_tips[['day']].value_counts(dropna=False)

day 
Sat     87
Sun     76
Thur    62
Fri     19
dtype: int64

In [11]:
df_tips[['time']].value_counts(dropna=False)

time  
Dinner    176
Lunch      68
dtype: int64

In [12]:
df_tips = pd.get_dummies(df_tips, drop_first=True)
df_tips.head()

Unnamed: 0,total_bill,tip,size,sex_Male,smoker_Yes,day_Sat,day_Sun,day_Thur,time_Lunch
0,16.99,1.01,2,0,0,0,1,0,0
1,10.34,1.66,3,1,0,0,1,0,0
2,21.01,3.5,3,1,0,0,1,0,0
3,23.68,3.31,2,1,0,0,1,0,0
4,24.59,3.61,4,0,0,0,1,0,0


### Primeiro modelo


---


Exercício:


* Construa um primeiro modelo utilizando a variável total_bill para explicar a variável tip. Use 30% do banco para teste.


* Qual a interpretação do coeficiente?
* Qual a interpretação do intercepto?
* Faça as análises de apuração do modelo


Dica:


Ao se usar uma única variável é necessário modificar o formato do dado para que ele seja uma matriz. Para isso faça:


``` python
model.fit(np.array(X_train).reshape(-1,1), Y_train)
```




#### Solução

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(df_tips['total_bill'],
                                                    df_tips['tip'],
                                                    test_size=.3,
                                                    random_state=15)

In [None]:
X_train.head()

136    10.33
47     32.40
129    22.82
148     9.78
95     40.17
Name: total_bill, dtype: float64

In [None]:
model = LinearRegression()
model.fit(np.array(X_train).reshape(-1,1), Y_train)

In [None]:
model.coef_

array([0.10460459])

In [None]:
model.intercept_

0.9544635765372389

In [None]:
x_range = np.arange(0,10)

px.line(x=x_range, y=model.predict(x_range.reshape(-1,1)))

In [None]:
predicao = model.predict(np.array(X_test).reshape(-1,1))
real = Y_test


px.scatter(x=X_test, y=[Y_test, predicao])

In [None]:
mean_squared_error(Y_test, model.predict(np.array(X_test).reshape(-1,1)))

1.1498071109044476

In [None]:
mean_absolute_error(Y_test, model.predict(np.array(X_test).reshape(-1,1)))

0.7931003781497372

In [None]:
r2_score(Y_test, model.predict(np.array(X_test).reshape(-1,1)))

0.3853346512142004

### Modelo completo


---
Exercício:


* Construa um modelo utilizando todas as variáveis disponíveis no banco. Use 30% do banco para teste.


* Qual a interpretação do coeficiente?
* Qual a interpretação do intercepto?
* Faça as análises de apuração do modelo. Este modelo é melhor que o anterior?

Dica:

Para se obter um dataframe com os coeficientes e seus respectivos nomes faça:

``` python
pd.DataFrame(model.coef_, index=X_train.columns[X_train.columns!='tip'])
```

#### Solução

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(df_tips.loc[:, df_tips.columns != 'tip'],
                                                    df_tips['tip'],
                                                    test_size=.3,
                                                    random_state=15)

In [None]:
model = LinearRegression()
model.fit(X_train, Y_train)

In [None]:
pd.DataFrame(model.coef_, index=df_tips.columns[df_tips.columns!='tip'])

Unnamed: 0,0
total_bill,0.094404
size,0.185418
sex_Male,-0.023921
smoker_Yes,-0.071442
day_Sat,0.046031
day_Sun,0.126079
day_Thur,-0.288821
time_Lunch,0.410427


In [None]:
model.intercept_

0.6153202594823224

In [None]:
mean_squared_error(Y_test, model.predict(X_test))

1.1488381500076714

In [None]:
mean_absolute_error(Y_test, model.predict(X_test))

0.7935295518631128

In [None]:
r2_score(Y_test, model.predict(X_test))

0.3858526395636622

## Treino de modelo de regressão - dataset preço carro

---
Exercício:

* Análise o banco de dados de preço de carros utilizando a correlação.
* Há alguma variável correlacionada?

#### Solução

In [None]:
df.corr()

Unnamed: 0.1,Unnamed: 0,seats,kms_driven,mileage,engine,max_power,torque,price,registration_tratada_antes_2010,registration_tratada_depois 2020,fuel_type_Diesel,fuel_type_Petrol,ownsership_tratado_First Owner,ownsership_tratado_None,ownsership_tratado_Second Owner,ownsership_tratado_Third Owner,transmission_tratado_Manual,transmission_tratado_None
Unnamed: 0,1.0,-0.053076,0.047386,-0.052555,0.005692,0.005692,0.021863,-0.015435,0.048447,-0.072193,-0.026525,0.026913,0.062119,-0.080727,-0.045189,0.042692,0.154855,-0.080727
seats,-0.053076,1.0,0.045104,0.069348,0.191673,0.191673,-0.025201,-0.013211,-0.028311,0.017562,0.313214,-0.313372,-0.058751,-0.027815,0.091508,-0.037561,-0.122982,-0.027815
kms_driven,0.047386,0.045104,1.0,-0.100283,-0.041644,-0.041644,0.059113,0.003888,0.047717,-0.329386,0.146317,-0.138685,-0.063032,-0.039234,0.056002,0.090806,0.124208,-0.039234
mileage,-0.052555,0.069348,-0.100283,1.0,0.309145,0.309145,-0.002287,0.036203,0.003507,0.12391,0.069891,-0.083191,-0.15737,0.473974,-0.04355,-0.024431,-0.092391,0.473974
engine,0.005692,0.191673,-0.041644,0.309145,1.0,1.0,-0.009431,-0.003196,-0.005946,0.116568,0.095375,-0.092406,-0.014323,-0.012035,0.024381,-0.00789,-0.072724,-0.012035
max_power,0.005692,0.191673,-0.041644,0.309145,1.0,1.0,-0.009431,-0.003196,-0.005946,0.116568,0.095375,-0.092406,-0.014323,-0.012035,0.024381,-0.00789,-0.072724,-0.012035
torque,0.021863,-0.025201,0.059113,-0.002287,-0.009431,-0.009431,1.0,-0.006461,0.05149,-0.029259,-0.047844,0.05108,-0.078649,0.087637,0.00555,0.125008,0.025616,0.087637
price,-0.015435,-0.013211,0.003888,0.036203,-0.003196,-0.003196,-0.006461,1.0,0.196729,-0.023591,-0.029655,0.030735,-0.087081,-0.007667,0.053312,0.146854,0.037526,-0.007667
registration_tratada_antes_2010,0.048447,-0.028311,0.047717,0.003507,-0.005946,-0.005946,0.05149,0.196729,1.0,-0.051003,-0.031114,0.033432,-0.084427,-0.01578,0.002912,0.18067,0.052247,-0.01578
registration_tratada_depois 2020,-0.072193,0.017562,-0.329386,0.12391,0.116568,0.116568,-0.029259,-0.023591,-0.051003,1.0,-0.079249,0.067514,0.140543,-0.034451,-0.123646,-0.041907,-0.211669,-0.034451


In [None]:
df = df.drop(columns=['max_power'])

In [None]:
df[df.transmission_tratado_None==1]

Unnamed: 0.1,Unnamed: 0,seats,kms_driven,mileage,engine,torque,price,registration_tratada_antes_2010,registration_tratada_depois 2020,fuel_type_Diesel,fuel_type_Petrol,ownsership_tratado_First Owner,ownsership_tratado_None,ownsership_tratado_Second Owner,ownsership_tratado_Third Owner,transmission_tratado_Manual,transmission_tratado_None
116,116,5,60000,2993.0,26149.0,620.0,55.0,0,0,1,0,0,1,0,0,0,1
136,136,5,60000,2993.0,26149.0,620.0,55.0,0,0,1,0,0,1,0,0,0,1
170,170,5,100000,1461.0,1085.0,248.0,6.0,0,0,1,0,0,1,0,0,0,1
190,190,5,100000,1461.0,1085.0,248.0,6.0,0,0,1,0,0,1,0,0,0,1
210,210,5,10000,998.0,11841.0,172.0,11.7,0,1,0,1,0,1,0,0,0,1
213,213,5,30000,1995.0,188.0,400.0,19.5,0,0,1,0,0,1,0,0,0,1
228,228,6,20000,1451.0,141.0,250.0,17.5,0,1,0,1,0,1,0,0,0,1
231,231,5,32000,1995.0,188.0,400.0,18.2,0,0,1,0,0,1,0,0,0,1
233,233,5,32000,1995.0,188.0,400.0,18.2,0,0,1,0,0,1,0,0,0,1
234,234,5,32000,1995.0,188.0,400.0,18.2,0,0,1,0,0,1,0,0,0,1


In [None]:
df=df[df.transmission_tratado_None==0]

In [None]:
df = df.drop(columns=['ownsership_tratado_None', 'transmission_tratado_None'])

### Primeiro modelo

---

Exercício:


* Construa um primeiro modelo utilizando a variável kms_driven para explicar a variável price. Use 30% do banco para teste.


* Qual a interpretação do coeficiente?
* Qual a interpretação do intercepto?
* Faça as análises de apuração do modelo

#### Solução

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(df['kms_driven'],
                                                    df['price'],
                                                    test_size=.3,
                                                    random_state=15)

In [None]:
X_train.head()

985     60000
1421    44000
1151    75000
321     40000
736     25000
Name: kms_driven, dtype: int64

In [None]:
Y_train.head()

985     38.00
1421     4.65
1151     5.70
321      9.99
736     64.00
Name: price, dtype: float64

In [None]:
X_train.shape, X_test.shape

((1052,), (451,))

In [None]:
model = LinearRegression()
model.fit(np.array(X_train).reshape(-1,1), Y_train)

In [None]:
model.coef_

array([0.00038978])

In [None]:
model.intercept_

61.21400403353527

In [None]:
model.predict(np.array([[100]]))

array([61.25298232])

In [None]:
import plotly.express as px

x_range = np.arange(0,10000)

px.line(x=x_range, y=model.predict(x_range.reshape(-1,1)))

In [None]:
px.scatter(x=X_train, y=Y_train, trendline="ols")

In [None]:
mean_squared_error(Y_test, model.predict(np.array(X_test).reshape(-1,1)))

30820990.201757334

In [None]:
mean_absolute_error(Y_test, model.predict(np.array(X_test).reshape(-1,1)))

434.27601759530245

In [None]:
r2_score(Y_test, model.predict(np.array(X_test).reshape(-1,1)))

-0.0028804715663206526

### Modelo completo
---
Exercício:


* Construa um modelo utilizando todas as variáveis disponíveis no banco. Use 30% do banco para teste.

* Qual a interpretação do coeficiente?
* Qual a interpretação do intercepto?
* Faça as análises de apuração do modelo. Este modelo é melhor que o anterior?

Remova as observações em que price é maior que 90 e treine um novo modelo.

* Qual a interpretação do coeficiente?
* Qual a interpretação do intercepto?
* Plot a predição vs valor real, você percebe algo estranho?


#### Solução

In [None]:
var = [col for col in df.columns if col!='price']

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(df[var],
                                                    df['price'],
                                                    test_size=.3,
                                                    random_state=15)

In [None]:
model = LinearRegression()
model.fit(X_train, Y_train)

In [None]:
pd.DataFrame(model.coef_, index=var)

Unnamed: 0,0
Unnamed: 0,-0.218519
seats,-34.28686
kms_driven,0.0004391538
mileage,0.3328505
engine,-2.424372e-10
torque,-8.000881e-05
registration_tratada_antes_2010,-143.025
registration_tratada_depois 2020,-42.5884
fuel_type_Diesel,214.8241
fuel_type_Petrol,277.3297


In [None]:
df.price.quantile([.9, .95, .99])

0.90    43.00
0.95    58.55
0.99    84.96
Name: price, dtype: float64

In [None]:
df = df[df.price < 90]

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(df[var],
                                                    df['price'],
                                                    test_size=.3,
                                                    random_state=15)
model = LinearRegression()
model.fit(X_train, Y_train)

In [None]:
pd.DataFrame(model.coef_, index=var)

Unnamed: 0,0
Unnamed: 0,-0.001859016
seats,1.316171
kms_driven,-9.928134e-05
mileage,0.002511354
engine,-1.30469e-11
torque,3.322378e-06
registration_tratada_antes_2010,-1.444792
registration_tratada_depois 2020,6.669329
fuel_type_Diesel,9.781439
fuel_type_Petrol,4.522822


In [None]:
model.intercept_

9.558894899452742

In [None]:
r2_score(Y_test, model.predict(X_test))

0.36568109577361774

In [None]:
mean_absolute_error(Y_test, model.predict(X_test))

8.501375139755494

In [None]:
px.scatter(x=Y_train, y=model.predict(X_train))

### Transformação nos dados
---
Exercício:

* Transforme a variável price para o seu logaritmo
* Ajuste um modelo utilizando o logaritmo
* Plot o exponencial da predição vs valor real. O modelo melhorou?

#### Solução

In [None]:
model = LinearRegression()
model.fit(X_train, np.log(Y_train))
pd.DataFrame(np.round(model.coef_,4), index=var)

Unnamed: 0,0
Unnamed: 0,-0.0002
seats,0.1519
kms_driven,-0.0
mileage,-0.0
engine,-0.0
torque,-0.0
registration_tratada_antes_2010,-0.8112
registration_tratada_depois 2020,0.3777
fuel_type_Diesel,0.6126
fuel_type_Petrol,0.1474


In [None]:
px.scatter(x=np.log(Y_train), y=model.predict(X_train))

In [None]:
px.scatter(x=Y_test, y=np.exp(model.predict(X_test)))