# Feature Engineer - Eng. Atributos Linear Regression

## *Senário Proposto*

- Conseguir o melhor R2 Score em um Modelo Linear
- Usar apenas Regressão Linear Simples
- Usar Feature Engineer!

### Librarys

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

- Dados

In [20]:
 
df = pd.read_csv('./visits.csv', sep='\t')
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values('date')

df_original = df.copy()

- Criando marcadores temporais

In [3]:
 
df['day_of_week'] = df['date'].dt.day_of_week
df['month'] = df['date'].dt.month
df['day_of_month'] = df['date'].dt.day


- Bolean - Dia de pagameto => True

In [4]:
df['is_payday'] = df['day_of_month'].isin([1, 2, 3, 4, 5, 28, 29, 30, 31]).astype(int)

- Criando variáveis de interação entre os dias da semana e os meses do ano

In [5]:

 
df['day_month_inter'] = df['day_of_week'].astype(str) + '_' + df['month'].astype(str)
df = pd.concat([df, pd.get_dummies(df['day_month_inter'], prefix='inter', drop_first=True, dtype=float)], axis=1)
 

- Dummy - É feriado??

In [6]:
feriados = [
    '2023-01-01', '2023-02-20', '2023-02-21', '2023-02-22',
    '2023-04-07', '2023-04-21', '2023-05-01', '2023-06-08',
    '2023-09-07', '2023-10-12', '2023-11-02', '2023-11-15', '2023-12-25'
]
df['is_holiday'] = df['date'].isin(pd.to_datetime(feriados)).astype(int)
df['is_holiday']

0      1
1      0
2      0
3      0
4      0
      ..
313    0
314    0
315    0
316    0
317    0
Name: is_holiday, Length: 318, dtype: int64

- criando uma Média Móvel de 7 dias

In [7]:
for i in range(1, 15):
    df[f'lag_{i}'] = df['visits'].shift(i)
 
df['roll_mean_7'] = df['visits'].shift(1).rolling(window=7).mean()
df['roll_max_7'] = df['visits'].shift(1).rolling(window=7).max()
df['roll_min_7'] = df['visits'].shift(1).rolling(window=7).min()

- Tratando os dados

In [8]:

outliers_extremos = [
    '2023-11-14', '2023-06-10', '2023-08-08', '2023-04-09', '2023-08-13', '2023-05-14',
    '2023-08-15', '2023-04-28', '2023-02-18', '2023-08-11', '2023-08-26'
]
 
for date in outliers_extremos:
    col_name = f'special_{date}'
    df[col_name] = (df['date'] == pd.to_datetime(date)).astype(int)
 
df_model = df.dropna().copy()
y = df_model['visits']
cols_to_drop = ['date', 'visits', 'day_of_week', 'month', 'day_of_month', 'day_month_inter']
X = df_model.drop(cols_to_drop, axis=1)
X = X.astype(float)
 


### Treinando e avaliando nosso modelo

In [9]:
model = LinearRegression()
model.fit(X, y)
score = model.score(X, y)
 
print("-" * 30)
print(f"R² (Scikit-Learn): {score:.4f}")
print("-" * 30)


------------------------------
R² (Scikit-Learn): 0.8888
------------------------------


## Comparação 

- Para compreender a importancia de feature engineer irei treinar um modelo apenas com alguns atributos!

In [13]:
df_original.head()

Unnamed: 0,date,visits
0,2023-01-01,56
1,2023-01-02,52
2,2023-01-03,83
3,2023-01-04,84
4,2023-01-05,94


In [21]:
df_original['date'] = pd.to_datetime(df_original['date'])

df_original['day_of_week'] = df_original['date'].dt.day_of_week

df_original = pd.concat([df_original, pd.get_dummies(df_original['day_of_week'], prefix='inter', drop_first=True, dtype=int)], axis=1)
df_original

Unnamed: 0,date,visits,day_of_week,inter_1,inter_2,inter_3,inter_4,inter_5,inter_6
0,2023-01-01,56,6,0,0,0,0,0,1
1,2023-01-02,52,0,0,0,0,0,0,0
2,2023-01-03,83,1,1,0,0,0,0,0
3,2023-01-04,84,2,0,1,0,0,0,0
4,2023-01-05,94,3,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...
313,2023-11-10,155,4,0,0,0,1,0,0
314,2023-11-11,126,5,0,0,0,0,1,0
315,2023-11-12,67,6,0,0,0,0,0,1
316,2023-11-13,140,0,0,0,0,0,0,0


In [22]:
y_ofc = df_original['visits']

cols_to_drop = ['date', 'visits', 'day_of_week']
X_ofc = df_original.drop(cols_to_drop, axis=1)
X = X.astype(float)

In [23]:
model.fit(X_ofc, y_ofc)
score = model.score(X_ofc, y_ofc)

score

0.5007761278265374

## Conclusão

- Nosso modelo linear conseguiu sair de 0.50 para 0.88 apenas com tecnicas de Feature Engineer