<a href="https://colab.research.google.com/github/HerbertMariano/california_housing_train/blob/main/california_housing_train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [81]:
!pip install scikit-optimize boruta

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [82]:
import pandas as pd
import numpy as np
from boruta import BorutaPy
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
from skopt import BayesSearchCV

1. CRIM taxa de criminalidade per capita por cidade

2. Proporção ZN de terrenos residenciais zoneados para lotes acima 25.000 pés quadrados

3. Proporção INDUS de acres comerciais não varejistas por cidade

4. Variável dummy CHAS Charles River (= 1 se os limites do tratorio; 0 caso contrário)

5. Concentração de NOX óxidos nítricos (partes por 10 milhões)

6. RM número médio de cômodos por domicílio

7. Proporção AGE de unidades ocupadas pelo proprietário construídas antes de 1940

8. Distâncias ponderadas DIS para cinco centros de emprego de Boston

9. Índice RAD de acessibilidade às rodovias radiais

10. IMPOSTO taxa de imposto de propriedade de valor total por $10.000

11. Relação aluno-professor da PTRATIO por município

12. B 1000(Bk - 0,63)^2 onde Bk é a proporção de negros pela cidade

13. LSTAT % status inferior da população

14. MEDV Valor médio de residências ocupadas pelos proprietários em US$ 1.000

In [83]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
cols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
data = pd.read_csv(url, header=None, names=cols, delim_whitespace=True)
X = data.drop('MEDV', axis=1)
y = data['MEDV']


In [84]:
data

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273.0,21.0,393.45,6.48,22.0


In [87]:
model = RandomForestRegressor(n_estimators=100,max_depth=5, random_state=42)
feat_selector = BorutaPy(
    verbose=0,
    estimator=model,
    n_estimators='auto',
    max_iter=10
)

In [88]:
feat_selector.fit(X.values,y.values.ravel())
X_selected = feat_selector.transform(X.values)

In [89]:
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=1)

In [90]:
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

In [91]:
params = {
    'regressor__fit_intercept': [True, False],
    'regressor__copy_X': [True, False],
    'regressor__n_jobs': [-1]
}

In [92]:
search = BayesSearchCV(pipe, params, n_iter=20, cv=2, random_state=1)

In [93]:
search.fit(X_train, y_train)



In [94]:
train_score = search.score(X_train, y_train)
test_score = search.score(X_test, y_test)

In [95]:
print(f'Train Score: {train_score:.2f}')
print(f'Test Score: {test_score:.2f}')

Train Score: 0.70
Test Score: 0.73


In [96]:
y_pred_train = search.predict(X_train)
y_pred_test = search.predict(X_test)

In [97]:
mse_train = mean_squared_error(y_train, y_pred_train)
mse_test = mean_squared_error(y_test, y_pred_test)
rmse_train = np.sqrt(mse_train)
rmse_test = np.sqrt(mse_test)
mae_train = mean_absolute_error(y_train, y_pred_train)
mae_test = mean_absolute_error(y_test, y_pred_test)

In [98]:
print("Train MSE: ", mse_train)
print("Test MSE: ", mse_test)
print("Train RMSE: ", rmse_train)
print("Test RMSE: ", rmse_test)
print("Train MAE: ", mae_train)
print("Test MAE: ", mae_test)

Train MSE:  24.02616797276034
Test MSE:  26.734358645459576
Train RMSE:  4.901649515495813
Test RMSE:  5.170527888471309
Train MAE:  3.3773302991540137
Test MAE:  4.068134385962243
