### Intalling libs

In [1]:
# pip install pydotplus

### Column description

| Variável 	| Descrição 	|
|:-:	|:-	|
|CRIM   | Taxa de crimes per capita por cidade 	|
|ZN     | Proporção de áreas residenciais zoneadas para lotes acima de 25 mil pés quadrados (aproximadamente 2.320 metros quadrados|
|INDUS  | Proporção de acres para negócios não ligados ao varejo por cidade|
|CHAS   | Variável dummy sobre Rio Charles (1 se a região faz fronteira com rio; 0 caso contrário)|
|NOX    | Concentração de óxidos nítricos (partes por 0 milhões)|
|RM     | Número médio de cômodos por habitação|
|AGE    | Proporção de unidades ocupadas por proprietários construídas antes de 1940|
|DIS    | Distâncias ponderadas até cinco centros de empregos em Boston|
|RAD    | Índice de acessibilidade às rodovias radiais|
|TAX    | Taxa de impostos sobre o valor total da propriedade por 10 mil dólares|
|PTRATIO| Razão entre aluno-professor por cidade|
|B      | 1000(Bk - 0.63) ^ 2 em que Bk é a proporção de negros (Bk = Black) por cidade (conjunto de dados de 1978)|
|LSTAT  | Porcentagem da população com status mais baixo|
|MEDV   | Valor médio das casas ocupadas por proprietários em incrementos de 1000 dólares|


### Importing Libs

In [6]:
# dataset import
from sklearn.datasets import (
    load_boston
)

# data visualization
%matplotlib inline
import seaborn as sns

import matplotlib.pyplot as plt

from seaborn import (
    jointplot,
    pairplot,
    boxplot,
    heatmap
)

from yellowbrick.features import (
    Rank2D, 
    RadViz,
    FeatureImportances,
    ParallelCoordinates,
    JointPlotVisualizer,
)

from yellowbrick.classifier import (
    ConfusionMatrix
)

import dtreeviz

import pydotplus

from io import(
    StringIO
)

from IPython.display import (
    Image
)

# data manipulation
import numpy as np
import pandas as pd
from pandas.plotting import(
    radviz
)

import janitor as jn

from ydata_profiling import ProfileReport

# missing values
import missingno as msno

from sklearn.impute import (
    SimpleImputer
)

# machine learning models
from sklearn import (
    svm,
    tree,
    impute,
    ensemble,
    preprocessing,
    model_selection
)

from sklearn.utils import (
    resample
)

from sklearn.dummy import (
    DummyClassifier
)

from sklearn.model_selection import (
    train_test_split
)

from sklearn.experimental import (
    enable_iterative_imputer
)

from sklearn.linear_model import (
    LogisticRegression
)

from sklearn.naive_bayes import (
    GaussianNB
)

from sklearn.tree import (
    DecisionTreeClassifier,
    export_graphviz,
    plot_tree
)

from sklearn.neighbors import (
    KNeighborsClassifier
)

from sklearn.naive_bayes import (
    GaussianNB
)

from sklearn.svm import (
    SVC
)

from sklearn.ensemble import (
    RandomForestClassifier
)

from imblearn.over_sampling import (
    RandomOverSampler,
)

from sklearn.dummy import (
    DummyRegressor
)

import shap

import rfpimp

import lightgbm as lgb

import xgboost as xgb

import xgbfir

# data model metrics
from lime import (
    lime_tabular
)

from treeinterpreter import (
    treeinterpreter as ti
)

from sklearn.metrics import (
    auc,
    f1_score,
    roc_curve,
    recall_score,
    roc_auc_score,
    accuracy_score,
    precision_score,
    confusion_matrix,
    average_precision_score
)

import scikitplot as skplt

from yellowbrick.classifier import (
    ROCAUC,
    ClassBalance,
    ConfusionMatrix,
    ClassPredictionError,
    ClassificationReport,
    PrecisionRecallCurve,
    DiscriminationThreshold,
)

from yellowbrick.model_selection import (
    LearningCurve,
    ValidationCurve,
)

# data prep-model
from sklearn.model_selection import (
    train_test_split,
    StratifiedKFold,
    learning_curve
)

# model deploy
import pickle

### Regression
**Regression** is a supervised machine learning process. It is similar to classification, but instead of predicting a label (target), it tries to predict a continuous (numeric) value.<br><br>
The fact is that **sklearn** is capable of applying many of the same **classification** models to **regression** problems. In effect, the API is the same and calls *.fit*, *.score*, and *.predict*.<br><br>
For the **regression**, we will use a Boston housing dataset.

### Reading the Boston Housing Dataset

In [3]:
# loading boston dataset
b = load_boston()


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

### Preparing the dataset

In [9]:
# creating X dataset
bos_X = pd.DataFrame(b.data, columns=b.feature_names)

# creating y dataset
bos_y = b.target

# train_test_split dataset
bos_X_train, bos_X_test, bos_y_train, bos_y_test = model_selection.train_test_split(
    bos_X,
    bos_y,
    test_size=0.3,
    random_state=42
)

bos_X

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48


### Preparing the standardized dataset

In [10]:
bos_sX = preprocessing.StandardScaler().fit_transform(
    bos_X
)

bos_sX_train, bos_sX_test, bos_sy_train, bos_sy_test = model_selection.train_test_split(
    bos_sX,
    bos_y, test_size=0.3,
    random_state=42
)



### Base model
A basic regression model can give us something against which we can compare other models.<br><br>
No **skear**, the default result of the *.score()* method is the *coefficient of determination* (*r2 or R2*). This number explains the percentage of variation in the input data captured by the forecast. In general, the value will be between 0 and 1, but may be negative in the case of particularly dilapidated models.<br><br>
The default strategy of *DummyRegressor* is to predict the average value of the training set.

In [11]:
dr = DummyRegressor()

dr.fit(bos_X_train, bos_y_train)

dr.score(bos_X_test, bos_y_test)

-0.03469753992352409

### Linear regression
A simple **linear regression** tries to adapt the formula *y = mx + b*, while minimizing the square of errors. When applied, we have an *intercept* and a *coefficient*. <br><br>
The *intercept* provides a base value for a prediction, modified by the sum of the product between the coefficient and the input data. This format can be generalized to larger dimensions. In this case, each attribute will have a coefficient. The higher the absolute value of the coefficient, the more impact the attribute will have on the target.<br><br>
This model assumes that the prediction is a linear combination of the input data. For some datasets this would not be enough. More complexity can be added through attribute transformation (**sklearn**'s *preprocessing.PolynomialFeatures* transformer is capable of creating polynomial combinations of attributes). If this results in overfitting, **ridge** and **lasso** regressions can be used to regularize the estimator.<br><br>
This model is also susceptible to *heteroscedasticity*. *Heteroscedasticity* is the idea that as input values change, so does the prediction error (or residuals). Another issue that must be considered is *multicollinearity*. That is, if the columns have a high level of correlation, it will be more difficult to interpret the coefficients.