## Modele de regresie

Folositi urmatoarele seturi de date:
1. [CPU Computer Hardware](https://archive.ics.uci.edu/ml/datasets/Computer+Hardware); excludeti din dataset coloanele: vendor name, model name, estimated relative performance; se va estima coloana "published relative performance".
1. [Boston Housing](http://archive.ics.uci.edu/ml/machine-learning-databases/housing/)
1. [Wisconsin Breast Cancer](http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html); cautati in panelul din stanga Wisconsin Breast Cancer si urmati pasii din "My personal Notes"
1. [Communities and Crime](http://archive.ics.uci.edu/ml/datasets/communities+and+crime); stergeti primele 5 dimensiuni si trasaturile cu missing values.

Pentru fiecare set de date aplicati minim 5 modele de regresie din scikit learn. Pentru fiecare raportati: mean absolute error, mean squared error, median absolute error - a se vedea [sklearn.metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) - folosind 5 fold cross validation. Valorile hiperparametrilor trebuie cautate cu grid search (cv=3)  si random search (n_iter dat de voi). Metrica folosita pentru optimizarea hiperparametrilor va fi mean squared error. Raportati mediile rezultatelor atat pentru fold-urile de antrenare, cat si pentru cele de testare; indicatie: puteti folosi metoda `cross_validate` cu parametrul `return_train_score=True`, iar ca model un obiect de tip `GridSearchCV` sau `RandomizedSearchCV`.

Rezultatele vor fi trecute intr-un dataframe. Intr-o stare intermediara, valorile vor fi calculate cu semnul minus: din motive de implementare, biblioteca sklearn transforma scorurile in numere negative; a se vedea imaginea de mai jos:

![intermediate report](./images/cpu_intermediate_blurred.png)


Valorile vor fi aduse la interval pozitiv, apoi vor fi marcate cele maxime si minime; orientativ, se poate folosi imaginea de mai jos, reprezentand dataframe afisat in notebook; puteti folosi alte variante de styling pe dataframe precum la https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#.  

Se va crea un raport final in format HTML sau PDF - fisier(e) separat(e). Raportul trebuie sa contina minimal: numele setului de date si obiectul dataframe; preferabil sa se pastreze marcajul de culori realizat in notebook.

![report](./images/cpu_results_blurred.png)

Notare:
1. Se acorda 20 de puncte din oficiu.
1. Optimizare si cuantificare de performanta a modelelor: 3 puncte pentru fiecare combinatie set de date + model = 60 de puncte
1. Documentare modele: numar modele * 2 puncte = 10 puncte. Documentati in jupyter notebook fiecare din modelele folosite, in limba romana. Puteti face o sectiune separata cu documentarea algoritmilor. Fiecare model trebuie sa aiba o descriere de minim 20 de randuri, minim o imagine asociata si minim 2 referinte bibliografice.
1. 10 puncte: export in format HTML sau PDF.



In [164]:
!pip install dominate



In [165]:
import numpy as np
import pandas as pd
from typing import List, Dict
from dominate.tags import *
from dominate.util import raw
from sklearn.datasets import load_boston, load_breast_cancer
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import BayesianRidge

In [166]:
def lin_reg_params() -> Dict[str,List[bool]]:
    """
    Creeaza un dictionar cu numele parametrilor ca si chei, iar ca valori, 
    liste cu valorile posibile; pentru Linear Regression
    return: dictionarul 
    """
    copy_X:List[bool] = [True,False]
    fit_intercept:List[bool] = [True,False]
    normalize:List[bool] = [True,False]
    positive:List[bool] = [True,False]
    return dict(copy_X=copy_X, fit_intercept=fit_intercept, normalize=normalize, positive=positive)

def lasso_ridge_elastic_params() -> Dict[str,List]:
    """
    Creeaza un dictionar cu numele parametrilor ca si chei, iar ca valori, 
    liste cu valorile posibile; pentru Lasso, Ridge si Elastic Net
    return: dictionarul
    """
    alpha:List[float] = [1.0,1.1,1.2,1.3,1.4,1.5]
    fit_intercept:List[bool] = [True,False]
    normalize:List[bool] = [True,False]
    copy_X:List[bool] = [True,False]
    max_iter:List[int] = [1000,1100,1200,1300,1400,1500]
    return dict(alpha=alpha, fit_intercept=fit_intercept, normalize=normalize, copy_X=copy_X, max_iter=max_iter)

def bayesian_ridge_params() -> Dict[str,List]:
    """
    Creeaza un dictionar cu numele parametrilor ca si chei, iar ca valori, 
    liste cu valorile posibile; pentru Bayesian Ridge
    return: dictionarul
    """
    alpha_1:List[float] = [1.e-5,1.e-6]
    alpha_2:List[float] = [1.e-5,1.e-6]
    lambda_1:List[float] = [1.e-5,1.e-6]
    lambda_2:List[float] = [1.e-5,1.e-6]
    return dict(alpha_1=alpha_1, alpha_2=alpha_2, lambda_1=lambda_1, lambda_2=lambda_2)

def get_errors(model, data:np.ndarray, target:np.ndarray) -> None:
    """
    Afiseaza mean absolute error, mean squared error, median absolute error pentru un anumit model de regresie, 
    pentru un anumit set de date 
    param model: modelul de regresie
    param data: data-ul setului de date
    param target: target-ul setului de date
    """
    neg_mean_abs_err:List[float] = cross_val_score(model, data, target, scoring='neg_mean_absolute_error', cv=5)
    neg_mean_sqr_err:List[float] = cross_val_score(model, data, target, scoring='neg_mean_squared_error', cv=5)
    neg_median_abs_err:List[float] = cross_val_score(model, data, target, scoring='neg_median_absolute_error', cv=5)
    print(f'Negative mean absolute errors for {model} are {neg_mean_abs_err}')
    print(f'Negative mean squared errors for {model} are {neg_mean_sqr_err}')
    print(f'Negative median absolute errors for {model} are {neg_median_abs_err}')

def grid_search(model, param_grid:Dict[str,List], data:np.ndarray, target:np.ndarray) -> None:
    """
    Calculeaza valorile optimale ale hiperparametrilor unui model de regresie ales, pentru un anumit set de date 
    param model: modelul de regresie
    param param_grid: dictionar cu numele parametrilor ca valoare a cheilor si un set de valori
    param data: data-ul setului de date
    param target: target-ul setului de date
    """
    grid_src = GridSearchCV(estimator = model, param_grid=param_grid, scoring='neg_mean_squared_error', cv=3, return_train_score=True)
    grid_src.fit(data, target)
    return grid_src
    
def randomized_search(model, param_distributions:Dict[str,List], data:np.ndarray, target:np.ndarray) -> None:
    """
    Calculeaza valorile optimale ale hiperparametrilor unui model de regresie ales, pentru un anumit set de date 
    param model: modelul de regresie
    param param_distributions: dictionar cu numele parametrilor ca valoare a cheilor si un set de valori
    param data: data-ul setului de date
    param target: target-ul setului de date
    """
    randomized_src = RandomizedSearchCV(estimator = model, param_distributions=param_distributions, n_iter = 6, scoring='neg_mean_squared_error', cv=3, return_train_score=True)
    randomized_src.fit(data, target)
    return randomized_src

def get_mean_values(model, data:np.ndarray, target:np.ndarray) -> None:
    """
    Functie care determina mediile rezultatelor atat pentru fold-urile de antrenare, cat si pentru cele de testare
    param model: modelul ales
    param data: data-ul setului de date
    param target: target-ul setului de date
    """
    res_mean_abs_err:List[float] = cross_validate(model, data, target, cv=5, return_train_score=True, scoring='neg_mean_absolute_error')
    res_mean_sqr_err:List[float] = cross_validate(model, data, target, cv=5, return_train_score=True, scoring='neg_mean_squared_error')
    res_median_abs_err:List[float] = cross_validate(model, data, target, cv=5, return_train_score=True, scoring='neg_median_absolute_error')
    print('The mean absolute error for train scores is ', res_mean_abs_err['train_score'].mean())
    print('The mean absolute error for test scores is ', res_mean_abs_err['test_score'].mean())
    print('The mean squared error for train scores is ', res_mean_sqr_err['train_score'].mean())
    print('The mean squared error for test scores is ', res_mean_sqr_err['test_score'].mean())
    print('The median absolute error for train scores is ', res_median_abs_err['train_score'].mean())
    print('The median absolute error for test scores is ', res_median_abs_err['test_score'].mean())
    

In [167]:
linear_reg = LinearRegression()
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=1.0)
bayesian_ridge = BayesianRidge()
elastic_net = ElasticNet(random_state=0)

In [168]:
computer_hardware:str = './data/machine.data'
names_computer_hardware:List[str] = ['vendor_name', 'model_name', 'myct', 'mmin', 'mmax', 'cach', 'chmin', 'chmax', 'prp', 'erp']
computer_hardware_data = pd.read_csv(computer_hardware, names=names_computer_hardware)
computer_hardware_data = computer_hardware_data.values[:, 2:-1]

X1:np.ndarray = computer_hardware_data[:, :-1]
y1:np.ndarray = computer_hardware_data[:, -1]
y:float = X1*2+3

linear_reg.fit(X1, y)
ridge.fit(X1, y1)
lasso.fit(X1,y1)
bayesian_ridge.fit(X1,y1)
elastic_net.fit(X1,y1)

get_errors(ridge, X1, y1)
grid_search(ridge, lasso_ridge_elastic_params(), X1, y1)
randomized_search(ridge, lasso_ridge_elastic_params(), X1, y1)
get_mean_values(ridge, X1, y1)

Negative mean absolute errors for Ridge() are [-61.3399541  -31.95820468 -28.01434149 -35.29783247 -60.26591349]
Negative mean squared errors for Ridge() are [ -7132.22717357  -2313.54305212  -1501.78912702  -2325.37181398
 -18642.63089465]
Negative median absolute errors for Ridge() are [-42.57144874 -22.57109952 -22.08688641 -23.87018445 -24.16193043]
The mean absolute error for train scores is  -36.69557755818259
The mean absolute error for test scores is  -43.37524924729828
The mean squared error for train scores is  -3243.698631843855
The mean squared error for test scores is  -6383.112412268876
The median absolute error for train scores is  -25.58146831141265
The median absolute error for test scores is  -27.052309909141457


In [169]:
boston_housing = load_boston()
X2:np.ndarray = boston_housing.data
y2:np.ndarray = boston_housing.target
y:float = X2*2+3
    
linear_reg.fit(X2, y)
ridge.fit(X2, y2)
lasso.fit(X2,y2)
bayesian_ridge.fit(X2,y2)
elastic_net.fit(X2,y2)

get_errors(linear_reg, X2, y2)
grid_search(linear_reg, lin_reg_params(), X2, y2)
randomized_search(linear_reg, lin_reg_params(), X2, y2)
get_mean_values(linear_reg, X2, y2)

Negative mean absolute errors for LinearRegression() are [-2.62190565 -3.90725478 -4.386606   -5.57073637 -4.76333993]
Negative mean squared errors for LinearRegression() are [-12.46030057 -26.04862111 -33.07413798 -80.76237112 -33.31360656]
Negative median absolute errors for LinearRegression() are [-1.93044442 -3.20683505 -3.73356502 -2.87614975 -4.94657505]
The mean absolute error for train scores is  -3.221564506407365
The mean absolute error for test scores is  -4.249968544192532
The mean squared error for train scores is  -20.735084629886178
The mean squared error for test scores is  -37.13180746769914
The median absolute error for train scores is  -2.393983912971861
The median absolute error for test scores is  -3.3387138560783045


In [170]:
breast_cancer = load_breast_cancer()
X3:np.ndarray = breast_cancer.data
y3:np.ndarray = breast_cancer.target
y:float = X3*2+3
    
linear_reg.fit(X3, y)
ridge.fit(X3, y3)
lasso.fit(X3,y3)
bayesian_ridge.fit(X3,y3)
elastic_net.fit(X3,y3)

get_errors(lasso, X3, y3)
grid_search(lasso, lasso_ridge_elastic_params(), X3, y3)
randomized_search(lasso, lasso_ridge_elastic_params(), X3, y3)
get_mean_values(lasso, X3, y3)

Negative mean absolute errors for Lasso() are [-0.33221043 -0.28178556 -0.22624776 -0.24051973 -0.23382699]
Negative mean squared errors for Lasso() are [-0.1791947  -0.12221632 -0.08211334 -0.08168521 -0.09210719]
Negative median absolute errors for Lasso() are [-0.22588341 -0.21865268 -0.18394417 -0.20790057 -0.19130003]
The mean absolute error for train scores is  -0.25245518215654095
The mean absolute error for test scores is  -0.2629180933980965
The mean squared error for train scores is  -0.1025975088287813
The mean squared error for test scores is  -0.111463351408519
The median absolute error for train scores is  -0.20084023571611875
The median absolute error for test scores is  -0.2055361744872072


In [171]:
attributes = pd.read_csv('./data/attributes.csv', delim_whitespace = True)
communities_crimes = pd.read_csv('./data/communities.data', names = attributes['attributes'])
communities_crimes.head()

Unnamed: 0,state,county,community,communityname,fold,population,householdsize,racepctblack,racePctWhite,racePctAsian,...,LandArea,PopDens,PctUsePubTrans,PolicCars,PolicOperBudg,LemasPctPolicOnPatr,LemasGangUnitDeploy,LemasPctOfficDrugUn,PolicBudgPerPop,ViolentCrimesPerPop
0,8,?,?,Lakewoodcity,1,0.19,0.33,0.02,0.9,0.12,...,0.12,0.26,0.2,0.06,0.04,0.9,0.5,0.32,0.14,0.2
1,53,?,?,Tukwilacity,1,0.0,0.16,0.12,0.74,0.45,...,0.02,0.12,0.45,?,?,?,?,0.0,?,0.67
2,24,?,?,Aberdeentown,1,0.0,0.42,0.49,0.56,0.17,...,0.01,0.21,0.02,?,?,?,?,0.0,?,0.43
3,34,5,81440,Willingborotownship,1,0.04,0.77,1.0,0.08,0.12,...,0.02,0.39,0.28,?,?,?,?,0.0,?,0.12
4,42,95,6096,Bethlehemtownship,1,0.01,0.55,0.02,0.95,0.09,...,0.04,0.09,0.02,?,?,?,?,0.0,?,0.03


In [172]:
communities_crimes = communities_crimes.drop(columns=['state','county','community','communityname','fold'], axis=1)
communities_crimes = communities_crimes.replace('?', np.NaN)
communities_crimes = communities_crimes.dropna(axis=1)
communities_crimes.head()

Unnamed: 0,population,householdsize,racepctblack,racePctWhite,racePctAsian,racePctHisp,agePct12t21,agePct12t29,agePct16t24,agePct65up,...,PctForeignBorn,PctBornSameState,PctSameHouse85,PctSameCity85,PctSameState85,LandArea,PopDens,PctUsePubTrans,LemasPctOfficDrugUn,ViolentCrimesPerPop
0,0.19,0.33,0.02,0.9,0.12,0.17,0.34,0.47,0.29,0.32,...,0.12,0.42,0.5,0.51,0.64,0.12,0.26,0.2,0.32,0.2
1,0.0,0.16,0.12,0.74,0.45,0.07,0.26,0.59,0.35,0.27,...,0.21,0.5,0.34,0.6,0.52,0.02,0.12,0.45,0.0,0.67
2,0.0,0.42,0.49,0.56,0.17,0.04,0.39,0.47,0.28,0.32,...,0.14,0.49,0.54,0.67,0.56,0.01,0.21,0.02,0.0,0.43
3,0.04,0.77,1.0,0.08,0.12,0.1,0.51,0.5,0.34,0.21,...,0.19,0.3,0.73,0.64,0.65,0.02,0.39,0.28,0.0,0.12
4,0.01,0.55,0.02,0.95,0.09,0.05,0.38,0.38,0.23,0.36,...,0.11,0.72,0.64,0.61,0.53,0.04,0.09,0.02,0.0,0.03


In [173]:
X4:np.ndarray = communities_crimes.drop(columns=['ViolentCrimesPerPop'])
y4:np.ndarray = communities_crimes['ViolentCrimesPerPop']
y:float = X4*2+3

linear_reg.fit(X4, y)
ridge.fit(X4, y4)
lasso.fit(X4,y4)
bayesian_ridge.fit(X4,y4)
elastic_net.fit(X4,y4)

get_errors(bayesian_ridge, X4, y4)
grid_search(bayesian_ridge, bayesian_ridge_params(), X4, y4)
randomized_search(bayesian_ridge, bayesian_ridge_params(), X4, y4)
get_mean_values(bayesian_ridge, X4, y4)

Negative mean absolute errors for BayesianRidge() are [-0.09631268 -0.10385865 -0.08741068 -0.09242159 -0.09237512]
Negative mean squared errors for BayesianRidge() are [-0.02025571 -0.02279832 -0.01650001 -0.01613057 -0.01740948]
Negative median absolute errors for BayesianRidge() are [-0.06664378 -0.06785932 -0.05817873 -0.06864403 -0.06217283]
The mean absolute error for train scores is  -0.09123194984053748
The mean absolute error for test scores is  -0.09447574447653886
The mean squared error for train scores is  -0.017128726010436506
The mean squared error for test scores is  -0.01861881875889384
The median absolute error for train scores is  -0.06378250172302716
The median absolute error for test scores is  -0.06469973943338093


In [174]:
def grid_calculate(model, params:Dict[str,List], data:np.ndarray, target:np.ndarray, score_type:str, error_type:str) -> float:
    """
    Raporteaza mediile rezultatelor pentru unul dintre fold-uri (antrenare/testare) cu metoda GridSearchCV
    param model: modelul de regresie
    param param_grid: dictionar cu numele parametrilor ca valoare a cheilor si un set de valori
    param data: data-ul setului de date
    param target: target-ul setului de date
    param score_type: tipul scorului
    param error_type: tipul erorii
    return: media
    """
    grd_src = grid_search(model, params, data, target)
    result:List[float] = cross_validate(grd_src, data, target, cv=5, return_train_score=True, scoring=error_type)
    return result[score_type].mean()

In [175]:
def randomize_calculate(model, params:Dict[str,List], data:np.ndarray, target:np.ndarray, score_type:str, error_type:str) -> float:
    """
    Raporteaza mediile rezultatelor pentru unul dintre fold-uri (antrenare/testare) cu metoda RandomizedSearchCV
    param model: modelul de regresie
    param param_grid: dictionar cu numele parametrilor ca valoare a cheilor si un set de valori
    param data: data-ul setului de date
    param target: target-ul setului de date
    param score_type: tipul scorului
    param error_type: tipul erorii
    return: media
    """
    rand_src = randomized_search(model, params, data, target)
    result:List[float] = cross_validate(rand_src, data, target, cv=5, return_train_score=True, scoring=error_type)
    return result[score_type].mean()

In [176]:
def get_errors_score(score_type: str, error_type: str, data:np.ndarray, target:np.ndarray) -> List[float]:
    """
    Contruieste o lista din mediile rezultatelor pentru fold-urile de antrenare, cat si pentru cele de testare,
    pentru un data set dat ca parametru
    param score_type: tipul scorului
    param error_type: tipul erorii
    param data: data-ul setului de date
    param target: target-ul setului de date
    return: lista mediilor
    """
    column:List[float] = []
    column.append(grid_calculate(linear_reg, lin_reg_params(), data, target, score_type, error_type))
    column.append(randomize_calculate(linear_reg, lin_reg_params(), data, target, score_type, error_type))
    column.append(grid_calculate(ridge, lasso_ridge_elastic_params(), data, target, score_type, error_type))  
    column.append(randomize_calculate(ridge, lasso_ridge_elastic_params(), data, target, score_type, error_type))
    column.append(grid_calculate(lasso, lasso_ridge_elastic_params(), data, target, score_type, error_type))   
    column.append(randomize_calculate(lasso, lasso_ridge_elastic_params(), data, target, score_type, error_type))
    column.append(grid_calculate(elastic_net, lasso_ridge_elastic_params(), data, target, score_type, error_type))   
    column.append(randomize_calculate(elastic_net, lasso_ridge_elastic_params(), data, target, score_type, error_type))
    column.append(grid_calculate(bayesian_ridge, bayesian_ridge_params(), data, target, score_type, error_type))
    column.append(randomize_calculate(bayesian_ridge, bayesian_ridge_params(), data, target, score_type, error_type))
    return column

In [177]:
def get_data_frame(data:np.ndarray, target:np.ndarray) -> Dict[str,list]:
    """
    Contruieste un dictionar din coloanele alcatuite din mediile rezultatelor pentru fold-urile de antrenare, 
    cat si pentru cele de testare, pentru un data set dat ca parametru
    param data: data-ul setului de date
    param target: target-ul setului de date
    return: dictionarul 
    """
    test_neg_mean_absolute_error:List[float] = get_errors_score('test_score', 'neg_mean_absolute_error', data, target)    
    test_neg_mean_squared_error:List[float] = get_errors_score('test_score', 'neg_mean_squared_error', data, target)
    test_neg_median_absolute_error:List[float] = get_errors_score('test_score', 'neg_median_absolute_error', data, target)
    train_neg_mean_absolute_error:List[float] = get_errors_score('train_score', 'neg_mean_absolute_error', data, target)
    train_neg_mean_squared_error:List[float] = get_errors_score('train_score', 'neg_mean_squared_error', data, target)
    train_neg_median_absolute_error:List[float] = get_errors_score('train_score', 'neg_median_absolute_error', data, target)

    data_frame:Dict[str,list] = {
            'Model_name': ['Linear Regression', 'Linear Regression', 'Ridge', 'Ridge', 'Lasso', 'Lasso', 'Elastic Net', 'Elastic Net', 'Bayesian Ridge', 'Bayesian Ridge'],
            'Search_strategy': ['GridSearchCV', 'RandomizedSearchCV', 'GridSearchCV', 'RandomizedSearchCV', 'GridSearchCV', 'RandomizedSearchCV','GridSearchCV', 'RandomizedSearchCV', 'GridSearchCV', 'RandomizedSearchCV'],
            'test_neg_mean_absolute_error': test_neg_mean_absolute_error,
            'test_neg_mean_squared_error': test_neg_mean_squared_error,
            'test_neg_median_absolute_error': test_neg_median_absolute_error,
            'train_neg_mean_absolute_error': train_neg_mean_absolute_error,
            'train_neg_mean_squared_error': train_neg_mean_squared_error,
            'train_neg_median_absolute_error': train_neg_median_absolute_error,
        }
    return data_frame

In [178]:
data_frame = get_data_frame(X1, y1)
df = pd.DataFrame(data_frame)
display(df)

KeyboardInterrupt: 

In [None]:
def get_positive_data_frame(data_frame: Dict[str,List]) -> Dict[str,List]:
    """
    Contruieste un dictionar cu valori pozitive din coloanele alcatuite din mediile rezultatelor 
    pentru fold-urile de antrenare, cat si pentru cele de testare, pentru un data set dat ca parametru
    param data: data-ul setului de date
    param target: target-ul setului de date
    return: dictionarul 
    """
    pos_data_frame:Dict[str,List] = {}
    lst:List = []
    for key in data_frame:
        for value in data_frame[key]:
            if isinstance(value, float):
                lst.append(abs(value))
            else:
                lst.append(value)
        pos_data_frame[key.replace("_neg","")] = lst
        lst = []
    return pos_data_frame

In [None]:
positive_data_frame:Dict[str,List] = get_positive_data_frame(data_frame)
pos_df = pd.DataFrame(positive_data_frame)
display(pos_df)

In [None]:
style = pos_df.style.\
    highlight_max(color = 'green', axis = 0).\
    highlight_min(color = 'red', axis = 0)

style

In [None]:
head_title = 'CPU Computer Hardware'
raw_html_content = style.render() 
#print(html(head(title(head_title)), body(raw(raw_html_content))))