# Laborator 8

## Modele de regresie

Folositi urmatoarele seturi de date:
1. [CPU Computer Hardware](https://archive.ics.uci.edu/ml/datasets/Computer+Hardware); excludeti din dataset coloanele: vendor name, model name, estimated relative performance; se va estima coloana "published relative performance".
1. [Boston Housing](http://archive.ics.uci.edu/ml/machine-learning-databases/housing/)
1. [Wisconsin Breast Cancer](http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html); cautati in panelul din stanga Wisconsin Breast Cancer si urmati pasii din "My personal Notes"
1. [Communities and Crime](http://archive.ics.uci.edu/ml/datasets/communities+and+crime); stergeti primele 5 dimensiuni si trasaturile cu missing values.

Pentru fiecare set de date aplicati minim 5 modele de regresie din scikit learn. Pentru fiecare raportati: mean absolute error, mean squared error, median absolute error - a se vedea [sklearn.metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) - folosind 5 fold cross validation. Valorile hiperparametrilor trebuie cautate cu grid search (cv=3)  si random search (n_iter dat de voi). Metrica folosita pentru cautarea hiperparametrilor va fi mean squared error. Raportati mediile rezultatelor atat pentru fold-urile de antrenare, cat si pentru cele de testare; indicatie: puteti folosi metoda `cross_validate` cu parametrul `return_train_score=True`, iar ca model un obiect de tip `GridSearchCV` sau `RandomizedSearchCV`.

Rezultatele vor fi trecute intr-un dataframe. Intr-o stare intermediara, valorile vor fi calculate cu semnul minus: din motive de implementare, biblioteca sklearn transforma scorurile in numere negative; a se vedea imaginea de mai jos:

![intermediate report](./images/cpu_intermediate_blurred.png)


Valorile vor fi aduse la interval pozitiv, apoi vor fi marcate cele maxime si minime; orientativ, se poate folosi imaginea de mai jos, reprezentand dataframe afisat in notebook; puteti folosi alte variante de styling pe dataframe precum la https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#.  

Se va crea un raport final in format HTML sau PDF - fisier(e) separat(e). Raportul trebuie sa contina minimal: numele setului de date si obiectul dataframe; preferabil sa se pastreze marcajul de culori realizat in notebook.

![report](./images/cpu_results_blurred.png)

Notare:
1. Se acorda 20 de puncte din oficiu.
1. Optimizare si cuantificare de performanta a modelelor: 3 puncte pentru fiecare combinatie set de date + model = 60 de puncte
1. Documentare modele: numar modele * 2 puncte = 10 puncte. Documentati in jupyter notebook fiecare din modelele folosite, in limba romana. Puteti face o sectiune separata cu documentarea algoritmilor. Fiecare model trebuie sa aiba o descriere de minim 20 de randuri, minim o imagine asociata si minim 2 referinte bibliografice.
1. 10 puncte: export in format HTML sau PDF.



*Notare:* laboratorul va fi salvat in repository-ul de github si prezentat in saptamana 6-10 mai.

In [22]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_validate
from sklearn import linear_model
from scipy.stats import uniform as sp_rand
from sklearn import ensemble
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
import sys

if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")

In [2]:
data_cpu = pd.read_csv('machine.data', header=None)
data_cpu=data_cpu.values[:,2:-1]
x_cpu=data_cpu[:, :-1]
y_cpu=data_cpu[:, -1]

boston_housing = pd.read_csv('housing.data', header=None, delim_whitespace=True)
x_housing = boston_housing.values[:, :-1]
y_housing = boston_housing.values[:, -1]

wisconsin_breast_cancer = pd.read_csv('r_wpbc.data', header=None)
x_wisconsin = wisconsin_breast_cancer.values[:, :-1]
y_wisconsin = wisconsin_breast_cancer.values[:, -1]

communities_and_crime = pd.read_csv('communities.data', header=None)
communities_and_crime = communities_and_crime[(communities_and_crime != '?').all(axis=1)]
x_communities = communities_and_crime.values[:,5:-1]
y_communities = communities_and_crime.values[:,-1]


In [3]:
def fun(x,y, model):
    warnings.simplefilter(action='ignore')
    alphas = {'alpha' : [0.1, 1.0, 10.0]}
    grid = GridSearchCV(model, param_grid=alphas, cv=3,scoring='neg_mean_squared_error')
    grid= cross_validate(grid,x, y,scoring=('neg_mean_absolute_error','neg_mean_squared_error','neg_median_absolute_error'),
                             cv=5, return_train_score=True)

    GridSearch={'test_mean_absolute_error':abs(grid['test_neg_mean_absolute_error'].mean()),
                'test_mean_squared_error':abs(grid['test_neg_mean_squared_error'].mean()),
                'test_median_absolute_error':abs(grid['test_neg_median_absolute_error'].mean()),
                'train_mean_absolute_error':abs(grid['train_neg_mean_absolute_error'].mean()),
                'train_mean_squared_error':abs(grid['train_neg_mean_squared_error'].mean()),
                'train_median_absolute_error':abs(grid['train_neg_median_absolute_error'].mean())}




    rsearch = RandomizedSearchCV(estimator=model, param_distributions=alphas, n_iter=50)
    rsearch= cross_validate(rsearch,x, y,scoring=('neg_mean_absolute_error','neg_mean_squared_error','neg_median_absolute_error'),
                             cv=5, return_train_score=True)
    
    RandomSearch={'test_mean_absolute_error': abs(rsearch['test_neg_mean_absolute_error'].mean()),
                'test_mean_squared_error': abs(rsearch['test_neg_mean_squared_error'].mean()),
                'test_median_absolute_error':abs(rsearch['test_neg_median_absolute_error'].mean()),
                'train_mean_absolute_error':abs(rsearch['train_neg_mean_absolute_error'].mean()),
                'train_mean_squared_error':abs(rsearch['train_neg_mean_squared_error'].mean()),
                'train_median_absolute_error':abs(rsearch['train_neg_median_absolute_error'].mean())}
    
    dataset = pd.DataFrame({'GridSearchCV' :GridSearch, 'RandomizedSearchCV': RandomSearch})
    return dataset.T

In [4]:
def neighReg(x,y):
    params = {'n_neighbors':[5,6,7,8,9,10],'leaf_size':[1,2,3,5],'weights':['uniform', 'distance'],'algorithm':['auto', 'ball_tree','kd_tree','brute'],'n_jobs':[-1]}
    
    grid = GridSearchCV(KNeighborsRegressor(), param_grid=params, cv=3, n_jobs= -1, iid=False)

    grid= cross_validate(grid,x, y,scoring=('neg_mean_absolute_error','neg_mean_squared_error','neg_median_absolute_error'),
                             cv=5, return_train_score=True)

    GridSearch={'test_mean_absolute_error':abs(grid['test_neg_mean_absolute_error'].mean()),
                'test_mean_squared_error':abs(grid['test_neg_mean_squared_error'].mean()),
                'test_median_absolute_error':abs(grid['test_neg_median_absolute_error'].mean()),
                'train_mean_absolute_error':abs(grid['train_neg_mean_absolute_error'].mean()),
                'train_mean_squared_error':abs(grid['train_neg_mean_squared_error'].mean()),
                'train_median_absolute_error':abs(grid['train_neg_median_absolute_error'].mean())}




    rsearch = RandomizedSearchCV(estimator=KNeighborsRegressor(), param_distributions=params, n_iter=50)
    rsearch= cross_validate(rsearch,x, y,scoring=('neg_mean_absolute_error','neg_mean_squared_error','neg_median_absolute_error'),
                             cv=5, return_train_score=True)
    
    RandomSearch={'test_mean_absolute_error': abs(rsearch['test_neg_mean_absolute_error'].mean()),
                'test_mean_squared_error': abs(rsearch['test_neg_mean_squared_error'].mean()),
                'test_median_absolute_error':abs(rsearch['test_neg_median_absolute_error'].mean()),
                'train_mean_absolute_error':abs(rsearch['train_neg_mean_absolute_error'].mean()),
                'train_mean_squared_error':abs(rsearch['train_neg_mean_squared_error'].mean()),
                'train_median_absolute_error':abs(rsearch['train_neg_median_absolute_error'].mean())}
    
    dataset = pd.DataFrame({'GridSearchCV' :GridSearch, 'RandomizedSearchCV': RandomSearch})
    return dataset.T

In [5]:
data_cpu1=fun(x_cpu, y_cpu, linear_model.Ridge())
data_cpu2=fun(x_cpu, y_cpu, linear_model.Lasso())
data_cpu3=fun(x_cpu, y_cpu, linear_model.ElasticNet())
data_cpu4=fun(x_cpu, y_cpu, linear_model.LassoLars())
#data_cpu5=neighReg(x_cpu, y_cpu)

data_cpu4=data_cpu4.append(data_cpu1)
data_cpu4=data_cpu4.append(data_cpu2)
data_cpu4=data_cpu4.append(data_cpu3)
#data_cpu4=data_cpu4.append(data_cpu5)

#data_cpu4=data_cpu4.highlight_max(axis=0)
#data_cpu4=data_cpu4.highlight_min(color='red',axis=0)
#data_cpu4.style.apply(highlight_max)
data_cpu4

Unnamed: 0,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error
GridSearchCV,41.288543,6201.732604,25.326891,35.639909,3427.445058,23.082545
RandomizedSearchCV,39.688969,6938.306195,22.917932,35.171268,3650.149457,21.517765
GridSearchCV,43.348357,6378.281774,27.039073,36.694713,3243.700668,25.580515
RandomizedSearchCV,43.348357,6378.281774,27.039073,36.694713,3243.700668,25.580515
GridSearchCV,43.105958,6325.128609,27.084825,36.681951,3247.735472,25.389984
RandomizedSearchCV,42.184375,6225.743794,26.559089,36.692699,3250.294222,25.412552
GridSearchCV,42.873289,6283.86577,26.973227,36.664509,3246.636762,25.617069
RandomizedSearchCV,41.989119,6163.512772,26.471834,36.696225,3253.220653,25.743453


In [6]:
boston_housing1 = fun(x_housing, y_housing, linear_model.Ridge())
boston_housing2 = fun(x_housing, y_housing, linear_model.Lasso())
boston_housing3 = fun(x_housing, y_housing, linear_model.ElasticNet())
boston_housing4 = fun(x_housing, y_housing, linear_model.LassoLars())
boston_housing5 = neighReg(x_housing, y_housing)

boston_housing4 = boston_housing4.append(boston_housing1)
boston_housing4 = boston_housing4.append(boston_housing2)
boston_housing4 = boston_housing4.append(boston_housing3)
boston_housing4 = boston_housing4.append(boston_housing5)
boston_housing4

Unnamed: 0,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error
GridSearchCV,4.732101,47.578549,3.236772,3.991472,32.854899,2.84821
RandomizedSearchCV,4.732101,47.578549,3.236772,3.991472,32.854899,2.84821
GridSearchCV,3.943297,33.395224,2.873217,3.250549,21.699319,2.369281
RandomizedSearchCV,3.943297,33.395224,2.873217,3.250549,21.699319,2.369281
GridSearchCV,4.5842,42.613151,3.446231,3.613896,27.584772,2.611563
RandomizedSearchCV,4.625231,45.22604,3.458034,3.774538,30.214985,2.748519
GridSearchCV,4.643469,44.682147,3.385351,3.657341,28.398273,2.657646
RandomizedSearchCV,4.71406,47.16725,3.481667,3.812657,30.85548,2.788845


In [7]:
wisconsin_breast_cancer1 = fun(x_wisconsin, y_wisconsin, linear_model.Ridge())
wisconsin_breast_cancer2 = fun(x_wisconsin, y_wisconsin, linear_model.Lasso())
wisconsin_breast_cancer3 = fun(x_wisconsin, y_wisconsin, linear_model.ElasticNet())
wisconsin_breast_cancer4 = fun(x_wisconsin, y_wisconsin, linear_model.LassoLars())
wisconsin_breast_cancer5 = neighReg(x_wisconsin, y_wisconsin)

wisconsin_breast_cancer4 = wisconsin_breast_cancer4.append(wisconsin_breast_cancer1)
wisconsin_breast_cancer4 = wisconsin_breast_cancer4.append(wisconsin_breast_cancer2)
wisconsin_breast_cancer4 = wisconsin_breast_cancer4.append(wisconsin_breast_cancer3)
wisconsin_breast_cancer4 = wisconsin_breast_cancer4.append(wisconsin_breast_cancer5)
wisconsin_breast_cancer4

Unnamed: 0,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error
GridSearchCV,30.665849,1380.216616,27.050985,24.789832,890.256356,22.761165
RandomizedSearchCV,30.665849,1380.216616,27.050985,24.789832,890.256356,22.761165
GridSearchCV,29.967607,1291.239413,29.037879,24.52172,887.707859,21.904468
RandomizedSearchCV,30.45159,1329.947168,29.76831,24.072996,868.574423,21.20442
GridSearchCV,29.055021,1200.794006,27.715184,25.407462,929.394374,22.99784
RandomizedSearchCV,29.113336,1207.442494,27.77492,25.337865,927.049009,23.021285
GridSearchCV,28.963436,1195.898527,28.322895,25.297216,924.010111,22.791684
RandomizedSearchCV,28.963436,1195.898527,28.322895,25.297216,924.010111,22.791684


In [40]:
communities_and_crime1 = fun(x_communities, y_communities, linear_model.Ridge())
communities_and_crime2 = fun(x_communities, y_communities, linear_model.Lasso())
communities_and_crime3 = fun(x_communities, y_communities, linear_model.ElasticNet())
communities_and_crime4 = fun(x_communities, y_communities, linear_model.LassoLars())
communities_and_crime5 = neighReg(x_communities, y_communities)

communities_and_crime4 = communities_and_crime4.append(communities_and_crime1)
communities_and_crime4 = communities_and_crime4.append(communities_and_crime2)
communities_and_crime4 = communities_and_crime4.append(communities_and_crime3)
communities_and_crime4 = communities_and_crime4.append(communities_and_crime5)
communities_and_crime4

Unnamed: 0,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error
GridSearchCV,0.228038,0.074914,0.2072,0.2255065,0.07367124,0.206605
RandomizedSearchCV,0.228038,0.074914,0.2072,0.2255065,0.07367124,0.206605
GridSearchCV,0.105249,0.019533,0.078593,0.08341744,0.01212998,0.060696
RandomizedSearchCV,0.105249,0.019533,0.078593,0.08341744,0.01212998,0.060696
GridSearchCV,0.228038,0.074914,0.2072,0.2255065,0.07367124,0.206605
RandomizedSearchCV,0.228038,0.074914,0.2072,0.2255065,0.07367124,0.206605
GridSearchCV,0.197273,0.05641,0.173538,0.1933309,0.054131,0.170149
RandomizedSearchCV,0.197273,0.05641,0.173538,0.1933309,0.054131,0.170149
GridSearchCV,0.117354,0.02264,0.08711,6.136048e-09,6.438233e-16,0.0
RandomizedSearchCV,0.117354,0.02264,0.08711,8.399748e-09,1.289758e-15,0.0


In [25]:
def highlight_max(data, color='lime'):
    '''
    highlight the maximum in a Series or DataFrame
    '''
    attr = 'background-color: {}'.format(color)
    if data.ndim == 1:  # Series from .apply(axis=0) or axis=1
        is_max = data == data.max()
        return [attr if v else '' for v in is_max]
    else:  # from .apply(axis=None)
        is_max = data == data.max().max()
        return pd.DataFrame(np.where(is_max, attr, ''),
                            index=data.index, columns=data.columns)

def highlight_min(data, color='yellow'):
    '''
    highlight the minimum in a Series or DataFrame
    '''
    attr = 'background-color: {}'.format(color)
    if data.ndim == 1:  # Series from .apply(axis=0) or axis=1
        is_min = data == data.min()
        return [attr if v else '' for v in is_min]
    else:  # from .apply(axis=None)
        is_min = data == data.min().min()
        return pd.DataFrame(np.where(is_min, attr, ''),
                            index=data.index, columns=data.columns)
def highlight_min(s):
    '''
    highlight the maximum in a Series yellow.
    '''
    is_min = s == s.min()
    return ['background-color: yellow' if v else '' for v in is_min]

data_cpu4_styled=data_cpu4.reset_index(drop=True).style.apply(highlight_max)
data_cpu4_styled=data_cpu4_styled.apply(highlight_min)
data_cpu4_styled

Unnamed: 0,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error
0,41.2885,6201.73,25.3269,35.6399,3427.45,23.0825
1,39.689,6938.31,22.9179,35.1713,3650.15,21.5178
2,43.3484,6378.28,27.0391,36.6947,3243.7,25.5805
3,43.3484,6378.28,27.0391,36.6947,3243.7,25.5805
4,43.106,6325.13,27.0848,36.682,3247.74,25.39
5,42.1844,6225.74,26.5591,36.6927,3250.29,25.4126
6,42.8733,6283.87,26.9732,36.6645,3246.64,25.6171
7,41.9891,6163.51,26.4718,36.6962,3253.22,25.7435


In [48]:
data_cpu4.to_html('filename.html')