# Laborator 8

## Modele de regresie

Folositi urmatoarele seturi de date:
1. [CPU Computer Hardware](https://archive.ics.uci.edu/ml/datasets/Computer+Hardware); excludeti din dataset coloanele: vendor name, model name, estimated relative performance; se va estima coloana "published relative performance".
1. [Boston Housing](http://archive.ics.uci.edu/ml/machine-learning-databases/housing/)
1. [Wisconsin Breast Cancer](http://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html); cautati in panelul din stanga Wisconsin Breast Cancer si urmati pasii din "My personal Notes"
1. [Communities and Crime](http://archive.ics.uci.edu/ml/datasets/communities+and+crime); stergeti primele 5 dimensiuni si trasaturile cu missing values.

Pentru fiecare set de date aplicati minim 5 modele de regresie din scikit learn. Pentru fiecare raportati: mean absolute error, mean squared error, median absolute error - a se vedea [sklearn.metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) - folosind 5 fold cross validation. Valorile hiperparametrilor trebuie cautate cu grid search (cv=3)  si random search (n_iter dat de voi). Metrica folosita pentru cautarea hiperparametrilor va fi mean squared error. Raportati mediile rezultatelor atat pentru fold-urile de antrenare, cat si pentru cele de testare; indicatie: puteti folosi metoda `cross_validate` cu parametrul `return_train_score=True`, iar ca model un obiect de tip `GridSearchCV` sau `RandomizedSearchCV`.

Rezultatele vor fi trecute intr-un dataframe. Intr-o stare intermediara, valorile vor fi calculate cu semnul minus: din motive de implementare, biblioteca sklearn transforma scorurile in numere negative; a se vedea imaginea de mai jos:

![intermediate report](./images/cpu_intermediate_blurred.png)


Valorile vor fi aduse la interval pozitiv, apoi vor fi marcate cele maxime si minime; orientativ, se poate folosi imaginea de mai jos, reprezentand dataframe afisat in notebook; puteti folosi alte variante de styling pe dataframe precum la https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#.  

Se va crea un raport final in format HTML sau PDF - fisier(e) separat(e). Raportul trebuie sa contina minimal: numele setului de date si obiectul dataframe; preferabil sa se pastreze marcajul de culori realizat in notebook.

![report](./images/cpu_results_blurred.png)

Notare:
1. Se acorda 20 de puncte din oficiu.
1. Optimizare si cuantificare de performanta a modelelor: 3 puncte pentru fiecare combinatie set de date + model = 60 de puncte
1. Documentare modele: numar modele * 2 puncte = 10 puncte. Documentati in jupyter notebook fiecare din modelele folosite, in limba romana. Puteti face o sectiune separata cu documentarea algoritmilor. Fiecare model trebuie sa aiba o descriere de minim 20 de randuri, minim o imagine asociata si minim 2 referinte bibliografice.
1. 10 puncte: export in format HTML sau PDF.



*Notare:* laboratorul va fi salvat in repository-ul de github si prezentat in saptamana 6-10 mai.

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_validate
from sklearn import linear_model
from scipy.stats import uniform as sp_rand
from sklearn import ensemble
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
import sys

if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")

In [5]:
data_cpu = pd.read_csv('machine.data', header=None)
data_cpu=data_cpu.values[:,2:-1]
x_cpu=data_cpu[:, :-1]
y_cpu=data_cpu[:, -1]

boston_housing = pd.read_csv('housing.data', header=None, delim_whitespace=True)
x_housing = boston_housing.values[:, :-1]
y_housing = boston_housing.values[:, -1]

wisconsin_breast_cancer = pd.read_csv('r_wpbc.data', header=None)
x_wisconsin = wisconsin_breast_cancer.values[:, :-1]
y_wisconsin = wisconsin_breast_cancer.values[:, -1]

communities_and_crime = pd.read_csv('communities.data', header=None)
communities_and_crime = communities_and_crime[(communities_and_crime != '?').all(axis=1)]
x_communities = communities_and_crime.values[:,5:-1]
y_communities = communities_and_crime.values[:,-1]


In [6]:
def fun(x,y, model):
    warnings.simplefilter(action='ignore')
    alphas = {'alpha' : [0.1, 1.0, 10.0]}
    grid = GridSearchCV(model, param_grid=alphas, cv=3,scoring='neg_mean_squared_error')
    grid= cross_validate(grid,x, y,scoring=('neg_mean_absolute_error','neg_mean_squared_error','neg_median_absolute_error'),
                             cv=5, return_train_score=True)

    GridSearch={'test_mean_absolute_error':abs(grid['test_neg_mean_absolute_error'].mean()),
                'test_mean_squared_error':abs(grid['test_neg_mean_squared_error'].mean()),
                'test_median_absolute_error':abs(grid['test_neg_median_absolute_error'].mean()),
                'train_mean_absolute_error':abs(grid['train_neg_mean_absolute_error'].mean()),
                'train_mean_squared_error':abs(grid['train_neg_mean_squared_error'].mean()),
                'train_median_absolute_error':abs(grid['train_neg_median_absolute_error'].mean())}




    rsearch = RandomizedSearchCV(estimator=model, param_distributions=alphas, n_iter=3)
    rsearch= cross_validate(rsearch,x, y,scoring=('neg_mean_absolute_error','neg_mean_squared_error','neg_median_absolute_error'),
                             cv=5, return_train_score=True)
    
    RandomSearch={'test_mean_absolute_error': abs(rsearch['test_neg_mean_absolute_error'].mean()),
                'test_mean_squared_error': abs(rsearch['test_neg_mean_squared_error'].mean()),
                'test_median_absolute_error':abs(rsearch['test_neg_median_absolute_error'].mean()),
                'train_mean_absolute_error':abs(rsearch['train_neg_mean_absolute_error'].mean()),
                'train_mean_squared_error':abs(rsearch['train_neg_mean_squared_error'].mean()),
                'train_median_absolute_error':abs(rsearch['train_neg_median_absolute_error'].mean())}
    
    dataset = pd.DataFrame({'GridSearchCV' :GridSearch, 'RandomizedSearchCV': RandomSearch})
    return dataset.T

In [7]:
def neighReg(x,y):
    params = {'n_neighbors':[5,6,7,8],'leaf_size':[1,2,3],'weights':['uniform', 'distance']}
    
    grid = GridSearchCV(KNeighborsRegressor(), param_grid=params, cv=3, n_jobs= -1, iid=False)

    grid= cross_validate(grid,x, y,scoring=('neg_mean_absolute_error','neg_mean_squared_error','neg_median_absolute_error'),
                             cv=5, return_train_score=True)

    GridSearch={'test_mean_absolute_error':abs(grid['test_neg_mean_absolute_error'].mean()),
                'test_mean_squared_error':abs(grid['test_neg_mean_squared_error'].mean()),
                'test_median_absolute_error':abs(grid['test_neg_median_absolute_error'].mean()),
                'train_mean_absolute_error':abs(grid['train_neg_mean_absolute_error'].mean()),
                'train_mean_squared_error':abs(grid['train_neg_mean_squared_error'].mean()),
                'train_median_absolute_error':abs(grid['train_neg_median_absolute_error'].mean())}




    rsearch = RandomizedSearchCV(estimator=KNeighborsRegressor(), param_distributions=params, n_iter=20)
    rsearch= cross_validate(rsearch,x, y,scoring=('neg_mean_absolute_error','neg_mean_squared_error','neg_median_absolute_error'),
                             cv=5, return_train_score=True)
    
    RandomSearch={'test_mean_absolute_error': abs(rsearch['test_neg_mean_absolute_error'].mean()),
                'test_mean_squared_error': abs(rsearch['test_neg_mean_squared_error'].mean()),
                'test_median_absolute_error':abs(rsearch['test_neg_median_absolute_error'].mean()),
                'train_mean_absolute_error':abs(rsearch['train_neg_mean_absolute_error'].mean()),
                'train_mean_squared_error':abs(rsearch['train_neg_mean_squared_error'].mean()),
                'train_median_absolute_error':abs(rsearch['train_neg_median_absolute_error'].mean())}
    
    dataset = pd.DataFrame({'GridSearchCV' :GridSearch, 'RandomizedSearchCV': RandomSearch})
    return dataset.T

In [8]:
data_cpu1=fun(x_cpu, y_cpu, linear_model.Ridge())
data_cpu2=fun(x_cpu, y_cpu, linear_model.Lasso())
data_cpu3=fun(x_cpu, y_cpu, linear_model.ElasticNet())
data_cpu4=fun(x_cpu, y_cpu, linear_model.LassoLars())
data_cpu5=neighReg(x_cpu, y_cpu)

data_cpu4

Unnamed: 0,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error
GridSearchCV,41.288543,6201.732604,25.326891,35.639909,3427.445058,23.082545
RandomizedSearchCV,39.688969,6938.306195,22.917932,35.171268,3650.149457,21.517765


In [9]:
boston_housing1 = fun(x_housing, y_housing, linear_model.Ridge())
boston_housing2 = fun(x_housing, y_housing, linear_model.Lasso())
boston_housing3 = fun(x_housing, y_housing, linear_model.ElasticNet())
boston_housing4 = fun(x_housing, y_housing, linear_model.LassoLars())
boston_housing5 = neighReg(x_housing, y_housing)

boston_housing4

Unnamed: 0,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error
GridSearchCV,4.732101,47.578549,3.236772,3.991472,32.854899,2.84821
RandomizedSearchCV,4.732101,47.578549,3.236772,3.991472,32.854899,2.84821


In [10]:
wisconsin_breast_cancer1 = fun(x_wisconsin, y_wisconsin, linear_model.Ridge())
wisconsin_breast_cancer2 = fun(x_wisconsin, y_wisconsin, linear_model.Lasso())
wisconsin_breast_cancer3 = fun(x_wisconsin, y_wisconsin, linear_model.ElasticNet())
wisconsin_breast_cancer4 = fun(x_wisconsin, y_wisconsin, linear_model.LassoLars())
wisconsin_breast_cancer5 = neighReg(x_wisconsin, y_wisconsin)

wisconsin_breast_cancer4

Unnamed: 0,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error
GridSearchCV,30.665849,1380.216616,27.050985,24.789832,890.256356,22.761165
RandomizedSearchCV,30.665849,1380.216616,27.050985,24.789832,890.256356,22.761165


In [11]:
communities_and_crime1 = fun(x_communities, y_communities, linear_model.Ridge())
communities_and_crime2 = fun(x_communities, y_communities, linear_model.Lasso())
communities_and_crime3 = fun(x_communities, y_communities, linear_model.ElasticNet())
communities_and_crime4 = fun(x_communities, y_communities, linear_model.LassoLars())
communities_and_crime5 = neighReg(x_communities, y_communities)

communities_and_crime4

Unnamed: 0,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error
GridSearchCV,0.228038,0.074914,0.2072,0.225506,0.073671,0.206605
RandomizedSearchCV,0.228038,0.074914,0.2072,0.225506,0.073671,0.206605


In [12]:
def highlight_max(s):
    '''
    highlight the maximum in a Series yellow.
    '''
    is_min = s == s.max()
    return ['background-color: red' if v else '' for v in is_min]

def highlight_min(s):
    '''
    highlight the maximum in a Series yellow.
    '''
    is_min = s == s.min()
    return ['background-color: lime' if v else '' for v in is_min]

# data_cpu4_styled=data_cpu4.reset_index(drop=True).style.apply(highlight_max)
# data_cpu4_styled=data_cpu4_styled.apply(highlight_min)
# data_cpu4_styled

In [22]:
temp = {'Ridge': data_cpu1, 'Lasso': data_cpu2, 'ElasticNet':data_cpu3, 'LassoLars': data_cpu4, 'KNeighborsRegressor':data_cpu5}
result=pd.concat(temp)
result = result.style.apply(highlight_max).apply(highlight_min)
result

Unnamed: 0,Unnamed: 1,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error
ElasticNet,GridSearchCV,42.8733,6283.87,26.9732,36.6645,3246.64,25.6171
ElasticNet,RandomizedSearchCV,41.9891,6163.51,26.4718,36.6962,3253.22,25.7435
KNeighborsRegressor,GridSearchCV,37.0517,5869.26,17.4342,2.56936,98.8612,0.0
KNeighborsRegressor,RandomizedSearchCV,36.9836,5859.01,17.4342,2.56936,98.8612,0.0
Lasso,GridSearchCV,43.106,6325.13,27.0848,36.682,3247.74,25.39
Lasso,RandomizedSearchCV,42.1844,6225.74,26.5591,36.6927,3250.29,25.4126
LassoLars,GridSearchCV,41.2885,6201.73,25.3269,35.6399,3427.45,23.0825
LassoLars,RandomizedSearchCV,39.689,6938.31,22.9179,35.1713,3650.15,21.5178
Ridge,GridSearchCV,43.3484,6378.28,27.0391,36.6947,3243.7,25.5805
Ridge,RandomizedSearchCV,43.3484,6378.28,27.0391,36.6947,3243.7,25.5805


In [34]:
temp = {'Ridge': boston_housing1, 'Lasso': boston_housing2, 'ElasticNet':boston_housing3, 'LassoLars': boston_housing4, 'KNeighborsRegressor':boston_housing5}
result=pd.concat(temp)
result = result.style.apply(highlight_max).apply(highlight_min)
result

Unnamed: 0,Unnamed: 1,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error
ElasticNet,GridSearchCV,4.64347,44.6821,3.38535,3.65734,28.3983,2.65765
ElasticNet,RandomizedSearchCV,4.71406,47.1673,3.48167,3.81266,30.8555,2.78885
KNeighborsRegressor,GridSearchCV,6.10634,77.524,4.2765,0.809298,6.50452,0.571429
KNeighborsRegressor,RandomizedSearchCV,6.10634,77.524,4.2765,0.809298,6.50452,0.571429
Lasso,GridSearchCV,4.5842,42.6132,3.44623,3.6139,27.5848,2.61156
Lasso,RandomizedSearchCV,4.62523,45.226,3.45803,3.77454,30.215,2.74852
LassoLars,GridSearchCV,4.7321,47.5785,3.23677,3.99147,32.8549,2.84821
LassoLars,RandomizedSearchCV,4.7321,47.5785,3.23677,3.99147,32.8549,2.84821
Ridge,GridSearchCV,3.9433,33.3952,2.87322,3.25055,21.6993,2.36928
Ridge,RandomizedSearchCV,3.9433,33.3952,2.87322,3.25055,21.6993,2.36928


In [35]:
temp = {'Ridge': wisconsin_breast_cancer1, 'Lasso': wisconsin_breast_cancer2, 'ElasticNet':wisconsin_breast_cancer3, 'LassoLars': wisconsin_breast_cancer4, 'KNeighborsRegressor':wisconsin_breast_cancer5}
result=pd.concat(temp)
result = result.style.apply(highlight_max).apply(highlight_min)
result

Unnamed: 0,Unnamed: 1,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error
ElasticNet,GridSearchCV,28.9634,1195.9,28.3229,25.2972,924.01,22.7917
ElasticNet,RandomizedSearchCV,28.9634,1195.9,28.3229,25.2972,924.01,22.7917
KNeighborsRegressor,GridSearchCV,32.5522,1471.52,29.5636,14.1885,494.442,12.644
KNeighborsRegressor,RandomizedSearchCV,32.5522,1471.52,29.5636,14.1885,494.442,12.644
Lasso,GridSearchCV,29.055,1200.79,27.7152,25.4075,929.394,22.9978
Lasso,RandomizedSearchCV,29.1133,1207.44,27.7749,25.3379,927.049,23.0213
LassoLars,GridSearchCV,30.6658,1380.22,27.051,24.7898,890.256,22.7612
LassoLars,RandomizedSearchCV,30.6658,1380.22,27.051,24.7898,890.256,22.7612
Ridge,GridSearchCV,29.9676,1291.24,29.0379,24.5217,887.708,21.9045
Ridge,RandomizedSearchCV,30.4516,1329.95,29.7683,24.073,868.574,21.2044


In [36]:
temp = {'Ridge': communities_and_crime1, 'Lasso': communities_and_crime2, 'ElasticNet':communities_and_crime3, 'LassoLars': communities_and_crime4, 'KNeighborsRegressor':communities_and_crime5}
result=pd.concat(temp)
result = result.style.apply(highlight_max).apply(highlight_min)
result

Unnamed: 0,Unnamed: 1,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error
ElasticNet,GridSearchCV,0.197273,0.0564097,0.173538,0.193331,0.054131,0.170149
ElasticNet,RandomizedSearchCV,0.197273,0.0564097,0.173538,0.193331,0.054131,0.170149
KNeighborsRegressor,GridSearchCV,0.117179,0.0233241,0.0894509,0.0,0.0,0.0
KNeighborsRegressor,RandomizedSearchCV,0.117179,0.0233241,0.0894509,0.0,0.0,0.0
Lasso,GridSearchCV,0.228038,0.0749139,0.2072,0.225506,0.0736712,0.206605
Lasso,RandomizedSearchCV,0.228038,0.0749139,0.2072,0.225506,0.0736712,0.206605
LassoLars,GridSearchCV,0.228038,0.0749139,0.2072,0.225506,0.0736712,0.206605
LassoLars,RandomizedSearchCV,0.228038,0.0749139,0.2072,0.225506,0.0736712,0.206605
Ridge,GridSearchCV,0.105249,0.0195327,0.0785926,0.0834174,0.01213,0.060696
Ridge,RandomizedSearchCV,0.105249,0.0195327,0.0785926,0.0834174,0.01213,0.060696


In [19]:
data_cpu1=data_cpu1.append(data_cpu2)
data_cpu1=data_cpu1.append(data_cpu3)
data_cpu1=data_cpu1.append(data_cpu4)
data_cpu1=data_cpu1.append(data_cpu5)

data_cpu1.to_html('cpu.html')

In [13]:
boston_housing1 = boston_housing1.append(boston_housing2)
boston_housing1 = boston_housing1.append(boston_housing3)
boston_housing1 = boston_housing1.append(boston_housing4)
boston_housing1 = boston_housing1.append(boston_housing5)

boston_housing1.to_html('housing.html')

wisconsin_breast_cancer1 = wisconsin_breast_cancer1.append(wisconsin_breast_cancer2)
wisconsin_breast_cancer1 = wisconsin_breast_cancer1.append(wisconsin_breast_cancer3)
wisconsin_breast_cancer1 = wisconsin_breast_cancer1.append(wisconsin_breast_cancer4)
wisconsin_breast_cancer1 = wisconsin_breast_cancer1.append(wisconsin_breast_cancer5)

wisconsin_breast_cancer1.to_html('cancer.html')

communities_and_crime1 = communities_and_crime1.append(communities_and_crime2)
communities_and_crime1 = communities_and_crime1.append(communities_and_crime3)
communities_and_crime1 = communities_and_crime1.append(communities_and_crime4)
communities_and_crime1 = communities_and_crime1.append(communities_and_crime5)

communities_and_crime1.to_html('crime.html')