# Exercise 6

## SVM & Regularization


For this homework we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')

In [3]:
data = data_w.assign(type = 'white')

data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.sample(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
879,6.9,0.12,0.36,2.2,0.037,18.0,111.0,0.9919,3.41,0.82,11.9,8,white
3436,6.9,0.24,0.23,7.1,0.041,20.0,97.0,0.99246,3.1,0.85,11.4,6,white
5086,7.9,0.5,0.33,2.0,0.084,15.0,143.0,0.9968,3.2,0.55,9.5,5,red
5663,9.2,0.67,0.1,3.0,0.091,12.0,48.0,0.99888,3.31,0.54,9.5,6,red
4271,6.9,0.42,0.2,15.4,0.043,57.0,201.0,0.99848,3.08,0.54,9.4,5,white


# Exercise 6.1

Show the frecuency table of the quality by type of wine

In [4]:
tabla_de_frecuencia = pd.DataFrame(data['quality'].groupby(data['type']).value_counts())
tabla_de_frecuencia

Unnamed: 0_level_0,Unnamed: 1_level_0,quality
type,quality,Unnamed: 2_level_1
red,5,681
red,6,638
red,7,199
red,4,53
red,8,18
red,3,10
white,6,2198
white,5,1457
white,7,880
white,8,175


# SVM

# Exercise 6.2

* Standarized the features (not the quality)
* Create a binary target for each type of wine
* Create two Linear SVM's for the white and red wines, repectively.


In [5]:
data.quality.unique()

array([6, 5, 7, 8, 4, 3, 9], dtype=int64)

In [6]:
# Se tomarán sólo como buenos los de calidad mayor a 6
data['quality_dummy'] = data['quality'] > 6
print(data['quality_dummy'].value_counts())
data['quality_dummy'] = data['quality_dummy'].astype(int)
data_r = data[data['type'] == 'red']
data_w = data[data['type'] == 'white']
data_r.head()

False    5220
True     1277
Name: quality_dummy, dtype: int64


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,quality_dummy
4898,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,0
4899,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red,0
4900,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red,0
4901,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red,0
4902,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,0


In [7]:
data_w.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,quality_dummy
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white,0
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white,0
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white,0
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,0
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,0


In [8]:
variables = data.columns.values.tolist()
variables.remove('quality')
variables.remove('quality_dummy')
variables.remove('type')

variables_y = ['quality','quality_dummy']

# Tener claro cuáles serán las X's
variables

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol']

In [9]:
# Dividir X e Y
X_w = data_w[variables]
y_w = data_w[variables_y]

X_r = data_r[variables]
y_r = data_r[variables_y]

# Aquí para dividir en train y test
from sklearn.model_selection import train_test_split
X_r_train, X_r_test, y_r_train, y_r_test = train_test_split(X_r, y_r)
X_w_train, X_w_test, y_w_train, y_w_test = train_test_split(X_w, y_w)

# Normalizar datos
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(X_r_train)
scaler.fit(X_w_train)
X_w_train_scaled = scaler.transform(X_w_train)
X_w_test_scaled = scaler.transform(X_w_test)

X_r_train_scaled = scaler.transform(X_r_train)
X_r_test_scaled = scaler.transform(X_r_test)

In [10]:
from sklearn.svm import SVC # "Support Vector Classifier"
clf_lin = SVC(kernel='linear', gamma='auto')

In [11]:
clf_lin.fit(X_r_train_scaled,y_r_train['quality_dummy'])

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [12]:
y_r_pred_lin = clf_lin.predict(X_r_test_scaled)

In [13]:
from sklearn.metrics import accuracy_score
accuracy_score(y_r_test['quality_dummy'], y_r_pred_lin)

0.8675

In [14]:
clf_lin.fit(X_w_train_scaled,y_w_train['quality_dummy'])

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [15]:
y_w_pred_lin = clf_lin.predict(X_w_test_scaled)

In [16]:
accuracy_score(y_w_test['quality_dummy'], y_w_pred_lin)

0.7812244897959184

# Exercise 6.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’)


In [17]:
clf_poly = SVC(kernel='poly', gamma='auto')
clf_rbf = SVC(kernel='rbf', gamma='auto')
clf_sig = SVC(kernel='sigmoid', gamma='auto')

### Tipo Red

In [18]:
clf_poly.fit(X_r_train_scaled, y_r_train['quality_dummy'])
clf_rbf.fit(X_r_train_scaled, y_r_train['quality_dummy'])
clf_sig.fit(X_r_train_scaled, y_r_train['quality_dummy'])

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [19]:
y_r_pred_poly = clf_poly.predict(X_r_test_scaled)
y_r_pred_rbf = clf_rbf.predict(X_r_test_scaled)
y_r_pred_sig = clf_sig.predict(X_r_test_scaled)

In [20]:
print(f"Accuracy Score para kernel 'poly' es: {accuracy_score(y_r_test['quality_dummy'], y_r_pred_poly)}")
print(f"Accuracy Score para kernel 'rbf' es: {accuracy_score(y_r_test['quality_dummy'], y_r_pred_rbf)}")
print(f"Accuracy Score para kernel 'sig' es: {accuracy_score(y_r_test['quality_dummy'], y_r_pred_sig)}")

Accuracy Score para kernel 'poly' es: 0.8775
Accuracy Score para kernel 'rbf' es: 0.8925
Accuracy Score para kernel 'sig' es: 0.81


In [21]:
# El kernel más adecuado para los vinos tipo red es el rbf

### Tipo White

In [22]:
clf_poly.fit(X_w_train_scaled, y_w_train['quality_dummy'])
clf_rbf.fit(X_w_train_scaled, y_w_train['quality_dummy'])
clf_sig.fit(X_w_train_scaled, y_w_train['quality_dummy'])

y_w_pred_poly = clf_poly.predict(X_w_test_scaled)
y_w_pred_rbf = clf_rbf.predict(X_w_test_scaled)
y_w_pred_sig = clf_sig.predict(X_w_test_scaled)

print(f"Accuracy Score para kernel 'poly' es: {accuracy_score(y_w_test['quality_dummy'], y_w_pred_poly)}")
print(f"Accuracy Score para kernel 'rbf' es: {accuracy_score(y_w_test['quality_dummy'], y_w_pred_rbf)}")
print(f"Accuracy Score para kernel 'sig' es: {accuracy_score(y_w_test['quality_dummy'], y_w_pred_sig)}")

Accuracy Score para kernel 'poly' es: 0.806530612244898
Accuracy Score para kernel 'rbf' es: 0.8171428571428572
Accuracy Score para kernel 'sig' es: 0.7387755102040816


In [23]:
# El kernel más adecuado para los vinos tipo white es el rbf

# Exercise 6.4
Using the best SVM find the parameters that gives the best performance

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]

### Tipo Red

In [24]:
accuracy = []
for c in [0.1, 1, 10, 100, 1000]:
    for g in [0.01, 0.001, 0.0001]:
        clf_rbf_cg = SVC(kernel='rbf', C=c, gamma=g)
        clf_rbf_cg.fit(X_r_train_scaled, y_r_train['quality_dummy'])
        y_r_pred_rbf_cg = clf_rbf_cg.predict(X_r_test_scaled)
        accuracy.append([c, g, accuracy_score(y_r_test['quality_dummy'], y_r_pred_rbf_cg)])
accuracy = pd.DataFrame(accuracy, columns = ['C', 'Gamma', 'Accuracy'])
print(accuracy.loc[accuracy['Accuracy'].idxmax()])

C           10.0000
Gamma        0.0100
Accuracy     0.8925
Name: 6, dtype: float64


### Tipo White

In [25]:
accuracy = []
for c in [0.1, 1, 10, 100, 1000]:
    for g in [0.01, 0.001, 0.0001]:
        clf_rbf_cg = SVC(kernel='rbf', C=c, gamma=g)
        clf_rbf_cg.fit(X_w_train_scaled, y_w_train['quality_dummy'])
        y_w_pred_rbf_cg = clf_rbf_cg.predict(X_w_test_scaled)
        accuracy.append([c, g, accuracy_score(y_w_test['quality_dummy'], y_w_pred_rbf_cg)])
accuracy = pd.DataFrame(accuracy, columns = ['C', 'Gamma', 'Accuracy'])
print(accuracy.loc[accuracy['Accuracy'].idxmax()])

C           100.000000
Gamma         0.010000
Accuracy      0.817143
Name: 9, dtype: float64


# Exercise 6.5

Compare the results with other methods

In [26]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9, solver='liblinear', multi_class='ovr')
logreg_l1 = LogisticRegression(C=0.1, penalty='l1', solver='liblinear', multi_class='ovr')
logreg_l2 = LogisticRegression(C=0.1, penalty='l2', solver='liblinear', multi_class='ovr')

### Tipo Red

In [27]:
logreg.fit(X_r_train_scaled, y_r_train['quality_dummy'])
logreg_l1.fit(X_r_train_scaled, y_r_train['quality_dummy'])
logreg_l2.fit(X_r_train_scaled, y_r_train['quality_dummy'])

y_r_pred_logreg = logreg.predict(X_r_test_scaled)
y_r_pred_logreg_l1 = logreg_l1.predict(X_r_test_scaled)
y_r_pred_logreg_l2 = logreg_l2.predict(X_r_test_scaled)

print(f"Accuracy Score para uan regresión logística es: {accuracy_score(y_r_test['quality_dummy'], y_r_pred_logreg)}")
print(f"Accuracy Score para uan regresión logística tipo Lasso es: {accuracy_score(y_r_test['quality_dummy'], y_r_pred_logreg_l1)}")
print(f"Accuracy Score para uan regresión logística truncada es: {accuracy_score(y_r_test['quality_dummy'], y_r_pred_logreg_l2)}")

print(f'''
Teniendo en cuenta los resultados, para los vino de tipo red usaría un SVM con kernel = 'rbf' ya que genera un 
Accuracy Score de 0.89
''')

Accuracy Score para uan regresión logística es: 0.89
Accuracy Score para uan regresión logística tipo Lasso es: 0.89
Accuracy Score para uan regresión logística truncada es: 0.8925

Teniendo en cuenta los resultados, para los vino de tipo red usaría un SVM con kernel = 'rbf' ya que genera un 
Accuracy Score de 0.89



### Tipo White

In [28]:
logreg.fit(X_w_train_scaled, y_w_train['quality_dummy'])
logreg_l1.fit(X_w_train_scaled, y_w_train['quality_dummy'])
logreg_l2.fit(X_w_train_scaled, y_w_train['quality_dummy'])

y_w_pred_logreg = logreg.predict(X_w_test_scaled)
y_w_pred_logreg_l1 = logreg_l1.predict(X_w_test_scaled)
y_w_pred_logreg_l2 = logreg_l2.predict(X_w_test_scaled)

print(f"Accuracy Score para uan regresión logística es: {round(accuracy_score(y_w_test['quality_dummy'], y_w_pred_logreg),4)}")
print(f"Accuracy Score para uan regresión logística tipo Lasso es: {round(accuracy_score(y_w_test['quality_dummy'], y_w_pred_logreg_l1),4)}")
print(f"Accuracy Score para uan regresión logística truncada es: {round(accuracy_score(y_w_test['quality_dummy'], y_w_pred_logreg_l2),4)}")

print(f'''
Teniendo en cuenta los resultados, para los vino de tipo white usaría un SVM con kernel = 'rbf' ya que genera un 
Accuracy Score de 0.824
''')

Accuracy Score para uan regresión logística es: 0.8122
Accuracy Score para uan regresión logística tipo Lasso es: 0.809
Accuracy Score para uan regresión logística truncada es: 0.8131

Teniendo en cuenta los resultados, para los vino de tipo white usaría un SVM con kernel = 'rbf' ya que genera un 
Accuracy Score de 0.824



# Regularization

# Exercise 6.6


* Train a linear regression to predict wine quality (Continous)

* Analyze the coefficients

* Evaluate the RMSE

In [29]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression(normalize=False)

### Tipo Red

In [30]:
linreg.fit(X_r_train_scaled, y_r_train['quality'])

coef_linreg_r = pd.DataFrame(linreg.coef_, columns=['Beta'])
coef_linreg_r.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
Beta,-0.015404,-0.103305,-0.015133,0.038274,-0.037201,0.079775,-0.169505,0.052479,-0.100879,0.097121,0.379539


In [31]:
linreg.intercept_

5.669317719339707

Los coeficientes son muy pequeños frente al intercepto lo que podría sesgar el análisis.

In [32]:
y_r_pred_linreg = linreg.predict(X_r_test_scaled)

from sklearn import metrics
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_r_test['quality'], y_r_pred_linreg)))

RMSE: 0.6665619235919926


### Tipo White

In [33]:
linreg.fit(X_w_train_scaled, y_w_train['quality'])

coef_linreg_w = pd.DataFrame(linreg.coef_, columns=['Beta'])
coef_linreg_w.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
Beta,0.065848,-0.174214,0.007385,0.429454,-0.005372,0.076625,-0.029241,-0.467087,0.124442,0.079157,0.230011


In [34]:
linreg.intercept_

5.881840457391785

Los coeficientes son muy pequeños frente al intercepto lo que podría sesgar el análisis.

In [35]:
y_w_pred_linreg = linreg.predict(X_w_test_scaled)

print('RMSE:', np.sqrt(metrics.mean_squared_error(y_w_test['quality'], y_w_pred_linreg)))

RMSE: 0.7827957991035311


# Exercise 6.7

* Estimate a ridge regression with alpha equals 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [36]:
from sklearn.linear_model import Ridge
linreg_l2_01 = Ridge(alpha=0.1)
linreg_l2_1 = Ridge(alpha=1)

### Tipo Red

In [37]:
linreg_l2_01.fit(X_r_train_scaled, y_r_train['quality'])
linreg_l2_1.fit(X_r_train_scaled, y_r_train['quality'])

coef_ridgereg_01_r = pd.DataFrame(linreg_l2_01.coef_, columns = ['Beta'])
coef_ridgereg_01_r.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
Beta,-0.015315,-0.103302,-0.015127,0.038427,-0.037197,0.079695,-0.169435,0.052187,-0.100764,0.097134,0.379384


In [38]:
comp_ridgereg_01_linreg_r = pd.DataFrame(coef_linreg_r >= coef_ridgereg_01_r)
comp_ridgereg_01_linreg_r.Beta.value_counts()

False    8
True     3
Name: Beta, dtype: int64

In [39]:
coef_ridgereg_1_r = pd.DataFrame(linreg_l2_1.coef_, columns = ['Beta'])
coef_ridgereg_1_r.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
Beta,-0.014538,-0.103277,-0.015067,0.039754,-0.037163,0.078983,-0.16881,0.049625,-0.099742,0.097246,0.378011


In [40]:
comp_ridgereg_1_linreg_r = pd.DataFrame(coef_linreg_r >= coef_ridgereg_1_r)
comp_ridgereg_1_linreg_r.Beta.value_counts()

False    8
True     3
Name: Beta, dtype: int64

In [41]:
y_r_pred_linreg_l2_01 = linreg_l2_01.predict(X_r_test_scaled)
y_r_pred_linreg_l2_1 = linreg_l2_1.predict(X_r_test_scaled)

print('RMSE con alpha(0.1) =', np.sqrt(metrics.mean_squared_error(y_r_test['quality'], y_r_pred_linreg_l2_01)))
print('RMSE con alpha(1) =', np.sqrt(metrics.mean_squared_error(y_r_test['quality'], y_r_pred_linreg_l2_1)))

RMSE con alpha(0.1) = 0.6665471357210262
RMSE con alpha(1) = 0.6664172188918969


### Tipo White

In [42]:
linreg_l2_01.fit(X_w_train_scaled, y_w_train['quality'])
linreg_l2_1.fit(X_w_train_scaled, y_w_train['quality'])

coef_ridgereg_01_w = pd.DataFrame(linreg_l2_01.coef_, columns=['Beta'])
coef_ridgereg_01_w.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
Beta,0.065738,-0.174215,0.007382,0.42913,-0.005396,0.076642,-0.029261,-0.466615,0.124352,0.07913,0.230208


In [43]:
comp_ridgereg_01_linreg_w = pd.DataFrame(coef_linreg_w >= coef_ridgereg_01_w)
comp_ridgereg_01_linreg_w.Beta.value_counts()

True     8
False    3
Name: Beta, dtype: int64

In [44]:
coef_ridgereg_1_w = pd.DataFrame(linreg_l2_1.coef_, columns=['Beta'])
coef_ridgereg_1_w.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
Beta,0.064762,-0.174227,0.007349,0.426242,-0.005606,0.076799,-0.02944,-0.462418,0.123553,0.078886,0.231963


In [45]:
comp_ridgereg_1_linreg_w = pd.DataFrame(coef_linreg_w >= coef_ridgereg_1_w)
comp_ridgereg_1_linreg_w.Beta.value_counts()

True     8
False    3
Name: Beta, dtype: int64

In [46]:
y_w_pred_linreg_l2_01 = linreg_l2_01.predict(X_w_test_scaled)
y_w_pred_linreg_l2_1 = linreg_l2_1.predict(X_w_test_scaled)

print('RMSE con alpha(0.1) =', np.sqrt(metrics.mean_squared_error(y_w_test['quality'], y_w_pred_linreg_l2_01)))
print('RMSE con alpha(1) =', np.sqrt(metrics.mean_squared_error(y_w_test['quality'], y_w_pred_linreg_l2_1)))

RMSE con alpha(0.1) = 0.7827941851494262
RMSE con alpha(1) = 0.7827801696309228


Encontramos que para los dos tipos de vino la mayoría de coeficientes son menores a los que se presentan con la regresión lineal, lo que hace que el intercepto tenga aún más fuerza en el análisis y siga sesgando el mismo.

# Exercise 6.8

* Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [47]:
from sklearn.linear_model import Lasso
linreg_l1_001 = Lasso(alpha=0.01)
linreg_l1_01 = Lasso(alpha=0.1)
linreg_l1_1 = Lasso(alpha=1)

### Tipo Red

In [48]:
linreg_l1_001.fit(X_r_train_scaled, y_r_train['quality'])
linreg_l1_01.fit(X_r_train_scaled, y_r_train['quality'])
linreg_l1_1.fit(X_r_train_scaled, y_r_train['quality'])

coef_lasso_001_r = pd.DataFrame(linreg_l1_001.coef_, columns=['Beta'])
coef_lasso_001_r.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
Beta,0.0,-0.100217,-0.0,0.0,-0.032497,0.00835,-0.110644,0.0,-0.057632,0.092902,0.347792


In [49]:
comp_lasso_001_linreg_r = pd.DataFrame(coef_linreg_r >= coef_lasso_001_r)
comp_lasso_001_linreg_r.Beta.value_counts()

False    6
True     5
Name: Beta, dtype: int64

In [50]:
coef_lasso_01_r = pd.DataFrame(linreg_l1_01.coef_, columns=['Beta'])
coef_lasso_01_r.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
Beta,0.005032,-0.101742,0.0,0.0,-0.00474,-0.0,-0.0,-0.0,-0.0,0.047711,0.257034


In [51]:
comp_lasso_01_linreg_r = pd.DataFrame(coef_linreg_r >= coef_lasso_01_r)
comp_lasso_01_linreg_r.Beta.value_counts()

False    6
True     5
Name: Beta, dtype: int64

In [52]:
coef_lasso_1_r = pd.DataFrame(linreg_l1_1.coef_, columns=['Beta'])
coef_lasso_1_r.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
Beta,0.0,-0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0


In [53]:
comp_lasso_1_linreg_r = pd.DataFrame(coef_linreg_r >= coef_lasso_1_r)
comp_lasso_1_linreg_r.Beta.value_counts()

False    6
True     5
Name: Beta, dtype: int64

In [54]:
y_r_pred_linreg_l1_001 = linreg_l1_001.predict(X_r_test_scaled)
y_r_pred_linreg_l1_01 = linreg_l1_01.predict(X_r_test_scaled)
y_r_pred_linreg_l1_1 = linreg_l1_1.predict(X_r_test_scaled)

print('RMSE con alpha(0.01) =', np.sqrt(metrics.mean_squared_error(y_r_test['quality'], y_r_pred_linreg_l1_001)))
print('RMSE con alpha(0.1) =', np.sqrt(metrics.mean_squared_error(y_r_test['quality'], y_r_pred_linreg_l1_01)))
print('RMSE con alpha(1) =', np.sqrt(metrics.mean_squared_error(y_r_test['quality'], y_r_pred_linreg_l1_1)))

RMSE con alpha(0.01) = 0.6649019129040484
RMSE con alpha(0.1) = 0.6834522262176693
RMSE con alpha(1) = 0.8280619786294363


### Tipo White

In [55]:
linreg_l1_001.fit(X_w_train_scaled, y_w_train['quality'])
linreg_l1_01.fit(X_w_train_scaled, y_w_train['quality'])
linreg_l1_1.fit(X_w_train_scaled, y_w_train['quality'])

coef_lasso_001_w = pd.DataFrame(linreg_l1_001.coef_, columns=['Beta'])
coef_lasso_001_w.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
Beta,-0.0,-0.173391,0.0,0.217835,-0.012763,0.066323,-0.016237,-0.172053,0.061155,0.052198,0.353043


In [56]:
comp_lasso_001_linreg_w = pd.DataFrame(coef_linreg_w >= coef_lasso_001_w)
comp_lasso_001_linreg_w.Beta.value_counts()

True     7
False    4
Name: Beta, dtype: int64

In [57]:
coef_lasso_01_w = pd.DataFrame(linreg_l1_01.coef_, columns=['Beta'])
coef_lasso_01_w.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
Beta,-0.0,-0.079922,0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,0.295375


In [58]:
comp_lasso_01_linreg_w = pd.DataFrame(coef_linreg_w >= coef_lasso_01_w)
comp_lasso_01_linreg_w.Beta.value_counts()

True     6
False    5
Name: Beta, dtype: int64

In [59]:
coef_lasso_1_w = pd.DataFrame(linreg_l1_1.coef_, columns=['Beta'])
coef_lasso_1_w.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
Beta,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,0.0


In [60]:
comp_lasso_1_linreg_w = pd.DataFrame(coef_linreg_w >= coef_lasso_1_w)
comp_lasso_1_linreg_w.Beta.value_counts()

True     7
False    4
Name: Beta, dtype: int64

In [61]:
y_w_pred_linreg_l1_001 = linreg_l1_001.predict(X_w_test_scaled)
y_w_pred_linreg_l1_01 = linreg_l1_01.predict(X_w_test_scaled)
y_w_pred_linreg_l1_1 = linreg_l1_1.predict(X_w_test_scaled)

print('RMSE con alpha(0.01) =', np.sqrt(metrics.mean_squared_error(y_w_test['quality'], y_w_pred_linreg_l1_001)))
print('RMSE con alpha(0.1) =', np.sqrt(metrics.mean_squared_error(y_w_test['quality'], y_w_pred_linreg_l1_01)))
print('RMSE con alpha(1) =', np.sqrt(metrics.mean_squared_error(y_w_test['quality'], y_w_pred_linreg_l1_1)))

RMSE con alpha(0.01) = 0.782051406408641
RMSE con alpha(0.1) = 0.811928779315237
RMSE con alpha(1) = 0.9072707333821783


Al igual que con la regresión tipo Ridge, encontramos que para los dos tipos de vino la mayoría de coeficientes son menores a los que se presentan con la regresión lineal, lo que hace que el intercepto tenga aún más fuerza en el análisis y siga sesgando el mismo.

# Exercise 6.9

* Create a binary target

* Train a logistic regression to predict wine quality (binary)

* Analyze the coefficients

* Evaluate the f1score

In [62]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9, solver='liblinear', multi_class='ovr')

### Tipo Red

In [63]:
logreg.fit(X_r_train_scaled, y_r_train['quality_dummy'])

coef_logreg_r = pd.DataFrame(logreg.coef_, index=['Beta'])
coef_logreg_r

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
Beta,0.138811,-0.278299,0.100696,0.847051,-0.183551,0.23218,-0.739228,-0.431471,-0.003492,0.365485,0.979307


In [64]:
y_r_pred_logreg = logreg.predict(X_r_test_scaled)

from sklearn.metrics import f1_score

f1_score(y_r_test['quality_dummy'], y_r_pred_logreg)

0.4761904761904762

### Tipo White

In [65]:
logreg.fit(X_w_train_scaled, y_w_train['quality_dummy'])

coef_logreg_w = pd.DataFrame(logreg.coef_, index=['Beta'])
coef_logreg_w

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
Beta,0.513332,-0.339762,-0.054728,1.65522,-0.252555,0.171852,-0.035274,-2.264085,0.555222,0.277077,0.030405


In [66]:
y_w_pred_logreg = logreg.predict(X_w_test_scaled)

f1_score(y_w_test['quality_dummy'], y_w_pred_logreg)

0.4191919191919192

Para los dos tipos de vino encontramos un F1_Score inferior a 0.50, lo que nos dice que ni siquiera la mitad de las predicciones desarrolladas por medio de estos modelos serían acertadas; siendo mejor el modelo de los vinos tipo White

# Exercise 6.10

* Estimate a regularized logistic regression using:
* C = 0.01, 0.1 & 1.0
* penalty = ['l1, 'l2']
* Compare the coefficients and the f1score

In [67]:
score = []
def logreg_fun(X_train, y_train, X_test, y_test):
    for p in ['l1', 'l2']:
        for c in [0.01, 0.1, 1.0]:
            logreg = LogisticRegression(C=c, penalty=p, solver='liblinear', multi_class='ovr')
            logreg.fit(X_train, y_train)
            y_pred_logreg = logreg.predict(X_test)
            score.append([p, c, f1_score(y_test, y_pred_logreg)])
# score = pd.DataFrame(score, columns=['Penalty', 'C', 'F1_Score'])
# print(score.loc[score['F1_Score'].idxmax()])

### Tipo Red

In [68]:
score = []
logreg_fun(X_r_train_scaled, y_r_train['quality_dummy'], X_r_test_scaled, y_r_test['quality_dummy'])
score = pd.DataFrame(score, columns=['Penalty', 'C', 'F1_Score'])
print(score.loc[score['F1_Score'].idxmax()])

Penalty           l2
C                0.1
F1_Score    0.494118
Name: 4, dtype: object


### Tipo White

In [69]:
score = []
logreg_fun(X_w_train_scaled, y_w_train['quality_dummy'], X_w_test_scaled, y_w_test['quality_dummy'])
score = pd.DataFrame(score, columns=['Penalty', 'C', 'F1_Score'])
print(score.loc[score['F1_Score'].idxmax()])

Penalty           l1
C                  1
F1_Score    0.412214
Name: 2, dtype: object


Ajustando las regresiones logísticas con los penalties encontramos un mejor desempeño del modelo para el tipo de vino Red, donde su F1_Score asciende a 0.45; sigue siendo inferior a 0.50 pero muy superior al 0.37 que obteniamos sin los penalties. El desempeño del modelo para vino tipo White desmejora