# Exercise 6

## SVM & Regularization


For this homework we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.

In [175]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [176]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')

In [177]:
data = data_w.assign(type = 'white')

data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.sample(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
480,6.7,0.47,0.34,8.9,0.043,31.0,172.0,0.9964,3.22,0.6,9.2,5,white
567,6.0,0.26,0.5,2.2,0.048,59.0,153.0,0.9928,3.08,0.61,9.8,5,white
1634,8.3,0.25,0.49,16.8,0.048,50.0,228.0,1.0001,3.03,0.52,9.2,6,white
5749,9.3,0.43,0.44,1.9,0.085,9.0,22.0,0.99708,3.28,0.55,9.5,5,red
4675,5.7,0.21,0.37,4.5,0.04,58.0,140.0,0.99332,3.29,0.62,10.6,6,white


# Exercise 6.1

Show the frecuency table of the quality by type of wine

In [178]:
data[['quality','type']].groupby(['type','quality'])['type'].count().unstack()

quality,3,4,5,6,7,8,9
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
red,10.0,53.0,681.0,638.0,199.0,18.0,
white,20.0,163.0,1457.0,2198.0,880.0,175.0,5.0


# SVM

# Exercise 6.2

* Standarized the features (not the quality)
* Create a binary target for each type of wine
* Create two Linear SVM's for the white and red wines, repectively.


In [179]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC  # "Support Vector Classifier"
import numpy as np

#Estandarizar el DF
scaler=StandardScaler()
data_scaled=scaler.fit_transform(data.drop(columns=['quality','type']))
data_scaled=pd.DataFrame(data_scaled, columns=data.drop(columns=['quality','type']).columns)
data_scaled[['quality','type']]=data[['quality','type']]
df=data_scaled

#Crear variables binarias para cada tipo de vino
data_scaled.loc[data_scaled.quality <= 5, 'quality'] = 0
data_scaled.loc[data_scaled.quality > 5, 'quality'] = 1


#Dividir en 2 DF
wwine=data_scaled[data_scaled['type']=='white']

rwine=data_scaled[data_scaled['type']!='white']

In [180]:
#MODELO WHITE WINE
#Train and test Data

#Definir variables
x=wwine.drop(columns=['quality','type'])
y=wwine.quality

#Partición de sets
X1_train, X1_test, y1_train, y1_test = train_test_split(x, y, test_size=0.3)


#Crear los SVM lineales, uno para cada tipo de vino
clf1 = SVC(kernel='linear')
clf1.fit(X1_train, y1_train)

clf1.score(X1_test,y1_test)

0.754421768707483

In [181]:
#MODELO RED WINE

#Definir variables
x=rwine.drop(columns=['quality','type'])
y=rwine.quality

#Partición de sets
X2_train, X2_test, y2_train, y2_test = train_test_split(x, y, test_size=0.3)

#Crear los SVM lineales, uno para cada tipo de vino
clf = SVC(kernel='linear')
clf.fit(X2_train, y2_train)

clf.score(X2_test, y2_test)

0.74375

# Exercise 6.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’)


In [182]:
#MODELO WHITE WINE

#Crear los Kernels
wclf_poly = SVC(kernel='poly')
wclf_poly.fit(X1_train, y1_train)

wclf_rbf = SVC(kernel='rbf')
wclf_rbf.fit(X1_train, y1_train)

wclf_sigmoid = SVC(kernel='sigmoid')
wclf_sigmoid.fit(X1_train, y1_train)

print('Kernel Poly score',str(round(wclf_poly.score(X1_test, y1_test),2)),'\n'
      'Kernel rbf score',str(round(wclf_rbf.score(X1_test, y1_test),2)),'\n'
      'Kernel sigmoid score',str(round(wclf_sigmoid.score(X1_test, y1_test),2)))

Kernel Poly score 0.74 
Kernel rbf score 0.78 
Kernel sigmoid score 0.67


In [183]:
#MODELO RED WINE

#Crear los Kernels
rclf_poly = SVC(kernel='poly')
rclf_poly.fit(X2_train, y2_train)

rclf_rbf = SVC(kernel='rbf')
rclf_rbf.fit(X2_train, y2_train)

rclf_sigmoid = SVC(kernel='sigmoid')
rclf_sigmoid.fit(X2_train, y2_train)

print('Kernel Poly score',str(round(rclf_poly.score(X2_test, y2_test),2)),'\n'
      'Kernel rbf score',str(round(rclf_rbf.score(X2_test, y2_test),3)),'\n'
      'Kernel sigmoid score',str(round(rclf_sigmoid.score(X2_test, y2_test),2)))

Kernel Poly score 0.74 
Kernel rbf score 0.74 
Kernel sigmoid score 0.64


# Exercise 6.4
Using the best SVM find the parameters that gives the best performance

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]

In [184]:
def bestsvc(tvino,x_train,y_train,x_test,y_test,c,g):
    score=0
    for i in c:
        for j in g:
            clf=SVC(kernel='rbf',C=i,gamma=j).fit(x_train,y_train)
            if clf.score(x_test, y_test)>score:
                score=clf.score(x_test, y_test)
                C=i
                gamma=j
    print('Vino',tvino,'\nScore:',round(score,2),'\nC:',C,'\ngamma:',gamma,'\n')

In [185]:
C=[0.1, 1, 10, 100, 1000]
gamma=[0.01, 0.001, 0.0001]
bestsvc('White',X1_train,y1_train,X1_test,y1_test,C,gamma)
bestsvc('Red',X2_train,y2_train,X2_test,y2_test,C,gamma)

Vino White 
Score: 0.77 
C: 100 
gamma: 0.01 

Vino Red 
Score: 0.74 
C: 1000 
gamma: 0.01 



# Exercise 6.5

Compare the results with other methods

In [186]:
from sklearn.linear_model import LogisticRegression
def logit(tvino,x_train,y_train,x_test,y_test):
    logreg = LogisticRegression(C=1e9,solver='liblinear')
    logreg.fit(x_train,y_train)
    score=logreg.score(x_test,y_test)
    print('Vino',tvino,'\nScore:',round(score,2),'\n')

In [187]:
logit('White',X1_train,y1_train,X1_test,y1_test)
logit('Red',X2_train,y2_train,X2_test,y2_test)

Vino White 
Score: 0.76 

Vino Red 
Score: 0.74 



# Regularization

# Exercise 6.6


* Train a linear regression to predict wine quality (Continous)

* Analyze the coefficients

* Evaluate the RMSE

In [188]:
from sklearn.linear_model import LinearRegression

#Data frame estandarizado con variable dependiente continua
df=data_scaled
df['quality']=data['quality']

#Tiwpo de vino a dummies
df['type']=pd.get_dummies(df['type'], drop_first=True)

#Definir Variables
x=df.drop(columns=['quality'])
y=df['quality']

#Sets de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

#Modelo
linearm=LinearRegression().fit(X_train,y_train)

In [189]:
linearm.score(X_test,y_test)

0.2955125586191577

In [190]:
#Obtener los coeficientes
for i in range(0,len(linearm.coef_)):
    print('B'+str(i)+'('+str(df.drop(columns=['quality']).columns[i])+')=',round(linearm.coef_[i],2))

B0(fixed acidity)= 0.11
B1(volatile acidity)= -0.25
B2(citric acid)= -0.01
B3(residual sugar)= 0.32
B4(chlorides)= -0.02
B5(free sulfur dioxide)= 0.08
B6(total sulfur dioxide)= -0.08
B7(density)= -0.34
B8(pH)= 0.08
B9(sulphates)= 0.09
B10(alcohol)= 0.26
B11(type)= -0.42


In [191]:
from sklearn.metrics import mean_squared_error
#Raiz del error cuadratico medio
print('RMSE:',round(mean_squared_error(y_test,linearm.predict(X_test)),2))

RMSE: 0.53


# Exercise 6.7

* Estimate a ridge regression with alpha equals 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [192]:
from sklearn.linear_model import Ridge
from sklearn import metrics
#Ridge con alfa=1
ridgereg1 = Ridge(alpha=1, normalize=True)
ridgereg1.fit(X_train, y_train)
y1_pred = ridgereg1.predict(X_test)

In [193]:
#Accuracy
ridgereg1.score(X_test,y_test)

0.21933383365792058

In [194]:
#Ridge con alfa=0.1
ridgereg2 = Ridge(alpha=0.1, normalize=True)
ridgereg2.fit(X_train, y_train)
y2_pred = ridgereg2.predict(X_test)

In [195]:
#Accuracy
ridgereg2.score(X_test,y_test)

0.2917384733887014

In [196]:
#Comparación de coeficientes
df = pd.DataFrame({'Variables': x.columns,'Coef lineal':linearm.coef_,
                   'Coef Ridge 1':ridgereg1.coef_,'Coef Ridge 0.1':ridgereg2.coef_})
df

Unnamed: 0,Variables,Coef lineal,Coef Ridge 1,Coef Ridge 0.1
0,fixed acidity,0.11061,-0.002488,0.026996
1,volatile acidity,-0.249387,-0.09608,-0.208207
2,citric acid,-0.009999,0.021666,0.002981
3,residual sugar,0.315712,0.027374,0.144277
4,chlorides,-0.021667,-0.042864,-0.036808
5,free sulfur dioxide,0.084324,0.02907,0.078584
6,total sulfur dioxide,-0.076798,-0.035582,-0.084715
7,density,-0.336798,-0.070853,-0.120305
8,pH,0.079879,0.013269,0.032666
9,sulphates,0.092677,0.039228,0.078051


In [197]:
#Evaluación de los 2 métodos
print('Alfa 1\nRMSE',round(np.sqrt(metrics.mean_squared_error(y_test, y1_pred)),2),'\n')
print('Alfa 0.1\nRMSE',round(np.sqrt(metrics.mean_squared_error(y_test, y2_pred)),2),'\n')

Alfa 1
RMSE 0.77 

Alfa 0.1
RMSE 0.73 



# Exercise 6.8

* Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [198]:
from sklearn.linear_model import Lasso
#Lasso alfa 0.01
lassoreg1 = Lasso(alpha=0.01, normalize=True)
lassoreg1.fit(X_train, y_train)
yl1_pred = lassoreg1.predict(X_test)

In [199]:
#Lasso alfa 0.1
lassoreg2 = Lasso(alpha=0.1, normalize=True)
lassoreg2.fit(X_train, y_train)
yl2_pred = lassoreg2.predict(X_test)

In [200]:
#Lasso alfa 1
lassoreg3 = Lasso(alpha=1, normalize=True)
lassoreg3.fit(X_train, y_train)
yl3_pred = lassoreg3.predict(X_test)

In [201]:
#Comparación de coeficientes
df = pd.DataFrame({'Variables': x.columns,'Coef lineal':linearm.coef_,
                   'Coef Lasso 0.001':lassoreg1.coef_,'Coef Lasso 0.1':lassoreg2.coef_,'Coef Lasso 1':lassoreg3.coef_})
df

Unnamed: 0,Variables,Coef lineal,Coef Lasso 0.001,Coef Lasso 0.1,Coef Lasso 1
0,fixed acidity,0.11061,-0.0,-0.0,-0.0
1,volatile acidity,-0.249387,-0.0,-0.0,-0.0
2,citric acid,-0.009999,0.0,0.0,0.0
3,residual sugar,0.315712,-0.0,-0.0,-0.0
4,chlorides,-0.021667,-0.0,-0.0,-0.0
5,free sulfur dioxide,0.084324,0.0,0.0,0.0
6,total sulfur dioxide,-0.076798,-0.0,-0.0,-0.0
7,density,-0.336798,-0.0,-0.0,-0.0
8,pH,0.079879,0.0,0.0,0.0
9,sulphates,0.092677,0.0,0.0,0.0


# Exercise 6.9

* Create a binary target

* Train a logistic regression to predict wine quality (binary)

* Analyze the coefficients

* Evaluate the f1score

In [230]:
from sklearn.linear_model import LogisticRegression
#Renombrar el df, traer el df con categoría binaria
df=data_scaled.copy()

#Crear variables binarias 
df.loc[df.quality <= 5, 'quality'] = 0
df.loc[df.quality > 5, 'quality'] = 1

#Definir Variables
x=df.drop(columns=['quality'])
y=df['quality']

#Sets de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

#Regresión logística
logreg = LogisticRegression(C=1e9,multi_class='multinomial', solver='newton-cg').fit(X_train,y_train)

In [231]:
logreg.coef_

array([[ 0.04509411, -0.38669205, -0.03726443,  0.22531822, -0.03973588,
         0.15599361, -0.16539878, -0.1434953 ,  0.04415876,  0.14941317,
         0.52269807, -0.28383609]])

In [248]:
from sklearn.metrics import f1_score
f1_score(y_test,logreg.predict(X_test))

0.7913894324853229

# Exercise 6.10

* Estimate a regularized logistic regression using:
* C = 0.01, 0.1 & 1.0
* penalty = ['l1, 'l2']
* Compare the coefficients and the f1score

In [249]:
#Estimar Modelos
logreg1 = LogisticRegression(C=0.01, penalty='l1',solver='saga',multi_class='multinomial').fit(X_train, y_train)
logreg2 = LogisticRegression(C=0.01, penalty='l2',solver='saga',multi_class='multinomial').fit(X_train, y_train)
logreg3 = LogisticRegression(C=0.1, penalty='l1',solver='saga',multi_class='multinomial').fit(X_train, y_train)
logreg4 = LogisticRegression(C=0.1, penalty='l2',solver='saga',multi_class='multinomial').fit(X_train, y_train)
logreg5 = LogisticRegression(C=1, penalty='l1',solver='saga',multi_class='multinomial').fit(X_train, y_train)
logreg6 = LogisticRegression(C=1, penalty='l2',solver='saga',multi_class='multinomial').fit(X_train, y_train)

In [259]:
#Comparar Coeficientes
coef = pd.DataFrame({'Variables': x.columns,'Coef C=0.01 P=l1':logreg1.coef_[0],
                   'Coef C=0.01 P=l2':logreg2.coef_[0],'Coef C=0.1 P=l1':logreg3.coef_[0],
                   'Coef C=0.1 P=l2':logreg4.coef_[0],'Coef C=1 P=l1':logreg5.coef_[0],
                   'Coef C=1 P=l2':logreg6.coef_[0]})
coef

Unnamed: 0,Variables,Coef C=0.01 P=l1,Coef C=0.01 P=l2,Coef C=0.1 P=l1,Coef C=0.1 P=l2,Coef C=1 P=l1,Coef C=1 P=l2
0,fixed acidity,0.0,0.033596,0.0,0.041145,0.034749,0.044485
1,volatile acidity,-0.242936,-0.316791,-0.352272,-0.374971,-0.384011,-0.385354
2,citric acid,0.0,-0.021946,-0.029943,-0.036177,-0.036608,-0.0372
3,residual sugar,0.010934,0.158734,0.119521,0.207031,0.205672,0.222814
4,chlorides,0.0,-0.045065,-0.03185,-0.039811,-0.039438,-0.039719
5,free sulfur dioxide,0.0,0.134209,0.146564,0.155675,0.15558,0.156087
6,total sulfur dioxide,0.0,-0.164375,-0.178718,-0.171863,-0.167586,-0.166392
7,density,0.0,-0.095038,0.0,-0.122025,-0.114872,-0.140329
8,pH,0.0,0.041087,0.017541,0.042433,0.03817,0.043865
9,sulphates,0.041948,0.139737,0.137245,0.148945,0.147249,0.149379


In [270]:
#Comparar f1 score
print('Coef C=0.01 P=l1:',round(f1_score(y_test,logreg1.predict(X_test)),4),
                   '\nCoef C=0.01 P=l2:',round(f1_score(y_test,logreg1.predict(X_test)),4),
                   '\nCoef C=0.1 P=l1:',round(f1_score(y_test,logreg2.predict(X_test)),4),
                   '\nCoef C=0.1 P=l2:',round(f1_score(y_test,logreg3.predict(X_test)),4),
                   '\nCoef C=1 P=l1:',round(f1_score(y_test,logreg4.predict(X_test)),4),
                   '\nCoef C=1 P=l2:',round(f1_score(y_test,logreg5.predict(X_test)),4))

Coef C=0.01 P=l1: 0.7869 
Coef C=0.01 P=l2: 0.7869 
Coef C=0.1 P=l1: 0.7866 
Coef C=0.1 P=l2: 0.7879 
Coef C=1 P=l1: 0.7867 
Coef C=1 P=l2: 0.7903
