Problem Statement: Classifying Wine based on Flavanoids

- Linear Regression isn't best for classification tasks that require categorical distinctions. Hence, **logistic regression** is applied. It is practically the same as linear regression, but the line is squished via the normalizing **sigmoid function**

    
        Formula: 1/(1 + e^-z), z = mx + b

- A lot of things are similar with logistic regression, the only difference being that the linear regression line squished with the normalization function in order to predict 1 of 2 classes. This regression is best for **binary classification**.
- Gradient descent can also be followed and a loss function can be determined for logistic regression.

        Loss Function: J(w,b) = -nΣi=0((y_i * log(h(x_i))) + (1 - y_i)log(1 - h(x_i))), h(x) is the sigmoid function or model probability classification prediction.

In a way, logistic regression is a simple perceptron model, which takes in a series of weights, as well as features, performs a sum of mx + b, normalizes the output using sigmoid function, getting a probability.

Stochastic Gradient Descent can be applied similarly to logistic regression. There is a different formula, it is as follows.

      m = m - l(y_i - h(x_i))x_i; l is learning rate, h(x) is sigmoid prediction.
      b = b - l(y_i - h(x_i))

Let's apply this to an intuition project on classifying wine based on flavanoids.

In [None]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.datasets import load_wine #Load dataset for intuition problem.
wine = load_wine()
X = wine.data
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,14.23,1.71,2.43,15.6,127.0,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.20,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0
174,13.40,3.91,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0


In [None]:
wine.feature_names # To find what index flavanoid is at to make sure it is the only feature being identified

['alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280/od315_of_diluted_wines',
 'proline']

In [None]:
#Now to make sure that we only perform binary classification from flavanoids [Index 6 in the features]
X = wine.data[:, 6]
Y = wine.target
print(X.shape, Y.shape)

(178,) (178,)


In [None]:
#Now to perform filtering on dataframe
z = pd.DataFrame(np.vstack((Y, X)).T)
tz = z[(z[0] == 0) | (z[0] == 1)]
tz

Unnamed: 0,0,1
0,0.0,3.06
1,0.0,2.76
2,0.0,3.24
3,0.0,3.49
4,0.0,2.69
...,...,...
125,1.0,2.65
126,1.0,3.15
127,1.0,2.24
128,1.0,2.45


In [None]:
Y = np.array(tz[0])
X = np.array(tz[1])

In [None]:
X, Y

(array([3.06, 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51, 2.98, 3.15, 3.32,
        2.43, 2.76, 3.69, 3.64, 2.91, 3.14, 3.4 , 3.93, 3.03, 3.17, 2.41,
        2.88, 2.37, 2.61, 2.68, 2.94, 2.19, 2.97, 2.33, 3.25, 3.19, 2.69,
        2.74, 2.53, 2.98, 2.68, 2.43, 2.64, 3.04, 3.29, 2.68, 3.56, 2.63,
        3.  , 2.65, 3.17, 3.39, 2.92, 3.54, 3.27, 2.99, 3.74, 2.79, 2.9 ,
        2.78, 3.  , 3.23, 3.67, 0.57, 1.09, 1.41, 1.79, 3.1 , 1.75, 2.65,
        3.18, 2.  , 1.3 , 1.28, 1.02, 2.86, 1.84, 2.89, 2.14, 1.57, 2.03,
        1.32, 1.85, 2.55, 2.26, 2.53, 1.58, 1.59, 2.21, 1.94, 1.69, 1.61,
        1.69, 1.59, 1.5 , 1.25, 1.46, 2.25, 2.26, 2.27, 0.99, 2.5 , 3.75,
        2.99, 2.17, 1.36, 2.11, 1.64, 1.92, 1.84, 2.03, 1.76, 2.04, 2.92,
        2.58, 2.27, 2.03, 2.01, 2.29, 2.17, 1.6 , 2.09, 1.25, 1.64, 2.79,
        5.08, 2.13, 2.65, 3.03, 2.65, 3.15, 2.24, 2.45, 1.75]),
 array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0.

In [None]:
#Define m, b, and n similarly to linear regression
m = 0
b = 0
n = len(Y)
LR = 0.001
epochs = 300
accuracy = 0 #We can evaluate accuracy of classification now since it is categorical

def sigmoid(z):
  sig = 1/(1 + np.exp(-z))
  return sig

#Gradient Descent
for epoch in range(1,epochs+1):
    for i in range(len(X)):
        gr_wrt_m = X[i] * (sigmoid((m * X[i]) + b) - Y[i])
        gr_wrt_c = sigmoid((m * X[i]) + b) - Y[i]
        m = m - LR * gr_wrt_m
        b = b - LR * gr_wrt_c

print(" m and c values are  ",  m , b)
print("Optimal values of m and b are", m, b)

 m and c values are   -0.8295583744974768 2.1891794437591803
Optimal values of m and b are -0.8295583744974768 2.1891794437591803


In [None]:
predictions = []
ylist = []
for i in range(len(X)):
    z = (m * X[i]) + b
    y_pred = sigmoid(z)
    if y_pred>=0.5:
        predictions.append(1)
    else:
        predictions.append(0) #Intuition here is that if there is a probability greater than 50% that it is wine, it will append a prediction of 1 to the predictions list.

# 'predictions' list will contain all the predicted class labels using optimum 'm' and 'c'

for k in range(len(X)):
    if ( predictions[k] ==  int(Y[k]) ):
        accuracy = accuracy + 1
    else:
        continue
print(" Accuracy is  ", accuracy/n)

 Accuracy is   0.8076923076923077


Now, let's implement this using sklearn. They have methods to build Logistic Regression Models, fit [or train] based on labeled X and Y data, and are able to predict based on X data. They also have evaluation metrics in the library

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

X1 = X.reshape(-1,1)

model = LogisticRegression(solver='newton-cg', max_iter=10)
model.fit(X1, Y)
pred = model.predict(X1)
accuracy = accuracy_score(Y, pred)
print(accuracy)

0.8153846153846154
