# <font color=#8c0062><center>Logistic regression</center></font>

Logistic regression is part of a linear models. It is mainly used on binary classification problems, but can be extended into more complex ones. Main concept behind this model, as name suggest, is logistic function which helps us understand the relationship of one binary (0-1) variable and one or many more other variables. In its core, the logit function transform our variable into 0-1 scale, using base of natural log by this equation:

__$$\frac{1} { 1 - e^{- value}}$$__

This equation ensures the range between 0 - 1. But instead of using the least square method (which cant be used here), the model predicts the propability and finds the best fitted sigmoid (S shaped function) so that the curve with maximum likelihood is selected base upon data we are feeding the model. 

If you want to get deeper on the math and logic behind it, there are many relevant sources to learn them from, much better than I would be able to attempt to create here, so I suggest to do so. So lets try to implement this in Sklearn


We will try to predict the quality of red wine:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression # import specifically the Linear regression model from sklearn
from sklearn.model_selection import train_test_split # import the train_test_split to help us to easily divide our data

%matplotlib inline

In [2]:
datasets_path_raw = ".jupyter\\datasets\\raw\\"
df = pd.read_csv(datasets_path_raw + "winequality-red.csv")

In [3]:
df.head(2)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5


In [4]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [5]:
df.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

We loaded dataset that of red wine, that is clean right from the start, with quality from 3 - 8. We will translate it into high quality (6-8 scale = 1), and low quality (3-5 scale = 0) and create logistic regression model predicting high or low quality red wine.

In [6]:
df["binary_quality"] = np.nan # creating empty column

# function for evaluating high/low quality wine (aka 1 or 0)
def calc_bin_class(x):
    if x <= 5:
        return 0
    else:
        return 1

# iterating through apply, using our evaluation function
df["binary_quality"] = df.apply(lambda row: calc_bin_class(row["quality"]), axis=1)

df.sample(3)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,binary_quality
1062,8.0,0.38,0.44,1.9,0.098,6.0,15.0,0.9956,3.3,0.64,11.4,6,1
1188,6.7,0.64,0.23,2.1,0.08,11.0,119.0,0.99538,3.36,0.7,10.9,5,0
74,9.7,0.32,0.54,2.5,0.094,28.0,83.0,0.9984,3.28,0.82,9.6,5,0


In [7]:
X = df.drop(columns=["quality", "binary_quality"])
y = df["binary_quality"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LogisticRegression(max_iter=50000) # I had to set max_iter to avoid getting error
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.76875

We are getting accuracy something above 76%, it is not that bad, but lets try to squeeze at least a little bit more out of our model. We are trying to learn how to best create the model, so play with it to your hearts content. Open up scikit learn docs, look in the parameters and tweak them. __Of course more of the math behind it you understand the better you will be able to set it__, but do not be discouraged if you do not understand it (my case), try what you can, look into those individual params from time to time as you will try them out and with time, practice and a little consistency mixed with interest - you will get there ! 

In [8]:
model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=50000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Alright, I am interested and will be trying to adjust 2 parameters that could have at least some impact on accuracy:
* __C__ - regularization strenght - smaller values mean stronger regualrization
* __solver__ - algorithm used for problem optimization

yea I know, I just kinda copied the docs ... But lets not be lazy here, if you do not know those algos, just find some blog or YT tutorials, as far as I can say, data science community is really great and helpful in many ways !

In [9]:
solvers = ["newton-cg", "lbfgs", "sag", "saga"]
print("original score: " + str(model.score(X_test, y_test)))

for i in range(4):
    model.solver = solvers[i]
    print("Testing with " + solvers[i] + " solver:")
    for j in range(1,16,2):
        model.C = float(j/10)
        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)
        print("Accuracy with C set to " + str(model.C) + " is: " + str(score))

original score: 0.76875
Testing with newton-cg solver:
Accuracy with C set to 0.1 is: 0.746875
Accuracy with C set to 0.3 is: 0.759375
Accuracy with C set to 0.5 is: 0.771875
Accuracy with C set to 0.7 is: 0.765625
Accuracy with C set to 0.9 is: 0.76875
Accuracy with C set to 1.1 is: 0.76875
Accuracy with C set to 1.3 is: 0.76875
Accuracy with C set to 1.5 is: 0.771875
Testing with lbfgs solver:
Accuracy with C set to 0.1 is: 0.746875
Accuracy with C set to 0.3 is: 0.759375
Accuracy with C set to 0.5 is: 0.771875
Accuracy with C set to 0.7 is: 0.7625
Accuracy with C set to 0.9 is: 0.76875
Accuracy with C set to 1.1 is: 0.76875
Accuracy with C set to 1.3 is: 0.76875
Accuracy with C set to 1.5 is: 0.771875
Testing with sag solver:
Accuracy with C set to 0.1 is: 0.7625
Accuracy with C set to 0.3 is: 0.778125
Accuracy with C set to 0.5 is: 0.775
Accuracy with C set to 0.7 is: 0.775
Accuracy with C set to 0.9 is: 0.775
Accuracy with C set to 1.1 is: 0.775
Accuracy with C set to 1.3 is: 0.77

Well, those differences are not that great, but hey, every 0.1% is a valuable information ! Here we can see that "sag" solver with C set to higher number is reaping the best result. You can make a lot more testing, all depends on you. This is just a mini showcase of how you can try to build your model (and there are propably way more efficient ways). But this 1 of my actual ways to test and fit models

I hope the code above was at least a little bit beneficial and I would like to thank you for the time you spend going throught this little notebook. I wish you the best in your data science journey !