# Logistic Regression01

## <center>Supervised Learning</center>

- making inferences from labeled data.

#### 1. Classification (categorical data)
- binary classification (tumor: benign, malignant)
- multiclass classification (books: maths, physics, stats, psychology, etc.)
- example algorithms: KNN, Linear Models, Decision Trees, SVMs, etc.

#### 2. Regression (continuous data)
- predicting income, price of stock, age, and other continous data 
- example algorithms: KNN, Linear Regression, Decision Trees, SVMs, etc.
___

Linear models (Linear Regression, Polynormal Regression, Gaussian Regression, Sigmoid Regression, etc) - make predictions according to a linear function of the input features. <br>
Many ML algorithms (including those specified above) can be used for both classification and regression.

Logistic Regression is the special case of linear regression where dependent or output variable is categorical. Logistic in logistic regression comes from the function which is the core of this algorithm. This function is logistic function or sigmoid function . Logistic function can be written as : 

$$f(x) = \frac{1}{1+e^{-x}}$$

![Sigmoid function](images/sigmoid.png)

Logistic regression is the classification algorithm.  We can implement this function in numpy as follows:



In [None]:
import math
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
#mpl.rcParams['axes.unicode_minus']=False

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

In [3]:
numbers = np.linspace(-20,20,50) #generate a list of numbers
numbers

array([-20.        , -19.18367347, -18.36734694, -17.55102041,
       -16.73469388, -15.91836735, -15.10204082, -14.28571429,
       -13.46938776, -12.65306122, -11.83673469, -11.02040816,
       -10.20408163,  -9.3877551 ,  -8.57142857,  -7.75510204,
        -6.93877551,  -6.12244898,  -5.30612245,  -4.48979592,
        -3.67346939,  -2.85714286,  -2.04081633,  -1.2244898 ,
        -0.40816327,   0.40816327,   1.2244898 ,   2.04081633,
         2.85714286,   3.67346939,   4.48979592,   5.30612245,
         6.12244898,   6.93877551,   7.75510204,   8.57142857,
         9.3877551 ,  10.20408163,  11.02040816,  11.83673469,
        12.65306122,  13.46938776,  14.28571429,  15.10204082,
        15.91836735,  16.73469388,  17.55102041,  18.36734694,
        19.18367347,  20.        ])

In [4]:
#we will pass each number through sigmoid function
results = sigmoid(numbers)
results[:20]  #print few numbers

array([2.06115362e-09, 4.66268920e-09, 1.05478167e-08, 2.38610019e-08,
       5.39777490e-08, 1.22107080e-07, 2.76227484e-07, 6.24874560e-07,
       1.41357420e-06, 3.19774584e-06, 7.23382998e-06, 1.63640365e-05,
       3.70175420e-05, 8.37362281e-05, 1.89405944e-04, 4.28366894e-04,
       9.68517025e-04, 2.18827951e-03, 4.93663522e-03, 1.10983776e-02])

As you can see, all numbers are squashed between [0,1]
Now, we will implement logistic regression using sklearn.

In [1]:
# Using LogisticRegression on the cancer dataset. 

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
%matplotlib inline

cancer = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [2]:
print('Accuracy on the training subset: {:.3f}'.format(log_reg.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(log_reg.score(X_test, y_test)))

Accuracy on the training subset: 0.953
Accuracy on the test subset: 0.958


**Regularization**:

- prevention of overfitting - (according to Muller and Guido ML book)
- L1 - assumes only a few features are important
- L2 - does not assume only a few features are important - used by default in scikit-learn LogisticRegression
               
**'C'**:

- parameter to control the strength of regularization
- lower C => logistic regression adjusts to the majority of data points.
- higher C => correct classification of each data point.

In [3]:
log_reg100 = LogisticRegression(C=100)
log_reg100.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(log_reg100.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(log_reg100.score(X_test, y_test)))

Accuracy on the training subset: 0.972
Accuracy on the test subset: 0.965


In [4]:
log_reg001 = LogisticRegression(C=0.01)
log_reg001.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(log_reg001.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(log_reg001.score(X_test, y_test)))

Accuracy on the training subset: 0.934
Accuracy on the test subset: 0.930
