# Logistic Regression

Contrary to its name, logistic regression is a machine learning model used for classification problems.

$$ y = \frac{1}{1+e^{-(aX+b)}} $$

Q: What are X, y, a, b?
* `X` is our input data.
    * Q: What is the shape of X?
    * A: (# of data points, # of features) \[e.g. (# of penguins, # of penguins characteristics we want to use) \]
* `y`, `y_true` is our labels; `y_pred` is the predictions of our model
* `a` and `b` are model parameters (in this case, coefficient(s) and intercept); **they are what our model learns**!

Let's take the example of the penguins dataset:
* If we're using only e.g. two features, body mass and flipper length, our logistic regression will look like this
$$ y = \frac{1}{1+e^{-(a_1 x_1+a_2 x_2+b)}} $$
* If we're using only all four numerical features, our logistic regression will look like this
$$ y = \frac{1}{1+e^{-(a_1 x_1+a_2 x_2 + a_3 x_3 + a_4 x_4 +b)}} $$

Here, $x_1$ to $x_4$ are our _features_, and $a_1$ to $a_4$, plus $b$ are our _parameters_.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
def sigmoid(a, b, x):
    return 1/(1+np.exp(-(a*x+b)))

In [None]:
x = np.linspace(-10, 10, 201)

In [None]:
a = 1
b = 0

In [None]:
plt.plot(x, sigmoid(a, b, x))

In [None]:
a_list = [0.5, 1, 5]
b_list = [-5, 0, 5]

In [None]:
for val in a_list:
    plt.plot(x, sigmoid(val, b, x), label=val)
    plt.legend()

In [None]:
for val in b_list:
    plt.plot(x, sigmoid(a, val, x), label=val)
    plt.legend()

Pros and cons of logistic regressions:
* Pros:
    * Outputs probabilities
    * Interpretable — coefficients tell you which features are important
    * Easy to implement
    * Fast
* Cons:
    * Linear decision boundary

### Build a logistic regression model to predict penguin species

#### 1. Prepare the data

In [None]:
df = pd.read_csv('penguins_simple.csv', sep=';')

In [None]:
df = df[df['Species'] != 'Chinstrap']

In [None]:
df['Species'].value_counts()

In [None]:
sns.scatterplot(x='Body Mass (g)', y='Flipper Length (mm)', data=df, hue='Species', style='Species', palette=['red','blue'])

In [None]:
df['Species_category']=np.where(df['Species']== 'Adelie', 0, 1)

In [None]:
df.head(150)

In [None]:
sns.scatterplot(x='Body Mass (g)', y='Flipper Length (mm)', data=df, hue='Species_category', style='Species_category', palette=['red','blue'])

#### 2. Build the model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
X = df[['Body Mass (g)', 'Flipper Length (mm)']]
y = df['Species_category']

In [None]:
X.shape, y.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

In [None]:
df.isna().sum()

In [None]:
X_train = (X_train-X_train.mean())/X_train.std()

In [None]:
X_test = (X_test-X_test.mean())/X_test.std()

In [None]:
# instatiate the model
m = LogisticRegression()

In [None]:
# train the model
m.fit(X_train, y_train)

In [None]:
# our coefficients
m.coef_, m.intercept_

#### 3. Evaluate the model

In [None]:
# score on the training data
m.score(X_train, y_train)

In [None]:
# score on the testing data
m.score(X_test, y_test)

#### 4. Predict

In [None]:
m.coef_

In [None]:
from matplotlib.colors import ListedColormap

#removing column names and replacing with int
#X_train= X_train[:].values

#function for plotting the decision boundry
def versiontuple(v):
    return tuple(map(int, (v.split("."))))


def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):

    # setup marker generator and color map
    markers = ('o', 'x', 's', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])
    
    #removing column names and replacing with int
    Xnew= X[:].values

    # plot the decision surface
    x1_min, x1_max = Xnew[:, 0].min() - 1, Xnew[:, 0].max() + 1
    x2_min, x2_max = Xnew[:, 1].min() - 1, Xnew[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=Xnew[y == cl, 0], y=Xnew[y == cl, 1],
                    alpha=0.8, c=cmap(idx),
                    marker=markers[idx], label=cl)

    # highlight test samples
    if test_idx:
        # plot all samples
        if not versiontuple(np.__version__) >= versiontuple('1.9.0'):
            X_test, y_test = Xnew[list(test_idx), :], y[list(test_idx)]
            warnings.warn('Please update to NumPy 1.9.0 or newer')
        else:
            X_test, y_test = Xnew[test_idx, :], y[test_idx]

        plt.scatter(X_test[:, 0],
                    X_test[:, 1],
                    c='',
                    alpha=1.0,
                    linewidths=1,
                    marker='o',
                    s=55, label='test set')

In [None]:
plot_decision_regions(X_test, y_test, classifier=m)
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()