# Introduction to Logistic Regression

In this notebook you will see how logistic regression works very graphically. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

sns.set()

Next, we’ll take advantage of the `make_classification` function from the scikit-learn library to generate data. 
Since logistic regression is primarily only applicable to binary classification problems, the data points in our case are composed of two classes.

In [None]:
# We set a random state so the generated data is the same for each run of this cell
x, y = make_classification(
    n_samples=100,
    n_features=1,
    n_classes=2,
    n_clusters_per_class=1,
    flip_y=0.03,
    n_informative=1,
    n_redundant=0,
    n_repeated=0,
    random_state=3789
)

In [None]:
data = {"x":x.reshape((100,)), "y":y.reshape((100,))}

In [None]:
df = pd.DataFrame(data=data)
df

In [None]:
# statistical properties for each class y (0,1)
statistical_properties_for_class = df.groupby(by=['y'])[["x"]].describe()
statistical_properties_for_class

In [None]:
# We plot the relationship between the feature and classes. 
plt.scatter(x, y, c=y, cmap='rainbow')
plt.xlabel('x')
plt.ylabel('y')
plt.title("Binary Classification Data")
plt.savefig("binary_classification_data.png");

The above depicts a graph of a binary classification dataset.  The y-axis represents the true class labels for each data point. In this dataset, y = 0 likely represents the negative class and y=1 represents the positive class. The x-axis,  depicts one variable.
For observation belonging to the negative class (y=0) x ranges from -2.6 and 2.03, and for observation belonging to the positive class (y=1) x ranges from -0.48 and 2.30. Check statistical_properties_for_class python variable for more statistical information.

In [None]:
# Prior to training our model, we’ll set aside a portion of our data in order to evaluate its performance.
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=1)

In [None]:
# We instantiate an instance of the LogisticRegression class and call the fit function with the features and the labels 
# (since Logistic Regression is a supervised machine learning algorithm) as arguments.

lr = LogisticRegression()
lr.fit(X_train, y_train)

In [None]:
# We can access the following properties to actually view the model parameter.
print(lr.coef_)
print(lr.intercept_)

In [None]:
# Let’s see how the model performs against data that it hasn’t been trained on.
y_pred = lr.predict(X_test)

In [None]:
# Given that this is a classification problem, we use a confusion matrix to evaluate our model.
confusion_matrix(y_test, y_pred)

In [None]:
# If for whatever reason we’d like to check the actual probability that a data point belongs to a given class, 
# we can use the predict_proba function.
lr.predict_proba(X_test)

In [None]:
pd.DataFrame(lr.predict_proba(X_test))

The first column corresponds to the probability that the sample belongs to the first class (our case y=0) and the second column corresponds to the probability that the sample belongs to the second class (our case y=1).
Before attempting to plot the Sigmoid function (it gives the probability that a certain observation belongs to the positive class), we create and sort a DataFrame containing our test data.

The sigmoid function reads:
$$\sigma = \dfrac{1}{1 + e^{-t}} =  \dfrac{1}{1+ e^{-(b_{0} + b_{1}*X)}}$$

In [None]:
df = pd.DataFrame({'x': X_test[:,0], 'y': y_test})
df = df.sort_values(by='x')

In [None]:
df

In [None]:
from scipy.special import expit

sigmoid_function = expit(df['x'] * lr.coef_[0][0] + lr.intercept_[0]).ravel()

In [None]:
fig, ax = plt.subplots(sharey=True)

# Plot binary classification data
ax.scatter(df['x'], df['y'], c=df['y'], cmap='rainbow', edgecolors='b')
ax.tick_params(axis='y')
ax.set_xlabel('X')
ax.set_ylabel('y\nclasses 0 1', rotation=45)
ax.set_yticks([0.0, 1.0])

# Generate a new Axes instance, on the twin-X axes (same position)
ax2 = ax.twinx()

# Plot sigmoid function, change tick color
ax2.plot(df['x'],sigmoid_function, color='green', label= "Sigmoid")
ax2.tick_params(axis='y', labelcolor='green')
ax2.set_yticks(np.insert(ax2.get_yticks()[1:-1], 3, 0.5))
ax2.text(x=1.8,y=1.05,s="Probability",color='Green')

ax2.legend(loc=(0.01,0.9))
plt.show()

The above depicts a graph of a binary classification dataset.  The y-axis represents the true class labels for each data point. In this dataset, y = 0 likely represents the negative class and y=1 represents the positive class. The x-axis,  depicts one variable.
For observation belonging to the negative class (y=0) x ranges from -2.55 and 0.74 showed with blue dots, and for observation belonging to the positive class (y=1) x ranges from 0.54 and 1.58 and showed by red dots. There is a green line representing the sigmoid function which in y =0.5 (threshold) has the x = 0. there is 5 blue dots which have x > 0.

**Check your understanding:**
- Which value of X correspond to a probability threshold of 0.5?
- How many data points will be misclassified in the graph above if you choose a probability threshold at 0.5?
- Which threshold would you choose to improve the accuracy?
- Which threshold would you choose to improve recall?