# <font color='#eb3483'> Logistic Regression </font>

In this module, we'll be exploring how to build a logistic regression model using scikit-learn. Remember, logistic regression is a binary classification algorithm, so instead of predicting a number (regression) or a group of labels (i.e. multi-class classification) we'll be predicting either true or false. Let's start by importing our usual toolkit.

In [None]:
from IPython.display import Image
import pandas as pd
import numpy as np
from sklearn import datasets

import seaborn as sns
sns.set(rc={'figure.figsize':(6,6)}) #Set our seaborn aesthetics (we're going to customize our figure size)

import warnings
warnings.simplefilter("ignore")

## <font color='#eb3483'> Breast Cancer Data </font>

We are going to use the [Wisconsin Breast Cancer Dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)). Is a dataset that contains measurements computed from digitized images of fine needle aspirate (FNA) breast mass samples. The goal is to predict whether the cells are benign (non-cancerous) or malignant (cancerous).

The images look like this one:

![title](media/breast_cancer.png)

In [None]:
# load the cancer dataset as "cancer_data" and view its keys
cancer_data = ...

In [None]:
# print out the "DESCR" key

This dataset is a binary classification problem, where the target variable can be either a negative or positive. It is encoded as a 0 or a 1.

In [None]:
# view the first 20 values using the "target" key

We can see the target labels (the names the 0 and 1 represent) by reading the key `target_names`

In [None]:
# use the "target_names" key to see the target labels

So we see that the 0 means the sample is malignant and a 1 means its benign.

In [None]:
# create a new dataframe called "cancer_df" using the data we loaded.
# the columns are from the "feature_names" key
cancer_df = pd.DataFrame(...)

# add a new column, "target", from the "target" key
cancer_df["target"] = ...

cancer_df.head()

**The independent variables are measurements taken on the cell images, mostly measures of the cell nucleii**

In [None]:
# view the shape of the dataframe

In [None]:
# check the counts of the "target" column. Hint: use value_counts(True)

So we see that 62.7% of the cancers are benign and 37.3% are malignant

## <font color='#eb3483'> Logistic Regression in Scikit-learn </font>

In [None]:
# import the LogisticRegression class from scikit-learn
from sklearn.model_selection import ...

# import train_test_split from from scikit-learn
from sklearn.linear_model import ...


In [None]:
LogisticRegression?

### <font color='#eb3483'> 1) Let's split our data set into train test split </font>

In [None]:
X = ...
y = ...

# use 20% of the dataset as the testing set
X_train, X_test, y_train, y_test = ...

In [None]:
print('X_train: ',X_train.shape)
print('X_test: ',X_test.shape)
print('y_train: ',y_train.shape)
print('y_test ',y_test.shape)

This estimator (or classifier, another name for classification problems estimators) is used the same way as LinearRegression.

### <font color='#eb3483'>  2) Create and fit the logistic regression model </font>

In [None]:
model = ...
model.fit(...)

### <font color='#eb3483'> 3) See the model predictions </font>

In [None]:
predictions = ...

predictions[:10]

We see the `predict` method directly outputs the predicted class (0 or 1). If we want to see the probabilities the model assign to each class (that means, how confident the method is that the observation belongs to one class or the other) we can use the method `predict_proba`.

In [None]:
# np.set_printoptions(suppress=True) # uncomment this to suppress scientific notation
predictions_probabilities = ...
predictions_probabilities[:10]

# why are there two columns of results here?

In [None]:
sns.distplot(predictions_probabilities[:,0], bins= 10, kde = False, label = 'Prob0', axlabel = 'Prob0')

In [None]:
sns.distplot(predictions_probabilities[:,1], bins = 10, kde = False, label = 'Prob1', axlabel = 'Prob1')

We can see how each class predictions change with the predicted probabilities.

In [None]:
probs_df = pd.DataFrame(predictions_probabilities)
probs_df = round(probs_df, 2)
probs_df.head()

We can check how the model computes the probabilities and assigns a class given a threshold

In [None]:
X = X_test.reset_index().copy()
X["target"] = y_test.tolist()
X["prediction"] = predictions
X = pd.concat([X, probs_df], axis=1)
X[["target", "prediction", 0, 1]].head(20)

By default the threshold for this split is 0.5.

The logistic regression model, has many hyperparameters.

In [None]:
# use get_params() to check the model hyperparameters

In [None]:
LogisticRegression?

The most important hyperparameters are:
- **penalty** : `l1` or `l2` the regularization method
- **fit_intercept**: `True` or `False`, if you want the intercept in the linear model
- **C** : float, inverse of regularization; the lower the float, the more the model will regularize the feature coefficients

Extra reading on regularization: https://towardsdatascience.com/over-fitting-and-regularization-64d16100f45c and https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c