## **Logistic regression**
Logistic regression is another form of regression which differs from the classical linear regression we saw in the first lesson. While linear regression outputs/predicts continous values, logistic regression outputs probabilities. It is mainly used in binary classification tasks although we call it regression. In **binary classification** we have mainly two classes: **class 0**  and **class 1**. A probability can take any value from 0 to 1. 

In logistic regression we can perform binary classification by assigning predictions with probabilities greater than **0.5** to belong to class 1, else belonging to class 0. In linear regression we fit a straight line to our data points but in logistic regression we fit a logistic function to our data points. A logistic function has an S-shape. Watch this cool video from [statquest](https://www.youtube.com/watch?v=yIYKR4sgzI8).

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

We import our breast cancer dataset, we standardize it then we build a logistic regression model

In [2]:
bc_data = load_breast_cancer()
x = bc_data.data # data
y = bc_data.target # labels
print(x.shape)
print(y.shape)
print(bc_data.feature_names) # use this to get names of variables
print(bc_data.target_names) # use this to find the various classes

(569, 30)
(569,)
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
['malignant' 'benign']


We standardize data using Standard scaler

In [3]:
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)


We split our data into training and a test set, then build our logistic regression model. We do a special kind of splitting called stratify splitting. This splitting ensures that both training and test data have equal percentages of data from both classes.

In [4]:
x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, test_size=0.2, stratify=y)
logreg_clf = LogisticRegression()
logreg_clf.fit(x_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Let us predict the class of one observation from our test set.

In [5]:
predicted_class = logreg_clf.predict(x_test[5].reshape(1, -1))
predicted_prob = logreg_clf.predict_proba(x_test[5].reshape(1, -1))
print('Predicted class = ',predicted_class, ' and Predicted probabilities = ',predicted_prob, 'Real class = ', y_test[5])

Predicted class =  [1]  and Predicted probabilities =  [[0.00271056 0.99728944]] Real class =  1


From the above we see that our logistic regression model predicts a higher probability for class 1 (**probability = 0.92**). The **predict** function above gives us the class prediction while the **predict_proba** function gives us the probabilities of belonging to class 0 or class 1.

Let us now predict the classes for all the observations in our test set. We will also see how well we perform in our predictions by measuring our accuracy.

In [8]:
predictions_test = logreg_clf.predict(x_test)
predictions_train = logreg_clf.predict(x_train)
accuracy_test = accuracy_score(y_test, predictions_test)*100
accuracy_train = accuracy_score(y_train, predictions_train)*100
print(accuracy_test, accuracy_train)
print(predictions_test)

98.24561403508771 98.68131868131869
[1 1 1 1 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 0 1 0 1 1
 1 1 1 1 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 1 0
 1 0 1 1 1 0 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 1 0 1
 1 1 1]


In [7]:
y_true = [1, 1, 1, 1, 1, 0]
y_pred = [1, 1, 1, 1, 1, 1]
print(accuracy_score(y_true, y_pred))

0.8333333333333334


**How to calculate accuracy**

true positive + true negatives / n





On our test data alone we get almost 95% accuracy. That is impressive right.

**What you should have learned.**

1. What is logistic regression
2. What it is used for.
3. How to build a logistic regression classifier with scikit learn.

**Your turn !!!**

Choose a dataset of your choice, build a logistic regression model using this dataset and show how much accuracy you can achieve. You can also select just some of the variables or use all the variables if you wish.

**Hint**

Sklearn provides some datasets if you wish to use one from there. Have a look at [sklearn_datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html).