# Binary Classification with Logistic Regression

In this module, you will learn how to perform binary classification in scikit-learn with logistic regression. We will take a look at the iris dataset, which is a classic for testing out classification algorithms. Finally, we will use the accuracy score for evaluating how good the model we create really is.

<b>Functions and attributes in this lecture: </b>
- `pandas:` - Pandas package with alias `pd`
  - `.value_counts()` - Get the value distribution for the pandas series
  - `.corr()` - Get the correlation matrix for a pandas dataframe
- `sklearn.linear_model` - Submodule for linear models
  - `LogisticRegression()` - The logistic regression model
    - `.fit()` - Training the model on the data
    - `.predict()` - Predicting on new data using the model
    - `.predict_proba()` - Get the precentages for prediction on new data using the model
- `sklearn.metrics` - Submodule for metrics used to evaluate models
  - `accuracy_score()` - Finding the accuracy score for a set of predictions
- `sklearn.datasets` - Submodule of sklearn for toy datasets
  - `load_iris()` - A function for loading the iris dataset

In [2]:
# Non-sklearn packages
import numpy as np
import pandas as pd

# Sklearn modules & functions
from sklearn import datasets
from sklearn.model_selection import train_test_split

## Working with the Iris Dataset

Let us begin by importing the iris dataset and checking it out!

In [12]:
# Loading the Iris dataset
X, y = datasets.load_iris(return_X_y=True, as_frame=True)

# Some info about the dataset
print(datasets.load_iris()["DESCR"])

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

In [None]:
# Describe the features
X.describe()

# Get statistical summary of the features

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [15]:
# Check out the datatypes of the features
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB


In [16]:
# Checking the values that the output can take
y.value_counts()

target
0    50
1    50
2    50
Name: count, dtype: int64

In [18]:
# Selecting only the first two classes
X = X[50:]
y = y[50:] - 1

In [20]:
# Collect all the variables
all_variables = pd.concat([X, y], axis=1)

In [21]:
# Checking the correlation
all_variables.corr()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
sepal length (cm),1.0,0.457228,0.864225,0.281108,
sepal width (cm),0.457228,1.0,0.401045,0.537728,
petal length (cm),0.864225,0.401045,1.0,0.322108,
petal width (cm),0.281108,0.537728,0.322108,1.0,
target,,,,,


In [None]:
# A small visualization

## Logistic Regression

We will now train a logistic regression model for binary classification of the iris flower.