# Classification
- #### where your output variable y can take on only one of a small handful of possible values instead of any number in an infinite range of numbers like in regression.
- #### It turns out that linear regression is not a good algorithm for classification problems.

## Logistic Regression
- #### A specific type of Regression since Linear Regression is not a good choice for classification problems.

### Sigmoid/Logistic Function
<h4>

A sigmoid function is any mathematical function whose graph has a characteristic S-shaped or sigmoid curve.
$$
g(x) = \frac{1}{(1 + e^{-x})}
$$
<p align="center">
<img src="https://media.licdn.com/dms/image/D4D12AQGIXdSG7IJCNw/article-cover_image-shrink_600_2000/0/1694183259537?e=2147483647&v=beta&t=OtnfeqwCtKTSVrdKZdyOzNYECyLLZuEUIxkTfTQ0dS0">
</p>

- Sigmoid function in Logistic Regression

### Logistic Regression Algorithm

<h4>

- $f_{\vec{w}, b}(\vec{x}) = \vec{w} \cdot \vec{x} + b = z $

- This $z$ is then passed into the sigmoid function.

- $g(z) = \frac{1}{1 + e^{-z}}$

- Thus the equation becomes;
$$
f_{\vec{w}, b}(\vec{x}) = g(\vec{w} \cdot \vec{x} + b) = \frac{1}{1 + e^{-(\vec{w} \cdot \vec{x} + b)}}
$$

- Logistic Regression Model gives the **Probability** that the class is 1 or Yes.

## Below is an example of Logistic Regression from Breast Cancer Dataset

In [34]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn.preprocessing
from sklearn.datasets import load_breast_cancer
from sklearn import linear_model, model_selection
import seaborn as sns

<h5>

Breast cancer diagnostic dataset
--------------------------------------------

**Data Set Characteristics:**

- Number of Instances: 569

- Number of Attributes: 30 numeric, predictive attributes and the class

- Attribute Information:
    - radius (mean of distances from center to points on the perimeter)
    - texture (standard deviation of gray-scale values)
    - perimeter
    - area
    - smoothness (local variation in radius lengths)
    - compactness (perimeter^2 / area - 1.0)
    - concavity (severity of concave portions of the contour)
    - concave points (number of concave portions of the contour)
    - symmetry
    - fractal dimension ("coastline approximation" - 1)

    The mean, standard error, and "worst" or largest (mean of the three
    worst/largest values) of these features were computed for each image,
    resulting in 30 features.  For instance, field 0 is Mean Radius, field
    10 is Radius SE, field 20 is Worst Radius.

    - class:
            - WDBC-Malignant
            - WDBC-Benign

- Summary Statistics:


|              Attribute              |  Min  |  Max  |
| :---------------------------------: | :---: | :---: |
|radius (mean):                       | 6.981 | 28.11 |
|texture (mean):                      |  9.71 | 39.28 |
|perimeter (mean):                    | 43.79 | 188.5 |
|area (mean):                         | 143.5 | 2501.0|
|smoothness (mean):                   | 0.053 | 0.163 |
|compactness (mean):                  | 0.019 | 0.345 |
|concavity (mean):                    | 0.0   | 0.427 |
|concave points (mean):               | 0.0   | 0.201 |
|symmetry (mean):                     | 0.106 | 0.304 |
|fractal dimension (mean):            | 0.05  | 0.097 |
|radius (standard error):             | 0.112 | 2.873 |
|texture (standard error):            | 0.36  | 4.885 |
|perimeter (standard error):          | 0.757 | 21.98 |
|area (standard error):               | 6.802 | 542.2 |
|smoothness (standard error):         | 0.002 | 0.031 |
|compactness (standard error):        | 0.002 | 0.135 |
|concavity (standard error):          | 0.0   | 0.396 |
|concave points (standard error):     | 0.0   | 0.053 |
|symmetry (standard error):           | 0.008 | 0.079 |
|fractal dimension (standard error):  | 0.001 | 0.03  |
|radius (worst):                      | 7.93  | 36.04 |
|texture (worst):                     | 12.02 | 49.54 |
|perimeter (worst):                   | 50.41 | 251.2 |
|area (worst):                        | 185.2 | 4254.0|
|smoothness (worst):                  | 0.071 | 0.223 |
|compactness (worst):                 | 0.027 | 1.058 |
|concavity (worst):                   | 0.0   | 1.252 |
|concave points (worst):              | 0.0   | 0.291 |
|symmetry (worst):                    | 0.156 | 0.664 |
|fractal dimension (worst):           | 0.055 | 0.208 |


- Missing Attribute Values: None

- Class Distribution: 212 - Malignant, 357 - Benign

</h5>

In [39]:
# Get Dataset in 'Bunch' Type Data
data_bunch = load_breast_cancer()

# Convert *Bunch* Data to DataFrame Pandas
data = pd.DataFrame(data=data_bunch.data, columns=data_bunch.feature_names)

# Add the target column i.e. Median House Value to the DataFrame
target_data = data_bunch.target
data

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [46]:
# Calculating size of Whole Dataset
shape = data.shape # (569, 31)

# No. of Training Data Rows
training_rows = int(shape[0] * 0.8) # 455

# No. of Testing Data Rows
testing_rows = shape[0] - training_rows # 569 - 455 = 114

# We have divided the whole dataset into Training (80%) and Testing (20%)
training_data = data.iloc[:training_rows]
testing_data = data.iloc[training_rows:]

# Dividing Target Data
training_target_data = target_data[:training_rows]
testing_target_data = target_data[training_rows:]

In [48]:
training_data

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.38,17.33,184.60,2019.0,0.16220,0.6656,0.7119,0.26540,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.99,23.41,158.80,1956.0,0.12380,0.1866,0.2416,0.18600,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.57,25.53,152.50,1709.0,0.14440,0.4245,0.4504,0.24300,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.91,26.50,98.87,567.7,0.20980,0.8663,0.6869,0.25750,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.54,16.67,152.20,1575.0,0.13740,0.2050,0.4000,0.16250,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
450,11.87,21.54,76.83,432.0,0.06613,0.10640,0.08777,0.02386,0.1349,0.06612,...,12.79,28.18,83.51,507.2,0.09457,0.3399,0.3218,0.08750,0.2305,0.09952
451,19.59,25.00,127.70,1191.0,0.10320,0.09871,0.16550,0.09063,0.1663,0.05391,...,21.44,30.96,139.80,1421.0,0.15280,0.1845,0.3977,0.14660,0.2293,0.06091
452,12.00,28.23,76.77,442.5,0.08437,0.06450,0.04055,0.01945,0.1615,0.06104,...,13.09,37.88,85.07,523.7,0.12080,0.1856,0.1811,0.07116,0.2447,0.08194
453,14.53,13.98,93.86,644.2,0.10990,0.09242,0.06895,0.06495,0.1650,0.06121,...,15.80,16.93,103.10,749.9,0.13470,0.1478,0.1373,0.10690,0.2606,0.07810


In [49]:
testing_data

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
455,13.380,30.72,86.34,557.2,0.09245,0.07426,0.02819,0.03264,0.1375,0.06016,...,15.050,41.61,96.69,705.6,0.11720,0.14210,0.07003,0.07763,0.2196,0.07675
456,11.630,29.29,74.87,415.1,0.09357,0.08574,0.07160,0.02017,0.1799,0.06166,...,13.120,38.81,86.04,527.8,0.14060,0.20310,0.29230,0.06835,0.2884,0.07220
457,13.210,25.25,84.10,537.9,0.08791,0.05205,0.02772,0.02068,0.1619,0.05584,...,14.350,34.23,91.29,632.9,0.12890,0.10630,0.13900,0.06005,0.2444,0.06788
458,13.000,25.13,82.61,520.2,0.08369,0.05073,0.01206,0.01762,0.1667,0.05449,...,14.340,31.88,91.06,628.5,0.12180,0.10930,0.04462,0.05921,0.2306,0.06291
459,9.755,28.20,61.68,290.9,0.07984,0.04626,0.01541,0.01043,0.1621,0.05952,...,10.670,36.92,68.03,349.9,0.11100,0.11090,0.07190,0.04866,0.2321,0.07211
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.560,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.41070,0.22160,0.2060,0.07115
565,20.130,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.32150,0.16280,0.2572,0.06637
566,16.600,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.34030,0.14180,0.2218,0.07820
567,20.600,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.93870,0.26500,0.4087,0.12400


In [50]:
# Model created for our dataset using sklearn LinearRegression 
model = linear_model.LinearRegression()
model.fit(training_data, training_target_data)

In [55]:
# Predictions made for value of z
predictions_z = model.predict(testing_data)

In [106]:
# Move this z to sigmoid function
predictions = 1 / (1 + np.exp(-predictions_z))

# Vectorized operation to apply the threshold
predictions = np.where(predictions >= 0.63, 1, 0)

# Convert predictions to integer type
predictions = predictions.astype(int)

predictions

array([1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 1])

In [107]:
testing_target_data

array([1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 1])

In [108]:
from sklearn.metrics import r2_score
r2_score(testing_target_data, predictions)

0.8505244755244756

## Now below is the same Logistic Regression example for Classification using sklearn built-in methods

In [110]:
log_regression = linear_model.LogisticRegression()
log_regression.fit(training_data, training_target_data)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [114]:
sklearn_predictions = log_regression.predict(testing_data)
sklearn_predictions

array([0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0,
       0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 1])

In [115]:
r2_score(testing_target_data, sklearn_predictions)

0.6013986013986015