# Logistic Regression 

## Import Dependencies & Data

In [5]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
%matplotlib inline 
import seaborn as sns
sns.set()

from sklearn import datasets
from sklearn.model_selection import train_test_split

In [6]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

In [7]:
iris_data = load_iris()
print(iris_data.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [10]:
X,y = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names), pd.DataFrame(data=iris_data.target, columns=["iris_type"])

# features
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [11]:
# target variable
y.head()

Unnamed: 0,iris_type
0,0
1,0
2,0
3,0
4,0


In [22]:
X.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


## Train Test Split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0) 
#That final command is needed to restructure our y-data in order to effectively fit and predict the model on it.
y_train, y_test = np.ravel(y_train), np.ravel(y_test)

## Logistic Regression 

In [13]:
logreg = LogisticRegression(random_state=0, solver="lbfgs", multi_class="multinomial")
logreg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

* In our example, rather than sending it new data manually, we'll actually send the model our X_test data, since it hasn't seen it before.

In [18]:
y_test.shape

(38,)

In [17]:
y_pred.shape

(38,)

In [15]:
y_pred = logreg.predict(X_test)

- We can also use the .predict_proba() command to grab the relative classification probabilities in an array, rather than the assigned classes themselves.

In [20]:
logreg.predict_proba(X_test)

array([[1.17922388e-04, 5.61467204e-02, 9.43735357e-01],
       [1.26292326e-02, 9.60453076e-01, 2.69176916e-02],
       [9.84397411e-01, 1.56025501e-02, 3.85521957e-08],
       [1.25169336e-06, 2.31508254e-02, 9.76847923e-01],
       [9.70234900e-01, 2.97649377e-02, 1.62569141e-07],
       [2.01665199e-06, 5.94446983e-03, 9.94053514e-01],
       [9.81899507e-01, 1.81004229e-02, 7.04287691e-08],
       [2.84243189e-03, 7.47093540e-01, 2.50064028e-01],
       [1.50916144e-03, 7.38518909e-01, 2.59971930e-01],
       [2.05290442e-02, 9.35891937e-01, 4.35790187e-02],
       [9.22384402e-05, 1.59464222e-01, 8.40443540e-01],
       [6.98633156e-03, 8.09996346e-01, 1.83017322e-01],
       [4.08224311e-03, 7.93600620e-01, 2.02317137e-01],
       [3.05683925e-03, 7.60908560e-01, 2.36034601e-01],
       [3.87703025e-03, 7.10277309e-01, 2.85845661e-01],
       [9.82815578e-01, 1.71843652e-02, 5.65333914e-08],
       [6.72908291e-03, 7.56467769e-01, 2.36803149e-01],
       [1.14293523e-02, 8.45109

- The next step is to simply determine the model's accuracy.

We can do this by calling .score() on our model and sending it our test data.

In [21]:
logreg.score(X_test,y_test)

0.9736842105263158

- One common mistake many people make is to send y_pred rather than X_test as the first argument to the scoring method.

- The scoring method actually does the prediction step under the hood, so it requires that you send it both testing datasets.

Now that we have our score, let's think about it. Generally, we want our score to beat a baseline score, which effectively can be presented as the chance of getting a single class label randomly.

- In this case, since we have three explicit class labels, we technically have a 33.33% baseline probability to correctly assign any label.

- So let's apply that to our obtained score.

- Did our logistic classification score beat the relative baseline?