# Modeling - Logistic Regression

What is it?
- a machine learning algorithm used for predicting categorical target variables
- Pipeline: Plan - Acquire - Prepare - Explore - **Model** - Deliver

Why do we care?
- we can predict future target variables based on the model we build! 

How does it work?
- [slides we already saw](https://docs.google.com/presentation/d/1uK_PLp_gjowSTUEIhPyJrniuHFXN4waR/edit?usp=sharing&ouid=110448495992573862737&rtpof=true&sd=true)

How do we use it?
- acquire, prepare, explore our data
- split data for modeling
- build models on train
    - create rules based on our input data
- evaluate models on train & validate
    - see how our rules work on unseen data
- pick best of the best model, and evaluate bestest model on test

In [1]:
import numpy as np
import pandas as pd

from sklearn.metrics import classification_report

import acquire
import prepare

#new import! 
from sklearn.linear_model import LogisticRegression

### Acquire

In [17]:
df = acquire.get_iris_data()

csv file found and loaded


In [18]:
df.head()

Unnamed: 0,species_id,measurement_id,sepal_length,sepal_width,petal_length,petal_width,species_name
0,1,1,5.1,3.5,1.4,0.2,setosa
1,1,2,4.9,3.0,1.4,0.2,setosa
2,1,3,4.7,3.2,1.3,0.2,setosa
3,1,4,4.6,3.1,1.5,0.2,setosa
4,1,5,5.0,3.6,1.4,0.2,setosa


### Prepare

In [19]:
df = prepare.prep_iris(df)

In [20]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


### We will utilize binary classification

In [21]:
df.species.value_counts()

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

#### Predict if species is virginica or not

In [22]:
# Binary classification - predict if species is non-virginica or virginica
# change setosa and versicolor to '0' and virginica to 1
df.species = (df.species == 'virginica') *1

In [23]:
df.species.value_counts()

0    100
1     50
Name: species, dtype: int64

In [24]:
#split my data
train, validate, test = prepare.my_train_test_split(df, 'species')

In [25]:
train.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
117,7.7,3.8,6.7,2.2,1
133,6.3,2.8,5.1,1.5,1
3,4.6,3.1,1.5,0.2,0
141,6.9,3.1,5.1,2.3,1
61,5.9,3.0,4.2,1.5,0


## Explore 

ONLY USING TRAIN!

completed the following steps on my features and target variable
1. hypothesize
2. visualize
3. analyze
4. summarize

these steps arent written out here, however, i found that petal width and petal length identified species the most

### Modeling

split into features and target variable
- need to do this on my train, validate, and test dataframe
- will end up with the following variables:
    - X_train, X_validate, X_test: all the features we plan to put into our model
    - y_train, y_validate, y_test: the target variable

In [26]:
X_train = train.drop(columns='species')
y_train = train.species

In [27]:
X_validate = validate.drop(columns='species')
y_validate = validate.species

In [28]:
X_test = test.drop(columns='species')
y_test = test.species

In [29]:
X_train.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
117,7.7,3.8,6.7,2.2
133,6.3,2.8,5.1,1.5
3,4.6,3.1,1.5,0.2
141,6.9,3.1,5.1,2.3
61,5.9,3.0,4.2,1.5


#### sklearn modeling process

1. create the thing! 
2. fit the thing!
3. use the thing! 

In [33]:
#create it
logit1 = LogisticRegression()
logit1

LogisticRegression()

In [34]:
#fit it
logit1.fit(X_train, y_train)

LogisticRegression()

In [35]:
#use it
logit1.score(X_train, y_train)

0.9888888888888889

In [36]:
#take a look at predictions
logit1.predict(X_train)

array([1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0])

In [41]:
# View raw probabilities (output from the model)
logit1.predict_proba(X_train).round(2)[:5]

array([[0.01, 0.99],
       [0.43, 0.57],
       [1.  , 0.  ],
       [0.16, 0.84],
       [0.88, 0.12]])

In [43]:
#classification report
print(classification_report(y_train, logit1.predict(X_train)))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99        60
           1       0.97      1.00      0.98        30

    accuracy                           0.99        90
   macro avg       0.98      0.99      0.99        90
weighted avg       0.99      0.99      0.99        90



In [44]:
#coef
logit1.coef_

array([[ 0.00936447, -0.3395498 ,  2.46399837,  1.88864922]])

In [46]:
#columns
X_train.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], dtype='object')

### change hyperparameter
#### Regularization:
- Keep model simple
- Constraints the coefficients


#### C = Inverse of regularization strength:

- Lower C is higher regularization
- Lower C discourages learning more complex model

In [48]:
# Change hyperparameter C = 0.01
logit2 = LogisticRegression(C=0.01)
logit2

LogisticRegression(C=0.01)

In [49]:
# fit the model
logit2.fit(X_train, y_train)

LogisticRegression(C=0.01)

In [51]:
# make prediction
logit2.predict(X_train)

array([1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0])

In [53]:
# score
logit2.score(X_train, y_train)

0.8

In [55]:
#classification report
print(classification_report(y_train, logit2.predict(X_train)))

              precision    recall  f1-score   support

           0       0.77      1.00      0.87        60
           1       1.00      0.40      0.57        30

    accuracy                           0.80        90
   macro avg       0.88      0.70      0.72        90
weighted avg       0.85      0.80      0.77        90



### Evaluate Model 1 and 2 performance on 'Validate'

In [56]:
logit1.score(X_validate, y_validate)

0.9666666666666667

In [57]:
logit2.score(X_validate, y_validate)

0.8333333333333334