# Modeling - Logistic Regression

What is it?
- a machine learning algorithm used for predicting categorical target variables
- Pipeline: Plan - Acquire - Prepare - Explore - **Model** - Deliver

Why do we care?
- we can predict future target variables based on the model we build! 

How does it work?
- [slides we already saw](https://docs.google.com/presentation/d/1uK_PLp_gjowSTUEIhPyJrniuHFXN4waR/edit?usp=sharing&ouid=110448495992573862737&rtpof=true&sd=true)

How do we use it?
- acquire, prepare, explore our data
- split data for modeling
- build models on train
    - create rules based on our input data
- evaluate models on train & validate
    - see how our rules work on unseen data
- pick best of the best model, and evaluate bestest model on test

In [70]:
import numpy as np
import pandas as pd
import math

from sklearn.metrics import classification_report, confusion_matrix

import acquire
import prepare
from sklearn.linear_model import LogisticRegression
#new import! 


### Acquire

In [71]:
df = acquire.get_iris_data()

In [72]:
df.head()


Unnamed: 0.1,Unnamed: 0,species_id,measurement_id,sepal_length,sepal_width,petal_length,petal_width,species_name
0,0,1,1,5.1,3.5,1.4,0.2,setosa
1,1,1,2,4.9,3.0,1.4,0.2,setosa
2,2,1,3,4.7,3.2,1.3,0.2,setosa
3,3,1,4,4.6,3.1,1.5,0.2,setosa
4,4,1,5,5.0,3.6,1.4,0.2,setosa


### Prepare

In [73]:
df = prepare.prep_iris(df)

In [74]:
df.head()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_setosa,species_versicolor,species_virginica
0,5.1,3.5,1.4,0.2,setosa,1,0,0
1,4.9,3.0,1.4,0.2,setosa,1,0,0
2,4.7,3.2,1.3,0.2,setosa,1,0,0
3,4.6,3.1,1.5,0.2,setosa,1,0,0
4,5.0,3.6,1.4,0.2,setosa,1,0,0


### We will utilize binary classification

#### Predict if species is virginica or not

In [75]:
df.species.value_counts()





train, validate, test = prepare.split_data(df, 'species')

train.species.value_counts()

versicolor    30
virginica     30
setosa        30
Name: species, dtype: int64

In [25]:
# Binary classification - predict if species is non-virginica or virginica
# change setosa and versicolor to '0' and virginica to 1

    sepal_length  sepal_width  petal_length  petal_width  species  \
89           5.5          2.5           4.0          1.3        0   
3            4.6          3.1           1.5          0.2        0   
46           5.1          3.8           1.6          0.2        0   
80           5.5          2.4           3.8          1.1        0   
10           5.4          3.7           1.5          0.2        0   
..           ...          ...           ...          ...      ...   
94           5.6          2.7           4.2          1.3        0   
69           5.6          2.5           3.9          1.1        0   
27           5.2          3.5           1.5          0.2        0   
38           4.4          3.0           1.3          0.2        0   
75           6.6          3.0           4.4          1.4        0   

    species_setosa  species_versicolor  species_virginica  
89               0                   1                  0  
3                1                   0             

In [76]:
#split my data
train, validate, test = prepare.split_data(df, 'species')

## Explore 

ONLY USING TRAIN!

completed the following steps on my features and target variable
1. hypothesize
2. visualize
3. analyze
4. summarize

these steps arent written out here, however, i found that petal width and petal length identified species the most

### Modeling

split into features and target variable
- need to do this on my train, validate, and test dataframe
- will end up with the following variables:
    - X_train, X_validate, X_test: all the features we plan to put into our model
    - y_train, y_validate, y_test: the targete variable

In [82]:
X_train = train.drop(columns='species')
y_train = train.species_virginica

In [83]:
X_validate = validate.drop(columns='species')
y_validate = validate.species_virginica

In [86]:
X_test = test.drop(columns='species')
y_test = test.species_virginica
y_train.value_counts()

0    60
1    30
Name: species_virginica, dtype: int64

#### sklearn modeling process

1. create the object
2. fit the thing
3. use the thing

In [80]:
logit1 = LogisticRegression()
logit1

In [87]:
logit1.fit(X_train, y_train)

In [88]:
logit1.score(X_train, y_train)

1.0

In [89]:
#take a look at predictions
logit1.predict(X_train)

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 0], dtype=uint8)

In [90]:
# View raw probabilities (output from the model)
logit1.predict_proba(X_train).round(2)

array([[0.88, 0.12],
       [0.13, 0.87],
       [0.94, 0.06],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.97, 0.03],
       [1.  , 0.  ],
       [0.88, 0.12],
       [0.9 , 0.1 ],
       [0.84, 0.16],
       [0.99, 0.01],
       [0.97, 0.03],
       [0.05, 0.95],
       [0.02, 0.98],
       [0.13, 0.87],
       [0.93, 0.07],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.97, 0.03],
       [0.95, 0.05],
       [0.05, 0.95],
       [0.02, 0.98],
       [0.13, 0.87],
       [0.02, 0.98],
       [0.94, 0.06],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.92, 0.08],
       [0.04, 0.96],
       [0.97, 0.03],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.06, 0.94],
       [0.04, 0.96],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.95, 0.05],
       [0.97, 0.03],
       [1.  , 0.  ],
       [0.96, 0.04],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [1.  , 0.  ],
       [0.97, 0.03],
       [1.  , 0.  ],
       [1.  ,

In [100]:
#score 


In [118]:
#classification report


In [91]:
#coef
logit1.coef_

array([[ 0.40755728, -0.14265297,  1.39078556,  0.90561891, -0.04143467,
        -1.71176196,  1.75324074]])

In [92]:
#columns
X_train.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species_setosa', 'species_versicolor', 'species_virginica'],
      dtype='object')

### change hyperparameter
#### Regularization:
- Keep model simple
- Constraints the coefficients


#### C = Inverse of regularization strength:

- Lower C is higher regularization
- Lower C discourages learning more complex model

In [93]:
# Change hyperparameter C = 0.01
logit2 = LogisticRegression(C=0.01)
logit2

In [94]:
# fit the model
logit2.fit(X_train, y_train)

In [95]:
# make prediction
logit2.predict(X_train)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 0], dtype=uint8)

In [96]:
# score
logit2.score(X_train, y_train)

0.9

In [97]:
#classification report
print(classification_report(y_train,logit2.predict(X_train)))

              precision    recall  f1-score   support

           0       0.87      1.00      0.93        60
           1       1.00      0.70      0.82        30

    accuracy                           0.90        90
   macro avg       0.93      0.85      0.88        90
weighted avg       0.91      0.90      0.89        90



### Evaluate Model 1 and 2 performance on 'Validate'

In [98]:
logit1.score(X_validate, y_validate)

1.0

In [99]:
logit2.score(X_validate, y_validate)

0.8666666666666667