# Modeling - Logistic Regression

What is it?
- a machine learning algorithm used for predicting categorical target variables
- Pipeline: Plan - Acquire - Prepare - Explore - **Model** - Deliver

Why do we care?
- we can predict future target variables based on the model we build! 

How does it work?
- [slides we already saw](https://docs.google.com/presentation/d/1uK_PLp_gjowSTUEIhPyJrniuHFXN4waR/edit?usp=sharing&ouid=110448495992573862737&rtpof=true&sd=true)

How do we use it?
- acquire, prepare, explore our data
- split data for modeling
- build models on train
    - create rules based on our input data
- evaluate models on train & validate
    - see how our rules work on unseen data
- pick best of the best model, and evaluate bestest model on test

In [15]:
import numpy as np
import pandas as pd
import math

from sklearn.metrics import classification_report, confusion_matrix

import acquire as acq
import prepare as prep

#new import! 
from sklearn.linear_model import LogisticRegression

### Acquire

In [2]:
df = acquire.get_iris_data()

csv file found and loaded


In [109]:
df.head()

Unnamed: 0,species_id,measurement_id,sepal_length,sepal_width,petal_length,petal_width,species_name
0,1,1,5.1,3.5,1.4,0.2,setosa
1,1,2,4.9,3.0,1.4,0.2,setosa
2,1,3,4.7,3.2,1.3,0.2,setosa
3,1,4,4.6,3.1,1.5,0.2,setosa
4,1,5,5.0,3.6,1.4,0.2,setosa


### Prepare

In [3]:
df = prepare.prep_iris(df)

In [111]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


### We will utilize binary classification

#### Predict if species is virginica or not

In [17]:
# Binary classification - predict if species is non-virginica or virginica
# change setosa and versicolor to '0' and virginica to 1
df.species_name.value_counts()

setosa        50
versicolor    50
virginica     49
Name: species_name, dtype: int64

In [18]:
df.species_name == 'virginica'

0      False
1      False
2      False
3      False
4      False
       ...  
145     True
146     True
147     True
148     True
149     True
Name: species_name, Length: 149, dtype: bool

In [21]:
#split my data
X = df.drop(columns= 'species_name')
Y = df.species_name

X = pd.DataFrame(X)
Y = pd.DataFrame(Y)

X_train, X_validate, X_test, y_train, y_validate, y_test = acq.train_validate_test_split(X, Y)
#train, validate, test = prepare.my_train_test_split(df, 'species')

In [12]:
X_train = pd.DataFrame(X_train)
X_validate = pd.DataFrame(X_validate)
X_test = pd.DataFrame(X_test)
y_train = pd.DataFrame(y_train)
y_validate = pd.DataFrame(y_validate)
y_test = pd.DataFrame(y_test)

In [22]:
X_train = X_train.drop(columns = 'species_name_versicolor')


In [16]:
y_train.head()

Unnamed: 0,species_name
52,versicolor
36,setosa
125,virginica
85,versicolor
18,setosa


## Explore 

ONLY USING TRAIN!

completed the following steps on my features and target variable
1. hypothesize
2. visualize
3. analyze
4. summarize

these steps arent written out here, however, i found that petal width and petal length identified species the most

### Modeling

split into features and target variable
- need to do this on my train, validate, and test dataframe
- will end up with the following variables:
    - X_train, X_validate, X_test: all the features we plan to put into our model
    - y_train, y_validate, y_test: the targete variable

#### sklearn modeling process

1. create the thing!
2. Fit the thing
3. Use the thing

In [25]:
# create it

logit1 = LogisticRegression()
logit

In [26]:
# fit it

logit1.fit(X_train, y_train)


  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [27]:
#use it
logit1.score(X_train, y_train)

1.0

In [31]:
#take a look at predictions
logit1.predict(X_train)

array(['versicolor', 'setosa', 'virginica', 'versicolor', 'setosa',
       'virginica', 'virginica', 'setosa', 'setosa', 'virginica',
       'versicolor', 'virginica', 'virginica', 'versicolor', 'virginica',
       'virginica', 'setosa', 'versicolor', 'setosa', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor', 'setosa',
       'versicolor', 'versicolor', 'setosa', 'versicolor', 'setosa',
       'versicolor', 'setosa', 'virginica', 'setosa', 'setosa',
       'versicolor', 'setosa', 'virginica', 'setosa', 'setosa', 'setosa',
       'versicolor', 'setosa', 'virginica', 'versicolor', 'versicolor',
       'virginica', 'virginica', 'virginica', 'versicolor', 'virginica',
       'versicolor', 'setosa', 'setosa', 'virginica', 'virginica',
       'virginica', 'virginica', 'setosa', 'setosa', 'setosa',
       'virginica', 'virginica', 'versicolor', 'virginica', 'virginica',
       'versicolor', 'virginica', 'setosa', 'setosa', 'virginica',
       'versicolor', 'versico

In [35]:
# View raw probabilities (output from the model)
logit1.predict_proba(X_train).round(2)[:5]

array([[0.  , 0.87, 0.13],
       [0.98, 0.02, 0.  ],
       [0.  , 0.04, 0.96],
       [0.01, 0.9 , 0.09],
       [0.96, 0.04, 0.  ]])

In [36]:
#classification report
print(classification_report(y_train, logit1.predict(X_train)))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        30
  versicolor       1.00      1.00      1.00        30
   virginica       1.00      1.00      1.00        29

    accuracy                           1.00        89
   macro avg       1.00      1.00      1.00        89
weighted avg       1.00      1.00      1.00        89



In [37]:
#coef
logit1.coef_

array([[-0.38380716,  0.59193411, -1.829478  , -0.7200031 , -0.02638308],
       [ 0.14712769, -0.3421631 ,  0.40487703, -0.15996982, -1.8275633 ],
       [ 0.23667947, -0.24977101,  1.42460097,  0.87997291,  1.85394638]])

In [41]:
#columns
X_train = X_train.drop(columns = 'species_name_virginica')
X_train.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], dtype='object')

### change hyperparameter
#### Regularization:
- Keep model simple
- Constraints the coefficients


#### C = Inverse of regularization strength:

- Lower C is higher regularization
- Lower C discourages learning more complex model

In [65]:
# Change hyperparameter C = 0.01
logit2 = LogisticRegression(C=0.01)
logit2

In [66]:
# fit the model
logit2.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


In [67]:
# make prediction
logit2.predict(X_train)

array(['virginica', 'setosa', 'virginica', 'virginica', 'setosa',
       'virginica', 'virginica', 'setosa', 'setosa', 'virginica',
       'versicolor', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'setosa', 'versicolor', 'setosa', 'versicolor',
       'versicolor', 'virginica', 'virginica', 'versicolor', 'setosa',
       'versicolor', 'versicolor', 'setosa', 'versicolor', 'setosa',
       'versicolor', 'setosa', 'virginica', 'setosa', 'setosa',
       'versicolor', 'setosa', 'virginica', 'setosa', 'setosa', 'setosa',
       'versicolor', 'setosa', 'virginica', 'versicolor', 'versicolor',
       'virginica', 'virginica', 'versicolor', 'versicolor', 'virginica',
       'virginica', 'setosa', 'setosa', 'virginica', 'virginica',
       'virginica', 'virginica', 'setosa', 'setosa', 'setosa',
       'virginica', 'virginica', 'versicolor', 'virginica', 'virginica',
       'versicolor', 'virginica', 'setosa', 'setosa', 'virginica',
       'versicolor', 'virginica', 

In [68]:
# score
logit2.score(X_train, y_train)

0.8876404494382022

In [69]:
#classification report
print(classification_report(y_train, logit2.predict(X_train)))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        30
  versicolor       0.95      0.70      0.81        30
   virginica       0.76      0.97      0.85        29

    accuracy                           0.89        89
   macro avg       0.90      0.89      0.89        89
weighted avg       0.91      0.89      0.89        89



### Evaluate Model 1 and 2 performance on 'Validate'

In [72]:
logit1.score(X_validate, y_validate)

Feature names seen at fit time, yet now missing:
- species_name_virginica



ValueError: X has 4 features, but LogisticRegression is expecting 5 features as input.

In [71]:
X_validate = X_validate.drop(columns = 'species_name_virginica')

KeyError: "['species_name_virginica'] not found in axis"

In [59]:
y_validate.head()

Unnamed: 0,species_name
129,virginica
99,versicolor
114,virginica
90,versicolor
39,setosa


In [64]:
logit2.score(X_validate, y_validate)

0.9333333333333333