# Cost Functions and Solutions To the Optimization Problem
Unlike the least-squares problem for linear regression, no one has yet found a closed-form solution to the optimization problem presented by logistic regression. But even if one exists, the computation would no doubt be so complex that we'd be better off using some sort of approximation method instead.

Recall the cost function for linear regression: <br/><br/>
$SSE = \Sigma_i(y_i - \hat{y}_i)^2 = \Sigma_i(y_i - (\beta_0 + \beta_1x_{i1} + ... + \beta_nx_{in}))^2$.

This function, $SSE(\vec{\beta})$, is convex.

If we plug in our new logistic equation for $\hat{y}$, we get: <br/><br/>
$SSE_{log} = \Sigma_i(y_i - \hat{y}_i)^2 = \Sigma_i\left(y_i - \left(\frac{1}{1+e^{-(\beta_0 + \beta_1x_{i1} + ... + \beta_nx_{in})}}\right)\right)^2$.

## However... 
*This* function, $SSE_{log}(\vec{\beta})$, is [**not** convex](https://towardsdatascience.com/why-not-mse-as-a-loss-function-for-logistic-regression-589816b5e03c).

That means that, if we tried to use gradient descent or some other approximation method that looks for the minimum of this function, we could easily find a local rather than a global minimum.

## The Good News is... 
We can use **log-loss** instead:

$\mathcal{L}(\vec{y}, \hat{\vec{y}}) = -\frac{1}{N}\Sigma^N_{i=1}\left(y_iln(\hat{y}_i)+(1-y_i)ln(1-\hat{y}_i)\right)$,

where $\hat{y}_i$ is the probability that $(x_{i1}, ... , x_{in})$ belongs to **class 1**.

**Additional resources on the log-loss function**:

https://towardsdatascience.com/optimization-loss-function-under-the-hood-part-ii-d20a239cde11

https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a

http://wiki.fast.ai/index.php/Log_Loss

## Great.. we have a cost function for log reg but how do we intepret our outputs now? 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns

# For our modeling steps
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, LogisticRegression,\
LassoCV, RidgeCV
from sklearn.model_selection import train_test_split, KFold,\
cross_val_score, cross_validate, ShuffleSplit
from sklearn.metrics import mean_squared_error, accuracy_score, confusion_matrix, classification_report


# For demonstrative pruposes
from scipy.special import logit, expit
from sklearn import datasets

In [None]:
# glass identification dataset
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data'
col_names = ['id','ri','na','mg','al','si','k','ca','ba','fe','glass_type']
glass = pd.read_csv(url, names=col_names, index_col='id')
glass.sort_values('al', inplace=True)
glass.head()

In [None]:
# types 1, 2, 3 are window glass
# types 5, 6, 7 are household glass
glass['household'] = glass.glass_type.map({1:0, 2:0, 3:0, 5:1, 6:1, 7:1})
glass.head()

## Train Logistic Regression

In [None]:
# fit a logistic regression model and store the class predictions

logreg = LogisticRegression(random_state=42)
feature_cols = ['al']
X = glass[feature_cols]
y = glass.household
logreg.fit(X, y)
glass['household_pred_class'] = logreg.predict(X)

## Interpreting Logistic Regression Coefficients

In [None]:
logreg.coef_

How do we interpret the coefficients of a logistic regression? For a linear regression, the situaton was like this:

- Linear Regression: We construct the best-fit line and get a set of coefficients. Suppose $\beta_1 = k$. In that case we would expect a 1-unit change in $x_1$ to produce a $k$-unit change in $y$.

- Logistic Regression: We find the coefficients of the best-fit line by some approximation method. Suppose $\beta_1 = k$. In that case we would expect a 1-unit change in $x_1$ to produce a $k$-unit change (not in $y$ but) in $ln\left(\frac{y}{1-y}\right)$.

We have:

$\ln\left(\frac{y(x_1+1, ... , x_n)}{1-y(x_1+1, ... , x_n)}\right) = \ln\left(\frac{y(x_1, ... , x_n)}{1-y(x_1, ... , x_n)}\right) + k$.

Exponentiating both sides:

$\frac{y(x_1+1, ... , x_n)}{1-y(x_1+1, ... , x_n)} = e^{\ln\left(\frac{y(x_1, ... , x_n)}{1-y(x_1, ... , x_n)}\right) + k}$ <br/><br/> $\frac{y(x_1+1, ... , x_n)}{1-y(x_1+1, ... , x_n)}= e^{\ln\left(\frac{y(x_1, ... , x_n)}{1-y(x_1, ... , x_n)}\right)}\cdot e^k$ <br/><br/> $\frac{y(x_1+1, ... , x_n)}{1-y(x_1+1, ... , x_n)}= e^k\cdot\frac{y(x_1, ... , x_n)}{1-y(x_1, ... , x_n)}$

That is, the odds ratio at $x_1+1$ has increased by a factor of $e^k$ relative to the odds ratio at $x_1$.

For more on interpretation, see [this page](https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/how-to/binary-logistic-regression/interpret-the-results/all-statistics-and-graphs/coefficients/).

In [None]:
# examine the intercept

logodds = logreg.intercept_
logodds

> **Interpretation:** For an 'al' value of 0, the log-odds of 'household' is -6.01.

In [None]:
odds = np.exp(logodds)
odds

In [None]:
prob = odds / (1 + odds)
prob

# Complete Logistic Regression Walk-thru 

In [None]:
df = pd.read_csv('data/adult.csv')
df.head()

In [None]:
df['country'] = df['country'].replace(' ?',np.nan)
df['workclass'] = df['workclass'].replace(' ?',np.nan)
df['occupation'] = df['occupation'].replace(' ?',np.nan)

df.dropna(how='any',inplace=True)
df.info()

In [None]:
salary_map = {' <=50K':1, ' >50K':0}
df['salary'] = df['salary'].map(salary_map).astype(int)

In [None]:
#let's look at a countplot to visualize our new dependent variable 
sns.countplot(x='salary', data=df)

In [None]:
sns.countplot(df['sex'], hue=df['salary'])

## Splitting into train/test sets 

In [None]:
X = df.drop(['salary'], axis=1)
y = df['salary']

split_size = 0.3

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=split_size,random_state=0)

## One Hot Encoding our categorical variables 
### Training set first - need to fit and transform 

In [None]:
# Taking in other features (category)
ohe = OneHotEncoder(drop='first')
dummies = ohe.fit_transform(X_train[['workclass', 'education', 'marital-status', 'occupation',
                                    'relationship', 'race', 'sex', 'country']])

# Getting a DF
dummies_df = pd.DataFrame(dummies.todense(), columns=ohe.get_feature_names(),
                         index=X_train.index)

# What we'll feed int our model
X_train_df = pd.concat([X_train[['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
                                'hours-per-week']], dummies_df], axis=1)
X_train_df.head()

## Encoding our test set - only transform 

In [None]:
# Note the same transformation (not FIT) to match structure
test_dummies = ohe.transform(X_test[['workclass', 'education', 'marital-status', 'occupation',
                                    'relationship', 'race', 'sex', 'country']])
test_df = pd.DataFrame(test_dummies.todense(), columns=ohe.get_feature_names(),
                       index=X_test.index)
X_test_df = pd.concat([X_test[['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
                                'hours-per-week']], test_df], axis=1)

## First Model 

In [None]:
lg1 = LogisticRegression()
lg1.fit(X_train_df, y_train)

In [None]:
#How did our training data do? 
lg1.score(X_train_df, y_train)

Evaluate how well the model generalizes with cross-validation (only training data). _Remember_,  cross-validation works like this: First I'll partition my training data into $k$-many *folds*. Then I'll train a model on $k-1$ of those folds and "test" it on the remaining fold. I'll do this for all possible divisions of my $k$ folds into $k-1$ training folds and a single "testing" fold. Since there are $k\choose 1$$=k$-many ways of doing this, I'll be building $k$-many models!

![](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

In [None]:
cv_results = cross_validate(
                X=X_train_df, 
                y=y_train,
                estimator=lg1, 
                cv=10,
                scoring='accuracy',
                return_train_score=True
)

In [None]:
cv_results.keys()

In [None]:
cv_results 

## Let's see how the model performs on our test data 

In [None]:
#predictions
prediction = lg1.predict(X_test_df)
prediction

In [None]:
print('Accuracy Score:')
print(accuracy_score(y_test, prediction))

print('-'*40)
print('Confusion Matrix:')
print(confusion_matrix(y_test, prediction))

## What are we actually trying to determine here? 

## Review: Bias-Variance tradeoff

![](https://miro.medium.com/max/700/1*oO0KYF7Z84nePqfsJ9E0WQ.png)

The important idea here is that there is a *trade-off*: If we have too few data in our sample (training set), or too few predictors, we run the risk of high *bias*, i.e. an underfit model. On the other hand, if we have too many predictors (especially ones that are collinear), we run the risk of high *variance*, i.e. an overfit model.

### Underfitting 
> Underfit models fail to capture all of the information in the data
* low complexity --> high bias, low variance
* training error: large
* testing error: large

### Overfitting 
> Overfit models fit to the noise in the data and fail to generalize
* high complexity --> low bias, high variance
* training error: low
* testing error: large

**We use training, validating and testing to help us understand if our model is over/underfitting:** 
Roughly:
- Training data is for building the model;
- Validation data is for *tweaking* the model;
- Testing data is for evaluating the model on unseen data.
<br/>

- Think of **training** data as what you study for a test
- Think of **validation** data is using a practice test (note sometimes called **dev**)
- Think of **testing** data as what you use to judge the model
    - A **holdout** set is when your test dataset is never used for training (unlike in cross-validation)
    

## If our model is over/underfitting... 

### A class imbalance in most likely to result in underfitting... why? 

In [None]:
#smote
from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_train_resampled, y_train_resampled = smote.fit_sample(X_train_df, y_train) 
print(pd.Series(y_train_resampled).value_counts())

In [None]:
#train on resampled data 
lg2 = LogisticRegression()
lg2.fit(X_train_resampled, y_train_resampled)

In [None]:
lg2.score(X_train_resampled, y_train_resampled)

## Ugh... it got worse... why? 

## If our model is overfitting... 
### Regularization could help! 

Again, complex models are very flexible in the patterns that they can model but this also means that they can easily find patterns that are simply statistical flukes of one particular dataset rather than patterns reflective of the underlying data-generating process.

When a model has large weights, the model is "too confident". This translates to a model with high variance which puts it in danger of overfitting! We need to punish large (confident) weights by contributing them to the error function

**Some Types of Regularization:**

1. Reducing the number of features
2. Increasing the amount of data
3. Popular techniques: Ridge, Lasso, Elastic Net

## The Strategy Behind Ridge / Lasso / Elastic Net

Overfit models overestimate the relevance that predictors have for a target. Thus overfit models tend to have **overly large coefficients**. 

Generally, overfitting models come from a result of high model variance. High model variance can be caused by:

- having irrelevant or too many predictors
- multicollinearity
- large coefficients

Regularization is about introducing a factor into our model designed to enforce the structure that the coefficients stay small, by _penalizing_ the ones that get too large.

That is, we'll alter our loss function so that the goal now is not merely to minimize the difference between actual values and our model's predicted values. Rather, we'll add in a term to our loss function that represents the sizes of the coefficients.

### Lasso: L1 Regularization - Absolute Value
- Tend to get sparse vectors (small weights go to 0)
- Reduce number of weights
- Good feature selection to pick out importance

$$ J(W,b) = -\dfrac{1}{m} \sum^m_{i=1}\big[\mathcal{L}(\hat y_i, y_i)+ \dfrac{\lambda}{m}|w_i| \big]$$

### Ridge: L2 Regularization - Squared Value

- Not sparse vectors (weights homogeneous & small)
- Tends to give better results for training

    
$$ J(W,b) = -\dfrac{1}{m} \sum^m_{i=1}\big[\mathcal{L}(\hat y_i, y_i)+ \dfrac{\lambda}{m}w_i^2 \big]$$

### When should you use L1 and L2 Regularization? 
* **L2**
 - when you have features with high multicollinearity
 - reduce model complexity
 - features with large coefficients(high bias)
 
* **L1**
 - when you have a lot of small coefficients
 - when you have LOTS of features
 - Feature selection 
 
 ![](https://hackernoon.com/hn-images/0*Hb81qZ91t-kZg2eo.png)

## Logistic Regression has regularization already built in 
[Let's check it out](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

### We could run a for loop to iterate over a list of possible C values to see which is best for our model 

In [None]:
C = [100, 10, 1, .1, .001]
for c in C:
    lg3 = LogisticRegression(penalty='l1', C=c, solver='liblinear')
    lg3.fit(X_train_df, y_train)
    print('C:', c)
    print('Training accuracy:', lg3.score(X_train_df, y_train))
    print('Test accuracy:', lg3.score(X_test_df, y_test))
    print('')

## Other Hyperparameters 
### LogisticRegression has several optional parameters that define the behavior of the model and approach: 

**penalty**- is a string ('l2' by default) that decides whether there is regularization and which approach to use. Other options are 'l1', 'elasticnet', and 'none'.

**dual**- is a Boolean (False by default) that decides whether to use primal (when False) or dual formulation (when True).

**tol**- is a floating-point number (0.0001 by default) that defines the tolerance for stopping the procedure.

**C**- is a positive floating-point number (1.0 by default) that defines the relative strength of regularization. Smaller values indicate stronger regularization.

**fit_intercept**- is a Boolean (True by default) that decides whether to calculate the intercept 𝑏₀ (when True) or consider it equal to zero (when False).

**intercept_scaling**- is a floating-point number (1.0 by default) that defines the scaling of the intercept 𝑏₀.

**class_weight**- is a dictionary, 'balanced', or None (default) that defines the weights related to each class. When None, all classes have the weight one.

**random_state**- is an integer, an instance of numpy.RandomState, or None (default) that defines what pseudo-random number generator to use.

**solver**- is a string ('liblinear' by default) that decides what solver to use for fitting the model. Other options are 'newton-cg', 'lbfgs', 'sag', and 'saga'.

**max_iter**- is an integer (100 by default) that defines the maximum number of iterations by the solver during model fitting.

**multi_class**- is a string ('ovr' by default) that decides the approach to use for handling multiple classes. Other options are 'multinomial' and 'auto'.

**verbose**- is a non-negative integer (0 by default) that defines the verbosity for the 'liblinear' and 'lbfgs' solvers.

**warm_start**- is a Boolean (False by default) that decides whether to reuse the previously obtained solution.

**n_jobs**- is an integer or None (default) that defines the number of parallel processes to use. None usually means to use one core, while -1 means to use all available cores.

**l1_ratio**- is either a floating-point number between zero and one or None (default). It defines the relative importance of the L1 part in the elastic-net regularization.

#### Warning: 
**You should carefully match the solver and regularization method for several reasons:**

'liblinear' solver doesn’t work without regularization. <br/>
'newton-cg', 'sag', 'saga', and 'lbfgs' don’t support L1 regularization. <br/>
'saga' is the only solver that supports elastic-net regularization.

## Pros and Cons of Logistic Regression 

Advantages of logistic regression:

- Highly interpretable (if you remember how)
- Model training and prediction are fast
- Not many parameters to tune
- Can perform well with a small number of observations
- Outputs well-calibrated predicted probabilities

Disadvantages of logistic regression:

- Presumes a linear relationship between the features and the log-odds of the response
- Performance is (generally) not competitive with the best supervised learning methods
- Can't automatically learn feature interactions

## Review: 
- Bias-Variance tradeoff is essential for optimizing all machine learning models 
- Training, validating, testing confirms if our model is over/underfitting 
- Classification metrics help to understand if we should prioritize precision/recall/f1 and give us specifics about where our model is going wrong
- Hyperparameter tuning allows us to adjust our models to fit the data 