# Logistic Regression
Logistic Regression is a mathematical model(a.k.a. logit model) used in statistics to estimate the probability of an event occuring having been given some previous data.

Logistic regression implies that the possible outcomes are not numerical but rather categorical.

Examples: Yes/No, Pass/Fail, 1/0, etc.

**Logit Model:**
$odds = e^{(b0+b1x1+b2x2+...+bkxk)}$

where,

$odds = \frac{p(x)}{1 - p(x)}$

p(x) = probability of event happening
    
1 - p(x) = probability of event not happening

# Problem Statement
In this very simple example, we will take on an univariate logistic regression model. Let's have it.

We have a bank related problem at hand. We have been given some bank data and we have to construct a model that will predict that a certain client will opt to subscribe our term deposit or not.

Given data consists of two columns. One contains the duration for which the client is associated with the bank and other tells whether he did or did not subscribe for term deposit. We need to train our model with given data.

# 1. Importing Relevant Libraries 

In [None]:
import numpy as np, pandas as pd
import matplotlib.pyplot as plt, seaborn as sns
import statsmodels.api as sm

sns.set()

# 2. Loading Data

In [None]:
raw_data = pd.read_csv("../input/simple-bank-data/simple_bank_data.csv")
raw_data.head()

# 3. Making Data Regression Ready
We do not need first column i.e. "Unnamed: 0" as it is just another index. We will drop it.

In [None]:
data = raw_data.copy()  # always better to make a copy to edit and keep raw data untouched

data = data.drop(["Unnamed: 0"], axis = 1)
data.head()

In [None]:
data.info()  # checking for null values

There are no null values.

We will now map 'y' into 1s and 0s. 1 for successful subscription, 0 for failure.

In [None]:
data["y"] = data["y"].map({"yes": 1, "no": 0})
data.head(3)

# 4. Declaring Dependent and Independent Variables

In [None]:
x1 = data["duration"]  # feature
y = data["y"]  # target

# 5. Visualization of Data:
It is useful to visualize the data we have, helps in creating some hypothesis and then we can move on to test them.

In [None]:
plt.scatter(x1, y)
plt.xlabel('Duration', fontsize = 20)
plt.ylabel('Subscription', fontsize = 20)
plt.show()

With this scatter plot we can interpret that clients with more duration of association with the bank have higher acceptance of term deposit. With logistic regression applied, we will be more certain of our hypothesis.

# 6. Applying Logistic Regression

In [None]:
x = sm.add_constant(x1)
reg_log = sm.Logit(y, x)
results_reg = reg_log.fit()

results_reg.summary()

**Summary Interpretation:**

- Pseudo R-squ. - assessing the model, 0.2 - 0.4 is good
- Log-Likelihood - defines likelihood of model with given data
- LL-Null - Likelihood if data had variables that do not define target  (Log-Likelihood should always greater than LL-Null, more greater, better)
- LLR p-value - Log Likelihood Ratio, assess that our model is significantly different from LL-Null model that is basically a useless model (p-value: 0.000, significant model)

# 7. Substituting Coefficients in Logit Model Equation
<center>$odds = e^{(-1.7001+0.0051*x1)}$<\center>

Let's find odds for different durations:

**Making a list od different durations:**

In [None]:
duration_list = []
for i in range(0, 2001, 100):
    duration_list.append(i)
print(duration_list)

**Defining a function to calculate odds using constructed model:**

In [None]:
def odds(duration_list):
    odd_list = []
    for i in range(len(duration_list)):
        odd = np.exp(-1.7001 + (0.0051*duration_list[i]))
        odd_list.append(round(odd,2))
    return odd_list
odds_list = odds(duration_list)
print(odds_list)

**Making a list of tuples containing two values. (i.e. duration and odds)**

In [None]:
durations_odds = zip(duration_list, odds_list)
durations_odds = list(durations_odds)
durations_odds

**Interpreting what we have found out:**

- odds < 1, mean that there is higher probability of failure than that of successful subscription
- odds = 1, mean that there is equal probability of failure and successful subscription
- odds > 1, mean that there is higher probability of success than that of failure in subscription

- **odds = x, there is x times more probability of success than that of failure in subscription**

**In this case,** we can say and validate our hypothesis that we made while viewing scatter plot that clients with longer association with the bank have higher chances of taking term deposit.

## Additional: How does a single feature affects the odds?
**There is another way in which results can be interpreted.**

Let's take two durations and their odds as example,

<center>odd_1 = e^(-1.7001 + 0.0051\*duration1)  --> eq 1<\center>

<center>odd_2 = e^(-1.7001 + 0.0051\*duration2)  --> eq 2<\center>

taking log of both equation,

<center>=> log(odd_1) = -1.7001 + 0.0051\*duration1  --> eq 1<\center>

<center>=> log(odd_2) = -1.7001 + 0.0051\*duration2  --> eq 2<\center>

eq 2 - eq 1

<center>=> log(odd_2) - log(odd_1) = -1.7001 + 0.0051\*duration2 - (-1.7001 + 0.0051\*duration1)<\center>

<center>=> log(odd_2/odd_1) = 0.0051(duration2 - duration1) --> eq<\center>

e^eq

<center>=> odd_2/odd_1 = e^(0.0051(duration2 - duration1)) --> eq<\center>

for change in duration = 1 day,

<center>=> odd_2/odd_1 = e^(0.0051(1))<\center>

<center>=> odd_2/odd_1 = e^0.0051<\center>

<center>=> odd_2/odd_1 = 1.005113027136717<\center>

<center>=> odd_2 = 100.51% * odd_1<\center>

From above equation, we can see that one day change in duration, increases the odds by 0.51%.

From here on, we can use following equation to find the impact on odds of some feature on the target, considering that all other features are same,

<center>$Δodds = e^{bk}$</center>