# Logistic Regression

nicely explained:
* https://towardsdatascience.com/logistic-regression-python-7c451928efee  

example:
* https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8

__Logistic Regression is a supervised machine learning algorithm used in binary classification__.  
Logistic Regression __fits a line to a dataset and then returns the probability__ that a new sample belongs to one of the two classes according to its location with respect to the line.

### Logistic Regression Assumptions
* Binary logistic regression requires the dependent variable to be binary.
* For a binary regression, the factor level 1 of the dependent variable should represent the desired outcome.
* Only the meaningful variables should be included.
* The independent variables should be independent of each other. That is, the model should have little or no multicollinearity.
* The independent variables are linearly related to the log odds.
* Logistic regression requires quite large sample sizes.

### Difference between probability and odds. 
Odds are calculated by taking the number of events where something happened and dividing by the number events where that same something didn’t happen. For example, if the odds of winning a game are 5 to 2, we calculate the ratio as 5/2=2.5. On the other hand, probability is calculated by taking the number of events where something happened and dividing by the total number events (including events when that same something did and didn’t happen). For example, the probability of winning a game with the same odds is 5/(5+2)=0.714.

The dataset comes from the UCI Machine Learning repository(http://archive.ics.uci.edu/ml/index.php) 

In [6]:
! curl -OL https://raw.githubusercontent.com/madmashup/targeted-marketing-predictive-engine/master/banking.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4768k  100 4768k    0     0  2775k      0  0:00:01  0:00:01 --:--:-- 2773k


In [6]:
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import matplotlib.pyplot as plt 
plt.rc("font", size=14)

import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

In [2]:
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

The dataset provides the bank customers’ information. It includes 41,188 records and 21 fields.  
It is __related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict whether the client will subscribe (1/0) to a term deposit (variable y).__ 

In [9]:
df=pd.read_csv("./banking.csv")
df.shape

(41188, 21)

In [10]:
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,40.02406,258.28501,2.567593,962.475454,0.172963,0.081886,93.575664,-40.5026,3.621291,5167.035911,0.112654
std,10.42125,259.279249,2.770014,186.910907,0.494901,1.57096,0.57884,4.628198,1.734447,72.251528,0.316173
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6,0.0
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1,0.0
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0,0.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1,0.0
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1,1.0


In [11]:
df.sample(5)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
25774,35,management,married,university.degree,no,no,no,telephone,nov,mon,986,1,7,3,failure,-3.4,92.649,-30.1,0.714,5017.5,0
9331,42,management,married,high.school,no,no,no,telephone,jun,fri,20,5,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,0
8364,46,blue-collar,married,basic.9y,no,yes,no,telephone,may,mon,194,5,999,0,nonexistent,1.1,93.994,-36.4,4.858,5191.0,0
24413,33,technician,married,professional.course,no,yes,no,cellular,may,wed,221,1,999,0,nonexistent,-1.8,92.893,-46.2,1.334,5099.1,0
38413,59,admin.,married,university.degree,no,no,yes,cellular,sep,wed,193,1,3,1,success,-1.1,94.199,-37.5,0.886,4963.6,1


In [16]:
df.count()
#df.notnull().sum()

age               41188
job               41188
marital           41188
education         41188
default           41188
housing           41188
loan              41188
contact           41188
month             41188
day_of_week       41188
duration          41188
campaign          41188
pdays             41188
previous          41188
poutcome          41188
emp_var_rate      41188
cons_price_idx    41188
cons_conf_idx     41188
euribor3m         41188
nr_employed       41188
y                 41188
dtype: int64

In [22]:
 df["education"].nunique(),df["education"].unique()

(8, array(['basic.4y', 'unknown', 'university.degree', 'high.school',
        'basic.9y', 'professional.course', 'basic.6y', 'illiterate'],
       dtype=object))

In [23]:
 df["job"].nunique(),df["job"].unique()

(12, array(['blue-collar', 'technician', 'management', 'services', 'retired',
        'admin.', 'housemaid', 'unemployed', 'entrepreneur',
        'self-employed', 'unknown', 'student'], dtype=object))

In [24]:
df.describe(include="all")

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
count,41188.0,41188,41188,41188,41188,41188,41188,41188,41188,41188,41188.0,41188.0,41188.0,41188.0,41188,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
unique,,12,4,8,3,3,3,2,10,5,,,,,3,,,,,,
top,,admin.,married,university.degree,no,yes,no,cellular,may,thu,,,,,nonexistent,,,,,,
freq,,10422,24928,12168,32588,21576,33950,26144,13769,8623,,,,,35563,,,,,,
mean,40.02406,,,,,,,,,,258.28501,2.567593,962.475454,0.172963,,0.081886,93.575664,-40.5026,3.621291,5167.035911,0.112654
std,10.42125,,,,,,,,,,259.279249,2.770014,186.910907,0.494901,,1.57096,0.57884,4.628198,1.734447,72.251528,0.316173
min,17.0,,,,,,,,,,0.0,1.0,0.0,0.0,,-3.4,92.201,-50.8,0.634,4963.6,0.0
25%,32.0,,,,,,,,,,102.0,1.0,999.0,0.0,,-1.8,93.075,-42.7,1.344,5099.1,0.0
50%,38.0,,,,,,,,,,180.0,2.0,999.0,0.0,,1.1,93.749,-41.8,4.857,5191.0,0.0
75%,47.0,,,,,,,,,,319.0,3.0,999.0,0.0,,1.4,93.994,-36.4,4.961,5228.1,0.0


In [25]:
 df["marital"].nunique(),df["marital"].unique()

(4, array(['married', 'single', 'divorced', 'unknown'], dtype=object))

In [30]:
len(df), len(df.dropna())

(41188, 41188)

Input variables
1. age (numeric)
2. job : type of job (categorical: “admin”, “blue-collar”, “entrepreneur”, “housemaid”, “management”, “retired”, “self-employed”, “services”, “student”, “technician”, “unemployed”, “unknown”)
3. marital : marital status (categorical: “divorced”, “married”, “single”, “unknown”)
4. education (categorical: “basic.4y”, “basic.6y”, “basic.9y”, “high.school”, “illiterate”, “professional.course”, “university.degree”, “unknown”)
5. default: has credit in default? (categorical: “no”, “yes”, “unknown”)
6. housing: has housing loan? (categorical: “no”, “yes”, “unknown”)
7. loan: has personal loan? (categorical: “no”, “yes”, “unknown”)
7. contact: contact communication type (categorical: “cellular”, “telephone”)
8. month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
9. day_of_week: last contact day of the week (categorical: “mon”, “tue”, “wed”, “thu”, “fri”)
10. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). The duration is not known before a call is performed, also, after the end of the call, y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model
11. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
12. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
13. previous: number of contacts performed before this campaign and for this client (numeric)
14. poutcome: outcome of the previous marketing campaign (categorical: “failure”, “nonexistent”, “success”)
15. emp.var.rate: employment variation rate — (numeric)
16. cons.price.idx: consumer price index — (numeric)
17. cons.conf.idx: consumer confidence index — (numeric)
18. euribor3m: euribor 3 month rate — (numeric)
19. nr.employed: number of employees — (numeric)

In [31]:
 df["education"].nunique(),df["education"].unique()

(8, array(['basic.4y', 'unknown', 'university.degree', 'high.school',
        'basic.9y', 'professional.course', 'basic.6y', 'illiterate'],
       dtype=object))

In [33]:
df['education']=np.where(df['education'] =='basic.9y', 'Basic', df['education'])
df['education']=np.where(df['education'] =='basic.6y', 'Basic', df['education'])
df['education']=np.where(df['education'] =='basic.4y', 'Basic', df['education'])

In [34]:
 df["education"].nunique(),df["education"].unique()

(6, array(['Basic', 'unknown', 'university.degree', 'high.school',
        'professional.course', 'illiterate'], dtype=object))