# Logistic Regression

* https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
* https://towardsdatascience.com/understanding-logistic-regression-using-a-simple-example-163de52ea900

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import seaborn as sns
import matplotlib.pyplot as plt 
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

### Read Data

In [3]:
dt_train=pd.read_csv('../TrainData.csv')

In [4]:
dt_test=pd.read_csv('../TestData.csv')

### Theory

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X.

Logistic Regression Assumptions

* Binary logistic regression requires the dependent variable to be binary.
* For a binary regression, the factor level 1 of the dependent variable should represent the desired outcome.
* Only the meaningful variables should be included.
* The independent variables should be independent of each other. That is, the model should have little or no multicollinearity.
* The independent variables are linearly related to the log odds.
* Logistic regression requires quite large sample sizes.

### Implementation

#### Dummy Variables

In this case it's not necessary, the discrete variables have their own meaning instead of serving ordinal or categorical purpose.

In [51]:
# But in case we need that later
# I will list it here
# sex = pd.get_dummies(train['Sex'],drop_first=True)
# embark = pd.get_dummies(train['Embarked'],drop_first=True)

#### X & Y

In [5]:
X_train=dt_train.iloc[:,1:]
y_train=dt_train.iloc[:,0]
X_test=dt_test.iloc[:,1:]
y_test=dt_test.iloc[:,0]

#### The Model


In [6]:
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()

In [7]:
logmodel.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [8]:
predictions = logmodel.predict(X_test)

#### Check Accuracy and Calculate Cost

In [9]:
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.94      0.99      0.97     33589
           1       0.59      0.15      0.24      2447

    accuracy                           0.94     36036
   macro avg       0.77      0.57      0.60     36036
weighted avg       0.92      0.94      0.92     36036



#### if output is prob, then we can calculate cost

In [10]:
pred_prob=logmodel.predict_proba(X_test)

In [12]:
pred_prob

array([[0.85581462, 0.14418538],
       [0.98827879, 0.01172121],
       [0.95848708, 0.04151292],
       ...,
       [0.98675724, 0.01324276],
       [0.96585111, 0.03414889],
       [0.89544991, 0.10455009]])

In [13]:
pred_prob[:,1]

array([0.14418538, 0.01172121, 0.04151292, ..., 0.01324276, 0.03414889,
       0.10455009])

In [15]:
def cost(truth,pred):
    out=np.sum(-truth*np.log(pred)-(1-truth)*np.log(1-pred))/len(truth)
    
    return out
    

In [16]:
c=cost(y_test,pred_prob[:,1])
print('The cost is',c)

The cost is 0.21378883611653318


### Takeaway