# Logistic Regression

Logistic Regression is a linear classification algorithm. Classification is a problem in which the task is to assign a category/class to a new instance learning the properties of each class from the existing labeled data, called training set. Examples of classification problems can be classifying emails as spam and non-spam, looking at height, weight, and other attributes to classify a person as fit or unfit, etc.

**Why should we learn Logistic Regression?**

- It is the first supervised learning algorithm that comes to the mind of data science practitioners to create a strong baseline model to check the uplift.

- It is a fundamental, powerful, and easily implementable algorithm. It is also a very intuitive and interpretable model as the final outputs are coefficients depicting the relationship between response variable and features.

**Is logistic regression a regressor or a classifier?**

Logistic regression is usually used as a classifier because it predicts discrete classes.
Having said that, it technically outputs a continuous value associated with each prediction.
So we see that it is actually a regression algorithm that can solve classification problems.

Logistic regression can produce a probability score along with its classification prediction.
It is fair to say that it is a classifier because it is used for classification, although it is technically also a regressor.


## What parameters can be tuned in logistic regression models?





## Using logistic regression in Titanic survival prediction

In [1]:
# import libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

**Loading the final dataframe**

In order to start modeling, we will focus on our train data and forget temporarily about the dataset where we need to make predictions.

Let's start by loading our clean titanic train data and name it final_df.

In [5]:
# loading clean train dataset

final_df = pd.read_csv('datasets/clean_titanic_train.csv')


In [10]:
# Let's take a look at our first 10 rows to verify all data is numerical

final_df.head(10)

Unnamed: 0.1,Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,fam_mbrs
0,0,0,3,1,0.271174,1,0,0.027567,0,1
1,1,1,1,0,0.472229,1,0,0.271039,1,1
2,2,1,3,0,0.321438,0,0,0.030133,0,0
3,3,1,1,0,0.434531,1,0,0.201901,0,1
4,4,0,3,1,0.434531,0,0,0.030608,0,0
5,5,0,3,1,0.346569,0,0,0.032161,2,0
6,6,0,1,1,0.673285,0,0,0.197196,0,0
7,7,0,3,1,0.019854,3,1,0.080133,0,4
8,8,1,3,0,0.334004,0,2,0.042332,0,2
9,9,1,2,0,0.170646,1,0,0.114338,1,1


**Separate features and target as X and y**

In [17]:
X = final_df.drop(['Survived','Unnnamed: 0'], axis=1)
y = final_df['Survived']

KeyError: "['Unnnamed: 0'] not found in axis"

**Split dataframe in training set and testing set**

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [13]:
from sklearn.linear_model import LogisticRegression

# Instantiate Logistic Regression

model = LogisticRegression()

In [14]:
# Fit the data

model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [15]:
model.coef_

array([[ 2.93931897e-04, -8.75048187e-01, -2.54984177e+00,
        -2.72265416e-02, -9.61048282e-02, -2.07671242e-02,
         1.12536765e+00,  2.30871173e-01, -1.16871952e-01]])

In [None]:
# Hypertune parameters

optimized_model = LogisticRegression(solver='liblinear', penalty='l2', random_state=42, C=0.01)

optimized_model.coef_

Source: 

https://towardsdatascience.com/a-handbook-for-logistic-regression-bb2d0dc6d8a8

https://www.displayr.com/how-to-interpret-logistic-regression-coefficients/