# Logistic Regression

## 0) Warmup

1. What are we trying to predict in this weeks project?
- Which values does y take on?
- What information do we use to make the prediction?

## 1) Predicting Probabilities

Instead of directly predicting the binary outcome (0, 1), we are actually predicting a probability of "success" (belonging to class 1).

$f(X) = \hat{p}(X)$

where X are input features such as *age*, *Pclass*, *gender*, ...

How do we do that?

- Threshold value of 0.5
- The parameters are responsible for the predictions
    - w are the weights of the input features --> determine the sensitivity of the curve
    - b is a parameter that shifts the function to the left (>0) or right (<0). It determines the predicted probability for x=0
- How do we find the parameters? --> The loss is minimized --> Every machine learning algorithm will have some kind of loss (objective functin) that is minimized.
- The minimzation of the loss is equivalent to the maximization of the likelihood of observing the data points that we have observed

## 2) Let's do it

In [2]:
# Import the necessary packages

import pandas as pd

#Import logistic regression
from sklearn.linear_model import LogisticRegression

In [3]:
# Import the dataset
df = pd.read_csv("train.csv", index_col=0)
df["SibSp"].value_counts()

0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64

In [4]:
# Define X
#Try Passenger class - turn Series into dataframe - sklearn does not work with Series

X= df[["Pclass"]]

In [5]:
# Define y
# Must be a Series for sklearn
y = df["Survived"]

In [6]:
# Split the data into a training set and a test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y) 
#By default it splits the data 75-25


In [7]:
# Create a model

m = LogisticRegression()

In [8]:
# Train a model

m.fit(X_train, y_train) # <-- whole iterative process of finding parameters

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [9]:
# What are the parameter coefficients?
w = m.coef_[0] 
w

array([-0.80907802])

In [10]:
b= m.intercept_
b

array([1.40767376])

In [11]:
# Use the model to make predictions on the seen data

#ypred_train = m.predict(X_train)
ypred_test = m.predict(X_test) # <-- the predicted y values

ypred_test

array([0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
       0, 1, 1])

In [12]:
m.predict_proba(X_test)  # <-- the probabilities

array([[0.73488694, 0.26511306],
       [0.73488694, 0.26511306],
       [0.55242716, 0.44757284],
       [0.73488694, 0.26511306],
       [0.35466503, 0.64533497],
       [0.55242716, 0.44757284],
       [0.73488694, 0.26511306],
       [0.55242716, 0.44757284],
       [0.35466503, 0.64533497],
       [0.73488694, 0.26511306],
       [0.73488694, 0.26511306],
       [0.73488694, 0.26511306],
       [0.73488694, 0.26511306],
       [0.73488694, 0.26511306],
       [0.55242716, 0.44757284],
       [0.73488694, 0.26511306],
       [0.73488694, 0.26511306],
       [0.73488694, 0.26511306],
       [0.73488694, 0.26511306],
       [0.73488694, 0.26511306],
       [0.35466503, 0.64533497],
       [0.55242716, 0.44757284],
       [0.73488694, 0.26511306],
       [0.55242716, 0.44757284],
       [0.55242716, 0.44757284],
       [0.35466503, 0.64533497],
       [0.35466503, 0.64533497],
       [0.35466503, 0.64533497],
       [0.35466503, 0.64533497],
       [0.35466503, 0.64533497],
       [0.

#### In the long run, we want to use more than one predictor (X-variable) 

In [13]:
df_multi = df.dropna(subset=["Age"])

In [14]:
X_multi = df_multi[['Pclass', 'Age']]
y_multi = df_multi['Survived']
# Can then use m.fit - here skipping the train-test split

m.fit(X_multi, y_multi)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [15]:
m.coef_

array([[-1.22653571, -0.04149665]])

In [16]:
#Accuracy - which ratio of the data points were classified corectly?
m.score(X_multi, y_multi)

0.696078431372549