# 1. Logistic regression

Logistic regression is a widely used statistical method for modeling binary classification problems. It is distinct from linear regression in its purpose and underlying mathematical framework, focusing on predicting categorical outcomes rather than continuous values. Here’s an overview of logistic regression, its differences from linear regression, and typical use cases.

## 1.1. Overview of Logistic Regression

**Purpose**

Logistic regression is used to model the probability that a given input belongs to a particular category. It is particularly suited for binary classification tasks where the outcome is categorical, such as yes/no, true/false, or success/failure.

**Logistic Function**

The key component of logistic regression is the logistic function (also known as the sigmoid function), which maps any real-valued number into a value between 0 and 1:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

Where \( z \) is a linear combination of the input features:

$$
z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n
$$

**Logistic Regression Model**

The logistic regression model predicts the probability \( P(y = 1 \mid x) \) as:

$$
P(y = 1 \mid x) = \sigma(z) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n)}}
$$

Where:
- $P(y = 1 \mid x)$ is the probability of the outcome being 1, given the input features $x$.
- $\beta_0$ is the intercept.
- $\beta_1, \beta_2, \ldots, \beta_n$ are the coefficients for the input features $x_1, x_2, \ldots, x_n$.


## 1.2. Differences Between Logistic and Linear Regression

1. **Nature of the Dependent Variable**

    - **Linear Regression:** Used for predicting a continuous dependent variable.
    
    - **Logistic Regression:** Used for predicting a categorical dependent variable, typically binary.

2. **Output Interpretation**

    - **Linear Regression:** Directly predicts the outcome value, which can range from $-\infty$ to $+\infty$.
    
    - **Logistic Regression:** Predicts the probability of the outcome belonging to a certain class, which ranges between 0 and 1.

3. **Link Function**

    - **Linear Regression:** Uses an identity link function, meaning the prediction is a direct linear combination of inputs.
    
    - **Logistic Regression:** Uses a logistic (sigmoid) link function to map predictions to probabilities.


4. **Cost Function**

    - **Linear Regression:** Typically uses Mean Squared Error (MSE) as the cost function.
  
  $$
  \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
  $$

  Where:
  - $n$ is the number of observations.
  - $y_i$ is the actual value.
  - $\hat{y}_i$ is the predicted value.

    - **Logistic Regression:** Uses the log-loss (or cross-entropy loss) function, which is suited for classification tasks.

  $$
  \text{Log-Loss} = -\frac{1}{m} \sum_{i=1}^{m} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]
  $$

  Where:
  - $m$ is the number of instances.
  - $y_i$ is the true label for instance $i$ (0 or 1).
  - $p_i$ is the predicted probability for instance $i$ that $y = 1$.

5. **Optimization**
    - **Linear Regression:** Can be solved using analytical methods or gradient descent.
    - **Logistic Regression:** Typically solved using iterative optimization algorithms like gradient descent due to the non-linear nature of the cost function.



## 1.3. Typical Use Cases for Logistic Regression

1. **Binary Classification**

    - Logistic regression is the go-to model for binary classification problems, such as spam detection (spam vs. not spam), disease diagnosis (positive vs. negative), and customer churn (churn vs. no churn).

2. **Credit Scoring**

    - Predicting the likelihood of default on a loan based on borrower characteristics.

3. **Medical Diagnosis**

    - Determining the presence or absence of a disease based on patient data.

4. **Marketing**

    - Predicting customer purchase behavior (buy vs. not buy) based on demographic and behavioral data.

5. **Fraud Detection**

    - Identifying fraudulent transactions based on transaction patterns

## 1.4. Implementing Logistic Regression in Python

The Titanic dataset includes information about the passengers on the Titanic, such as their age, sex, class, and whether they survived. We'll use logistic regression to predict the likelihood of survival (`survived` column) based on these features.

In [1]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Load the Titanic dataset
df = sns.load_dataset('titanic').dropna(subset=['age', 'embarked', 'fare', 'deck'])

# Preprocessing: Convert categorical variables to numerical
df['sex'] = df['sex'].map({'male': 0, 'female': 1})
df['embarked'] = df['embarked'].map({'C': 0, 'Q': 1, 'S': 2})
df['deck'] = df['deck'].map({'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6})

# Define the feature matrix X and target vector y
X = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'deck']]
y = df['survived']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and fit the logistic regression model
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = log_reg.predict(X_test_scaled)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nAccuracy Score:")
print(accuracy_score(y_test, y_pred))


Confusion Matrix:
[[ 8  5]
 [ 9 15]]

Classification Report:
              precision    recall  f1-score   support

           0       0.47      0.62      0.53        13
           1       0.75      0.62      0.68        24

    accuracy                           0.62        37
   macro avg       0.61      0.62      0.61        37
weighted avg       0.65      0.62      0.63        37


Accuracy Score:
0.6216216216216216
