# Understanding Logistic Regression for Classification 🚗

Despite its name, **Logistic Regression** is a powerful and widely-used algorithm for **classification** tasks, not regression. It's used when the goal is to predict a categorical outcome, such as Yes/No, True/False, or in this case, 0/1 (owns a car vs. does not).

The key to logistic regression is the **Sigmoid (or Logistic) function**. While a linear regression model outputs a continuous value that can range from negative to positive infinity, the sigmoid function takes any real-valued number and "squashes" it into a value between 0 and 1. This output can be interpreted as a probability.


The formula for the sigmoid function is:
$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Typically, if the output probability is greater than 0.5, we classify the outcome as 1; otherwise, we classify it as 0.

This notebook will walk through building a logistic regression model in `scikit-learn` and then break down the underlying math to see how the predictions are made.

---

## 1. Loading the Data

We'll start by loading a simple dataset that contains the `monthly_salary` of individuals and a binary target variable, `owns_car` (where 0 means No and 1 means Yes).

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('car_ownership.csv')
df.head()

Unnamed: 0,monthly_salary,owns_car
0,22000,0
1,25000,0
2,47000,1
3,52000,0
4,46000,1


## 2. Model Training and Evaluation with Scikit-Learn

Using `scikit-learn`, we can quickly train and evaluate our model. We'll perform a train-test split, fit the `LogisticRegression` model to our training data, and then check its accuracy on the unseen test data.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X = df[['monthly_salary']]
y = df['owns_car']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.8888888888888888

The model achieves an accuracy of about **88.9%** on the test data, meaning it correctly predicted car ownership for nearly 9 out of 10 people in the test set.

Let's look at the model's predictions compared to the actual values from the test set.

In [3]:
y_test.tolist()

[1, 0, 1, 0, 0, 0, 1, 1, 0]

In [4]:
model.predict(X_test)

array([1, 0, 1, 0, 0, 0, 0, 1, 0], dtype=int64)

Comparing the two lists, we can see the model made one mistake (the 7th value was predicted as 0 but was actually 1), which aligns with the ~89% accuracy score.

## 3. Under the Hood: The Mathematics of the Prediction

How does the model make these predictions? It involves a two-step process: a linear calculation followed by the non-linear sigmoid function.

### Step 3.1: The Linear Equation

First, the model calculates a value `z` using a simple linear equation, just like linear regression:

$$ z = mx + b $$

Here, `m` is the **coefficient** and `b` is the **intercept**. We can get these values from our trained model.

In [5]:
model.coef_, model.intercept_

(array([[0.00013621]]), array([-5.39725076]))

### Step 3.2: The Sigmoid Function

The output `z` is not the final prediction. It is then passed into the sigmoid function to convert it into a probability. Let's define this function in Python.

In [6]:
import math
def sigmoid(x):
    return 1 / (1 + math.exp(-x))

### Step 3.3: Making a Manual Prediction

Now, let's combine these pieces to manually predict the probability of car ownership for someone with a **monthly salary of 62,000**.

First, we calculate `z`:

In [10]:
z = model.coef_*62000 + model.intercept_
z[0][0]

3.047874614448303

Next, we pass this `z` value to our sigmoid function. We can create a helper function for this.

In [11]:
def prediction_function(salary, model):
    z = model.coef_*salary + model.intercept_
    y = sigmoid(z[0][0])
    return y

In [9]:
prediction_function(62000, model)

0.954690678813166

The result is **0.955**. This means the model predicts a **95.5% probability** that a person with a monthly salary of 62,000 owns a car. Since this probability is > 0.5, the model would classify this as **1** (owns car).