# Module 4.3: Logistic Regression for Classification

So far, we've predicted continuous values (like house prices). But what if we want to predict a category? For example:
* Will a customer churn or not? (Yes/No)
* Is an email spam or not spam? (Spam/Not Spam)
* Does a patient have a disease? (Positive/Negative)

This type of problem is called **Classification**. Our first and most fundamental classification algorithm is **Logistic Regression**.

**How it works:** Instead of fitting a straight line to the data, Logistic Regression fits a special 'S'-shaped curve, called a **sigmoid function**. This curve outputs a probability between 0 and 1. We can then set a threshold (usually 0.5) to classify the outcome as one category or the other.

[Image of a sigmoid function curve]

**Goal of this Notebook:**
1.  Introduce the concept of Feature Scaling.
2.  Train a Logistic Regression model.
3.  Understand and interpret key classification metrics like the **Confusion Matrix**, **Accuracy**, **Precision**, and **Recall**.

### Dataset Setup

We'll use a dataset about social network ads to predict whether a user will purchase a product based on their age and salary.

➡️ **Action:** Go to the `02_Data_Analysis_and_Wrangling/data/` folder. Create a new file named `Social_Network_Ads.csv` and paste the following content into it:

```csv
User ID,Gender,Age,EstimatedSalary,Purchased
15624510,Male,19,19000,0
15810944,Male,35,20000,0
15668575,Female,26,43000,0
15603246,Female,27,57000,0
15804002,Male,19,76000,0
15728773,Male,27,58000,0
15598044,Female,27,84000,0
15694829,Female,32,150000,1
15600575,Male,25,33000,0
15727311,Female,35,65000,0
15570769,Female,26,80000,0
15606274,Female,26,52000,0
15746139,Male,20,86000,0
15704987,Male,32,18000,0
15628972,Male,18,82000,0
15697695,Male,29,80000,0
15735878,Male,47,25000,1
15617482,Male,45,26000,1
15704583,Male,46,28000,1
15622836,Female,48,29000,1
15649487,Male,45,22000,1
15737691,Female,46,49000,1
15724550,Female,47,47000,1
15679401,Male,48,41000,1
15744232,Female,45,22000,1
15638646,Male,46,79000,1
15757646,Male,47,33000,1
15594452,Female,49,28000,1
15639174,Male,49,36000,1
15756820,Male,42,65000,0
```

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [None]:
df = pd.read_csv('../02_Data_Analysis_and_Wrangling/data/Social_Network_Ads.csv')
df.head()

## 1. Preparing the Data

We define our features (X) and target (y) and split the data as usual.

In [None]:
X = df[['Age', 'EstimatedSalary']]
y = df['Purchased']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

### A New Step: Feature Scaling

Look at our data: 'Age' ranges from about 18-60, while 'EstimatedSalary' ranges from 15,000 to 150,000. This large difference in scale can cause problems for many machine learning algorithms.

**Feature Scaling** standardizes the range of our features. We use `StandardScaler` from Scikit-Learn, which transforms the data so it has a mean of 0 and a standard deviation of 1.

**Important:** We `fit_transform` on the training data, but only `transform` on the test data. This prevents 'data leakage' from the test set into our training process.

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## 2. Training the Model

In [None]:
# Instantiate and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

## 3. Evaluating the Model

For classification, accuracy is not enough. We need more detailed metrics.

### The Confusion Matrix
A table that describes the performance of a classification model. It shows:
* **True Positives (TP):** Correctly predicted positive.
* **True Negatives (TN):** Correctly predicted negative.
* **False Positives (FP):** Incorrectly predicted positive (Type I Error).
* **False Negatives (FN):** Incorrectly predicted negative (Type II Error).

In [None]:
predictions = model.predict(X_test)

# Generate the confusion matrix
cm = confusion_matrix(y_test, predictions)
print("Confusion Matrix:")
print(cm)

### Accuracy, Precision, and Recall

* **Accuracy:** Overall, how often is the classifier correct? `(TP+TN)/Total`
* **Precision:** Of all the positive predictions, how many were actually correct? `TP/(TP+FP)`
* **Recall (Sensitivity):** Of all the actual positive cases, how many did we correctly identify? `TP/(TP+FN)`

The `classification_report` gives us all these metrics at once.

In [None]:
print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}\n")
print("Classification Report:")
print(classification_report(y_test, predictions))

> **Interpretation:** Our model has an overall accuracy of 88%. For the 'Purchased' class (1), it has high precision (meaning when it predicts someone will buy, it's usually right) and good recall (meaning it catches a good portion of the people who actually buy).


## ✅ What's Next?

You've successfully trained your first classification model and learned how to evaluate it properly. Understanding precision and recall is a critical skill.

Logistic Regression is a linear model. Next, we'll explore non-linear models like **Decision Trees and Random Forests**, which can capture more complex patterns in data.