# Logistic Regression — Beginner Friendly Notebook
**Audience:** Students new to classification and basic statistics.

**What this notebook is for (plain language):**
We will learn how to predict categories (like pass/fail, spam/ham) using *logistic regression*.
You will see the concept of mapping linear scores into probabilities with the logistic (sigmoid) function, fit models, and evaluate them with clear metrics.

**Prerequisites**
- Basic Python and algebra.
- Some familiarity with probabilities is helpful but not required.

**Notebook structure**
1. Intuition and a tiny synthetic example (1D).
2. Logistic model: sigmoid & loss (explained simply).
3. Hands-on with the Breast Cancer dataset.
4. Evaluation: confusion matrix, ROC, precision-recall.
5. Regularization and class imbalance.
6. Calibration and practical advice.
7. Exercises and glossary.


## 1 — Intuition with a tiny example

We create synthetic data where the probability of class 1 increases with a feature `x`. We'll fit logistic regression and plot predicted probability vs `x`.
This shows why we use a sigmoid: outputs between 0 and 1 that we can interpret as probabilities.


In [None]:
# Synthetic logistic example (1D)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
np.random.seed(0)

# create data
X = np.linspace(-6, 6, 200).reshape(-1,1)
# true probability (sigmoid of a linear function)
true_prob = 1 / (1 + np.exp(-0.8 * X.squeeze()))
y = (np.random.rand(X.shape[0]) < true_prob).astype(int)

clf = LogisticRegression(solver='lbfgs', max_iter=10000)
clf.fit(X, y)
prob_pred = clf.predict_proba(X)[:,1]

# Plot
plt.scatter(X, y, alpha=0.2, label='observed class (jittered)')
plt.plot(X, prob_pred, color='red', label='predicted P(y=1)')
plt.xlabel('x')
plt.ylabel('Probability / class')
plt.legend()
plt.title('Synthetic logistic regression: probability curve')
plt.show()


**Beginner explanation**:
- Observations are 0 or 1 (class labels). The sigmoid maps a linear combination to probabilities between 0 and 1.
- We can choose a cutoff (usually 0.5) to convert probabilities to class predictions.


## 2 — Logistic function and loss (short, non-technical)

The logistic (sigmoid) function squashes any real number to the interval (0,1).  
The learning procedure chooses parameters to make predicted probabilities match observed labels — this is done by maximizing the likelihood (or equivalently minimizing log-loss / cross-entropy).


In [None]:
# 3 — Hands-on: Breast cancer dataset (binary classification)
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

data = load_breast_cancer(as_frame=True)
df = data.frame.copy()
X = df.drop(columns=['target'])
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=1)

clf = LogisticRegression(max_iter=10000)
clf.fit(X_train, y_train)
print('Test accuracy:', clf.score(X_test, y_test))
print('\nClassification report:\n', classification_report(y_test, clf.predict(X_test)))


## 4 — Confusion matrix, probabilities, and thresholds

A confusion matrix shows true positives/negatives and false positives/negatives.  
Adjusting the threshold trades precision vs recall. We'll compute predicted probabilities and show the confusion matrix at 0.5.


In [None]:
# Confusion matrix and probabilities
import numpy as np
from sklearn.metrics import confusion_matrix

probs = clf.predict_proba(X_test)[:,1]
pred50 = (probs >= 0.5).astype(int)
cm = confusion_matrix(y_test, pred50)
print('Confusion matrix (threshold=0.5):\n', cm)


## 5 — ROC and Precision-Recall (intuition)

- ROC curve: plots true positive rate vs false positive rate as threshold varies. AUC summarizes it.
- Precision-Recall: preferred when the positive class is rare; shows precision for different recall levels.


In [None]:
# ROC and Precision-Recall curves
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt

fpr, tpr, _ = roc_curve(y_test, probs)
auc = roc_auc_score(y_test, probs)
plt.plot(fpr, tpr, label=f'AUC={auc:.3f}')
plt.plot([0,1],[0,1],'--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')
plt.legend()
plt.show()

precision, recall, _ = precision_recall_curve(y_test, probs)
ap = average_precision_score(y_test, probs)
plt.plot(recall, precision, label=f'AP={ap:.3f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall curve')
plt.legend()
plt.show()


## 6 — Regularization and class imbalance (short)

Logistic regression supports L1 and L2 penalties. When classes are imbalanced, consider `class_weight='balanced'` or resampling.
We'll show a quick grid search for regularization strength using AUC as the metric.


In [None]:
# Quick grid search for regularization (AUC)
from sklearn.model_selection import GridSearchCV
param_grid = {'C':[0.01, 0.1, 1, 10]}
grid = GridSearchCV(LogisticRegression(max_iter=10000), param_grid, scoring='roc_auc', cv=4)
grid.fit(X_train, y_train)
best = grid.best_estimator_
print('Best params:', grid.best_params_)
print('Test AUC (best):', roc_auc_score(y_test, best.predict_proba(X_test)[:,1]))


## 7 — Calibration (why it matters briefly)

If you want probabilities (not just labels) to be trustworthy, check calibration. Methods such as Platt scaling or isotonic regression adjust predicted probabilities to better match observed frequencies.


## 8 — Practical tips, glossary, and exercises

**Tips**
- For simple problems, logistic regression is fast and interpretable.
- Always look at predicted probabilities, not just accuracy.
- Choose evaluation metric according to the task (precision, recall, AUC, AP).

**Glossary**
- Precision: TP / (TP + FP) — when predicted positive, how often correct.
- Recall: TP / (TP + FN) — of true positives, how many we found.
- AUC: area under ROC — single number summary of separability.

**Exercises**
1. Try different thresholds and plot precision and recall vs threshold.
2. Use `class_weight='balanced'` and observe changes in recall/precision.
3. Calibrate the classifier using `CalibratedClassifierCV` and compare reliability.

_End of beginner logistic regression notebook._