# C1: INTRODUCTION TO ODDS AND PROBABILITY, BINOMIAL LOGISTIC REGRESSION

## Standard Process of Data Science Projects

**CRISP-DM Framework:**
1. **Business Understanding:** Define the problem clearly.  
   *Example:* Predict if a patient has cancer.  
2. **Data Understanding:** Explore data, understand features, distributions, missing values.  
3. **Data Preparation:** Clean, transform, encode categorical features, handle missing values, scaling.  
4. **Modeling:** Choose algorithm, train, tune hyperparameters.  
5. **Evaluation:** Use metrics (accuracy, precision, recall, F1, AUC).  
6. **Deployment:** Integrate into production, monitor performance.  

## Basic Concepts Revisited

- **Supervised Learning:** Learn mapping from input features → output labels (classification or regression).  
- **Classification:** Output variable is categorical.  
  *Example:* Yes/No, Class A/Class B.  
- **Regression:** Output variable is continuous.  
  *Example:* House price.  
- **Features:** Independent variables.  
- **Target:** Dependent variable.  

## Odds and Probability

#### Probability
- Measures the likelihood of an event:  
  $\mathrm{P(event)} = \dfrac{\text{favorable outcomes}}{\text{total outcomes}}$  

- *Example:* If 3 out of 10 patients have cancer,  
  $\mathrm{P(cancer)} = \dfrac{3}{10} = 0.3$  

#### Odds
- Ratio of the probability of the event happening to it not happening:  
  $\mathrm{Odds} = \dfrac{P(event)}{1 - P(event)}$  

- *Example:*  
  $\mathrm{Odds(cancer)} = \dfrac{0.3}{1 - 0.3} = \dfrac{0.3}{0.7} \approx 0.43$  
  **Interpretation:** Odds < 1 means the event is less likely than not happening.  

#### Odds Ratio
- Compares odds between two groups:  
  $\mathrm{Odds\;Ratio} = \dfrac{\text{Odds in Group 1}}{\text{Odds in Group 2}}$  

## Binomial Logistic Regression

- **Purpose:**  
  Predicts the probability of a binary outcome (two classes).  
  *Example:* Will the email be spam (Yes/No)?  

- **Key Formula:**  
  $\log\left(\dfrac{P}{1 - P}\right) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots$  


## Sigmoid Function

- Transforms log-odds into probability:  
  $\mathrm{P} = \dfrac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + \dots)}}$  

- **Shape:** S-curve between 0 and 1.  

In [1]:
# Python Example: Logistic Regression 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Example dataset
data = pd.DataFrame({
    'age': [25, 32, 47, 51, 62, 23, 34, 45],
    'bp': [120, 130, 140, 150, 160, 110, 125, 135],
    'has_disease': [0, 0, 1, 1, 1, 0, 0, 1]
})

# Features & target
X = data[['age', 'bp']]
y = data['has_disease']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Model training
model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         2

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2

