<a href="https://colab.research.google.com/github/CaseyXu2021/MachineLearningNotebooks/blob/master/predictive%20learning_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Red Cross: Modeling Results

This notebook contains the modeling results of the analyst at the Red Cross. Your task is to answer the accompanying pre-class questions based on your understanding of the results and the case study. You’ll need to run and **write** your own code.

## 1-Loading the data

Let's start by loading the data and looking at the first few rows.

In [1]:
import pandas as pd

file_name ='mailing.csv'
df = pd.read_csv(file_name)
df.head()

Unnamed: 0,Income,Firstdate,Lastdate,Amount,rfaf2,rfaa2,pepstrfl,glast,gavr,class
0,3,9409,9509,0.06,1,G,0,50,30.0,0
1,2,9201,9602,0.16,4,G,X,20,20.55,1
2,0,9510,9603,0.2,4,E,0,5,8.75,0
3,6,9409,9603,0.13,2,G,0,25,22.5,0
4,0,9310,9511,0.1,1,G,0,25,12.5,0


Here's some info about the data.

| Column | Description|
| --- | --- |
|Income|	Household income.
|Firstdate|	Data associated with the first gift by this individual.
|Lastdate|	Data associated with the most recent gift.
|Amount	| Average amount by this individual over all periods (incl. zeros).
|rfaf2|	Frequency code.
|rfaa2|	Donation amount code.
|pepstrfl|	Flag indicating a star donator.
|glast|	Amount of last gift.
|gavr	| Amount of average gift.
|class| Whether the person donated blood (1 represents yes).

## 2-Modeling

This script builds and evaluates a model learned with logistic regression. The workflow does this:

- **Data preparation:** Splits the dataset into a training set and a test set, separating categorical and numerical features.  
- **Preprocessing:** Uses a `ColumnTransformer` to one-hot encode categorical variables and standardize numerical ones.  
- **Modeling and tuning:** Wraps preprocessing and logistic regression in a `Pipeline`, then tunes the regularization strength (`C`) via grid search with 5-fold cross-validation, optimizing for ROC AUC.  
- **Evaluation:** Applies the model learned with the best regularization strength to the test set, computes predicted probabilities, and reports performance using **ROC AUC**, **confusion matrix**, and **classification report** (accuracy, precision, recall, F1).


In [2]:
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report

# --- Data setup ---
target = "class"
cat = ['rfaf2', 'rfaa2', 'pepstrfl']
num = [c for c in df.columns if c not in cat + [target]]
X, y = df.drop(columns=[target]), df[target]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, stratify=y, random_state=42
)

# --- Pipeline and tuning ---
pre = ColumnTransformer([
    ("cat", OneHotEncoder(drop="first", handle_unknown="ignore"), cat),
    ("num", StandardScaler(), num)
])
pipe = Pipeline([
    ("prep", pre),
    ("clf", LogisticRegression(solver="lbfgs", max_iter=1000, random_state=42))
])

grid = {"clf__C": [10**p for p in np.arange(-2, 2.5, 0.5)]}
print("Tuning model. Please be patient. If it's taking too long, replace the 0.5 in the grid with a 1.")
search = GridSearchCV(pipe, grid, scoring="roc_auc", cv=5, n_jobs=-1)
search.fit(X_train, y_train)
model = search.best_estimator_

# --- Evaluation ---
y_proba = model.predict_proba(X_test)[:, 1]
y_pred = (y_proba >= 0.5).astype(int)

print("Best C:", search.best_params_["clf__C"])
print("Test ROC AUC:", round(roc_auc_score(y_test, y_proba), 4))
print("\nReport:\n", classification_report(y_test, y_pred, zero_division=0))

Tuning model. Please be patient. If it's taking too long, replace the 0.5 in the grid with a 1.
Best C: 1.0
Test ROC AUC: 0.5957

Report:
               precision    recall  f1-score   support

           0       0.95      1.00      0.97     18206
           1       0.00      0.00      0.00       972

    accuracy                           0.95     19178
   macro avg       0.47      0.50      0.49     19178
weighted avg       0.90      0.95      0.92     19178



Test ROC AUC: 0.5957 ---
Recall, Precision, f1-score =0.00 --- model predict all individual do not donate blood.

Accuracy = 0.95, model predict everyone as 0, due to high data set, accuracy is high but useless.

### Pre-class assignment questions

1. **How many individuals were estimated to have more than a 50% probability of donating blood in the test set?**
2. **How successful was the campaign overall?** Consider both benefits and costs. You will need to make one or more assumptions to answer this question. State them clearly and justify them briefly.
3. **What if the campaign had targeted only individuals with a predicted probability of donating above 5%?** Estimate how successful the campaign would have been under this strategy. Base your analysis on the model’s predictions in the test set (`y_proba` and `y_test`).
4. **If your goal were purely to maximize the monetary value of the campaign, who should have been targeted?** Identify the probability threshold you would use to decide whom to contact, and justify your choice. Clearly state any assumptions you make.
5. **Imagine you are in Natalie’s shoes.** Write a short, one-paragraph memo to the Red Cross explaining the potential value of the model. Support your explanation with evidence—for example, by showing how campaign performance in the test set changes across different probability thresholds.
6. **Would you like to present your analysis in class?** If so, make sure your notebook is clean and well-documented. I’ll select one or two students to present based on the quality of their pre-class work, but not necessarily the very best submissions (I simply won’t have the time to review them all). If you aren’t selected, please don’t take it as a reflection of your work. A good presentation can significantly boost your class contribution grade (and if you’re selected, it can only help, not hurt, your grade).